Moneyball: Sports Analytics in Soccer to Predict Performance and Outcomes

Ready to learn Data Science for Sports? Browse courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

There is no doubt that soccer is the most popular sport in the world, and its popularity is growing in the US. Over 25 million fans watched U.S. Women’s FIFA World Cup 2015, and earned Fox over $40 million in ad revenue. Similarly, the United States’ 2-2 draw with Portugal in the 2014 World Cup was seen by an average of 24.7 million viewers on Univision and ESPN.

What’s Sports Analytics?

Sports analytics is the processes that identify and acquire the knowledge and insight about potential players’ performances based on the use of a variety of data sources such as game data and individual player performance data. These advanced and sophisticated type of analytics should be able to extract valuable actionable insights for the coaches and managers to utilize.

Sports analytics can be utilized in various domains including:

Predicting the outcome of a game
Predicting the performances of teams or individual players
Building new strategies for upcoming competitions
Deciding the price of a player if a club was to rent/sell/buy him or her
Connecting players to brands and sponsors

Of course, not all teams use analytical tools. In addition to the costs involved, there’s also the problem of explaining complex analytical methods to coaches in ways they can understand. Thus, soccer analytics is more widespread in big clubs where they have the necessary financial power to utilize these methods.

But that should not necessarily be the case. Since traditional sports analysts do not reveal the logic behind their methods, we’ve decided to give you a flavor of what can be done with soccer data. While this post will not answer all the questions you have about soccer analytics, it can help you understand how to get started.

In this project, we have used a very limited number of player’s attributes (See the section under Player’s Features/Attributes) that are easy and not expensive to gather, to both reverse engineer the most advanced Rating and Performance index (The results are presented in Plot 1-12) and then to propose a more robust and easy model for future player ratings and performance prediction (See the section under Machine-Learning and AI Models). The program runs on Spark and Cloud Environment, and can be used for Terra-Petta scale of data, from multiple years, with thousands of players, with 100s of attributes.

What’s Soccer Analytics?

Soccer Analytics is the art of creating insights and actionable decisions using soccer related data. While predictive analytics uses big data to determine the probability or the likelihood of a certain outcome, intelligent descriptive analytics looks at big data and analyzes it using machine learning and artificial intelligence methods to come of with suggestions that will improve the likelihood of a desired outcome.

Some important concepts to know while conducting this analysis are:

Game Modeling: Modeling of the game before, during and after the game using scientific techniques to match or predict a set of outcomes.
Expert Player Rating: Players ratings given by an expert. These ratings take a black-box approach, and they vary according to the prior knowledge of the expert.
Soccer Performance Analytics: Is a tool to help players, coaches and managers to quantitatively assess the players and team performance and help to improve both players and team performance and design a set of wining strategies for upcoming game(s).

Here are some questions to ask when running analysis on soccer players:

How does expert rating differ from ratings generated by data-driven Machine Learning ratings?
What are the most important players’ attributes linked to their performance?
What criteria do experts use when they evaluate Players? Is there a way to reverse engineer their criteria?
Which attributes are important for each specific position?
Can we use ratings and players’ attributes to predict the outcome of a game?
Can we aggregate the players’ rating to come up with a team rating?
Is there a way to correlate the team rating to the outcome of the game?
Do the outcomes of games influence expert ratings more than individual performance indicators do?
Can we predict the outcome of a new game given the past performance of the players?

Using advance analytics and visualization tools such as Machine Learning and network analytics to predict the outcomes of soccer games is becoming more and more popular as these methods continue to move into the mainstream with the help of tools that make it easier to conduct these advanced analytical methods.

Some insights to our approach:

In this project, we used our selected set of clustering and classification techniques and the best model was selected and ranked based on Train-Validation-Test process.

Companies such as OPTA, Prozone, Amisco, and WhoScored are now collecting rich soccer data. These can be utilized to conduct accurate assessments.
For our project, a rich data set containing more than 210 attributes of players including 198 performance statistics were used. To calculate the overall performance and ratings of the players, some or all of the attributes were being used. Some of the very advanced Expert Ratings include: Caapello Index, Castrol Index, and WhoScored.com. These Ratings include each player’s cumulative ratings and game-based ratings.
For classification-regression and clustering, there are many Machine learning models that can be used. For classification-regression model, you can use Machine Learning models (SVMs, logistic regression, linear regression), naive Bayes, Regression by Discretization using J48, Additive Regression with Decision Stump, decision trees, ensembles of trees (Random Forests and Gradient-Boosted Trees), isotonic regression, Multilayer Perceptron, RBF Network. For Clustering, you can use k-means, clustering using affinity propagation, Agglomerative Clustering (Ward, Average, and Complete), Gaussian mixture, power iteration clustering (PIC), latent Dirichlet allocation (LDA). Furthermore, you can use dimensionality reduction such as singular value decomposition (SVD) and principal component analysis (PCA) to reduce the feature space. In our case study, we tested all the models and the best results combined and are presented in Plot 1-12, without elaborating about specific model and how they can be aggregated.

Players’ Features/Attributes:

In this project, we have used a subset of the Player’s Features/Attributes from the following list. Keep in mind that your model should be able to select and rank these attributes based on their importance. At this blog, we are not elaborating on what features have been selected by models, as these will depend on the specific approach you choose to take when building your model.

Nationality, Club, League, Age, Height, String Foot, Position (GK, CB, RB, LB, DM, CM, RM, LM, AM, RW, LW, SS, CF)
Attacking Prowess, Ball Control, Dribbling, Low Pass, Lofted Pass, Finishing
Place Kicking, Swerve, Header, Defensive Prowess, Ball Winning, Kicking Power, Speed, Explosive Power, Body Balance, Jump, Stamina, Goalkeeping, Saving, Form, Injury, Resistance, Weak Foot Use, Weak Foot Accuracy, Trickster, Mazing Run, Speeding Bullet,, Incisive Run, Long Ball Expert, Early Cross, Long Ranger
Scissors Feint, Flip Flap, Marseille Turn, Sombrero, Cut Behind & Turn, Scotch Move, Long Range Drive, Knuckle Shot, Acrobatic Finishing, First-time Shot, One-touch Pass, Weighted Pass, Pinpoint Crossing, Outside Curler, Low Punt Trajectory, Long Throw, GK Long Throw, Man Marking, Track Back, Captancy, Super-sub, Fighting Spirit

While, we do not have intention to disclose the final optimized features selected by our optimized model and strategy, we used the following ML-AL based aggregator operator model

Finally, here’s a list of the plots created as a result of this analysis:

Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the Forward Players. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the Goalkeepers. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the Defensive Players. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Clustering (similarities of players’ clusters) and prediction of the rating (plot: prediction vs. actual) for the MidField Players. Train, test and Validation. The similarities of the individual players are shown by the lines on clustering plots.

Clustering (similarities of players’ clusters) using advanced Visual-Analytics-Clustering for Forward Players. The similarities of the individual players are shown by the lines and their strength with the width of lines on clustering plot.

Clustering (similarities of players’ clusters) using advanced Visual-Analytics-Clustering for Forward Players. Similar clusters are closer to each other.

Typical performance of our Regression model’s Prediction for the rating of the Forward players (Prediction Vs. Actual). Train data for Forward Players.

Typical performance of our Regression model’s Prediction for the rating of the Forward players (Prediction Vs. Actual). Validation data for Forward Players.

Typical performance of our Regression model’s Prediction for the rating of the Forward players (Prediction Vs. Actual). Test data for Forward Players.

Typical performance of our Regression model’s Prediction for the rating of the players (Prediction Vs. Actual). Test data for Forward-MidField-Defensive Players.

Typical performance of our Regression model’s Prediction for the rating of the players (Prediction Vs. Actual). Validation data for Forward-MidField-Defensive Players.