Predicting Ultimate Fight Championship (UFC) Judging Decisions Using Multivariate Linear Regression

11 min readOct 5, 2021

Abstract

Using historical fight statistics from UFCStats.com and fight scoring data from MMADecisions.com, I created a Multivariate Linear Regression model that accepts fight data (strikes, takedowns, submission attempts, control time, etc.) and predicts the score of the fight. In addition to evaluating the model’s accuracy for predicting the correct score, I also evaluated the model’s ability to pick the winning fighter correctly. After testing several different algorithms, I settled on LinearRegressor from Sci-Kit Learn for my multivariate linear regression model. In predicting fight scores, the model returned an R-Squared of 0.7168. The model predicted the correct winner of the fight with approximately 85.5% accuracy.

This model can be used to understand how judges (and the media) are scoring fights. Which fight statistics have the biggest impact on scoring decisions? How do total strikes, significant strikes, grappling statistics, and control time weigh into the judges’ decisions? The UFC has prescribed judging criteria, and the model results indicate that the criteria are largely being followed. But perhaps some judges have been more in line with the model over time than others. This model could be used to evaluate individual judge performance across time and also to study how changes in scoring rules and criteria impact fighting styles and fight scoring.

Visit my GitHub repository for all project code and presentation materials.

Watch my project presentation and code review videos on my YouTube channel.

UFC Scoring Criteria

Judges award 10 points to the winner of the round and 9 points to the loser of the round. In cases of particularly severe performance disparity, the loser of the round may only receive 8 points. This was previously rare, but recent changes have made 10–8 rounds more common. At the end of the fight, round scores are added and the fighter with the most points for a given judge will earn that judge’s vote. The fighter with the most votes of the three judges wins the fight. The UFC provides the following hierarchy for deciding the winner of a single round:

1. Effective Striking and Grappling — This is meant to be the deciding factor in the majority of rounds. Effective striking includes legal strikes (punches, kicks, knees, elbows) that inflict damage upon the opponent. Effective grappling includes successful takedowns, reversals, and submission attempts.

2. Effective Aggression — In the absence of a clear winner in the area of effective striking and grappling, effective aggression is the deciding factor. The round is awarded to the fighter who made the greatest attempt to finish the fight in the round.

3. Effective Control of the Fight Area — The final criterion for deciding a round winner, if there is no clear winner in striking/grappling or aggression, is control of the fight area. This encompasses pushing the pace of the fight and controlling the fight area (e.g. keeping the opponent up against the fence).

These judging guidelines are published and understood by judges, fighters, and fans alike. The guidelines help to create entertaining fights by aligning entertaining fight styles with a greater likelihood of winning fights. The scoring system rewards high output, high aggression, and fighting at a fast pace.

Data Used in the Model

Data was obtained from UFCStats.com and MMADecisions.com. I scraped all availabe UFCStats.com data available, going back to 1997. I scraped decision data from MMADecisions.com for all fights from January 1, 2010 through May 31, 2021. As such, the data for the regression model includes all UFC fights from January 2010 through May 2021 for which decision data was available on MMADecisions.com. If a fight ends in a knockout or submission, the judging data is not tracked on MMADecisions.com.

Ultimately, my dataset includes 2,258 fights. For each fight, I scraped fight statistics by round. For each fight statistic, I calculated the disparity between the fighters to feed to the model. For example, if Fighter 1 landed 10 head strikes and Fighter 2 landed 5 head strikes in a round, “Significant Head Strikes Disparity” would be 5 for Fighter 1 and -5 for Fighter 2.

The following fight metrics were included prior to feature reduction:

Knockdown Disparity — A knockdown occurs when a fighter falls to the ground after being struck.
Significant Head Strikes Disparity
Significant Body Strikes Disparity
Significant Leg Strikes Disparity
Significant Distance Strikes Disparity — A strike “at distance” refers to a strike landed when the the fighters are standing and not in the clinch position.
Significant Clinch Strikes Disparity — A “clinch strike” refers to a strike landed when the the fighters are in a clinch position.
Significant Ground Strikes Disparity — A “ground strike” refers to a strike landed when the the fighters are on the ground.
Significant Strikes Disparity
Total Strikes Landed Disparity
Total Strikes Attempted Disparity
Takedowns Landed Disparity
Takedowns Attempted Disparity
Submissions Attempted Disparity
Reversals Disparity — A “reversal” occurs when the fighter being controlled on the ground reverses submission and obtains control.
Control Time Disparity — “Control time” refers to the number of seconds that a fighter maintains a controlled position on the ground.

The fight statistics above were the starting place for the independent variables of the regression model. Some of these were trimmed during the feature selection process, which was conducted primarily by inspecting correlations and variance inflation factor. The dependent variable, or the score of the fight, was based off of the MMADecisions.com data. Rather than using judge data only, I created a combined average score for a fight using all available judge and media scoring data. There are only 3 judges per fight, and their decisions are often highly criticized. In many cases, fight media is more knowledgeable than the judges of the fights. Including additional scores made for a more robust dependent variable.

Data Collection and Storage

The web scraping and data storage code can be found on my GitHub here.

All data used in the model was obtained using a web scraping library called Scrapy. Scrapy is an object oriented programming web scraper. Initializing a Scrapy project creates a package of modules that work together to scrape and store data from the web.

The majority of the work is done by the spider classes. Each of my spiders each serves a separate web scraping function. I created a PerformanceSpider that scrapes UFCStats.com. This spider grabs fight metrics by round for each fight as well as information for each event (location, date, etc.). I also created a FighterSpider that scrapes the fighter information pages on UFCStats.com. Finally, I created a DecisionSpider that scrapes MMADecisions.com. This final scraper locates the relevant fight that was stored in the database by the FighterSpider, and then stores scoring-related information.

The Scrapy package links directly to my PostgreSQL database using the SQL Alchemy library. To train and test the model, data was queried from the database using SQL Alchemy.

If you are a budding Data Scientist building a web scraping and data storage project with Scrapy, I highly recommend using the following two resources in tandem:

Building the Model

The regression modeling and evaluation code can be found on my GitHub here.

Feature Engineering and Feature Reduction

The feature engineering and reduction code can be found on my GitHub in the Data Preparation notebook.

Before building the regression model, I first needed to engineer the appropriate features and eliminate collinearity through feature reduction. As detailed in the Data section above, all features were converted such that the independent variables each represented the disparity between Fighter 1 and Fighter 2 for each fight statistic. I also created a single dependent variable that blended all available judge and media scores into a single combined average score.

To explore feature reduction, I first created a correlation heatmap, as shown below. The heat map shows, for example, that Takedowns Attempted (takedown_att_disparity) and Takedowns Landed (takedown_land_disparity) have moderately negative correlations with most striking statistics, particularly Significant Distance Strikes (sig_dist_land_disparity). This makes intuitive sense, because a fighter who attempts many more takedowns than their opponent is less likely to be focused on striking, especially striking from a distance rather than in the clinch or on the ground.

Examining the heatmap, while informative, is not particularly actionable. A better way to determine which variables to drop is using Variance Inflation Factor (VIF), which is a great tool for determining multicollinearity in a regression model. I imported and ran variance_inflation_factor from the statsmodels library, which helped me to find features that were highly correlated and adversely impacting model performance. On my first run of the VIF, six features returned VIF of infinity. After removing collinear features, aided by the Seaborn heatmap, my second run returned VIF’s under 5.0 (a typical benchmark) for all features. See below for the two runs of VIF.

I also explored polynomial features and feature interactions, but they did not provide any meaningful performance benefit to the model.

Train-Test Split

Rather than using the built-in train-test split functionality from Sci-kit Learn, I opted for a time-based split. Rather than splitting randomly, this instead splits observations based on the date of the fight. This ensures that the database record for Fighter 1 and Fighter 2 for a given fight will both be in the same dataset (train or test). This allows for comparing the predicted score for each fighter and declaring the winner based on the higher of the two predicted scores.

In addition to a more reasonable winner prediction methodology, a time-based split avoids other issues that may arise with including one half of a fight in the training dataset and the other half in the testing dataset. The nature of the data is such that Fighter 1 and Fighter 2 are mirror images of each other, because each independent variable reflects the difference between Fighter 1 and Fighter 2. Including Fighter 1 in the training data and using that to predict Fighter 2 in the testing data would not be a fair methodology.

Regression Model Validation and Testing

I tested every model and data combination as detailed in the table below. Ultimately, none of the data or model tweaks provided a significant boost to the performance of the simple Sci-Kit Learn LinearRegressor model using unaltered data. Normalization, standardization, and regularization all had minimal impact on R-Squared. In addition, SGDRegressor (Stochastic Gradient Descent) and RandomForestRegressor each performed very similarly to LinearRegressor. All models were put through cross validation with 5 folds.

Model Selection and Performance

I ultimately selected the LinearRegressor model as my final model, as the more sophisticated and complex models provided no significant benefit. The final model uses normalized data, achieved using MinMaxScaler, because this allows for a more meaningful comparison of linear regression coefficients. See the screenshot below for the coefficients of each feature in the final model.

You’ll notice that these features align well with the prescribed judging criteria. The first criterion in the judging hierarchy is “Effective Striking and Grappling.” This can be represented by Significant Strikes and Takedowns Landed, the two most important features in the model. The next most important features are Control Time Disparity and Knockdowns Disparity. Each of these can be interpreted as contributing to “Effective Aggression,” and Control Time Disparity obviously directly relates to “Effective Control of the Fight Area.”

Another interesting aspect of the model is that Takedowns Attempted has a negative coefficient, while Takedowns Landed has a larger positive coefficient. In other words, when a fighter attempts a takedown, there is a penalty in the eyes of the judges when the fighter does not complete the takedown.

In predicting fight scores, the model returned an R-Squared of 0.7168. The accuracy of the model is impacted by the way that rounds are scored. A typical round is scored 10–9. Most fights are three rounds, which means that the closest score in a typical fight would be 29–28. This translates to 9.66–9.33 on a per round basis. As can be seen in the “Predicted vs. Actual Per Round Score” graphic below, there is a chasm between. 9.33 and 9.66 in Actual Per Round Score. Meanwhile, there is no such chasm in the Predicted Per Round Score as determined by the regression model. In reality, this particular problem of determining the winner on a round by round basis may be better suited to Logistic Regression than Linear Regression.

Because of this shortcoming in the interpretation of R-Squared, I prefer to evaluate the model based on its ability to predict the winner of the fight. The model predicted the correct winner with approximately 85.5% accuracy. As shown below, the model tends to determine a winner correctly when the Actual Per Round Score and the Predicted Per Round Score are on the same side of 9.5 points per round.

Where the Model Fails

I examined the feature distributions for fights where the model made correct versus incorrect predictions. As expected, the fights where the model got it wrong tend to have noticeably tighter distributions for key judging criteria. The black box plots below show that the incorrect predictions came when fights were more closely contested in terms of pure statistics. For fights such as these, it may be that the deciding factors of the fight could not be captured in striking and grappling statistics. For example, one fighter may have landed the more damaging strikes or may have been the clear aggressor in ways not captured in the numbers. Perhaps one fighter was very close to finishing the fight with multiple submission attempts. Statistics, while fairly predictive in the long run, will never tell the full story of an individual fight.

Future UFC Model Developments

There are several interesting directions to take with this data. Firstly, this regression model can be tweaked to improve performance. The most obvious model enhancement would be to attempt to build a model that determines winners on a per round basis. There would be challenges, primarily because the round by round scoring data is less complete than full fight scoring data on MMADecisions.com. It is also worthwhile to explore Logistic Regression on a per round basis to determine the winner of each round. Even applying the linear regression model to individual rounds rather than the full fight is likely to improve the winner prediction accuracy, as this is more in line with how judges actually score the fight. But Logistic Regression could improve accuracy even further.

I’d also like to apply the model to explore a few interesting questions. Which judges historically deviate most from the regression model predictions? Which fighters and fighter styles tend to generate incorrect predictions more than others? How have rule changes and scoring criteria changes impacted model performance across time? Are there any clear differences in judging criteria between different athletic commissions (e.g. different states or countries)? All of these questions could be addressed using this model.

Finally, I intend to use the data pipeline I’ve built to generate new models and explore new topics within the UFC. The first of these new projects will be examining fighter styles. I plan to use fighter statistics and unsupervised learning techniques to cluster fighters together by their different styles. I can then evaluate what happens when different styles meet in the octagon. Which styles tend to win out over others? Which style matchups produce the most exciting fights?

Stay tuned for future posts combining my passions for MMA and machine learning! Drop any ideas, questions, or critiques you might have in the comments.