Building a large database of MMA fight results III: summarizing the demographics of 140,000 MMA fighters
13 May 2016The essential units of analysis in MMA are the sport’s fighters and the bouts between them. In my previous post, I discussed how to standardize match data so that the major categories of finishes could be easily visualized.
In this entry, I will discuss how to clean-up fighter-level data. The main goal of this post will be to determine the factors that affect fighter matchups. For some fighters, our data contains useful demographics, while for other fighters, this information is missing. One goal of this post will be to infer missing demographic data based on fighters’ previous bouts. For example, if we don’t know where in the world a fighter lives, we can probably determine their location with good accuracy if all their fights have been against people from the same region.
The data that we will primarily work with was collected seperately for each fighter. We will first combine fighters’ data into a single table that can be more easily used.
Name | Nickname | Birthday | Height | Weight | nationality | weight_class | camp |
---|---|---|---|---|---|---|---|
Andrei Arlovski | The Pit Bull | 1979-02-04 | 193.04 | 109.32 | Belarus | Heavyweight | Jackson-Wink MMA |
Name | Nickname | Birthday | Height | Weight | nationality | weight_class | camp | Query |
---|---|---|---|---|---|---|---|---|
Andrei Arlovski | The Pit Bull | 1979-02-04 | 193.04 | 109.32 | Belarus | Heavyweight | Jackson-Wink MMA | /fighter/Andrei-Arlovski-270 |
Ikuhisa Minowa | Minowaman | 1976-01-12 | 175.26 | 83.91 | Japan | Middleweight | Kuma Gym | /fighter/Ikuhisa-Minowa-250 |
Kazushi Sakuraba | The Gracie Hunter | 1969-07-14 | 182.88 | 75.75 | Japan | Welterweight | Laughter7 | /fighter/Kazushi-Sakuraba-84 |
Ronda Rousey | Rowdy | 1987-02-01 | 167.64 | 61.23 | United States | Bantamweight | Team Hayastan | /fighter/Ronda-Rousey-73073 |
B.J. Penn | The Prodigy | 1978-12-13 | 175.26 | 65.77 | United States | Featherweight | BJ Penn’s MMA | /fighter/BJ-Penn-1307 |
Thales Trindade | Pezao | NA | NA | NA | Brazil | NA | NA | /fighter/Thales-Trindade-166551 |
Summarizing fighter demographics: gender, weight and locale
For many fighters, our dataset contains important fighter attributes such as nationality, weight class and age. One attribute that is noticeably absent is gender. If we are interested in the factors affecting who-fights-who in MMA, then gender is definitely important.
One method that we can use to try to predict the genders of fighters is to look at the gender that is generally associated with a fighter’s first name. The gender provides great resources for exactly this task. Using the “ssa” method, which looks at Social Security Administration baby names, for each first name, we can determine the fraction of male babies and the fraction of female babies with the name.
Name | Query | first_name | Pr_male |
---|---|---|---|
Andrei Arlovski | /fighter/Andrei-Arlovski-270 | Andrei | 1.0000 |
Ikuhisa Minowa | /fighter/Ikuhisa-Minowa-250 | Ikuhisa | NA |
Kazushi Sakuraba | /fighter/Kazushi-Sakuraba-84 | Kazushi | NA |
Ronda Rousey | /fighter/Ronda-Rousey-73073 | Ronda | 0.0083 |
B.J. Penn | /fighter/BJ-Penn-1307 | B.J. | NA |
Thales Trindade | /fighter/Thales-Trindade-166551 | Thales | 1.0000 |
Classifying names works for the majority of fighters, but for 8% of fighters predicted gender is clearly ambiguous (0.1 < Pr(Male) < 0.9), while for 11% of fighters no gender prediction is possible at all.
In addition to gender, we would like to summarize each fighter’s weight class and where he/she is from. To summarize where fighters are from, nationality may be too specific for some purposes (there are 193 countries in the UN). To provide a less specific summary of fighters’ locations, we can use the countrycode package. Countrycode provides mappings between countries and regions. We can then also manually add the continents in which these regions fall.
nationality | n | nation_region | nation_continent |
---|---|---|---|
165 | 1 | NA | NA |
185 | 1 | NA | NA |
Afganistan | 1 | NA | NA |
Afghanistan | 98 | Southern Asia | Asia |
Albania | 13 | Southern Europe | Europe |
Algeria | 10 | Northern Africa | Africa |
Based on our starting weight class and nationality data along with the genders that were inferred based on names, we have complete data for 49% of fighters. We can be confident in weight class (although this does change across fighters’ careers). We are less confident, however, in the inferred gender because there are many cases of ambigously predicted genders.
Network-based imputation of fighter demographics
To improve our prediction of fighters’ locale, weight class and gender, we can use an additional important source of information: their matchups with other fighters. Fighters are likely to fight people who are a similar weight, the same gender and who reside in a similar location.
Our strategy is to capitalize on bout information to label fighters’ unknown attributes (and gender, as this may be mispredicted by name), according to a consensus of fighters’ neighbors. This approach is implemented as an iterative algorithm. During one iteration, we start with bout data (fighter-opponent) and fill in the opponent demographics. Then a consensus of opponents’ demographics is taken for each fighter and the fighter’s attributes are updated if they were previously unknown. After each iteration we learn more about opponents’ demographics (if they were missing); this information is then passed onto fighters until we are left with only ambiguous attributes.
Name | first_name | Pr_male | Pr_male_impute | weight_class | weight_class_impute | nationality | nationality_impute |
---|---|---|---|---|---|---|---|
Andrei Arlovski | Andrei | 1.0000 | 1.0000 | Heavyweight | Heavyweight | Belarus | Belarus |
Ikuhisa Minowa | Ikuhisa | NA | NA | Middleweight | Middleweight | Japan | Japan |
Kazushi Sakuraba | Kazushi | NA | NA | Welterweight | Welterweight | Japan | Japan |
Ronda Rousey | Ronda | 0.0083 | 0.0083 | Bantamweight | Bantamweight | United States | United States |
B.J. Penn | B.J. | NA | NA | Featherweight | Featherweight | United States | United States |
Thales Trindade | Thales | 1.0000 | 1.0000 | NA | NA | Brazil | Brazil |
As the imputation algorithm proceeds, we are able to fill in a large amount of missing information. This method should accurately predict unknown genders, weight classes and the general locale of fighters (inferred nationalities are more uncertain in many cases).
Pr_male_impute | weight_class_impute | nationality_impute | nation_region_impute | nation_continent_impute | iteration |
---|---|---|---|---|---|
15890 | 55810 | 42317 | 42317 | 42317 | 0 |
8719 | 26693 | 20429 | 20429 | 20429 | 1 |
820 | 13393 | 9595 | 9595 | 9595 | 2 |
688 | 11144 | 7775 | 7775 | 7775 | 3 |
671 | 10680 | 7412 | 7412 | 7412 | 4 |
670 | 10576 | 7338 | 7338 | 7338 | 5 |
670 | 10551 | 7319 | 7319 | 7319 | 6 |
670 | 10546 | 7312 | 7312 | 7312 | 7 |
670 | 10544 | 7308 | 7308 | 7308 | 8 |
670 | 10542 | 7307 | 7307 | 7307 | 9 |
670 | 10542 | 7307 | 7307 | 7307 | 10 |
Summary of fighter demographics
As a quick summary of cleaned-up fighter-specific data, we can look at summaries of two major types of fighter demographics: fighter weight classes and where fighters live.
First we will look at weight classes. Because weight classes are ordered, with Atomweight the lightest and Super Heavyweight the heaviest, weight categories were sorted according to their logical progression and the frequency of fighters in each weight class was determined
Looking at weight classes: most fighters compete at intermediate weight classes; the most common weight classes are Lightweight and Welterweight.
The geographical summaries that we generated: nationality, region and continent are a nested hierarchy with nations residing within regions and regions within continents. To make use of this hierarchy for visualization we need to make sure that this nested structure is maintained for all fighters such that nations can only belong to one region and regions can only belong to one continent.
Continent | Region | Nationality | n |
---|---|---|---|
N. America | Northern America | United States | 57170 |
S. America | South America | Brazil | 18553 |
Europe | Northern Europe | United Kingdom | 8190 |
Asia | Eastern Asia | Japan | 5928 |
N. America | Northern America | Canada | 4618 |
Europe | Eastern Europe | Russia | 4478 |
N. America | Northern America | USA | 2436 |
Europe | Eastern Europe | Poland | 2361 |
Having established a hierarchy of fighter locations, summarized based on the number of fighters residing in each country, nested within geographical regions and continents, we can visualize this hierarchy using a treemap. Essentially we will turn each country into a rectangle with area proportional to the number of fighter’s from that country. These rectangles will be nested according to the continent-region-nationality hierarchy.
To allow for clearer discrimination of regions (which will be especially useful for later analyses) we can shade similar countries with similar colors. Conventionally, we would only color one level of our hierarchy by either choosing unique colors for continents or regions or nations. If we only colored continents, we would only have a very coarse summary of geograophy. If we colored nations then similar country might receive very different shades of color.
Using a hierarchy to guide the choice of colors seems natural when such information is present. Here, I chose a general color for each continent, shades of that color for regions, and more specific colors for each country.
By coloring and organizing fighter according to their nationality we can see that Mixed Martial Arts is truely a world-wide phenomenon, although its popularity is greatest in the USA, Europe, Brazil and Japan.
In my next post, I will demonstrate how the wide geographical distribution of MMA fighters and other factors such as weight class and gender structure the matchups between fighters. This structure emerges directly from the matchups between fighters and can be readily visualized by treating MMA fights as an undirected graph.