Building a large database of MMA fight results IV: turning tables into a database with R and SQL
11 Aug 2016Previously, I discussed acquiring and standardizing a compendium of fighter and fight-level MMA results.
- I: Scraping fighter and fight data: Aggregating data from ~250,000 MMA fights between ~140,000 fighters.
- II: Cleaning up fight-level data: summarizing the major ways that fights are won.
- III: Cleaning up fighter-level data: summarizing where fighters are from and major weight classes.
In this post, I will go over how the previously created fight, fighter and event-level tables can be transformed into a MySQL database that can both be created and accessed using R.
Setting up MySQL
Before we can make an MMA database we may have to setup SQL.
First we need to install a distribution of MySQL. For this purpose, I would recommend MySQLWorkbench. In addition to including a distribution of MySQL that can be accessed from the command line, MySQL Workbench also has a nice, interactive interface for making SQL queries and contains other features, such as tools to make an Enhanced Entity-Relationship (EER) diagram.
Before we can actually start querying SQL, we need to start the local server. For OS X, I do this using the MySQL Preference Pane, which allows you to start the server from Systems Preferences.
Another step that may be useful is creating an alias for MySQL so that it can be more easily worked with. I did this by adding the following code to my .bash_profile.
Setting up the MMA database using MySQL
Once the initial setup of MySQL is complete, the next step is to create the database that we will populate with our fight data.
Populating the MMA database from R using RMySQL
Working with our SQL database using MySQL Workbench
Now that we have piped the bouts, fighters and events into the MMA database we can start working with this dataset in SQL.
MySQL Workbench makes interacting with SQL queries very straightforward. For example, shown below, we can see all the results of bouts that ended in headbutts.
To better understand this dataset, it would be useful to know how fields in each dataset are connected. Generally, each table will have a primary key that uniquely identifies each entry (a row). For fighters and events, a logical choice for the table’s primary key are the urls that pointed to the fighter and event’s webpage (Query and Event_link, respectively). Individual bouts entries were derived from fighters’ pages, so the table doesn’t have any great choices for a primary key. To generate a unique primary key for bouts, we can add an additional field Fight_ID that provides a unique identifier to each bout entry. We can also auto-incriment this ID such that if additional fights are added, they will automatically be provided a new unique ID.
Primary keys are useful because they can be used to perform summaries at the fighter/bout/event level. An even more important role for these keys is that they tell us how to combine data. For example, we have both fighter-level data and bout-level data: if we want to be able to combine these two datasets, we need a clear way of linking fighter-level data to the fighter and his/her opponent in the bouts dataset. We can accurately carry out this combination because the bouts data contains two columns, Fighter_link and Opponent_link, that serve as foreign keys; their entries map to the entries of Query in the fighters table. A similar link is provided by the Event_link in the bouts data; it maps bouts to a specific MMA event stored in the events table.
An EER diagram for this dataset summarizes the fields in fighters, bouts and events as well as the primary key, foreign key relationships that link them:
Querying SQL from R using dplyr
While SQL is a powerful language for querying and combining tables, if we want to analyze our database using R, it is useful to access data from within R.
One way that we can access our database from within R is by executing a SQL call from within R.
Fighter | Opponent | Finish | Event |
---|---|---|---|
Ely Galvao | Heleno Nunes | Headbutt | BVF 11 - Circuito Brasileiro de Vale Tudo 4 |
Heleno Nunes | Ely Galvao | Headbutt | BVF 11 - Circuito Brasileiro de Vale Tudo 4 |
Roger Huerta | Jeff Carlson | Headbutt | EC - Best of the Best 2 - Day Event |
Jeff Carlson | Roger Huerta | Headbutt | EC - Best of the Best 2 - Day Event |
Sean Robinson | Chris du Pont | Headbutt | EFC Africa - Extreme Fighting Championship Africa 25 |
An alternative to making queries using SQL syntax is to use dplyr to set up similar select, filter, etc. commands and then fetch the data that we are interested in.
Fighter | Opponent | Finish | Event |
---|---|---|---|
Ely Galvao | Heleno Nunes | Headbutt | BVF 11 - Circuito Brasileiro de Vale Tudo 4 |
Heleno Nunes | Ely Galvao | Headbutt | BVF 11 - Circuito Brasileiro de Vale Tudo 4 |
Roger Huerta | Jeff Carlson | Headbutt | EC - Best of the Best 2 - Day Event |
Jeff Carlson | Roger Huerta | Headbutt | EC - Best of the Best 2 - Day Event |
Sean Robinson | Chris du Pont | Headbutt | EFC Africa - Extreme Fighting Championship Africa 25 |
Now that we can cleanly combine and access fighter and fight-level data, we can dive into some deeper analyses. I have already talked about one such analysis: treating bouts as an undirected graph to investigate the factors affecting fighter matchups.