In a previous blog post, I introduced logistic regression – a simple machine learning classifier. Towards the end, I promised that we would work on making a model to predict which teams are likely to win awards in a given year. The approach we’ll use together in this post was developed by myself, Jordan Mattingley, and Ben Worlton.

## A bit of feature engineering

For this model, we’re going to make a key assumption: previous awards won is a good indication of what future awards will be won.

Our justification here is twofold: some awards tend to be grouped together. Intuitively, it just makes sense that the teams who win Quality awards are similar to the teams who win Excellence in Engineering awards. Similarly, the Chairman’s award has similar criteria to the Engineering Inspiration award. The second point of using past awards as the input to our model is that these features are year-independent and allows us to use many years of data to analyze.

Without further ado, let’s get our data.

## Pulling data from APIv3

The Blue Alliance provides a RESTful API we can use to get award histories for a particular team for a particular year. In a nutshell, using the TBA API works like this:

- Request an API key (it’s under More, then Account). The key should be kept safe. It’s probably unwise to commit it to a shared repository; I suggest keeping it in a separate file or as an environment variable.
- The API represents the data in JSON format (JavaScript Object Notation). Most programming languages have standard libraries for parsing and unpacking JSON into something easier to work with in the language. You’ll need to figure out how to parse this in your particular language. For instance, if you’re working in Python, use the
`json`

module. - Now that you’ve got a key and a way to interpret the data, you need to make an API request. This is as simple as making an HTTP request (Pythonistas can use
`requests`

to do this), though you’ll need to set the`User-Agent`

and`X-TBA-Auth-Key`

headers. - Decide which endpoints to use by referencing the API documentation. It’s worth pointing out that the documentation is interactive, and you can make requests and tune parameters in there before transferring it into your code. For this project, we’ll use two endpoints:
`/teams`

and`/team/{team_key}/awards`

.

Jordan put together some code for this when we were developing this model together; I’ve uploaded his code as in a GitHub gist.

## Preprocessing data

Our input to our logistic regression model is some number of independent variables as input. But each instance of data needs to have the same input. This gets tricky, because not all teams have been around for the same amount of time. We have a couple options:

- Use only the previous year’s information.
- Use a set number of prior years.
- Find a way to summarize awards over multiple years of history.

If we go with the first option, we’ll end up throwing away about 90% of the award data we have. It also eliminates the benefits we have of using year-independent features. The second option forces us to not predict teams who don’t have enough consecutive, prior years. This also throws away a lot of data. These downsides make the third option look really attractive in comparison.

There are a few strategies for summarizing data over time, but we’ll use one that’s used in the Q-learning algorithm and, coincidentally, the one Jim Zondag uses to compute his derated points in his championship analyses. More mathematically, we’re going to use a geometric series; we multiply awards earned in prior years by a value between 0 and 1, taken to the power of the number of years in the past. Adding all these terms is a finite number even if there are infinitely many of them. So, even if a team had been around forever, we’d still have a finite number. For reference, Jim uses 2/3 for his derating factor.

Once we get up and running, we can play with this constant, known as a hyperparameter. In practice, I’ve found that a value of .8 works well for this particular problem.

Here’s some code to calculate the input and output vectors. To make things simple, my friends and I just looked at if a team won an award in a particular year. It gets a bit tricky to order the multiple events for a year chronologically.

## Logistical learning

This is the easiest part of all of this. We just take our input and output vectors, and leverage a library (like `sklearn`

) to do the logistic regression on its own. We’ll assume each award is judged independently and make a model for each award. Example code is available here.

In a future post, we’ll look at what this model tells us, and how we can use it effectively.