Schedule Strengths (1 of 3): Finding the Best Strength of Schedule Metric

Background

A few months ago, I kicked off discussion about how to define Strength of Schedule for FRC, and introduced a new metric to take my best shot at quantifying it. Well after a long hiatus, I’m back at it looking at strength of schedule metrics. To summarize, what I am looking for is a metric which is forward-looking, year-independent, and mirrors as much as possible what we colloquially mean when we say a schedule is “good” or “bad”. I want these three properties for the following reasons:

  1. Forward-looking means that I want to be able to tell before the matches take place whether the schedule is good or bad. There are lots of easy backward-looking metrics we could use (that is, we evaluate the schedule strength after the event based on observed performance at that event), but such metrics cannot be applied to judge a schedule right when it is released, which is the moment in time we most want to evaluate schedule strength. Furthermore, such metrics could not be used to generate “balanced” schedules which is a long-term goal of mine.
  2. Year-independence means that the metric we use is broadly applicable in any FRC game, provided the general 3v3 structure remains the same. This is important because I don’t want to have to re-do all of this work every year, I want something that has as few assumptions as possible.
  3. Matches our colloquial definition of schedule strength means that the metric has properties we would expect it to. For example, we expect that schedules with more matches will tend to be fairer than schedules with fewer matches. We also expect to be able to look at the “worst” schedules at a glance and recognize why they are so bad and vice versa for the “best” schedules. If we don’t have these properties, we are probably not measuring anything useful.

With this in mind, I have developed 7 candidate forward-looking metrics and will be sharing with you my analysis of which one(s) have the most general value moving forward, particularly for the future development of “balanced” schedules.

Candidate metrics

Here are the seven candidate metrics, as well as their descriptions. I just made up all of these names, so sorry if you don’t like them:

  • Caleb’s Strength of Schedule: A detailed description of this metric can be found in these two posts. However, I’ve made a slight change since then. Essentially, this metric is the probability that the given schedule will end up being better than a random schedule for your team according to my event simulator. The formula in my second link has been slightly modified to the following:

Capture.PNG

This changes it so that the average schedule is now 50%, and teams who have the first seed locked now have strengths around 50% instead of 100%.

  • Expected Rank Change: This metric also uses my simulator, but instead of all the crazy math of the previous metric, this is simply the given team’s average rank using random schedules subtracted from their average rank using the given schedule, and then divided by the number of teams at the event. In addition to the simplicity, a large advantage of this metric is the fact that, since it uses my simulator, it factors in the bonus RPs, which none of the remaining metrics will have the capability to do.
  • Average Elo Difference: This metric is the average found by summing all of the opponent Elos in each match and subtracting all partner Elos from each match, and then subtracting the average event Elo (to allow comparison between events). Pretty straightforward.
  • Expected Wins Added: Again using Elo, this metric is the expected percentage of wins the schedule would add to an average team at the event. So a value of 0 indicates that an average team would be expected to win 50%, a value of 0.4 (40%) would indicate that an average team is expected to win 90% of their matches.

The following 3 metrics are all found by sorting all of the teams entering the event by their Elo rating, and then only comparing these Elo “ranks”, and not actual Elo values.

  • Average Rank Difference: This metric is found by averaging the sum of all opponent ranks and subtracting out all partner ranks, and then also subtracting ((# of teams + 1)/2) and then dividing by the number of teams in order to compare between events.
  • Weighted Rank Difference: Very similar to the above metric, except this one is found by weighting partners more heavily than opponents. So it is the sum of all opponent ranks minus (3/2)*(sum of partner ranks), and then divided by the number of teams at the event. This is because you will always have 1.5 times as many opponents as partners.
  • Winning Rank Matches: This metric is found by treating each match as a binary event. Either it is a “winning” match or a “losing” match depending on your partner and opponent ranks. If the sum of your partner ranks + ((# of teams + 1)/2)  is lower than the sum of the opponent ranks, it is considered a winning match for an average team, otherwise it is a losing match. This sum is then divided by the number of matches played and then 0.5 is subtracted from it in order to allow comparison between events.

Results

I have uploaded a file to my Miscellaneous Statistics Projects paper titled “2018_schedule_strengths_v4” which contains these metrics for all FRC teams at all 2018 events. According to 5 out of 7 metrics, 2220 on Archimedes was the best 2018 schedule. There is less agreement between the metrics on the worst schedule, but 2096 on Hopper was in the top 10 by all metrics for the worst schedule. I posted a brief comparison of 2220’s and 2096’s partners and opponents here. It seems clear to me that we are on the right track based on this validation, as both schedules are clearly very extremely “good” and “bad” respectively.

Determining the Best Metric

Now, with the results in hand, let’s determine which of these metrics are best to use going forward. All 7 are forward-looking, so we can’t winnow down any options based on that. They also all at least roughly meet our colloquial definition of “schedule strength” based on some simple validation of the best and worst schedules above. However, we can determine which metrics better meet this criteria by examining the correlations between them. Here is a chart showing the correlation coefficients between each of the metrics:

There is very clearly one of these which stands out from the others, and that is the “Expected Wins Added” metric. This metric is dominated by every other metric, meaning that the correlations between it and any third metric will always be less than the correlation between the alternative and the same third metric. This means that “Expected Wins Added” is probably not capturing the colloquial definition of “schedule strength” as well as the other metrics do. Although note that the correlation is indeed still well above zero, which means this metric is not completely useless, just not as well suited as the alternatives for what we are looking for. In a similar way, “Winning Rank Matches” is clearly a step below all other options, so let’s throw out that metric as well. Removing these two gives us the new correlation chart:

There are no longer any obvious candidates to remove based on their correlations. What we see instead are 3 groups of metrics. Group 1 is “Caleb’s Strength of Schedule” and “Expected Rank Change”. These metrics are understandably very strongly correlated since they are both direct outputs of my event simulator, and factor in bonus RPs and other team attributes not found in the other Elo-based metrics. Group 2 contains “Average Rank Difference” and “Weighted Rank Difference”. These metrics are understandably very correlated, since the opponent calculations are equivalent, and the partner calculations are only different by a factor of 1.5. Group 3 contains “Total Elo Difference”, which has slight attributes of both Group 1 and 2, and thus has intermediate correlations with all other metrics.

So we can’t easily eliminate any of these based on criteria 3, but fortunately we can eliminate some based on criteria 2, year-independence. I personally think my simulator is an incredible tool for this work (but I’m biased :)), however, what it certainly is not is year-independent. There’s a lot of 2018-specific features in it. So both “Caleb’s Strength of Schedule” and “Expected Rank Change” should be thrown out by criteria 2. But you might ask “why include them at all if you were planning to throw them out from the start?” Well for one, I wanted to see how they would compare to the others, because I still think they are excellent for finding schedule strengths in 2018, and I think the results have shown that. More importantly though, we can still look at the correlations with other metrics even if we weren’t planning to use it.

In a similar vein, we should also throw out “Total Elo Difference”. Elo is really cool (again I’m biased) but it is not widely used or accepted in the broader FRC community relative to things like event ranks or District Points. Both of the latter can easily be substituted for the Elo ranks used for “Average Rank Difference” and “Weighted Rank Difference”, but trying to map those onto something like Elo ratings would get messy very quickly.

So we’re left with just “Average Rank Difference” and “Weighted Rank Difference”. “Average Rank Difference” has the benefit of being simpler to explain and understand. “Weighted Rank Difference” is slightly harder to explain, but it does correlate marginally better with the output of my event simulator. I believe the higher correlation of “Weighted Rank Difference” comes from the fact that individual partners should be weighted higher than individual opponents due to their effect on the bonus RPs. Good opponents can cost you the win, but good partners can both help you win and help you to achieve bonus RPs. Both of these options are good choices and I can understand using either.

Final Thoughts

My personal choice moving forward though will be to use “Average Rank Difference”. Current students have never experienced a game without the bonus RPs, so they might not realize that this ranking structure is actually a recent phenomenon in FRC. I am not yet convinced that the 2RP win + 2 bonus RPs for separate game tasks formula will hold into the future, so I think it makes more sense at this point in time to weight all partners and opponents equally for strength of schedule, and not assume we will continue having bonus RPs indefinitely. If the GDC continues this pattern for a few more years I will re-evaluate, but that is where I stand for now.

That’s all I’ve got for now, but I’m not done yet. It’s one thing to complain about existing schedules, but my next step is to use this metric to actually generate “balanced” schedules. I’d also like to go back in time and apply this metric to all previous 3v3 games so that we can see the best and worst schedules of all time, I want to see in context how awful the 2007 algorithm of death actually was.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s