War Story: Working Around a Broken API at the 2015 Championship

This is a post about when things go wrong.

Incident response is an important part of running any production service. The first step is to make immediate fixes and make sure things can either fail gracefully or return to a semi-operable state. Then, it is important to review the incident to you can make improvements to prevent similar things from happening in the future. Championship 2015 is one such story; we’ve gone through our chat logs and old documents to reconstruct a timeline of events that illustrates how software development is an iterative process.

FIRST started providing event results and team information through an API for the first time in 2015. Almost a whole season went by with minimal issues and only a few hiccups here and there — not bad for a first cut at providing an API. However, during the 2015 Championship Event, their API went down hard.

We knew a lot of teams depend on The Blue Alliance for scouting data and match notifications, so we wanted a workaround as fast as possible. Here’s an account of what happened on our side when realized that FIRST was having problems posting match results.


Thursday, April 23, 2015 @ 10:04 AM Eastern: We discover that our logs show the FIRST API returning error code 500: Internal Server Error on a subset of match result endpoints.

10:12 AM: Debugging shows that any match query that includes a played match returns a 500 error. Queries that include only unplayed matches work fine. An email is sent to FIRST API team documenting our findings and asking what they can do about it.

10:16 AM: @FRC Teams Tweets:

A discussion begins on how TBA can handle this. “Go back to HTML parsing” is suggested, but we quickly realize that results are not available on the FRC Events web pages either.

10:25 AM: We get word from people at the event that FIRST “announced on Carver that they couldn’t connect to the FIRST servers.” It doesn’t fully explain the 500 errors, but confirms that match results likely won’t be appearing on FRC Events or the FIRST API anytime soon. We decide to wait until the “update at lunch” @FRCTeams Tweeted about and analyze the situation then.

1:14 PM: FRC Events appears to be updated with pre-lunch matches, but is not continuing to update live as matches are played. The FIRST API still doesn’t have match results, but oddly, rankings are working fine.

3:12 PM: We write a script to parse match results the FRC Events web pages. Pre-lunch matches are imported into TBA, but new matches still don’t show up.

7:48 PM: Some divisions are being updated with the day’s matches on FRC Events, so they are imported into TBA.

Friday, April 24 @ 12:18 AM: After talking to non-FIRST-officials who have some experience with FMS, we come to the conclusion that FIRST’s sync issues probably stems from the lack of internet bandwidth at the venue to upload match results due to the way their database syncing works (we won’t go into those details here).

12:26 AM: Discussion about why FIRST’s sync uses so much bandwidth reminds us that we implemented an “Authenticated Write API” for Chezy Champs 2014. This API allows anyone with the appropriate authentication keys to send match results to TBA programmatically. We quickly explore how we can get trusted people to provide us match results, such as the FIRST scorekeeper or teams we know in each division.

12:38 AM: We’re now deciding whether to have a Google form for submission that TBA periodically pulls from or to make a mobile-optimized input page. A Google form would require less implementation but may be hard to properly input results on mobile devices.

12:48 AM: The plan is set: Generate 8 sets of authentication keys (one for each division), contact teams from each division to see if they would be willing to help out, and create a mobile-friendly page that takes in the keys and has an interface to submit and edit match scores. This page would be a completely a client-side solution, relying on JavaScript to make calls to the TBA Authenticated Write API.

2:10 AM: Still trying to finalize the list of teams for each division.

2:27 AM: A first prototype of the interface is completed. Matches are automatically populated from the TBA API. Pressing “SUBMIT” calls the Authenticated Write API to send match results.

2015-04-23
A first prototype of the match submission interface.

3:23 AM: The first revision of the match input page is committed to the TBA git repository

3:38 AM: We hear back from FIRST that it is indeed a bandwidth issue at the venue. The 500 errors are due to incomplete data that makes it to their servers. No guarantee that it will be fixed at any point during the weekend, so we continue with our plan.

4:30 AM: Interface is refined, bugs are squashed, instructions are written. Time to get some sleep.

2015_match_input_ui
Instructions demonstrating “success” and “error” cases.

8:36 AM: Awake and still trying to confirm contacts for teams in each division.

9:17 AM: All teams (111, 118, 254, 1114, 1625, 2338, and 3132) are confirmed!

9:42 AM: We realize that FIRST’s API is returning stale data, it is wiping any manually input results. TBA disables querying FIRST’s API for data.

9:52 AM: A bug in the code causes the red alliance to show up for both red and blue. Oops!

10:16 AM-10:20 AM: The bug is fixed and we roll the site. One bug for a late night coding session isn’t too bad.

10:18 AM: Nathan posts a thread on CD to showcase our awesome community and to coordinate fixing errors due to manual data entry.

And that’s mostly it! The rest of the weekend was spent on things like manually inputting alliance selections and creating scripts to auto generate semifinal matches based on winners from the quarterfinals (and finals from semifinals) so that the match result input page would work. Unfortunately, that meant that I missed this unfortunate moment on the live webcast.

IMG_20150425_112306
Phil’s “War Room” whiteboard ensuring all divisions had correct results coming in during eliminations.

Aftermath

After Championship was over (and we were able to catch up on sleep), we realized we had the opportunity to build something powerful. We already had the infrastructure built to programmatically import data from a non-FIRST source. All we had to do was build a webpage that exposed an interface for offseason events to input all their data: schedules, rankings, results, and match videos. And we did – the first revision of the TBA Event Wizard was merged on May 14, a short 3 weeks after Championship ended. We tested the system at an offseason event where we had someone there physically to do support and launched it to the world a few days later.

The EventWizard has allowed TBA to greatly scale its offseason event data offerings and has been one of the most impactful ancillary features of the site. Offseason events can now provide real-time results and push notifications to all TBA users and the admins no longer have to manually import all data. Nowadays, it is even easier to post data from your event to The Blue Alliance – this page explains all the ways you can help contribute.

Thanks to the teams and people that were a part of the crowdsourcing efforts and to FIRST for working hard to iterate and improve their own systems. In the following years, we have never needed to use the emergency match input capability again for an official event.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s