This is a post about when things go wrong.
Incident response is an important part of running any production service. The first step is to make immediate fixes and make sure things can either fail gracefully or return to a semi-operable state. Then, it is important to review the incident to you can make improvements to prevent similar things from happening in the future. Championship 2015 is one such story; we’ve gone through our chat logs and old documents to reconstruct a timeline of events that illustrates how software development is an iterative process.
FIRST started providing event results and team information through an API for the first time in 2015. Almost a whole season went by with minimal issues and only a few hiccups here and there — not bad for a first cut at providing an API. However, during the 2015 Championship Event, their API went down hard.
We knew a lot of teams depend on The Blue Alliance for scouting data and match notifications, so we wanted a workaround as fast as possible. Here’s an account of what happened on our side when realized that FIRST was having problems posting match results.
Thursday, April 23, 2015 @ 10:04 AM Eastern: We discover that our logs show the FIRST API returning error code 500: Internal Server Error on a subset of match result endpoints.
10:12 AM: Debugging shows that any match query that includes a played match returns a 500 error. Queries that include only unplayed matches work fine. An email is sent to FIRST API team documenting our findings and asking what they can do about it.
10:16 AM: @FRC Teams Tweets:
— FRC Teams (@FRCTeams) April 23, 2015
A discussion begins on how TBA can handle this. “Go back to HTML parsing” is suggested, but we quickly realize that results are not available on the FRC Events web pages either.
10:25 AM: We get word from people at the event that FIRST “announced on Carver that they couldn’t connect to the FIRST servers.” It doesn’t fully explain the 500 errors, but confirms that match results likely won’t be appearing on FRC Events or the FIRST API anytime soon. We decide to wait until the “update at lunch” @FRCTeams Tweeted about and analyze the situation then.
1:14 PM: FRC Events appears to be updated with pre-lunch matches, but is not continuing to update live as matches are played. The FIRST API still doesn’t have match results, but oddly, rankings are working fine.
3:12 PM: We write a script to parse match results the FRC Events web pages. Pre-lunch matches are imported into TBA, but new matches still don’t show up.
7:48 PM: Some divisions are being updated with the day’s matches on FRC Events, so they are imported into TBA.
Friday, April 24 @ 12:18 AM: After talking to non-FIRST-officials who have some experience with FMS, we come to the conclusion that FIRST’s sync issues probably stems from the lack of internet bandwidth at the venue to upload match results due to the way their database syncing works (we won’t go into those details here).
12:26 AM: Discussion about why FIRST’s sync uses so much bandwidth reminds us that we implemented an “Authenticated Write API” for Chezy Champs 2014. This API allows anyone with the appropriate authentication keys to send match results to TBA programmatically. We quickly explore how we can get trusted people to provide us match results, such as the FIRST scorekeeper or teams we know in each division.
12:38 AM: We’re now deciding whether to have a Google form for submission that TBA periodically pulls from or to make a mobile-optimized input page. A Google form would require less implementation but may be hard to properly input results on mobile devices.
2:10 AM: Still trying to finalize the list of teams for each division.
2:27 AM: A first prototype of the interface is completed. Matches are automatically populated from the TBA API. Pressing “SUBMIT” calls the Authenticated Write API to send match results.
3:23 AM: The first revision of the match input page is committed to the TBA git repository
3:38 AM: We hear back from FIRST that it is indeed a bandwidth issue at the venue. The 500 errors are due to incomplete data that makes it to their servers. No guarantee that it will be fixed at any point during the weekend, so we continue with our plan.
4:30 AM: Interface is refined, bugs are squashed, instructions are written. Time to get some sleep.
8:36 AM: Awake and still trying to confirm contacts for teams in each division.
9:17 AM: All teams (111, 118, 254, 1114, 1625, 2338, and 3132) are confirmed!
9:42 AM: We realize that FIRST’s API is returning stale data, it is wiping any manually input results. TBA disables querying FIRST’s API for data.
9:52 AM: A bug in the code causes the red alliance to show up for both red and blue. Oops!
10:16 AM-10:20 AM: The bug is fixed and we roll the site. One bug for a late night coding session isn’t too bad.
10:18 AM: Nathan posts a thread on CD to showcase our awesome community and to coordinate fixing errors due to manual data entry.
And that’s mostly it! The rest of the weekend was spent on things like manually inputting alliance selections and creating scripts to auto generate semifinal matches based on winners from the quarterfinals (and finals from semifinals) so that the match result input page would work. Unfortunately, that meant that I missed this unfortunate moment on the live webcast.
After Championship was over (and we were able to catch up on sleep), we realized we had the opportunity to build something powerful. We already had the infrastructure built to programmatically import data from a non-FIRST source. All we had to do was build a webpage that exposed an interface for offseason events to input all their data: schedules, rankings, results, and match videos. And we did – the first revision of the TBA Event Wizard was merged on May 14, a short 3 weeks after Championship ended. We tested the system at an offseason event where we had someone there physically to do support and launched it to the world a few days later.
The EventWizard has allowed TBA to greatly scale its offseason event data offerings and has been one of the most impactful ancillary features of the site. Offseason events can now provide real-time results and push notifications to all TBA users and the admins no longer have to manually import all data. Nowadays, it is even easier to post data from your event to The Blue Alliance – this page explains all the ways you can help contribute.
Thanks to the teams and people that were a part of the crowdsourcing efforts and to FIRST for working hard to iterate and improve their own systems. In the following years, we have never needed to use the emergency match input capability again for an official event.