Forecasting DCI Scores Part 1: How Not to Forecast DCI Scores

drum-corps
modeling
Published

June 25, 2018

Welcome to my DCI Modeling Series - a series of an indeterminate number of posts I’m doing on forecasting DCI scores. This all started as an educational endeavor for me back in 2016, so I’ve decided to share my experience and methodologies as I work through developing the model. Everything is open source and available on GitHub (until Microsoft destroys it), and this project wouldn’t be possible without data provided by corpsreps.com (now dcxmuseum.org) – thanks guys!

I aged out of DCI in 2015, so this is my third summer as a “real person”. In what is obviously a complete coincidence, this is also my third summer of modeling DCI scores and producing Finals Week forecasts. My first two years went okay, and I had some fun, but the models had lots of mistakes. This post summarizes my previous efforts and talks about where I missed the ball.

Just as a note, all my code for this project is openly available on GitHub - starting in 2016 and going through 2018 (when I actually get around to finishing the code for the 2018 model). If you want to really know what’s going on under the hood, looking at the code is the best way to do it. I’ve done my best to comment and make the code readable (it’s still pretty bad - don’t judge me. Yours isn’t any better).

Let’s start with the basic questions - at what level of detail should we model scores? Each show can have up to 11 judges - should we model them all or just worry about total scores? When I started, I figured it was best to model at the caption level - GE, Music, and Visual - because this allows the model to account for corps that are really good in one area but keeps it flexible enough to account for varying panel sizes. We also don’t have to worry about what individual judges are thinking. Three years later, I still think this is the best approach (this is just about the only thing I got right the first time).

We start in 2016, a simple model I threw together because I was bored one day in late July. The model fits a line through each corps’ scores and simple extrapolates those lines out to Finals Week (similar to what FloMarching did last year in their rankings). This approach is mostly fine if you just want to get a sense of things, but it breaks down pretty quickly.

Consider an example (we’ll use total scores instead of captions to keep things simple) from last year. Bluecoats performed on days 1, 3, and 5 of the season, getting a 72.3, 72.7, and then a 74.9. The best fit line through their data yields the following equation:

Score = 71.35 + 0.65*Day

The model claims Bluecoats starts with a score of 71.25 on a fictional Day 0 and improves an average of 0.65 points per day through the season. Finals was Day 52 last year, which means the model would forecast a score of … 105.15. Oops.

So what went wrong? The biggest problem is that DCI scores don’t improve linearly through the season. We fit a line through the data when we shouldn’t be fitting a line. Early scores are close to linear but level off towards the end of the season, meaning we’ll tend to overestimate scores. The next mistake is that model never mentions how sure it is. Is the improvement actually 0.65 points per day, or could it be 0.5 instead? For us to trust a model, we need an indication of how precise, or confident, it is.

The 2017 model fixed both of these issues. Instead of fitting a line to the data, it fits an exponential (we’ll get to the uncertainty in a bit), which is basically like moving the b in the equation below.

Score = a + b*Day       --->       Score = a + Day^b

When we fit a line, a gives us the initial condition and b gives us the pace of improvement. They have the same basic jobs in the exponential, but define a curve that levels off instead of a straight line.

The exponential fits the data better than a linear model, but it’s far from perfect. For starters, it’s harder to fit an exponential to data than a line, so we need more data to make a prediction (the model didn’t try until a corps had preformed 6 times, but it typically took 8-10 performances to get a reasonable fit). This is important for Open Class, as we often didn’t get enough data for corps until late July or even early August. In addition, the fit is especially sensitive to performances that come on the heels of long gaps than other scores. This means the 2017 model is less confident in its exponential fit than the 2016 model is in its linear fit.

Luckily, the 2017 model lets us know how confident it is and took some steps to hedge its bets. Instead of implying it was 100% sure like the 2016 model, it would give odds, such as giving the Blue Devils a 65% chance to win after Prelims night last year. How does the model go from an exponential to assigning odds of winning?

In modeling DCI scores, there are two types of uncertainty. The first is the uncertainty in the fit of the improvement curve (aka the exponential) - how sure are we that this corps is this good and not better or worse? The model can easily quantify this uncertainty because its dictated by the strength of fit of the improvement curve.

The second kind of uncertainty is noise from show to show that’s basically random - corps have good nights and bad nights, some panels give unusually high or low scores, and so on. This is a little trickier to quantify. The naive approach would be to simply figure out what the magnitude of this uncertainty is (which is pretty easy to do with the historical data provided by corpsreps) and add that much noise to each corps. Unfortunately, things aren’t so simple because the noise is highly correlated from corps to corps.

Suppose we’re watching Semifinals in 2017, and Mandarins gets an unusually high score compared to what the model expected. It’s tempting to think their chances of making Finals have increased substantially, but history tells us they didn’t. This is because the fact that Mandarins got a high score is often an indication that the Madison Scouts will also get a higher score than the model expects. In other words, the noise in Mandarins’ score tells us something about what we would expect the noise in Madison’s score to be. This is what it means for scores to be correlated, and it makes it tougher for one corps to pass another late in the season.

In practice, the model runs a Monte Carlo, simulating each show 1000 times. It draws a random number for each corps’ score according to the exponential fit - a random a and b based on the improvement curve and its uncertainty. Then, it simulates the random noise for each corps by drawing random numbers from a multivariate distribution which makes sure the noise is correlated from corps to corps. Once the model has 1000 shows’ worth of simulated results, assigning odds is just an exercise in counting. If the model says a corps has a 95.4% chance of a corps winning, it’s literal in the sense that it happened in 954 out of the 1000 simulations.

The fundamental approach of the 2017 model to handling error is sound and standard practice, but I still made some mistakes. First, I artificially diminished the uncertainty in the exponential fit by one third. My thought was that the statistical uncertainty would be greater than the actual uncertainty because of effects like slotting. It may sound reasonable, but it’s wrong. Second, I assumed that the correlation between scores for all corps was equal regardless of placement - that is, I assumed that the correlation in noise between the Blue Stars and Crossmen was the same as between the Blue Stars and Legends. This may once again seem reasonable, but its dead wrong.

The 2017 model also didn’t account for slotting very well. For example, the model treated Santa Clara Vanguard as the favorite for most of the season, at one point giving them an 85% chance to win it all, despite the fact that they were never able to consistently beat the Blue Devils. Vanguard’s scores started lower and they improved quickly, and the model assumed this would continue and that they would therefore pass Blue Devils. That never happened, an in fact rarely does in DCI’s history. The next version of the model needs to be more skeptical - if Vanguard can’t consistently beat or at least draw even with the Blue Devils, they shouldn’t be the favorites to win.

Even with all its flaws, the 2017 model was pretty good. It uses the correct exponential to estimate the pace of improvement throughout the season and it sues the correct approach to hedging the uncertainty. It was just calibrated poorly. In general, it was too confident in the exponential and operated on incorrect assumptions for uncertainty propagation. As we’ll see in the next installment of this series, the 2018 DCI model is mostly a well-calibrated version of the 2017 model.