Since 2017, I have been maintaining a model that forecasts the DCI scores and results for Finals Week. Since the models are based on historical data, it makes sense to look at what they would have forecasted if they were run for seasons in the past. This can help give us a sense for where the model does and doesn't do well and how it can be improved in the future.
Before getting into the details of how the model is built, it's important to note a key difference between the historical model and the season-by-season forecasts I've been doing since 2017. The season-specific models forecast captions (General Effect, Music, and Visual) individually, but the historical model only uses total score. This is because not so much a conscious design decision as a limitation of the data I have - I only have caption-specific scores for every show in the DCI season since 2016.
The DCI model is just four steps:
Each corps' skill curve is based on a curve fitting where the independent variable is time and the dependent variable is their score. This curve follows the form y = a + xb, where each corps has their own
b coefficients. In practice, that looks something like this:
This plot is a comparison of the skill curves for Santa Clara Vanguard and Bluecoats in 2018. Loosely, we can interpret the
a coefficient as each corps starting skill level, while the
b coefficient describes how quickly each corps improves.. In 2018, the Bluecoats had a higher
a but a lower
b, so they started the season out in first place. But eventually Santa Clara overtook Bluecoats because they had a higher
Each day and score available to the model is used to fit the skill curves, but only for corps with at least 7 shows. This is to make sure the model has a large enough sample to make an educated guess about how good a corps is and how fast they're improving. There are also cases where the curve fitting can fail, normally because the data is too scattered or sparse to make a good prediction. When that happens, the corps is ignored until the model gets more data. If you see corps missing from the predictions, it's because of one of these two reasons.
In the curve fitting algorithm, scores are not weighted the same. The model has a "discounting factor" built in, where it underweights shows just happened. This makes the model more skeptical. It won't improve the skill curve much if a corps scores higher than it expects - it waits to see that success sustained over 3 or 4 shows before adjusting. In addition to discounting recent shows, the model also weights scores based on the day of the show, preferring scores later in the season to early. For example, a show from July 29 is weighted more than a show from June 29, because scores are more stabilized late in the season, and more predictive of Finals Week scoring.
The model uses the skill curve to predict how good each corps will be during Finals Week. It also tracks the uncertainty in the coefficients themselves. It treats this basically like measurement error - the model accounts for the fact that it's using imperfect data to determine each corps' skill curves and hedges accordingly. The more uncertain the curve fitting, the more it hedges.
The model creates a distribution for each corps'
b coefficients and pulls 10,000 samples for each one. It assumes each corps is independent. Doing thousands of trials is how the model converts raw scores to percentages. For example, the model thinks that Santa Clara will clearly be better than Bluecoats during Finals Week from the plot above, but that doesn't necessarily translate to 100% certainty. For example, one of the 10,000 runs could end up with a higher
b for Bluecoats than Santa Clara Vanguard. In fact, the model had the Bluecoats with a 5% chance or so to beat Santa Clara Vanguard head-to-head very late into the season.
Natural variability comes from variation in how judges assign scores from show to show, as they don't always perfectly match how well the corps actually did. This variability does not mean judges are biased or political. In fact, the model does not consider individual judges in the forecast, and there is no evidence in the data to suggest that judges are collectively biased against particular corps. Rather, this part of the model just accounts for the fact that sometimes judges score a little low or a little high. They're pretty good on average.
Unlike the uncertainty in the skill-based forecast, this uncertainty is not independent from corps to corps. This is because judges tend to be off consistently from corps to corps at a show. If they score Bluecoats higher than expected, there's a very good chance they will do the same for Santa Clara Vanguard at the same show. This correlation is pretty strong, and stronger for corps that perform back-to-back than those that perform first and last. The correlations the model uses are based on those in the historical data, from 1995 to 2016.
In order for the model to be able forecast Finals Week, it needs to know which order the corps are performing in so it can set the pairwise correlations accordingly. Because all corps perform in order of their rank during Finals Week, it just needs to rank the corps. Using the most recent scores of all corps is unfair, because corps who have performed more recently will have higher scores. Therefore, it uses the skill curves to do the ranking.
Once all the corps are ranked, the model forecasts finals week based just on the natural variability. It does this by pulling a random sample of 10,000 scores for each corps, based on the distribution of error in the historical data and the correlation from corps to corps.
At this point, the model has 10,000 skill-based simulations of Finals Week and 10,000 simulations based on natural variability. All it does now is combine them, making sure to match the overall profile of DCI scores. Generally speaking, the natural variability is about twice as large as the skill-based uncertainty in the historical data, so those simulations are weighted about twice as much. The model also makes sure that the overall error distribution matches the historical error distribution. The margin of error for any specific show is generally about 2 to 2.5 points.
In this forecasting and averaging process, the interpretability of the model's raw score predictions breaks down. The winning corps' score can vary from less than 90 to more than 100 as the season progresses, but we know that's impossible. What the model does retain the is the gaps between corps, which is why the scores in the prediction are never greater than 0. The last thing the model does before giving us output is adjust the mean score prediction so that the winner is always assigned 0, and each corps' average is the distance between them and the winning corps.
First, a shoutout: all of the content on this page is based on the analysis fivethirtyeight did on their own models. I have long wanted to make this model and publish the results, but didn't feel I had a good way to put the overall model performance in the context necessary to judge it fairly. I couldn't come up with a way of doing that I was satisfied with until I saw their self-evaluations. The content of this page is heavily based on their work. If you are interested in the approach to, and process of, making forecasting models in general, I highly recommend reading their site regularly.
Broadly speaking, there are two ways to evaluate the "goodness" of the DCI model: Calibration and Skill.
Calibration measures the accuracy of the model in assigning probabilities. If the model is well calibrated, then all the things it assigns a probability of 50% should happen roughly half the time. Likewise, things it gives a 90% chance of happening should happen 90% of the time, and things it gives a 10% chance of happening should only happen 10% of the time. If all of this is true, the model is well-aligned with what happened in the real world, which is a key part evaluating its performance.
But calibration is easy to "hack". For example, I can guarantee a perfectly well-calibrated model by assigning all corps the "climatological mean" probabilities. For example, if there are 25 corps, I can assign each one a 4% (1 in 25) chance of winning the Founders Trophy and a 48% (12 in 25) chance of making finals. This "model" (we'll call it the Climatological Mean Model because it will be important later) will be perfectly calibrated by definition, but nobody would think it's very useful. This is why we also need to consider skill.
Model skill is a sort of catch-all term for how much more useful the model is in comparison to the Climatological Mean Model (the aforementioned perfectly calibrated, but useless, model). In particular, we will evaluate the DCI model based on the three Brier Score Decompositions and its overall Brier Skill Score.
The Brier Score is based on a simple idea. When we evaluate a model, we should give it credit for forecasts it gets "right" but that credit should be also based on how confident the model was. For example, the model gave Santa Clara Vanguard a 70% chance of winning Gold on the eve of Prelims day in 2018. While it got this right - Santa Clara did win last year - it would have been "more right" if it had given Santa Clara a 90% to win. This idea also applies in reverse. If the Blue Devils (which the model gave a 25% chance to win) had won last year, the model would have been wrong, and some credit would be taken away. But more credit would have been taken away if it had only given Blue Devils a 10% chance to win.
The Brier Score has a range from 0 to 1, where 0 is good and 1 is bad. The Climatological Mean Model would have a Brier Score very close to 1 becuase it's not very skilled.
The Brier Score's components and Brier Skill Score give us a sense for how the model performs in specific areas, especially compared to the Climatological Mean Model.
The Brier Score components are Reliability, Resolution, and Uncertainty. All three components range from 0 to roughly 0.25. Reliability is a measure for the model's calibration, where a lower score is better. Resolution measures how far the model's predictions are from the Climatological Mean Model, where a higher score means the model's predictions are farther away. This is generally considered better, but not always. Uncertainty captures how difficult the predictions actually are, where a high score means the predictions are more difficult.
It's helpful to think of the components in how they relate to the total Brier Score, through the equation
Brier Score = Uncertainty + Reliability - Resolution. Becuase a low score is better, this means models get more credit when the Reliability is low and Resolution is high. We can think of the uncertainty as the baseline from which the model is judged.
The Brier Skill Score is an adjustment of the model's Brier Score relative to Climatological Mean Model and uncertainty. It ranges from negative infinity to 1. Anything less than 0 means the model is less useful than the Climatological Mean Model, and anything greater than 0 is more useful. By definition, a model with a Brier Skill Score of 1 is perfect.
In order to evaluate the model, we need data. Using DCI seasons from 1995 to 2018, we can compare the model's predictions to almost a quarter century of actual, real-world events. Essentially, all we need is two lists - one of the model's predictions and another of the what happened.
In practice, that means we'll have 4 data points per corps per seasons - the odds the model gave for them to make Finals (and whether they did), the odds the model gave them to win the Bronze (and whether they did), the odds the modle gave them to win Silver (and whether they did), and the odds the model gave them to win Gold (and whether they did). Because the number of corps which make Semifinals has fluctuated over time, I left it out.
We can't just use one forecast per season though, because the evaluation of the model depends greatly on the day from which we forecast. We should expect the model to perform better during Finals Week, when it has an entire season's worth of data, than in early-mid July. Therefore, we'll also run the model at 9 points throughout the season, ranging from 3 days before Finals (the eve of Prelims), and 30 days before Finals. Therefore, the data captures the model's ability to predict throughout the course of the season. This has the added benefit of increasing the sample size, meaning we can can understand the model's behavior with higher confidence.
The best way to assess the calibration of the model is to see it visually. We do that plotting the percent of the time something actually happened versus the odds the model gave. If it's well calibrated, the plot will look like a straight line from the point (0,0) - things that happened 0% of the time which the model gave 0% odds - to the point (1,1) - things that happened 100% of the time which the model gave 100% odds.
In the figure below, this perfect calibration line appears in blue. The points indicate the model's predictions, with the 95% confidence interval around them. Points below the line indicate underconfidence in the model, becuase things happened more often than the model thought. Likewise, points above the line indicate where the model was overconfident.
In general, the model is underconfident becuase most points fall below the perfect calibration line. But is's not wildly underconfident though becuase the real-world results are within the confidence interval of the model's predictions in many cases. There could be many reasons for this - the most obvious explanation is that the model thinks scores are more variable than they are.
Another potential reason, which I think is actually the case, is that the skill curve fitting algorithm is more uncertain than it should be in this context. The standard errors of the curve fitting algorithm are calculated based on the mathematical calculation of uncertainty. But if DCI judges come to a consensus on how good corps are quickly, then the confidence intervals might be larger than they need to be. 6 shows is not a large sample, mathemaically speaking, but it could be large enough for judges to establish a baseline for how good the corps is. If this is the case, then the mathematical uncertainty would cause the model to hedge more than it needs to.
Becuase data was collected at 9 points throughout each season, the sample size is large enough to evaluate model skill on a season by season basis.
The model's performance is pretty consistent. The Brier Skill score is generally between 0.5 and 0.75. If you look closely and squint your eyes (generally a sign that you're looking for a significant effect when there isn't one), maybe you can convince yourself the model is tending to get better as time goes on. This wouldn't necessarily be a surprise, if DCI judging is getting more predictable over time.
Uncertainty appears to be pretty consistent through time, at around 0.2. It does drop a little bit in 2014 - this is because Open Class corps were switched from being judged on their own sheets to being judged on World Class sheets. This makes it easier for the model because it only forecasts who makes Finals and wins medals, and can almost automatically rule out all Open Class corps. If it were to forecast who made Semifinals, however, this wouldn't be the case. Because many Open Class corps make Semifinals, the uncertainty would actually increase. The resolution is pretty constant at an intermediate level.
The reliability and overall Brier Score tend to track together, which means the reliability is the component which has the biggest influence on the Brier Score. This isn't to say that reliability is the most important part of the Brier Score in general terms, but rather that the reliability is the only component that changed all that much from season to season. Basically, the unpredictability of DCI scores hasn't changed much, nor has the model's resolution compared to the Climatological Mean Model. What does change is that the model has slightly better years and slightly worse years, probably based mostly on random chance, so the reliability and Brier Score move together.
Overall, the DCI model is ... fine. The model performs pretty well, and is clearly better than the Climatological Mean Model, but that's not surprising because the Climatological Mean Model is, quite literally, the definition of useless in this context.
For more broad comparisons, we can compare the model's performance to fivethirtyeight's forecasts. Compared to all of their sports and politics models, the DCI model is good but not great; middle of the pack. However, predicting DCI scores is generally much easier than predicting March Madness. We don't call Finals Week "August Madness" after all. So we should probably take the comparison with a grain of salt - given data and time, I'm pretty sure Nate Silver could to much better than this.
The model struglles with calibration. Even with all the data, it tends to be underconfident, sometimes significantly so. I don't entirely know why that is. If anything, I thought it'd be overconfident. However, my best guess is that the model "fools itself" with the small sample from the curve fitting, overestimating the standard errors of the exponential coefficients relative the true, "real-world" uncertainty. I think this is becuase DCI judges come to a consensus
If my intuition is correct, the season-specific DCI model, like the one I did for 2018 have better reliabilty and resolution than the historical model. The season-specific models make forecasts on a per-caption basis instead of total score. The fact that it makes the three separate forecasts and adds them up actually decreases the overall uncertainty in the model.
Naturally, the next thing on my to-do list is to translate this analysis to season-by-season modeling on a per-caption basis. I have reliable caption-specific score data for only 2016 to 2018. While this isn't enough data to build a caption-specific model from scatch like I did for this model, it is enough to test how the assumptions I make here apply to the finer detail modeling. We can also use the predictions from 2016 to 2018 to see if the caption-based model is better calibrated, as I suspect it will be, and higher resolution. Stay tuned!