Welcome to my DCI Modeling Series - a series of an indeterminate number of posts I’m doing on forecasting DCI scores. This all started as an educational endeavor for me back in 2016, so I’ve decided to share my experience and methodologies as I work through developing the model. Everything is open source and available on GitHub, and this project wouldn’t be possible without data provided by corpsreps.com (now dcxmuseum.org) – thanks guys!
Now that the DCI season is a couple weeks old, it's time to introduce the 2018 DCI model. I plan on launching the 2018 model with live predictions after the shows in Minnesota and Orlando this upcoming weekend (July 7-8), so this is just going to be a general introduction - Part 3 of this series will be more thorough and get into the weeds. If you can't wait until then, check out the code on Github.
The 2018 model is basically the same as the 2017 model, with some additions to fix the issues that were discussed in Part 1. Here's a quick recap of the issues:
- We know that we can't model DCI scores as a straight line, as I did in 2016. Instead, we need to use an exponential, which was the important innovation of the 2017 model.
- We need to account for uncertainty in the model, of which there are two types:
- The uncertainty in how good a corps is - basically, how good is the exponential fit?
- The uncertainty in scores from show to show - from good days and bad days.
- The model needs to be more skeptical and account for slotting. Before it gives Corps A a good chance to beat Corps B, we should see Corps A actually beat, or at least draw even, with Corps B consistently.
The 2017 model didn't account for either type of uncertainty very well, and it didn't address slotting at all. The new model is better on both counts.
The 2018 model has two pieces - an exponential piece and a rank piece. Each one makes a set of predictions more or less independently, using the same Monte Carlo technique that the 2017 model did. The model blends their predictions together to form the overall forecast.
As you may have figured out from the name, the exponential piece of the model is basically the 2017 model. It fits an exponential to each corps’ data and uses that to predict their Finals Week scores. The fitting algorithm isn't quite the same, but the change is mostly marginal. In order to alleviate the distorting effect of scores after long breaks have on the predictions, the model discounts the weight of recent scores in the curve fitting algorithm. This makes the model, in effect, more skeptical, and it won't get thrown off by a random high score or two (Part 3 will talk about how the model does this more specifically). Unlike the 2017 model, the exponential doesn't deal with both types of uncertainty, instead dealing only with the first.
The first thing the rank piece does, as the name would suggest, is rank the corps. This is the only time the two models overlap, as the rank piece uses the exponential to rank the corps instead of using the most recent scores like dciscores.com. This means the model ranks the corps using its best guess for scores as though they all performed on the same day. For example, I'm publishing this on July 5 and the model needs to determine who's ranked higher between the Bluecoats and Santa Clara Vanguard. It uses the exponential to estimate each corps’ score on July 5, instead of using the July 3 score for Bluecoats and July 1 for Santa Clara Vanguard (for what it's worth, Bluecoats has the edge right now, but history suggests it might not last).
Once the corps are ranked, the rank piece deals with the second type of uncertainty. Assuming the score gaps it sees now hold until Finals Week, it simulates the random noise from show to show like the 2017 model did, by drawing random numbers from a multivariate distribution. These errors are correlated from corps to corps, but the correlation between all corps is not the same. So rather than use the uniform correlation like the 2017 model did, the 2018 model a “step-down” approach. The correlation is strongest between nearest neighbors, second strongest between second nearest neighbors, et cetera. The 8th place corps is most correlated with 7th and 9th, second most with 6th and 10th, and so on until it gets to the background correlation about 8 corps away. This better reflects the difficulty of moving up in DCI's rankings, therefore accounting for slotting better than the 2017 model did.
Both pieces of the model predict the corps’ placements 10,000 times (this is 10 times more than last year because the code is much faster this year). Each of the 10,000 runs are combined with the rank piece of the model getting more weight than the exponential (this is because more variation in DCI scores comes from the second kind of uncertainty than the first - more on this in Part 3). Like the 2017 model, the 2018 model assigns odds by counting - if a corps has a 50% chance to win, it's because it won in 5000 of the 10,000 simulations.
There's one more design change for the 2018 model - it doesn't predict exact scores as explicitly as the 2017 model did. Instead, it focuses more on the gaps between corps. If the model predicts that a corps will get a 90.25, it's not so much predicting a 90.25 as much as it is predicting that the corps will be 8 points behind first place, and first place's mean score was a 98.25. Therefore, the 2018 will not display raw score estimates, but as the corps’ average gap behind first place.
So that's how the 2018 DCI model is going to work, at least on a high level. If you want more detail than I provided here, check back when I post Part 3 of this series. In the meantime, I plan on launching live predictions a day or two after the Orlando show so keep an eye out for that too. If you're here a week or so after I posted this, check out the Drum Corps page of this site for a link.
Lastly, if you haven't gone to a drum corps show this summer, you need to. The achievement level in the activity right now is unbelievable!