
Big Data and the Big Dance
The NCAA Tournament is famous for its unpredictability. This year, the University of North Carolina, the University of Virginia, the University of Kansas and the University of Oregon are the top teams vying for tournament glory, but the odds are that at least one of these teams will be taken down by an underdog before reaching the Final Four. Every year, sports pundits and fans alike eagerly wait to see who the next “Cinderella story” will be, who will take up the mantle held by teams like Butler in 2011, George Mason University in 2006 and North Carolina State in 1983. In fact, since seeding began in 1979, only once – in 2008 – have all four top-seeded teams made it to the Final Four.
It is precisely these unexpected twists and turns that make March Madness one of the most popular events in sports, with millions of people filling out brackets to join office pools, enter national tournaments and compete with their friends to see who can predict which of the 68 eligible teams will come out on top. Unfortunately, statisticians estimate the odds of correctly predicting the outcome of each game in the tournament to be one in over nine quintillion.
Yes, you read that correctly…one in over nine quintillion. To date, no perfect bracket has ever been verified, a fact of which fact Warren Buffett was probably well-aware when he offered a billion dollar prize for one in 2014 (it was a pretty safe bet; no one who entered the competition made it past the 2nd round with their bracket still intact). To try and make sense of the madness, fans and coaches alike are turning to Big Data analysis to not only build the perfect bracket, but the perfect team. In fact, March Madness has been called “America’s most popular exercise in statistical reasoning”.
Before the Big Dance even begins, coaches analyze massive amounts of data they gather through technology such as Sports VU, cameras in arenas that record and store raw information at 25 frames per second. This data, in turn, helps them determine probable outcomes for match ups or predict how a certain player can be expected to perform under pressure. This analysis has become such a powerful tool for a team that former Butler University Coach Brad Stevens (now head coach for the Boston Celtics) employed a full-time statistician on his staff to help him decide his starting line-up and create optimal player combinations.
It doesn’t seem like much of a stretch to apply the extensive analysis gleaned from the court to predicting tournament outcomes. The Dance Card, a formula developed by Jay Coleman, Mike DuMond and Allen Lynch, has a 97% success rate of predicting which teams will receive at-large bids into the Tournament, choosing 141 or 146 teams over the last four years. However, success rates for predictive analysis based on Big Data drop sharply after the initial bids. Even with the extensive availability of data, identifying the right formula for predicting the NCAA tournament continues to elude; Nate Silver, the incredibly successful baseball statistician who was able to correctly forecast the winner of each state in the 2012 presidential election, only achieved a 33% accuracy rate for his 2014 tournament predictions.
In pursuit of the perfect algorithm, Kaggle, a leading platform for predictive analysis and data modeling, challenges data scientists with March Machine Learning Mania. The 600+ competitors not only build and test their predictive models against past tournaments, but they are also required to provide a quantitative measure of confidence for their predictions to ensure a scientific approach and mitigate the “lucky guess”. The 2014 winners combined a formula that analyzed teams’ performance per possession with Las Vegas betting odds, which incorporate intangible factors such as injuries and home-field advantage.
In the end, most of the estimated 70 million brackets will be created through some combination of data analysis and pure gut-instinct (or perhaps, in some cases, blind hope). If you do decide to rely completely on a predictive data model, Will Cukierski, a competition administrator at Kaggle, suggests utilizing several techniques at once, or “stacking”. Merging a formula that focuses on better seeds with one that focuses on defensive prowess, for example, could give you a powerful advantage.
Then again, you might just be better off choosing teams by mascot.