Beat the Streak Day Fifteen: A back-testing framework

In this blog post, I will talk about a back-testing framework I developed to evaluate the quality of different Beat the Streak pick selection strategies on historical data, including the data sources I use, the evaluation metrics I look at, and some of the baseline models I've considered.  The source code for my back-testing framework is available at  Everything is written in python, and the source code heavily relies on the pandas package for data processing.

Data Sources

I draw on data from multiple sources, most importantly is statcast data, which I obtain using pybaseball.  This dataset contains information about every pitch, including the outcome (ball/strike/hit/etc) as well as other characteristics that have expanded over time (pitch type, velocity, spin, etc.).  From this pitch-level data, I derive at-bat level data and (batter, game)-level data.  This data includes many context features that can be useful for predicting the outcome of (batter, game) hit probabilities, like the ballpark, starting pitcher, teams, etc.

Some of my models also use retrosheet data, which I primarily use because it includes umpire and weather information that is not present tin the statcast data.  This data is also available from pybaseball.  One downside of the retrosheet data is that it is not updated regularly, and so if I wanted to deploy a pick selection model in real time that uses the retrosheet data, I would need to find another data source for weather / umpire information.  However, at the moment my models have not been good enough in my back-testing simulations to justify putting in that work (I'm looking to consistently achieve >= 80% accuracy first).  

I also download data from meteostat, which is a non-baseball dataset that can be used to fetch historical weather information.  None of my models currently use this data though since similar information is present in the retrosheet data.

From these raw data sources, I do some feature engineering on top of that.  The engineered features typically consist of rolling averages and exponential moving averages over historical data.  For example, one of the engineered features is a distribution over pitch types for each pitcher, which is calculated by taking an exponential moving average over their pitch type distributions from previous games.  There are many other statistics that I engineered and groups that I partition by.  The full set of features can be found in the open source repository.  

I have data going all the way back to the year 2000, but fine-granularity pitch information was not available until much later (I think 2008), so there is some missing data that models must take care to handle appropriately.  

Model Definition and Simulation

The next part of my evaluation framework is a definition of a pick-selection model, as well as some baseline models.  In my evaluation framework, a pick selection model must be a class that implements 3 methods: (1) train (2) predict and (3) update.

1. train: trains a model on a given data set from scratch
2. predict: predicts the probability of a hit for a set of (batter, game) pairs using the current model.  
3. update: updates the current model with a small batch of data (usually the data batch that was just past to predict).  

Given these three methods, evaluation proceeds by training an initial model via the train method on e.g., data from 2000, then iterating over the remainder of the dataset grouped and sorted by "game_date", and calling "predict" followed by "update".  This procedure faithfully approximates how the pick selection model would be used in reality, and ensures that no test set leakage occurs as well, which one must be careful of.  Note this is different than the traditional static train/test split that is common in data science.  
After iterating through the entire dataset we have predicted probabilities for every (batter, game) since 2001, as well as the true outcome (hit / no hit).

Baseline Model Example

One of the baseline models I've implemented is very simple.  It simply tracks the historical (hits / games) for each batter in the "train" and "update" steps, and uses that directly as the probability in the "predict" step.  This is a naive baseline, but is useful as a comparison point as well as to give a concrete example of how the train/predict/update methods could be implemented.  


The next part of my evaluation framework is utilities for summarizing the performance of different models by using the dataset obtained from the simulation that contains the model-predicted hit probabilities as well as the true outcome.  I look at a variety of scalar metrics and visualizations enumerated below.  Each metric/visualization is intended to provide information about how the model is performing from various angles. 

Scalar Metrics

  • Likelihood: The geometric average of the likelihood (i.e., the probability that an observed outcome would have occurred under the model).  This is simply the exponentiated average log-likelihood, also known as the (negative) cross-entropy in classification problems.  This is a fairly standard metric and not specific to BTS, but a useful general-purpose measure of model quality.
  • Top-k Accuracy Per Day: The accuracy of top k picks per day.  This metric speaks for itself.  When k=2, that corresponds to a "double down" in BTS.  Using larger k is helpful because it increases amount of data which reduces the variance in the metric, and hence we are less likely to misinterpret something due to random chance.  
  • Conditional success: The accuracy of picks whose predicted probabilities were greater than a threshold (by default, >=0.78).  Some days the model might not identify good picks, and the previous metric would penalize that model.  This metric provides an alternate view which and equally useful measure of model performance.  
  • Conditional count: The number of picks whose predicted probabilities were greater than a threshold (by default, >=0.78).  Some models might achieve 100% conditional success if they only identified a handful of (batter, game) pairs with probabilities >= 0.78.  The conditional count metric tells us how much confidence we should place in the conditional success metric.  
  • Top k-accuracy: The accuracy of the top k picks over the entire evaluation window.  Like the other metrics, this is looking at the (batter, game) pairs with the highest predicted hit probability.  However, this metric uses a much larger k, like 300 or 500, but also looks over the entire dataset rather than one day at a time.  
As an example, here are some numbers for one model I've implemented, "Singlearity-Lite", when evaluated on 2016 data:
  • Likelihood: 0.5281 
  • Top 2/Day Accuracy: 0.7514 
  • Top 5/Day Accuracy: 0.7743 
  • Accuracy(proba>0.78): 0.8211 
  • Count(proba>0.78): 123.0000 
  • Top 300 Accuracy: 0.7700 


  • Calibration curve: This visualization plots the predicted probability vs. the observed frequency on the evaluation dataset.  If the observed frequency matches the predicted probability, the model is said to be well-calibrated.  To make this plot we sort the data by predicted probability, than take a rolling average of both the predicted probability as well as the binary outcome (hit / no hit).  
  • Success curve: This visualization plots the cumulative accuracy for the top-k (batter, games) for k=1...1000.  It is essentially a continuous version of Top-k accuracy metric described above.  Along with the plot, three reference lines are shown: two reference lines for 75% and 80% accuracy, and a line showing what is "expected" based on the predicted probabilities.  If the observe cumulative accuracy matches the predicted accuracy, that is usually a good sign, although for small k there is significant variance and the plot can sometimes be erratic.  
  • Success distribution: This visualization plots the distribution of hit probabilities for the top-2 batters per game.  If the model is well-calibrated, this metric is useful for determining whether it can identify highly favorable situations, i.e., (batter, game) pairs with >=80% hit probability on a regular basis.


In this blog post, I summarized my back-testing and evaluation framework for beat-the-streak.  This framework allows me to quickly iterate on new models and evaluate their performance on historical data.  If you are intersted in developing your own models for beat the streak, my evaluation framework is a great starting point for you.  The code is open source, and if you have any issues with it shoot me a message and I'd be happy to help you figure it out.  While my odds of beating the streak are slim to none, I still enjoy working on these problems for fun and blogging about them.  


Popular posts from this blog

Optimal Strategy for Farkle Dice

Markov Chains and Expected Value

Automatically Finding Recurrence Relations from Integer Sequences