Posts

Showing posts from 2023

Beat the Streak Day Fourteen: Singlearity

Image
After a fortnight of work, I am back with another blog post on MLBs beat the streak contest. And before you ask, yes that is still a thing in 2023. No one has won it yet, and this year the longest streak was 44, 13 shy of of the number needed and 7 short of the previous all time leader in BTS. In short, it doesn't seem like we're any closer to winning it now than we were 10 years ago.  In the last few days, I have been thinking about new algorithms, models, and approaches to this longstanding problem. But before I can or should test these new approaches, it's important to better understand the limitations of simpler approaches when done very carefully. Singlearity was the first approach to this problem I've seen that convincingly demonstrated solid performance where it matters.  The idea is to apply standard neural network training techniques on a dataset of (batter, game, outcome) data. The neural network is trained on carefully feature engineered data. Specif

Deriving Perfect Strategy for Blackjack

Image
In this blog post, I will discuss the algorithmic techniques needed to derive perfect strategy for the game of blackjack. Perfect strategy utilizes knowledge of the exact composition of cards remaining in the deck. No human can execute perfect strategy in practice since it requires keeping track of the count of 13 different cards simultaneously. It’s theoretically possible to use a technique like this to gain a slight advantage in online live dealer blackjack via software, and this strategy does provide a positive EV game, although as we will see later, the advantage is quite small. Nevertheless, it is quite an interesting algorithmic problem, and I had fun solving it. This will be the topic of this blog post. The Wizard of Odds has a great YouTube video showing how you can derive basic strategy for a simplified game with an infinite deck of cards. This is a great starting point for understanding this way of thinking. There are two main questions we ultimately want to answer

Beat the Streak: Day 13

Image
Recall from Beat the Streak: Day 12 , that my grand vision requires building a collection of several models for different sub problems, which will all be combined to get a model for the probability that a batter will get a hit in a given game.  In this blog post, my aim is to tackle the first subproblem.  Specifically, I'd like to build a model to predict the probability that a ball put into play results in a hit, given it's launch angle, spray angle, launch speed, and any other relevant context (like the ballpark).  Note that statcast data already has a column called "estimated_ba_from_speedangle".  This only looks at launch angle and launch velocity, and ignores spray angle.  It therefore acts as a good baseline for this problem that we can hopefully improve upon.  An even more naive baseline is to assume the probability of a hit is constant given it was put into play, ignoring all other context.  Evaluating these models gives a negative log likelihood of 0.409 and

Beat the Streak, Day 12: My Grand Vision

Dear blog reader and beat-the-streak enthusiasts alike, Today I want talk about my grand vision for beat-the-streak modelling.  I will explain how I have always envisioned a solution for beat-the-streak, but never actually attempted to solve it this way due the complexity of the project and the limited time I have to work on it.  I hope this could be the year that I start chipping away at putting this grand vision in practice.  In this blog post, I want to explicitly write out this vision along with the sub-problems that would need to be solved to execute this vision.  I have hinted at this vision in some earlier blog posts, but now I want to dive a little deeper on exactly what this idea would entail. In short, this vision requires modeling probability distributions at the finest level of granularity (pitches) and using those as building blocks for coarser granularity models (atbats and game).  The specific models that I am proposing to train / develop are listed below.  Among these,

Beat the Streak: Day 11

Image
In a previous blog post , I showed that simply looking at empirical frequencies to estimate hit probabilities can be misleading, as there is a positive bias that is introduced when we take the maximum over a bunch of empirical frequencies.  This bias will incorrectly lead us to believe that the probability of a hit for the best batter is higher than it truly is, which is clearly a problem from the perspective of beat the streak.   It is fairly straightforward to correct for the bias.  In this blog post, I will explain how, and discuss the implications of the bias-corrected hit probabilities.  Recall from the previous blog post, that our setup is as follows: Suppose we have a collection of batters $ i=1, \dots, 250$, and each batter has a certain (unknown) probability of getting a hit in a given game $p_i$.  Moreover, assume each batter plays in $162$ games, and that the outcomes for each player across games is i.i.d.  For the purposes of this problem abstraction, let's assume $p_i

Beat the Streak: Day 10

Image
In this blog post, we augment our dataset with information about weather, and looking at how different features pertaining to weather affect the probability of getting a hit. Specifically, we will look at 3 weather-related variables: temperature, wind speed + direction, and precipitation.   Temperature We will begin by analyzing the effect of temperature on the probability of getting a hit.  Our data consists of plate appearances from 2010 - 2021, or roughly 2 million records.  For each plate appearance, we have access to the temperature (presumably, this corresponds to temperature at the start of the game ) and the outcome (hit / no hit).  We sort this dataset by temperature, and then compute a rolling mean of width 100,000.  This allows us to get the discrete events (hit / no hit) and turn them into probabilities.  The plot below shows the trend.  This shows that higher temperatures lead to increased hit probability.  Specifically, going from 50 degrees to 70 degrees gives approxima