Posts

Beat the Streak: Day 13

Image
Recall from Beat the Streak: Day 12 , that my grand vision requires building a collection of several models for different sub problems, which will all be combined to get a model for the probability that a batter will get a hit in a given game.  In this blog post, my aim is to tackle the first subproblem.  Specifically, I'd like to build a model to predict the probability that a ball put into play results in a hit, given it's launch angle, spray angle, launch speed, and any other relevant context (like the ballpark).  Note that statcast data already has a column called "estimated_ba_from_speedangle".  This only looks at launch angle and launch velocity, and ignores spray angle.  It therefore acts as a good baseline for this problem that we can hopefully improve upon.  An even more naive baseline is to assume the probability of a hit is constant given it was put into play, ignoring all other context.  Evaluating these models gives a negative log likelihood of 0.409 and

Beat the Streak, Day 12: My Grand Vision

Dear blog reader and beat-the-streak enthusiasts alike, Today I want talk about my grand vision for beat-the-streak modelling.  I will explain how I have always envisioned a solution for beat-the-streak, but never actually attempted to solve it this way due the complexity of the project and the limited time I have to work on it.  I hope this could be the year that I start chipping away at putting this grand vision in practice.  In this blog post, I want to explicitly write out this vision along with the sub-problems that would need to be solved to execute this vision.  I have hinted at this vision in some earlier blog posts, but now I want to dive a little deeper on exactly what this idea would entail. In short, this vision requires modeling probability distributions at the finest level of granularity (pitches) and using those as building blocks for coarser granularity models (atbats and game).  The specific models that I am proposing to train / develop are listed below.  Among these,

Beat the Streak: Day 11

Image
In a previous blog post , I showed that simply looking at empirical frequencies to estimate hit probabilities can be misleading, as there is a positive bias that is introduced when we take the maximum over a bunch of empirical frequencies.  This bias will incorrectly lead us to believe that the probability of a hit for the best batter is higher than it truly is, which is clearly a problem from the perspective of beat the streak.   It is fairly straightforward to correct for the bias.  In this blog post, I will explain how, and discuss the implications of the bias-corrected hit probabilities.  Recall from the previous blog post, that our setup is as follows: Suppose we have a collection of batters $ i=1, \dots, 250$, and each batter has a certain (unknown) probability of getting a hit in a given game $p_i$.  Moreover, assume each batter plays in $162$ games, and that the outcomes for each player across games is i.i.d.  For the purposes of this problem abstraction, let's assume $p_i

Beat the Streak: Day 10

Image
In this blog post, we augment our dataset with information about weather, and looking at how different features pertaining to weather affect the probability of getting a hit. Specifically, we will look at 3 weather-related variables: temperature, wind speed + direction, and precipitation.   Temperature We will begin by analyzing the effect of temperature on the probability of getting a hit.  Our data consists of plate appearances from 2010 - 2021, or roughly 2 million records.  For each plate appearance, we have access to the temperature (presumably, this corresponds to temperature at the start of the game ) and the outcome (hit / no hit).  We sort this dataset by temperature, and then compute a rolling mean of width 100,000.  This allows us to get the discrete events (hit / no hit) and turn them into probabilities.  The plot below shows the trend.  This shows that higher temperatures lead to increased hit probability.  Specifically, going from 50 degrees to 70 degrees gives approxima

Biased Best-of-K Rock Paper Scissors

The topic of today's blog post is an interesting twist on the classical rock paper scissors game. Alice and Bob agree to play rock paper scissors (best of 1).  If Alice loses, she loses gracefully and accepts defeat.  If Bob loses, he will insist on playing best of 3.  If he loses yet again, he will insist on playing best of 5, and so on and so forth. What is Alice's probability of winning this game if she is willing to agree to Bob's request $k$ times (best of up to $K=2k+1$)? Let's begin by working out the formula for $k=1$, which is perhaps the most realistic scenario.  How much does Alice give up by agreeing to Bob's request once?  The probability of Alice winning a given round is $1/2$, and same for Bob (a round is consists of a sequence of ties followed by exactly one non-tie).  Here are the possible sequence of events, along with their probabilities, with A/B denoting that Alice/Bob wins the given round respectively. B - 0.5 - Bob wins in 1 turn AA - 0.25 - A

Beat the Streak: Day Nine

Image
In this blog post, I want to talk about why getting 80% success rate in beat the streak is so challenging.  I believe I identified a mathematical reason for this, which I am going to share in this blog post.   First, lets look at some simple statistics that hint that 80% success should not be out of reach.   In the table below, we are showing the percentage of games with a hit for the most successful batters in 2011-2019. batter % Games with Hit 2011 Jacoby Ellsbury 0.821656 2012 Derek Jeter 0.812121 2013 Michael Cuddyer 0.807692 2014 Jose Altuve 0.803797 2015 Dee Gordon 0.800000 2016 Mookie Betts 0.807453 2017 Ender Inciarte 0.775641 2018 Jose Altuve 0.786207 2019 DJ LeMahieu

Beat the Streak: Day Eight

Image
In this blog post, we will explore three factors that influence the probability of correctly selecting a player to get a hit on a given day.  These are: 1. Individual batter strength, as measured by the proportion of plate appearances that resulted in a hit. 2. Team offensive strength, as measured by the average number of plate appearances per game by the batting team.   3. The position in the batting order. We plot the distribution of these statistics over (batter, year) pairs and (team, year) pairs.  The plots below reveal that the best batters get a hit in about 30% of plate appearances, and the strongest offensive teams average 39 plate appearances per game.  The tables below show the top-performing batters and teams: batter year Josh Hamilton 2010 0.326 Trea Turner 2016 0.324 Jose Altuve 2014 0.319