### Beat the Streak: Day One

In 2001, MLB.com released the Beat the Streak challenge: A challenge to fans to essentially beat the all time best hitting streak established by Joe DiMaggio in 1941. In that season, Joe DiMaggio had a 56 game hitting streak. The longest hitting streak by any MLB player since that is 45 games. When the challenge was first introduced, fans were asked to pick a (possibly different) player every day who they expect to get a hit. If that player earned a hit, then their streak would increase by 1. Otherwise, it would go back to 0. The first fan to reach a streak of 57 would win the grand prize. Since it's introduction in 2001, the grand prize has grown from $100,000 to $5,600,000, and a number of new features have been added to improve the odds for the fans. Yet, no one has even broken the 50 game streak barrier, let alone win the grand prize.

I have been a casual beat the streak player for a few seasons, but I never really took it too seriously. Last season, I decided to use my background in statistics and programming to look at historical data and improve my odds. It worked reasonably well, but I am completely rethinking the model this season. This is the first blog post in a series that will cover the progress I make on my beat the streak tool and I will share my mathematical / algorithmic insights. As this is my first post on this topic, I will not get too technical today. I will share a few simple insights, however.

MLB.com claims that it is 1,000,000 times easier to beat the streak than it is to fill out a perfect bracket. Even so, nobody has been able to win in it's 14 years of existence. As a warmup and to see if I'm completely wasting my time on this, let's approximate this probability under a naive player selection model. In 2014, Jose Altuve had the highest batting average. He got at least one hit in about 8080 of the games he played. Using that to approximate the probability of getting a hit in a single game, we can approximate the probability of beating the streak over the course of a season. The season lasts about 183 days (162 games for each team), so there are (183 - 57) windows to have the streak, and the probability of getting a streak in a single window is given by $$ 0.8^{57} \approx 0.000002993 $$ The probability of beating the streak throughout the course of a season is even higher, but the calculation requires slightly more advanced math. These calculations were made under a simplified model of the game. However, the actual rules are more favorable: you can pick up to 2 players every day and you get one mulligan. The mulligan can be used to save your streak if you happen to lose it when it's between 10 - 15 inclusive. Thus, we can now win by getting 10 hits in a row, getting 5 hits out of 6, then getting 42 hits in a row. This feature can improve your odds but since you only get one mulligan, it's not likely to help very much (and I'm just too lazy to calculate the odds).

In future blog posts, I will discuss the following topics which I have either solved, looked into, or plan on looking into:

I will explore ways to approximate percentages based on situational data. For example, handedness of the batter/pitcher, location of the game, time of the game, strength of the bullpen, etc. I may also discuss machine learning techniques that can be applied to this type of data.

Sometimes the calculated probability of a hit will be below some threshold, in which case it is favorable to not pick any player and instead maintain the current streak. I have developed a dynamic programming algorithm to calculate these thresholds based on the current streak. For example, when your streak is higher, you might want to start being more selective in the picks you make.

As data is becoming more and more readily available, Big Data has grown in popularity. I will use the pitch F/X data (data containing information about every pitcher for every at bat for every game since 2008), to matchup batters with pitchers that they are likely to get a hit against. This is the idea which I hope will give me a significant advantage in this challenge.

I have been a casual beat the streak player for a few seasons, but I never really took it too seriously. Last season, I decided to use my background in statistics and programming to look at historical data and improve my odds. It worked reasonably well, but I am completely rethinking the model this season. This is the first blog post in a series that will cover the progress I make on my beat the streak tool and I will share my mathematical / algorithmic insights. As this is my first post on this topic, I will not get too technical today. I will share a few simple insights, however.

MLB.com claims that it is 1,000,000 times easier to beat the streak than it is to fill out a perfect bracket. Even so, nobody has been able to win in it's 14 years of existence. As a warmup and to see if I'm completely wasting my time on this, let's approximate this probability under a naive player selection model. In 2014, Jose Altuve had the highest batting average. He got at least one hit in about 8080 of the games he played. Using that to approximate the probability of getting a hit in a single game, we can approximate the probability of beating the streak over the course of a season. The season lasts about 183 days (162 games for each team), so there are (183 - 57) windows to have the streak, and the probability of getting a streak in a single window is given by $$ 0.8^{57} \approx 0.000002993 $$ The probability of beating the streak throughout the course of a season is even higher, but the calculation requires slightly more advanced math. These calculations were made under a simplified model of the game. However, the actual rules are more favorable: you can pick up to 2 players every day and you get one mulligan. The mulligan can be used to save your streak if you happen to lose it when it's between 10 - 15 inclusive. Thus, we can now win by getting 10 hits in a row, getting 5 hits out of 6, then getting 42 hits in a row. This feature can improve your odds but since you only get one mulligan, it's not likely to help very much (and I'm just too lazy to calculate the odds).

In future blog posts, I will discuss the following topics which I have either solved, looked into, or plan on looking into:

**How to better approximate the probability of a hit**I will explore ways to approximate percentages based on situational data. For example, handedness of the batter/pitcher, location of the game, time of the game, strength of the bullpen, etc. I may also discuss machine learning techniques that can be applied to this type of data.

**When to pick 2 players, 1 player, or 0 players**Sometimes the calculated probability of a hit will be below some threshold, in which case it is favorable to not pick any player and instead maintain the current streak. I have developed a dynamic programming algorithm to calculate these thresholds based on the current streak. For example, when your streak is higher, you might want to start being more selective in the picks you make.

**Taking into account pitch F/X data**As data is becoming more and more readily available, Big Data has grown in popularity. I will use the pitch F/X data (data containing information about every pitcher for every at bat for every game since 2008), to matchup batters with pitchers that they are likely to get a hit against. This is the idea which I hope will give me a significant advantage in this challenge.

## Comments

## Post a Comment