Beat the Streak: Day Four
In this blog post, I will introduce an idea I recently came up with to predict the most likely players to get a hit in a given game based on situational factors such as opposing starter, opposing team, ballpark, and so on. I have not written much in this blog on this topic, although I did some work on this topic last fall which you can find here. In that work, I had the chance to explore a bunch of ideas that I had, but ultimately had to back up a few steps and rethink my approach. I think the ideas are still valid, and will continue refine them as time permits. A few weeks ago, I came up with a new approach that is completely different from my other approaches so far, and I will share it in the rest of this post.
If we take the smallest value in every category we end up with a situation where the batter has a ~98% chance of getting a hit. Clearly this idea needs to be revised but it seems to work reasonably well as a proof of concept. I've used it to make a few of my picks and it usually makes good picks, although it doesn't handle players very well if they have only played in a few major league games.
Defining the Problem
Before we dive into the math, let's talk about what exactly we are trying to do. The end goal is to pick the player who is most likely to get a hit on a given day based on factors associated to the games for that day. Some of these factors include: the batter, the pitcher, the ballpark, the two teams, the time of day, home/away for the batter, handedness of the opposing starter, handedness of the batter, and order in the lineup. To determine the most likely player to get a hit on a given day, we have to assign probabilities to every batter that is in the starting lineup for that day, then look at the players with the highest probabilities. I will note here that if these probabilities are well calibrated, they can be used to determine whether or not it is worthwhile to pick a player on a given day, or if you are better off taking a pass and maintaining your current streak. I formally analyzed this problem in this blog post.Previous Approaches
In case you didn't get through my writeup where I outlined my previous approaches, I will summarize them here. The main difference between my approach last fall and my current approach is that I am looking at different data. Ultimately we want to know who is going to get a hit in a particular game, and my previous approaches attempted to answer this question by looking at data associated with individual at bats and even individual pitches. I tried a variety of things, one of which was weighted decision trees, to approximate the outcome probabilities for possible events in an at bat and/or a pitch. With at bat probabilities at hand, I estimated the distribution of the number of at bats to expect in a particular game then combined the information together to approximate the probability of getting a hit in a given game. My current approach is different because instead of looking at individual at bats and/or pitches then transforming those predictions into predictions for an entire game, I am directly looking at the entire game. My new data set is derived from my old data set of at bats by combining at bats for which the date and player is the same, then collapsing those rows into a single row that has a new column for whether or not the batter got at least one hit in any of those at bats. My new approach is also different than previous approaches because it doesn't rely on the simplifying assumption that every batter has been facing average pitcher and that every pitcher has been facing average batters. Depending on the strength of the teams in the same division, different players will face opponents with different strengths. Eventually, I think I will go back to my original idea, since there's valuable information to mine there. However, I will not be talking about that in this blog post.The Approach
Finally I feel like I've sufficiently introduced the topic, so I can start talking about the solution. Here are some basic facts which are either completely obvious or easily verifiable:- An average player gets a hit in 60-65% of games
- Some batters are above average or below average
- Some pitchers are above average or below average
- Some ballparks are more hitter friendly than others
- Some teams have stronger bullpens than other teams
- Some teams have stronger lineups than other teams (more at bats for each player)
- Some other factors affect the likelihood of a player getting a hit
- How should the transformation function be defined?
- How do we assign values to each batter/pitcher/ballpark/ect.
- Let \( a_i \) be the value for batter \( i \)
- Let \( b_i \) be the value for starting pitcher \( i \)
- Let \( c_i \) be the value for ballpark \( i \)
Results
In my actual implementation, I took into account more variables than I demonstrated in the simple example above. Unfortunately, the best values for the parameters are not close to 1 as I was hoping they would be. For some variables, all of the values are well above 1 and for others all of the values are well below 1. When taken into account together they more or less cancel out. Thus, we must use all variables at once to get a probability that makes sense. The tables below show the numbers for each variable corresponding to the 10 most hitter friendly players/situations (if there are more than 10 to begin with).
Batter
|
Value
|
---|---|
Corey Seager
|
0.3731
|
Jose Abreu
|
0.3775
|
Andres Blanco
|
0.3805
|
Devon Travis
|
0.3927
|
Dee Gordon
|
0.4264
|
Danny Valencia
|
0.4288
|
Matt Duffy
|
0.4351
|
Martin Prado
|
0.4427
|
Lorenzo Cain
|
0.4453
|
Daniel Murphy
|
0.4460
|
Starting Pitcher | Value |
---|---|
Trevor May | 0.2775 |
Mike Pelfrey | 0.2855 |
Buck Farmer | 0.2939 |
Phil Hughes | 0.3337 |
Alex Colome | 0.3504 |
Tommy Milone | 0.3640 |
Vance Worley | 0.3732 |
Ervin Santana | 0.3741 |
Tyler Duffey | 0.3804 |
Ricky Nolasco | 0.3848 |
BallPark | Value |
---|---|
Rangers | 0.9589 |
Rockies | 0.9754 |
Red Sox | 0.9896 |
Indians | 1.0238 |
Twins | 1.0590 |
Orioles | 1.1303 |
Yankees | 1.1638 |
Astros | 1.1661 |
D-backs | 1.1680 |
Tigers | 1.1977 |
Pitcher Team | Value |
---|---|
Yankees | 0.4415 |
D-backs | 0.4493 |
Brewers | 0.6695 |
Cardinals | 0.7631 |
Royals | 0.7877 |
Giants | 0.8231 |
Braves | 0.8539 |
Nationals | 0.8670 |
Athletics | 0.9461 |
Cubs | 0.9498 |
Batting Order
|
Value
|
---|---|
1
|
0.2029
|
2
|
0.2212
|
3
|
0.2578
|
4
|
0.2372
|
5
|
0.2397
|
6
|
0.2398
|
7
|
0.2512
|
8
|
0.2708
|
9
|
0.3540
|
Time | Value |
---|---|
Day | 1.4976 |
Night | 1.5330 |
Location (Batter) | Value |
---|---|
Home | 2.5846 |
Away | 2.6131 |
If we take the smallest value in every category we end up with a situation where the batter has a ~98% chance of getting a hit. Clearly this idea needs to be revised but it seems to work reasonably well as a proof of concept. I've used it to make a few of my picks and it usually makes good picks, although it doesn't handle players very well if they have only played in a few major league games.
Comments
Post a Comment