Beat the Streak: Day Four
In this blog post, I will introduce an idea I recently came up with to predict the most likely players to get a hit in a given game based on situational factors such as opposing starter, opposing team, ballpark, and so on. I have not written much in this blog on this topic, although I did some work on this topic last fall which you can find here. In that work, I had the chance to explore a bunch of ideas that I had, but ultimately had to back up a few steps and rethink my approach. I think the ideas are still valid, and will continue refine them as time permits. A few weeks ago, I came up with a new approach that is completely different from my other approaches so far, and I will share it in the rest of this post.
If we take the smallest value in every category we end up with a situation where the batter has a ~98% chance of getting a hit. Clearly this idea needs to be revised but it seems to work reasonably well as a proof of concept. I've used it to make a few of my picks and it usually makes good picks, although it doesn't handle players very well if they have only played in a few major league games.
Defining the Problem
Before we dive into the math, let's talk about what exactly we are trying to do. The end goal is to pick the player who is most likely to get a hit on a given day based on factors associated to the games for that day. Some of these factors include: the batter, the pitcher, the ballpark, the two teams, the time of day, home/away for the batter, handedness of the opposing starter, handedness of the batter, and order in the lineup. To determine the most likely player to get a hit on a given day, we have to assign probabilities to every batter that is in the starting lineup for that day, then look at the players with the highest probabilities. I will note here that if these probabilities are well calibrated, they can be used to determine whether or not it is worthwhile to pick a player on a given day, or if you are better off taking a pass and maintaining your current streak. I formally analyzed this problem in this blog post.Previous Approaches
In case you didn't get through my writeup where I outlined my previous approaches, I will summarize them here. The main difference between my approach last fall and my current approach is that I am looking at different data. Ultimately we want to know who is going to get a hit in a particular game, and my previous approaches attempted to answer this question by looking at data associated with individual at bats and even individual pitches. I tried a variety of things, one of which was weighted decision trees, to approximate the outcome probabilities for possible events in an at bat and/or a pitch. With at bat probabilities at hand, I estimated the distribution of the number of at bats to expect in a particular game then combined the information together to approximate the probability of getting a hit in a given game. My current approach is different because instead of looking at individual at bats and/or pitches then transforming those predictions into predictions for an entire game, I am directly looking at the entire game. My new data set is derived from my old data set of at bats by combining at bats for which the date and player is the same, then collapsing those rows into a single row that has a new column for whether or not the batter got at least one hit in any of those at bats. My new approach is also different than previous approaches because it doesn't rely on the simplifying assumption that every batter has been facing average pitcher and that every pitcher has been facing average batters. Depending on the strength of the teams in the same division, different players will face opponents with different strengths. Eventually, I think I will go back to my original idea, since there's valuable information to mine there. However, I will not be talking about that in this blog post.The Approach
Finally I feel like I've sufficiently introduced the topic, so I can start talking about the solution. Here are some basic facts which are either completely obvious or easily verifiable: An average player gets a hit in 6065% of games
 Some batters are above average or below average
 Some pitchers are above average or below average
 Some ballparks are more hitter friendly than others
 Some teams have stronger bullpens than other teams
 Some teams have stronger lineups than other teams (more at bats for each player)
 Some other factors affect the likelihood of a player getting a hit
 How should the transformation function be defined?
 How do we assign values to each batter/pitcher/ballpark/ect.
 Let \( a_i \) be the value for batter \( i \)
 Let \( b_i \) be the value for starting pitcher \( i \)
 Let \( c_i \) be the value for ballpark \( i \)
Results
In my actual implementation, I took into account more variables than I demonstrated in the simple example above. Unfortunately, the best values for the parameters are not close to 1 as I was hoping they would be. For some variables, all of the values are well above 1 and for others all of the values are well below 1. When taken into account together they more or less cancel out. Thus, we must use all variables at once to get a probability that makes sense. The tables below show the numbers for each variable corresponding to the 10 most hitter friendly players/situations (if there are more than 10 to begin with).
Batter

Value


Corey Seager

0.3731

Jose Abreu

0.3775

Andres Blanco

0.3805

Devon Travis

0.3927

Dee Gordon

0.4264

Danny Valencia

0.4288

Matt Duffy

0.4351

Martin Prado

0.4427

Lorenzo Cain

0.4453

Daniel Murphy

0.4460

Starting Pitcher  Value 

Trevor May  0.2775 
Mike Pelfrey  0.2855 
Buck Farmer  0.2939 
Phil Hughes  0.3337 
Alex Colome  0.3504 
Tommy Milone  0.3640 
Vance Worley  0.3732 
Ervin Santana  0.3741 
Tyler Duffey  0.3804 
Ricky Nolasco  0.3848 
BallPark  Value 

Rangers  0.9589 
Rockies  0.9754 
Red Sox  0.9896 
Indians  1.0238 
Twins  1.0590 
Orioles  1.1303 
Yankees  1.1638 
Astros  1.1661 
Dbacks  1.1680 
Tigers  1.1977 
Pitcher Team  Value 

Yankees  0.4415 
Dbacks  0.4493 
Brewers  0.6695 
Cardinals  0.7631 
Royals  0.7877 
Giants  0.8231 
Braves  0.8539 
Nationals  0.8670 
Athletics  0.9461 
Cubs  0.9498 
Batting Order

Value


1

0.2029

2

0.2212

3

0.2578

4

0.2372

5

0.2397

6

0.2398

7

0.2512

8

0.2708

9

0.3540

Time  Value 

Day  1.4976 
Night  1.5330 
Location (Batter)  Value 

Home  2.5846 
Away  2.6131 
If we take the smallest value in every category we end up with a situation where the batter has a ~98% chance of getting a hit. Clearly this idea needs to be revised but it seems to work reasonably well as a proof of concept. I've used it to make a few of my picks and it usually makes good picks, although it doesn't handle players very well if they have only played in a few major league games.
Comments
Post a Comment