Beat the Streak: Day Four

In this blog post, I will introduce an idea I recently came up with to predict the most likely players to get a hit in a given game based on situational factors such as opposing starter, opposing team, ballpark, and so on. I have not written much in this blog on this topic, although I did some work on this topic last fall which you can find here. In that work, I had the chance to explore a bunch of ideas that I had, but ultimately had to back up a few steps and rethink my approach. I think the ideas are still valid, and will continue refine them as time permits. A few weeks ago, I came up with a new approach that is completely different from my other approaches so far, and I will share it in the rest of this post.

Defining the Problem

Before we dive into the math, let's talk about what exactly we are trying to do. The end goal is to pick the player who is most likely to get a hit on a given day based on factors associated to the games for that day. Some of these factors include: the batter, the pitcher, the ballpark, the two teams, the time of day, home/away for the batter, handedness of the opposing starter, handedness of the batter, and order in the lineup. To determine the most likely player to get a hit on a given day, we have to assign probabilities to every batter that is in the starting lineup for that day, then look at the players with the highest probabilities. I will note here that if these probabilities are well calibrated, they can be used to determine whether or not it is worthwhile to pick a player on a given day, or if you are better off taking a pass and maintaining your current streak. I formally analyzed this problem in this blog post.

Previous Approaches

In case you didn't get through my writeup where I outlined my previous approaches, I will summarize them here. The main difference between my approach last fall and my current approach is that I am looking at different data. Ultimately we want to know who is going to get a hit in a particular game, and my previous approaches attempted to answer this question by looking at data associated with individual at bats and even individual pitches. I tried a variety of things, one of which was weighted decision trees, to approximate the outcome probabilities for possible events in an at bat and/or a pitch. With at bat probabilities at hand, I estimated the distribution of the number of at bats to expect in a particular game then combined the information together to approximate the probability of getting a hit in a given game. My current approach is different because instead of looking at individual at bats and/or pitches then transforming those predictions into predictions for an entire game, I am directly looking at the entire game. My new data set is derived from my old data set of at bats by combining at bats for which the date and player is the same, then collapsing those rows into a single row that has a new column for whether or not the batter got at least one hit in any of those at bats. My new approach is also different than previous approaches because it doesn't rely on the simplifying assumption that every batter has been facing average pitcher and that every pitcher has been facing average batters. Depending on the strength of the teams in the same division, different players will face opponents with different strengths. Eventually, I think I will go back to my original idea, since there's valuable information to mine there. However, I will not be talking about that in this blog post.

The Approach

Finally I feel like I've sufficiently introduced the topic, so I can start talking about the solution. Here are some basic facts which are either completely obvious or easily verifiable:

An average player gets a hit in 60-65% of games
Some batters are above average or below average
Some pitchers are above average or below average
Some ballparks are more hitter friendly than others
Some teams have stronger bullpens than other teams
Some teams have stronger lineups than other teams (more at bats for each player)
Some other factors affect the likelihood of a player getting a hit

My idea work by using 0.63 as the base percentage of getting a hit without looking at any other information. Then I update the probability based on the situational variables. For example, if Miguel Cabrera was the batter, the 0.63 might get transformed to 0.78. If Mike Pelfrey was pitching, that 0.78 might get transformed into a 0.81. The other variables will have a similar affect on the probability. There are two questions that we need to answer at this point.

How should the transformation function be defined?
How do we assign values to each batter/pitcher/ballpark/ect.

Note that the second question might not make sense yet, but after I answer the first question is should be clear what I mean. What are the properties that a transformation function should have? Well certainly it needs to be defined $ f : [0,1] \rightarrow [0,1] $ because the input and output should always be a probability. Further, we want the transition function for each variable to be of the same form, and that the order the different variables are processed shouldn't affect the final output. Luckily there is a very natural function that satisfies this criteria, namely $$ f_a(x) = x^a $$ where $ x $ is the base probability, $ a $ is a positive number assigned for one of the variables (e.g. the batter). For Miguel Cabrera, for example, we could set $ a = 0.53 $, so that $ f_{0.53}(0.63) = 0.63^{0.53} \approx 0.78 $. With this method, average batters will have $ a \approx 1 $, above average batters will have $ a < 1 $, and below average batters will have $ a > 1 $. Similarly, variables that take on values favorable to the batter will have $ a < 1 $ and otherwise $ a > 1 $. As another example, hitter friendly ballparks like Coors Field should have $ a < 1 $ while tougher ballparks like Citi Field should have $ a > 1 $. Every variable that I listed can be dealt with in the same way. To work out a full example, assume we assign a values of $ [0.53, 0.87, 1.2, 0.9, 0.95, 1.0] $ for each of the variables listed above. We can estimate the probability of a batter getting a hit in this situation by evaluating $$ (f_{0.53} \circ f_{0.87} \circ f_{1.2} \circ f_{0.9} \circ f_{0.95} \circ f_{1.0}) (0.63) $$ $$ 0.63^{0.53 \cdot 0.87 \cdot 1.2 \cdot 0.9 \cdot 0.95 \cdot 1.0} $$ $$ \boxed{0.804} $$ So we can conclude that in this situation, the likelihood of the player getting a hit is about 80%. Now that I've shown how to determine the probability of getting a hit from the situation assuming we know the number $ a $ associated to each value for every variable, I will explain how to go about finding findings these numbers. Everything up to this point as been fairly straight forward. This next part is a little bit more complicated but if you have a strong background and mathematics then you should be fine. I haven't quite settled on a notation that I like for this part of the problem, so this next part might seem a little bit confusing. I will try my best to explain it clearly however. Let's assume for a moment that we are only dealing with the first three variables: batter, pitcher, and ballpark $ a,b,c $ .

Let $ a_i $ be the value for batter $ i $
Let $ b_i $ be the value for starting pitcher $ i $
Let $ c_i $ be the value for ballpark $ i $

Note that $ a_i, b_i, $ and $ c_i $ are parameters in a statistical model. As such, we can use maximum likelihood estimation to find the most likely values that they can take on given the training data (we have a dataset that contains tens of thousands of examples to train from). Given a set of parameters we can compute the likelihood of observing the data given that those are the true parameters with the formula below: $$ p_j = 0.63^{a_{x_j} \cdot b_{y_j} \cdot c_{z_j}} $$ $$ Likelihood = \prod_{j=1}^{N} h_j \cdot p_j + (1-h_j) \cdot (1 - p_j) $$ I know the notation sucks, but unfortunately I can't think of a better way to set it up. $ x_j $ is the batter associated to row $j$ in the data set. $ y_j $ is the value of the pitcher associated to row $j$ in the data set. $ z_j $ is the value of the ballpark associated to row $ j $ in the data set. $ h_j = 1 $ if the player got a hit in the game, and $ h_j = 0 $ otherwise. $ N $ is the number of rows in the training data. One of the reasons I set the notation up this way is because every batter, pitcher, and ballpark exists in many different rows in many different combinations. We seek to choose the parameters $ a_i, b_i, c_i $ that maximize that likelihood. However, since the likelihood is numerically $ 0 $ (meaning it's so small it can't be represented as a 64 bit double), and our statistical model is a function of data, we must work with the log likelihood instead: $$ LogLikelihoood = \sum_{j=1}^N h_j \log{(p_j)} + (1 - h_j) \log{(1 - p_j)} $$ We want to maximize this with respect to the parameters $ a_i, b_i, c_i $. To do that, I defined the likelihood function in python as a function of the parameters (where the data is accessed globally), and maximized it by using methods from scipy.optimize. Since I don't have a good intuition of whether or not this function is convex or not, I used global optimization instead of local optimization. After many hours of coding and optimizing for speed (after all, the statistical model is a function of 10's of thousands of things), I was finally able to run this program in a reasonable amount of time on 3 years worth of data. If you want to code this yourself, you will need to supply the Jacobian for LogLikelihood function or it will take way too long to converge. Anyway, it ended up finding the best parameters after about an hour of computation, but I let it run for an additional 10+ hours just to be sure that it found the best solution. I know global optimization algorithms aren't guaranteed to converge to a global optimum, but I am reasonably convinced based on the results that it found it in this case.

Results

In my actual implementation, I took into account more variables than I demonstrated in the simple example above. Unfortunately, the best values for the parameters are not close to 1 as I was hoping they would be. For some variables, all of the values are well above 1 and for others all of the values are well below 1. When taken into account together they more or less cancel out. Thus, we must use all variables at once to get a probability that makes sense. The tables below show the numbers for each variable corresponding to the 10 most hitter friendly players/situations (if there are more than 10 to begin with).

Batter	Value
Corey Seager	0.3731
Jose Abreu	0.3775
Andres Blanco	0.3805
Devon Travis	0.3927
Dee Gordon	0.4264
Danny Valencia	0.4288
Matt Duffy	0.4351
Martin Prado	0.4427
Lorenzo Cain	0.4453
Daniel Murphy	0.4460

Starting Pitcher	Value
Trevor May	0.2775
Mike Pelfrey	0.2855
Buck Farmer	0.2939
Phil Hughes	0.3337
Alex Colome	0.3504
Tommy Milone	0.3640
Vance Worley	0.3732
Ervin Santana	0.3741
Tyler Duffey	0.3804
Ricky Nolasco	0.3848

BallPark	Value
Rangers	0.9589
Rockies	0.9754
Red Sox	0.9896
Indians	1.0238
Twins	1.0590
Orioles	1.1303
Yankees	1.1638
Astros	1.1661
D-backs	1.1680
Tigers	1.1977

Pitcher Team	Value
Yankees	0.4415
D-backs	0.4493
Brewers	0.6695
Cardinals	0.7631
Royals	0.7877
Giants	0.8231
Braves	0.8539
Nationals	0.8670
Athletics	0.9461
Cubs	0.9498

Batting Order	Value
1	0.2029
2	0.2212
3	0.2578
4	0.2372
5	0.2397
6	0.2398
7	0.2512
8	0.2708
9	0.3540

Time	Value
Day	1.4976
Night	1.5330

Location (Batter)	Value
Home	2.5846
Away	2.6131

If we take the smallest value in every category we end up with a situation where the batter has a ~98% chance of getting a hit. Clearly this idea needs to be revised but it seems to work reasonably well as a proof of concept. I've used it to make a few of my picks and it usually makes good picks, although it doesn't handle players very well if they have only played in a few major league games.

Concluding Thoughts

Anyway, there's still a good amount of programming ahead of me to determine whether this approach works better than my previous approaches. I wanted to share this idea with other people who are interested in this problem so we can possibly open up a dialogue and make real progress towards solving this problem. I think my idea is a good example of thinking outside the box, which is what I think it necessary for this problem. At the same time, I don't think there is a very strong justification for the statistical model that I choose other than the fact that it has the properties I was looking for. However since I parameterized the model and found the optimal values for the parameters, it seems like it should produce high quality estimates for most situations. It remains to be seen if this idea will lead anywhere. If you are interested in reproducing this work, shoot me an email and let me know. There are a number of variations of this idea that I am going to try out once I get more free time. If you have any ideas to contribute or want to work together on this, let me know through email.

Search This Blog

Ryan's Repository of Random Reflections