Beat the Streak: Day Two

With the mid way point of the MLB season fast approaching, it is starting to become apparent that I need to finish my program if I want to have any shot at compiling a respectable streak this season. In this blog post, I will talk about one formula that I am relying heavily on in my tool. This formula is based on Bill James' Log5 formula but is modified for non-binary output.

In general, the Log5 formula is used to approximate the probability that team A will defeat defeat team B given each of their individual winning percentages. It can be modified to handle batter vs. pitcher matchups as well. There is no mathematically rigorous derivation of the formula, so using it in my tool is a little bit iffy to say the least. However, I think using it will yield a better approximation to batter vs. pitcher matchups than not using it.

In it's simplest form, the approximate batting average of batter B against pitcher P is given by:
$$ AVG_{B v P} = \frac{\frac{AVG_B \cdot AVG_P}{AVG_L}}{\frac{AVG_B \cdot AVG_P}{AVG_L} + \frac{(1 - AVG_B) \cdot (1 - AVG_P)}{(1 - AVG_L)}} $$ Where $AVG_L$ is the league average. It is incorporated into the formula as a normalizing factor. While I can't rigorously justify this formula, I will list a number of intuitive properties that this formula has.
  • \( AVG_B = AVG_L \leftrightarrow AVG_{B v P} = AVG_P \)
  • \( AVG_P = AVG_L \leftrightarrow AVG_{B v P} = AVG_B \)
  • \( AVG_B > AVG_L \leftrightarrow AVG_{B v P} > AVG_P \)
  • \( AVG_P > AVG_L \leftrightarrow AVG_{B v P} > AVG_B \)
  • \( 0 \leq AVG_B, AVG_P, AVG_L \leq 1 \rightarrow 0 \leq AVG_{B v P} \leq 1 \)
We can reason about this formula by breaking it up into two components: The numerator is where the weighting is occurring, and the denominator is where the normalization is occurring. Here is the numerator of the formula: $$AVG_B \Big(\frac{AVG_P}{AVG_L}\Big) = \Big(\frac{AVG_B}{AVG_L}\Big) AVG_P$$ The above gives two intuitive interpretations of the formula: $AVG_{BvP}$ is proportional to $AVG_B$ weighted by the ratio of $AVG_P$ to the league average. Alternatively, we can think of it as: $AVG_{BvP}$ is proportional to $AVG_P$ weighted by the ratio of $AVG_B$ to the league average. The denominator of the formula is just there to enforce property (5). Without it, it would be possible for the resulting average reach beyond the bounds $[0,1]$, which obviously would not be correct.

Now that we have an intuitive understanding of what gives this formula the (desirable) properties it has, we can extend it one step further to suite our needs. There are 5 distinct outcomes that can come from any pitch: Ball, Strike, Foul, In Play (Hit), and In Play (Out).

Knowing the distribution of these outcomes for a given batter, a given pitcher, and the entire league, we can estimate this distribution for a given batter vs. pitcher matchup! For example, the estimated probability of a given pitch ending up as a Hit is given by:
$$ P_{B v P}(Hit) = \frac{\frac{P_B(Hit) P_P(Hit)}{P_L(Hit)}}
                            {\frac{P_B(Ball) P_P(Ball)}{P_L(Ball)} +
                            \frac{P_B(Strike) P_P(Strike)}{P_L(Strike)} +
                            \frac{P_B(Foul) P_P(Foul)}{P_L(Foul)} +
                            \frac{P_B(Hit) P_P(Hit)}{P_L(Hit)} +
                            \frac{P_B(Out) P_P(Out)}{P_L(Out)}}$$ This formula has all the nice properties that the last formula had, but is generalized to handle more outcomes.

Using this information, we can look at individual at bats and estimate the probability of a hit in a given at bat by looking at the possible pitch sequences. I have derived a Markov Chain that can describe these events, and I will discuss it in a future blog post.

Comments

Post a Comment

Popular posts from this blog

Multi-Core Programming with Java

Beat the Streak: Day Three

Efficiently Remove Duplicate Rows from a 2D Numpy Array