### Beat the Streak, Day 12: My Grand Vision

Dear blog reader and beat-the-streak enthusiasts alike,

Today I want talk about my grand vision for beat-the-streak modelling.  I will explain how I have always envisioned a solution for beat-the-streak, but never actually attempted to solve it this way due the complexity of the project and the limited time I have to work on it.  I hope this could be the year that I start chipping away at putting this grand vision in practice.  In this blog post, I want to explicitly write out this vision along with the sub-problems that would need to be solved to execute this vision.  I have hinted at this vision in some earlier blog posts, but now I want to dive a little deeper on exactly what this idea would entail.

In short, this vision requires modeling probability distributions at the finest level of granularity (pitches) and using those as building blocks for coarser granularity models (atbats and game).  The specific models that I am proposing to train / develop are listed below.  Among these, there are three standard probabilistic classification models, two continuous distribution fitting models, and three known models (which could be lightly parameterized/learned).  I have looked at some of these sub-problems in the past, but not all of them, and have not yet put them together into a unified grand model.

Previously, my approach was to directly learn a model at the coarsest level of granularity (games), which is ultimately what we care about for beat-the-streak.  However, I have several reasons to believe it would be helpful to draw on finer granularity data, both to get more signal and richer features.  One benefit of this approach is any improvements to any sub-model should ultimately be reflected in the final coarse-grained model.   In the sections below, I'll will talk briefly about each sub-model, and a natural baseline to consider (and hopefully improve over significant).

# Pitch Level Granularity

## [Probabilistic Classification] P(Hit | launch angle, spray angle, exit velocity, context)

For the first model, we want to simply calculate the probability of a hit, assuming it was in play, and given its launch angle, spray angle, exit velocity, and other outside context (wind, ballpark, etc.)  A natural baseline for this model is to simply ignore the context and say the probability of a hit is constant.  This is a trivial model and so would not be that good, but will hopefully be easy to improve upon [in a future blog post].

## [Continuous Distribution Fitting] P(launch angle, spray angle, exit velocity | in play=True, x, y, type, speed, nastiness, context)

For the second model, we want to fit a distribution for launch angle, spray angle, and exit velocity, given that the ball was put in play as a function of the pitch characteristics and other context (batter, etc.)  A natural baseline model would ignore the context and just fit a model to P(launch angle, spray angle, exit velocity | in play = True).  A natural model class to consider could be a mixture of Gaussians or a kernel density estimator.  This sub-problem will be explored in greater depth in a future blog post.

## [Probabilistic Classification] P(outcome | x, y, type, speed, nastiness, context)

For the third model, we want to fit a distribution for the outcome of the pitch (i.e., ball, strike, foul, hit by pitch, in play) given the pitch characteristics and other context (batter, etc.).  One natural baseline is to assume the probability is constant for all pitch characteristics and batter.  Another natural baseline is to assume the distribution depends on the pitch characteristics, but not on the other context (including batter).  One model class to consider for this sub-problem is a feed forward neural network.  Alternatively, we could instead fit P(x, y, type, speed, nastiness | outcome = Z) for each Z, and apply Bayes rule to flip it around.  This sub-problem will be explored in greater depth in a future blog post.

## [Continuous Distribution Fitting] P(x, y, speed, nastiness | type, context)

For the fourth model, I already have a pretty good solution to this problem that I came up with in my graduate machine learning course.  See the link above (and below) for more information about the approach and baselines considered there.

## [Probabilistic Classification] P(type | context)

For the fifth model, this was also something I solved in my graduate machine learning course.

# At Bat and Game Level Granularity

## [Known Model, Markov Chain] P(hit | context) - depends on previous models

The role of the sixth model is to "lift" our pitch-level models into an at-bat level model.  By combining the models learned above, we can construct a Markov chain that represents the sequence of events in a plate appearance.  By analyzing this Markov chain, we can exactly calculate the probability of ending in the right terminal state (in play, hit).  This is a problem that I considered in my senior capstone course for mathematics back in 2016.  See the link above

## [Known Model, Learned Generalized Negative Binomial ] P(team plate appearances | context) - depends on previous model

Baseline: Negative Binomial, team-independent

For the seventh model, we need to estimate to number of plate appearances that a team will get in a given game from the relevant context.  There are multiple ways we could approach this problem.  A natural option is to take the atbat-level model, combined with the known lineup to exactly calculate this the distribution of plate appearances via the negative binomial distribution.  This is something I scratched the surface on in Beat the Streak: Day 8, but definitely needs revisiting in a future post.

## [Known Model, Exact Formula] P(hit in game | context) - depends on both models above

The eighth and final model is the probability of getting a hit in a given game.  If we have good models for the hit probability in a plate appearance, and the number of plate appearances in a game, there is a simple formula we can utilize to calculate the probability of a hit in a given game.  This was also covered in Beat the Streak: Day 8.   The baseline we would hopefully be able to improve, is a model that directly learns this conditional distribution, without going through any sub-models.  That is how I have previously approached this problem due to simplicity, so I have some reasonable baselines to compare against there already.

# Next Steps

Now that I've outlined the sub-problems that need to be solved to realize this grand vision, I want to conclude this blog post by briefly discussing what's next in this line of work.
• Develop a robust evaluation framework
• Implement baselines for each sub-problem and have a leader board of sorts.
• Tackle each sub-problem one at a time.