Improving Naive Bayes with MLE

Today I'm going to be talking about the probabilistic classifier known as Naive Bayes, and a recent idea I came up with to improve it. My idea relies on the same assumptions that naive bayes does, but it finds different values for the conditional probabilities and class probabilities that describe the data better (or so I originally thought). In this blog post, I am going to quickly go through the traditional naive bayes setup, introduce my idea, then compare the two in terms of prediction quality.

The Traditional Setup

Let's assume we have a data set with \( N \) observations, where each observation has \( n \) attributes \( x_1, x_2, \dots, x_n \) and \( 1 \) class value \( C_k \). Naive bayes says that we can compute the probability of observing \( C_k \) given the attribute information by evaluating the following formula: $$ P(C_k | x_1, \dots, x_n) = \frac{ P(C_k) \prod_{i=1}^n P(x_i \mid C_k) }{P(x_1, \dots, x_n)} $$ where $$ P(x_1, \dots, x_n) = \sum_{j} P(C_j) \prod_{i=1}^n P(x_i \mid C_j) $$ These class and conditional probabilities are approximated from training observations in the intuitive way. \( P(C_k) \) is approximated as the proportion of values in the training data that have the class \( C_k \). \( P(x_i \mid C_k) \) is approximated as the proportion of the data whose \( i^{th} \) attribute takes on value \( x_i \) among the data with class value \( C_k \). This is the intuitive thing to do, but is it the right thing to do? In both courses where I learned about Naive Bayes, there was not much justification provided for choosing to approximate these parameters in the intuitive way. I initially accepted this intuition, but after reflecting on it for a bit I thought I could do better (turns out I was both right and wrong).

In order to find a better set of parameters, I will use a commonly used technique from statistics called maximum likelihood estimation. While this technique is often introduced for low dimensional problems such as fitting parameters to a normal distribution, it can be applied in higher dimensional cases as well, such as the situation we are dealing with here. Maximum likelihood estimation computes the probability of observing the training data under the given parameters. The parameters that maximize this likelihood are then used to describe the model. The likelihood function for this problem is given by: $$ Likelihood = \prod_{m=1}^N \frac{ P(C^m) \prod_{i=1}^n P(x^m_1 \mid C^m) }{ \sum_{j} P(C_j) \prod_{i=1}^n P(x^m_i \mid C_j) } $$ Where \( C^m \) is the class value for the \( m^{th} \) observation and \( x^m_i \) is \( x_i \) for the \( m^{th} \) observation in the training data. The log likelihood is given by: $$ LogLike = \sum_{m=1}^N \log{\frac{ P(C^m) \prod_{i=1}^n P(x^m_i \mid C^m) }{ \sum_{j} P(C_j) \prod_{i=1}^n P(x^m_i \mid C_j) }} $$ The parameters that maximize this log likelihood are not easy to calculate in closed form. In my experiments, I calculate these parameters numerically by using routines defined in scipy's optimize.minimize library. This is a significant computational burden in general, but there are steps that can be taken to limit the magnitude of this burden.


Notes

When I originally came up with this idea, I did a quick google search to see if anything like this had been tried before. While initial results came up empty (I was mainly looking at teaching powerpoints), after I dug a little bit deeper I found that something like this has been considered. In finding this information, I also found rigorous justification for the intuitive choice of parameters.

It turns out that the intuitive approximations of the parameters are also based on maximum likelihood estimation, but the likelihood function is slightly different. Instead of maximizing the the probability of observing the class values given the attributes for all the training data (which is what I do), it maximizes the probability of observing all the data together. By setting it up in this way, the log likelihood decomposes really nicely, and the optimal parameters can be solved in closed form as a constrained optimization problem (e.g., using the method of lagrange multipliers). In my setup however, we are maximizing the conditional likelihood: the likelihood of observing the class values given the attribute values. My setup makes sense because in (probabilistic) classification we are interested in the conditional probability. However, my setup doesn't decompose as nicely so I have to resort to numerical optimization procedures instead.


Comparison

To test out my idea, I setup an experiment where I randomly assign values for all the parameters, then randomly generate a training set and a test set with those parameters under the standard naive bayes assumption. This way, the results are based on data where the naive bayes assumption is valid. While in my scripts the number of attributes, the number of classes, and the size of the training set and the size of the test set are tunable, I will present results with 1 specific configuration.
  • 3 kinds of attributes that can take on 2, 3, and 4 values respectively.
  • 3 classes
  • 10000 rows in training set
  • 10000 rows in testing set
  • 100 random trials
In each trial, I record the (conditional) log likelihood for the parameters produced my the traditional method as well as my method for both the training set and the test set. I use these log likelihood values to compute a likelihood ratio between my method and the normal method for both data sets. Finally, I compute the geometric mean over all these trials. The likelihood can be interpreted as the probability of observing the data under the given parameters, and the likelihood ratio is a measure of how much better my method is than the traditional method. Likelihood ratios greater than 1.0 favor my method, while likelihood values less than 1.0 favor the traditional method.

The geometric mean of the likelihood ratio for the training set was 2.15, while the geometric mean of the likelihood ratio for the test set was 0.51. These results are both expected and surprising. They are expected, because by the way my method is defined, it has to be at least as good as the traditional method on the training data (with 1 small caveat which I am not going to talk about in this blog post). However, when considering the test set, the traditional method outperforms my method. This came as a slight surprise to me since I thought training with the conditional likelihood would help with classifying instances with the conditional likelihood. In both cases, the difference between the choice of parameters is very small, as they both produce likelihood values on the same order of magnitude. Additionally, the classification success rate is very similar between the two models, which is to be expected.

The increased computational requirements of my method make it not worth pursuing much further, but it was interesting to run these experiments and verify that fact for myself. I was prepared to provide much more extensive results if the preliminary findings were more promising, but unfortunately I think this idea ends here. If there's one thing I think you should take away from this blog post, it is that you should always try to recognize situations in your coursework where hidden assumptions or choices are made, and try to justify those assumptions for yourself. If you are interested in pursuing the code involved in making this blog post, I added it to one of my GitHub repositories.

Comments

Popular posts from this blog

Optimal Strategy for Farkle Dice

Markov Chains and Expected Value

Automatically Finding Recurrence Relations from Integer Sequences