### Neural Networks Simplified

Artificial Neural Networks, and specifically deep neural networks have been gaining a lot of attention recently, and they are being used successfully across many disciplines. I first encountered neural networks a few years ago, but after hearing the term "backpropagation" and having no idea what it meant I feared that neural networks may be too complicated for me to understand. I made no attempts to truly understand them until recently, and have found that they are actually very easy to understand. I'm writing this blog post share my perspective on how to think about neural networks without jumping directly into the math.  This post is intended for beginners interested in machine learning using neural networks.  By the end of this blog post, you should be able to understand basics of neural networks enough to talk about them intelligently. And if you are so inclined, you can do a deeper dive into the mathematics that underlies the idea.

In this blog post I will talk about neural networks in the context of a probabilistic classification task, which is the problem of determining which class a given observation belongs to by assigning probabilities to each possible class. For example, I have used these types of neural networks to predict the outcome of an at bat based on the context (e.g., batter, pitcher, location, time, etc.) In general, there is some directly observable data $X$ which is just a set of numerical vectors, one vector per data item. Note that this setup can work with non-numerical data too, since there are techniques to convert non-numerical features to numerical ones. There is a correct classification $y_i$ for each $x_i \in X$. In the baseball example, for example, $y_i$ can be either hit, out or walk. A slightly simpler version assumes there are only 2 possible classes, $0$ or $1$, and I will assume we will be working in this setting for the remainder of the post. The goal of the neural network is to effectively capture the relationship between $X$ and $y$ so that it can be used to accurately predict the correct class of new data - i.e., it should generalize to new data effectively.

A neural network is just a complicated function from inputs $x$ to output probabilities $P[y=1]$.  A simple 3-layer neural network is shown below.  A neural network is a parametric model (meaning it can be expressed as a vector of numbers), where the parameters are just the edge weights of the graph below.  The expressive power of a neural network comes from the number of parameters in the model and the structure of the network. http://cs231n.github.io/neural-networks-1/
For a fixed setting of these edge weights, we can compute $P[y=1]$ from $x$ by filling in the input layer with the entries in $x$, computing the entries in the first hidden layer from the entries in the input layer, computing the entries in the second hidden layer from the first hidden layer, and finally computing $P[y=1]$ from the entries in the second hidden layer.  The procedure for calculating the value of a node is a very simple to step process: (1) take a weighted sum of all entries in the previous layer and (2) pass the weighted sum through a non-linear "squashing" function that maps numbers from $\mathbb{R} \rightarrow [0,1]$ such as the logistic function $f(x) = \frac{1}{1 + e^x}$.   Note that in step (1), the weights in the weighted sum are specified by the parameters associated with the edge weights going into the node.  In step (2), other non-linear functions are also possible, but as long as the squashing function is continuous and differentiable, the specific choice is not particularly important for understanding the rest of this post, but this non-linearity is what allows the neural network to approximate any arbitrary smooth non-linear function.

Anyway, since we can compute value of the output layer $P[y=1]$ for a given observation $x$ with a fixed set of parameters, we can easily understand how good a particular network is by comparing the output probabilities produced on the training data with the the actual classifications $y$.  Formally, we can compute the likelihood of observing the training classes $y$ given the training observations $X$ assuming the given neural network (i.e., the parameters specifying it) is the model that generated the classifications.

$$Likelihood = \prod_{i=1}^N P[y_i = 1]^{y_i} (1 - P[y_i = 1])^{1 - y_i}$$

Here, $P[y_i = 1]$ is the estimated probability that $y_i = 1$ as determined by the neural network for observation $x_i$.  To understand the formula, observe that $y_i \in \{ 0, 1 \}$ and we are just computing the probability of observing a collection of independent events.  With this function in hand, we have a way of measuring how good a particular network explains the training data.  Thus, the network which best describes the data is the one which maximizes this likelihood.

So far, we have mainly been working with a neural network with fixed parameters, but in reality they are unknown, and we want to find the parameters that best explain the training data.  Thus, we can express the likelihood as a function of $\theta$.  In practice, the log likelihood is preferred over the likelihood, because the product decomposes into a sum, which is more convenient to work with, and the log likelihood also doesn't become numerically $0$ with lots of data.  Since log is a strictly increasing function, the parameters that maximize the log likelihood also maximize the likelihood.  The log likelihood is

$$LogLike (\theta) = \sum_{i=1}^N y_i \log{(P_{\theta}[y_i = 1])} + (1-y_i) \log{(1-P_{\theta}[y_i = 1])}$$

where $\theta$ are the parameters of the model.  Note that the log likelihood is a smooth function of $\theta$ because it is a sum of $N$ smooth functions of $\theta$.  Thus, we can find the best set of parameters $\theta$ by maximizing the function with respect to $\theta$.  While it's not possible in general to maximize this function in closed form, there are several ways to numerically find (local) maximum.  The best methods are gradient-based (the gradient is a generalization of a derivative for multi-dimensional functions).  Gradient-based methods are sometimes referred to as "hill-climbing algorithms" because they attempt to find the maximum value of the function by iteratively taking a small step in the direction that has the largest slope.  Backpropagation is just an efficient algorithm for computing these gradients, so fast optimization can be done using gradient-based optimization tools.

Hopefully you understand the basic mechanics of a neural network now.  There's a lot of things I didn't talk about in this post such as over fitting and dealing with local (instead of global) maximums produced by most gradient-based optimization algorithms.   While this post is fairly long, the length of this post does not translate to complexity of neural networks.  Training a neural network can be summarized as finding the parameters of a statistical model that best explain the data by using gradient ascent.  That's all there is too it, and unfortunately when I first encountered neural networks I thought they were so much more complicated than that.