Posts

L2 projection onto the probability simplex with bound constraints

Projecting onto the probability simplex is a common problem that arises in frequency estimation and related tasks.  This problem can be solved efficiently with an $O(n \log{(n)})$ algorithm.  In this blog post I consider the related problem of projecting onto the probability simplex with bound constraints.  Given a vector $ \mathbf{r} \in \mathbb{R}^n $, our goal is to find a vector $\mathbf{p}^*$ that solves the following optimization problem. $$ \begin{equation*} \begin{aligned} & \underset{\mathbf{p}}{\text{minimize}} & & \frac{1}{2} \lvert \lvert \mathbf{p} - \mathbf{r} \rvert \rvert_2^2 \\ & \text{subject to} & & \mathbf{1}^T \mathbf{p} = 1 \\ & & & \mathbf{a} \leq \mathbf{p} \leq \mathbf{b} \\ \end{aligned} \end{equation*} $$ This problem generalizes the standard probability simplex projection problem by introducing arbitrary bound constraints $ \mathbf{a} \leq \mathbf{p} \leq \mathbf{b}$. Here, $\mathbf{a}, \mathbf{b} \in \mat...

Beyond vectorization: writing efficient numpy code

Numpy is a great library for expressing computations over vectors, matrices, and multi-dimensional arrays in python due to it’s simplicity and efficiency. In a previous blog post , I discussed fairly well-known techniques for speeding up (and cleaning up) numpy code by avoiding loops and exploiting problem structure. I showed that when you use the suggestions in that post, you can speed up your numpy code by orders of magnitude. In this blog post, I’ll show you how to get even more speed out of your numpy code on top of the improvements you get from vectorization. As a simple example to illustrate an inefficiency of numpy, consider computations of the form z = 0.2*x + 0.8*y where x and y are large numpy arrays. This computation will form intermediate arrays 0.2*x and 0.8*y, so the memory overhead can be problematic and slow down the computation. On my laptop, it takes about 1 second for arrays of size 100 million: import numpy as np def foo(x,y): return 0.2*x + 0....

Beat the Streak: Day Seven

Image
In this blog post, I discuss questions of the form: “Does batter X perform better at home or away?   At day or night?   Against lefties or righties? On Friday or Monday?”   What I found was a little bit surprising.   Take for example the batter Daniel Murphy.   When you look at his data from 2011 - 2017, you will see that he got a hit in 29.85% of 1424 plate appearances during day games and he got a hit in 26.97% of 2673 plate appearances during night games.   This is a pretty meaningful difference, but is it statistically significant?   In other words, could this difference be explained purely by chance?   To answer this question, we can perform a chi squared test under the null hypothesis that the true probabilities are the same.   When we do this we get a chi squared value of 3.35 and a corresponding p value of 0.067.   Thus, we can reject the null hypothesis that the true underlying probabilities are the same at the 90% confiden...

Optimal Strategy for Farkle Dice

In this post, I will discuss my findings in terms of an optimal strategy for Farkle , which is a dice game of chance.  If you are unfamiliar with the game, I encourage you to skim the linked article so that you can better understand this blog post.  All of my findings are based on sound mathematical methods and a computer program was used to determine the optimal strategy.  Before I begin, let me first state the value of different dice rolls I assumed while developing the strategy. Each 1: 100 points Each 5: 50 points 3 1's: 1000 points 3 2's: 200 points ... 3 6's: 600 points 4 1's: 2000 points ... 4 6's 1200 points 5 1's 3000 points ... 5 6's 1800 points Straight 1-6 1500 points 3 Pairs (e.g., 22 33 44): 750 points Everything else is considered to be a Farkle (0 points). There are many variants to Farkle, but I like to play by this set of rules. The "game state" of Farkle roll can roughly be characterized by 2 values: the curre...

Writing Efficient Numpy Code

Image
In this blog post, I am going to talk about writing efficient numpy code in python. I do a good amount of numerical linear algebra for my research and personal projects, and I typically code in python with numpy (and spicy) because it is very easy to use and it is also very efficient when used correctly. Consider the following problem, which will serve as a concrete task to use as an example throughout this post. Suppose we have a (column) vector $x$ of length $n = 2^k$ and we want to compute $H_k x$ where $H_k$ is a “hierarchical” matrix with branching factor $2$, defined by $$ \begin{align*} H_0 &= \begin{bmatrix} 1 \end{bmatrix} \\ H_{k+1} &= \begin{bmatrix} 1 & 1 \\ H_k & 0 \\ 0 & H_k \end{bmatrix} \end{align*} $$ Where the top row is a vector of all ones and 0 denotes a matrix of zeros having the same size as $H_k$. For example, $H_2$ is a $7 \times 4$ matrix that looks like this: $$ \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 1 & 0 & 0...

Representing Graphical Model Factors in Numpy

I've been working a bit with graphical models lately for my research.  For a while I was using a library called pgmpy for their implementations of factor arithmetic and inference algorithms.  However, as my research is progressing I am needing more control than what pgmpy offers, so I decided to re-implement and extend the algorithms that I needed.  If you are in a similar situation and are finding yourself implementing your own graphical models algorithms, this post is for you. In the setting I am working in, I have a Markov random field model where there are only a small number of variables (no more than a few dozen, often much less).  However, each variable can take on a possibly large number of categorical values and the graph may contain relatively large cliques that make exact inference computationally challenging (although still possible). Suppose the model has $d$ variables and variable $i$ can take on $n_i$ possible values.  Usually a factor defined...

A Streamlined Workflow for Scientific Reproduciblility and Efficiency

As an undergraduate researcher, I was often unnecessarily bogged down by my poor workflow.  I consistently programmed "one off" scripts which sometimes turned into an important component of my research, and I was frequently frustrated by the way things were difficult to use.  My workflow improved a little when I got to grad school, but I finally decided to come up with a workflow that I could be happy with early this semester.  After collaborating with another student on a project and having a really hard time to integrate into his code, I decided I needed to come up with a good set of practices for my own sanity and for the sake of future collaborators as well.  While there is an initial upfront cost of using good programming principles, the long term benefits will usually outweigh the initial costs by a lot.  In this blog post, I'll summarize some key pieces to my new workflow.  If you are struggling with some of the problems that I mentioned then I hope ...