Posts

Showing posts from 2017

A Streamlined Workflow for Scientific Reproduciblility and Efficiency

As an undergraduate researcher, I was often unnecessarily bogged down by my poor workflow.  I consistently programmed "one off" scripts which sometimes turned into an important component of my research, and I was frequently frustrated by the way things were difficult to use.  My workflow improved a little when I got to grad school, but I finally decided to come up with a workflow that I could be happy with early this semester.  After collaborating with another student on a project and having a really hard time to integrate into his code, I decided I needed to come up with a good set of practices for my own sanity and for the sake of future collaborators as well.  While there is an initial upfront cost of using good programming principles, the long term benefits will usually outweigh the initial costs by a lot.  In this blog post, I'll summarize some key pieces to my new workflow.  If you are struggling with some of the problems that I mentioned then I hope you will find t

Kronecker Product for Matrices with Special Structure Using LinearOperators

I have been working with very large matrices for my research lately, and some of them are sparse, while others are dense but have some special structure that can be exploited to represent them efficiently.  I have also been constructing enormous matrices from these smaller building blocks using the Kronecker product.  Both numpy and scipy have functions built in to compute the Kronecker product between two matrices, but they need the input matrices to be explicitly represented (as a dense numpy array or a sparse scipy matrix), and so they are not compatible with the special-structure matrices that I've been dealing with.  I recently solved this problem, and that is the topic of this blog post. An Example As an example to make things more concrete, consider the two matrices shown below: $$ A = \begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}  \qquad B = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 &

Modeling and Predicting the Outcome of a Sequence of Games

Image
Here's something I though of today that I think is pretty interesting. Consider the following toy problem: Suppose Alice and Bob play a sequence of games, and the probability that Alice wins game $i$ is given by $p_i$. Further, assume $p_1 \sim U[0,1]$ is uniform and that $ p_{i+1} \sim Beta(\alpha p_i + \beta, \alpha (1-p_i) + \beta) $ for known $ (\alpha, \beta) $. We are unable to directly observe $p_i$, but we do observe the outcome of each game $ o_i $ for $ 1 \leq i \leq k $. The task is to estimate $ p_{k+1} $ given the evidence $ o_1, \dots, o_k $.   The figure below shows how Alice's win probability might evolve over time.  This particular figure was generated with $ \alpha = 1000 $ and $ \beta = 5 $. We can model this situation with a modified Hidden Markov Model.  The graph below shows how the random variables are related in the form of a directed graphical model: There are very efficient algorithms for doing inference and learning on Hidden Markov Mod

Efficiently Remove Duplicate Rows from a 2D Numpy Array

Suppose you have a 2d numpy array and you want to remove duplicate rows (or columns).  In this blog post, I'll show you a trick you can use to do this more efficiently than using np.unique(A, axis=0) .  This algorithm has time complexity $ O(\max(n \log{n}, n m)) $ for an $ n \times m $ matrix, and works almost surely .  By "almost surely" I mean that it is a randomized algorithm that works correctly with probability $1$.  To find the unique rows of a matrix $A$, the algorithm works by generating a random vector $x$ of real numbers, computing the dot product $ y = A x $, then analyzing the unique elements of $y$.  The indices of unique elements in $y$ is the same as the unique row indices in $A$ with probability $1$.  Below is an example to demonstrate how it works: > import numpy as np > A = np.random.randint(low=0, high=3, size=(10, 2)) > A array([[2, 1], [1, 2], [0, 0], [2, 2], [1, 2], [0, 0], [0, 2], [2, 1],

Context-Dependent Pitch Prediction with Neural Networks

This semester I took the machine learning class at UMass, and for my final project I developed models for predicting characteristics of pitches based on the context that they were thrown in.  The context consists of relevant and available information known before the pitch is thrown, such as the identities of the batter and pitcher, the batter height and stance, the pitcher handedness, the current count, and many others.  In particular, I looked into modeling the distribution over pitch types and locations.  This problem is challenging primarily because for a particular context (a specific setting of the different features that make up the context), there is usually very little data available for pitches thrown in that context.  In general, for each context feature we include, we have to split up our data into $k$, groups where $k$ is the number of values that the context feature can take on (for batter stance it is $k=2$, for count it is $k=12$, etc).  Thus, developing models by parti