Showing posts from 2015

A Neat Random Walk Problem

As many of the people that read this blog probably know, I enjoy inventing math problems, especially problems that are probabilistic in nature, and then working out the math to solve that problem. In this blog post, I will introduce one problem that I solved that I think is particularly elegant: You are randomly walking along the x-axis starting at \( x=0 \). When you are at \( x = n \), you move to \( x = n-1 \) with probability \( p = \frac{n}{n+1} \) and to \( x = n+1 \) with probability \( p = \frac{1}{n+1} \). What is the expected number of steps until you return to \( x = 0 \) for the first time? This problem differs from standard random walk problems because the probability of moving left and right along the x-axis depends on your current location. In this problem, there is a force that is pulling you towards \( x = 0 \), and the further away you get, the more of an effect this force has on you. Show/Hide Solution So how do we go about solving this problem? When

Modeling Baseball At Bats

This semester in my mathematics capstone course, students had the chance to develop mathematical models to describe real world research problems.  I took this as an opportunity to research and develop probabilistic models that can be used to predict the outcome distribution of a baseball at bat.  My hope was that I could apply my findings from this research project to my side project of Beat the Streak.  I learned a lot about mathematical modeling in this class, and explored a variety of techniques for making predictions about baseball at bats.  At the end of the class, we wrote a report that introduces the models we came up with, and compares the quality of the predictions they produce.  I have made this report available here .

Evaluating the Sum of a Geometric Sequence

This is a short blog post that shows you an easy and intuitive way to derive the formula for the summation of an infinite geometric sequence. Let \( 0 \leq p < 1 \), and let \( a \) be some constant; then we wish to find the value of \( x \) such that $$ x = \sum_{k=0}^\infty a p^k $$ Writing out the first few terms of the summation, we get: $$ x = a + a p + a p^2 + a p^3 + \dots $$ Rewriting the equation by factoring out a \( p \) from every term except the first, we get: $$ x = a + p (a + a p + a p^2 + \dots) $$ Notice that the expression in parenthesis is exactly how \( x \) is defined. Replacing the expression with \( x \) leaves us with: $$ x = a + p x $$ Solving the equation for \( x \) yields $$ x = \frac{a}{1-p} $$ Just remember this simple derivation and you will never have to look up the formula for evaluating the sum of an infinite geometric sequence ever again!

Beat the Streak: Day Three

In order to maximize your probability of beating the streak, you should (1) predict the probability that a batter will get a hit in a given game given the game parameters and (2) determine if it's worthwhile to risk your current streak in order to possibly improve it by 1 or 2 games. In this blog post, I outline my solution to (2). In previous blog posts, I've hinted at what I do to solve (1) and will continue that discussion in a later blog post. Motivation When your current streak is short, the optimal strategy is to pick the best player every day, regardless of how likely their probability of getting a hit is (to a certain extent). However, as your streak grows, you have an important decision to consider: is it better to pick the best player today and possibly lose your current streak, or skip picking a player and instead maintain your current streak. Naturally, this decision should be guided by the players probability of getting a hit, as well as the distribution of the

Automatically Finding Recurrence Relations from Integer Sequences

In a lot of the recreational math and computer science problems I solve, I try to derive a recurrence relation that describes the situation. These recurrence relations pop up pretty often in counting problems (how many ways can you tile a 2xn grid with 2x1 tiles?) In some cases, I am unable to derive a recurrence relation directly from the problem statement, but still suspect that there exists some recurrence relation. In these cases, I end up solving the problem usin g brute force methods for small input sizes and analyzing/extrapolating the pattern. However, when the relation is obscure, it becomes hard to find. In order to make this process easier in the future, I created a program that can automatically find recurrence relations for arbitrary integer sequences. To use it, simply type in the first few terms of a sequence you are interested in, and it will try to find a recurrence relation for that sequence. There are some constraints on the allowable inputs/outputs which I di

Beat the Streak: Day Two

With the mid way point of the MLB season fast approaching, it is starting to become apparent that I need to finish my program if I want to have any shot at compiling a respectable streak this season. In this blog post, I will talk about one formula that I am relying heavily on in my tool. This formula is based on Bill James' Log5 formula but is modified for non-binary output. In general, the Log5 formula is used to approximate the probability that team A will defeat defeat team B given each of their individual winning percentages. It can be modified to handle batter vs. pitcher matchups as well. There is no mathematically rigorous derivation of the formula, so using it in my tool is a little bit iffy to say the least. However, I think using it will yield a better approximation to batter vs. pitcher matchups than not using it. In it's simplest form, the approximate batting average of batter B against pitcher P is given by: $$ AVG_{B v P} = \frac{\frac{AVG_B \cdot AVG_P}{AVG

Non-Numeric Matrices 2

In my last blog post , I showed how matrices and matrix multiplication in particular can make sense for more than just numbers. In fact, I used a boolean matrix construct to find the connected components of a graph. In this blog post I will extend that idea to two more types of matrices: "Distance" matrices to find the shortest distances between any two nodes Path matrices to find the (shortest) list of actions that lead between any two states The first type of matrix that I will talk about is what I call a 'Distance' matrix. This is not to be confused with how distance matrices are usually defined in mathematics and computer science (pairwise distance between points). Instead, my definition is more closely related to an adjacency matrix, where connected nodes have non-zero entries (given by the weight, or distance between the nodes), and unconnected nodes have a distance of 'X' to denote that there is no edge between the nodes. Using this newly de

Non-Numeric Matrices

Matrices are a fundamental part of linear algebra, and we typically think of a matrix as a 2 dimensional array of numbers. In this post, I will show that matrices can be used in other ways too, and I will show that these new types of matrices can be used to solve real problems. Adjacency Matrix If you are already familiar with Adjacency matrices, I recommend you skip ahead to the next section; otherwise, read on! Matrices are especially important in graph theory, as they can be used to model the structure of a graph. A graph is a set of vertices and edges, and the edges indicate how the vertices are connected. Graphs can be used to model a number of real world things: the internet, social networks, road systems, etc. One common representation of a graph is an adjacency matrix: For a matrix \( A = [a_{ij}] \), \( a_{ij} = 1 \) if node i is connected to node j and \( a_{ij} = 0 \) otherwise (Note I implicitly implied there is some ordering on the nodes). One interesting prop

Beat the Streak: Day One

In 2001, released the Beat the Streak challenge : A challenge to fans to essentially beat the all time best hitting streak established by Joe DiMaggio in 1941. In that season, Joe DiMaggio had a 56 game hitting streak. The longest hitting streak by any MLB player since that is 45 games. When the challenge was first introduced, fans were asked to pick a (possibly different) player every day who they expect to get a hit. If that player earned a hit, then their streak would increase by 1. Otherwise, it would go back to 0. The first fan to reach a streak of 57 would win the grand prize. Since it's introduction in 2001, the grand prize has grown from $100,000 to $5,600,000, and a number of new features have been added to improve the odds for the fans. Yet, no one has even broken the 50 game streak barrier, let alone win the grand prize. I have been a casual beat the streak player for a few seasons, but I never really took it too seriously. Last season, I decided to use my back

Memoization in Haskell

When I was first learning Haskell, there were some imperative design patterns that I had trouble converting to functional style Haskell. Memoization is the first thing that comes to mind. I use memoization so much that it was pretty much trivial do in an imperative language like Java. However, I could not figure out how to do it in Haskell. In this blog post, I will discuss how you can memoize your functions in Haskell to boost performance in case you're ever in a situation where you want to use it. It turns out that memoization in Haskell is just as easy, if not easier, than memoization in an imperative language. Before I talk about memoization in Haskell, I will discuss the traditional approach to memoizing a recursive function in an imperative language. It essentially boils down to these steps: Query lookup table to see if result is computed; If so, return it Compute result recursively Store result in lookup table Return result For example, consider the numbers in Catal

Markov Chains and Expected Value

A few weeks ago, I was using a Markov Chain as a model for a Project Euler problem, and I learned about how to use the transition matrix to find the expected number of steps to reach a certain state. In this post, I will derive the linear system that will help answer that question, and will work out specific an example using a 1-dimensional random walk as the model. Terminology Before I begin the derivation, let me define some terms. The transition matrix is an \( n \) by \( n \) matrix that describes how you can move from one state to the next. The rows of the transition matrix must always sum to 1. States can either be transient or absorbing . An absorbing state is one that cannot be left once reached (it transitions to itself with probability 1). A transient state is a state that is not an absorbing state. In many problems, it is of general interest to compute the expected number of steps to reach an absorbing state (from some start state). Derivation  Let \( p_

Investigating Bezout's Identity for Fibonacci Numbers

The Fibonacci numbers: \( 1,1,2,3,5,8,13,21,... \) have a number of interesting properties. A few days ago, I discovered and proved one such property that I find particularly interesting. It turns out that successive Fibonacci numbers are always relatively prime (I will prove this later). Further, Bezout's lemma guarantees the existence of integers \( p \) and \( q \) such that \( p F_n + q F_{n+1} = 1 \), where \( F_n \) denotes the \( n^{th} \) number in the Fibonacci sequence. In this blog post, I will find a general formula for \( p \) and \( q \). There is a simple result and an elegant proof of that result which I will demonstrate. Before we find a general formula for \( p \) and \( q \), let me first prove that \( F_n \) and \( F_{n+1} \) are always relatively prime: Assume for contradiction that \( F_n \) and \( F_{n+1} \) have a factor, \( d > 1 \) that they share. Then \( F_n = d k_1 \) and \( F_{n+1} = d k_2 \) for some integers \( k_1 \) and \( k_2 \).

Interpolation Search from Any Distribution

In my last blog post , I introduced interpolation search, a searching algorithm that can beat binary search asymptotically. To introduce the concept of interpolation search, I made a simplifying assumption that the underlying data came from a uniform distribution. It is still possible to use interpolation search on non-uniform distributions, as long as you have the CDF for that distribution. Background Info A cumulative distribution function is a function that tells you the probability that a random item from the distribution is less than or equal to a specified value. $$ F(k) = P(X \leq k) $$ Thus, $F(k)$ can take on values in the range $[0,1]$ and it is non-descending. For a discrete uniform distribution of integers in the range $[1,10]$, the CDF looks like: And for a continuous normal distribution, the CDF looks like: How to get the CDF There are a number of options to get a handle on the CDF. You can find information online about your distribution, do a one-pass

Interpolation Search Explained

If you have taken an introductory computer science course, you've probably seen the binary search algorithm - an algorithm to efficiently find the index of an item in a sorted array, if it exists. You might not have heard of interpolation search, however. Interpolation search is an alternative to binary search that utilizes information about the underlying distribution of data to be searched. By using this additional information, interpolation search can be as fast as $O(log(log(n)))$, where $n$ is the size of the array. In this post I am going to talk about two implementations of interpolation search. First, I will discuss the algorithm as described on Wikepedia, then I will describe my own version of the algorithm that has a better worst case time complexity than the first and performs better in practice than the traditional binary search and interpolation search algorithms. The Basic Idea Interpolation search models how humans search a dictionary better than a binary search,

An Efficient Communication Scheme

In this blog post, I'll be talking about an efficient communication scheme that can be leveraged in parallel programs for broadcast or reduce operations.  The idea is very simple, but I thought it was pretty cool when I learned it, so I thought I would pass along the information to you.  Let's abstract away the parallel programming aspect and analyze the following question: You have a message that you want to share with a large group of people, but you can only share the message with one person at a time. How should you distribute this message? Setting up the Problem To clear up some possible ambiguities, I want to emphasize that it takes one time step to share a message regardless of who you share it with, and anybody can share the message with anybody else (granted they've received it first).  I will identify each person in the group by a unique integer in ${0,1,...,n}$, and I will assign myself the id $0$. This is similar to how we might structure a distributed prog

Generating Random Numbers from Arbitrary Distributions

Generating random numbers is an important aspect of programming, and just about every programming language offers some source of (pseudo) randomness, usually in the form of a uniform distribution. For example, java has a builtin function, <code class="java">Math.random()</code> that returns a random floating point number in the range $[0,1)$. In some instances, we would like to have random numbers coming from a different distribution, but most programming languages don't offer this functionality (except very specialized languages such as R). In the rest of this blog post, I am going to explain how you can use a uniformly distributed random number to generate a number from some other distribution. Special Case: The Normal Distribution If you would like to get numbers from a binomial distribution, or are willing to accept an approximation for the normal distribution in favor of simplicity, then you may want to use this technique. If you add together some num

Triangles Containing the Origin

You may have found this site by searching for information regarding Project Euler Problem 184. This is not a solution to that problem. In fact, it really bothers me when I find people who post their solutions to these problems online, especially the higher level ones. In this blog post, I talk about a simple problem from probability that was motivated from this Project Euler problem, but the solution to this problem is not likely to help you solve that one. Pick three points uniformly at random along the circumference of a unit circle centered at the origin. What is the probability that the triangle connecting these three points contains the origin? Like I said, this problem is not as difficult as the problems that I usually write about, but I decided to write a blog post about it for two main reasons: I wanted to create the animation/simulation for this problem I eventually want to extend this problem to explore all convex polygons instead of just triangles (which is a more diff