Efficiently Remove Duplicate Rows from a 2D Numpy Array
Suppose you have a 2d numpy array and you want to remove duplicate rows (or columns). In this blog post, I'll show you a trick you can use to do this more efficiently than using
This method can easily be modified to removing duplicate columns by letting $ x $ be a random vector of size $n$ (as opposed to $m$), then computing the dot product $ y = x A $ and indexing with $ A[:, index] $. So why does this algorithm work? It should be clear that if the two rows in $ A $ are equal then the same two rows in $ y = A x $ will also be equal no matter what $x$ is. What's not as obvious is the converse: if two rows in $ y = A x $ are equal, does that imply that those two rows in $A$ are equal? Of course, if $x$ is chosen adversarially then this claim would not hold because the adversary can always select $x$ that breaks the algorithm by solving a linear system. For example, if they wanted $ \begin{bmatrix} 2 & 1 \end{bmatrix} $ and $ \begin{bmatrix} 1 & 2 \end{bmatrix} $ to be incorrectly labeled as the same rows, then they could solve the linear system $ \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} c \\ c \end{bmatrix} $. However, if we generate $x$ by taking $m$ independent random samples from a $ U[0,1] $ distribution, then the claim holds with probability $1$. While it may not be obvious how to prove this, it should hopefully be intuitive that it is true. The intuition follows from the well known fact that the probability that a continuous random variable takes on a particular value is $0$. Let $ r_1 \neq r_2 $ be two different rows, then in order for this algorithm to incorrectly label them as the same, the random $x$ would have to satisfy $ r_1 \cdot x - r_2 \cdot x = (r_1 - r_2) \cdot x = 0 $. Since $ (r_1 - r_2) \cdot x $ is a continuous random variable, we can apply the well known fact stated earlier that this event occurs with probability zero. One subtle detail that I should point out is that since computers use a floating point representation for real numbers and there are only a finite number of possible floating point numbers, this event that in theory occurs with probability $0$, actually occurs with nonzero probability. However, the probability is so small and the event is so unlikely that we can just ignore it. I will also point out that the method described in this blog post works just as well with scipy sparse matrices. In fact, assuming the matrix supports row indexing (such as
np.unique(A, axis=0)
. This algorithm has time complexity $ O(\max(n \log{n}, n m)) $ for an $ n \times m $ matrix, and works almost surely. By "almost surely" I mean that it is a randomized algorithm that works correctly with probability $1$. To find the unique rows of a matrix $A$, the algorithm works by generating a random vector $x$ of real numbers, computing the dot product $ y = A x $, then analyzing the unique elements of $y$. The indices of unique elements in $y$ is the same as the unique row indices in $A$ with probability $1$. Below is an example to demonstrate how it works:> import numpy as np
> A = np.random.randint(low=0, high=3, size=(10, 2))
> A
array([[2, 1],
[1, 2],
[0, 0],
[2, 2],
[1, 2],
[0, 0],
[0, 2],
[2, 1],
[1, 0],
[1, 2]])
> x = np.random.rand(A.shape[1])
> y = A.dot(x)
> y
array([ 1.23529869, 1.74609273, 0. , 1.98759428, 1.74609273,
0. , 1.50459119, 1.23529869, 0.24150155, 1.74609273])
> unique, index = np.unique(y, return_index=True)
> A[index]
array([[0, 0],
[1, 0],
[2, 1],
[0, 2],
[1, 2],
[2, 2]])
This method can easily be modified to removing duplicate columns by letting $ x $ be a random vector of size $n$ (as opposed to $m$), then computing the dot product $ y = x A $ and indexing with $ A[:, index] $. So why does this algorithm work? It should be clear that if the two rows in $ A $ are equal then the same two rows in $ y = A x $ will also be equal no matter what $x$ is. What's not as obvious is the converse: if two rows in $ y = A x $ are equal, does that imply that those two rows in $A$ are equal? Of course, if $x$ is chosen adversarially then this claim would not hold because the adversary can always select $x$ that breaks the algorithm by solving a linear system. For example, if they wanted $ \begin{bmatrix} 2 & 1 \end{bmatrix} $ and $ \begin{bmatrix} 1 & 2 \end{bmatrix} $ to be incorrectly labeled as the same rows, then they could solve the linear system $ \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} c \\ c \end{bmatrix} $. However, if we generate $x$ by taking $m$ independent random samples from a $ U[0,1] $ distribution, then the claim holds with probability $1$. While it may not be obvious how to prove this, it should hopefully be intuitive that it is true. The intuition follows from the well known fact that the probability that a continuous random variable takes on a particular value is $0$. Let $ r_1 \neq r_2 $ be two different rows, then in order for this algorithm to incorrectly label them as the same, the random $x$ would have to satisfy $ r_1 \cdot x - r_2 \cdot x = (r_1 - r_2) \cdot x = 0 $. Since $ (r_1 - r_2) \cdot x $ is a continuous random variable, we can apply the well known fact stated earlier that this event occurs with probability zero. One subtle detail that I should point out is that since computers use a floating point representation for real numbers and there are only a finite number of possible floating point numbers, this event that in theory occurs with probability $0$, actually occurs with nonzero probability. However, the probability is so small and the event is so unlikely that we can just ignore it. I will also point out that the method described in this blog post works just as well with scipy sparse matrices. In fact, assuming the matrix supports row indexing (such as
scipy.sparse.csr_matrix
), then we don't have do change the code above at all except for in the construction of $A$!
Comments
Post a Comment