Efficiently Remove Duplicate Rows from a 2D Numpy Array

Suppose you have a 2d numpy array and you want to remove duplicate rows (or columns).  In this blog post, I'll show you a trick you can use to do this more efficiently than using np.unique(A, axis=0).  This algorithm has time complexity $ O(\max(n \log{n}, n m)) $ for an $ n \times m $ matrix, and works almost surely.  By "almost surely" I mean that it is a randomized algorithm that works correctly with probability $1$.  To find the unique rows of a matrix $A$, the algorithm works by generating a random vector $x$ of real numbers, computing the dot product $ y = A x $, then analyzing the unique elements of $y$.  The indices of unique elements in $y$ is the same as the unique row indices in $A$ with probability $1$.  Below is an example to demonstrate how it works:

> import numpy as np
> A = np.random.randint(low=0, high=3, size=(10, 2))
> A
array([[2, 1],
       [1, 2],
       [0, 0],
       [2, 2],
       [1, 2],
       [0, 0],
       [0, 2],
       [2, 1],
       [1, 0],
       [1, 2]])
> x = np.random.rand(A.shape[1])
> y = A.dot(x)
> y
array([ 1.23529869,  1.74609273,  0.        ,  1.98759428,  1.74609273,
        0.        ,  1.50459119,  1.23529869,  0.24150155,  1.74609273])
> unique, index = np.unique(y, return_index=True)
> A[index]
array([[0, 0],
       [1, 0],
       [2, 1],
       [0, 2],
       [1, 2],
       [2, 2]])

This method can easily be modified to removing duplicate columns by letting $ x $ be a random vector of size $n$ (as opposed to $m$), then computing the dot product $ y = x A $ and indexing with $ A[:, index] $.  So why does this algorithm work?  It should be clear that if the two rows in $ A $ are equal then the same two rows in $ y = A x $ will also be equal no matter what $x$ is.  What's not as obvious is the converse: if two rows in $ y = A x $ are equal, does that imply that those two rows in $A$ are equal?  Of course, if $x$ is chosen adversarially then this claim would not hold because the adversary can always select $x$ that breaks the algorithm by solving a linear system.  For example, if they wanted $ \begin{bmatrix} 2 & 1 \end{bmatrix} $ and $ \begin{bmatrix} 1 & 2 \end{bmatrix} $ to be incorrectly labeled as the same rows, then they could solve the linear system $ \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} c \\ c \end{bmatrix} $.  However, if we generate $x$ by taking $m$ independent random samples from a $ U[0,1] $ distribution, then the claim holds with probability $1$.  While it may not be obvious how to prove this, it should hopefully be intuitive that it is true.  The intuition follows from the well known fact that the probability that a continuous random variable takes on a particular value is $0$.  Let $ r_1 \neq r_2 $ be two different rows, then in order for this algorithm to incorrectly label them as the same, the random $x$ would have to satisfy $ r_1 \cdot x - r_2 \cdot x = (r_1 - r_2) \cdot x = 0 $.  Since $ (r_1 - r_2) \cdot x $ is a continuous random variable, we can apply the well known fact stated earlier that this event occurs with probability zero.  One subtle detail that I should point out is that since computers use a floating point representation for real numbers and there are only a finite number of possible floating point numbers, this event that in theory occurs with probability $0$, actually occurs with nonzero probability.  However, the probability is so small and the event is so unlikely that we can just ignore it.  I will also point out that the method described in this blog post works just as well with scipy sparse matrices.  In fact, assuming the matrix supports row indexing (such as scipy.sparse.csr_matrix), then we don't have do change the code above at all except for in the construction of $A$!

Comments

  1. Silahkan di kunjungi ya kawan-kawan 100% Memuaskan
    > Hoki anda ada di sini <
    Bandar Q Online Terpercaya dan Teraman di GUNUNGPOKER
    Link daftar : http://bandaraduq.com/Register.aspx?lang=id
    BBM : 56978317

    SEMUA GAME HANYA PAKAI 1 USER ID : Poker, Domino QQ, Capsa Susun, Adu Q, Bandar Poker,

    Segera daftarkan userid anda di GUNUNGPOKER
    Promo Terbaru dari GUNUNGPOKER
    - Minimal DEPOSIT & WITHDRAW Rp 20.000,-
    - Tersedia 7 game dalam 1 USER ID
    - BONUS Turnover 0.5%
    - BONUS Referral 20%

    UNTUK INFORMASI SELANJUTNYA BISA HUB KAMI DI :
    LIVECHAT GUNUNGPOKER 24 JAM ONLINE
    Fanspage FB : @agengunungpoker
    Pin BB : 56978317
    WA : +62812-7287-4416
    LINE : gunungpokercsr1
    WECHAT : gunungpokercsr1
    YM : gunungpokercsr1@yahoo.com
    Agen domino
    BandarQ
    Domino 99
    Domino QQ
    Agen Poker
    Bandar Poker
    Agen Judi QQ
    Judi Online
    Forum Judi Online

    hubungi kami di :
    Line : gunungpokercsr1
    Bbm : 56978317
    Wa : +6281272874416

    ReplyDelete




  2. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.


    UIPath Training in Bangalore

    ReplyDelete

Post a Comment

Popular posts from this blog

Multi-Core Programming with Java

Beat the Streak: Day Three