### Efficiently Remove Duplicate Rows from a 2D Numpy Array

Suppose you have a 2d numpy array and you want to remove duplicate rows (or columns). In this blog post, I'll show you a trick you can use to do this more efficiently than using

This method can easily be modified to removing duplicate columns by letting $ x $ be a random vector of size $n$ (as opposed to $m$), then computing the dot product $ y = x A $ and indexing with $ A[:, index] $. So why does this algorithm work? It should be clear that if the two rows in $ A $ are equal then the same two rows in $ y = A x $ will also be equal no matter what $x$ is. What's not as obvious is the converse: if two rows in $ y = A x $ are equal, does that imply that those two rows in $A$ are equal? Of course, if $x$ is chosen adversarially then this claim would not hold because the adversary can always select $x$ that breaks the algorithm by solving a linear system. For example, if they wanted $ \begin{bmatrix} 2 & 1 \end{bmatrix} $ and $ \begin{bmatrix} 1 & 2 \end{bmatrix} $ to be incorrectly labeled as the same rows, then they could solve the linear system $ \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} c \\ c \end{bmatrix} $. However, if we generate $x$ by taking $m$ independent random samples from a $ U[0,1] $ distribution, then the claim holds with probability $1$. While it may not be obvious how to prove this, it should hopefully be intuitive that it is true. The intuition follows from the well known fact that the probability that a continuous random variable takes on a particular value is $0$. Let $ r_1 \neq r_2 $ be two different rows, then in order for this algorithm to incorrectly label them as the same, the random $x$ would have to satisfy $ r_1 \cdot x - r_2 \cdot x = (r_1 - r_2) \cdot x = 0 $. Since $ (r_1 - r_2) \cdot x $ is a continuous random variable, we can apply the well known fact stated earlier that this event occurs with probability zero. One subtle detail that I should point out is that since computers use a floating point representation for real numbers and there are only a finite number of possible floating point numbers, this event that in theory occurs with probability $0$, actually occurs with nonzero probability. However, the probability is so small and the event is so unlikely that we can just ignore it. I will also point out that the method described in this blog post works just as well with scipy sparse matrices. In fact, assuming the matrix supports row indexing (such as

`np.unique(A, axis=0)`

. This algorithm has time complexity $ O(\max(n \log{n}, n m)) $ for an $ n \times m $ matrix, and works *almost surely*. By "almost surely" I mean that it is a randomized algorithm that works correctly with probability $1$. To find the unique rows of a matrix $A$, the algorithm works by generating a random vector $x$ of real numbers, computing the dot product $ y = A x $, then analyzing the unique elements of $y$. The indices of unique elements in $y$ is the same as the unique row indices in $A$ with probability $1$. Below is an example to demonstrate how it works:```
> import numpy as np
> A = np.random.randint(low=0, high=3, size=(10, 2))
> A
array([[2, 1],
[1, 2],
[0, 0],
[2, 2],
[1, 2],
[0, 0],
[0, 2],
[2, 1],
[1, 0],
[1, 2]])
> x = np.random.rand(A.shape[1])
> y = A.dot(x)
> y
array([ 1.23529869, 1.74609273, 0. , 1.98759428, 1.74609273,
0. , 1.50459119, 1.23529869, 0.24150155, 1.74609273])
> unique, index = np.unique(y, return_index=True)
> A[index]
array([[0, 0],
[1, 0],
[2, 1],
[0, 2],
[1, 2],
[2, 2]])
```

This method can easily be modified to removing duplicate columns by letting $ x $ be a random vector of size $n$ (as opposed to $m$), then computing the dot product $ y = x A $ and indexing with $ A[:, index] $. So why does this algorithm work? It should be clear that if the two rows in $ A $ are equal then the same two rows in $ y = A x $ will also be equal no matter what $x$ is. What's not as obvious is the converse: if two rows in $ y = A x $ are equal, does that imply that those two rows in $A$ are equal? Of course, if $x$ is chosen adversarially then this claim would not hold because the adversary can always select $x$ that breaks the algorithm by solving a linear system. For example, if they wanted $ \begin{bmatrix} 2 & 1 \end{bmatrix} $ and $ \begin{bmatrix} 1 & 2 \end{bmatrix} $ to be incorrectly labeled as the same rows, then they could solve the linear system $ \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} c \\ c \end{bmatrix} $. However, if we generate $x$ by taking $m$ independent random samples from a $ U[0,1] $ distribution, then the claim holds with probability $1$. While it may not be obvious how to prove this, it should hopefully be intuitive that it is true. The intuition follows from the well known fact that the probability that a continuous random variable takes on a particular value is $0$. Let $ r_1 \neq r_2 $ be two different rows, then in order for this algorithm to incorrectly label them as the same, the random $x$ would have to satisfy $ r_1 \cdot x - r_2 \cdot x = (r_1 - r_2) \cdot x = 0 $. Since $ (r_1 - r_2) \cdot x $ is a continuous random variable, we can apply the well known fact stated earlier that this event occurs with probability zero. One subtle detail that I should point out is that since computers use a floating point representation for real numbers and there are only a finite number of possible floating point numbers, this event that in theory occurs with probability $0$, actually occurs with nonzero probability. However, the probability is so small and the event is so unlikely that we can just ignore it. I will also point out that the method described in this blog post works just as well with scipy sparse matrices. In fact, assuming the matrix supports row indexing (such as

`scipy.sparse.csr_matrix`

), then we don't have do change the code above at all except for in the construction of $A$!

ReplyDeleteSilahkan di kunjungi ya kawan-kawan 100% Memuaskan

> Hoki anda ada di sini <

Bandar Q Online Terpercaya dan Teraman di GUNUNGPOKER

Link daftar : http://bandaraduq.com/Register.aspx?lang=id

BBM : 56978317

SEMUA GAME HANYA PAKAI 1 USER ID : Poker, Domino QQ, Capsa Susun, Adu Q, Bandar Poker,

Segera daftarkan userid anda di GUNUNGPOKER

Promo Terbaru dari GUNUNGPOKER

- Minimal DEPOSIT & WITHDRAW Rp 20.000,-

- Tersedia 7 game dalam 1 USER ID

- BONUS Turnover 0.5%

- BONUS Referral 20%

UNTUK INFORMASI SELANJUTNYA BISA HUB KAMI DI :

LIVECHAT GUNUNGPOKER 24 JAM ONLINE

Fanspage FB : @agengunungpoker

Pin BB : 56978317

WA : +62812-7287-4416

LINE : gunungpokercsr1

WECHAT : gunungpokercsr1

YM : gunungpokercsr1@yahoo.com

Agen domino

BandarQ

Domino 99

Domino QQ

Agen Poker

Bandar Poker

Agen Judi QQ

Judi Online

Forum Judi Online

hubungi kami di :

Line : gunungpokercsr1

Bbm : 56978317

Wa : +6281272874416

ReplyDeleteYour good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.

UIPath Training in Bangalore