Efficiently Remove Duplicate Rows from a 2D Numpy Array
Suppose you have a 2d numpy array and you want to remove duplicate rows (or columns). In this blog post, I'll show you a trick you can use to do this more efficiently than using np.unique(A, axis=0) . This algorithm has time complexity $ O(\max(n \log{n}, n m)) $ for an $ n \times m $ matrix, and works almost surely . By "almost surely" I mean that it is a randomized algorithm that works correctly with probability $1$. To find the unique rows of a matrix $A$, the algorithm works by generating a random vector $x$ of real numbers, computing the dot product $ y = A x $, then analyzing the unique elements of $y$. The indices of unique elements in $y$ is the same as the unique row indices in $A$ with probability $1$. Below is an example to demonstrate how it works: > import numpy as np > A = np.random.randint(low=0, high=3, size=(10, 2)) > A array([[2, 1], [1, 2], [0, 0], [2, 2], [1, 2], [0, 0], [0, 2], ...