public class HyperplaneLSH extends LSH
Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).
In particular, this implementation implements a locality sensitive hashing scheme which maps high-dimensional vectors which are close together (with high probability) according to Cosine Similarity into the same buckets. Each LSH maps a vector onto one side or the other of a random hyperplane, thereby producing a single bit as the hash value. Multiple, independent, hashes can be run on the same input and aggregated together to form a more broad domain than a single bit.
For more information, see Charikar, Moses S.. (2002). "Similarity Estimation Techniques from Rounding Algorithms". Proceedings of the 34th Annual ACM Symposium on Theory of Computing 2002.
Constructor and Description |
---|
HyperplaneLSH(int dim,
org.apache.commons.math.random.RandomGenerator rg)
Locality sensitive hash that maps vectors onto 0,1 in such a way that colliding
vectors are "near" one another according to cosine similarity with high probability.
|
Modifier and Type | Method and Description |
---|---|
long |
apply(org.apache.commons.math.linear.RealVector vector)
Compute which side of the hyperplane that the parameter is on.
|
getDim, getRandomGenerator
public HyperplaneLSH(int dim, org.apache.commons.math.random.RandomGenerator rg)
Generally, multiple LSH are combined via repetition to increase the range of the hash function to the full set of longs. This repetition is accomplished by wrapping instances of the LSH in a LSHFamily, which does the combination.
The size of the hash family corresponds to the number of independent hashes you want to apply to the data. In a k-near neighbors style of searching, this corresponds to the number of neighbors you want to find (i.e. the number of vectors within a distance according to cosine similarity).
dim
- The dimension of the vectors which are to be hashedrg
- random number generator