public class L2PStableHash extends LSHFunc
Localitysensitive hashing (LSH) is a method of performing probabilistic dimension reduction of highdimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).In particular, this implementation implements a locality sensitive hashing scheme which maps highdimensional vectors which are close together (with high probability) according to the L2 distance metric into the same buckets. This implementation uses a 2stable distribution (a Gaussian distribution) in order to accomplish this. For more information, see Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. (2004). "LocalitySensitive Hashing Scheme Based on pStable Distributions". Proceedings of the Symposium on Computational Geometry.
Constructor and Description 

L2PStableHash(java.lang.String sDim,
java.lang.String sW,
java.lang.String sRepeat,
java.lang.String sNumHashes) 
L2PStableHash(java.lang.String sDim,
java.lang.String sW,
java.lang.String sRepeat,
java.lang.String sNumHashes,
java.lang.String sSeed)
Locality sensitive hash that maps vectors onto a long in such a way that colliding
vectors are "near" one another according to cosine similarity with high probability.

Modifier and Type  Method and Description 

protected LSHCreator 
createLSHCreator() 
protected int 
getDimension() 
exec, getSeed, outputSchema
allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
public L2PStableHash(java.lang.String sDim, java.lang.String sW, java.lang.String sRepeat, java.lang.String sNumHashes, java.lang.String sSeed)
Generally, multiple LSH are combined via repetition to increase the range of the hash function to the full set of longs. The number of functions which you want to internally repeat is specified by the sRepeat parameter. The size of the hash family corresponds to the number of independent hashes you want to apply to the data. In a knear neighbors style of searching, this corresponds to the number of neighbors you want to find (i.e. the number of vectors within a distance according to cosine similarity). This UDF, indeed all pstable LSH functions are parameterized with a quantization parameter (w or r in the literature , depending on where you look). Consider the following excerpt from Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. (2004). "LocalitySensitive Hashing Scheme Based on pStable Distributions". Proceedings of the Symposium on Computational Geometry.
Decreasing the width of the projection (w) decreases the probability of collision for any two points. Thus, it has the same effect as increasing k . As a result, we would like to set w as small as possible and in this way decrease the number of projections we need to make.In the literature, the quantization parameter (or width of the projection) is found empirically given a sample of the data and the likely threshold of for the metric. Tuning this parameter is very important for the performance of this algorithm.
Consider the following example where we input some 3dimensional points and a set of 3dimensional queries and find the nearest neighbors of the query points:
 Create a L2PStableHash of  3 dimensional data  projection width of 200  1 internal hashes  family of 5 hashes  with a seed of 0  This creates a bag of tuples:  lsh_id:Integer the family ID (in this case, 04)  hash:Long the hash define LSH datafu.pig.hash.lsh.L2PStableHash('3', '200', '1', '5', '0'); define METRIC datafu.pig.hash.lsh.metric.L2(); PTS = LOAD 'input' AS (dim1:double, dim2:double, dim3:double); hash the input points PTS_HASHED = foreach PTS generate TOTUPLE(dim1, dim2, dim3) as pt , FLATTEN(LSH(TOTUPLE(dim1, dim2, dim3)));  the hash family ID and the hash should group the input points into partitions PARTITIONS = group PTS_HASHED by (lsh_id, hash);  take in the query points and hash them QUERIES = LOAD 'queries' as (dim1:double, dim2:double, dim3:double); QUERIES_HASHED = foreach QUERIES generate TOTUPLE(dim1, dim2, dim3) as query_pt , FLATTEN(LSH(TOTUPLE(dim1, dim2, dim3))) ;  join the hashed query points with the (presumably larger) list of input data split by partitions QUERIES_W_PARTS = join QUERIES_HASHED by (lsh_id, hash), PARTITIONS by (group.$0, group.$1);  Now, use the appropriate METRIC UDF (in this case L2 (aka Euclidean) distance) to find the first point within  a parameterized threshold (in this case, 1000). It takes:  query_pt:Tuple the query point  threshold:Double the threshold, so that if the distance between the query point and a point  in the partition is less than this threshold, it returns the point (and stops searching)  partition:Bag The bag of tuples in the partition. tuples from NEAR_NEIGHBORS = foreach QUERIES_W_PARTS generate query_pt as query_pt , METRIC(query_pt, 1000, PTS_HASHED) as neighbor ; describe NEAR_NEIGHBORS; describe NEAR_NEIGHBORS;  {query_pt: (dim1: double,dim2: double,dim3: double)  ,neighbor: (pt: (dim1: double,dim2: double,dim3: double)  ,lsh::lsh_id: int  ,lsh::hash: long  )  }  project out the query and the matching point NEIGHBORS_PROJ = foreach NEAR_NEIGHBORS { generate query_pt as query_pt, neighbor.pt as matching_pts; };  Filter out the hashes which resulted in no matches NOT_NULL = filter NEIGHBORS_PROJ by SIZE(matching_pts) > 0;  group by the query NEIGHBORS_GRP = group NOT_NULL by query_pt; describe NEIGHBORS_GRP;  Generate the query, the number of matches and the bag of matching points NEIGHBOR_CNT = foreach NEIGHBORS_GRP{ MATCHING_PTS = foreach NOT_NULL generate FLATTEN(matching_pts); DIST_MATCHING_PTS = DISTINCT MATCHING_PTS; generate group as query_pt, COUNT(NOT_NULL), DIST_MATCHING_PTS; }; describe NEIGHBOR_CNT;  NEIGHBOR_CNT: {query_pt: (dim1: double,dim2: double,dim3: double)  ,long  ,DIST_MATCHING_PTS: { (matching_pts::dim1: double,matching_pts::dim2: double,matching_pts::dim3: double)  }  } STORE NEIGHBOR_CNT INTO 'neighbors';
sDim
 Dimension of the vectorssW
 A double representing the quantization parameter (also known as the projection width)sRepeat
 Number of internal repetitions (generally this should be 1 as the pstable hashes have a larger range than one bit)sNumHashes
 Size of the hash family (if you're looking for k near neighbors, this is the k)sSeed
 Seed to use when constructing LSH familypublic L2PStableHash(java.lang.String sDim, java.lang.String sW, java.lang.String sRepeat, java.lang.String sNumHashes)
protected LSHCreator createLSHCreator()
createLSHCreator
in class LSHFunc
protected int getDimension()
getDimension
in class LSHFunc