# Espy Yonder

maheshakyas' personal website

## To bridge the gaps - An overview of WSO2 Machine Learner

• technology

How effectively can we bring purely mathematical and mere conceptual models/mechanisms into the utilization of common man? Is it plausible to make the prerequisite technical know-how for dealing with such models indifferent among multiple ranges of intellectual levels? That said, are these mechanisms capable of functioning with a variety of applications and data. To which extent can these scale? The advent of tools for emulating learning theory has trasnpired amidst these concerns.

Frameworks which perform computational learning tasks had started to emerge approximately at the same time as the inception of Machine Learning as a another branch of Artificial Intelligence. At early stages, these frameworks basically emulated the exact algorithms with small samples of data and almost all of them were CLI (Command Line Interface) based frameworks. Majority of those were used only in laborotaries in very specific applications and restricted to researchers. But our society has evolved much and progressed through a myriad of milestones along the path of artificial intelligence and nowadays almost all the organizations require some amount of data analytics and machine learning to keep abreast with the ever-growing technology. This trend has as become exigent due to the massive amounts of data being collected and stored each and every day. With this inclination, the demand for the tools that conduct the entire range of workflows of data analytics has been elevated colossaly.

As an open-source middleware based organization, WSO2 has been able to establish its’ name among a vast amount of communities which seek different types of solutions in their variety of application domains. A few years ago, WSO2 entered into the arena of data analytics by providing a rich platform for batch analytics and real time data analytics (WSO2 Data Analytics Server and WSO2 Complex Event Processor). In order to fortify the hegemony that WSO2 has been able to build around data analytics, WSO2 Machine Learner has been introduced with the intention of providing a rich, user friendly, flexible and scalable predictive analytics platform which is compatible with the standard machine learning methodologies and with the other products of WSO2 as well. At the humble nascent of the first release of WSO2 Machine Learner (version 1.0.0), this discussion will provide an elaborative overview on its’ ambition of bridging the gaps that hinder prevailing predictive analytics tasks due to various reasons such as complexity, lack of skill in data analytics, inability to scale, real time prediction requirements, rapid increase of not analytized data storages, proprietary rights of software, etc.

In the subsequent sections, you will be able to understand how WSO2 Machine Learner (WSO2 ML from this point onwards) can fit in and what sets it apart from other machine learning tools and frameworks available with respect to the following concerns.

1. Functionality
2. HCI aspects and userbility
3. Scalability
4. Lisence - Availability to users
5. Compatibility and use cases

## A quick guide to plotting with python and matplotlib - part 2

• miscellaneous

In the previous part of this guide, we have discussed on how to create basic components of a plot. It’s time to move into some advanced material. This is a continuation of the part 1 of this guide, assuming that you have read and grasped it already.

# A bar chart grouped accoring to some variable

In this plot, a bar chart will be created grouped based on a variable with each group having its’ own similar components. You may know what it actually looks like after plotting then it will be easy to analyze the code and learn.

import matplotlib.pyplot as plt
import numpy as np

# Defines the sizes of the axes
plt.axis([0,14, 0,140])

p1 = plt.Rectangle((0, 0), 0.1, 0.1, fc="crimson")
p2 = plt.Rectangle((0, 0), 0.1, 0.1, fc="burlywood")
p3 = plt.Rectangle((0, 0), 0.1, 0.1, fc="chartreuse")

plt.legend((p1, p2, p3), ('category 1','category 2','category 3'), loc='upper left')

# Defines labels for each group
labels = ['        ', '      1', '        ', '        ', '      2', '        ', '        ', '      4', '        ']

# Creates discrete values for x co-ordinates (widths of the bars)
x = np.array([0,1,2,5,6,7,10,11,12]) + 1

# Defines some random set of values for y (heights of the bars)
y = np.array([55.433, 55.855, 55.719, 55.433, 90.199, 93.563, 55.433, 104.807, 106.693])

# Replaces the names in the x-axis with labels
plt.xticks(x, labels)

# Creates the bar chart
plt.bar(left = x, height=y, color=['crimson', 'burlywood', 'chartreuse'])

plt.grid(which='both')
plt.xlabel('This is my x-axis')
plt.title("This is our title")

plt.show()

No we’ll try to analyze each new individual component of this code piece. Starting from the beginning:

• plt.axis([0,14, 0,140]) sets the limits of x-axis from $$0$$ to $$14$$ and limits of y-axis from $$0$$ to $$140$$
• labels = [' ', ' 1', ' ', ' ', ' 2', ' ', ' ', ' 4', ' '] is used to create a representation for a group name. Here, there are $$9$$ elements in the list, since there are $$9$$ bars. Those bars need to be grouped into $$3$$ groups, hence for each group a label is given. Each label should be displayed right below the bar at the middle of each group.
• plt.xticks(x, labels) replaces the display names of the values on x-axis with the labels, but the actual co-ordinates of the bars remain same at those previously defined x-axis values.
• plt.bar(left = x, height=y, color=['crimson', 'burlywood', 'chartreuse']) is where the actual bars are plotted. For parameter left, x values are given so that the left end of the bars will be equal to the first element in x. For height parameter, y values are given. color parameter uses the given set of colors and applies those set of colors on the bars in a circular order. Here, only $$3$$ colors are given, so those colors rotate around the $$9$$ bars exactly $$3$$ times.

## A quick guide to plotting with python and matplotlib - part 1

• miscellaneous

Representation of information and data visualization can be often considered as one of the most indispensable feats that requires subtlest attention in the present community of science and technology. So that’s that. Now let’s hop into plotting!

Python has this amazing library called matplotlib where you can create plots of almost everything you can ever wish of (yes, it supports almost all sorts of plots that can be drawn with R now). But for a novice, this could be a little tricky at the beginning. Figuring out how to draw exactly what you might need is kind of a headache, since you don’t have much experience in manipulating the resources packeged with this library. The documentation indeed provides a nice overview of what this library is capable of, but still one might want to create simple, yet the weirdest plot that no documentation or Stack Overflow answer could ever help (I guess I’m one of them and there are many others I know as well). So, let’s try out some fundamental techniques first and then heed the deeper ones. These are some of the graphs I wanted to plot in many differenct circumstances. I assume these would provide you at least some assistance in creating your own plots. I’ll be using numpy library as well to create required data in these demonstrations. Note that matplotlib and numpy are imported in advance.

import matplotlib.pyplot as plt
import numpy as np

# Lines connecting scatter points with different marker for each line

The term marker refers to a symbol that represents a point in a graph. There are numerous markers in matplotlib, so we will choose a few of them to demonstrate this. Typical syntax for scatter plot is plt.scatter(x, y, marker=m) where x is the set of the x-co-ordinates, y is the set of y-co-ordinates (these are compulsory arguments for scatter function) and m is the marker we are going to use. Here are some example markers:

• ‘o’
• ’+’
• ‘v’
• ‘*’

Let’s plot now.

# There are 3 lines
num_of_lines = 3

# Defines a coluor for each line
colours = ['c', 'crimson', 'chartreuse']

# Defines a marker for each line
markers = ['o', 'v', '*']

# Creates x array with numbers ranged from 0 to 10(exclusive)
# Creates an empty list for y co-ordinates in each line
x = np.arange(10)
y = []

# For each line
for i in range(num_of_lines):
# Adds to y according to y=ix+1 function
y.append(x*i+1)

# This is where plotting happens!!!
# For each line
for i in range(num_of_lines):
# Scatter plot with point_size^2 = 75, and with respective colors
plt.scatter(x, y[i], marker=markers[i], s=75, c=colours[i])
# Connects points with lines, and with respective colours
plt.plot(x, y[i], c=colours[i])

# Show grid in the plot
plt.grid()
# Finally, display the plot
plt.show()

Fabulas, isn’t it? Now we shall add a simple legend to this plot. This is what we are going to do. The upper left corner seems like an open space. So, let’s add the legend there. A Rectangle is used to represent an entry in the legend.

# Creates 3 Rectangles
p1 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colours[0])
p2 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colours[1])
p3 = plt.Rectangle((0, 0), 0.1, 0.1, fc=colours[2])

# Adds the legend into plot
plt.legend((p1, p2, p3), ('line1', 'line2', 'line3'), loc='best')

In the legend function, in addition to rectangles and the names of the entries, it is possible to specify the location of the legend as well. ‘best’ gives best position for the legend in strict sense of word (which is the upper left corner). The other locations are as follows:

Position #
‘best’ 0
‘upper right’ 1
‘upper left’ 2
‘lower left’ 3
‘lower right’ 4
‘right’ 5
‘center left’ 6
‘center right’ 7
‘lower center’ 8
‘upper center’ 9
‘center’ 10

Note that you can even use the corresponding number to specify the location.

Now, simply add the legend code segment just before the plt.show() in the first code. You will see that there is a nice legend at the upper left corner.

Only a little work is left to do… What? Naming the axes and entitling the plot. It takes only $$3$$ lines of code. Add these lines just above the plt.show() function.

# Sets x-axis
plt.xlabel('This is my x-axis')

# Sets y-axis

# Sets title
plt.title("This is our title")

Now you know how to create the basic components of a plot. This will be the end of the first part of this guide. More interesting stuff will follow in the next parts.

## Performance comparison among LSH Forest, ANNOY and FLANN

• gsoc

Finally, it is time to compare performance of Locality Sensitive Hashing Forest(approximate nearest neighbor search implementation in scikit-learn), Spotify Annoy and FLANN.

## Criteria

Synthetic datasets of different sizes (varying n_samples and n_features) are used for this evalutation. For each data set, following measures were calculated.

1. Index building time of each ANN (Approximate Nearest Neighbor) implementation.
2. Accuracy of nearest neighbors queries with their query times.

Python code used for this evaluation can be found in this Gist. Parameters of LSHForest (n_estimators=10 and n_candidates=50) are kept fixed during this experiment. Accuracies can be raised by tuning these parameters.

## Results

For each dataset, two graphs have been plotted according to the measures expressed in the above section. It is evident that index building times of LSH Forest and FLANN are almost incomparable to that of Annoy for almost all the datasets. Moreover, for larger datasets, LSH Forest outperforms Annoy at large margins with respect to accuracy and query speed. Observations from these graphs prove that LSH Forest is competitive with FLANN for large datasets.

## A demonstration of the usage of Locality Sensitive Hashing Forest in approximate nearest neighbor search

• gsoc

This is a demonstration to explain how to use the approximate nearest neighbor search implementation using locality sensitive hashing in scikit-learn and to illustrate the behavior of the nearest neighbor queries as the parameters vary. This implementation has an API which is essentially as same as the NearestNeighbors module as approximate nearest neighbor search is used to speed up the queries at the cost of accuracy when the database is very large.

Before beginning the demonstration, background has to be set. First, the required modules are loaded and a synthetic dataset is created for testing.

import time
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import LSHForest
from sklearn.neighbors import NearestNeighbors

# Initialize size of the database, iterations and required neighbors.
n_samples = 10000
n_features = 100
n_iter = 30
n_neighbors = 100
rng = np.random.RandomState(42)

# Generate sample data
X, _ = make_blobs(n_samples=n_samples, n_features=n_features,
centers=10, cluster_std=5, random_state=0)

There are two main parameters which affect queries in the LSH Forest implementation.

1. n_estimators : Number of trees in the LSH Forest.
2. n_candidates : Number of candidates chosen from each tree for distance calculation.

In the first experiment, average accuracies are measured as the value of n_estimators vary. n_candidates is kept fixed. slearn.neighbors.NearestNeighbors used to obtain the true neighbors so that the returned approximate neighbors can be compared against.

# Set n_estimators values
n_estimators_values = np.linspace(1, 30, 5).astype(np.int)
accuracies_trees = np.zeros(n_estimators_values.shape[0], dtype=float)

# Calculate average accuracy for each value of n_estimators
for i, n_estimators in enumerate(n_estimators_values):
lshf = LSHForest(n_candidates=500, n_estimators=n_estimators,
n_neighbors=n_neighbors)
nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')

lshf.fit(X)
nbrs.fit(X)
for j in range(n_iter):
query = X[rng.randint(0, n_samples)]
neighbors_approx = lshf.kneighbors(query, return_distance=False)
neighbors_exact = nbrs.kneighbors(query, return_distance=False)

intersection = np.intersect1d(neighbors_approx,
neighbors_exact).shape[0]
ratio = intersection/float(n_neighbors)
accuracies_trees[i] += ratio

accuracies_trees[i] = accuracies_trees[i]/float(n_iter)

Similarly, average accuracy vs n_candidates is also measured.

# Set n_candidate values
n_candidates_values = np.linspace(10, 500, 5).astype(np.int)
accuracies_c = np.zeros(n_candidates_values.shape[0], dtype=float)

# Calculate average accuracy for each value of n_candidates
for i, n_candidates in enumerate(n_candidates_values):
lshf = LSHForest(n_candidates=n_candidates, n_neighbors=n_neighbors)
nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')
# Fit the Nearest neighbor models
lshf.fit(X)
nbrs.fit(X)
for j in range(n_iter):
query = X[rng.randint(0, n_samples)]
# Get neighbors
neighbors_approx = lshf.kneighbors(query, return_distance=False)
neighbors_exact = nbrs.kneighbors(query, return_distance=False)

intersection = np.intersect1d(neighbors_approx,
neighbors_exact).shape[0]
ratio = intersection/float(n_neighbors)
accuracies_c[i] += ratio

accuracies_c[i] = accuracies_c[i]/float(n_iter)

You can get a clear view of the behavior of queries from the following plots.

The next experiment demonstrates the behavior of queries for different database sizes (n_samples).

# Initialize the range of n_samples
n_samples_values = [10, 100, 1000, 10000, 100000]
average_times = []
# Calculate the average query time
for n_samples in n_samples_values:
X, labels_true = make_blobs(n_samples=n_samples, n_features=10,
centers=10, cluster_std=5,
random_state=0)
# Initialize LSHForest for queries of a single neighbor
lshf = LSHForest(n_candidates=1000, n_neighbors=1)
lshf.fit(X)

average_time = 0

for i in range(n_iter):
query = X[rng.randint(0, n_samples)]
t0 = time.time()
approx_neighbors = lshf.kneighbors(query,
return_distance=False)[0]
T = time.time() - t0
average_time = average_time + T

average_time = average_time/float(n_iter)
average_times.append(average_time)

n_samples space is defined as [10, 100, 1000, 10000, 100000]. Query time for a single neighbor is measure for these different values of n_samples.

## Improvements for LSH Forest implementation and its applications

• gsoc

GSoC 2014 is coming to an end. But LSH Forest implementation requires a little more work to be completed. Following are the list of tasks to be done. They will be completed during the next two weeks.

## 1. Improving LSH Forest implementation

I have got a lot of feedback from scikit-learn community about my implementation of LSH Forest. Many of them are about the possible optimizations. Making those optimizations happen will cause a significant improvement in the performance.

## 2. Applying LSH Forest in DBSCAN

The idea of this is to speed up the clustering method using approximate neighbor search, rather than spending much time on exhaustive exact nearest neighbor search. DBSCAN requires the neighbors of which the distance from a data point is less than a given radius. We use the term radius neighbors for this. As LSH Forest is implemented to adhere with the scikit-learn API, we have a radius_neighbors function in LSH Forest(NearestNeighbors in scikit-learn has radius_neighbors function). Therefore, LSH Forest can be directly applied in place of exact nearest neighbor search.

After this application, it will be benchmarked to analyze the performance. Approximate neighbors are more useful when the size of the database (often the number of features) is very large. So the benchmark will be based on following facts. What are the sample sizes and number of features, where approximate neighbor search reasonably outperforms the exact neighbor search.

## 3. Documentation and wrapping up work

After completing the implementation and benchmarking, it will be documented with the scikit-learn documentation standards.

## Testing LSH Forest

• gsoc

There are two types of tests to perform in order to ensure the correct functionality the LSH Forest.

1. Tests for individual functions of the LSHForest class.
2. Tests for accuracy variation with parameters of implementation.

scikit-learn provides a nice set testing tools for this task. It is elaborated in the utilities for developers section. I have used following assertions which were imported from sklearn.utils.testing. Note that numpy as imported as np.

1. assert_array_equal - Compares each element in an array.
2. assert_equal - Compares two values.
3. assert_raises - Checks whether the given type of error is raised.

## Testing individual functions

### Testing fit function

Requirement of this test is to ensure that the fit function does not work without the necessary arguments provision and it produces correct attributes in the class object(in the sense of dimensions of arrays).

Suppose we initialize a LSHForest as lshf = LSHForest()

If the estimator is not fitted with a proper data, it will produce a value error and it is testes as follows:

X = np.random.rand(samples, dim)
lshf = LSHForest(n_trees=n_trees)
lshf.fit(X)

We define the sample size and the dimension of the dataset as samples and dim respectively and the number of trees in the LSH forest as n_trees.

# Test whether a value error is raised when X=None
assert_raises(ValueError, lshf.fit, None)

Then after fitting the estimator, following assertions should hold true.

# _input_array = X
assert_array_equal(X, lshf._input_array)
# A hash function g(p) for each tree
assert_equal(n_trees, lshf.hash_functions_.shape[0])
# Hash length = 32
assert_equal(32, lshf.hash_functions_.shape[1])
# Number of trees in the forest
assert_equal(n_trees, lshf._trees.shape[0])
# Each tree has entries for every data point
assert_equal(samples, lshf._trees.shape[1])
# Original indices after sorting the hashes
assert_equal(n_trees, lshf._original_indices.shape[0])
# Each set of original indices in a tree has entries for every data point
assert_equal(samples, lshf._original_indices.shape[1])

All the other tests for functions also contain the tests for valid arguments, therefore I am not going to describe them in those sections.

### Testing kneighbors function

kneighbors tests are based on the number of neighbors returned, neighbors for a single data point and multiple data points.

We define the required number of neighbors as n_neighbors and crate a LSHForest.

n_neighbors = np.random.randint(0, samples)
point = X[np.random.randint(0, samples)]
neighbors = lshf.kneighbors(point, n_neighbors=n_neighbors,
return_distance=False)
# Desired number of neighbors should be returned.
assert_equal(neighbors.shape[1], n_neighbors)

For multiple data points, we define number of points as n_points:

points = X[np.random.randint(0, samples, n_points)]
neighbors = lshf.kneighbors(points, n_neighbors=1,
return_distance=False)
assert_equal(neighbors.shape[0], n_points)

The above tests ensures that the maximum hash length is also exhausted because the data points in the same data set are used in queries. But to ensure that hash lengths less than the maximum hash length also get involved, there is another test.

# Test random point(not in the data set)
point = np.random.randn(dim)
lshf.kneighbors(point, n_neighbors=1,
return_distance=False)

### Testing distances of the kneighbors function

Returned distances should be in sorted order, therefore it is tested as follows: Suppose distances is the returned distances from the kneighbors function when the return_distances parameter is set to True.

assert_array_equal(distances[0], np.sort(distances[0]))

### Testing insert function

Testing insert is somewhat similar to testing fit because what we have to ensure are dimensions and sample sizes. Following assertions should hold true after fitting the LSHFores.

# Insert wrong dimension
assert_raises(ValueError, lshf.insert,
np.random.randn(dim-1))
# Insert 2D array
assert_raises(ValueError, lshf.insert,
np.random.randn(dim, 2))

lshf.insert(np.random.randn(dim))

# size of _input_array = samples + 1 after insertion
assert_equal(lshf._input_array.shape[0], samples+1)
# size of _original_indices[1] = samples + 1
assert_equal(lshf._original_indices.shape[1], samples+1)
# size of _trees[1] = samples + 1
assert_equal(lshf._trees.shape[1], samples+1)

## Testing accuracy variation with parameters

The accuracy of the results obtained from the queries depends on two major parameters.

1. c value - c.
2. number of trees - n_trees.

Separate tests have been written to ensure that the accuracy variation is correct with these parameter variation.

### Testing accuracy against c variation

Accuracy should be in an increasing order as the value of c is raised. Here, the average accuracy is calculated for different c values.

c_values = np.array([10, 50, 250])

Then the calculated accuracy values of each c value is stored in an array. Following assertion should hold true in order to make sure that, higher the c value, higher the accuracy.

# Sorted accuracies should be equal to original accuracies
assert_array_equal(accuracies, np.sort(accuracies))

### Testing accuracy against n_trees variation

This is as almost same as the above test. First, n_trees are stored in the ascending order.

n_trees = np.array([1, 10, 100])

After calculating average accuracies, the assertion is performed.

# Sorted accuracies should be equal to original accuracies
assert_array_equal(accuracies, np.sort(accuracies))

## What is left?

In addition to the above tests, precision should also be tested against c values and n_trees. But it has already been tested in the prototyping stage and those tests consume a reasonably large amount of time which makes them unable to be performed in the scikit-learn test suit. Therefore, a separate benchmark will be done on this topic.

As provided in the guidelines in scikit-learn contributors’ page, Nose testing framework has been used to automate the testing process.

## Optimizations on Locality sensitive hashing forest data structure

• gsoc

In order to get LSH forest approximate nearest neighbor search method into a competitive and useful state, optimization was a necessity. These optimizations were based on the portions of the LSH forest algorithm and implementation of that algorithm. There are two general cases of bottlenecks where requirement for optimization arises.

1. Memory usage of the data structure.
2. Speed of queries.

Let’s discuss these two cases separately in detail. Remember that there will always be a trade off between memory and speed.

## Optimizations on memory usage

In the previous post about the implementation of the LSH forest, I have explained how the basic data structure and the functions. Some of the optimizations caused obvious changes in those functions. First of all, I need to mention that this data structure stores the entire data set fitted because it will be required to calculate the actual distances of the candidates. So the memory consumption is usually predominated by the size of the data set. LSH Forest data structure does not have control over this. Only thing that can be controlled from the data structure is the set of hashes of the data points.

Current data structure maintains a fixed length hash(but user can define that length at the time of initialization). Hashes are comprised of binary values (1s and 0s). At the moment, these hashes are stored as binary strings.

One of the optimizations which can be done to maintain a more compact index for hashes is using the equivalent integer of the binary strings. For an example, string 001110 can be represented as 14. For very large data sets with very high dimensions, this technique will hold no effect as the improvement on the memory usage of hash indices is insignificant when compared to the memory required to store the data set.

As I have mentioned earlier, there is always a cost for every optimization. Here we have to trade off speed for improved memory consumption. The LSH algorithm is applied to a single data point up to the number hash length required. In the default case, there will be 32 hashes per a single data point in a single tree and those hashed values have to be concatenated in order to create $$g(p)$$(refer to the LSH Forest paper). Because the hashing is done separately for individual elements of $$g(p)$$, they are stored in an array in the first place. In order to create a single hash from these values in the array, they are first concatenated as a string. This is the point where memory speed dilemma occurs. These string hashes can be converted into integer hashes for compact indices but it will cost extra time.

In the next section, I will explain how this extra overhead can be minimized using a trick, but the fact that “you cannot eliminate the memory speed trade-off completely” holds.

## Optimizations on query speeds

(I strongly recommend you to go through the LSH forest paper before reading the following section because the description involves terms directly taken from the paper)

I was hapless at the beginning, because the query speeds of the implemented LSH forest structure were not at a competitive state to challenge the other approximate nearest neighbor search implementations. But with the help of my project mentors, I was able to get over with this issue.

### Optimizations on the algorithm

In the initial implementation, binary hash: $$g(p)$$ for each tree is computed during the synchronous ascending phase when it was required. This causes a great overhead because each time it is computed, the LSH function has to be performed on the query vector. For Random projections, it is a dot product and it is an expensive operation for large vectors. Computing the binary hashes for each tree in advance and storing them is a reliable solution for this issue. The algorithm descend needs to be performed on each tree to find the longest matching hash length. So the above algorithm will iterate over the number of trees in the forest. For each iteration, the binary hash of the query is calculated and stored as follows. This does not consume much memory because the number of trees is small(typically 10).

def descend(query, hash_functions, trees):
bin_queries = []
max_depth = 0
for i in range(number_of_trees):
binary_query = do_hash(query, hash_functions[i]
k = find_longest_prefix_match(trees[i], binary_query)
if k > max_depth:
max_depth = k
bin_queries.append(bin_query)
return max_depth, binary_queries

It is an obvious fact that as the number of candidates for the nearest neighbors grows the number of actual distance calculations grows proportionally. This computation is also an expensive operation when compared with the other operations in the LSH forest structure. So limiting the number of candidates is also a passable optimization which can be done with respect to the algorithm. In the above function, one of the terminating conditions of the candidate search is maximum depth reaching 0 (The while loop runs until x > 0). But it should not necessarily run until maximum depth reaches 0 because we are looking for most viable candidates. It is known fact that for smaller matching hash lengths, it is unlikely to have a large similarity between two data points. Therefore, as the considered hash length is decreased, eligibility of the candidates we get is also decreased. Thus having a lower bound on the maximum depth will set a bound to candidates we collect. It decreases the probability of having irrelevant candidates. So this can be done by setting x > lower_bound in the while loop for some value of lower_bound > 0. But there is a risk of not retrieving the required number of neighbors for very small data sets because the the candidate search may terminate before it acquires the required number of candidates. Therefore user should be aware of a suitable lower_bound for the queries at the time of initialization.

### Optimizations on the implementation

I have indicated that bisect operations comes with Python are faster for small lists but as the size of the list grow numpy searchsorted functions becomes faster. In my previous implementation, I have used an alternative version of bisect_right function as it does not fulfill the requirement of finding the right most index for a particular hash length in a sorted array(Please refer to my previous postif things are not clear). But we cannot create an alternative version of numpy searchsorted function, therefore a suitable transformation is required on the hash itself.

Suppose we have a binary hash item = 001110. What we need is the largest number with the first 4 bits being the same as item. 001111 suffices this requirement. So the transformation needed is replacing the last 2 bits with 1s.

def transform_right(item, h, hash_size):
transformed_item = item[:h] + "".join(['1' for i in range(hash_size - h)])
return transformed_right

The transformation needed to get the left most index is simple. This is 001100 which is last two bits replaced by 0s. This is same as having 0011. Therefore only a string slicing operation item[:4] will do the the job.

It gets more complicated when it comes to integer hashes. Integer has to be converted to the string, transformed and re-converted to integer.

def integer_transform(int_item, hash_size, h):
str_item = ('{0:0'+str(hash_size)+'b}').format(int_item)
transformed_str = transform_right(str_item, h, hash_size):
transformed_int = int(transformed_str, 2)
return transformed_int

But this is a very expensive operation for a query. For indexing, only int(binary_hash, 2) is required but this wont make much effect because the LSH algorithms on the data set dominates that operation completely. But for a query this is a significant overhead. Therefore we need an alternative.

Required integer representations for left and right operations can be obtained by performing the bit wise $$AND$$ and $$OR$$ operations with a suitable mask. Masks are generated by the following function.

def _generate_masks(hash_size):
for length in range(hash_size+1):
left_mask  = int("".join(['1' for i in range(length)])
+ "".join(['0' for i in range(hash_size-length)]), 2)
right_mask = int("".join(['0' for i in range(length)])
+ "".join(['1' for i in range(hash_size-length)]), 2)
return left_masks, right_masks

These masks will be generated at the indexing time. Then the masks will be applied with the integer hashes.

def apply_masks(item, left_masks, right_masks, h):
return item_left, item_right

The left most and right most indices of a sorted array can be obtained in the following fashion.

import numpy as np
def find_matching_indices(sorted_array, item_left, item_right):
left_index = np.searchsorted(sorted_array, item_left)
right_index = np.searchsorted(sorted_array, item_right, side = 'right')
return np.arange(left_index, right_index)

As the masks are precomputed, the speed overhead at the query time is minimum in this approach. But still there is a little overhead in this approach because original hashed binary number are stored in an array and those numbers need to be concatenated and converted to obtain the corresponding integers. If the integers are cached this overhead will be eliminated.

### Cached hash

This is method which guarantees a significant speed up in the queries with expense of index building speed. At the indexing time, we can create a dictionary with key being arrays of hashed bits and the values being the corresponding integers. The number of items in the dictionary is the number of combinations for a particular hash length.

Example:

Suppose the hash length is 3. Then the bit combinations are:

Then it will be only a matter of a dictionary look-up. Once an integer is required for a particular hash bit array, it can be retrieved directly from the dictionary.

The implementation of this type of hash is a bit tricky. Typically in LSH forests, the maximum hash length will be 32. Then the size of the dictionary will be $$2^n = 4294967296$$ where $$n = 32$$. This is an extremely infeasible size to hold in the memory(May require tens of Gigabytes). But when $$n = 16$$, the size becomes $$65536$$. This is a very normal size which can be easily stored in the memory. Therefore we use two caches of size 16.

First a list for digits 0 and 1 is created.

digits = ['0', '1']

The the cache is created as follows.

import itertools

cache_N = 32 / 2
c = 2 ** cache_N # compute once

cache = {
tuple(x): int("".join([digits[y] for y in x]),2)
for x in itertools.product((0,1), repeat=cache_N)
}

Suppose item is a list of length 32 with binary values(0s and 1s). Then we can obtain the corresponding integer for that item as follows:

xx = tuple(item)
int_item = cache[xx[:16]] * c + cache[xx[16:]]

This technique can be used in queries to obtain a significant improvement in speed assuming that the user is ready to sacrifice a portion of the memory.

Performance of all these techniques were measured using Pythons’ line_profiler and memory_profiler. In my next post, I will explain concisely how these tools have been used to evaluate these implementations.

## An illustration of the functionality of the LSH Forest

• gsoc

Before digging into more technical detail on the implementation of LSH forest, I thought it would be useful to provide an insightful illustration on how the best candidates are selected from the fitted data points.

For this task, I decided to use a randomly generated dataset of the sample size of 10000 in the two dimensional space. The reason of using two dimensions is that it is convenient to depict the vectors in 2D space and the a normal human being can easily grasp the idea of convergence in a 2D space.

Following configuration of the LSH forest has been selected for this illustration.

• Number of trees = 1
• Hash length= 32
• c = 1
• Lower bound of hash length at termination = 4
• Expecting number of neighbors = 10

You can get an idea of these parameters from my previous article on the LSH forest

The following illustrations show the convergence of data points towards the query point as the considered hash length is increased. The important fact I want to emphasize here is that candidates chosen by matching the most significant hash bits converge into the actual data point we are interested in. This happens because of the amplification property of Locality sensitive hashing algorithm.

(Beware! The query point is in RED)

Only if the required number of candidates are not reached during the early sweeps, the algorithm will search for more candidates in smaller matching hash lengths. The the best neighbors are selected from that set of candidates by calculating the actual distance.

In my next post, I will enlighten you about the subtle optimizations done on the LSH forest data structure to find the best candidates at a maximum speed.

## Performance evaluation of Approximate Nearest Neighbor search implementations - Part 2

• gsoc

This continues from the post Performance evaluation of Approximate Nearest Neighbor search implementations - Part 1. The evaluation about the memory consumption is already completed in that post. In this post, the next two aspects of the evaluation framework, precision and query speed will be discussed.

When measuring the performance of Approximate nearest neighbor search methods, expressing precision and the query speed independently is less useful since the entire idea of approximating is to obtain the desired precision within a better time period. In order to evaluate precision, an ANN implementation should be able to provide a multiple number of neighbors(rather than just the nearest neighbor). After obtaining all data points in the data set from the ANN method, first few entires in the that neighbors list(ten neighbors in my evaluation tests) are taken as the neighbors of ground truth of that particular ANN method. These set of data points are compared against neighbors retrieved when the number of queried neighbors is varied. Precision tests are adapted from the tests performed for ANNOY.

This precision measure eliminates some of our candidate ANN implementations because those are not capable of producing a multiple number of neighbors. Obtaining multiple neighbors is an essential requirement of for precision tests described above as well the general applications of nearest neighbor search. Therefore, for the precision tests, only ANNOY, FLANN, KGraph and LSH Forest are taken into consideration. All evaluation tests, LSH forest implementation and the plots can be found in this Github repository.

Before jumping into comparisons, I thought it is imperative to get a notion on the characteristics of the LSH forest implementation. Unlike other ANN implementations, LSH forest provides some user controlled parameters to tune the forest and queries in different scenarios.

For the entire forest:

• Number of trees
• Maximum hash length

For queries:

• c : A value which determines the number of candidates chosen into the neighbors set. This acts with the number of trees.
• A lower bound for the maximum depth of hashes considered.

In these precision tests, all the those factors but c are fixed to constant values as follows:

• Number of trees = 10
• Maximum hash length = 32
• Lower bound = 4

The same data set which has been prepared using singular value decomposition on movielens data is used in all of these tests. Following are the resulting graphs of the performance of LSH forest. Time is measured in seconds.

The next section of this post illustrates the performance comparisons among ANNOY, FLANN, LSH forest and KGraph. Precision vs. time graphs are used for this comparison. Comparing ANNOY, FLANN and LSHF in one place:

KGraph has a significantly higher precision rate than other ANN implementations.(Rather than approximating, it gives the actual nearest neighbors according the KGraph documentation )

One of the main considerations of these evaluations is the maintainability of the code. Therefore, any implementation which goes into scikit-learn should have reasonably less complex code. Both FLANN and KGraph use complex data structures and algorithms to achieve higher speeds. ANNOY has a reasonably passable precision-query speed combination with a less complex implementation. Our current implementation of LSH forest has been able to beat ANNOY in precision-query speed comparison.

## Indexing speeds of ANN implementations

In addition to precision and query speed, a measure of indexing speed has a major importance. Therefore as a final step for this evaluation, I will provide you a description on indexing speed for the same data set used above for each ANN implementation.

• KGraph : 65.1286380291 seconds
• ANNOY (metric=’Angular’) : 47.5299789906 seconds
• FLANN : 0.314314842224 seconds
• LSHF : 3.73900985718 seconds

In my next post, I will discuss about the implementation of LSH forest in detail and how ANN methods will be implemented in scikit-learn.

### Acronyms:

1. ANN : Approximate Nearest Neighbors
2. LSH : Locality Sensitive Hashing

## LSH Forest with sorted arrays and binary search

• gsoc

More on GSoC with scikit-learn! LSH forest is a promising, novel and alternative method introduced in order to alleviate the drawbacks from which vanilla LSH suffers. I assume you have a passable idea of what LSH means. If not, I suggest you to refer to this: Locality-sensitive hashing. LSH forest has a theoretical guarantee of its’ suggested improvements. For more information, refer to the published paper: LSH Forest: Self-Tuning Indexes for Similarity Search

In general, the data structure used to implement LSH forest is a prefix tree(trie). In this article, I will elaborate how to implement it with sorted arrays and binary search. This will reduce the complexity involved with a separate data structure(such as a tree). You can see the complete implementation in this gist.

## How it is done

This implementation follows every design aspect suggested in the LSH forest paper except the data structure. LSH_forest is a class which has the initialization method as follows:

def __init__(self, max_label_length = 32, number_of_trees = 5):
self.max_label_length = max_label_length
self.number_of_trees = number_of_trees
self.random_state = np.random.RandomState(seed=1)

numpy has been used and imported as np. Variable names are as same as the names in the paper. The length of the hash is a fixed value. This value will be a small integer for almost all the applications.

In a normal LSH based nearest neighbor search, there are two main operations.

1. Building index
2. Queries

### Building index

First stage of building index is hashing the data point in the data set passed into the function. Random projection has been used as the hashing algorithm(It belongs to LSH family). In order to perform random projection, a set of random hyper-planes is required with the shape of $$expected Hash Size \times dimension Of The Data Vector$$. It is done by the following function.

def _get_random_hyperplanes(self, hash_size = None, dim = None):
"""
Generates hyperplanes from standard normal distribution  and return
it as a 2D numpy array. This is g(p,x) for a particular tree.
"""
return self.random_state.randn(hash_size, dim)

Then, the random projection is performed. It is a simple operation as all it needs to do is get the dot product of the generated hyper-planes and the data vectors. Then it will create a binary string taking the sign of the hash into account:

def _hash(self, input_point = None, hash_function = None):
"""
Does hash on the data point with the provided hash_function: g(p,x).
"""
projections = np.dot(hash_function, input_point)
return "".join(['1' if i > 0 else '0' for i in projections])

After this a tree(a figurative tree) is build using by sorting those binary hashes. At this point, original indices are retained because it will be only way to refer to the original vectors from now on. It is done as follows:

def _create_tree(self, input_array = None, hash_function = None):
"""
Builds a single tree (in this case creates a sorted array of
binary hashes).
"""
number_of_points = input_array.shape[0]
binary_hashes = []
for i in range(number_of_points):
binary_hashes.append(self._hash(input_array[i], hash_function))

binary_hashes = np.array(binary_hashes)
o_i = np.argsort(binary_hashes)
return o_i, np.sort(binary_hashes)

This is the process which has to be done a single tree. But there are multiple number of trees. So this has to be done for each tree. The above function is called for each tree with the corresponding hash function which is $$g(p)$$. Then hash functions, trees and original indices are stored as numpy arrays.

### Queries

This is the tricky part of this implementation. All the tree operations indicated in the paper have to be converted into range queries in order to work with sorted arrays and binary search. I will move step by step about how binary search has been used in this application.

The first objective is: given a sort array of binary hashes, a binary query and a hash value h, retrieve an array of indices where the most significant h bits of the entries are as same as the most significant h bits of the query. In order to achieve this, I have re-implemented the bisect functions(which comes by default with Python) with a little essential modification.

def bisect_left(a, x):
lo = 0
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if a[mid] < x:
lo = mid + 1
else:
hi = mid
return lo

def bisect_right(a, x):
lo = 0
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid] and not a[mid][:len(x)]==x:
hi = mid
else:
lo = mid + 1
return lo

#function which accepts an sorted array of bit strings, a query string
#This returns an array containing all indices which share the first h bits of the query
def simpleFunctionBisectReImplemented(sorted_array, item, h):
left_index = bisect_left(sorted_array, item[:h])
right_index = bisect_right(sorted_array, item[:h])
return np.arange(left_index, right_index)

Here I have considered a minor aspect about slicing and startswith in Python string. In place of a[mid][:len(x)]==x, I could have used startswith in-built function. But after a little research, it became obvious why the latter is not suitable here. startswith works efficiently with very long strings, but slicing has been optimized from C level for efficiency for small strings. In this application, hash strings do not have a requirement to be very long. You can read more about this from this question.

The time complexity of this method is as any binary search. The number of entries n is the length of the array of sorted binary hashes. There are two searches in this method, but after performing it, the overall complexity will be $$O(log n)$$. (You can do the math and confirm)

There is another binary search. It is to find the longest prefix match for a binary query string in a sorted array of binary hashes.

def find_longest_prefix_match(bit_string_list, query):
hi = len(query)
lo = 0

if len(simpleFunctionBisectReImplemented(bit_string_list, query, hi)) > 0:
return hi

while lo < hi:
mid = (lo+hi)//2
k = len(simpleFunctionBisectReImplemented(bit_string_list, query, mid))
if k > 0:
lo = mid + 1
res = mid
else:
hi = mid

return res

Time complexity of this operation is a little trickier. Binary searches on two different parameters are involved in this. The outer binary search corresponds to the length of the query string: say v and the inner binary search corresponds to length of the sorted array of binary hashes: say n. The it has a complexity of $$logv \times logn$$. So it will be approximately $$O(logn^K)$$ where $$K=logv$$.

That is all the basic setting required to implement LSH forest with sorted arrays and binary search. Now we can move on to the actual implementation of queries as indicated in the paper. There are two main phases described to perform queries.

1. Descending phase.
2. Synchronous ascending phase.

In the descending phase, the longest matching hash length for a particular query is retrieved from all the tree. This step is quite straightforward as all is does is using the above described longest prefix match function on each tree. From this max_depth (x in the paper) is found.

The query function accept a value for $$c$$ (refer to the paper) as well. This determines the number of candidates returned from the function. This is $$M$$ which is equal to $$c \times numberOfTrees$$. In asynchronous ascend phase, starting from x, every matching x long entry from each tree is collected(in a loop). Then x is decreased by one. Same is done repeatedly for each tree until the required number of candidates are retrieved. During the process, the length of candidate list may grow greater than required number of candidates. But the search does not end until the following condition is sufficed(As described in the synchronous ascend algorithm in the paper).

condition: $$x>0$$ and $$(lenth(candidates) > c$$ or $$length(unique(candidates)) > m)$$

$$M » m$$ where $$m$$ is the actual number of neighbors required. So after selecting the candidates, a true distance measure will be used to determine the actual neighbors. This will be done later as the project proceeds. The current implementation will be used to perform the tasks in the evaluation criteria that I have discussed in my earlier post.

In my next post I will illustrate how the various versions of LSH forest performs and a comparison with other ANN implementations.

### Acronyms

1. LSH : Locality Sensitive Hashing

## Performance evaluation of Approximate Nearest Neighbor search implementations - Part 1

• gsoc

This typifies the official instigation of my GSoC project. In my previous post, I have discussed how to create the bench marking data set which will be used from here on. I will discuss how the evaluation framework is designed to evaluate the performance of the existing approximate nearest neighbor search implementations. For this evaluation, I have chosen the following implementations:

## Singular value decomposition to create a bench marking data set from MovieLens data

• gsoc

This is the second article on my Google Summer of Code project and this follows from my previous post about the description about my project: Approximate nearest neighbor search using Locality sensitive hashing. Here, I will elaborate how I created my data set for prototyping, evaluating and bench marking purposes in the project. I have used Singular value decomposition on the MovieLens 1M data to create this sample data set.

## MovieLens 1M data

GroupLens Research is and organization publishes research articles in conferences and journals primarily in the field of computer science, but also in other fields including psychology, sociology, and medicine. It has collected and made available rating data sets from the MovieLens web site. The data sets were collected over various periods of time, depending on the size of the set.

MovieLens 1M data set contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000. After extracting the compressed content, there will be following files at your hand:

• ratings.dat : Contains user IDs, movie IDs, ratings on 5 star scale and time stamp.
• movies.dat : Contains movie IDs, titles and genres.
• users.dat : Contains user IDs, genders, ages, ocupations and zip-codes.

## A brief explanation about singular value decomposition and its’ role in machine learning

Singular value decomposition is a matrix factorization method. The general equation can be expressed as follows.

Suppose $$X$$ has $$n$$ rows and $$d$$ columns. $$U$$ is a matrix whose dimensions are $$n \times n$$, $$V$$ is another matrix whose dimensions are $$d \times d$$, and $$S$$ is a matrix whose dimensions are $$n \times d$$, the same dimensions as $$X$$. In addition, $$U^T U = I _ n$$ and $$V^T V = I _ d$$

### What is the significance of the SVD(Singular Value Decomposition) in machine learning and what does it have to do with MovieLens data?

We can represent each movie from a dimension and each user corresponds to a data point in this high dimensional space. But we are not able to visualize more than three dimensions. This data can be represented by a matrix(The $$X$$ in the above equation). A sample depiction of the matrix may look as follows. Because number of ratings for a movie by users is significantly low when considered with the number of users, this matrix contains a large number of empty entries. Therefore this matrix will be a very sparse matrix. Hence, approximating this matrix with a lower rank matrix is a worthwhile attempt.

Consider the following scenario:

If every user who likes “movie X” also likes “movie Y”, then it is possible to group them together to form an agglomerative movie or feature. After forming new features in that way, two users can be compared by analyzing their ratings for different features rather than for individual movies.

In the same way different users may rate same movies similarly. So there can different types of similarities among user preferences.

According to this factorization method (you might want to read more about SVD at this point from the reference I have provided earlier) the matrix $$S$$ is a diagonal matrix containing the singular values of the matrix $$X$$. The number of singular values is exactly equal to the rank of the matrix $$X$$. The rank of a matrix is the number of linearly independent rows or columns in the matrix. We know that two vectors are linearly independent if they cannot be written as the sum or scalar multiple of any other vectors in that vector space. You can notice that this linear independence somehow captures the notion of a feature or agglomerative item which we try to generate from in this approach. According to the above scenario, if every user who liked “Movie X” also liked “Movie Y”, then those two movie vectors would be linearly dependent and would only contribute one to the rank.

So how are we to get rid of this redundant data. We can compare movies if most users who like one also like the other. In order to do that, we will keep the largest k singular values in $$S$$. This will give us the best rank-k approximation to X.

So the entire procedure can be boiled down to following three steps:

1. Compute the SVD: $$X = U S V^T$$.
2. Form the matrix $$S’$$ by keeping the k largest singular values and setting the others to zero.
3. Form the matrix $$X _ lr$$ by $$X _ lr = U S’ V^T$$.

## Implementation

To perform SVD on MovieLens data set and recompose the matrix with a lower rank, I used scipy sparse matrix, numpy and pandas. It has been done in following steps.

1) Import required packages.

import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy.sparse.linalg import svds
import pickle

2) Load data set into a pandas data frame.

data_file = pd.read_table(r'ratings.dat', sep = '::', header=None)

Here,I have assumed that the ratings.dat file from MovieLens 1M data will be in the working directory. Only reason I am using pandas data frame is its’ convenience of usage. You can directly open the file and proceed. But then you will have to change following steps to adapt to that method.

3) Extract required meta information from the data set.

users = np.unique(data_file[0])
movies = np.unique(data_file[1])

number_of_rows = len(users)
number_of_columns = len(movies)

movie_indices, user_indices = {}, {}

for i in range(len(movies)):
movie_indices[movies[i]] = i

for i in range(len(users)):
user_indices[users[i]] = i

As the user IDs and movie IDs are not continueous integers(there are missing numbers inbetween), a proper mapping is required. It will be used when inserting data into the matrix. At this point, you can delete the loaded data frame in order to save memory. But it is optional.

4) Creating the sparse matrix and inserting data.

#scipy sparse matrix to store the 1M matrix
V = sp.lil_matrix((number_of_rows, number_of_columns))

#adds data into the sparse matrix
for line in data_file.values:
u, i , r , gona = map(int,line)
V[user_indices[u], movie_indices[i]] = r

You can save sparse matrix V using pickle if you are willing to use it later.

#as these operations consume a lot of time, it's better to save processed data
with open('movielens_1M.pickle', 'wb') as handle:
pickle.dump(V, handle)

5) Perform SVD.

#as these operations consume a lot of time, it's better to save processed data
#gets SVD components from 10M matrix
u,s, vt = svds(V, k = 500)

with open('movielens_1M_svd_u.pickle', 'wb') as handle:
pickle.dump(u, handle)
with open('movielens_1M_svd_s.pickle', 'wb') as handle:
pickle.dump(s, handle)
with open('movielens_1M_svd_vt.pickle', 'wb') as handle:
pickle.dump(vt, handle)

The svds method performs the SVD. Parameter k is the number of singular values we want to retain. Here also I have save the intermediate data using pickle

After this decomposition you will get u, s and vt. They have (number of users, k), (k, ) and (k, number of movies) shapes respectively.

6) Recomposing the lower rank matrix.

As s is a vector, we need to create a diagonal matrix form that with the diagonal containing the values of that vector. It is done as follows:

s_diag_matrix = np.zeros((s.shape[0], s.shape[0]))

for i in range(s.shape[0]):
s_diag_matrix[i,i] = s[i]

This will create a diagonal matrix. After that, all you have to do is get the matrix product of u, s_diag_matix and vt in that order.

X_lr = np.dot(np.dot(u, s_diag_matrix), vt)

Now we have the lower rank approximation for $$X$$ as $$X _ lr = U S’ V^T$$. Now this matrix can be used as a bench marking data set for the application.

## References

1. Dimensionality Reduction and the Singular Value Decomposition, Available[online]: http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html
2. Data Mining Algorithms In R/Dimensionality Reduction/Singular Value Decomposition, Available[online]: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition

## GSoC2014: Approximate nearest neighbor search using LSH

• gsoc

This project has been initiated as a Google summer of code project. Python Software foundation serves as an “umbrella organization” to a variety of Python related open source projects. Scikit-learn is a machine learning module which operates under that umbrella. I’m instigating this project under that module. In the following sections, I will describe what this really is with help of the essence of my project proposal. This is just a head start of the journey and you will be able to obtain a clear picture as this proceeds.

Nearest neighbor search is a well known problem which can be defined as follows: given a collection of n data points, create a data structure which, given any query point, reports the data points that is closest to the query. This problem holds a major importance in certain applications: data mining, databases, data analysis, pattern recognition, similarity search, machine learning, image and video processing, information retrieval and statistics. To perform nearest neighbor search, there are several efficient algorithms known for the case where the dimension is low. But those methods suffer from either space or query time that is exponential in dimension.

In order to address the “Curse of Dimensionality” in large data sets, recent researches had been lead based on approximating neighbor search. It has been proven that in many cases, approximate nearest neighbor is as almost good as the exact one[1]. Locality Sensitive Hashing is one of those approximating methods. The key idea of LSH is to hash data points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are close to each other than for those that are far apart.

## What does this have to do with me?

In scikit-learn, currently exact nearest neighbor search is implemented, but when it comes to higher dimensions, it fails to perform efficiently[2]. So I’m taking an initiative to implement LSH based ANN for scikit-learn. In this project, several variants of LSH-ANN methods will be prototyped and evaluated. After identifying the most appropriate method for scikit-learn, it will be implemented in accordance with scikit-learn’s API and documentation levels, which includes narrative documentation. Then with the results obtained from prototyping stage, storing and querying structure of ANN will be implemented. After that, ANN part will be integrated into sklearn.neighbors module. Next comes the application of this method. Most of clustering algorithms use nearest neighbor search, therefore this method will be adapted to use in those modules in order to improve their operational speed. As these activities proceed, testing, examples and documentation will be covered. Bench marking will be done to assess the implementation.

## Milestones of the project

1. Prototyping/Evaluating existing LSH based ANN methods (vanilla and others) in order to find the most appropriate method to have in scikit-learn. There is no point of having methods in scikit-learn which are impractical to use with real data.
2. Approximating neighbor search uses hashing algorithms of LSH family. These algorithms will be implemented.
3. Implementation of an efficient storing structure to retain trained/hashed data.
4. Integrating the ANN search into current implementation of neighbor search, so that this can be used with the existing API.
5. Improving speed of existing clustering models with the implemented ANN search.
6. Completing tests, examples, documentation and bench marking.

Well that’s it for the moment. I will guide you through when the things are in motion.

## Abbreviations

• LSH : Locality sensitive hashing
• ANN : Approximate nearest neighbor
• API : Application programming interface

## References

1. A. Andoni and P. Indyk,”Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions”, Available[online]:http://people.csail.mit.edu/indyk/p117-andoni.pdf
2. R. Rehurek, “Performance Shootout of Nearest Neighbours: Contestants”, Available[online]:http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/

## The Pythonizer!

• miscellaneous

Though I’m a Python fanatic, Jekyll is inconceivably awesome! (It’s Ruby)

Jekyll also offers powerful support for code snippets:

def print_hi(name)
print "Hi, " name

print_hi('Gonaa')
#=> prints 'Hi, Gonaa' to STDOUT.

Check out the Jekyll docs for more info on how to get the most out of Jekyll. File all bugs/feature requests at Jekyll’s GitHub repo.