Espy Yonder

maheshakyas' personal website

Espy Yonder maheshakyas' personal website http://maheshakya.github.io Mon, 05 Oct 2015 05:50:43 +0000 Mon, 05 Oct 2015 05:50:43 +0000 Jekyll v2.4.0 To bridge the gaps - An overview of WSO2 Machine Learner <p>How effectively can we bring purely mathematical and mere conceptual models/mechanisms into the utilization of common man? Is it plausible to make the prerequisite technical know-how for dealing with such models indifferent among multiple ranges of intellectual levels? That said, are these mechanisms capable of functioning with a variety of applications and data. To which extent can these scale? The advent of tools for emulating learning theory has trasnpired amidst these concerns. </p> <p>Frameworks which perform computational learning tasks had started to emerge approximately at the same time as the inception of Machine Learning as a another branch of Artificial Intelligence. At early stages, these frameworks basically emulated the exact algorithms with small samples of data and almost all of them were CLI (Command Line Interface) based frameworks. Majority of those were used only in laborotaries in very specific applications and restricted to researchers. But our society has evolved much and progressed through a myriad of milestones along the path of artificial intelligence and nowadays almost all the organizations require some amount of data analytics and machine learning to keep abreast with the ever-growing technology. This trend has as become exigent due to the massive amounts of data being collected and stored each and every day. With this inclination, the demand for the tools that conduct the entire range of workflows of data analytics has been elevated colossaly.</p> <p>As an open-source middleware based organization, WSO2 has been able to establish its’ name among a vast amount of communities which seek different types of solutions in their variety of application domains. A few years ago, WSO2 entered into the arena of data analytics by providing a rich platform for batch analytics and real time data analytics (<a href="">WSO2 Data Analytics Server</a> and <a href="http://wso2.com/products/complex-event-processor/">WSO2 Complex Event Processor</a>). In order to fortify the hegemony that WSO2 has been able to build around data analytics, <a href="http://wso2.com/products/machine-learner/"><strong>WSO2 Machine Learner</strong></a> has been introduced with the intention of providing a rich, user friendly, flexible and scalable predictive analytics platform which is compatible with the standard machine learning methodologies and with the other products of WSO2 as well. At the humble nascent of the first release of WSO2 Machine Learner (version 1.0.0), this discussion will provide an elaborative overview on its’ ambition of bridging the gaps that hinder prevailing predictive analytics tasks due to various reasons such as complexity, lack of skill in data analytics, inability to scale, real time prediction requirements, rapid increase of not analytized data storages, proprietary rights of software, etc.</p> <p>In the subsequent sections, you will be able to understand how WSO2 Machine Learner (WSO2 ML from this point onwards) can fit in and what sets it apart from other machine learning tools and frameworks available with respect to the following concerns.</p> <ol> <li>Functionality</li> <li>HCI aspects and userbility</li> <li>Scalability</li> <li>Lisence - Availability to users</li> <li>Compatibility and use cases</li> </ol> <h1 id="functionality">1. Functionality</h1> <p><img src="https://docs.wso2.com/download/attachments/45949448/Ml-overview.png?version=3&amp;modificationDate=1443422195000&amp;api=v2" alt="ML key concepts" title="Overview functionality of WSO2 Machine Learner" class="center-image" /></p> <p>The above diagram shows the the key concepts and terminology of WSO2 ML. To put it simply, WSO2 ML supports the entire workflow of data mining: </p> <h3 id="data-sources-and-data-formats">1. Data sources and data formats</h3> <p>WSO2 ML supports datasources such as csv/tsv files from local file system, files from HDFS file system, and tables from WSO2 Data analytics server.</p> <h3 id="data-exploration">2. Data exploration</h3> <p>Analyzing and visualizing the characteristics and structure of the data is considered one of the most imparative tasks in data science. Most often, choosing the <strong>right</strong> machine learning algorithm depends on the understanding a user has about the data that he/she is dealing with. <img src="https://docs.google.com/drawings/d/1tZ1Sv7k1NK78Obe5GY9ievctmjaBJar5VA106jftCy4/pub?w=499&amp;h=380" alt="Know thy data" title="Know Thy Data!" class="center-image" /> WSO2 ML provides a plenty of tools to visualize data and get to know the data more. <img src="https://docs.google.com/drawings/d/1soCBzrrpB1A7QXvoh3Mj4VsLvdpF2vRsV2u7fx8XreA/pub?w=960&amp;h=720" alt="data exploration" title="Data exploration!" class="center-image" /></p> <p>In order to identify relations among the features of datasets, know statistical metrics such as mean, standard deviation and skewness, indentify class imbalancies and frequencies, and to visualize how data is structured, WSO2 ML provides the following graphs which are considered as standard methods of data visualizaiton.</p> <ul> <li>Scatter plots</li> <li>Parallel sets</li> <li>Trellis charts</li> <li>Cluster diagrams</li> <li>Histograms and pie charts</li> </ul> <p>See more about <a href="https://docs.wso2.com/display/ML100/Exploring+Data">exploring data at WSO2 ML documentation</a>.</p> <h3 id="data-preprocessing-and-feature-engineering">3. Data preprocessing and feature engineering</h3> <p>This is considered as the key of building a successful machine learning model within the data analytics community. WSO2 ML version 1.0.0 provides some primitive techniques of data proprocessing such as feature selection and missing value handling with mean imputation and discarding.</p> <h3 id="learning-algorithms">4. Learning algorithms</h3> <p>This initial release of WSO2 ML supports supervised learning and unsupervised learning. It has a set of algorithms for numerical prediction, classification and clustering.</p> <ul> <li>Numerical prediction - Linear Regression, Ridge Regression, Lasso Regression</li> <li>Classfication - Logistic Regression, Naive Bayes, Decision Tree, Random Forest and Support Vector Machines</li> <li>Clustering - K-Means</li> </ul> <p>For each of these learning algorithm, user can set the corresponding parameters that specify how algorithm should behave with respect to convergence, regularization, information gain, etc. These are called hyper-parameters. </p> <h3 id="prediction">5. Prediction</h3> <p>Users can use the trained models to predict the values. Feature values can be provided via the UI or a file that contains feature values can be uploaded and the predictions can be obtained for the entire set of records in the file.</p> <h3 id="model-evaluation-and-comparison">6. Model Evaluation and comparison</h3> <p>WSO2 ML allows you to retain a fraction of the training data and evaluate your trained models against that data. Model evaluation is a crucial task when determining the <strong>right learning algorithm</strong> and the <strong>right hyper-parameters</strong>. It helps the users to get an insight about the models they have trained and make decisions based on the interpretations of the results. Some of the evaluation metrics are:</p> <ul> <li>Accuracy</li> <li>Area under ROC curve</li> <li>Confusion Matrix</li> <li>Predicted vs. Actual graphs</li> <li>Feature importances</li> </ul> <p><img src="https://docs.wso2.com/download/attachments/45949563/screencapture-10-100-7-51-9443-ml-site-analysis-view-model-jag-1443421639332.png?version=1&amp;modificationDate=1443466811000&amp;api=v2" alt="Model evaluation" title="Model Evaluation" class="center-image" /></p> <p>Moreover, WSO2 ML allows to compare the models you have trained and lets you to choose the best fit among them.</p> <p>Read more about <a href="https://docs.wso2.com/display/ML100/Evaluating+Models">model evaluation at WSO2 ML documentation</a>.</p> <h3 id="additional-functionalities">7. Additional functionalities</h3> <p>In addition to standard machine learning tasks, WSO2 ML has the abilities such as downloading trained models, sending e-mail notifications to users when model building is complete, publishing trained models into registry, etc.</p> <h1 id="hci-aspects-and-userbility">2. HCI aspects and userbility</h1> <p>Even though Information technology has become an ubiquitous science in the modern community, there is still a huge gap between those who are with data and those who can analyze them.</p> <p><img src="https://docs.google.com/drawings/d/1iRK5_3eN_KJU0nCu0YOHLVM6zJxQtvUDcJjT2_NOBDE/pub?w=537&amp;h=255" alt="gap" title="Gap" class="center-image" /></p> <p>Most of the industries are suffering from the scarcity of knowledge workers of data analytics and it is considered as one of the most expensive skills in the current society.</p> <h3 id="i-keep-saying-the-sexy-job-in-next-ten-years-will-be-the-statisticians-"><strong><em>“ I keep saying the sexy job in next ten years will be the statisticians “</em></strong></h3> <p><strong>- Hal Varian, Google chief economist, 2009</strong></p> <p>There are solid evidences that analytics skills will be one of the most sought after requirements in the future industrial communities. But gaining such immense knowledge and experience in a short period of time is not plausible. WSO2 ML is here to help you. For novice data analysts, it provides a very verbose and highly interactive guided process. Therefore anyone with basic understanding of software can use WSO2 ML with ease.</p> <p><img src="https://docs.google.com/drawings/d/18kSP3WoqtwYH45nR4nT1ERiNTpIs6GwpqUvaH6tEMJQ/pub?w=960&amp;h=720" alt="Guided process" title="Guided Process" class="center-image" /></p> <p>It is a well known fact that software frameworks will be failures without proper human computer interaction capabilities. WSO2 ML is designed in a way such that anyone can quickly grasp what to do with it and how to get what you want out of it. It has made the workflow of data mining easier by providing rich graphical user interfaces and essential assistance such as descriptions of visualization techniques and tool tips which explain the effectivity of hyper-parameters of learning algorithms. Moreover WSO2 ML presents a very elaborative and easy-to-follow <a href="https://docs.wso2.com/display/ML100/Quick+Start+Guide">quick start guide in the documentation</a>.</p> <p><img src="https://docs.google.com/drawings/d/1ENic1CExtNyZK87A_eLy5-Ui6Y6v7Z5HCoWPL62Xlt4/pub?w=472&amp;h=274" alt="Help and Tooltips" title="Help and Tooltips" class="center-image" /></p> <p>For data scientists who are with greater experience and knowledge with data analytics, WSO2 ML provides advanced data visualizations and preprocessing techniques which can still be usefull in critical decision making. With the evaluations and feedback it provides, data analysts can identify the discripancies in their models and improve those to obtain the best results.</p> <h1 id="scalability">3. Scalability</h1> <p>In the modern community, scalability of software frameworks is the most decisive factor when selecting solutions for different industrial/enterprise domains due to high availability of data. Data analytics also need to adhere to the scalability requirements of industries in order to provide best support. WSO2 ML is built on top of <a href="http://spark.apache.org/">Apache Spark</a> which is one of the most successful frameworks built for large scale data processing. This allows WSO2 ML to extend its’ scalability in different directions.</p> <h3 id="standalone-deployment">1. Standalone deployment</h3> <p>This mode only allows to start WSO2 ML with a single JVM instance and all the processing is done locally in the same machine. This setting is recommended for small scale tasks like learning WSO2 ML, analyzing small datasets and analyzing sample datasets. It is not recommended to use the standalone mode for precessor intensive or memory intensive tasks with large datasets. WSO2 ML provides other deployments patterns to support such cases.</p> <h3 id="with-external-spark-cluster">2. With external Spark cluster</h3> <p>WSO2 ML with external Spark cluster allows to bring the maximum efficiency of the machine learning tasks and utilize the resources to reach the highest achievable performance. By allocating sufficient resources to the Spark cluster, WSO2 ML can be used to perform predictive analytics with any preferred sizes of datasets without having to worry about performance. This setting is recommended for working with very large datasets and CPU intensive algorithms.</p> <h3 id="with-wso2-data-analytics-server-as-spark-cluster">3. With WSO2 Data Analytics Server as Spark cluster</h3> <p>This setting is similar to working with external Spark cluster, but when the entire WSO2 data analytics platform is available, it is preferred to use WSO2 Data Analytics server as the Spark cluster for WSO2 ML. This avoids the trouble of setting an external Spark cluster and extra overheads associated with that.</p> <p>You can read more about the <a href="https://docs.wso2.com/display/ML100/Deployment+Patterns">deployment patterns at WSO2 ML documentation</a>.</p> <h1 id="lisence---availability-to-users">4. Lisence - Availability to users</h1> <p>One of the prominent aspects that makes WSO2 ML distinct is that it is free and open source under <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License 2.0</a>. Therefore, any interested person can get the code, play with it and add or change anything they want. Interested developers can even contribute to WSO2 ML by resolving issues and improving it.</p> <p>(Refer to <a href="https://wso2.org/jira/browse/ML">WSO2 ML Jira</a> and <a href="https://wso2.org/jira/browse/CARBON">WSO2 Carbon Jira</a>)</p> <h1 id="compatibility-and-use-cases">5. Compatibility and use cases</h1> <p>Some of the ways(but not restricted to) that WSO2 ML can be used can be indicated as follows:</p> <h3 id="rest-api">REST API</h3> <p>WSO2 ML has a rich restful API which allows users to call all the functionalities of ML externally. This API can be used to perform tasks starting from uploading data up to predicting and model evaluations. You can get a more sound idea of what this REST API is capable of by referring to the <a href="https://docs.wso2.com/display/ML100/REST+API+Guide">REST API guide of WSO2 ML</a>.</p> <h3 id="working-with-tables-in-wso2-data-analytics-server">Working with tables in WSO2 Data Analytics Server</h3> <p>It is a well known fact that most of the data accumulated in todays’ industries are not capable of being stored in typical file systems in properly structured ways. Most of the times, the data we have may have to be stored in distributed environments. WSO2 Data Analytics Server provides this capability to store data in tables and allows to perform SQL operations on the uploaded data. WSO2 ML can directly import these tables as its’ datasets. Read more on how to use <a href="https://docs.wso2.com/display/ML100/Integration+with+WSO2+Data+Analytics+Server">WSO2 DAS tables in WSO2 ML documentation</a>.</p> <h3 id="real-time-prediction-with-wso2-complex-event-processor">Real time prediction with WSO2 Complex Event Processor</h3> <p>The most vital duty of predictive machine learning models is to make predictions on unprecedented data and with all the smart technolgoies we have today, it has become a major requirement to analyze data in real time. Hence, making predicitons on real time data can be considered as the most eminent usage of machine learning models nowadays. WSO2 ML provides a feasible solution for this taks with its’ extension for WSO2 Complex Even Processor. Read more about <a href="https://docs.wso2.com/display/ML100/WSO2+CEP+Extension+for+ML+Predictions">real time prediction with WSO2 CEP extension at WSO2 ML documentation</a>.</p> <h1 id="whats-next">What’s next?</h1> <p>WSO2 ML is still at its’ embryonic infancy and it has a long journey ahead with many directions to venture. Some the features that are expected to be incorporated with future versions of WSO2 ML are as follows:</p> <ol> <li>PMML support</li> <li>Deep Learning</li> <li>Advanced data preprocessing methods and data wrangler support</li> <li>Recommendation systems</li> <li>Anomaly detection algorithms</li> <li>Dimensionality reduction methods</li> <li>Ensemble methods</li> <li>Natural Language Processing and many more!</li> </ol> Sun, 04 Oct 2015 00:00:00 +0000 http://maheshakya.github.io/technology/2015/10/04/to-bridge-the-gaps.html http://maheshakya.github.io/technology/2015/10/04/to-bridge-the-gaps.html technology A quick guide to plotting with python and matplotlib - part 2 <p>In the <a href="/miscellaneous/2015/06/04/a-quick-guide-plotting-with-python-and-matplotlib.html">previous part</a> of this guide, we have discussed on how to create basic components of a plot. It’s time to move into some advanced material. This is a continuation of the <a href="/miscellaneous/2015/06/04/a-quick-guide-plotting-with-python-and-matplotlib.html">part 1</a> of this guide, assuming that you have read and grasped it already.</p> <h1 id="a-bar-chart-grouped-accoring-to-some-variable">A bar chart grouped accoring to some variable</h1> <p>In this plot, a bar chart will be created grouped based on a variable with each group having its’ own similar components. You may know what it actually looks like after plotting then it will be easy to analyze the code and learn.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="c"># Defines the sizes of the axes</span> <span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">14</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span><span class="mi">140</span><span class="p">])</span> <span class="n">p1</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">Rectangle</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="s">&quot;crimson&quot;</span><span class="p">)</span> <span class="n">p2</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">Rectangle</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="s">&quot;burlywood&quot;</span><span class="p">)</span> <span class="n">p3</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">Rectangle</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="s">&quot;chartreuse&quot;</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">((</span><span class="n">p1</span><span class="p">,</span> <span class="n">p2</span><span class="p">,</span> <span class="n">p3</span><span class="p">),</span> <span class="p">(</span><span class="s">&#39;category 1&#39;</span><span class="p">,</span><span class="s">&#39;category 2&#39;</span><span class="p">,</span><span class="s">&#39;category 3&#39;</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="s">&#39;upper left&#39;</span><span class="p">)</span> <span class="c"># Defines labels for each group</span> <span class="n">labels</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39; &#39;</span><span class="p">,</span> <span class="s">&#39; 1&#39;</span><span class="p">,</span> <span class="s">&#39; &#39;</span><span class="p">,</span> <span class="s">&#39; &#39;</span><span class="p">,</span> <span class="s">&#39; 2&#39;</span><span class="p">,</span> <span class="s">&#39; &#39;</span><span class="p">,</span> <span class="s">&#39; &#39;</span><span class="p">,</span> <span class="s">&#39; 4&#39;</span><span class="p">,</span> <span class="s">&#39; &#39;</span><span class="p">]</span> <span class="c"># Creates discrete values for x co-ordinates (widths of the bars)</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">10</span><span class="p">,</span><span class="mi">11</span><span class="p">,</span><span class="mi">12</span><span class="p">])</span> <span class="o">+</span> <span class="mi">1</span> <span class="c"># Defines some random set of values for y (heights of the bars)</span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">55.433</span><span class="p">,</span> <span class="mf">55.855</span><span class="p">,</span> <span class="mf">55.719</span><span class="p">,</span> <span class="mf">55.433</span><span class="p">,</span> <span class="mf">90.199</span><span class="p">,</span> <span class="mf">93.563</span><span class="p">,</span> <span class="mf">55.433</span><span class="p">,</span> <span class="mf">104.807</span><span class="p">,</span> <span class="mf">106.693</span><span class="p">])</span> <span class="c"># Replaces the names in the x-axis with labels</span> <span class="n">plt</span><span class="o">.</span><span class="n">xticks</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">labels</span><span class="p">)</span> <span class="c"># Creates the bar chart</span> <span class="n">plt</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">left</span> <span class="o">=</span> <span class="n">x</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="n">y</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="p">[</span><span class="s">&#39;crimson&#39;</span><span class="p">,</span> <span class="s">&#39;burlywood&#39;</span><span class="p">,</span> <span class="s">&#39;chartreuse&#39;</span><span class="p">])</span> <span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">(</span><span class="n">which</span><span class="o">=</span><span class="s">&#39;both&#39;</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">&#39;This is your y-axis&#39;</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">&#39;This is my x-axis&#39;</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">&quot;This is our title&quot;</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span></code></pre></div> <p><img src="https://docs.google.com/drawings/d/17Fx3dWiy6119BrOkSOyTLZWY01ekxIQihSIIKlk8lDA/pub?w=960&amp;h=720" alt="plot_4" /></p> <p>No we’ll try to analyze each new individual component of this code piece. Starting from the beginning:</p> <ul> <li><code>plt.axis([0,14, 0,140])</code> sets the limits of x-axis from \(0\) to \(14\) and limits of y-axis from \(0\) to \(140\)</li> <li><code>labels = [' ', ' 1', ' ', ' ', ' 2', ' ', ' ', ' 4', ' ']</code> is used to create a representation for a group name. Here, there are \(9\) elements in the list, since there are \(9\) bars. Those bars need to be grouped into \(3\) groups, hence for each group a label is given. Each label should be displayed right below the bar at the middle of each group.</li> <li><code>plt.xticks(x, labels)</code> replaces the display names of the values on x-axis with the labels, but the actual co-ordinates of the bars remain same at those previously defined x-axis values.</li> <li><code>plt.bar(left = x, height=y, color=['crimson', 'burlywood', 'chartreuse'])</code> is where the actual bars are plotted. For parameter <code>left</code>, x values are given so that the left end of the bars will be equal to the first element in <code>x</code>. For <code>height</code> parameter, y values are given. <code>color</code> parameter uses the given set of colors and applies those set of colors on the bars in a circular order. Here, only \(3\) colors are given, so those colors rotate around the \(9\) bars exactly \(3\) times.</li> </ul> Thu, 04 Jun 2015 00:00:00 +0000 http://maheshakya.github.io/miscellaneous/2015/06/04/a-quick-guide-plotting-with-python-and-matplotlib-2.html http://maheshakya.github.io/miscellaneous/2015/06/04/a-quick-guide-plotting-with-python-and-matplotlib-2.html miscellaneous A quick guide to plotting with python and matplotlib - part 1 <p>Representation of information and data visualization can be often considered as one of the most indispensable feats that requires subtlest attention in the present community of science and technology. So that’s that. Now let’s hop into plotting!</p> <p>Python has this amazing library called <a href="http://matplotlib.org/"><strong>matplotlib</strong></a> where you can create plots of almost everything you can ever wish of (yes, it supports almost all sorts of plots that can be drawn with <a href="http://www.r-project.org/"><strong>R</strong></a> now). But for a novice, this could be a little tricky at the beginning. Figuring out how to draw exactly what you might need is kind of a headache, since you don’t have much experience in manipulating the resources packeged with this library. The documentation indeed provides a nice overview of what this library is capable of, but still one might want to create simple, yet the weirdest plot that no documentation or <a href="http://stackoverflow.com/"><em>Stack Overflow</em></a> answer could ever help (I guess I’m one of them and there are many others I know as well). So, let’s try out some fundamental techniques first and then heed the deeper ones. These are some of the graphs I wanted to plot in many differenct circumstances. I assume these would provide you at least some assistance in creating your own plots. I’ll be using <a href="http://www.numpy.org/"><strong>numpy</strong></a> library as well to create required data in these demonstrations. <em>Note that <strong>matplotlib</strong> and <strong>numpy</strong> are imported in advance.</em></p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span></code></pre></div> <h1 id="lines-connecting-scatter-points-with-different-marker-for-each-line">Lines connecting scatter points with different marker for each line</h1> <p>The term <code>marker</code> refers to a symbol that represents a point in a graph. There are numerous markers in <strong>matplotlib</strong>, so we will choose a few of them to demonstrate this. Typical syntax for scatter plot is <code>plt.scatter(x, y, marker=m)</code> where <code>x</code> is the set of the x-co-ordinates, <code>y</code> is the set of y-co-ordinates (these are compulsory arguments for <code>scatter</code> function) and <code>m</code> is the marker we are going to use. Here are some example markers:</p> <ul> <li>‘o’</li> <li>’+’</li> <li>‘v’</li> <li>‘*’</li> </ul> <p>Let’s plot now.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># There are 3 lines</span> <span class="n">num_of_lines</span> <span class="o">=</span> <span class="mi">3</span> <span class="c"># Defines a coluor for each line</span> <span class="n">colours</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;c&#39;</span><span class="p">,</span> <span class="s">&#39;crimson&#39;</span><span class="p">,</span> <span class="s">&#39;chartreuse&#39;</span><span class="p">]</span> <span class="c"># Defines a marker for each line</span> <span class="n">markers</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;o&#39;</span><span class="p">,</span> <span class="s">&#39;v&#39;</span><span class="p">,</span> <span class="s">&#39;*&#39;</span><span class="p">]</span> <span class="c"># Creates x array with numbers ranged from 0 to 10(exclusive)</span> <span class="c"># Creates an empty list for y co-ordinates in each line</span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> <span class="n">y</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># For each line</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_of_lines</span><span class="p">):</span> <span class="c"># Adds to y according to y=ix+1 function</span> <span class="n">y</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="c"># This is where plotting happens!!!</span> <span class="c"># For each line</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_of_lines</span><span class="p">):</span> <span class="c"># Scatter plot with point_size^2 = 75, and with respective colors</span> <span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">marker</span><span class="o">=</span><span class="n">markers</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">s</span><span class="o">=</span><span class="mi">75</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="n">colours</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="c"># Connects points with lines, and with respective colours</span> <span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">colours</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="c"># Show grid in the plot</span> <span class="n">plt</span><span class="o">.</span><span class="n">grid</span><span class="p">()</span> <span class="c"># Finally, display the plot</span> <span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span></code></pre></div> <p><img src="https://docs.google.com/drawings/d/1f4e4RIevPIcHI-ZVTzPRC3fh6gsG6qz1hGMeJ9Y_ajk/pub?w=960&amp;h=720" alt="plot_1" /></p> <p>Fabulas, isn’t it? Now we shall add a simple <strong>legend</strong> to this plot. This is what we are going to do. The upper left corner seems like an open space. So, let’s add the legend there. A <code>Rectangle</code> is used to represent an entry in the legend.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Creates 3 Rectangles</span> <span class="n">p1</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">Rectangle</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="n">colours</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="n">p2</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">Rectangle</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="n">colours</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">p3</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">Rectangle</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="n">fc</span><span class="o">=</span><span class="n">colours</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span> <span class="c"># Adds the legend into plot</span> <span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">((</span><span class="n">p1</span><span class="p">,</span> <span class="n">p2</span><span class="p">,</span> <span class="n">p3</span><span class="p">),</span> <span class="p">(</span><span class="s">&#39;line1&#39;</span><span class="p">,</span> <span class="s">&#39;line2&#39;</span><span class="p">,</span> <span class="s">&#39;line3&#39;</span><span class="p">),</span> <span class="n">loc</span><span class="o">=</span><span class="s">&#39;best&#39;</span><span class="p">)</span></code></pre></div> <p>In the <code>legend</code> function, in addition to rectangles and the names of the entries, it is possible to specify the location of the legend as well. <strong>‘best’</strong> gives best position for the legend in strict sense of word (which is the upper left corner). The other locations are as follows:</p> <table> <thead> <tr> <th style="text-align: center">Position</th> <th style="text-align: center">#</th> </tr> </thead> <tbody> <tr> <td style="text-align: center">‘best’</td> <td style="text-align: center">0</td> </tr> <tr> <td style="text-align: center">‘upper right’</td> <td style="text-align: center">1</td> </tr> <tr> <td style="text-align: center">‘upper left’</td> <td style="text-align: center">2</td> </tr> <tr> <td style="text-align: center">‘lower left’</td> <td style="text-align: center">3</td> </tr> <tr> <td style="text-align: center">‘lower right’</td> <td style="text-align: center">4</td> </tr> <tr> <td style="text-align: center">‘right’</td> <td style="text-align: center">5</td> </tr> <tr> <td style="text-align: center">‘center left’</td> <td style="text-align: center">6</td> </tr> <tr> <td style="text-align: center">‘center right’</td> <td style="text-align: center">7</td> </tr> <tr> <td style="text-align: center">‘lower center’</td> <td style="text-align: center">8</td> </tr> <tr> <td style="text-align: center">‘upper center’</td> <td style="text-align: center">9</td> </tr> <tr> <td style="text-align: center">‘center’</td> <td style="text-align: center">10</td> </tr> </tbody> </table> <p><em>Note that you can even use the corresponding number to specify the location.</em></p> <p>Now, simply add the legend code segment just before the <code>plt.show()</code> in the first code. You will see that there is a nice legend at the upper left corner.</p> <p><img src="https://docs.google.com/drawings/d/1TKLhYlSUnMBKrMvQOIbkwR968EbDWX6MpNI_PFJUSuQ/pub?w=960&amp;h=720" alt="plot_2" /></p> <p>Only a little work is left to do… What? Naming the axes and entitling the plot. It takes only \(3\) lines of code. Add these lines just above the <code>plt.show()</code> function.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Sets x-axis</span> <span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">&#39;This is my x-axis&#39;</span><span class="p">)</span> <span class="c"># Sets y-axis</span> <span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">&#39;This is your y-axis&#39;</span><span class="p">)</span> <span class="c"># Sets title</span> <span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">&quot;This is our title&quot;</span><span class="p">)</span></code></pre></div> <p><img src="https://docs.google.com/drawings/d/1uQmm-dhvkYS65xjugbT35EHcas5ywsawx0_2eFTC19c/pub?w=960&amp;h=720" alt="plot_3" /></p> <p>Now you know how to create the basic components of a plot. This will be the end of the first part of this guide. More interesting stuff will follow in the next parts.</p> Thu, 04 Jun 2015 00:00:00 +0000 http://maheshakya.github.io/miscellaneous/2015/06/04/a-quick-guide-plotting-with-python-and-matplotlib.html http://maheshakya.github.io/miscellaneous/2015/06/04/a-quick-guide-plotting-with-python-and-matplotlib.html miscellaneous Performance comparison among LSH Forest, ANNOY and FLANN <p>Finally, it is time to compare performance of Locality Sensitive Hashing Forest(approximate nearest neighbor search implementation in <a href="http://scikit-learn.org/stable/">scikit-learn</a>), <a href="https://github.com/spotify/annoy">Spotify Annoy</a> and <a href="http://www.cs.ubc.ca/research/flann/">FLANN</a>. </p> <h2 id="criteria">Criteria</h2> <p>Synthetic datasets of different sizes (varying <code>n_samples</code> and <code>n_features</code>) are used for this evalutation. For each data set, following measures were calculated.</p> <ol> <li>Index building time of each ANN (Approximate Nearest Neighbor) implementation.</li> <li>Accuracy of nearest neighbors queries with their query times.</li> </ol> <p>Python code used for this evaluation can be found in this <a href="https://gist.github.com/maheshakya/b7bcf915c9d5bab89d8d">Gist</a>. Parameters of <code>LSHForest</code> (<code>n_estimators=10</code> and <code>n_candidates=50</code>) are kept fixed during this experiment. Accuracies can be raised by tuning these parameters.</p> <h2 id="results">Results</h2> <p>For each dataset, two graphs have been plotted according to the measures expressed in the above section. <img src="https://docs.google.com/drawings/d/1PqPDygNOpIIicm4YEdN5FwLgECzPSa2toAsGsczIUP0/pub?w=960&amp;h=720" alt="n_samples=1000, n_features=100" /> <img src="https://docs.google.com/drawings/d/1v7mtz_xC_9njA-d3rzUYYfbJrzKr0kX5-sBfVOj-yCs/pub?w=960&amp;h=720" alt="n_samples=1000, n_features=500" /> <img src="https://docs.google.com/drawings/d/1gl65_N8mKJPGFardWwUt-8WXtqYbIeJM2jbAuk5qKJg/pub?w=960&amp;h=720" alt="n_samples=6000, n_features=3000" /> <img src="https://docs.google.com/drawings/d/1GIIjyHOG03WxNOEHhF4p19uP_9JZ9ctyYRRJp2HvCp4/pub?w=960&amp;h=720" alt="n_samples=10000, n_features=100" /> <img src="https://docs.google.com/drawings/d/1DFA6i-s661Liplq_lysUyz0KvIH4CnByA5jsEuCw7ig/pub?w=960&amp;h=720" alt="n_samples=10000, n_features=500" /> <img src="https://docs.google.com/drawings/d/14ta-LAR9KdPRaTDDJgKWYHpPsVhqvxCs9rjwHStbRz0/pub?w=960&amp;h=720" alt="n_samples=10000, n_features=1000" /> <img src="https://docs.google.com/drawings/d/1KoQ3wXMb-rm4ZGW8d2mMGCcsvxGPU1lgSJTCA0IbRec/pub?w=960&amp;h=720" alt="n_samples=10000, n_features=6000" /> <img src="https://docs.google.com/drawings/d/1RIr-jf818iQIIaOOi5C_sX87ynAR6cBVI1FF25Wbq9o/pub?w=960&amp;h=720" alt="n_samples=50000, n_features=5000" /> <img src="https://docs.google.com/drawings/d/1onzy44K6Yk4CcX4msTGLJK61UFNOqkkEzZLH3yZrKtY/pub?w=960&amp;h=720" alt="n_samples=100000, n_features=1000" /> It is evident that index building times of LSH Forest and FLANN are almost incomparable to that of Annoy for almost all the datasets. Moreover, for larger datasets, LSH Forest outperforms Annoy at large margins with respect to accuracy and query speed. Observations from these graphs prove that LSH Forest is competitive with FLANN for large datasets.</p> Sun, 17 Aug 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/08/17/performance-comparison-among-lsh-forest-annoy-and-flann.html http://maheshakya.github.io/gsoc/2014/08/17/performance-comparison-among-lsh-forest-annoy-and-flann.html gsoc A demonstration of the usage of Locality Sensitive Hashing Forest in approximate nearest neighbor search <p>This is a demonstration to explain how to use the approximate nearest neighbor search implementation using locality sensitive hashing in <a href="http://scikit-learn.org/stable/">scikit-learn</a> and to illustrate the behavior of the nearest neighbor queries as the parameters vary. This implementation has an API which is essentially as same as the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors">NearestNeighbors</a> module as approximate nearest neighbor search is used to speed up the queries at the cost of accuracy when the database is very large.</p> <p>Before beginning the demonstration, background has to be set. First, the required modules are loaded and a synthetic dataset is created for testing.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">time</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="kn">from</span> <span class="nn">sklearn.datasets.samples_generator</span> <span class="kn">import</span> <span class="n">make_blobs</span> <span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">LSHForest</span> <span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">NearestNeighbors</span> <span class="c"># Initialize size of the database, iterations and required neighbors.</span> <span class="n">n_samples</span> <span class="o">=</span> <span class="mi">10000</span> <span class="n">n_features</span> <span class="o">=</span> <span class="mi">100</span> <span class="n">n_iter</span> <span class="o">=</span> <span class="mi">30</span> <span class="n">n_neighbors</span> <span class="o">=</span> <span class="mi">100</span> <span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span> <span class="c"># Generate sample data</span> <span class="n">X</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">make_blobs</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="n">n_samples</span><span class="p">,</span> <span class="n">n_features</span><span class="o">=</span><span class="n">n_features</span><span class="p">,</span> <span class="n">centers</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">cluster_std</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span></code></pre></div> <p>There are two main parameters which affect queries in the LSH Forest implementation. </p> <ol> <li><code>n_estimators</code> : Number of trees in the LSH Forest.</li> <li><code>n_candidates</code> : Number of candidates chosen from each tree for distance calculation.</li> </ol> <p>In the first experiment, average accuracies are measured as the value of <code>n_estimators</code> vary. <code>n_candidates</code> is kept fixed. <code>slearn.neighbors.NearestNeighbors</code> used to obtain the true neighbors so that the returned approximate neighbors can be compared against.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Set `n_estimators` values</span> <span class="n">n_estimators_values</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int</span><span class="p">)</span> <span class="n">accuracies_trees</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">n_estimators_values</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span> <span class="c"># Calculate average accuracy for each value of `n_estimators`</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">n_estimators</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">n_estimators_values</span><span class="p">):</span> <span class="n">lshf</span> <span class="o">=</span> <span class="n">LSHForest</span><span class="p">(</span><span class="n">n_candidates</span><span class="o">=</span><span class="mi">500</span><span class="p">,</span> <span class="n">n_estimators</span><span class="o">=</span><span class="n">n_estimators</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="n">n_neighbors</span><span class="p">)</span> <span class="n">nbrs</span> <span class="o">=</span> <span class="n">NearestNeighbors</span><span class="p">(</span><span class="n">n_neighbors</span><span class="o">=</span><span class="n">n_neighbors</span><span class="p">,</span> <span class="n">algorithm</span><span class="o">=</span><span class="s">&#39;brute&#39;</span><span class="p">)</span> <span class="n">lshf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="n">nbrs</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_iter</span><span class="p">):</span> <span class="n">query</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">rng</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n_samples</span><span class="p">)]</span> <span class="n">neighbors_approx</span> <span class="o">=</span> <span class="n">lshf</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="n">neighbors_exact</span> <span class="o">=</span> <span class="n">nbrs</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="n">intersection</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">intersect1d</span><span class="p">(</span><span class="n">neighbors_approx</span><span class="p">,</span> <span class="n">neighbors_exact</span><span class="p">)</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">ratio</span> <span class="o">=</span> <span class="n">intersection</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">n_neighbors</span><span class="p">)</span> <span class="n">accuracies_trees</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">ratio</span> <span class="n">accuracies_trees</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">accuracies_trees</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">n_iter</span><span class="p">)</span></code></pre></div> <p>Similarly, average accuracy vs <code>n_candidates</code> is also measured.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Set `n_candidate` values</span> <span class="n">n_candidates_values</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">500</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int</span><span class="p">)</span> <span class="n">accuracies_c</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">n_candidates_values</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span> <span class="c"># Calculate average accuracy for each value of `n_candidates`</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">n_candidates</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">n_candidates_values</span><span class="p">):</span> <span class="n">lshf</span> <span class="o">=</span> <span class="n">LSHForest</span><span class="p">(</span><span class="n">n_candidates</span><span class="o">=</span><span class="n">n_candidates</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="n">n_neighbors</span><span class="p">)</span> <span class="n">nbrs</span> <span class="o">=</span> <span class="n">NearestNeighbors</span><span class="p">(</span><span class="n">n_neighbors</span><span class="o">=</span><span class="n">n_neighbors</span><span class="p">,</span> <span class="n">algorithm</span><span class="o">=</span><span class="s">&#39;brute&#39;</span><span class="p">)</span> <span class="c"># Fit the Nearest neighbor models</span> <span class="n">lshf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="n">nbrs</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_iter</span><span class="p">):</span> <span class="n">query</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">rng</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n_samples</span><span class="p">)]</span> <span class="c"># Get neighbors</span> <span class="n">neighbors_approx</span> <span class="o">=</span> <span class="n">lshf</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="n">neighbors_exact</span> <span class="o">=</span> <span class="n">nbrs</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="n">intersection</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">intersect1d</span><span class="p">(</span><span class="n">neighbors_approx</span><span class="p">,</span> <span class="n">neighbors_exact</span><span class="p">)</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">ratio</span> <span class="o">=</span> <span class="n">intersection</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">n_neighbors</span><span class="p">)</span> <span class="n">accuracies_c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">+=</span> <span class="n">ratio</span> <span class="n">accuracies_c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">accuracies_c</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">n_iter</span><span class="p">)</span></code></pre></div> <p>You can get a clear view of the behavior of queries from the following plots. <img src="https://docs.google.com/drawings/d/1NLa73eY1JDspoYBWW5bYl0JPoXTyA64JZWunQ4JYVO8/pub?w=960&amp;h=720" alt="accuracies_c_l" /></p> <p>The next experiment demonstrates the behavior of queries for different database sizes (<code>n_samples</code>).</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Initialize the range of `n_samples`</span> <span class="n">n_samples_values</span> <span class="o">=</span> <span class="p">[</span><span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">10000</span><span class="p">,</span> <span class="mi">100000</span><span class="p">]</span> <span class="n">average_times</span> <span class="o">=</span> <span class="p">[]</span> <span class="c"># Calculate the average query time</span> <span class="k">for</span> <span class="n">n_samples</span> <span class="ow">in</span> <span class="n">n_samples_values</span><span class="p">:</span> <span class="n">X</span><span class="p">,</span> <span class="n">labels_true</span> <span class="o">=</span> <span class="n">make_blobs</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="n">n_samples</span><span class="p">,</span> <span class="n">n_features</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">centers</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">cluster_std</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c"># Initialize LSHForest for queries of a single neighbor</span> <span class="n">lshf</span> <span class="o">=</span> <span class="n">LSHForest</span><span class="p">(</span><span class="n">n_candidates</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">lshf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="n">average_time</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_iter</span><span class="p">):</span> <span class="n">query</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">rng</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">n_samples</span><span class="p">)]</span> <span class="n">t0</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="n">approx_neighbors</span> <span class="o">=</span> <span class="n">lshf</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span> <span class="n">T</span> <span class="o">=</span> <span class="n">time</span><span class="o">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">t0</span> <span class="n">average_time</span> <span class="o">=</span> <span class="n">average_time</span> <span class="o">+</span> <span class="n">T</span> <span class="n">average_time</span> <span class="o">=</span> <span class="n">average_time</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">n_iter</span><span class="p">)</span> <span class="n">average_times</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">average_time</span><span class="p">)</span></code></pre></div> <p><code>n_samples</code> space is defined as [10, 100, 1000, 10000, 100000]. Query time for a single neighbor is measure for these different values of <code>n_samples</code>. <img src="https://docs.google.com/drawings/d/1KZwjo9Cx33kK-kTNayAT-eb79TPCbvX-kidyoPO7VJA/pub?w=960&amp;h=720" alt="query_time_vs_n_samples" /></p> Sun, 17 Aug 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/08/17/a-demonstration-of-the-usage-of-locality-sensitive-hashing-forest-in-approximate-nearest-neighbor-search.html http://maheshakya.github.io/gsoc/2014/08/17/a-demonstration-of-the-usage-of-locality-sensitive-hashing-forest-in-approximate-nearest-neighbor-search.html gsoc Improvements for LSH Forest implementation and its applications <p>GSoC 2014 is coming to an end. But LSH Forest implementation requires a little more work to be completed. Following are the list of tasks to be done. They will be completed during the next two weeks.</p> <h2 id="improving-lsh-forest-implementation">1. Improving LSH Forest implementation</h2> <p>I have got a lot of feedback from <a href="http://scikit-learn.org/stable/">scikit-learn</a> community about my implementation of LSH Forest. Many of them are about the possible optimizations. Making those optimizations happen will cause a significant improvement in the performance.</p> <h2 id="applying-lsh-forest-in-dbscanhttpenwikipediaorgwikidbscan">2. Applying LSH Forest in <a href="http://en.wikipedia.org/wiki/DBSCAN">DBSCAN</a></h2> <p>The idea of this is to speed up the clustering method using approximate neighbor search, rather than spending much time on exhaustive exact nearest neighbor search. DBSCAN requires the neighbors of which the distance from a data point is less than a given radius. We use the term <code>radius neighbors</code> for this. As LSH Forest is implemented to adhere with the scikit-learn API, we have a <code>radius_neighbors</code> function in LSH Forest(<code>NearestNeighbors</code> in scikit-learn has <code>radius_neighbors</code> function). Therefore, LSH Forest can be directly applied in place of exact nearest neighbor search. </p> <p>After this application, it will be benchmarked to analyze the performance. Approximate neighbors are more useful when the size of the database (often the number of features) is very large. So the benchmark will be based on following facts. What are the sample sizes and number of features, where approximate neighbor search reasonably outperforms the exact neighbor search.</p> <h2 id="documentation-and-wrapping-up-work">3. Documentation and wrapping up work</h2> <p>After completing the implementation and benchmarking, it will be documented with the scikit-learn documentation standards.</p> Tue, 05 Aug 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/08/05/improvements-for-lsh-forest-implementation-and-its-applications.html http://maheshakya.github.io/gsoc/2014/08/05/improvements-for-lsh-forest-implementation-and-its-applications.html gsoc Testing LSH Forest <p>There are two types of tests to perform in order to ensure the correct functionality the LSH Forest.</p> <ol> <li>Tests for individual functions of the LSHForest class.</li> <li>Tests for accuracy variation with parameters of implementation.</li> </ol> <p>scikit-learn provides a nice set testing tools for this task. It is elaborated in the <a href="http://scikit-learn.org/stable/developers/utilities.html">utilities for developers</a> section. I have used following assertions which were imported from <code>sklearn.utils.testing</code>. Note that <code>numpy</code> as imported as <code>np</code>.</p> <ol> <li><code>assert_array_equal</code> - Compares each element in an array.</li> <li><code>assert_equal</code> - Compares two values.</li> <li><code>assert_raises</code> - Checks whether the given type of error is raised.</li> </ol> <h2 id="testing-individual-functions">Testing individual functions</h2> <h3 id="testing-fit-function">Testing <code>fit</code> function</h3> <p>Requirement of this test is to ensure that the fit function does not work without the necessary arguments provision and it produces correct attributes in the class object(in the sense of dimensions of arrays).</p> <p>Suppose we initialize a LSHForest as <code>lshf = LSHForest()</code></p> <p>If the estimator is not fitted with a proper data, it will produce a value error and it is testes as follows:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="n">dim</span><span class="p">)</span> <span class="n">lshf</span> <span class="o">=</span> <span class="n">LSHForest</span><span class="p">(</span><span class="n">n_trees</span><span class="o">=</span><span class="n">n_trees</span><span class="p">)</span> <span class="n">lshf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span></code></pre></div> <p>We define the sample size and the dimension of the dataset as <code>samples</code> and <code>dim</code> respectively and the number of trees in the LSH forest as <code>n_trees</code>.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Test whether a value error is raised when X=None</span> <span class="n">assert_raises</span><span class="p">(</span><span class="ne">ValueError</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">fit</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span></code></pre></div> <p>Then after fitting the estimator, following assertions should hold true.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># _input_array = X</span> <span class="n">assert_array_equal</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">_input_array</span><span class="p">)</span> <span class="c"># A hash function g(p) for each tree</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">n_trees</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">hash_functions_</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c"># Hash length = 32</span> <span class="n">assert_equal</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">hash_functions_</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c"># Number of trees in the forest</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">n_trees</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">_trees</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c"># Each tree has entries for every data point</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">_trees</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="c"># Original indices after sorting the hashes</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">n_trees</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">_original_indices</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="c"># Each set of original indices in a tree has entries for every data point</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">_original_indices</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span></code></pre></div> <p>All the other tests for functions also contain the tests for valid arguments, therefore I am not going to describe them in those sections.</p> <h3 id="testing-kneighbors-function">Testing <code>kneighbors</code> function</h3> <p><code>kneighbors</code> tests are based on the number of neighbors returned, neighbors for a single data point and multiple data points.</p> <p>We define the required number of neighbors as <code>n_neighbors</code> and crate a LSHForest.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">n_neighbors</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">samples</span><span class="p">)</span> <span class="n">point</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">samples</span><span class="p">)]</span> <span class="n">neighbors</span> <span class="o">=</span> <span class="n">lshf</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">point</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="n">n_neighbors</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="c"># Desired number of neighbors should be returned.</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">neighbors</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">n_neighbors</span><span class="p">)</span></code></pre></div> <p>For multiple data points, we define number of points as <code>n_points</code>:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">points</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">samples</span><span class="p">,</span> <span class="n">n_points</span><span class="p">)]</span> <span class="n">neighbors</span> <span class="o">=</span> <span class="n">lshf</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">neighbors</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">n_points</span><span class="p">)</span></code></pre></div> <p>The above tests ensures that the maximum hash length is also exhausted because the data points in the same data set are used in queries. But to ensure that hash lengths less than the maximum hash length also get involved, there is another test. </p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Test random point(not in the data set)</span> <span class="n">point</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">dim</span><span class="p">)</span> <span class="n">lshf</span><span class="o">.</span><span class="n">kneighbors</span><span class="p">(</span><span class="n">point</span><span class="p">,</span> <span class="n">n_neighbors</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">return_distance</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span></code></pre></div> <h3 id="testing-distances-of-the-kneighbors-function">Testing distances of the <code>kneighbors</code> function</h3> <p>Returned distances should be in sorted order, therefore it is tested as follows: Suppose <code>distances</code> is the returned distances from the <code>kneighbors</code> function when the <code>return_distances</code> parameter is set to <code>True</code>.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">assert_array_equal</span><span class="p">(</span><span class="n">distances</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">distances</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span></code></pre></div> <h3 id="testing-insert-function">Testing <code>insert</code> function</h3> <p>Testing <code>insert</code> is somewhat similar to testing <code>fit</code> because what we have to ensure are dimensions and sample sizes. Following assertions should hold true after fitting the LSHFores.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Insert wrong dimension</span> <span class="n">assert_raises</span><span class="p">(</span><span class="ne">ValueError</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">insert</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">dim</span><span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="c"># Insert 2D array</span> <span class="n">assert_raises</span><span class="p">(</span><span class="ne">ValueError</span><span class="p">,</span> <span class="n">lshf</span><span class="o">.</span><span class="n">insert</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">dim</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span> <span class="n">lshf</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">dim</span><span class="p">))</span> <span class="c"># size of _input_array = samples + 1 after insertion</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">lshf</span><span class="o">.</span><span class="n">_input_array</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">samples</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="c"># size of _original_indices[1] = samples + 1</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">lshf</span><span class="o">.</span><span class="n">_original_indices</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">samples</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="c"># size of _trees[1] = samples + 1</span> <span class="n">assert_equal</span><span class="p">(</span><span class="n">lshf</span><span class="o">.</span><span class="n">_trees</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">samples</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span></code></pre></div> <h2 id="testing-accuracy-variation-with-parameters">Testing accuracy variation with parameters</h2> <p>The accuracy of the results obtained from the queries depends on two major parameters.</p> <ol> <li>c value - <code>c</code>.</li> <li>number of trees - <code>n_trees</code>.</li> </ol> <p>Separate tests have been written to ensure that the accuracy variation is correct with these parameter variation.</p> <h3 id="testing-accuracy-against-c-variation">Testing accuracy against <code>c</code> variation</h3> <p>Accuracy should be in an increasing order as the value of <code>c</code> is raised. Here, the average accuracy is calculated for different <code>c</code> values. </p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">c_values</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">10</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">250</span><span class="p">])</span></code></pre></div> <p>Then the calculated accuracy values of each <code>c</code> value is stored in an array. Following assertion should hold true in order to make sure that, higher the <code>c</code> value, higher the accuracy.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Sorted accuracies should be equal to original accuracies</span> <span class="n">assert_array_equal</span><span class="p">(</span><span class="n">accuracies</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">accuracies</span><span class="p">))</span></code></pre></div> <h3 id="testing-accuracy-against-ntrees-variation">Testing accuracy against <code>n_trees</code> variation</h3> <p>This is as almost same as the above test. First, <code>n_trees</code> are stored in the ascending order.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">n_trees</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">])</span></code></pre></div> <p>After calculating average accuracies, the assertion is performed.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Sorted accuracies should be equal to original accuracies</span> <span class="n">assert_array_equal</span><span class="p">(</span><span class="n">accuracies</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">accuracies</span><span class="p">))</span></code></pre></div> <h2 id="what-is-left">What is left?</h2> <p>In addition to the above tests, precision should also be tested against <code>c</code> values and <code>n_trees</code>. But it has already been tested in the prototyping stage and those tests consume a reasonably large amount of time which makes them unable to be performed in the scikit-learn test suit. Therefore, a separate benchmark will be done on this topic.</p> <h2 id="about-nose">About Nose</h2> <p>As provided in the guidelines in scikit-learn <a href="http://scikit-learn.org/stable/developers/">contributors’ page</a>, <a href="http://nose.readthedocs.org/en/latest/index.html">Nose</a> testing framework has been used to automate the testing process.</p> Thu, 24 Jul 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/07/24/testing-lsh-forest.html http://maheshakya.github.io/gsoc/2014/07/24/testing-lsh-forest.html gsoc Optimizations on Locality sensitive hashing forest data structure <p>In order to get LSH forest approximate nearest neighbor search method into a competitive and useful state, <strong>optimization</strong> was a necessity. These optimizations were based on the portions of the LSH forest algorithm and implementation of that algorithm. There are two general cases of bottlenecks where requirement for optimization arises.</p> <ol> <li>Memory usage of the data structure.</li> <li>Speed of queries.</li> </ol> <p>Let’s discuss these two cases separately in detail. Remember that there will always be a trade off between memory and speed.</p> <h2 id="optimizations-on-memory-usage">Optimizations on memory usage</h2> <p>In the <a href="/gsoc/2014/06/01/lsh-forest-with-sorted-arrays-and-binary-search.html">previous post about the implementation of the LSH forest</a>, I have explained how the basic data structure and the functions. Some of the optimizations caused obvious changes in those functions. First of all, I need to mention that this data structure stores the entire data set fitted because it will be required to calculate the actual distances of the candidates. So the memory consumption is usually predominated by the size of the data set. LSH Forest data structure does not have control over this. Only thing that can be controlled from the data structure is the set of hashes of the data points. </p> <p>Current data structure maintains a fixed length hash(but user can define that length at the time of initialization). Hashes are comprised of binary values (<code>1</code>s and <code>0</code>s). At the moment, these hashes are stored as binary strings. </p> <p>One of the optimizations which can be done to maintain a more compact index for hashes is using the equivalent integer of the binary strings. For an example, string <code>001110</code> can be represented as <code>14</code>. For very large data sets with very high dimensions, this technique will hold no effect as the improvement on the memory usage of hash indices is insignificant when compared to the memory required to store the data set. </p> <p>As I have mentioned earlier, there is always a cost for every optimization. Here we have to trade off speed for improved memory consumption. The LSH algorithm is applied to a single data point up to the number hash length required. In the default case, there will be <code>32</code> hashes per a single data point in a single tree and those hashed values have to be concatenated in order to create \(g(p)\)(refer to the <a href="http://ilpubs.stanford.edu:8090/678/1/2005-14.pdf">LSH Forest paper</a>). Because the hashing is done separately for individual elements of \(g(p)\), they are stored in an array in the first place. In order to create a single hash from these values in the array, they are first concatenated as a string. This is the point where memory speed dilemma occurs. These string hashes can be converted into integer hashes for compact indices but it will cost extra time. </p> <p>In the next section, I will explain how this extra overhead can be minimized using a trick, but the fact that “you cannot eliminate the memory speed trade-off completely” holds.</p> <h2 id="optimizations-on-query-speeds">Optimizations on query speeds</h2> <p>(I strongly recommend you to go through the <a href="http://ilpubs.stanford.edu:8090/678/1/2005-14.pdf">LSH forest paper</a> before reading the following section because the description involves terms directly taken from the paper)</p> <p>I was hapless at the beginning, because the query speeds of the implemented LSH forest structure were not at a competitive state to challenge the other approximate nearest neighbor search implementations. But with the help of my project mentors, I was able to get over with this issue.</p> <h3 id="optimizations-on-the-algorithm">Optimizations on the algorithm</h3> <p>In the initial implementation, binary hash: \(g(p)\) for each tree is computed during the <strong>synchronous ascending</strong> phase when it was required. This causes a great overhead because each time it is computed, the LSH function has to be performed on the query vector. For <em>Random projections</em>, it is a dot product and it is an expensive operation for large vectors. <img src="https://docs.google.com/drawings/d/1DCy4UsrJo1FXigeZvXeDzPEs_xsZ4AYgZfqERCjBujA/pub?w=960&amp;h=720" alt="descending phase for a single tree" /> Computing the binary hashes for each tree in advance and storing them is a reliable solution for this issue. The algorithm descend needs to be performed on each tree to find the longest matching hash length. So the above algorithm will iterate over the number of trees in the forest. For each iteration, the binary hash of the query is calculated and stored as follows. This does not consume much memory because the number of trees is small(typically 10). </p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">descend</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">hash_functions</span><span class="p">,</span> <span class="n">trees</span><span class="p">):</span> <span class="n">bin_queries</span> <span class="o">=</span> <span class="p">[]</span> <span class="n">max_depth</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_trees</span><span class="p">):</span> <span class="n">binary_query</span> <span class="o">=</span> <span class="n">do_hash</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">hash_functions</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="n">k</span> <span class="o">=</span> <span class="n">find_longest_prefix_match</span><span class="p">(</span><span class="n">trees</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">binary_query</span><span class="p">)</span> <span class="k">if</span> <span class="n">k</span> <span class="o">&gt;</span> <span class="n">max_depth</span><span class="p">:</span> <span class="n">max_depth</span> <span class="o">=</span> <span class="n">k</span> <span class="n">bin_queries</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">bin_query</span><span class="p">)</span> <span class="k">return</span> <span class="n">max_depth</span><span class="p">,</span> <span class="n">binary_queries</span></code></pre></div> <p>It is an obvious fact that as the number of candidates for the nearest neighbors grows the number of actual distance calculations grows proportionally. This computation is also an expensive operation when compared with the other operations in the LSH forest structure. So limiting the number of candidates is also a passable optimization which can be done with respect to the algorithm. <img src="https://docs.google.com/drawings/d/1Q9ocBPxZzdvTEjP-Byw-OwXbBfPKwpBt-XV3_Fsmsqw/pub?w=946&amp;h=386" alt="synchronous ascend phase" /> In the above function, one of the terminating conditions of the candidate search is maximum depth reaching 0 (The while loop runs until <code>x &gt; 0</code>). But it should not necessarily run until maximum depth reaches 0 because we are looking for most viable candidates. It is known fact that for smaller matching hash lengths, it is unlikely to have a large similarity between two data points. Therefore, as the considered hash length is decreased, eligibility of the candidates we get is also decreased. Thus having a lower bound on the maximum depth will set a bound to candidates we collect. It decreases the probability of having irrelevant candidates. So this can be done by setting <code>x &gt; lower_bound</code> in the while loop for some value of <code>lower_bound &gt; 0</code>. But there is a risk of not retrieving the required number of neighbors for very small data sets because the the candidate search may terminate before it acquires the required number of candidates. Therefore user should be aware of a suitable <code>lower_bound</code> for the queries at the time of initialization. </p> <h3 id="optimizations-on-the-implementation">Optimizations on the implementation</h3> <p>I have indicated that <code>bisect</code> operations comes with Python are faster for small lists but as the size of the list grow numpy <code>searchsorted</code> functions becomes faster. In my previous implementation, I have used an alternative version of <code>bisect_right</code> function as it does not fulfill the requirement of finding the right most index for a particular hash length in a sorted array(Please refer to my <a href="/gsoc/2014/06/01/lsh-forest-with-sorted-arrays-and-binary-search.html">previous post</a>if things are not clear). But we cannot create an alternative version of numpy <code>searchsorted</code> function, therefore a suitable transformation is required on the hash itself.</p> <p>Suppose we have a binary hash <code>item = 001110</code>. What we need is the largest number with the first 4 bits being the same as <code>item</code>. <code>001111</code> suffices this requirement. So the transformation needed is replacing the last 2 bits with 1s. </p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">transform_right</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="n">h</span><span class="p">,</span> <span class="n">hash_size</span><span class="p">):</span> <span class="n">transformed_item</span> <span class="o">=</span> <span class="n">item</span><span class="p">[:</span><span class="n">h</span><span class="p">]</span> <span class="o">+</span> <span class="s">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s">&#39;1&#39;</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">hash_size</span> <span class="o">-</span> <span class="n">h</span><span class="p">)])</span> <span class="k">return</span> <span class="n">transformed_right</span></code></pre></div> <p>The transformation needed to get the left most index is simple. This is <code>001100</code> which is last two bits replaced by 0s. This is same as having <code>0011</code>. Therefore only a string slicing operation <code>item[:4]</code> will do the the job. </p> <p>It gets more complicated when it comes to integer hashes. Integer has to be converted to the string, transformed and re-converted to integer. </p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">integer_transform</span><span class="p">(</span><span class="n">int_item</span><span class="p">,</span> <span class="n">hash_size</span><span class="p">,</span> <span class="n">h</span><span class="p">):</span> <span class="n">str_item</span> <span class="o">=</span> <span class="p">(</span><span class="s">&#39;{0:0&#39;</span><span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">hash_size</span><span class="p">)</span><span class="o">+</span><span class="s">&#39;b}&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">int_item</span><span class="p">)</span> <span class="n">transformed_str</span> <span class="o">=</span> <span class="n">transform_right</span><span class="p">(</span><span class="n">str_item</span><span class="p">,</span> <span class="n">h</span><span class="p">,</span> <span class="n">hash_size</span><span class="p">):</span> <span class="n">transformed_int</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">transformed_str</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">return</span> <span class="n">transformed_int</span></code></pre></div> <p>But this is a very expensive operation for a query. For indexing, only <code>int(binary_hash, 2)</code> is required but this wont make much effect because the LSH algorithms on the data set dominates that operation completely. But for a query this is a significant overhead. Therefore we need an alternative.</p> <p>Required integer representations for left and right operations can be obtained by performing the bit wise \(AND\) and \(OR\) operations with a suitable mask. Masks are generated by the following function.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">_generate_masks</span><span class="p">(</span><span class="n">hash_size</span><span class="p">):</span> <span class="n">left_masks</span><span class="p">,</span> <span class="n">right_masks</span> <span class="o">=</span> <span class="p">[],</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">length</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">hash_size</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span> <span class="n">left_mask</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="s">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s">&#39;1&#39;</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">length</span><span class="p">)])</span> <span class="o">+</span> <span class="s">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s">&#39;0&#39;</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">hash_size</span><span class="o">-</span><span class="n">length</span><span class="p">)]),</span> <span class="mi">2</span><span class="p">)</span> <span class="n">left_masks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">left_mask</span><span class="p">)</span> <span class="n">right_mask</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="s">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s">&#39;0&#39;</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">length</span><span class="p">)])</span> <span class="o">+</span> <span class="s">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s">&#39;1&#39;</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">hash_size</span><span class="o">-</span><span class="n">length</span><span class="p">)]),</span> <span class="mi">2</span><span class="p">)</span> <span class="n">right_masks</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">right_mask</span><span class="p">)</span> <span class="k">return</span> <span class="n">left_masks</span><span class="p">,</span> <span class="n">right_masks</span></code></pre></div> <p>These masks will be generated at the indexing time. Then the masks will be applied with the integer hashes.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">apply_masks</span><span class="p">(</span><span class="n">item</span><span class="p">,</span> <span class="n">left_masks</span><span class="p">,</span> <span class="n">right_masks</span><span class="p">,</span> <span class="n">h</span><span class="p">):</span> <span class="n">item_left</span> <span class="o">=</span> <span class="n">item</span> <span class="o">&amp;</span> <span class="n">left_masks</span><span class="p">[</span><span class="n">h</span><span class="p">]</span> <span class="n">item_right</span> <span class="o">=</span> <span class="n">item</span> <span class="o">|</span> <span class="n">right_masks</span><span class="p">[</span><span class="n">h</span><span class="p">]</span> <span class="k">return</span> <span class="n">item_left</span><span class="p">,</span> <span class="n">item_right</span></code></pre></div> <p>The left most and right most indices of a sorted array can be obtained in the following fashion.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="k">def</span> <span class="nf">find_matching_indices</span><span class="p">(</span><span class="n">sorted_array</span><span class="p">,</span> <span class="n">item_left</span><span class="p">,</span> <span class="n">item_right</span><span class="p">):</span> <span class="n">left_index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">searchsorted</span><span class="p">(</span><span class="n">sorted_array</span><span class="p">,</span> <span class="n">item_left</span><span class="p">)</span> <span class="n">right_index</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">searchsorted</span><span class="p">(</span><span class="n">sorted_array</span><span class="p">,</span> <span class="n">item_right</span><span class="p">,</span> <span class="n">side</span> <span class="o">=</span> <span class="s">&#39;right&#39;</span><span class="p">)</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">left_index</span><span class="p">,</span> <span class="n">right_index</span><span class="p">)</span></code></pre></div> <p>As the masks are precomputed, the speed overhead at the query time is minimum in this approach. But still there is a little overhead in this approach because original hashed binary number are stored in an array and those numbers need to be concatenated and converted to obtain the corresponding integers. If the integers are cached this overhead will be eliminated.</p> <h3 id="cached-hash">Cached hash</h3> <p>This is method which guarantees a significant speed up in the queries with expense of index building speed. At the indexing time, we can create a dictionary with key being arrays of hashed bits and the values being the corresponding integers. The number of items in the dictionary is the number of combinations for a particular hash length.</p> <p>Example:</p> <p>Suppose the hash length is 3. Then the bit combinations are: <img src="https://docs.google.com/drawings/d/1hiUf22xhydYwCedrn0QChcB8yGcv7C1puW0Y1a_n8ZI/pub?w=882&amp;h=384" alt="hash_bit_table" /></p> <p>Then it will be only a matter of a dictionary look-up. Once an integer is required for a particular hash bit array, it can be retrieved directly from the dictionary. </p> <p>The implementation of this type of hash is a bit tricky. Typically in LSH forests, the maximum hash length will be 32. Then the size of the dictionary will be \(2^n = 4294967296\) where \(n = 32\). This is an extremely infeasible size to hold in the memory(May require tens of Gigabytes). But when \(n = 16\), the size becomes \(65536\). This is a very normal size which can be easily stored in the memory. Therefore we use two caches of size 16. </p> <p>First a list for digits <code>0</code> and <code>1</code> is created.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">digits</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;0&#39;</span><span class="p">,</span> <span class="s">&#39;1&#39;</span><span class="p">]</span></code></pre></div> <p>The the cache is created as follows.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">itertools</span> <span class="n">cache_N</span> <span class="o">=</span> <span class="mi">32</span> <span class="o">/</span> <span class="mi">2</span> <span class="n">c</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">**</span> <span class="n">cache_N</span> <span class="c"># compute once</span> <span class="n">cache</span> <span class="o">=</span> <span class="p">{</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">x</span><span class="p">):</span> <span class="nb">int</span><span class="p">(</span><span class="s">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="n">digits</span><span class="p">[</span><span class="n">y</span><span class="p">]</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">x</span><span class="p">]),</span><span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">itertools</span><span class="o">.</span><span class="n">product</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">),</span> <span class="n">repeat</span><span class="o">=</span><span class="n">cache_N</span><span class="p">)</span> <span class="p">}</span></code></pre></div> <p>Suppose <code>item</code> is a list of length 32 with binary values(0s and 1s). Then we can obtain the corresponding integer for that <code>item</code> as follows:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">xx</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">(</span><span class="n">item</span><span class="p">)</span> <span class="n">int_item</span> <span class="o">=</span> <span class="n">cache</span><span class="p">[</span><span class="n">xx</span><span class="p">[:</span><span class="mi">16</span><span class="p">]]</span> <span class="o">*</span> <span class="n">c</span> <span class="o">+</span> <span class="n">cache</span><span class="p">[</span><span class="n">xx</span><span class="p">[</span><span class="mi">16</span><span class="p">:]]</span></code></pre></div> <p>This technique can be used in queries to obtain a significant improvement in speed assuming that the user is ready to sacrifice a portion of the memory.</p> <p>Performance of all these techniques were measured using Pythons’ <code>line_profiler</code> and <code>memory_profiler</code>. In my next post, I will explain concisely how these tools have been used to evaluate these implementations.</p> Mon, 07 Jul 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/07/07/optimizations-on-locality-sensitive-hashing-forest-data-structure.html http://maheshakya.github.io/gsoc/2014/07/07/optimizations-on-locality-sensitive-hashing-forest-data-structure.html gsoc An illustration of the functionality of the LSH Forest <p>Before digging into more technical detail on the implementation of LSH forest, I thought it would be useful to provide an insightful illustration on how the best candidates are selected from the fitted data points.</p> <p>For this task, I decided to use a randomly generated dataset of the sample size of 10000 in the two dimensional space. The reason of using two dimensions is that it is convenient to depict the vectors in 2D space and the a normal human being can easily grasp the idea of convergence in a 2D space. </p> <p>Following configuration of the LSH forest has been selected for this illustration. </p> <ul> <li>Number of trees = 1</li> <li>Hash length= 32</li> <li>c = 1</li> <li>Lower bound of hash length at termination = 4</li> <li>Expecting number of neighbors = 10</li> </ul> <p>You can get an idea of these parameters from my <a href="/gsoc/2014/06/01/lsh-forest-with-sorted-arrays-and-binary-search.html">previous article on the LSH forest</a></p> <p>The following illustrations show the convergence of data points towards the query point as the considered hash length is increased. The important fact I want to emphasize here is that candidates chosen by matching the most significant hash bits converge into the actual data point we are interested in. This happens because of the <a href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification">amplification property</a> of Locality sensitive hashing algorithm. </p> <p>(Beware! The query point is in RED) <img src="https://docs.google.com/drawings/d/1R8cajY5tZMxy9Q2_JlPCB2UV3FX8Uifw6tY57kY14Uk/pub?w=960&amp;h=720" alt="hash length = 0" /> <img src="https://docs.google.com/drawings/d/1xJfWym3OXfWMx9BZzLLX7iweWC4NVj7vqgxoijhEDlI/pub?w=960&amp;h=720" alt="hash length = 1" /> <img src="https://docs.google.com/drawings/d/1IOjYl-JsUxTzegKdCYK4C8jvGLV77d0FAhZAXLaj9Jk/pub?w=960&amp;h=720" alt="hash length = 3" /> <img src="https://docs.google.com/drawings/d/1lGJrddMp54dOk6pC6_miJUxjlwEGm8wfbF5Xj7x7N6w/pub?w=960&amp;h=720" alt="hash length = 4" /> <img src="https://docs.google.com/drawings/d/1CwUGIY4iiBEcyQhuV_zguvNQZx1LBR1Gg734Gx2nw7k/pub?w=960&amp;h=720" alt="hash length = 5" /> <img src="https://docs.google.com/drawings/d/1_2NU__OJ_5dio6KWDAb8FZoMHJqM-7XyQaw8-RvvH3A/pub?w=960&amp;h=720" alt="hash length = 7" /> <img src="https://docs.google.com/drawings/d/1IhoEw-k66h4EXa1g07_ywvmyrrbGsVyby4mnhRrmydk/pub?w=960&amp;h=720" alt="hash length = 8" /> <img src="https://docs.google.com/drawings/d/1oII5NtKCH3WYThoQuOVfx4JNrRbOLANGf5GGXSEmvt0/pub?w=960&amp;h=720" alt="hash length = 24" /> <img src="https://docs.google.com/drawings/d/1rxYddtwfE1ZGmTBLBJp1xhTCclYRF-KsLrWjFsesRHY/pub?w=960&amp;h=720" alt="hash length = 32" /></p> <p>Only if the required number of candidates are not reached during the early sweeps, the algorithm will search for more candidates in smaller matching hash lengths. The the best neighbors are selected from that set of candidates by calculating the actual distance. </p> <p>In my next post, I will enlighten you about the subtle optimizations done on the LSH forest data structure to find the best candidates at a maximum speed.</p> Sat, 28 Jun 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/06/28/what-does-locality-sensitive-hashing-forests-do.html http://maheshakya.github.io/gsoc/2014/06/28/what-does-locality-sensitive-hashing-forests-do.html gsoc Performance evaluation of Approximate Nearest Neighbor search implementations - Part 2 <p>This continues from the post <a href="/gsoc/2014/05/25/performance-evaluation-of-approximate-nearest-neighbor-search-implementations---part-1.html">Performance evaluation of Approximate Nearest Neighbor search implementations - Part 1</a>. The evaluation about the memory consumption is already completed in that post. In this post, the next two aspects of the evaluation framework, precision and query speed will be discussed. </p> <p>When measuring the performance of Approximate nearest neighbor search methods, expressing precision and the query speed independently is less useful since the entire idea of approximating is to obtain the desired precision within a better time period. In order to evaluate precision, an ANN implementation should be able to provide a multiple number of neighbors(rather than just the nearest neighbor). After obtaining all data points in the data set from the ANN method, first few entires in the that neighbors list(ten neighbors in my evaluation tests) are taken as the neighbors of ground truth of that particular ANN method. These set of data points are compared against neighbors retrieved when the number of queried neighbors is varied. Precision tests are adapted from the <a href="https://github.com/spotify/annoy/blob/master/examples/precision_test.py">tests performed for ANNOY</a>.</p> <p>This precision measure eliminates some of our candidate ANN implementations because those are not capable of producing a multiple number of neighbors. Obtaining multiple neighbors is an essential requirement of for precision tests described above as well the general applications of nearest neighbor search. Therefore, for the precision tests, only <a href="https://github.com/spotify/annoy">ANNOY</a>, <a href="http://www.cs.ubc.ca/research/flann/">FLANN</a>, <a href="http://www.kgraph.org/">KGraph</a> and LSH Forest are taken into consideration. All evaluation tests, LSH forest implementation and the plots can be found in <a href="https://github.com/maheshakya/Performance_evaluations_ANN">this Github repository</a>.</p> <p>Before jumping into comparisons, I thought it is imperative to get a notion on the characteristics of the LSH forest implementation. Unlike other ANN implementations, LSH forest provides some user controlled parameters to tune the forest and queries in different scenarios. </p> <p>For the entire forest: </p> <ul> <li>Number of trees</li> <li>Maximum hash length</li> </ul> <p>For queries:</p> <ul> <li>c : A value which determines the number of candidates chosen into the neighbors set. This acts with the number of trees.</li> <li>A lower bound for the maximum depth of hashes considered.</li> </ul> <p>In these precision tests, all the those factors but <code>c</code> are fixed to constant values as follows:</p> <ul> <li>Number of trees = 10</li> <li>Maximum hash length = 32</li> <li>Lower bound = 4</li> </ul> <p>The same data set which has been prepared using <a href="/gsoc/2014/05/18/preparing-a-bench-marking-data-set-using-singula-value-decomposition-on-movielens-data.html">singular value decomposition on movielens data</a> is used in all of these tests. Following are the resulting graphs of the performance of LSH forest. Time is measured in seconds. <img src="https://docs.google.com/drawings/d/14CLx4l4VNxJzINJUurlSsrXGl9iWH9fjVH1OideeVO0/pub?w=960&amp;h=720" alt="precision vs c LSHF" /> <img src="https://docs.google.com/drawings/d/1Qr0bHs9Q9pnoszRn-PgGEC5mFMNhwQzqkJ4K4vHYjCw/pub?w=960&amp;h=720" alt="precision vs time LSHF" /> <img src="https://docs.google.com/drawings/d/1EPkeWOfMt7y6_nKk0StNri71xkrwEtoX-8TySqmM8tk/pub?w=960&amp;h=720" alt="precision vs number of candidates LSHF" /> <img src="https://docs.google.com/drawings/d/1klhtpde7N5YLHuCZHX6ekTEuD54IPVM5Tgltj5Xukqk/pub?w=960&amp;h=720" alt="number of candidates vs c LSHF" /></p> <p>The next section of this post illustrates the performance comparisons among ANNOY, FLANN, LSH forest and KGraph. Precision vs. time graphs are used for this comparison. <img src="https://docs.google.com/drawings/d/1Lx8jRYyGbHBD8JzCG_0-HpO14fXyu-pMcto26gsa5zw/pub?w=960&amp;h=720" alt="precision vs time LSHF and ANNOY" /> <img src="https://docs.google.com/drawings/d/1kfUAr2W6WP_OL4l_vGZ-zaEXirG0jstFkk-r2TAH6rU/pub?w=960&amp;h=720" alt="precision vs time FLANN &amp; LSHF" /> Comparing ANNOY, FLANN and LSHF in one place: <img src="https://docs.google.com/drawings/d/1EQTMURuWB7hjoi9r0sGS2Z-IqL_Z_4kg7K5hC6sSt-g/pub?w=960&amp;h=720" alt="precision vs time LSHF, ANNOY &amp; FLANN" /></p> <p>KGraph has a significantly higher precision rate than other ANN implementations.(Rather than approximating, it gives the actual nearest neighbors according the KGraph documentation ) <img src="https://docs.google.com/drawings/d/1HRED65X5AlRuYIo1YaeU6D6ouVufdD8N3v1FFldNESg/pub?w=960&amp;h=720" alt="precision vs time LSHF &amp; KGraph" /> </p> <p>One of the main considerations of these evaluations is the maintainability of the code. Therefore, any implementation which goes into <a href="http://scikit-learn.org/stable/">scikit-learn</a> should have reasonably less complex code. Both FLANN and KGraph use complex data structures and algorithms to achieve higher speeds. ANNOY has a reasonably passable precision-query speed combination with a less complex implementation. Our current implementation of LSH forest has been able to beat ANNOY in precision-query speed comparison. </p> <h2 id="indexing-speeds-of-ann-implementations">Indexing speeds of ANN implementations</h2> <p>In addition to precision and query speed, a measure of indexing speed has a major importance. Therefore as a final step for this evaluation, I will provide you a description on indexing speed for the same data set used above for each ANN implementation.</p> <ul> <li>KGraph : 65.1286380291 seconds</li> <li>ANNOY (metric=’Angular’) : 47.5299789906 seconds</li> <li>FLANN : 0.314314842224 seconds</li> <li>LSHF : 3.73900985718 seconds</li> </ul> <p>In my next post, I will discuss about the implementation of LSH forest in detail and how ANN methods will be implemented in scikit-learn.</p> <h3 id="acronyms">Acronyms:</h3> <ol> <li>ANN : Approximate Nearest Neighbors</li> <li>LSH : Locality Sensitive Hashing</li> </ol> Sat, 14 Jun 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/06/14/performance-evaluation-of-approximate-nearest-neighbor-search-implementations-part-2.html http://maheshakya.github.io/gsoc/2014/06/14/performance-evaluation-of-approximate-nearest-neighbor-search-implementations-part-2.html gsoc LSH Forest with sorted arrays and binary search <p>More on GSoC with <a href="http://scikit-learn.org/stable/index.html">scikit-learn</a>! LSH forest is a promising, novel and alternative method introduced in order to alleviate the drawbacks from which vanilla LSH suffers. I assume you have a passable idea of what LSH means. If not, I suggest you to refer to this: <a href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing">Locality-sensitive hashing</a>. LSH forest has a theoretical guarantee of its’ suggested improvements. For more information, refer to the published paper: <a href="http://ilpubs.stanford.edu:8090/678/1/2005-14.pdf">LSH Forest: Self-Tuning Indexes for Similarity Search</a></p> <p>In general, the data structure used to implement LSH forest is a <a href="http://en.wikipedia.org/wiki/Trie">prefix tree</a>(trie). In this article, I will elaborate how to implement it with sorted arrays and binary search. This will reduce the complexity involved with a separate data structure(such as a tree). You can see the complete implementation in this <a href="https://gist.github.com/maheshakya/b22f640f67d7b574fd56">gist</a>.</p> <h2 id="how-it-is-done">How it is done</h2> <p>This implementation follows every design aspect suggested in the LSH forest paper except the data structure. LSH_forest is a class which has the initialization method as follows:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">max_label_length</span> <span class="o">=</span> <span class="mi">32</span><span class="p">,</span> <span class="n">number_of_trees</span> <span class="o">=</span> <span class="mi">5</span><span class="p">):</span> <span class="bp">self</span><span class="o">.</span><span class="n">max_label_length</span> <span class="o">=</span> <span class="n">max_label_length</span> <span class="bp">self</span><span class="o">.</span><span class="n">number_of_trees</span> <span class="o">=</span> <span class="n">number_of_trees</span> <span class="bp">self</span><span class="o">.</span><span class="n">random_state</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="n">seed</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span></code></pre></div> <p><code>numpy</code> has been used and imported as <code>np</code>. Variable names are as same as the names in the paper. The length of the hash is a fixed value. This value will be a small integer for almost all the applications.</p> <p>In a normal LSH based nearest neighbor search, there are two main operations. </p> <ol> <li>Building index</li> <li>Queries</li> </ol> <h3 id="building-index">Building index</h3> <p>First stage of building index is hashing the data point in the data set passed into the function. <a href="http://en.wikipedia.org/wiki/Locality-sensitive_hashing#Random_projection">Random projection</a> has been used as the hashing algorithm(It belongs to LSH family). In order to perform random projection, a set of random hyper-planes is required with the shape of \(expected Hash Size \times dimension Of The Data Vector\). It is done by the following function. </p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">_get_random_hyperplanes</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">hash_size</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="n">dim</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span> <span class="sd">&quot;&quot;&quot; </span> <span class="sd"> Generates hyperplanes from standard normal distribution and return </span> <span class="sd"> it as a 2D numpy array. This is g(p,x) for a particular tree.</span> <span class="sd"> &quot;&quot;&quot;</span> <span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">random_state</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">hash_size</span><span class="p">,</span> <span class="n">dim</span><span class="p">)</span></code></pre></div> <p>Then, the random projection is performed. It is a simple operation as all it needs to do is get the dot product of the generated hyper-planes and the data vectors. Then it will create a binary string taking the sign of the hash into account:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">_hash</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_point</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="n">hash_function</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span> <span class="sd">&quot;&quot;&quot;</span> <span class="sd"> Does hash on the data point with the provided hash_function: g(p,x).</span> <span class="sd"> &quot;&quot;&quot;</span> <span class="n">projections</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">hash_function</span><span class="p">,</span> <span class="n">input_point</span><span class="p">)</span> <span class="k">return</span> <span class="s">&quot;&quot;</span><span class="o">.</span><span class="n">join</span><span class="p">([</span><span class="s">&#39;1&#39;</span> <span class="k">if</span> <span class="n">i</span> <span class="o">&gt;</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">&#39;0&#39;</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">projections</span><span class="p">])</span></code></pre></div> <p>After this a tree(a figurative tree) is build using by sorting those binary hashes. At this point, original indices are retained because it will be only way to refer to the original vectors from now on. It is done as follows:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">_create_tree</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">input_array</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span> <span class="n">hash_function</span> <span class="o">=</span> <span class="bp">None</span><span class="p">):</span> <span class="sd">&quot;&quot;&quot;</span> <span class="sd"> Builds a single tree (in this case creates a sorted array of </span> <span class="sd"> binary hashes).</span> <span class="sd"> &quot;&quot;&quot;</span> <span class="n">number_of_points</span> <span class="o">=</span> <span class="n">input_array</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="n">binary_hashes</span> <span class="o">=</span> <span class="p">[]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">number_of_points</span><span class="p">):</span> <span class="n">binary_hashes</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_hash</span><span class="p">(</span><span class="n">input_array</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">hash_function</span><span class="p">))</span> <span class="n">binary_hashes</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">binary_hashes</span><span class="p">)</span> <span class="n">o_i</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">binary_hashes</span><span class="p">)</span> <span class="k">return</span> <span class="n">o_i</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">binary_hashes</span><span class="p">)</span></code></pre></div> <p>This is the process which has to be done a single tree. But there are multiple number of trees. So this has to be done for each tree. The above function is called for each tree with the corresponding hash function which is \(g(p)\). Then hash functions, trees and original indices are stored as <code>numpy</code> arrays.</p> <h3 id="queries">Queries</h3> <p>This is the tricky part of this implementation. All the tree operations indicated in the paper have to be converted into range queries in order to work with sorted arrays and binary search. I will move step by step about how binary search has been used in this application. </p> <p>The first objective is: given a sort array of binary hashes, a binary query and a hash value <code>h</code>, retrieve an array of indices where the most significant <code>h</code> bits of the entries are as same as the most significant <code>h</code> bits of the query. In order to achieve this, I have re-implemented the <code>bisect</code> functions(which comes by default with Python) with a little essential modification.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">bisect_left</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="n">lo</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">hi</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">while</span> <span class="n">lo</span> <span class="o">&lt;</span> <span class="n">hi</span><span class="p">:</span> <span class="n">mid</span> <span class="o">=</span> <span class="p">(</span><span class="n">lo</span><span class="o">+</span><span class="n">hi</span><span class="p">)</span><span class="o">//</span><span class="mi">2</span> <span class="k">if</span> <span class="n">a</span><span class="p">[</span><span class="n">mid</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">x</span><span class="p">:</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">mid</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">else</span><span class="p">:</span> <span class="n">hi</span> <span class="o">=</span> <span class="n">mid</span> <span class="k">return</span> <span class="n">lo</span> <span class="k">def</span> <span class="nf">bisect_right</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span> <span class="n">lo</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">hi</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="k">while</span> <span class="n">lo</span> <span class="o">&lt;</span> <span class="n">hi</span><span class="p">:</span> <span class="n">mid</span> <span class="o">=</span> <span class="p">(</span><span class="n">lo</span><span class="o">+</span><span class="n">hi</span><span class="p">)</span><span class="o">//</span><span class="mi">2</span> <span class="k">if</span> <span class="n">x</span> <span class="o">&lt;</span> <span class="n">a</span><span class="p">[</span><span class="n">mid</span><span class="p">]</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">a</span><span class="p">[</span><span class="n">mid</span><span class="p">][:</span><span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)]</span><span class="o">==</span><span class="n">x</span><span class="p">:</span> <span class="n">hi</span> <span class="o">=</span> <span class="n">mid</span> <span class="k">else</span><span class="p">:</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">mid</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">return</span> <span class="n">lo</span> <span class="c">#function which accepts an sorted array of bit strings, a query string</span> <span class="c">#This returns an array containing all indices which share the first h bits of the query</span> <span class="k">def</span> <span class="nf">simpleFunctionBisectReImplemented</span><span class="p">(</span><span class="n">sorted_array</span><span class="p">,</span> <span class="n">item</span><span class="p">,</span> <span class="n">h</span><span class="p">):</span> <span class="n">left_index</span> <span class="o">=</span> <span class="n">bisect_left</span><span class="p">(</span><span class="n">sorted_array</span><span class="p">,</span> <span class="n">item</span><span class="p">[:</span><span class="n">h</span><span class="p">])</span> <span class="n">right_index</span> <span class="o">=</span> <span class="n">bisect_right</span><span class="p">(</span><span class="n">sorted_array</span><span class="p">,</span> <span class="n">item</span><span class="p">[:</span><span class="n">h</span><span class="p">])</span> <span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">left_index</span><span class="p">,</span> <span class="n">right_index</span><span class="p">)</span></code></pre></div> <p>Here I have considered a minor aspect about slicing and <code>startswith</code> in Python string. In place of <code>a[mid][:len(x)]==x</code>, I could have used <code>startswith</code> in-built function. But after a little research, it became obvious why the latter is not suitable here. <code>startswith</code> works efficiently with very long strings, but slicing has been optimized from C level for efficiency for small strings. In this application, hash strings do not have a requirement to be very long. You can read more about this from this <a href="http://stackoverflow.com/questions/13270888/why-is-startswith-slower-than-slicing">question</a>.</p> <p>The time complexity of this method is as any binary search. The number of entries <code>n</code> is the length of the array of sorted binary hashes. There are two searches in this method, but after performing it, the overall complexity will be \(O(log n)\). (You can do the math and confirm)</p> <p>There is another binary search. It is to find the longest prefix match for a binary query string in a sorted array of binary hashes.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">find_longest_prefix_match</span><span class="p">(</span><span class="n">bit_string_list</span><span class="p">,</span> <span class="n">query</span><span class="p">):</span> <span class="n">hi</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">query</span><span class="p">)</span> <span class="n">lo</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">simpleFunctionBisectReImplemented</span><span class="p">(</span><span class="n">bit_string_list</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">hi</span><span class="p">))</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span> <span class="k">return</span> <span class="n">hi</span> <span class="k">while</span> <span class="n">lo</span> <span class="o">&lt;</span> <span class="n">hi</span><span class="p">:</span> <span class="n">mid</span> <span class="o">=</span> <span class="p">(</span><span class="n">lo</span><span class="o">+</span><span class="n">hi</span><span class="p">)</span><span class="o">//</span><span class="mi">2</span> <span class="n">k</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">simpleFunctionBisectReImplemented</span><span class="p">(</span><span class="n">bit_string_list</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">mid</span><span class="p">))</span> <span class="k">if</span> <span class="n">k</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span> <span class="n">lo</span> <span class="o">=</span> <span class="n">mid</span> <span class="o">+</span> <span class="mi">1</span> <span class="n">res</span> <span class="o">=</span> <span class="n">mid</span> <span class="k">else</span><span class="p">:</span> <span class="n">hi</span> <span class="o">=</span> <span class="n">mid</span> <span class="k">return</span> <span class="n">res</span></code></pre></div> <p>Time complexity of this operation is a little trickier. Binary searches on two different parameters are involved in this. The outer binary search corresponds to the length of the query string: say <code>v</code> and the inner binary search corresponds to length of the sorted array of binary hashes: say <code>n</code>. The it has a complexity of \(logv \times logn\). So it will be approximately \(O(logn^K)\) where \(K=logv\).</p> <p>That is all the basic setting required to implement LSH forest with sorted arrays and binary search. Now we can move on to the actual implementation of queries as indicated in the paper. There are two main phases described to perform queries.</p> <ol> <li>Descending phase.</li> <li>Synchronous ascending phase. </li> </ol> <p>In the descending phase, the longest matching hash length for a particular query is retrieved from all the tree. This step is quite straightforward as all is does is using the above described longest prefix match function on each tree. From this <code>max_depth</code> (<code>x</code> in the paper) is found.</p> <p>The query function accept a value for \(c\) (refer to the paper) as well. This determines the number of candidates returned from the function. This is \(M\) which is equal to \(c \times numberOfTrees\). In asynchronous ascend phase, starting from <code>x</code>, every matching <code>x</code> long entry from each tree is collected(in a loop). Then <code>x</code> is decreased by one. Same is done repeatedly for each tree until the required number of candidates are retrieved. During the process, the length of candidate list may grow greater than required number of candidates. But the search does not end until the following condition is sufficed(As described in the synchronous ascend algorithm in the paper).</p> <p>condition: \(x&gt;0\) and \((lenth(candidates) &gt; c \) or \(length(unique(candidates)) &gt; m)\)</p> <p>\(M » m\) where \(m\) is the actual number of neighbors required. So after selecting the candidates, a true distance measure will be used to determine the actual neighbors. This will be done later as the project proceeds. The current implementation will be used to perform the tasks in the <a href="/gsoc/2014/05/25/performance-evaluation-of-approximate-nearest-neighbor-search-implementations---part-1.html">evaluation criteria</a> that I have discussed in my earlier post. </p> <p>In my next post I will illustrate how the various versions of LSH forest performs and a comparison with other ANN implementations. </p> <h3 id="acronyms">Acronyms</h3> <ol> <li>LSH : Locality Sensitive Hashing</li> </ol> Sun, 01 Jun 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/06/01/lsh-forest-with-sorted-arrays-and-binary-search.html http://maheshakya.github.io/gsoc/2014/06/01/lsh-forest-with-sorted-arrays-and-binary-search.html gsoc Performance evaluation of Approximate Nearest Neighbor search implementations - Part 1 <p>This typifies the official instigation of my GSoC project. In my <a href="/gsoc/2014/05/18/preparing-a-bench-marking-data-set-using-singula-value-decomposition-on-movielens-data.html">previous post</a>, I have discussed how to create the bench marking data set which will be used from here on. I will discuss how the evaluation framework is designed to evaluate the performance of the existing approximate nearest neighbor search implementations. For this evaluation, I have chosen the following implementations:</p> <ul> <li>Spotify <a href="https://github.com/spotify/annoy">ANNOY</a></li> <li><a href="http://www.cs.ubc.ca/research/flann/">FLANN</a> </li> <li><a href="http://www.kgraph.org/">KGraph</a></li> <li><a href="http://nearpy.io/">nearpy</a></li> <li><a href="https://pypi.python.org/pypi/lshash/0.0.4dev">lshash</a></li> </ul> <h2 id="evaluation-framework">Evaluation framework</h2> <p>There are three main considerations in this evaluation framework. </p> <ol> <li>Memory consumption</li> <li>Precision</li> <li>Query speed</li> </ol> <p>For each of these aspects, there are separated tests which I will explain in the upcoming sections. In addition to these, from <a href="http://scikit-learn.org/stable/">scikit-learn</a> community, another requirement emerged to consider the index building time in order to assist with incremental learning. This is an optimization which has to be done in the ANN version that will be implemented in <a href="http://scikit-learn.org/stable/">scikit-learn</a>. </p> <p>Note: The evaluation framework will be run on a system with following configurations:</p> <ul> <li>Memory : 16 GB (1600 MHz)</li> <li>CPU : Intel® Core™ i7-4700MQ CPU @ 2.40GHz × 8 </li> </ul> <p>In this post, I will discuss the evaluation done for memory consumption.</p> <h3 id="memory-consumption-of-existing-ann-methods">Memory consumption of existing ANN methods</h3> <p>Memory consumption corresponds to the index building step in a nearest neighbor search data structure since it is the process which stores the data in the data structure accordingly. In this framework, there are two main aspects taken into account to express the memory consumption of an index building process. </p> <ol> <li><strong>Peak memory consumption</strong> : While the index building process takes place, there is a maximum amount of memory used. No amount of memory beyond this peak memory will be consumed during this process.</li> <li><strong>Overall increment</strong> : This is the actual memory used by the data structure after the index building process. This may be less than or equal to the peak memory consumption.</li> </ol> <p>These two aspects of the above mentioned ANN implementations were measured. The results are as follows.</p> <p><img src="https://docs.google.com/drawings/d/1gCLDnk_UJ-kk5bkY-g-jp1hnqO5GcDqusQch2OHw0VI/pub?w=668&amp;h=468" alt="Memory_usage_table" /></p> <p>Following graph illustrates the peak memory consumptions in log scale. <img src="https://docs.google.com/drawings/d/1j7BjozhffmhVMbJHtymDYtlKFvxo1LQRQnHY2PHDIQU/pub?w=804&amp;h=614" alt="Peak_memory_usage" /></p> <p>Following graph illustrates the overall memory increments in log scale. <img src="https://docs.google.com/drawings/d/1EhBe1c45BIn5tEs6hzqF8MNMzrYs_NZEivii8Wqs0A0/pub?w=808&amp;h=614" alt="Overall_memory_increment" /></p> <p>In the upcoming posts, I will discuss the other two aspects in the evaluation framework in detail and the performance of LSH-Forest implementation as well. </p> <p>Cheers!</p> Sun, 25 May 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/05/25/performance-evaluation-of-approximate-nearest-neighbor-search-implementations---part-1.html http://maheshakya.github.io/gsoc/2014/05/25/performance-evaluation-of-approximate-nearest-neighbor-search-implementations---part-1.html gsoc Singular value decomposition to create a bench marking data set from MovieLens data <p>This is the second article on my Google Summer of Code project and this follows from my <a href="/gsoc/2014/05/04/approximate-nearest-neighbor-search-using-lsh.html">previous post</a> about the description about my project: Approximate nearest neighbor search using Locality sensitive hashing. Here, I will elaborate how I created my data set for prototyping, evaluating and bench marking purposes in the project. I have used <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition">Singular value decomposition</a> on the <a href="http://grouplens.org/datasets/movielens/">MovieLens 1M</a> data to create this sample data set.</p> <h2 id="movielens-1m-data">MovieLens 1M data</h2> <p><a href="http://grouplens.org/">GroupLens Research</a> is and organization publishes research articles in conferences and journals primarily in the field of computer science, but also in other fields including psychology, sociology, and medicine. It has collected and made available rating data sets from the <a href="http://movielens.org">MovieLens</a> web site. The data sets were collected over various periods of time, depending on the size of the set.</p> <p>MovieLens 1M data set contains 1,000,209 anonymous ratings of approximately 3,900 movies made by 6,040 MovieLens users who joined MovieLens in 2000. After extracting the compressed content, there will be following files at your hand:</p> <ul> <li>ratings.dat : Contains user IDs, movie IDs, ratings on 5 star scale and time stamp.</li> <li>movies.dat : Contains movie IDs, titles and genres.</li> <li>users.dat : Contains user IDs, genders, ages, ocupations and zip-codes.</li> </ul> <p>More information about this can be found in the <a href="http://files.grouplens.org/datasets/movielens/ml-1m-README.txt">README</a>. </p> <h2 id="a-brief-explanation-about-singular-value-decomposition-and-its-role-in-machine-learning">A brief explanation about singular value decomposition and its’ role in machine learning</h2> <p>Singular value decomposition is a matrix factorization method. The general equation can be expressed as follows.</p> <script type="math/tex; mode=display">X = USV^T</script> <p>Suppose \(X\) has \(n\) rows and \(d\) columns. \(U\) is a matrix whose dimensions are \(n \times n\), \(V\) is another matrix whose dimensions are \(d \times d\), and \(S\) is a matrix whose dimensions are \(n \times d\), the same dimensions as \(X\). In addition, \(U^T U = I _ n\) and \(V^T V = I _ d\)</p> <p>You can read and understand more about this decomposition method and how it works from this <a href="http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition">article</a>.</p> <h3 id="what-is-the-significance-of-the-svdsingular-value-decomposition-in-machine-learning-and-what-does-it-have-to-do-with-movielens-data">What is the significance of the SVD(Singular Value Decomposition) in machine learning and what does it have to do with MovieLens data?</h3> <p>We can represent each movie from a dimension and each user corresponds to a data point in this high dimensional space. But we are not able to visualize more than three dimensions. This data can be represented by a matrix(The \(X\) in the above equation). A sample depiction of the matrix may look as follows. <img src="https://docs.google.com/drawings/d/1oBQ7iNf-c6GCYBvalyM7HXlcscX1ATz9lQsxzpHdCyQ/pub?w=960&amp;h=720" alt="user_movie_matrix" /> Because number of ratings for a movie by users is significantly low when considered with the number of users, this matrix contains a large number of empty entries. Therefore this matrix will be a very sparse matrix. Hence, approximating this matrix with a lower rank matrix is a worthwhile attempt.</p> <p>Consider the following scenario:</p> <p>If every user who likes “movie X” also likes “movie Y”, then it is possible to group them together to form an agglomerative movie or feature. After forming new features in that way, two users can be compared by analyzing their ratings for different features rather than for individual movies.</p> <p>In the same way different users may rate same movies similarly. So there can different types of similarities among user preferences.</p> <p>According to this factorization method (you might want to read more about SVD at this point from the reference I have provided earlier) the matrix \(S\) is a diagonal matrix containing the singular values of the matrix \(X\). The number of singular values is exactly equal to the rank of the matrix \(X\). The rank of a matrix is the number of linearly independent rows or columns in the matrix. We know that two vectors are linearly independent if they cannot be written as the sum or scalar multiple of any other vectors in that vector space. You can notice that this linear independence somehow captures the notion of a feature or agglomerative item which we try to generate from in this approach. According to the above scenario, if every user who liked “Movie X” also liked “Movie Y”, then those two movie vectors would be linearly dependent and would only contribute one to the rank.</p> <p>So how are we to get rid of this redundant data. We can compare movies if most users who like one also like the other. In order to do that, we will keep the largest k singular values in \(S\). This will give us the best rank-k approximation to X. </p> <p>So the entire procedure can be boiled down to following three steps:</p> <ol> <li>Compute the SVD: \(X = U S V^T\).</li> <li>Form the matrix \(S’\) by keeping the k largest singular values and setting the others to zero.</li> <li>Form the matrix \(X _ lr\) by \(X _ lr = U S’ V^T\).</li> </ol> <h2 id="implementation">Implementation</h2> <p>To perform SVD on MovieLens data set and recompose the matrix with a lower rank, I used scipy sparse matrix, numpy and pandas. It has been done in following steps.</p> <p>1) Import required packages.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="kn">import</span> <span class="nn">scipy.sparse</span> <span class="kn">as</span> <span class="nn">sp</span> <span class="kn">from</span> <span class="nn">scipy.sparse.linalg</span> <span class="kn">import</span> <span class="n">svds</span> <span class="kn">import</span> <span class="nn">pickle</span></code></pre></div> <p>2) Load data set into a pandas data frame.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">data_file</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_table</span><span class="p">(</span><span class="s">r&#39;ratings.dat&#39;</span><span class="p">,</span> <span class="n">sep</span> <span class="o">=</span> <span class="s">&#39;::&#39;</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span></code></pre></div> <p>Here,I have assumed that the <code>ratings.dat</code> file from MovieLens 1M data will be in the working directory. Only reason I am using pandas data frame is its’ convenience of usage. You can directly open the file and proceed. But then you will have to change following steps to adapt to that method.</p> <p>3) Extract required meta information from the data set.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">users</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">unique</span><span class="p">(</span><span class="n">data_file</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="n">movies</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">unique</span><span class="p">(</span><span class="n">data_file</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">number_of_rows</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">users</span><span class="p">)</span> <span class="n">number_of_columns</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">movies</span><span class="p">)</span> <span class="n">movie_indices</span><span class="p">,</span> <span class="n">user_indices</span> <span class="o">=</span> <span class="p">{},</span> <span class="p">{}</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">movies</span><span class="p">)):</span> <span class="n">movie_indices</span><span class="p">[</span><span class="n">movies</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">=</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">users</span><span class="p">)):</span> <span class="n">user_indices</span><span class="p">[</span><span class="n">users</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">=</span> <span class="n">i</span></code></pre></div> <p>As the user IDs and movie IDs are not continueous integers(there are missing numbers inbetween), a proper mapping is required. It will be used when inserting data into the matrix. At this point, you can delete the loaded data frame in order to save memory. But it is optional.</p> <p>4) Creating the sparse matrix and inserting data.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#scipy sparse matrix to store the 1M matrix</span> <span class="n">V</span> <span class="o">=</span> <span class="n">sp</span><span class="o">.</span><span class="n">lil_matrix</span><span class="p">((</span><span class="n">number_of_rows</span><span class="p">,</span> <span class="n">number_of_columns</span><span class="p">))</span> <span class="c">#adds data into the sparse matrix</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">data_file</span><span class="o">.</span><span class="n">values</span><span class="p">:</span> <span class="n">u</span><span class="p">,</span> <span class="n">i</span> <span class="p">,</span> <span class="n">r</span> <span class="p">,</span> <span class="n">gona</span> <span class="o">=</span> <span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span><span class="n">line</span><span class="p">)</span> <span class="n">V</span><span class="p">[</span><span class="n">user_indices</span><span class="p">[</span><span class="n">u</span><span class="p">],</span> <span class="n">movie_indices</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span> <span class="o">=</span> <span class="n">r</span></code></pre></div> <p>You can save sparse matrix <code>V</code> using <code>pickle</code> if you are willing to use it later. </p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#as these operations consume a lot of time, it&#39;s better to save processed data </span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">&#39;movielens_1M.pickle&#39;</span><span class="p">,</span> <span class="s">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">handle</span><span class="p">:</span> <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="n">handle</span><span class="p">)</span></code></pre></div> <p>5) Perform SVD.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#as these operations consume a lot of time, it&#39;s better to save processed data </span> <span class="c">#gets SVD components from 10M matrix</span> <span class="n">u</span><span class="p">,</span><span class="n">s</span><span class="p">,</span> <span class="n">vt</span> <span class="o">=</span> <span class="n">svds</span><span class="p">(</span><span class="n">V</span><span class="p">,</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">500</span><span class="p">)</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">&#39;movielens_1M_svd_u.pickle&#39;</span><span class="p">,</span> <span class="s">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">handle</span><span class="p">:</span> <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">u</span><span class="p">,</span> <span class="n">handle</span><span class="p">)</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">&#39;movielens_1M_svd_s.pickle&#39;</span><span class="p">,</span> <span class="s">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">handle</span><span class="p">:</span> <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">handle</span><span class="p">)</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">&#39;movielens_1M_svd_vt.pickle&#39;</span><span class="p">,</span> <span class="s">&#39;wb&#39;</span><span class="p">)</span> <span class="k">as</span> <span class="n">handle</span><span class="p">:</span> <span class="n">pickle</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">vt</span><span class="p">,</span> <span class="n">handle</span><span class="p">)</span></code></pre></div> <p>The <code>svds</code> method performs the SVD. Parameter <code>k</code> is the number of singular values we want to retain. Here also I have save the intermediate data using <code>pickle</code></p> <p>After this decomposition you will get <code>u</code>, <code>s</code> and <code>vt</code>. They have (<code>number of users</code>, <code>k</code>), (<code>k</code>, ) and (<code>k</code>, <code>number of movies</code>) shapes respectively.</p> <p>6) Recomposing the lower rank matrix.</p> <p>As <code>s</code> is a vector, we need to create a diagonal matrix form that with the diagonal containing the values of that vector. It is done as follows:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">s_diag_matrix</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">s</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">s</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]))</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">s</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]):</span> <span class="n">s_diag_matrix</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">s</span><span class="p">[</span><span class="n">i</span><span class="p">]</span></code></pre></div> <p>This will create a diagonal matrix. After that, all you have to do is get the matrix product of <code>u</code>, <code>s_diag_matix</code> and <code>vt</code> in that order.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">X_lr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">u</span><span class="p">,</span> <span class="n">s_diag_matrix</span><span class="p">),</span> <span class="n">vt</span><span class="p">)</span></code></pre></div> <p>Now we have the lower rank approximation for \(X\) as \(X _ lr = U S’ V^T\). Now this matrix can be used as a bench marking data set for the application.</p> <h2 id="references">References</h2> <ol> <li>Dimensionality Reduction and the Singular Value Decomposition, Available[online]: <a href="http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html">http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/svd.html</a></li> <li>Data Mining Algorithms In R/Dimensionality Reduction/Singular Value Decomposition, Available[online]: <a href="http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition">http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition</a></li> </ol> Sun, 18 May 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/05/18/preparing-a-bench-marking-data-set-using-singula-value-decomposition-on-movielens-data.html http://maheshakya.github.io/gsoc/2014/05/18/preparing-a-bench-marking-data-set-using-singula-value-decomposition-on-movielens-data.html gsoc GSoC2014: Approximate nearest neighbor search using LSH <p>This project has been initiated as a Google summer of code project. <a href="https://www.python.org/">Python Software foundation</a> serves as an “umbrella organization” to a variety of Python related open source projects. <a href="http://scikit-learn.org/stable/index.html">Scikit-learn</a> is a machine learning module which operates under that umbrella. I’m instigating this project under that module. In the following sections, I will describe what this really is with help of the essence of my project proposal. This is just a head start of the journey and you will be able to obtain a clear picture as this proceeds.</p> <h2 id="the-struggle-for-nearest-neighbor-search">The struggle for nearest neighbor search</h2> <p>Nearest neighbor search is a well known problem which can be defined as follows: given a collection of n data points, create a data structure which, given any query point, reports the data points that is closest to the query. This problem holds a major importance in certain applications: data mining, databases, data analysis, pattern recognition, similarity search, machine learning, image and video processing, information retrieval and statistics. To perform nearest neighbor search, there are several efficient algorithms known for the case where the dimension is low. But those methods suffer from either space or query time that is exponential in dimension.</p> <p>In order to address the “Curse of Dimensionality” in large data sets, recent researches had been lead based on approximating neighbor search. It has been proven that in many cases, approximate nearest neighbor is as almost good as the exact one[1]. Locality Sensitive Hashing is one of those approximating methods. The key idea of LSH is to hash data points using several hash functions to ensure that for each function the probability of collision is much higher for objects that are close to each other than for those that are far apart.</p> <h2 id="what-does-this-have-to-do-with-me">What does this have to do with me?</h2> <p>In scikit-learn, currently exact nearest neighbor search is implemented, but when it comes to higher dimensions, it fails to perform efficiently[2]. So I’m taking an initiative to implement LSH based ANN for scikit-learn. In this project, several variants of LSH-ANN methods will be prototyped and evaluated. After identifying the most appropriate method for scikit-learn, it will be implemented in accordance with scikit-learn’s API and documentation levels, which includes narrative documentation. Then with the results obtained from prototyping stage, storing and querying structure of ANN will be implemented. After that, ANN part will be integrated into <code>sklearn.neighbors</code> module. Next comes the application of this method. Most of clustering algorithms use nearest neighbor search, therefore this method will be adapted to use in those modules in order to improve their operational speed. As these activities proceed, testing, examples and documentation will be covered. Bench marking will be done to assess the implementation.</p> <h2 id="milestones-of-the-project">Milestones of the project</h2> <ol> <li>Prototyping/Evaluating existing LSH based ANN methods (vanilla and others) in order to find the most appropriate method to have in scikit-learn. There is no point of having methods in scikit-learn which are impractical to use with real data.</li> <li>Approximating neighbor search uses hashing algorithms of LSH family. These algorithms will be implemented.</li> <li>Implementation of an efficient storing structure to retain trained/hashed data.</li> <li>Integrating the ANN search into current implementation of neighbor search, so that this can be used with the existing API.</li> <li>Improving speed of existing clustering models with the implemented ANN search.</li> <li>Completing tests, examples, documentation and bench marking.</li> </ol> <p>Well that’s it for the moment. I will guide you through when the things are in motion. </p> <h2 id="abbreviations">Abbreviations</h2> <ul> <li>LSH : Locality sensitive hashing</li> <li>ANN : Approximate nearest neighbor</li> <li>API : Application programming interface</li> </ul> <h2 id="references">References</h2> <ol> <li>A. Andoni and P. Indyk,”Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions”, Available[online]:<a href="http://people.csail.mit.edu/indyk/p117-andoni.pdf">http://people.csail.mit.edu/indyk/p117-andoni.pdf</a></li> <li>R. Rehurek, “Performance Shootout of Nearest Neighbours: Contestants”, Available[online]:<a href="http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/">http://radimrehurek.com/2013/12/performance-shootout-of-nearest-neighbours-contestants/</a></li> </ol> Sun, 04 May 2014 00:00:00 +0000 http://maheshakya.github.io/gsoc/2014/05/04/approximate-nearest-neighbor-search-using-lsh.html http://maheshakya.github.io/gsoc/2014/05/04/approximate-nearest-neighbor-search-using-lsh.html gsoc The Pythonizer! <p>Though I’m a Python fanatic, Jekyll is inconceivably awesome! (It’s Ruby)</p> <p>Jekyll also offers powerful support for code snippets:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">print_hi</span><span class="p">(</span><span class="n">name</span><span class="p">)</span> <span class="k">print</span> <span class="s">&quot;Hi, &quot;</span> <span class="n">name</span> <span class="n">print_hi</span><span class="p">(</span><span class="s">&#39;Gonaa&#39;</span><span class="p">)</span> <span class="c">#=&gt; prints &#39;Hi, Gonaa&#39; to STDOUT.</span></code></pre></div> <p>Check out the <a href="http://jekyllrb.com">Jekyll docs</a> for more info on how to get the most out of Jekyll. File all bugs/feature requests at <a href="https://github.com/mojombo/jekyll">Jekyll’s GitHub repo</a>.</p> Sun, 06 Apr 2014 15:40:56 +0000 http://maheshakya.github.io/miscellaneous/2014/04/06/welcome-to-jekyll.html http://maheshakya.github.io/miscellaneous/2014/04/06/welcome-to-jekyll.html miscellaneous