{"id":81,"date":"2010-06-25T15:29:30","date_gmt":"2010-06-25T23:29:30","guid":{"rendered":"http:\/\/bynomial.com\/blog\/?p=81"},"modified":"2010-06-25T15:29:30","modified_gmt":"2010-06-25T23:29:30","slug":"fast-numeric-queries","status":"publish","type":"post","link":"http:\/\/bynomial.com\/blog\/?p=81","title":{"rendered":"Fast numeric queries"},"content":{"rendered":"<p>This post introduces a relatively new algorithm, along with an open-source demo iPad app, that can pre-process a large set of numeric data to help perform extremely fast error-tolerant lookups at query time.<\/p>\n<h2>Problems you can solve<\/h2>\n<p>There is a class of algorithms known as <a href=\"http:\/\/en.wikipedia.org\/wiki\/Locality_sensitive_hashing\">locality-sensitive hashing<\/a> (LSH) algorithms; this is a new one of those. \u00c2\u00a0They were pioneered by Rajeev Motwani and Piotr Indyk, who remains one of the primary world experts. \u00c2\u00a0LSH algorithms can be used to help solve the nearest neighbors problem. \u00c2\u00a0Solving this problem, which I&#8217;ll describe a little more below, is still the most obvious way to use LSH algorithms, along with the similar <em>k<\/em>-nearest neighbors problem.<\/p>\n<h3>The nearest neighbor \u00c2\u00a0(NN) and <em>k<\/em>-NN problems<\/h3>\n<p>The nearest neighbor question is this: Given a large set of points (<em>P<\/em>), pre-process it so that later, when you receive a new query point (<em>q<\/em>), you can very quickly return the subset of original points <em>P<\/em> which are within a fixed radius <em>r<\/em> of <em>q<\/em> (the desired output is the set near(<em>q<\/em>, <em>P<\/em>) = {<em>p<\/em> in <em>P<\/em> : ||<em>p<\/em> &#8211; <em>q<\/em>|| &lt; <em>r<\/em>}). \u00c2\u00a0The radius <em>r<\/em> is chosen in advance and\u00c2\u00a0doesn&#8217;t change, although you can store different outputs of the pre-processing which can handle multiple radii. \u00c2\u00a0Technically speaking, the NN problem actually asks for the single point in <em>P<\/em> closest to <em>q<\/em>, and the <em>k<\/em>-NN problem asks for the <em>k<\/em> points in <em>P<\/em> closest to <em>q<\/em>; we introduce the different version with the radius because it is close enough for practical purposes, and is more directly solved by LSH algorithms.<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-82 aligncenter\" title=\"File-KnnClassification\" src=\"http:\/\/bynomial.com\/blog\/wp-content\/uploads\/2010\/06\/File-KnnClassification.png\" alt=\"\" width=\"220\" height=\"199\" \/><\/p>\n<h3>Stuff you can do with NN or k-NN<\/h3>\n<p>Unless you&#8217;re a quasi-academic algorithm junkie like me, your eyes probably glazed over a little in that last paragraph. \u00c2\u00a0Despite sounding abstract, these problems have a myriad of very real applications.<\/p>\n<h4>Recommendations<\/h4>\n<p>For example, let&#8217;s say you have a user database with usage data per user, such as ratings from netflix or yelp users. \u00c2\u00a0Temporarily forgetting about missing fields (that&#8217;s a somewhat independent challenge), you could represent every user as a vector, and then quickly find each user&#8217;s 10 &#8220;nearest neighbors,&#8221; which in this case means the 10 other users most like the query user. \u00c2\u00a0From there, you can combine those 10 users to get a reasonable guess at what else the query user likes, and recommend the guessed-as-highly-rated items to the query user (excluding things they&#8217;ve already rated).<\/p>\n<h4>Handwriting recognition<\/h4>\n<p>Written characters can be encoded as an image, and images can be further reduced in a way that allows for some rotations and skews to get a relatively small thumbnail &#8211; let&#8217;s say at 500 coordinates, each a float arrived at by some function of a small black &amp; white image. \u00c2\u00a0If you made the thumbnails well, then slightly different ways of writing the number &#8216;7&#8217; will result in slightly different &#8211; but mostly similar &#8211; thumbnails. \u00c2\u00a0As vectors, they will be close together.<\/p>\n<p>Later you get a new image as input, and your job is to figure out what character it is. \u00c2\u00a0You get the thumbnail in the same way, and now just do a quick lookup via LSH and return the most-common character in that set. \u00c2\u00a0(This general method can actually be used for any machine learning classification problem.)<\/p>\n<h4>Duplicate detection<\/h4>\n<p>Suppose you&#8217;re a site like youtube with user-generated content and copyright lawyers at other companies who are feeling sue-happy. \u00c2\u00a0Your job is to quickly and efficiently find any user-uploaded videos that violate copyright. \u00c2\u00a0You have a database of known-copyright material, but the tricky part is that users will mess with these videos in small ways so that they are never exact copies.<\/p>\n<p>LSH to the rescue! \u00c2\u00a0Because any LSH-based lookup is automatically error tolerant, the copies don&#8217;t have to be exact matches. \u00c2\u00a0You can choose a meaningful threshold for what it means to be a &#8220;copy,&#8221; or perhaps simply flag a possible copy for further scrutiny with a slower algorithm.<\/p>\n<h4>Fraud detection<\/h4>\n<p>This case is theoretically similar to handwriting recognition. \u00c2\u00a0Suppose you&#8217;re a bank or credit card company, and you want to figure out as quickly as possible\u00c2\u00a0when a credit or debit card is stolen. \u00c2\u00a0Just as in the handwriting case, you have a set of sample data (which would look like times, locations, and purchase prices with numerical categories), and you do an LSH lookup for the nearest usage cases; you might also want to use thumbnails that can better match the numerically-close scale with the fraud-valid scale.<\/p>\n<p>I could go on for a while. \u00c2\u00a0Basically anything with numeric data that also involves some error-tolerance, a lot of data to go through or large data points (like videos), and the need for speed.<\/p>\n<h2><strong>How it works<\/strong><\/h2>\n<p>The full details are left to other places, such as the paper and slides below; but here is the basic idea:<\/p>\n<p>An LSH approach is like asking a group of people to write down their zip codes. \u00c2\u00a0You keep a zipcode\/name table sorted by zip code. \u00c2\u00a0When a new person arrives, you ask their zip code, and do a quick look in the table and can tell them who lives near them.<\/p>\n<p>The zip code function is a real-life locality sensitive hash. \u00c2\u00a0It has these properties:<\/p>\n<ul>\n<li>It&#8217;s a mapping from data to a discrete (non-continuous) set<\/li>\n<li>Data points (people) with the same hash value (zip code) tend to live close together<\/li>\n<\/ul>\n<p>This is nice because it allows for fast lookups. \u00c2\u00a0So far this describes the general LSH idea, which is not new.<\/p>\n<p>What is new in this algorithm is how it computes the hash values. \u00c2\u00a0Instead of relying on dimension reduction, it relies on using a tessellation (tiling) of <em>d<\/em>-dimensional space with one shape. \u00c2\u00a0First imagine this shape is a hypercube (<em>d<\/em>-dimensional cube), so we have a fixed grid of cubes in space. \u00c2\u00a0If we turn every corner into a hash table bin, then each hypercube will be thought of as a neighbor of each adjacent cube. \u00c2\u00a0In other words, every corner of every cube gets its own zip code, and we call two people neighbors if <em>any<\/em> of their many zip codes match.<\/p>\n<p>But this idea by itself is very bad for high dimensions &#8211; it is attacked by the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Curse_of_dimensionality\">curse of dimensionality<\/a>. \u00c2\u00a0The number of corners is 2<sup><em>d<\/em><\/sup>, and the number of neighboring hypercubes is 3<sup><em>d<\/em><\/sup>-1. \u00c2\u00a0So the cost for us is exponentially-many hash outputs of every point, and the gain is that we can match exponentially-many neighboring shapes (hypercubes).<\/p>\n<p>The new algorithm solves this by using a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Simplex\">simplex<\/a> to tile space. \u00c2\u00a0A simplex is a <em>d<\/em>-dimensional triangle; so it has <em>d<\/em>+1 corners in <em>d<\/em> dimensions. \u00c2\u00a0Now we pay the cost of having linearly-many hash outputs of every point with the gain of super-exponentially-many neighboring shapes! \u00c2\u00a0Compared to the hypercube approach, it is a clear win.<\/p>\n<h2>The iPad app<\/h2>\n<p>The iPad app demonstrates the 2d version of the algorithm. \u00c2\u00a0It generates a set of random points, and plots these on the screen. \u00c2\u00a0When you touch the screen, it connects your fingertip to the neighbors found by the LSH algorithm. \u00c2\u00a0It also can display the proximity graph found during preprocessing, and allow you to switch on\/off the skew correction component of the algorithm (it&#8217;s generally more accurate while on).<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-83\" title=\"LSH demo screenshot\" src=\"http:\/\/bynomial.com\/blog\/wp-content\/uploads\/2010\/06\/LSH-demo-screenshot.png\" alt=\"\" width=\"546\" height=\"696\" srcset=\"http:\/\/bynomial.com\/blog\/wp-content\/uploads\/2010\/06\/LSH-demo-screenshot.png 546w, http:\/\/bynomial.com\/blog\/wp-content\/uploads\/2010\/06\/LSH-demo-screenshot-235x300.png 235w\" sizes=\"(max-width: 546px) 100vw, 546px\" \/><\/p>\n<p>The code for this app is open source under the Apache 2 license, and currently hosted as <a href=\"http:\/\/code.google.com\/p\/ipadlocalityhash\/\">ipadlocalityhash on google code<\/a>.<\/p>\n<h2>Source &amp; References<\/h2>\n<p>The official open-source version of the demo code can be found here:<\/p>\n<p><a href=\"http:\/\/code.google.com\/p\/ipadlocalityhash\/\">http:\/\/code.google.com\/p\/ipadlocalityhash\/<\/a><\/p>\n<p>If you want to skip a couple links, you can also download the xcode project source (for v1.0) right here:<\/p>\n<p><a href=\"http:\/\/bynomial.com\/blogfiles\/ipadlocalityhash.tar.gz\">ipadlocalityhash.tar.gz<\/a><\/p>\n<p>Finally, here are links to the original paper describing the algorithm, and some slides I recently used to present it (with the help of this demo app):<\/p>\n<p><a href=\"https:\/\/www.siam.org\/proceedings\/soda\/2010\/SODA10_094_neylont.pdf\">https:\/\/www.siam.org\/proceedings\/soda\/2010\/SODA10_094_neylont.pdf<\/a><\/p>\n<p><a href=\"http:\/\/wiki.hackerdojo.com\/f\/lsh_hacker_dojo_talk.pdf\">http:\/\/wiki.hackerdojo.com\/f\/lsh_hacker_dojo_talk.pdf<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post introduces a relatively new algorithm, along with an open-source demo iPad app, that can pre-process a large set of numeric data to help perform extremely fast error-tolerant lookups at query time. Problems you can solve There is a class of algorithms known as locality-sensitive hashing (LSH) algorithms; this is a new one of [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false},"categories":[74,15,75],"tags":[105,77,78,76],"_links":{"self":[{"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/81"}],"collection":[{"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=81"}],"version-history":[{"count":0,"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=\/wp\/v2\/posts\/81\/revisions"}],"wp:attachment":[{"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=81"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=81"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/bynomial.com\/blog\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=81"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}