<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Information Retrieval Blog &#187; autocomplete</title>
	<atom:link href="http://blog.zye.me/tag/autocomplete/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.zye.me</link>
	<description>REAL TIME DATA PROCESSING, DISTRIBUTED COMPUTING, PATTERN DISCOVERY</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:33:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Trie-based approximate autocomplete implementation with support for ranks and synonyms</title>
		<link>http://blog.zye.me/2009/07/53262.html</link>
		<comments>http://blog.zye.me/2009/07/53262.html#comments</comments>
		<pubDate>Thu, 02 Jul 2009 18:58:09 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[autocomplete]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[Nutch]]></category>

		<guid isPermaLink="false">http://blog.so8848.com/?p=53262</guid>
		<description><![CDATA[Thoughts on Lucene, Solr, Nutch and vertical search  Trie-based approximate autocomplete implementation with support for ranks and synonyms Posted by Kelvin on 01 Jul 2009 at 02:30 am &#124; Tagged as: programming The problem of auto-completing user queries is a well-explored one. For example, Type less, find more: fast autocompletion search with a succinct index <a href='http://blog.zye.me/2009/07/53262.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<h1>Thoughts on Lucene, Solr, Nutch and vertical search <a href="http://www.supermind.org/feed"><img style="border: 0px none;" src="http://www.supermind.org/rss.gif" alt="" /></a></h1>
<h2 class="post-title"><a title="Permanent Link: Trie-based approximate autocomplete implementation with support for ranks and synonyms" rel="bookmark" href="http://www.supermind.org/blog/530/trie-based-approximate-autocomplete-implementation-with-support-for-ranks-and-synonyms">Trie-based approximate autocomplete implementation with support for ranks and synonyms</a></h2>
<p class="day-date">Posted by <em>Kelvin</em> on <em>01 Jul 2009</em> at <em>02:30 am</em> | Tagged as: <em><a title="View all posts in programming" rel="category tag" href="http://www.supermind.org/blog/category/programming">programming</a></em></p>
<p>The problem of auto-completing user queries is a well-explored one.</p>
<p>For example,<br />
<a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.3776&amp;rep=rep1&amp;type=pdf">Type less, find more: fast autocompletion search with a succinct index</a><br />
<a href="http://stevedaskam.wordpress.com/2009/06/07/putting-autocomplete-data-structure-to-the-test/">http://stevedaskam.wordpress.com/2009/06/07/putting-autocomplete-data-structure-to-the-test/</a><br />
<a href="http://suggesttree.sourceforge.net/">http://suggesttree.sourceforge.net/</a><br />
<a href="http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html">http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html</a></p>
<p>However, there’s been little written about supporting synonyms and approximate matching for the purpose of autocompletion.</p>
<p>The approach for autocompletion I’ll be discussing in this article supports the following features:</p>
<p>- given a prefix, returns all words starting with that prefix<br />
- approximate/fuzzy prefix matching based on k-errors / edit distance / Levenstein distance, with the operations: substitution, insert and delete (though adjacent transpositions would be trivial to add)<br />
- support for ranks (i.e. a number associated with a word that affects ranking, like number of search results for a given phrase)<br />
- support for synonyms</p>
<p>The search algorithm is independent of dictionary size, and dependent on <em>k</em> (edit distance) and length of prefix.</p>
<h3>Data Structures</h3>
<p>The data structure used is a trie. Words are broken into characters. Each character forms a node in a tree.</p>
<p>To support synonyms, a pointer is added to terminal nodes which points to the canonical word of the synonym ring.</p>
<p>To support ranks, a pointer is added to terminal nodes with the value of the rank. At search time, nodes are sorted first by edit distance, then by rank.</p>
<h3>Implementation</h3>
<p>At index-time, as mentioned above, a trie is built from input words/lexicon. No other preprocessing is done.</p>
<p>At search-time, dynamic programming (DP) is applied on a depth-first traversal of the trie, to collect all the “active sub-tries” of the trie which are less than <em>k</em> errors from the given prefix.</p>
<p>Where traditional DP uses a m x n matrix for its DP table where m and n are the 2 strings to be compared, we instantiate a single (prefix length + max k) x (prefix length) matrix for the entire trie, where prefix is the string for which we want to produce autocomplete results for.</p>
<p>Why  (prefix length + max k) x (prefix length) ?</p>
<p>1. We don’t need to compare full strings because we’re only interested in collecting active sub-tries which satisfy the k-errors constraint.</p>
<p>2. For cases where the length of the word to be evaluated is greater than the length of the prefix, evaluations of the word should be performed up to prefix length + maximum k. This is to take into account the scenario where the first <em>k</em> characters of a word, when deleted, satisfy the edit distance constraint. e.g. prefix of <strong>hamm</strong> and word <strong>bahamm</strong>, with k=2.</p>
<h4>Cutoff optimizations</h4>
<p>Each level of the trie has a non-decreasing value of <em>k</em> (edit distance).</p>
<p>Therefore when <em>k</em> &gt; maximum k, we can proceed to reject the entire sub-trie of that node. In other words, given the prefix <strong>hamm</strong>, and maximum k of <strong>1</strong>, when we encounter <strong>hen</strong> which has k=2, we can discard any children of <strong>hen</strong> because their k-values will be &gt;= 2.</p>
<h4>Collecting autocomplete results</h4>
<p>With a list of active sub-tries, we will then proceed to collect all terminal strings, sort them by edit distance and rank, and return the top n results.</p>
<h3>Further explorations</h3>
<p>1. Implementing adjacent transpositions<br />
2. Implementing Ukkonen’s cutoff algorithm for DP (not to be confused with Ukkonen’s algorithm for constant-time creation of suffix trees)<br />
3. Comparing performance of tries vs compact tries (where non-branching nodes are collapsed)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2009/07/53262.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

