<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Information Retrieval Blog &#187; SaberLucene</title>
	<atom:link href="http://blog.zye.me/tag/saberlucene/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.zye.me</link>
	<description>REAL TIME DATA PROCESSING, DISTRIBUTED COMPUTING, PATTERN DISCOVERY</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:33:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Be care of RangeQuery in Lucene</title>
		<link>http://blog.zye.me/2009/09/54218.html</link>
		<comments>http://blog.zye.me/2009/09/54218.html#comments</comments>
		<pubDate>Thu, 03 Sep 2009 02:13:38 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[RangeQuery]]></category>
		<category><![CDATA[SaberLucene]]></category>

		<guid isPermaLink="false">http://blog.so8848.com/?p=54218</guid>
		<description><![CDATA[Reminder, Lucene has many Query types – TermQuery, BooleanQuery, ConstantScoreQuery, MatchAllDocsQuery, MultiPhraseQuery, FuzzyQuery, WildcardQuery, RangeQuery, PrefixQuery, PhraseQuery, Span*Query, DisjunctionMaxQuery, etc. There is a bunch of Query implements in Lucene, which makes lucene very powerful in search. However, you should be very care of using Query like RangeQuery, especially when the size of your collection is very <a href='http://blog.zye.me/2009/09/54218.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p><strong><span style="font-family: mceinline;"> Reminder, Lucene has many Query types</span></strong></p>
<p>– TermQuery, BooleanQuery,</p>
<p>ConstantScoreQuery, MatchAllDocsQuery,</p>
<p>MultiPhraseQuery, FuzzyQuery,</p>
<p>WildcardQuery, RangeQuery, PrefixQuery,</p>
<p>PhraseQuery, Span*Query,</p>
<p>DisjunctionMaxQuery, etc.</p>
<p>There is a bunch of Query implements in Lucene, which makes lucene very powerful in search. However, you should be very care of using Query like RangeQuery, especially when the size of your collection is very large.</p>
<p>As you know that lucene will rewrite the original Query, but some of the implement could be ineffective. Let&#8217;s see the code snippet in RangeQuery first.</p>
<p><span style="font-family: mceinline;"> public RangeQuery(Term lowerTerm, Term upperTerm, boolean inclusive,</span></p>
<p><span style="font-family: mceinline;"> Collator collator)</span></p>
<p><span style="font-family: mceinline;"> {</span></p>
<p><span style="font-family: mceinline;"> this(lowerTerm, upperTerm, inclusive);</span></p>
<p><span style="font-family: mceinline;"> this.collator = collator;</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"><br />
</span></p>
<p><span style="font-family: mceinline;"> public Query rewrite(IndexReader reader) throws IOException {</span></p>
<p><span style="font-family: mceinline;"><br />
</span></p>
<p><span style="font-family: mceinline;"> BooleanQuery query = new BooleanQuery(true);</span></p>
<p><span style="font-family: mceinline;"> String testField = getField();</span></p>
<p><span style="font-family: mceinline;"> if (collator != null) {</span></p>
<p><span style="font-family: mceinline;"> TermEnum enumerator = reader.terms(new Term(testField, &#8220;&#8221;));</span></p>
<p><span style="font-family: mceinline;"> String lowerTermText = lowerTerm != null ? lowerTerm.text() : null;</span></p>
<p><span style="font-family: mceinline;"> String upperTermText = upperTerm != null ? upperTerm.text() : null;</span></p>
<p><span style="font-family: mceinline;"><br />
</span></p>
<p><span style="font-family: mceinline;"> try {</span></p>
<p><span style="font-family: mceinline;"> do {</span></p>
<p><span style="font-family: mceinline;"> Term term = enumerator.term();</span></p>
<p><span style="font-family: mceinline;"> if (term != null &amp;&amp; term.field() == testField) { // interned comparison</span></p>
<p><span style="font-family: mceinline;"> if ((lowerTermText == null</span></p>
<p><span style="font-family: mceinline;"> || (inclusive ? collator.compare(term.text(), lowerTermText) &gt;= 0</span></p>
<p><span style="font-family: mceinline;"> : collator.compare(term.text(), lowerTermText) &gt; 0))</span></p>
<p><span style="font-family: mceinline;"> &amp;&amp; (upperTermText == null</span></p>
<p><span style="font-family: mceinline;"> || (inclusive ? collator.compare(term.text(), upperTermText) &lt;= 0</span></p>
<p><span style="font-family: mceinline;"> : collator.compare(term.text(), upperTermText) &lt; 0))) {</span></p>
<p><span style="font-family: mceinline;"> addTermToQuery(term, query);</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> while (enumerator.next());</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> finally {</span></p>
<p><span style="font-family: mceinline;"> enumerator.close();</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;">&#8230;&#8230;&#8230;&#8230;&#8230;</span></p>
<p><span style="font-family: mceinline;">}</span></p>
<p>As we can see from this the source code, a RangeQuery may be rewrited into thousands of TermQuery. This will make search ineffective, or even cause &#8220;TooManyClauses exception&#8221;. In addition, the rewrite method in RangeQuery will traverse through the entire dictionary. This is another reason why RangeQuery would make the search operation slow.</p>
<p>In contrast to RangeQuery, RangeFilter will do this job faster. Although RangeFilter will also traverse through the entire dictionary,  it does not have additional search operation as RangeQuery.</p>
<p>The implement of RangeFilter in lucene  will not consume much memory. It will only used for approximate 12.5M memory for a collection with 10M documents. According to the statement above, I would recommend you to use RangeFilter rather than RangeQuery.</p>
<p><span style="background-color: #ffffff;">Actually, ConstantScoreRangeQuery is a wrapper of RangeFilter, which enables us to conduct range search.  ConstantScoreRangeQuery returns a constant score equal to its boost for all documents in the range. It&#8217;s better than RangeQuery when we want to restrict the spectrum of the result rather than to rank the results partly according to the score by the RangeQuery. </span></p>
<p><strong>Notes:</strong> The implements of FuzzyQuery, <span style="background-color: #ffffff;">WildcardQuery, RangeQuery and PrefixQuery are pretty much the same, also be careful of using them.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2009/09/54218.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Ivory Toolkit with the SMRF Retrieval Engine (under Hadoop Framework)</title>
		<link>http://blog.zye.me/2009/08/53939.html</link>
		<comments>http://blog.zye.me/2009/08/53939.html#comments</comments>
		<pubDate>Sat, 08 Aug 2009 17:53:41 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[IR toolkit]]></category>
		<category><![CDATA[IR实验系统]]></category>
		<category><![CDATA[Ivory]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[SaberLucene]]></category>

		<guid isPermaLink="false">http://blog.so8848.com/?p=53939</guid>
		<description><![CDATA[With the growth of IR dataset in size, it seems that a powerful platform for rapidly indexing and searching is needed.  Ivory is a newly announced experimental platform developed on the basis of Hadoop. It could be a good choice when we come to the billion era. This system has shown very competitive performance. I believe it will be the next successful <a href='http://blog.zye.me/2009/08/53939.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<table border="0" width="80%">
<tbody>
<tr>
<td align="left">With the growth of IR dataset in size, it seems that a powerful platform for rapidly indexing and searching is needed.  Ivory is a newly announced experimental platform developed on the basis of Hadoop. It could be a good choice when we come to the billion era. This system has shown very competitive performance. I believe it will be the next successful experimental platform  if more documentation can be provided.However, for the out-of-box Ivory,  there are not sufficient algorithms implemented like in Terrier (also not enough). This would also be a future step for our LabLucene Project (under release). Besides the MapReduce framework, we would also like to integrate Indri Query Lanuage into LabLucene. After these two major steps, we would expect a first release of LabLucene. Right now, I just start learning Hadoop. I would also like someone to help me out. Anyone who wants to get involved in this unfunded project will be warmly welcomed.</p>
<h2>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-</h2>
<h2>The Ivory Toolkit with the SMRF Retrieval Engine</h2>
<div class="main">
<p>Ivory is a Hadoop toolkit for Web-scale information retrieval research that features a retrieval engine based on Markov Random Fields, appropriately named SMRF (Searching with Markov Random Fields). This open-source project began in Spring 2009 and represents a collaboration between the University of Maryland and Yahoo! Research. Ivory takes full advantage of the <a href="http://hadoop.apache.org/core/">Hadoop</a> distributed environment (the MapReduce programming model and the underlying distributed file system) for both indexing and retrieval.</p>
<p>In order to temper expectations, please note that Ivory is not meant to serve as a full-featured search engine (e.g., Lucene), but rather aimed at information retrieval researchers who need access to low-level data structures and who generally know their way around retrieval algorithms. As a result, a lot of &#8220;niceties&#8221; are simply missing—for example, fancy interfaces or ingestion support for different file types. It goes without saying that Ivory is a bit rough around the edges, but our philosophy is to release early and release often. In short, Ivory is <strong>experimental</strong>!</p>
<p>Ivory was specifically designed to work with Hadoop &#8220;out of the box&#8221; on the <a href="http://boston.lti.cs.cmu.edu/clueweb09/wiki/tiki-index.php?page=ClueWeb09%20Wiki">ClueWeb09 collection</a>, a 1 billion page (25 TB) Web crawl distributed by Carnegie Mellon University. The initial release of Ivory is meant to serve as a reference implementation of indexing and retrieval algorithms that can operate at the multi-terabyte scale. Another interesting experimental aspect of Ivory is it&#8217;s retrieval architecture: we&#8217;ve been playing with retrieval engines that directly read postings from HDFS. The getting started guide with <a href="trec.html">TREC disks 4-5</a> provides more details.</p>
<h3>Download</h3>
<ul>
<li>Ivory, release 0.1 (July 18, 2009): <a href="http://www.umiacs.umd.edu/~jimmylin/dist/ivory-r0.1.tar.gz">ivory-r0.1.tar.gz</a> (4.9 MB)</li>
</ul>
<h3>Documentation</h3>
<ul>
<li><a href="javadoc/index.html">Ivory API javadoc</a></li>
<li><a href="start.html">Downloading and setting up</a> the Ivory toolkit</li>
<li>Getting started with <a href="trec.html">TREC disks 4-5</a></li>
<li>Getting started with <a href="clue.html">the ClueWeb09 collection</a></li>
<li>Other <a href="thirdparty.html">third-party libraries</a> on which Ivory depends</li>
<li><a href="team.html">Project team</a></li>
</ul>
</div>
</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2009/08/53939.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

