<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Information Retrieval Blog &#187; RangeQuery</title>
	<atom:link href="http://blog.zye.me/tag/rangequery/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.zye.me</link>
	<description>REAL TIME DATA PROCESSING, DISTRIBUTED COMPUTING, PATTERN DISCOVERY</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:33:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Be care of RangeQuery in Lucene</title>
		<link>http://blog.zye.me/2009/09/54218.html</link>
		<comments>http://blog.zye.me/2009/09/54218.html#comments</comments>
		<pubDate>Thu, 03 Sep 2009 02:13:38 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[RangeQuery]]></category>
		<category><![CDATA[SaberLucene]]></category>

		<guid isPermaLink="false">http://blog.so8848.com/?p=54218</guid>
		<description><![CDATA[Reminder, Lucene has many Query types – TermQuery, BooleanQuery, ConstantScoreQuery, MatchAllDocsQuery, MultiPhraseQuery, FuzzyQuery, WildcardQuery, RangeQuery, PrefixQuery, PhraseQuery, Span*Query, DisjunctionMaxQuery, etc. There is a bunch of Query implements in Lucene, which makes lucene very powerful in search. However, you should be very care of using Query like RangeQuery, especially when the size of your collection is very <a href='http://blog.zye.me/2009/09/54218.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p><strong><span style="font-family: mceinline;"> Reminder, Lucene has many Query types</span></strong></p>
<p>– TermQuery, BooleanQuery,</p>
<p>ConstantScoreQuery, MatchAllDocsQuery,</p>
<p>MultiPhraseQuery, FuzzyQuery,</p>
<p>WildcardQuery, RangeQuery, PrefixQuery,</p>
<p>PhraseQuery, Span*Query,</p>
<p>DisjunctionMaxQuery, etc.</p>
<p>There is a bunch of Query implements in Lucene, which makes lucene very powerful in search. However, you should be very care of using Query like RangeQuery, especially when the size of your collection is very large.</p>
<p>As you know that lucene will rewrite the original Query, but some of the implement could be ineffective. Let&#8217;s see the code snippet in RangeQuery first.</p>
<p><span style="font-family: mceinline;"> public RangeQuery(Term lowerTerm, Term upperTerm, boolean inclusive,</span></p>
<p><span style="font-family: mceinline;"> Collator collator)</span></p>
<p><span style="font-family: mceinline;"> {</span></p>
<p><span style="font-family: mceinline;"> this(lowerTerm, upperTerm, inclusive);</span></p>
<p><span style="font-family: mceinline;"> this.collator = collator;</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"><br />
</span></p>
<p><span style="font-family: mceinline;"> public Query rewrite(IndexReader reader) throws IOException {</span></p>
<p><span style="font-family: mceinline;"><br />
</span></p>
<p><span style="font-family: mceinline;"> BooleanQuery query = new BooleanQuery(true);</span></p>
<p><span style="font-family: mceinline;"> String testField = getField();</span></p>
<p><span style="font-family: mceinline;"> if (collator != null) {</span></p>
<p><span style="font-family: mceinline;"> TermEnum enumerator = reader.terms(new Term(testField, &#8220;&#8221;));</span></p>
<p><span style="font-family: mceinline;"> String lowerTermText = lowerTerm != null ? lowerTerm.text() : null;</span></p>
<p><span style="font-family: mceinline;"> String upperTermText = upperTerm != null ? upperTerm.text() : null;</span></p>
<p><span style="font-family: mceinline;"><br />
</span></p>
<p><span style="font-family: mceinline;"> try {</span></p>
<p><span style="font-family: mceinline;"> do {</span></p>
<p><span style="font-family: mceinline;"> Term term = enumerator.term();</span></p>
<p><span style="font-family: mceinline;"> if (term != null &amp;&amp; term.field() == testField) { // interned comparison</span></p>
<p><span style="font-family: mceinline;"> if ((lowerTermText == null</span></p>
<p><span style="font-family: mceinline;"> || (inclusive ? collator.compare(term.text(), lowerTermText) &gt;= 0</span></p>
<p><span style="font-family: mceinline;"> : collator.compare(term.text(), lowerTermText) &gt; 0))</span></p>
<p><span style="font-family: mceinline;"> &amp;&amp; (upperTermText == null</span></p>
<p><span style="font-family: mceinline;"> || (inclusive ? collator.compare(term.text(), upperTermText) &lt;= 0</span></p>
<p><span style="font-family: mceinline;"> : collator.compare(term.text(), upperTermText) &lt; 0))) {</span></p>
<p><span style="font-family: mceinline;"> addTermToQuery(term, query);</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> while (enumerator.next());</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> finally {</span></p>
<p><span style="font-family: mceinline;"> enumerator.close();</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;"> }</span></p>
<p><span style="font-family: mceinline;">&#8230;&#8230;&#8230;&#8230;&#8230;</span></p>
<p><span style="font-family: mceinline;">}</span></p>
<p>As we can see from this the source code, a RangeQuery may be rewrited into thousands of TermQuery. This will make search ineffective, or even cause &#8220;TooManyClauses exception&#8221;. In addition, the rewrite method in RangeQuery will traverse through the entire dictionary. This is another reason why RangeQuery would make the search operation slow.</p>
<p>In contrast to RangeQuery, RangeFilter will do this job faster. Although RangeFilter will also traverse through the entire dictionary,  it does not have additional search operation as RangeQuery.</p>
<p>The implement of RangeFilter in lucene  will not consume much memory. It will only used for approximate 12.5M memory for a collection with 10M documents. According to the statement above, I would recommend you to use RangeFilter rather than RangeQuery.</p>
<p><span style="background-color: #ffffff;">Actually, ConstantScoreRangeQuery is a wrapper of RangeFilter, which enables us to conduct range search.  ConstantScoreRangeQuery returns a constant score equal to its boost for all documents in the range. It&#8217;s better than RangeQuery when we want to restrict the spectrum of the result rather than to rank the results partly according to the score by the RangeQuery. </span></p>
<p><strong>Notes:</strong> The implements of FuzzyQuery, <span style="background-color: #ffffff;">WildcardQuery, RangeQuery and PrefixQuery are pretty much the same, also be careful of using them.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2009/09/54218.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

