<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Information Retrieval Blog &#187; information Retrieval</title>
	<atom:link href="http://blog.zye.me/category/%e5%ad%a6%e6%9c%af%e7%a0%94%e7%a9%b6/information-retrieval/feed" rel="self" type="application/rss+xml" />
	<link>http://blog.zye.me</link>
	<description>REAL TIME DATA PROCESSING, DISTRIBUTED COMPUTING, PATTERN DISCOVERY</description>
	<lastBuildDate>Tue, 31 Jan 2012 02:05:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>如何读取Lucene索引数据1&#8211;整理中</title>
		<link>http://blog.zye.me/2011/09/3886.html</link>
		<comments>http://blog.zye.me/2011/09/3886.html#comments</comments>
		<pubDate>Sat, 10 Sep 2011 01:52:04 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[信息检索]]></category>
		<category><![CDATA[索引]]></category>

		<guid isPermaLink="false">http://www.5yiso.cn/articles/%e5%a6%82%e4%bd%95%e8%af%bb%e5%8f%96lucene%e7%b4%a2%e5%bc%95%e6%95%b0%e6%8d%ae1-%e6%95%b4%e7%90%86%e4%b8%ad.html</guid>
		<description><![CDATA[Lucene源码分析(1) &#8212; 如何读取Lucene索引数据 终于清楚如何用读Lucene的索引 。本文要介绍一下如何利用IndexReader获取信息。为什么要读索引呢？因为我需要实现这些功能： (1) 统计term在整个collection中的文档频度(document frequency, DF)； (2) 统计term在整个collection中出现的词次(term frequency in whole collection)； (3) 统计term在某个文档中出现的频度(term frequency, TF)； (4) 列出term在某文档中出现的位置(position)； (5) 整个collection中文档的个数； 那 么为什么要用到这些数据呢？这些数据是实现TR(Text Retrieval，文本检索)的必备的&#8221;原料&#8221;，而且是经过加工的。在检索之前，只有原始文本(raw data)；经过索引器(indexer)的处理之后，原始文本变成了一个一个的term(或者token)，然后被indexer纪录下来所在的位置、 出现的次数。有了这些数据，应用一些模型，就可以实现搜索引擎实现的功能――文本检索。 聪明的读 者您可能会说，这看起来似乎很好做，不过就是计数(count)么。不错，就是计数，或者说是统计。但是看似简单的过程，如果加上空间(内存容量)的限 制，就显得不那么简单了。假设如果每篇文档有100个term，每个term需要存储10字节信息，存1,000,000篇文档需要 10x100x10^6=10^9=2^30字节，也就是1GB。虽然现在1G内存不算什么，可是总不能把1GB的数据时时刻刻都放入内存吧。那么放入硬 盘好了，现在需要用数据的时候，再把1GB数据从硬盘搬到内存。OK，可以先去冲杯咖啡，回来在继续下面的操作。这是1,000,000的文档，如果更多 一点呢，现在没有任何辅助数据结构的方式，会导致很差的效率。 Lucene的索引会把数据分成 段，并且在需要的时候才读，不需要的时候就让数据乖乖地呆在硬盘上。Lucene本身是一个优秀的索引引擎，能够提供有效的索引和检索机制。文本的目的 是，介绍如用利用Lucene的API，如何从已经建好的索引的数据中读取需要的信息。至于Lucene如何使用，我会在后续的文章中逐渐介绍。 我们一步一步来看。这里建设已经有实现建好索引，存放在index目录下。好，要读索引，总得先生成一个读索引器(即Lucene中IndexReader的实例)。好，写下面的程序(程序为C#程序，本文使用DotLucene)。 IndexReader reader; 问 题出来了，IndexReader是一个abstract类，不能实例化。那好，换派生类试试看。找到IndexReader的两个孩子 ――SegmentReader和MultiReader。用哪个呢？无论是哪个都需要一大堆参数(我是颇费了周折才搞清楚它们的用途，后面再解释)，似 乎想用Lucene的索引数据不是那么容易啊。通过跟踪代码和查阅文档，我终于找到使用IndexReader的钥匙。原来IndexReader有一个 &#8220;工厂模式&#8221;的static interface――IndexReader.Open。定义如下： #0001 public static IndexReader Open(System.String path) #0002 public static <a href='http://blog.zye.me/2011/09/3886.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>Lucene源码分析(1) &#8212; 如何读取Lucene索引数据</p>
<p id="msgcns!3BB36966ED98D3E5!408" class="bvMsg"><font face="Verdana"></font><font size="2">终于清楚如何用读Lucene的索引 <img src='http://blog.zye.me/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> 。本文要介绍一下如何利用IndexReader获取信息。为什么要读索引呢？因为我需要实现这些功能：<br />
(1) 统计term在整个collection中的文档频度(document frequency, DF)；<br />
(2) 统计term在整个collection中出现的词次(term frequency in whole collection)；<br />
(3) 统计term在某个文档中出现的频度(term frequency, TF)；<br />
(4) 列出term在某文档中出现的位置(position)；<br />
(5) 整个collection中文档的个数；</font></p>
<p><font face="Verdana"></font><font size="2">那 么为什么要用到这些数据呢？这些数据是实现TR(Text Retrieval，文本检索)的必备的&#8221;原料&#8221;，而且是经过加工的。在检索之前，只有原始文本(raw data)；经过索引器(indexer)的处理之后，原始文本变成了一个一个的term(或者token)，然后被indexer纪录下来所在的位置、 出现的次数。有了这些数据，应用一些模型，就可以实现搜索引擎实现的功能――文本检索。</font></p>
<p><font face="Verdana"></font><font size="2">聪明的读 者您可能会说，这看起来似乎很好做，不过就是计数(count)么。不错，就是计数，或者说是统计。但是看似简单的过程，如果加上空间(内存容量)的限 制，就显得不那么简单了。假设如果每篇文档有100个term，每个term需要存储10字节信息，存1,000,000篇文档需要 10x100x10^6=10^9=2^30字节，也就是1GB。虽然现在1G内存不算什么，可是总不能把1GB的数据时时刻刻都放入内存吧。那么放入硬 盘好了，现在需要用数据的时候，再把1GB数据从硬盘搬到内存。OK，可以先去冲杯咖啡，回来在继续下面的操作。这是1,000,000的文档，如果更多 一点呢，现在没有任何辅助数据结构的方式，会导致很差的效率。</font></p>
<p><font face="Verdana"></font><font size="2">Lucene的索引会把数据分成 段，并且在需要的时候才读，不需要的时候就让数据乖乖地呆在硬盘上。Lucene本身是一个优秀的索引引擎，能够提供有效的索引和检索机制。文本的目的 是，介绍如用利用Lucene的API，如何从已经建好的索引的数据中读取需要的信息。至于Lucene如何使用，我会在后续的文章中逐渐介绍。</font></p>
<p><font face="Verdana"></font><font size="2">我们一步一步来看。这里建设已经有实现建好索引，存放在index目录下。好，要读索引，总得先生成一个读索引器(即Lucene中IndexReader的实例)。好，写下面的程序(程序为C#程序，本文使用DotLucene)。<br />
IndexReader reader;<br />
问 题出来了，IndexReader是一个abstract类，不能实例化。那好，换派生类试试看。找到IndexReader的两个孩子 ――SegmentReader和MultiReader。用哪个呢？无论是哪个都需要一大堆参数(我是颇费了周折才搞清楚它们的用途，后面再解释)，似 乎想用Lucene的索引数据不是那么容易啊。通过跟踪代码和查阅文档，我终于找到使用IndexReader的钥匙。原来IndexReader有一个 &#8220;工厂模式&#8221;的static interface――IndexReader.Open。定义如下：<br />
#0001 public static IndexReader Open(System.String path)<br />
#0002 public static IndexReader Open(System.IO.FileInfo path)<br />
#0003 public static IndexReader Open(Directory directory)<br />
#0004 private static IndexReader Open(Directory directory, bool closeDirectory)<br />
其中有三个是public的接口，可供调用。打开一个索引，就是这么简单：<br />
#0001 IndexReader reader = IndexReader.Open(index);</font></p>
<p><font face="Verdana"></font><font size="2">实际上，这个打开索引经历了这样的一个过程：<br />
#0001 SegmentInfos infos = new SegmentInfos();<br />
#0002 Directory directory = FSDirectory.GetDirectory(index, false);<br />
#0003 infos.Read(directory);<br />
#0004 bool closeDirectory = false;<br />
#0005 if (infos.Count == 1)<br />
#0006 {<br />
#0007 // index is optimized<br />
#0008 return new SegmentReader(infos, infos.Info(0), closeDirectory);<br />
#0009 }<br />
#0010 else<br />
#0011 {<br />
#0012 IndexReader[] readers = new IndexReader[infos.Count];<br />
#0013 for (int i = 0; i &lt; infos.Count; i++)<br />
#0014 readers[i] = new SegmentReader(infos.Info(i));<br />
#0015 return new MultiReader(directory, infos, closeDirectory, readers);<br />
#0016 }</font></p>
<p><font face="Verdana"></font><font size="2">首 先要读入索引的段信息(segment information, #0001~#0003)，然后看一下有几个段：如果只有一个，那么可能是优化过的，直接读取这一个段就可以(#0008)；否则需要一次读入各个段 (#0013~#0014)，然后再拼成一个MultiReader(#0015)。打开索引文件的过程就是这样。</font></p>
<p><font face="Verdana"></font><font size="2">接下来我们要看看如何读取信息了。用下面这段代码来说明。<br />
#0001 public static void PrintIndex(IndexReader reader)<br />
#0002 {<br />
#0003 //显示有多少个document<br />
#0004 System.Console.WriteLine(reader + &#8220;tNumDocs = &#8221; + reader.NumDocs());<br />
#0005 for (int i = 0; i &lt; reader.NumDocs(); i++)<br />
#0006 {<br />
#0007 System.Console.WriteLine(reader.Document(i));<br />
#0008 }<br />
#0009<br />
#0010 //枚举term，获得&lt;document, term freq, position* &gt;信息<br />
#0011 TermEnum termEnum = reader.Terms();<br />
#0012 while (termEnum.Next())<br />
#0013 {<br />
#0014 System.Console.Write(termEnum.Term());<br />
#0015 System.Console.WriteLine(&#8220;tDocFreq=&#8221; + termEnum.DocFreq());<br />
#0016<br />
#0017 TermPositions termPositions = reader.TermPositions(termEnum.Term());<br />
#0018 int i = 0;<br />
#0019 int j = 0;<br />
#0020 while (termPositions.Next())<br />
#0021 {<br />
#0022 System.Console.WriteLine((i++) + &#8220;-&gt;&#8221; + &#8221; DocNo:&#8221; + termPositions.Doc() + &#8220;, Freq:&#8221; + termPositions.Freq());<br />
#0023 for (j = 0; j &lt; termPositions.Freq(); j++)<br />
#0024 System.Console.Write(&#8220;[" + termPositions.NextPosition() + "]&#8220;);<br />
#0025 System.Console.WriteLine();<br />
#0026 }<br />
#0027<br />
#0028 //直接获取 &lt;term freq, document&gt; 的信息<br />
#0029 TermDocs termDocs = reader.TermDocs(termEnum.Term());<br />
#0030 while (termDocs.Next())<br />
#0031 {<br />
#0032 System.Console.WriteLine((i++) + &#8220;-&gt;&#8221; + &#8221; DocNo:&#8221; + termDocs.Doc() + &#8220;, Freq:&#8221; + termDocs.Freq());<br />
#0033 }<br />
#0034 }<br />
#0035<br />
#0036 // FieldInfos fieldInfos = reader.fieldInfos;<br />
#0037 // FieldInfo pathFieldInfo = fieldInfos.FieldInfo(&#8220;path&#8221;);<br />
#0038<br />
#0039 //显示 term frequency vector<br />
#0040 for (int i = 0; i &lt; reader.NumDocs(); i++)<br />
#0041 {<br />
#0042 //对contents的token之后的term存于了TermFreqVector<br />
#0043 TermFreqVector termFreqVector = reader.GetTermFreqVector(i, &#8220;contents&#8221;);<br />
#0044<br />
#0045 if (termFreqVector == null)<br />
#0046 {<br />
#0047 System.Console.WriteLine(&#8220;termFreqVector is null.&#8221;);<br />
#0048 continue;<br />
#0049 }<br />
#0050<br />
#0051 String fieldName = termFreqVector.GetField();<br />
#0052 String[] terms = termFreqVector.GetTerms();<br />
#0053 int[] frequences = termFreqVector.GetTermFrequencies();<br />
#0054<br />
#0055 System.Console.Write(&#8220;FieldName:&#8221; + fieldName);<br />
#0056 for (int j = 0; j &lt; terms.Length; j++)<br />
#0057 {<br />
#0058 System.Console.Write(&#8220;[" + terms[j] + &#8220;:&#8221; + frequences[j] + &#8220;]&#8221;);<br />
#0059 }<br />
#0060 System.Console.WriteLine();<br />
#0061 }<br />
#0062 System.Console.WriteLine();<br />
#0063 }</font></p>
<p><font face="Verdana"></font><font size="2">#0004 计算document的个数<br />
#0012~#0034 枚举collection中所有的term<br />
其中#0017~#0026 枚举每个term在出现的document中的所有位置(第几个词，从1开始计数)；#0029~#0033 计算每个term出现在哪些文档和相应的出现频度(即DF和TF)。<br />
#0036~#0037在reader是SegmentReader类型的情况下有效。<br />
#0040~#0061可以快速的读取某篇文档中出现的term和相应的频度。但是这部分需要在建索引时，设置storeTermVector为true。比如<br />
doc.Add(Field.Text(&#8220;contents&#8221;, reader, true));<br />
其中的第三项即是。默认为false。</font></p>
<p><font face="Verdana"></font><font size="2">有了这些数据，就可以统计我需要的数据了。以后我会介绍如何建立索引，如何应用Lucene。</font></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/09/3886.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Lucene同时进行查询和索引</title>
		<link>http://blog.zye.me/2011/07/36221.html</link>
		<comments>http://blog.zye.me/2011/07/36221.html#comments</comments>
		<pubDate>Sun, 17 Jul 2011 14:27:50 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[searching]]></category>
		<category><![CDATA[查询]]></category>
		<category><![CDATA[索引]]></category>

		<guid isPermaLink="false">http://www.5yiso.cn/2008/07/36221.html</guid>
		<description><![CDATA[Does Lucene allow searching and indexing simultaneously?Yes. However, an IndexReader only searches the index as of the &#8220;point in time&#8221; that it was opened. Lucene在用IndexReader打开索引的同时，允许用IndexWriter对该索引进行更新，但是IndexReader只能查询到open索引时所索引的文件或者说Document，要想查询新索引的Document，IndexReader必须调用reopen方法（该方法开销较小）。 IndexReader.isCurrent() 方法可以用于测试索引是否有更新。]]></description>
			<content:encoded><![CDATA[<p>Does Lucene allow searching and indexing simultaneously?Yes. However, an IndexReader only searches the index as of the &#8220;point in time&#8221; that it was opened.</p>
<p>Lucene在用IndexReader打开索引的同时，允许用IndexWriter对该索引进行更新，但是IndexReader只能查询到open索引时所索引的文件或者说Document，要想查询新索引的Document，IndexReader必须调用reopen方法（该方法开销较小）。</p>
<p>IndexReader.isCurrent() 方法可以用于测试索引是否有更新。</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/07/36221.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene 在线查询</title>
		<link>http://blog.zye.me/2011/06/23889.html</link>
		<comments>http://blog.zye.me/2011/06/23889.html#comments</comments>
		<pubDate>Sun, 12 Jun 2011 14:27:35 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[信息检索]]></category>
		<category><![CDATA[在线查询]]></category>
		<category><![CDATA[索引]]></category>

		<guid isPermaLink="false">http://www.5yiso.cn/2008/03/23889.html</guid>
		<description><![CDATA[Lucene本身并不支持更新, 所以只能选择先删除再新增记录。 lucene本身支持两种删除模式 1,DeleteDocument(int docNum) //指定文档标号自动删除 2,DeleteDocuments(Term term) //删除所有出现该term的文档 一般使用的是第二种 IndexReader reader = IndexReader.Open(path)); int count=reader.DeleteDocuments(new Term(&#8220;FieldName&#8221;,&#8221;Txt&#8221;)); Lucene的删除也就是一次搜索的过程. 备注：需要匹配删除的字段存储时不要进切词. 我的疑问： 1. 用以上方式就行索引更新，特别是当更新的数据非常多的时候，如何保证更新的同时提高查询服务？optimize（）数据俩一大还是挺慢的，不能在短时间完成。 是否可以在更新的数据量很大时，直接生成新索引，然后更换查询的索引。 2. 用lucene实现在线查询，该如何做？ 是否可以在内存中用生成索引, 同时查两个索引。等内存中索引到一定大小时，写到硬盘或就行合并，但这样索引合并的代价比较大。 知道的朋友，劳烦告诉我，3qs！ &#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211; 该问题已经解决，其实Lucene中自带此功能， 如果要实现在线索引感觉还是在内存中保持Index比较方便。 索引修改后，要想保持查询同步，只需使用reopen（）函数重新打开索引即可，这个函数实现速度较快。 下面是我的应用中一段程序，对应凑合看就行。 public void refreshSearcher() { synchronized (searcher) { try { int preNum = searcher.maxDoc(); IndexSearcher[] searchers = new IndexSearcher[ireaderList .size()]; ArrayList&#60;IndexReader&#62; list = new <a href='http://blog.zye.me/2011/06/23889.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>Lucene本身并不支持更新, 所以只能选择先删除再新增记录。<br />
 lucene本身支持两种删除模式<br />
 <span style="font-family: 'Courier New';">1,DeleteDocument(int docNum) //指定文档标号自动删除</span><br />
 <span style="font-family: 'Courier New';">2,DeleteDocuments(Term term) //删除所有出现该term的文档<br />
 一般使用的是第二种<br />
 </span><span style="font-family: 'Courier New';"> IndexReader reader = IndexReader.Open(path));<br />
 int count=reader.DeleteDocuments(new Term(&#8220;FieldName&#8221;,&#8221;Txt&#8221;)); </span></p>
<p><span style="font-family: 'Courier New';"> </span></p>
<p><span style="font-family: 'Courier New';">Lucene的删除也就是一次搜索的过程.<br />
 备注：需要匹配删除的字段存储时不要进切词.</span></p>
<p><strong><span style="font-family: 'Courier New';">我的疑问：</span></strong></p>
<p><span style="font-family: 'Courier New';">1. 用以上方式就行索引更新，特别是当更新的数据非常多的时候，如何保证更新的同时提高查询服务？optimize（）数据俩一大还是挺慢的，不能在短时间完成。 </span></p>
<p><span style="font-family: 'Courier New';">是否可以在更新的数据量很大时，直接生成新索引，然后更换查询的索引。</span></p>
<p><span style="font-family: 'Courier New';">2. 用lucene实现在线查询，该如何做？ 是否可以在内存中用生成索引, 同时查两个索引。等内存中索引到一定大小时，写到硬盘或就行合并，但这样索引合并的代价比较大。</span></p>
<p><br class="spacer_" /></p>
<p><span style="font-family: 'Courier New';">知道的朋友，劳烦告诉我，3qs！</span></p>
<p><br class="spacer_" /></p>
<p>&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;</p>
<p>该问题已经解决，其实Lucene中自带此功能， 如果要实现在线索引感觉还是在内存中保持Index比较方便。</p>
<p>索引修改后，要想保持查询同步，只需使用reopen（）函数重新打开索引即可，这个函数实现速度较快。</p>
<p>下面是我的应用中一段程序，对应凑合看就行。</p>
<p><br class="spacer_" /></p>
<p><span style="white-space: pre;"> </span>public void refreshSearcher() {</p>
<p><span style="white-space: pre;"> </span>synchronized (searcher) {</p>
<p><span style="white-space: pre;"> </span>try {</p>
<p><span style="white-space: pre;"> </span>int preNum = searcher.maxDoc();</p>
<p><span style="white-space: pre;"> </span>IndexSearcher[] searchers = new IndexSearcher[ireaderList</p>
<p><span style="white-space: pre;"> </span>.size()];</p>
<p><span style="white-space: pre;"> </span>ArrayList&lt;IndexReader&gt; list = new ArrayList&lt;IndexReader&gt;();</p>
<p><span style="white-space: pre;"> </span>for (int i = 0; i &lt; ireaderList.size(); i++) {</p>
<p><span style="white-space: pre;"> </span>IndexReader treader = ireaderList.get(i).reopen();</p>
<p><span style="white-space: pre;"> </span>searchers[i] = new IndexSearcher(treader);</p>
<p><span style="white-space: pre;"> </span>list.add(treader);</p>
<p><span style="white-space: pre;"> </span>}</p>
<p><span style="white-space: pre;"> </span>searcher = new MultiSearcher(searchers);</p>
<p><span style="white-space: pre;"> </span>ireaderList = list;</p>
<p><span style="white-space: pre;"> </span>int aftNum = searcher.maxDoc();</p>
<p><span style="white-space: pre;"> </span>LOG.info(&#8220;before refresh: &#8221; + preNum + &#8221; after refresh: &#8220;</p>
<p><span style="white-space: pre;"> </span>+ aftNum);</p>
<p><span style="white-space: pre;"> </span>} catch (CorruptIndexException e) {</p>
<p><span style="white-space: pre;"> </span>// TODO Auto-generated catch block</p>
<p><span style="white-space: pre;"> </span>e.printStackTrace();</p>
<p><span style="white-space: pre;"> </span>} catch (IOException e) {</p>
<p><span style="white-space: pre;"> </span>// TODO Auto-generated catch block</p>
<p><span style="white-space: pre;"> </span>e.printStackTrace();</p>
<p><span style="white-space: pre;"> </span>}</p>
<p><span style="white-space: pre;"> </span>}</p>
<p><span style="white-space: pre;"> </span>}</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/06/23889.html/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>如何在lucene检索结果中再检索？Java</title>
		<link>http://blog.zye.me/2011/06/20849.html</link>
		<comments>http://blog.zye.me/2011/06/20849.html#comments</comments>
		<pubDate>Tue, 07 Jun 2011 14:29:38 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[二次检索]]></category>
		<category><![CDATA[缓存]]></category>

		<guid isPermaLink="false">http://www.5yiso.cn/2008/03/20849.html</guid>
		<description><![CDATA[如何在lucene检索结果中再检索？Java 1. 最容易实现的，把第一次和再次检索的关键词用BooleanQuery 并起来，这种简单方便，容易理解 2. 利用lucene的Filter，具体可以查看lucene的api中的org.apache.lucene.search.CachingWrapperFilter，它可以缓存上次的搜索结果，从而实现在结果中的搜索. 但注意的是DateFilter 和WrapperFilter 没有缓存功能，用CachingWrapperFilter包装废缓存过滤器（noncaching filter）才能实现缓存功能。下面是lucene in action 中的一段实例代码， 下面是简单的测试程序。当然在实际应用中可以做得比较复杂。 public void testCachingWrapper(){ Date jan1 = parserDate(&#8220;2004 jan 01&#8243;); Date dec31 = parserDate(&#8220;2004 Dec 31&#8243;); DateFilter dateFilter = new DateFilter(&#8220;modified&#8221;, jan1, dec31); cachingFilter = new CachingWrapperFilter(dateFilter); Hits hits = searcher.search(allbooks, cachingFilter); }]]></description>
			<content:encoded><![CDATA[<p>如何在lucene检索结果中再检索？Java</p>
<p>1. 最容易实现的，把第一次和再次检索的关键词用BooleanQuery 并起来，这种简单方便，容易理解</p>
<p>2. 利用lucene的Filter，具体可以查看lucene的api中的org.apache.lucene.search.CachingWrapperFilter，它可以缓存上次的搜索结果，从而实现在结果中的搜索. 但注意的是DateFilter 和WrapperFilter 没有缓存功能，用CachingWrapperFilter包装废缓存过滤器（noncaching filter）才能实现缓存功能。下面是lucene in action 中的一段实例代码，<br />
下面是简单的测试程序。当然在实际应用中可以做得比较复杂。</p>
<p>public void testCachingWrapper(){</p>
<p>Date jan1 = parserDate(&#8220;2004 jan 01&#8243;);</p>
<p>Date dec31 = parserDate(&#8220;2004 Dec 31&#8243;);</p>
<p>DateFilter dateFilter = new DateFilter(&#8220;modified&#8221;, jan1, dec31);</p>
<p>cachingFilter = new CachingWrapperFilter(dateFilter);</p>
<p>Hits hits = searcher.search(allbooks, cachingFilter);</p>
<p>}</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/06/20849.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Toolkits</title>
		<link>http://blog.zye.me/2011/06/55851.html</link>
		<comments>http://blog.zye.me/2011/06/55851.html#comments</comments>
		<pubDate>Thu, 02 Jun 2011 15:22:19 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[Toolkits]]></category>

		<guid isPermaLink="false">http://blog.so8848.com/2011/06/55851.html/</guid>
		<description><![CDATA[FlexCRFs: Flexible Conditional Random Fields CRFTagger: CRF English POS Chunker CRFChunker: CRF English Phrase Chunker JTextPro: A Java-based Text Processing Toolkit JWebPro: A Java-based Web Processing Toolkit JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool &#160;]]></description>
			<content:encoded><![CDATA[<p>
<ul style="font-family: 'Times New Roman'; font-size: medium;">
<li>
<p style="margin-top: 0px; margin-bottom: 8px;"><span style="font-family: Arial;"><a href="http://flexcrfs.sourceforge.net/">FlexCRFs</a>: Flexible Conditional Random Fields</span></p>
</li>
<li>
<p style="margin-top: 0px; margin-bottom: 8px;"><span style="font-family: Arial;"><a href="http://crftagger.sourceforge.net/">CRFTagger</a>: CRF English POS Chunker</span></p>
</li>
<li>
<p style="margin-top: 0px; margin-bottom: 8px;"><span style="font-family: Arial;"><a href="http://crfchunker.sourceforge.net/">CRFChunker</a>: CRF English Phrase Chunker</span></p>
</li>
<li>
<p style="margin-top: 0px; margin-bottom: 8px;"><span style="font-family: Arial;"><a href="http://jtextpro.sourceforge.net/">JTextPro</a>: A Java-based Text Processing Toolkit</span></p>
</li>
<li>
<p style="margin-top: 0px; margin-bottom: 8px;"><span style="font-family: Arial;"><a href="http://jwebpro.sourceforge.net/">JWebPro</a>: A Java-based Web Processing Toolkit</span></p>
</li>
<li>
<p style="margin-top: 0px; margin-bottom: 8px;"><span style="font-family: Arial;"><a href="http://jvnsegmenter.sourceforge.net/">JVnSegmenter</a>: A Java-based Vietnamese Word Segmentation Tool</span></p>
</li>
</ul>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/06/55851.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Inside Lucene/超人气搜索引擎学习(2.0)-读取索引</title>
		<link>http://blog.zye.me/2011/05/17463.html</link>
		<comments>http://blog.zye.me/2011/05/17463.html#comments</comments>
		<pubDate>Fri, 27 May 2011 14:30:02 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[信息检索]]></category>
		<category><![CDATA[索引]]></category>

		<guid isPermaLink="false">http://www.5yiso.cn/2008/03/17463.html</guid>
		<description><![CDATA[以前搜集的两篇文章，感觉对lucene的研究能稍微深入点的，想研究或者修改源代码的话值得一看。 下面是源地址 Inside Lucene/超人气搜索引擎学习(2.0)-读取索引 Index in Practice 索引: 按图索骥 T ermDoc从哪读取数据,自然是硬盘上已经建好的某个index, 具体说, 是从index中的某个文件读取. 要了解TermDoc读了什么东东,怎么读这些东东,必要时得考察Lucene index的细部结构. T ermDoc是个抽象类,这很好,以后可以创建自己的index结构,建立自己的搜索算法.不过这之前先要了解Lucene是怎么干的,而这个抽象类并不包含这个信息,所以,我们首先要找到TermQuery使用哪个TermDoc实现. 回想一下scorer中的TermDoc从哪里来. public class TermQuery extends Query { private class TermWeight implements Weight { public Scorer scorer(IndexReader reader) throws IOException { TermDocs termDocs = reader.termDocs(term); if (termDocs == null) return null; return new TermScorer(this, termDocs, getSimilarity(searcher), reader.norms(term.field())); } ... <a href='http://blog.zye.me/2011/05/17463.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<div class="postTitle"><em>以前搜集的两篇文章，感觉对lucene的研究能稍微深入点的，想研究或者修改源代码的话值得一看。</em> 下面是源地址</div>
<div class="postTitle"><a href="http://blog.csdn.net/bluemiles/archive/2006/07/24/968433.aspx">Inside Lucene/超人气搜索引擎学习(2.0)-读取索引</a></div>
<div class="postText">
<p><span style="FONT-SIZE: 180%"><strong><span style="FONT-SIZE: 0.75em">Index in Practice 索引: 按图索骥</span></strong></span></p>
<p><span style="FONT-SIZE: 180%"><strong><span style="FONT-SIZE: 1.5em">T</span></strong></span> ermDoc从哪读取数据,自然是硬盘上已经建好的某个index, 具体说, 是从index中的某个文件读取. 要了解TermDoc读了什么东东,怎么读这些东东,必要时得考察Lucene index的细部结构.</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>T</strong></span> ermDoc是个抽象类,这很好,以后可以创建自己的index结构,建立自己的搜索算法.不过这之前先要了解Lucene是怎么干的,而这个抽象类并不包含这个信息,所以,我们首先要找到TermQuery使用哪个TermDoc实现.</p>
<p>回想一下scorer中的TermDoc从哪里来. <br /><code><br />public class TermQuery extends Query { <br />private class TermWeight implements Weight { <br />public Scorer scorer(IndexReader reader) throws IOException { <br />TermDocs termDocs = reader.termDocs(term); </p>
<p>if (termDocs == null) <br />return null; </p>
<p>return new TermScorer(this, termDocs, getSimilarity(searcher), <br />reader.norms(term.field())); <br />} <br />... <br />} <br />... <br />} <br /></code> <br /><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>从</strong></span> 这段代码能找到真正创建TermDocs的那个类: IndexReader <br />用哪个TermDocs实现并不是TermQuery说了算,而是IndexReader的权利. TermQuery得到怎样一个TermDocs, 全由我们传递给TermQuery.weight.scorer()的那个IndexReader决定. 将这个TermDocs定位到指定的Term也完全由IndexReader负责。很遗憾,IndexReader也是抽象类. 想知道内幕?先找找IndexReader实现类。</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>如</strong></span> 果按照用户手册的方法进行搜索, IndexReader的一个静态方法将被调用,它返回我们需要的一个IndexReader实现:SegmentReader, 这是整个查询中用到的reader。</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>顺</strong></span> 藤摸瓜,很容易找到SegmentTermDocs这个类,也就是默认查询中SegmentReader使用的TermDocs,大部分查询结果通过这个类的实例来遍历.现在是时候翻它老底了,看看它怎么遍历数据,这些数据又从哪来. <br />&lt;code&gt; <br />class IndexReader{ <br />public TermDocs termDocs(Term term) throws IOException { <br />TermDocs termDocs = termDocs(); <br />termDocs.seek(term); <br />return termDocs; <br />} <br />&#8230; <br />}</p>
<p>class SegmentReader extends IndexReader{ <br />public final TermDocs termDocs() throws IOException { <br />return new SegmentTermDocs(this); <br />} <br />&#8230; <br />} <br />&lt;/code&gt;</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>从</strong></span> 已经列出的代码中, 能清晰地看到SegmentTermDocs从创建到传递给scorer前进行的一系列动作:</p>
<p>1. SegmentTermDocs构造: 根据parent设定自己的属性 <br />2. IndexReader调用TermDocs.seek(term); 实现类中这一步具体化为SegmentReader调用SegmentTermDocs.seek(term)</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>第</strong></span> 二步中, SegmentTermDocs进行了实际对index文件的读取. 而为了进行这些IO操作, 像前边说的, 必须依靠IndexReader才能完成, 这就是SegmentTermDocs构造是需要参数SegmentReader的原因.</p>
<p>seek (term)方法中SegmentTermDocs利用构造函数的唯一参数IndexReader(也就是创建它的那个reader, 称作parent&#8221;), 在硬盘索引文件中定位指定的term, 读入相关信息:df(包含term的文档数), 以及满足该term的文档集合在index文件中的位置. 这个位置后面, 是创建索引时就已排好的包含这个term的文档信息.</p>
<p>seek 完成后, TermDoc已经准备好读取数据了, 只要一声令下, TermDoc.read方法立刻能把每一篇文档的id和该term在这篇文档中的次数tf. 前面的记载是, scorer对象调用read方法, 尔后遍历其返回的全部文档, 把他们一个个塞到Collector中</p>
<p>精妙繁复的步骤: seek如何完成?</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>这</strong></span> 要涉及索引结构, 现在可以掀开索引文件的一个角, 偷窥下.</p>
<p>tis文件: Term InformationS <br />frq文件: FReQuency</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>必</strong></span> 须注意到IO动作一定是在IndexReader的几个成员中作的, 所有其他类中的IO要么用这些成员的Clone来完成, 要么直接代理给IndexReader. SegmentTermDocs.seek(term)动作是通过IndexReader进行的, SegmentTermDocs把创建他的IndexReader尊为parent, 在seek这种关键时刻利用IndexReader来读取索引数据. 没办法, 索引文件的读取(输入流的建立和定位)全由IndexReader负责. <br /><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>s</strong></span> eek 方法中为了实现定位而利用了IndexReader一个负责Term定位的成员tis, 从他的类名TermInfoReader看就知道有什么用途. 这个tis从.tis文件中找到我们指定的term, 读出一切我们需要的信息: 这个term在多少个文档中出现过(df)/这些文档记录在frq文件的什么位置(起始位置) 等等.</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>得</strong></span> 到这些信息后, TermDoc再自己seek, 这一步很简单, 除了几个加法和赋值, 唯一有特色的是对.frq文件的输入流(FileInputStream)进行seek(), seek()的数量正好是tis返回的&#8221;文档记录在frq文件中的位置&#8221;. 这个流是IndexReader初始化时创建的, 专门从frq文件读数据. IndexReader创建TermDoc时, TermDoc把这个输入流Clone()了一下, 赋给自己的成员. 这一seek()把.frq文件的输入指针定位好, 以后真正需要这个流的地方只有从frq文件读文档数据那一阵. 读数据的过程就发生在屡次提到过的termDoc.read()里, 现在我知道这个方法的实现是SegmentTermDocs.read().</p>
<p>read ()的实现是简单的顺序读取文件流, 具体过程涉及Lucene索引文件的二进制结构, 我不想这时候过多地纠缠. 大致了解termDoc如何定位数据, 心中的疑惑就能解开一半. 关于索引文件结构、各文件的关系、程序如何厘清这些关系, 还值得更多的讨论.</p>
<p><span style="FONT-SIZE: 1.5em; FONT-FAMILY: Courier New"><strong>到</strong></span> 这一步, 结合已熟知的scorer调用TermDoc的方式, 查询过程的基本途径已经隐约呈现出来了.</p>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/05/17463.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene 高亮 &#8211;不就行二次分词（zhuan）</title>
		<link>http://blog.zye.me/2011/05/12152.html</link>
		<comments>http://blog.zye.me/2011/05/12152.html#comments</comments>
		<pubDate>Fri, 20 May 2011 14:28:57 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[信息检索]]></category>
		<category><![CDATA[分词]]></category>

		<guid isPermaLink="false">http://www.5yiso.cn/2008/02/12152.html</guid>
		<description><![CDATA[1、问题的来源 增加分词以后结果的准确度提高了，但是用户反映返回结果的速度很慢。原因是，Lucene做每一篇文档的相关关键词的高亮显示时，在运行时执行了很多遍的分词操作。这样降低了性能。 BANG48D1BFD627C49E3110A55E03XIANGUO 2、解决方法 在Lucene1.4.3版本中的一个新功能可以解决这个问题。Term Vector现在支持保存Token.getPositionIncrement() 和Token.startOffset() 以及Token.endOffset() 信息。利用Lucene中新增加的Token信息的保存结果以后，就不需要为了高亮显示而在运行时解析每篇文档。通过Field方法控制是否保存该信息。修改HighlighterTest.java的代码如下： //增加文档时保存Term位置信息。 private void addDoc(IndexWriter writer, String text) throws IOException { Document d = new Document(); //Field f = new Field(FIELD_NAME, text, true, true, true); Field f = new Field(FIELD_NAME, text , Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS); d.add(f); writer.addDocument(d); } //利用Term位置信息节省Highlight时间。 void doStandardHighlights() throws Exception { Highlighter highlighter =new <a href='http://blog.zye.me/2011/05/12152.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<div>
<p><span style="FONT-FAMILY: Verdana"><span lang="EN-US" xml:lang="EN-US">1</span><span>、问题的来源</span></span></p>
<p style="TEXT-INDENT: 21pt"><span>增加分词以后结果的准确度提高了，但是用户反映返回结果的速度很慢<wbr />。原因是，</span><span lang="EN-US" xml:lang="EN-US">Lucene</span><span>做每一篇文档的相关关键词的高亮显示时，在运行时执行了很多遍的分<wbr />词操作。这样降低了性能。</span></p>
<p style="TEXT-INDENT: 21pt"><span lang="EN-US" xml:lang="EN-US">BANG48D1BFD627C49E3110A55E03XIANGUO</span></p>
<p><span lang="EN-US" xml:lang="EN-US">2</span><span>、解决方法</span></p>
<p style="TEXT-INDENT: 21pt"><span>在</span><span lang="EN-US" xml:lang="EN-US">Lucene1.4.3</span><span>版本中的一个新功能可以解决这个问题。</span><span lang="EN-US" xml:lang="EN-US">Term Vector</span><span>现在支持保存</span><span lang="EN-US" xml:lang="EN-US">Token.getPositionIncrement()</span> <span>和</span><span lang="EN-US" xml:lang="EN-US">Token.startOffset()</span> <span>以及</span><span lang="EN-US" xml:lang="EN-US">Token.endOffset()</span> <span>信息。利用</span><span lang="EN-US" xml:lang="EN-US">Lucene</span><span>中新增加的</span><span lang="EN-US" xml:lang="EN-US">Token</span><span>信息的保存结果以后，就不需要为了高亮显示而在运行时解析每篇文档<wbr />。通过</span><span lang="EN-US" xml:lang="EN-US">Field</span><span>方法控制是否保存该信息。修改</span><span lang="EN-US" xml:lang="EN-US">HighlighterTest.java</span><span>的代码如下：</span></p>
<p style="TEXT-INDENT: 21pt">
<p style="TEXT-INDENT: 21pt"><span lang="EN-US" xml:lang="EN-US">//</span><span>增加文档时保存</span><span lang="EN-US" xml:lang="EN-US">Term</span><span>位置信息。</span></p>
<p><span lang="EN-US" xml:lang="EN-US">private void addDoc(IndexWriter writer, String text) throws IOException</span></p>
<p><span lang="EN-US" xml:lang="EN-US">{</span></p>
<p><span lang="EN-US" xml:lang="EN-US">Document d = new Document();</span></p>
<p><span lang="EN-US" xml:lang="EN-US">//Field f = new Field(FIELD_NAME, text, true, true, true);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">Field f = new Field(FIELD_NAME, text ,</span></p>
<p><span lang="EN-US" xml:lang="EN-US"><span><wbr /></span> Field.Store.YES, Field.Index.TOKENIZED,</span></p>
<p><span lang="EN-US" xml:lang="EN-US"><span><wbr /></span> Field.TermVector.WITH_POSITIONS<wbr />_OFFSETS);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">d.add(f);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">writer.addDocument(d);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">}</span></p>
<p style="TEXT-INDENT: 21pt">
<p style="TEXT-INDENT: 21pt"><span lang="EN-US" xml:lang="EN-US">//</span><span>利用</span><span lang="EN-US" xml:lang="EN-US">Term</span><span>位置信息节省</span><span lang="EN-US" xml:lang="EN-US">Highlight</span><span>时间。</span></p>
<p><span lang="EN-US" xml:lang="EN-US">void doStandardHighlights() throws Exception</span></p>
<p><span lang="EN-US" xml:lang="EN-US">{</span></p>
<p><span lang="EN-US" xml:lang="EN-US">Highlighter highlighter =new Highlighter(this,new QueryScorer(query));</span></p>
<p><span lang="EN-US" xml:lang="EN-US">highlighter.setTextFragmenter(new SimpleFragmenter(20));</span></p>
<p><span lang="EN-US" xml:lang="EN-US">for (int i = 0; i &lt; hits.length(); i++)</span></p>
<p><span lang="EN-US" xml:lang="EN-US">{</span></p>
<p><span lang="EN-US" xml:lang="EN-US">String text = hits.doc(i).get(FIELD_NAME);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">int maxNumFragmentsRequired = 2;</span></p>
<p><span lang="EN-US" xml:lang="EN-US">String fragmentSeparator = &#8220;&#8230;&#8221;;</span></p>
<p><span lang="EN-US" xml:lang="EN-US">TermPositionVector tpv = (TermPositionVector)reader<wbr />.getTermFreqVector(hits.id(i),FIELD_NAME);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">//</span><span>如果没有</span><span lang="EN-US" xml:lang="EN-US">stop words</span><span>去除还可以改成</span> <span lang="EN-US" xml:lang="EN-US">TokenSources.getTokenStream(tpv,true);</span> <span>进一步提速。</span></p>
<p><span lang="EN-US" xml:lang="EN-US">TokenStream tokenStream=TokenSources.getTokenStream(tpv);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">//analyzer.tokenStream(FIELD<wbr />_NAME,new StringReader(text));</span></p>
<p><span lang="EN-US" xml:lang="EN-US">String result =</span></p>
<p><span lang="EN-US" xml:lang="EN-US">highlighter.getBestFragments(</span></p>
<p><span lang="EN-US" xml:lang="EN-US"><span><wbr /></span> tokenStream,</span></p>
<p><span lang="EN-US" xml:lang="EN-US"><span><wbr /></span> text,</span></p>
<p><span lang="EN-US" xml:lang="EN-US"><span><wbr /></span> maxNumFragmentsRequired,</span></p>
<p><span lang="EN-US" xml:lang="EN-US"><span><wbr /></span> fragmentSeparator);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">System.out.println(&#8220;t&#8221; + result);</span></p>
<p><span lang="EN-US" xml:lang="EN-US">}</span></p>
<p><span lang="EN-US" xml:lang="EN-US">}</span></p>
<p><span>最后把</span><span lang="EN-US" xml:lang="EN-US">highlight</span><span>包中的一个额外的判断去掉。对于中文来说没有明显的单词界限<wbr />，所以下面这个判断是错误的：</span></p>
<p><span lang="EN-US" xml:lang="EN-US">tokenGroup.isDistinct(token)</span></p>
<p><span>这样中文分词就不会影响到查询速度了</span></p>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/05/12152.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>提高lucene索引速度技巧汇总</title>
		<link>http://blog.zye.me/2011/05/9319.html</link>
		<comments>http://blog.zye.me/2011/05/9319.html#comments</comments>
		<pubDate>Thu, 19 May 2011 02:28:17 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[信息检索]]></category>
		<category><![CDATA[索引]]></category>

		<guid isPermaLink="false">http://www.5yiso.cn/2008/02/9319.html</guid>
		<description><![CDATA[ImproveIndexingSpeed How to make indexing faster Here are some things to try to speed up the indexing speed of your Lucene application. Please see ImproveSearchingSpeed for how to speed up searching. doubanclaim4c08e02b1af5eace Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add <a href='http://blog.zye.me/2011/05/9319.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<h1>ImproveIndexingSpeed</h1>
<div class="u_bd">
<div>
<h2>How to make indexing faster</h2>
<p>Here are some things to try to speed up the indexing speed of your Lucene application. Please see <a href="http://wiki.apache.org/lucene-java/ImproveSearchingSpeed"><span style="color: #22148d;">ImproveSearchingSpeed</span></a> for how to speed up searching. doubanclaim4c08e02b1af5eace</p>
<ul>
<li><strong>Be sure you really need to speed things up.</strong>
<p>Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your indexing speed is indeed too slow and the slowness is indeed within Lucene.</li>
<li><strong>Make sure you are using the latest version of Lucene.</strong></li>
<li><strong>Use a local filesystem.</strong>
<p>Remote filesystems are typically quite a bit slower for indexing. If your index needs to be on the remote fileysystem, consider building it first on the local filesystem and then copying it up to the remote filesystem.</li>
<li><strong>Get faster hardware, especially a faster IO system.</strong></li>
<li><strong>Open a single writer and re-use it for the duration of your indexing session.</strong></li>
<li><strong>Flush by RAM usage instead of document count.</strong>
<p>Call <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#ramSizeInBytes()"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">writer.ramSizeInBytes()</span></a> after every added doc then call <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#flush()"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">flush()</span></a> when it&#8217;s using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMaxBufferedDocs(int)"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">maxBufferedDocs</span></a> large enough to prevent the writer from flushing based on document count. However, don&#8217;t set it too large otherwise you may hit <a href="http://issues.apache.org/jira/browse/LUCENE-845"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">LUCENE-845</span></a> . Somewhere around 2-3X your &#8220;typical&#8221; flush count should be OK.</li>
<li><strong>Use as much RAM as you can afford.</strong>
<p>More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in <a href="http://issues.apache.org/jira/browse/LUCENE-843"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">LUCENE-843</span></a> found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.</li>
<li><strong>Turn off compound file format.</strong>
<p>Call <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">setUseCompoundFile(false)</span></a> . Building the compound file format takes time during indexing (7-33% in testing for <a href="http://issues.apache.org/jira/browse/LUCENE-888"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">LUCENE-888</span></a> ). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.</li>
<li><strong>Re-use Document and Field instances</strong>
<p>As of Lucene 2.3 (not yet released) there are new setValue(&#8230;) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost.</p>
<p>It&#8217;s best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(&#8230;), etc), and then re-add your Document instance.</p>
<p>Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field&#8217;s value until the Document containing that Field has been added to the index. See <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.html"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">Field</span></a> for details.</li>
<li><strong>Re-use a single Token instance in your analyzer</strong>
<p>Analyzers often create a new Token for each term in sequence that needs to be indexed from a Field. You can save substantial GC cost by re-using a single Token instance instead.</li>
<li><strong>Use the char[] API in Token instead of the String API to represent token Text</strong>
<p>As of Lucene 2.3 (not yet released), a Token can represent its text as a slice into a char array, which saves the GC cost of new&#8217;ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new&#8217;ing any objects for each term. See <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">Token</span></a> for details.</li>
<li><strong>Use autoCommit=false when you open your <a href="http://wiki.apache.org/lucene-java/IndexWriter"><span style="color: #22148d;">IndexWriter</span></a></strong>
<p>In Lucene 2.3 (not yet released), there are substantial optimizations for Documents that use stored fields and term vectors, to save merging of these very large index files. You should see the best gains by using autoCommit=false for a single long-running session of <a href="http://wiki.apache.org/lucene-java/IndexWriter"><span style="color: #22148d;">IndexWriter</span></a> . Note however that searchers will not see any of the changes flushed by this <a href="http://wiki.apache.org/lucene-java/IndexWriter"><span style="color: #22148d;">IndexWriter</span></a> until it is closed; if that is important you should stick with autoCommit=true instead or periodically close and re-open the writer.</li>
<li><strong>Instead of indexing many small text fields, aggregate the text into a single &#8220;contents&#8221; field and index only that (you can still store the other fields).</strong></li>
<li><strong>Increase <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">mergeFactor</span></a> , but not too much.</strong>
<p>Larger <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexWriter.html#setMergeFactor(int)"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">mergeFactors</span></a> defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking fo<br />
r the hard drives.</li>
<li><strong>Turn off any features you are not in fact using.</strong>
<p>If you are storing fields but not using them at query time, don&#8217;t store them. Likewise for term vectors. If you are indexing many fields, turning off norms for those fields may help performance.</li>
<li><strong>Use a faster analyzer.</strong>
<p>Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite time consuming, especially in Lucene version &lt;= 2.2. If you can get by with a simpler analyzer, then try it.</li>
<li><strong>Speed up document construction.</strong>
<p>Often the process of retrieving a document from somewhere external (database, filesystem, crawled from a Web site, etc.) is very time consuming.</li>
<li><strong>Don&#8217;t optimize unless you really need to (for faster searching).</strong></li>
<li><strong>Use multiple threads with one IndexWriter.</strong>
<p>Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures, native command queuing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.</li>
<li><strong>Index into separate indices then merge.</strong>
<p>If you have a very large amount of content to index then you can break your content into N &#8220;silos&#8221;, index each silo on a separate machine, then use the writer.addIndexesNoOptimize to merge them all into one final index.</li>
<li><strong>Run a Java profiler.</strong>
<p>If all else fails, profile your application to figure out where the time is going. I&#8217;ve had success with a very simple profiler called <a href="http://www.khelekore.org/jmp"><img src="http://wiki.apache.org/wiki/modern/img/moin-www.png" alt="[WWW]" width="11" height="11" /> <span style="color: #22148d;">JMP</span></a> . There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.</li>
</ul>
</div>
<h2>See also:</h2>
<div></div>
<div class="u_bd">http://wiki.apache.org/jakarta-lucene/ImproveIndexingSpeed</div>
<div class="u_bd">http://hi.baidu.com/expertsearch/blog/item/393c702cbede6c33359bf706.html</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/05/9319.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>跨语言信息检索综述</title>
		<link>http://blog.zye.me/2011/05/3259.html</link>
		<comments>http://blog.zye.me/2011/05/3259.html#comments</comments>
		<pubDate>Wed, 11 May 2011 02:30:29 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[信息检索]]></category>
		<category><![CDATA[跨语言检索]]></category>

		<guid isPermaLink="false">http://jeffye.yo2.cn/articles/%e8%b7%a8%e8%af%ad%e8%a8%80%e4%bf%a1%e6%81%af%e6%a3%80%e7%b4%a2%e7%bb%bc%e8%bf%b0-%e4%b8%8d%e6%96%ad%e6%9b%b4%e6%96%b0%e4%b8%ad.html</guid>
		<description><![CDATA[1 . 一篇关于英文综述（2005）&#8211;focus on current approaches to CLIR systems. literature-review-of-cross-language-information-retrieval.pdf 2. 一篇关于Dictionary-based approach 的文章，文章记录作者参加CLEF2000-2002（主要是欧洲语言交叉）评测以及实际系统开发中遇到的一系列问题，里面有一些实验结果值得一读，但这篇文章感觉描述得不是很清楚, 但不管怎么说，能给我提供一些可信的参考。Paper title： Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000-2002 Technorati : 跨语言检索   in progress]]></description>
			<content:encoded><![CDATA[<p>1 . 一篇关于英文综述（2005）&#8211;focus on current approaches to CLIR systems. <a title="literature-review-of-cross-language-information-retrieval.pdf" href="http://jeffye.yo2.cn/wp-content/uploads/192/19263/2008/02/literature-review-of-cross-language-information-retrieval.pdf">literature-review-of-cross-language-information-retrieval.pdf</a></p>
<p>2. 一篇关于Dictionary-based approach 的文章，文章记录作者参加CLEF2000-2002（主要是欧洲语言交叉）评测以及实际系统开发中遇到的一系列问题，里面有一些实验结果值得一读，但这篇文章感觉描述得不是很清楚, 但不管怎么说，能给我提供一些可信的参考。Paper title： Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000-2002</p>
<p class="zoundry_raven_tags"><!-- Tag links generated by Zoundry Raven. Do not manually edit. http://www.zoundryraven.com --> <span class="ztags"><span class="ztagspace">Technorati</span> : <a class="ztag" rel="tag" href="http://technorati.com/tag/%E8%B7%A8%E8%AF%AD%E8%A8%80%E6%A3%80%E7%B4%A2">跨语言检索</a></span></p>
<p class="zoundry_raven_tags"> </p>
<p class="zoundry_raven_tags"><span class="ztags">in progress</span></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/05/3259.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR 2011 accepted full paper list</title>
		<link>http://blog.zye.me/2011/04/55847.html</link>
		<comments>http://blog.zye.me/2011/04/55847.html#comments</comments>
		<pubDate>Tue, 19 Apr 2011 17:06:31 +0000</pubDate>
		<dc:creator>yezheng</dc:creator>
				<category><![CDATA[information Retrieval]]></category>
		<category><![CDATA[full paper]]></category>
		<category><![CDATA[SIGIR2011]]></category>

		<guid isPermaLink="false">http://blog.so8848.com/2011/04/55847.html/</guid>
		<description><![CDATA[SIGIR 2011 accepted full paper list (without poster papers) http://www.sigir2011.org/papers.htm Papers Utilizing Marginal Net Utility for Recommendation in E-commerce Jian Wang, Yi Zhang Summarizing the Differences in Multilingual News Xiaojun Wan, Houping Jia Cross-Language Web Page Classification via Joint Nonnegative Matrix Tri-factorization Based Dyadic K Wang Hua, Heng Huang Incremental Diversification for Very Large Sets: <a href='http://blog.zye.me/2011/04/55847.html'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>SIGIR 2011 accepted full paper list (without poster papers)</p>
<p><a href="http://www.sigir2011.org/papers.htm">http://www.sigir2011.org/papers.htm</a></p>
<h1 style="font-family: ARIAL, Verdana; font-size: 20px; font-weight: normal; color: #4394d0; line-height: 22px; text-align: left; padding: 0px; margin: 0px;">Papers</h1>
</p>
<table style="font-family: ARIAL, Verdana; line-height: 18px;" border="0" cellspacing="15" cellpadding="0" width="100%">
<tbody>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Utilizing Marginal Net Utility for Recommendation in E-commerce<br />
</strong><em>Jian Wang, Yi Zhang</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Summarizing the Differences in Multilingual News<br />
</strong><em>Xiaojun Wan, Houping Jia</em><strong><br />
</strong></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Cross-Language Web Page Classification via Joint Nonnegative Matrix Tri-factorization Based Dyadic K<br />
</strong><em>Wang Hua, Heng Huang</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Incremental Diversification for Very Large Sets: a Streaming-based Approach<br />
</strong><em>Enrico Minack, Wolf Siberski, Wolfgang Nejdl</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Multifaceted Toponym Recognition for Streaming News<br />
</strong><em>Michael Lieberman, Hanan Samet</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Collaborative Competitive Filtering: Learning Recommender Using Context of User Choices<br />
</strong><em>Shuang-Hong Yang, Bo Long, Alex Smola, Hongyuan Zha, Zhaohui Zheng</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Who Should Share What? Item-level Social Influence Prediction for Users and Posts Ranking<br />
</strong><em>Cui Peng, Fei Wang, Shao-Wei Liu, Ming-Dong Ou, Shi-Qiang Yang</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Identifying Points of Interest by Self-Tuning Clustering</strong><br />
<em>YiYang Yang, Zhiguo Gong, Leong Hou U</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>PICASSO &#8211; To Sing you must Close Your Eyes and Draw</strong><br />
<em>Aleksandar Stupar, Sebastian Michel</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Collective Entity Linking in Web Text: A Graph-Based Method</strong><br />
<em>Xianpei Han, Le Sun</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Toward Social Context Summarization For Web Documents</strong><br />
<em>Zi Yang, cai keke, Jie Tang, Li Zhang, Zhong Su, Juanzi Li</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>CLR: A Collaborative Location Recommendation Framework based on Co-Clustering</strong><br />
<em>Kenneth Wai-Ting Leung, Wang-Chien Lee, Dik Lun Lee</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>A Boosting Approach to Improving Pseudo-Relevance Feedback</strong><br />
<em>Yuanhua Lv, Wan Chen, ChengXiang Zhai</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Detecting Outlier Sections in US Congressional Legislation</strong><br />
<em>Elif Aktolga, Irene Ros, Yannick Assogba</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Out of sight, not out of mind: On the effect of social and physical detachment on information need</strong><br />
<em>Elad Yom-Tov, Fernando Diaz</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Synthesizing High Utility Suggestions for Rare Web Search Queries</strong><br />
<em>Alpa Jain, Umut Ozertem, Emre Velipasaoglu</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Measuring Improvement in User Search Performance Resulting From Optimal Search Tips</strong><br />
<em>Neema Moraveji, Daniel Russell, Jacob Bien, David Mease</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>ILDA: Interdependent LDA Model for Learning Latent Aspects and their Ratings from Online Product Rev</strong><br />
<em>Samaneh Moghaddam, Martin Ester</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Evolutionary Timeline Summarization: a Balanced Optimization Framework via Iterative Substitution</strong><br />
<em>Rui Yan, Xiaojun Wan, Jahna Otterbacher, Liang Kong, Yan Zhang, Xiaoming Li</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Efficient Manifold Ranking for Image Retrieval</strong><br />
<em>Bin Xu, Jiajun Bu, Chun Chen, Zhanying He, Deng Cai</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>The Economics in Interactive Information Retrieval</strong><br />
<em>Leif Azzopardi</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Cluster-based fusion of retrieved lists</strong><br />
<em>Anna<span>&nbsp;&nbsp;</span>Khudyak Kozorovitsky, Oren Kurland</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Active Learning to Maximize Accuracy vs. Effort in Interactive Information Retrieval</strong><br />
<em>Aibo Tian, Matthew Lease</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Document Clustering with Universum</strong><br />
<em>Dan Zhang, Jingdong Wang, Luo Si</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Composite Hashing with Multiple Information Sources</strong><br />
<em>Dan Zhang, Fei Wang, Luo Si</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Automatic Boolean Query Suggestion for Professional Search</strong><br />
<em>Youngho Kim, Jangwon Seo, Bruce Croft</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Inverted Indexes for Phrases and Strings</strong><br />
<em>Manish Patil, Sharma V. Thankachan, Rahul Shah, Wing-Kai Hon, Jeffrey Vitter, Sabrina Chandrasekaran</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Query by Document via a Decomposition-Based Two-Level Retrieval Approach</strong><br />
<em>Linkai Weng</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>UPS: Efficient Privacy Protection in Personalized Web Search</strong><br />
<em>He Bai, Lidan Shou, Ke Chen, Gang Chen, Yunjun Gao</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Learning Search Tasks in Queries and Web Pages via Graph Regularization</strong><br />
<em>Ming Ji, Jun<span>&nbsp;&nbsp;</span>Yan, Siyu Gu, Jiawei Han, Xiaofei He</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Multimedia Answering: Enriching Text QA with Media Information</strong><br />
<em>Liqiang Nie, Meng Wang, Zha Zhengjun, Li Guangda, Tat Seng Chua</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Evaluating Diversified Search Results Using Per-intent Graded Relevance</strong><br />
<em>Tetsuya Sakai, Ruihua Song</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Functional Matrix Factorizations for Cold-Start Collaborative Filtering</strong><br />
<em>Ke Zhou, Shuang-Hong Yang, Hongyuan Zha</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Mining Topics on Participations for Community Discovery</strong><br />
<em>Guoqing Zheng, Jinwen Guo, Lichun Yang, Shengliang Xu, Shenghua Bao, Zhong Su, Dingyi Han, Yong Yu</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Integrating Hierarchical Feature Selection and Classifier Training for Multi-Label Image Annotation</strong><br />
<em>Cheng Jin, Xiangyang Xue</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Understanding Re-finding Behaviour in Naturalistic Email Interaction Logs</strong><br />
<em>David Elsweiler, Morgan<span>&nbsp;&nbsp;</span>Harvey, Martin<span>&nbsp;&nbsp;</span>Hacker</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models</strong><br />
<em>Yasser Ganjisaffar, Rich Caruana, Cristina Lopes</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Query Suggestions in the Absence of Query Logs</strong><br />
<em>Sumit Bhatia, Debapriyo<span>&nbsp;&nbsp;</span>Majumdar, Prasenjit Mitra</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Estimation Methods for Ranking Recent Information</strong><br />
<em>Miles Efron, Gene Golovchinsky</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Unsupervised Query Segmentation Using Clickthrough for Information Retrieval</strong><br />
<em>Yanen Li, Bo-June Hsu, ChengXiang Zhai, Kuansan Wang</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Pseudo Test Collections for Learning Web Search Ranking Functions</strong><br />
<em>Nima Asadi, Donald Metzler, Tamer Elsayed, Jimmy Lin</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Predicting Web Searcher Satisfaction with Existing Community-based Answers</strong><br />
<em>Qiaoling Liu, Eugene Agichtein, Gideon Dror, Evgeniy Gabrilovich, Yoelle Maarek, Dan Pelleg, Idan Szpektor</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Social Annotation in Query Expansion a Machine Learning Approach</strong><br />
<em>Yuan Lin, Song Jin, Hongfei Lin, Yunlong Ma, Kan Xu</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Probabilistic Factor Models for Web Site Recommendation</strong><br />
<em>Hao Ma, Chao Liu, Irwin King, Michael R. Lyu</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Quaternary Semantic Analysis: Providing Recommendations based on Explicit and Implicit Feedbacks</strong><br />
<em>Chen wei, Hsu Wynne, Mong Li Lee</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Mining Tags Using Social Endorsement Networks</strong><br />
<em>Theodoros Lappas, Kunal Punera, Tamas Sarlos</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Why Searchers Switch: Understanding and Predicting Engine Switching Rationales</strong><br />
<em>Qi Guo, Ryen White, Yunqiao Zhang, Blake Anderson, Susan Dumais</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Exploiting Geographical Influence for Collaborative Point-of-Interests Recommendation</strong><br />
<em>Mao Ye, Peifeng Yin, Wang-Chien Lee, Dik Lun Lee</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Clickthrough-Based Latent Semantic Models for Web Search</strong><span><br />
<em>Jianfeng Gao, Kristina Toutanova, Wen-tau Yih</em></span></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Indexing Strategies for Graceful Degradation of Search Quality</strong><br />
<em>Shuai Ding, Sreenivas Gollapudi, Samuel Ieong, Krishnaram Kenthapadi, Alexandros Ntoulas</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Enriching Document Representation via Translation for Improved Monolingual Information Retrieval</strong><br />
<em>Seung-Hoon Na, Hwee Tou Ng</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Regularized Latent Semantic Indexing</strong><br />
<em>Quan Wang, Jun Xu, Hang Li, Nick Craswell</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>People Searching for People: Analysis of a People Search Engine Log</strong><br />
<em>Wouter Weerkamp, Richard Berendsen, Bogomil Kovachev, Edgar Meij, Krisztian Balog, Maarten de Rijke</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>An Event-centric Model for Multilingual Document Similarity</strong><br />
<em>Jannik Str&ouml;tgen, Conny Junghans, Michael Gertz</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>From One Tree to a Forest: a Unified Solution for Structured Web Data Extraction</strong><br />
<em>Qiang Hao, Rui Cai, Lei Zhang</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>A Site Oriented Method For Segmenting Web Pages</strong><br />
<em>David Fernandes de Oliveira, Edleno Moura, Altigran da Silva, Berthier Ribeiro-Neto, Edisson Braga</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Learning to Rank for Freshness and Relevance<span>&nbsp;</span></strong><span><br />
<em>Na Dai, Milad Shokouhi, Brian Davison</em></span></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>A Novel Corpus-Based Stemming Algorithm<span>&nbsp;&nbsp;</span>using Co-occurrence Statistics</strong><br />
<em>Jiaul Paik, Dipasree<span>&nbsp;&nbsp;</span>Pal, Swapan<span>&nbsp;&nbsp;</span>Parui</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>DOM Based Content Extraction via Text Density</strong><br />
<em>Fei Sun, Dandan Song, Lejian Liao</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Scalable multi-dimensional user intent identification using tree structured distributions</strong><br />
<em>Vinay Jethava, Liliana Calderon-Benavides, Chiranjib Bhattacharyya, Devdatt Dubhashi, Ricardo Baeza-Yates</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Handling Data Sparsity in Collaborative Filtering using Emotion and Semantic Based Features</strong><br />
<em>Yashar Moshfeghi, Benjamin Piwowarski, Joemon M. Jose</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Improved video categorization from text metadata and user comments</strong><br />
<em>Katja Filippova, Keith Hall</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Distributed, Private, and Anonymous Search Logs</strong><br />
<em>Henry Feild, James Allan, Joshua Glatt</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Evaluating Multi-Query Sessions</strong><br />
<em>Evangelos Kanoulas, Ben Carterette, Paul Clough, Mark Sanderson</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Parameterized Concept Weighting in Verbose Queries</strong><br />
<em>Michael Bendersky, Donald Metzler, Bruce Croft</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Mining Weakly Labeled Web Facial Images for Search-based Face Annotation</strong><br />
<em>Steven Hoi, dayong wang, Ying He</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Fast Context-aware Recommendations with Factorization Machines</strong><br />
<em>Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, Lars Schmidt-Thieme</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity</strong><br />
<em>Ferhan Ture, Tamer Elsayed, Jimmy Lin</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Dynamics of Focus-of-Attention in Diagnostic Search</strong><br />
<em>Marc-Allen Cartright, Ryen White, Eric Horvitz</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>CRTER: Using Cross Terms to Enhance Probabilistic Information Retrieval</strong><br />
<em>Jiashu Zhao, Jimmy Huang, Ben He</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>User Behavior in Zero-Recall eCommerce Queries</strong><br />
<em>Gyanit Singh, Nish Parikh, Neel Sundaresan</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Faster Temporal Range Queries over Versioned Text</strong><br />
<em>Jinru He, Torsten Suel</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Hypergeometric Language Models for Republished Article Finding</strong><br />
<em>Manos Tsagkias, Maarten de Rijke, Wouter Weerkamp</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>SCENE : A Scalable Two-Stage Personalized News Recommendation System</strong><br />
<em>Lei Li, Dingding Wang, Tao Li</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Crowdsourcing for Book Search Evaluation: Impact of Quality on Comparative System Ranking</strong><br />
<em>Gabriella Kazai, Jaap Kamps, Marijn Koolen, Natasa Milic-Frayling</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Post-Ranking Query Suggestion by Diversifying Search Results</strong><br />
<em>Yang Song, Dengyong Zhou, Li-wei He</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Authorship Classification: A Discriminative Syntactic Tree Mining Approach</strong><br />
<em>Sangkyum Kim, Hyungsul Kim, Tim Weninger, Jiawei Han, Hyun Duk Kim</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Learning Online Discussion Structures by Conditional Random Fields</strong><br />
<em>Wang Hongning, Chi Wang, ChengXiang Zhai, Jiawei Han</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Filtering Semi-Structured Documents Based on Faceted Feedback</strong><br />
<em>Lanbo Zhang, Yi Zhang</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>A Cascade Ranking Model for Efficient Ranked Retrieval</strong><br />
<em>Lidan Wang, Jimmy Lin, Donald Metzler</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Ranking Related News Predictions</strong><br />
<em>Nattiya Kanhabua, Roi Blanco, Michael Matthews</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Modeling and Analysis of Cross-Session Search Tasks<span>&nbsp;</span></strong><span><br />
<em>Alexander Kotov, Paul Bennett, Ryen White, Susan Dumais, Jaime Teevan</em></span></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Repeatable and Reliable Search System Evaluation using Crowd-Sourcing</strong><br />
<em>Roi Blanco, Harry Halpin, Daniel<span>&nbsp;&nbsp;</span>Herzig, Peter Mika, Jeffrey Pound, Henry Thompson, Thanh D. Tran</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Enhanced Results for Web Search</strong><br />
<em>Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Efficiently Collecting Relevance Information from Clickthroughs for Web Retrieval System Evaluation</strong><br />
<em>Jing He, Xin Zhao, Baihan Shu, Xiaoming Li, Hongfei Yan</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>System Effectiveness, User Models, and User Utility:<span>&nbsp;&nbsp;</span>A General Framework for Investigation</strong><br />
<em>Ben Carterette</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Seeding Simulated Queries with User-study Data for Personal Search Evaluation</strong><br />
<em>David Elsweiler, David Losada, Jose Carlos<span>&nbsp;&nbsp;</span>Toucedo, Ronald T. Fern&aacute;ndez<span>&nbsp;</span></em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Associative Tag Recommendation Exploiting Multiple Textual Features</strong><br />
<em>Fabiano Bel&eacute;m, Eder Martins, Tatiana Pontes, Jussara Almeida, Marcos Goncalves</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Recommending Ephemeral Items at Web Scale</strong><br />
<em>Ye Chen, John Canny</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Improving Local Search Ranking through External Logs</strong><br />
<em>Klaus Berberich, K&ouml;nig Arnd, Dimitrios<span>&nbsp;&nbsp;</span>Lymberopoulos, Peixiang Zhao</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Intent-Aware Search Result Diversification</strong><br />
<em>Rodrygo Santos, Craig Macdonald, Iadh Ounis</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Inferring and Using Location Metadata to Personalize Web Search</strong><br />
<em>Paul Bennett, Filip Radlinski, Ryen White, Emine Yilmaz</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Evaluating the Synergic Effect of Collaboration in Information Seeking</strong><br />
<em>Chirag Shah, Roberto Gonz&aacute;lez-Ib&aacute;&ntilde;ez</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Temporal Index Sharding for Space-Time Efficiency in Archive Search</strong><br />
<em>Avishek Anand, Srikanta Bedathur, Ralf Schenkel, Klaus Berberich</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Energy-Price-Driven Query Processing in Multi-center Web Search Engines</strong><br />
<em>Enver Kayaaslan, B. Barla Cambazoglu, Roi Blanco, Cevdet Aykanat</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Learning Sentiment Streams with Training Expansion and Demand-Driven Partitioning</strong><br />
<em>Ismael Santana Silva, Jana&iacute;na Gomide, Adriano Veloso, Renato Ferreira, Wagner Meira Jr.</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>ViewSer: Enabling Large-Scale Remote User Studies of Web Search Examination and Interaction</strong><br />
<em>Dmitry Lagun, Eugene Agichtein</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Posting List Intersection on Multicore Architectures</strong><br />
<em>Shirish Tatikonda, B. Barla Cambazoglu, Flavio Junqueira</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Timestamp-based Result Cache Invalidation Mechanisms for Web Search Enginess</strong><br />
<em>Adiye Alici, Ismail Altingovde, Rifat Ozcan, B. Barla Cambazoglu, Ozgur Ulusoy</em><em></em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Enhancing Multi-Label Music Genre Classification Through Ensemble Techniques</strong><br />
<em>Chris Sanden, John Zhang</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Learning Relevance in a Heterogeneous Social Network and Its Application in Online Targeting</strong><br />
<em>Chi Wang, Rajat Raina, David Fong, Ding Zhou, Jiawei Han</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Find It If You Can: Modeling and Predicting Different Types of Web Search Success with Behavior Data</strong><br />
<em>Mikhail Ageev, Qi Guo, Dmitry Lagun, Eugene Agichtein</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Faster Top-k Document Retrieval Using Block-Max Indexes</strong><br />
<em>Ding Shuai, Torsten Suel</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Enhancing Ad-hoc Relevance Weighting Using Probability Density Estimation</strong><br />
<em>Xiaofeng Zhou, Jimmy Huang, Ben He</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>On Theme Location Discovery for Travelogue Services</strong><br />
<em>Mao Ye, Rong Xiao, Wang-Chien Lee, Xing Xie</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Quantifying test collection quality based on the consistency of relevance judgements</strong><br />
<em>Falk Scholer, Andrew Turpin, Mark Sanderson</em></td>
</tr>
<tr style="height: 15px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Competition-based User Expertise Level Estimation</strong><br />
<em>Jing Liu, Young-In Song, Chin-Yew Lin</em></td>
</tr>
<tr style="height: 30px;">
<td class="xl65112" style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 10px; font-family: ARIAL, Verdana; font-size: 13px; vertical-align: bottom; white-space: normal; text-align: left; margin: 0px;"><strong>Relevant Knowledge Helps in Choosing Right Teacher: Active Query Selection for Ranking Adaptation</strong><br />
<em>Peng Cai, Wei Gao, Aoying Zhou, Kam-Fai Wong</em></td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://blog.zye.me/2011/04/55847.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

