Information Retrieval Blog » information Retrieval http://blog.zye.me ANTI-GFW Sun, 29 Aug 2010 03:59:54 +0000 http://wordpress.org/?v=2.9.1 en hourly 1 Computer Science Department Journal Rankings http://blog.zye.me/2010/08/55787.html http://blog.zye.me/2010/08/55787.html#comments Sat, 21 Aug 2010 21:59:01 +0000 Jeffye http://blog.zye.me/2010/08/55787.html Computer Science Department Journal Rankings Artificial Intelligence Premium Artificial Intelligence Computational Linguistics IEEE Trans on Pattern Analysis and Machine Intl IEEE Trans on Robotics and Automation IEEE Trans on Image Processing Journal of AI Research Neural Computation Machine Learning Intl Jnl of Computer Vision IEEE Trans on Neural Networks   Leading Artificial Intelligence Review ACM Transactions on Asian Language Information Processing AI Magazine Annals of Mathematics and AI Applied Artificial Intelligence Applied Intelligence Artificial Intelligence in [...]]]> http://blog.zye.me/2010/08/55787.html/feed 0 SIGIR 2010 Full Paper List http://blog.zye.me/2010/04/55738.html http://blog.zye.me/2010/04/55738.html#comments Sat, 17 Apr 2010 15:19:15 +0000 Jeffye http://blog.so8848.com/2010/04/55738.html 30     Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval Maryam Karimzadehgan, ChengXiang Zhai (University of Illinois at Urbana-Champaign)

502   DivQ: [...]]]> http://blog.zye.me/2010/04/55738.html/feed 1 Be care of RangeQuery in Lucene http://blog.zye.me/2009/09/54218.html http://blog.zye.me/2009/09/54218.html#comments Thu, 03 Sep 2009 02:13:38 +0000 jeffye http://blog.so8848.com/?p=54218 Reminder, Lucene has many Query types

– TermQuery, BooleanQuery,

ConstantScoreQuery, MatchAllDocsQuery,

MultiPhraseQuery, FuzzyQuery,

WildcardQuery, RangeQuery, PrefixQuery,

PhraseQuery, Span*Query,

DisjunctionMaxQuery, etc.

There is a bunch of Query implements in Lucene, which makes lucene very powerful in search. However, you should be very care of using Query like RangeQuery, especially when the size of your collection is very large.

As you know that lucene will [...]]]> http://blog.zye.me/2009/09/54218.html/feed 0 The Ivory Toolkit with the SMRF Retrieval Engine (under Hadoop Framework) http://blog.zye.me/2009/08/53939.html http://blog.zye.me/2009/08/53939.html#comments Sat, 08 Aug 2009 17:53:41 +0000 jeffye http://blog.so8848.com/?p=53939

This would also be a future step for our SaberLucene Project (under release). Beside MapReduce framework, [...]]]> http://blog.zye.me/2009/08/53939.html/feed 0 Trie-based approximate autocomplete implementation with support for ranks and synonyms http://blog.zye.me/2009/07/53262.html http://blog.zye.me/2009/07/53262.html#comments Thu, 02 Jul 2009 18:58:09 +0000 jeffye http://blog.so8848.com/?p=53262 Posted by Kelvin on 01 Jul 2009 at 02:30 am | Tagged as: programming

The problem of auto-completing user queries is a well-explored one.

For example, Type less, find more: fast autocompletion search with a succinct index http://stevedaskam.wordpress.com/2009/06/07/putting-autocomplete-data-structure-to-the-test/ http://suggesttree.sourceforge.net/ http://sujitpal.blogspot.com/2007/02/three-autocomplete-implementations.html

However, there’s been little written [...]]]> http://blog.zye.me/2009/07/53262.html/feed 1 Lucene 新子项目OpenRelevance起航 http://blog.zye.me/2009/06/53099.html http://blog.zye.me/2009/06/53099.html#comments Sat, 27 Jun 2009 03:42:39 +0000 jeffye http://blog.so8848.com/?p=53099 25 June 2009 – Apache Open Relevance Kickoff

也就是昨天,Apache 官方投票通过启动一个Lucene 子项目Open Relevance Project (ORP)。ORP主要目标是检索数据集、评价和查询,这样Lucene的开发者和用户就可以更容易地进行相关度评价测试。比较像TREC,NTCIR等这些评测,不过这个项目会更开放些。

更多信息可参考:

http://lucene.apache.org/openrelevance/

http://wiki.apache.org/lucene-java/OpenRelevance

这算是一个比较exciting NEWS,特别是像我比较喜欢Lucene,又想用Lucene搞点IR方面的研究的。

Incoming search terms for the article:lucene 子项目 (2)openrelevance (2)lucene子项目 (1)luence 子项目 (1)TREC lucene (1)Related PostsBe care of RangeQuery in LuceneTrie-based approximate autocomplete implementation with support for ranks and synonymslucene影响索引速度的因素-MergeFactor, MaxMergeDocs, RAMBufferSizeMBLucene的检索优化(一)Lucene的检索优化(二)–Hits的改进]]>
http://blog.zye.me/2009/06/53099.html/feed 0
Online free book: Search User Interface http://blog.zye.me/2009/06/53091.html http://blog.zye.me/2009/06/53091.html#comments Fri, 26 Jun 2009 03:33:59 +0000 jeffye http://blog.so8848.com/?p=53091 Marti Hearst 刚完成的一本新书, Search User Interfaces. 令人兴奋的是, 网上可以在网上免费阅读!  Marti 是 UC Berkeley 教授, 也是搜索用户接口设计方面的专家。这本相关博客: SearchUpTicious.

url:  http://searchuserinterfaces.com/book/

Incoming search terms for the article:search user interface (3)Search User Interfaces (2)Related PostsNo Related Post]]>
http://blog.zye.me/2009/06/53091.html/feed 0
使用Java Tar Package读取*.tar 或*.tar.gz 文件 http://blog.zye.me/2009/06/53089.html http://blog.zye.me/2009/06/53089.html#comments Fri, 26 Jun 2009 02:10:20 +0000 jeffye http://blog.so8848.com/?p=53089 Java Tar Package com.ice.tar 实现了一个tar 文档输入输出io包。使用方式接近java 中自带的 java.util.zip 包,所以也该非常容易上手,如果使用国java的zip包的话。

而且配合java 的中的GZIPInputStream 使用,就很容易实现.tar.gz 文件的访问。 步骤: 1. 读取文件,生成GZIPInputStream 流 2. 把1中生成的GZIPInputStream流传给 Java Tar Package 中的TarInputStream 流 过程非常简单,代码如下 private void visitTARGZ(P parser, File targzFile) throws IOException { FileInputStream fileIn = null; BufferedInputStream bufIn = null; GZIPInputStream gzipIn = null; TarInputStream taris = null; try { fileIn = new FileInputStream(targzFile); bufIn = new BufferedInputStream(fileIn); gzipIn = new GZIPInputStream(bufIn); //first [...]]]> http://blog.zye.me/2009/06/53089.html/feed 0 lucene影响索引速度的因素-MergeFactor, MaxMergeDocs, RAMBufferSizeMB http://blog.zye.me/2009/06/52813.html http://blog.zye.me/2009/06/52813.html#comments Tue, 23 Jun 2009 07:48:21 +0000 jeffye http://blog.so8848.com/2009/06/52813.html 版本:Java lucene2.4

在索引算法确定的情况下,最为影响Lucene索引速度有三个参数--IndexWriter中的 MergeFactor, MaxMergeDocs, RAMBufferSizeMB 。这些参数无非是控制内外存交换和索引合并频率,从而达到提高索引速度。当然这些参数的设置也得依照硬件条件灵活设置。

MaxMergeDocs

该参数决定写入内存索引文档个数,到达该数目后就把该内存索引写入硬盘,生成一个新的索引segment文件。 所以该参数也就是一个内存buffer,一般来说越大索引速度越快。 MaxBufferedDocs这个参数默认是disabled的,因为Lucene中还用另外一个参数(RAMBufferSizeMB)控制这个bufffer的索引文档个数。 其实MaxBufferedDocs和RAMBufferSizeMB这两个参数是可以一起使用的,一起使用时只要有一个触发条件满足就写入硬盘,生成一个新的索引segment文件。

RAMBufferSizeMB

控制用于buffer索引文档的内存上限,如果buffer的索引文档个数到达该上限就写入硬盘。当然,一般来说也只越大索引速度越快。 当我们对文档大小不太确定时,这个参数就相当有用,不至于outofmemory error.

MergeFactor

这个参数是用于子索引(Segment)合并的。 Lucene中索引总体上是这样进行,索引现写到内存,触发一定限制条件后写入硬盘,生成一个独立的子索引-lucene中叫Segment。一般来说这些子索引需要合并成一个索引,也就是optimize(),否则会影响检索速度,而且也可能导致open too many files。 MergeFactor 这个参数就是控制当硬盘中有多少个子索引segments,我们就需要现把这些索引合并冲一个稍微大些的索引了。 MergeFactor这个不能设置太大,特别是当MaxBufferedDocs比较小时(segment 越多),否则会导致open too many files错误,甚至导致虚拟机外面出错。

Note: Lucene 中默认索引合并机制并不是两两合并,好像是多个segment 合并成最终的一个大索引,所以MergeFactor越大耗费内存越多,索引速度也会快些,但我的感觉太大譬如300,最后合并的时候还是很满。Batch indexing 应 MergeFactor>10

Incoming search terms for the article:MergeFactor (25)lucene 两两 合并 索引 (16)lucene mergeFactor (14)ramBufferSizeMB (3)maxMergeDocs (2)Related PostsBe care of RangeQuery in LuceneTrie-based approximate autocomplete implementation with support for ranks and synonymsLucene 新子项目OpenRelevance起航Lucene的检索优化(一)Lucene的检索优化(二)–Hits的改进]]>
http://blog.zye.me/2009/06/52813.html/feed 0
Lucene的检索优化(一) http://blog.zye.me/2009/06/52812.html http://blog.zye.me/2009/06/52812.html#comments Tue, 23 Jun 2009 07:47:46 +0000 jeffye http://blog.so8848.com/2009/06/52812.html ucene支持内存索引:这样的搜索比基于文件的I/O有数量级的速度提升。 http://www.onjava.com/lpt/a/3273

而尽可能减少IndexSearcher的创建和对搜索结果的前台的缓存也是必要的。

Lucene面向全文检索的优化在于首次索引检索后,并不把所有的记录(Document)具体内容读取出来,而是只将所有结果中匹配度最高的头 100条结果(TopDocs)的ID放到结果集缓存中并返回,这里可以比较一下数据库检索:如果是一个10,000条的数据库检索结果集,数据库是一定 要把所有记录内容都取得以后再开始返回给应用结果集的。所以即使检索匹配总数很多,Lucene的结果集占用的内存空间也不会很多。对于一般的模糊检索应 用是用不到这么多的结果的,头100条已经可以满足90%以上的检索需求。

如果首批缓存结果数用完后还要读取更后面的结果时Searcher会再次检索并生成一个上次的搜索缓存数大1倍的缓存,并再重新向后抓取。所以如果 构造一个Searcher去查1-120条结果,Searcher其实是进行了2次搜索过程:头100条取完后,缓存结果用完,Searcher重新检索 再构造一个200条的结果缓存,依此类推,400条缓存,800条缓存。由于每次Searcher对象消失后,这些缓存也访问那不到了,你有可能想将结果 记录缓存下来,缓存数尽量保证在100以下以充分利用首次的结果缓存,不让Lucene浪费多次检索,而且可以分级进行结果缓存。

Lucene的另外一个特点是在收集结果的过程中将匹配度低的结果自动过滤掉了。这也是和数据库应用需要将搜索的结果全部返回不同之处。

    Things you can do from here: Subscribe to 笨笨的小田园 using Google Reader Get started using Google Reader to easily keep up with all your favorite sites     Incoming search terms for the article:lucene查询优化 (8)lucene优化 (6)lucene 优化 (6)lucene 查询优化 (5)lucene 缓存 (4)Related PostsLucene的检索优化(二)–Hits的改进Be care of RangeQuery in LuceneTrie-based [...]]]>
http://blog.zye.me/2009/06/52812.html/feed 0