BM family weighting scheme Introduction and important Literatures
1. Okapi BM25 是IR领域中一个非常重要的 Ranking 公式,bm的意思是best match, 理论基础为 Probabilistic Theory, 由 Stephen E. Robertson 在1970s发明, 也是Robertson 教授的成名作,奠定他在IR领域崇高地位。
//////////////Okapi bm25 formula/////////////////////


double K = k_1 * ((1 – b) + b * docLength / averageDocumentLength) + tf;
return Idf.log((numberOfDocuments – n_t + 0.5d) / (n_t+ 0.5d)) * ((k_1 + 1d) * tf / (K )) * ((k_3+1)*keyFrequency/(k_3+keyFrequency));
2. 不过对于各种IR weighting 公式,个人认为理论意义大于实际意义,即时最好的weighting 公式在performance上也不会比有heuristic weighting 公式好很多,最终形式上差别也不大。所以不研究理论,看看最终weighting 公式是怎么样就OK了。
3. 关于Okapi BM25的理论可以看如下参考文献,简单了解直接看Wikipedia:Okapi BM25
4. 喜欢Lucene的可以参考, BM25 implementation for Lucene , 不过lucene中实现BM25时document的长度length信息是用一个byte来存储的,解码的时候会有一定的精度损失。这个不过也该不会对performance有太大的影响。
5. BM series 还有bm250,bm2500,bm26等。不过这些命名也不知道有什么规则
———————————————————————-
The Okapi BM25 Weighting Scheme
This is a technical note about the BM25 weighting scheme, which is the default weighting scheme used by Xapian. Recent TREC tests have shown BM25 to be the best of the known probabilistic weighting schemes. In case you’re wondering, the BM simply stands for “Best Match”. We’ll follow the evolution from the traditional probabilistic weighting scheme through to BM25.
The Traditional Probabilistic Weighting Scheme
In its most general form, the traditional probabilistic term weighting function is:
| (k3+1)q (k3+q) | · | (k1+1)f (k1L+f) | ·log | (r+0.5)(N-n-R+r+0.5) (n-r+0.5)(R-r+0.5) | ...(1) |
where: k1, k3 are constants. q is the wqf, the within query frequency, f is the wdf, the within document frequency, n is the number of documents in the collection indexed by this term, N is the total number of documents in the collection, r is the number of relevant documents indexed by this term, R is the total number of relevant documents, L is the normalised document length (i.e. the length of this document divided by the average length of documents in the collection). The factors (k3 + 1) and (k1 + 1) are unnecessary here, but help scale the weights, so the first component is 1 when q = 1 etc. But they are critical below when we add an extra item to the sum of term weights.
BM11
Stephen Robertson’s BM11 uses formula (1) for the term weights, but adds an extra item to the sum of term weights to give the overall document score:
| k2 nq | (1-L) (1+L) | ...(2) |
where: nq is the number of terms in the query (the query length), k2 is yet another constant. Note that this extra item is zero when L = 1.
BM15
BM15 is BM11 with the k1+f in place of k1L+f in (1).
BM25
BM25 combines the B11 and B15 with a scaling factor, b, which turns BM15 into BM11 as it moves from 0 to 1:
| (k3+1)q (k3+q) | · | (k1+1)f (K+f) | ·log | (r+0.5)(N-n-R+r+0.5) (n-r+0.5)(R-r+0.5) | ...(3) |
where: K = k1(bL + (1-b)) BM25 originally introduced another constant, as a power to which f and K are raised. However, Stephen remarks that powers other than 1 were ‘not helpful’, and other tests confirm this, so Xapian’s implementation of BM25 ignores this. (2) and (3) make up BM25, with which Stephen has had so much recent success. This does all seem somewhat ad-hoc, with so many unknown constants in the formula. But note that with k2 = 0 and b = 1 we get the traditional formula anyway. The default parameter values Xapian uses are k1 = 1, k2 = 0, k3 = 1, and b = 0.5. These are reasonable defaults, but the optimum values will vary with both the documents being searched and the type of queries, so you may be able to improve the effectiveness of your search system by tuning the values of these parameters. In Xapian, we also apply a floor to L (0.5 by default) which helps stop tiny documents get ridiculously high weights. And the matcher wants the extra item in the sum to be positive, so we add k2nq (constant for a given query) to (2) to give:
| 2 k2 nq (1 + L) | ...(4) |
———————————————————————-
Important Literature:
1,2,3 BM 基本理论和实践
1. Robertson, S. and S. Walker (1994). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th ACM Conference on Research and Development in Information Retrieval (SIGIR’94), pp. 232–241.
1.1 Okapi at trec-3. 1.2 okapi at trec-5 2. Sparck-Jones, K., S. Walker, and S. Robertson (2000). A probabilistic model of information retrieval: Development and comparative experiments (part1 and 2). Information Processing & Management 36(6), 779–840. (BM250)
1,2 最原始理论文献
———————————————-
3. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
4. Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. Okapi at TREC-7. In Proceedings of the Seventh Text REtrieval Conference. Gaithersburg, USA, November 1998
5. Nick Craswell, Hugo Zaragoza, Stephen Robertson. Microsoft Cambridge at TREC-14: Enterprise Track. In Proceedings of the Fourteenth Text REtrieval Conference (TREC 2005). Gaithersburg, USA, November 2005. Describes application and tuning of Okapi BM25F.
6. Hugo Zaragoza, Nick Craswell, Michael Taylor, S. Saria, and S. E. Robertson. Microsoft Cambridge at TREC 13: Web and hard tracks. In Text REtrieval Conference (TREC-13), 2004. here
——————————-
7. Michael Taylor, Hugo Zaragoza, Nick Craswell, Stephen Robertson, and Chris Burges. Optimisation methods for ranking functions with multiple parameters. In Fifteenth Conference on Information and Knowledge Management (ACM CIKM), 2006. here 8. S. E. Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25 extension to multiple weighted fields. In Thirteenth Conference on Information and Knowledge Management (CIKM), 2004. here 9. S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter. Probabilistic models of indexing and searching. In R.N. Oddy, S.E. Robertson, C.J. van Rijsbergen, and P.W. Williams, editors, Information Retrieval Research, pages 35-56, London, 1981. Butterworths. here
——————————————————————————-
BM扩展Literatures
10. Xiangji Huang and Stephen E. Robertson. “A Probabilistic Approach to Chinese Information Retrieval: Theory and Experiments”, Proceedings of the 22nd Annual BCS-IRSG Colloquium on Information Retrieval Research, Cambridge, England, April 2000. 178-193. (BM26) 11. Robertson, S. E. (1997). Overview of the Okapi projects. Journal of Documentation, 53(1), 3-7 (BM2500) 11.1 Robertson, S. E., and Walker, S., Okapi/Keenbow at TREC-8. In TREC-8. (BM2500)
——————————————————————–
U of Glasgow 的Terrier系统中主推的Ranking Model,也是Probabilistic Models,但model中assumption不一样。该模型已经在很多TREC实验中证明效果好由于BM25。 顺便多说一局,Terrier 也是一个不可多得的IR平台,Java 平台,喜欢Java的可一同研究。Terrier out-of-box 已经非常适合batch experiment,这点感觉非常爽。
12. GIANNI AMATI el at. Probabilistic Models for Information Retrieval based on Divergence from Randomness–2002.
13. Gianni Amati and Cornelis Joost Van Rijsbergen.
Probabilistic models of information retrieval based on measuring the divergence from randomness. 20(4):357-389, 2002
——————————————————————————————-
Excellent Review and Tutorial,
我感觉是来从理论解释最深入浅出的,每一步近似的推倒都说得非常靠谱,如果想学BM系列probability Retrieval model又看不懂推倒的话,建议看看篇。Highly Recommendation
14. Norbert Fuhr. Probabilistic Models in Information Retrieval.
更新……