cognica-io / bayesian-bm25

Bayesian probability transforms for BM25 retrieval scores

61 stars

1 forks

0 issues

Python

Chat with Codebase Architecture Scan Security Audit Explain Codebase

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing cognica-io/bayesian-bm25 in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Source files are only loaded when you start an analysis to optimize performance.

Click here to launch the interactive analysis workspace

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind.in/repo/cognica-io/bayesian-bm25)

Preview:

Repository Overview (README excerpt)

Crawler view

Bayesian BM25 [Blog] [Papers] The reference implementation of the Bayesian BM25 and From Bayesian Inference to Neural Computation papers, by the original author. Converts raw BM25 retrieval scores into calibrated relevance probabilities using Bayesian inference. Overview Standard BM25 produces unbounded scores that lack consistent meaning across queries, making threshold-based filtering and multi-signal fusion unreliable. Bayesian BM25 addresses this by applying a sigmoid likelihood model with a composite prior (term frequency + document length normalization) and computing Bayesian posteriors that output well-calibrated probabilities in [0, 1]. A corpus-level base rate prior further improves calibration by 68–77% without requiring relevance labels. Key capabilities: • **Score-to-probability transform** — convert raw BM25 scores into calibrated relevance probabilities via sigmoid likelihood + composite prior + Bayesian posterior • **Base rate calibration** — corpus-level base rate prior estimated from score distribution (95th percentile, mixture model, or elbow detection) decomposes the posterior into three additive log-odds terms, reducing expected calibration error by 68–77% without relevance labels • **Parameter learning** — batch gradient descent or online SGD with EMA-smoothed gradients and Polyak averaging, with three training modes: balanced (C1), prior-aware (C2), and prior-free (C3) • **Probabilistic fusion** — combine multiple probability signals using AND, OR, NOT, and log-odds conjunction with multiplicative confidence scaling, optional per-signal reliability weights (Log-OP), and sparse signal gating (ReLU/Swish/GELU/Softplus activations from Paper 2, Theorems 6.5.3/6.7.4/6.8.1/Remark 6.5.4) with generalized beta control (Theorem 6.7.6) • **Learnable fusion weights** — learns per-signal reliability from labeled data via a Hebbian gradient that is backprop-free, starting from Naive Bayes uniform initialization (Remark 5.3.2); supports optional additive bias in log-odds space • **Attention-based fusion** — learns query-dependent signal weights via attention mechanism (Paper 2, Section 8), with exact attention pruning via and (Theorem 8.7.1); supports optional • **Multi-head attention** — creates multiple independent attention heads with different initializations and averages their log-odds for more robust fusion (Remark 8.6, Corollary 8.7.2) • **Neural score calibration** — (sigmoid) and (PAVA) convert raw neural model scores into calibrated probabilities for Bayesian fusion (Section 12.2 #5) • **External prior features** — callable on allows custom document priors to replace the composite prior, enabling features like recency or popularity weighting (Section 12.2 #6) • **Temporal adaptation** — uses exponential decay to weight recent observations more heavily in , tracking concept drift in non-stationary relevance patterns (Section 12.2 #3) • **Hybrid search** — converts vector similarity scores to probabilities for fusion with BM25 signals via weighted log-odds conjunction • **Vector score calibration** — converts vector distances into calibrated probabilities via likelihood ratio framework: , with KDE/GMM density estimation, gap detection, and auto-routing (Paper 3, Theorem 3.1.1) • **Index-aware density priors** — and provide density-based prior weights from IVF cell populations and k-NN distances for informing the vector calibration (Paper 3, Strategy 4.6.2) • **WAND pruning** — computes safe Bayesian probability upper bounds for document pruning in top-k retrieval; provides tighter block-level bounds for BMW-style pruning (Section 6.2, Corollary 7.4.2) • **Calibration metrics** — , , , and for evaluating probability quality, with bundling all metrics into a single diagnostic • **Fusion debugger** — records every intermediate value through the full pipeline (likelihood, prior, posterior, fusion) for transparent inspection, document comparison, and crossover detection; supports hierarchical fusion tracing with AND/OR/NOT composition and gating trace fields • **Multi-field search** — maintains separate BM25 indexes per field and fuses field-level probabilities via log-odds conjunction with configurable per-field weights • **Search integration** — drop-in scorer wrapping bm25s that returns probabilities instead of raw scores Adoption • MTEB — included as a baseline retrieval model ( ) for the Massive Text Embedding Benchmark • txtai — used for BM25 score normalization in hybrid search ( ) • Vespa.ai — adopted as an official sample application • UQA — used as the scoring operator for probabilistic text retrieval and multi-signal fusion in the unified query algebra Installation To use the integrated search scorer (requires ): Quick Start Converting BM25 Scores to Probabilities End-to-End Search with Probabilities Multi-Field Search Combining Multiple Signals Hybrid Text + Vector Search Vector Score Calibration Learning Fusion Weights from Data Attention-Based Fusion Multi-Head Attention Fusion Neural Score Calibration Temporal Adaptation WAND Pruning with Bayesian Upper Bounds Debugging the Fusion Pipeline Evaluating Calibration Quality Online Learning from User Feedback Training Modes Benchmarks BEIR Hybrid Search Evaluated on 5 BEIR datasets using the retrieve-then-evaluate protocol (top-1000 per signal, union candidates, pytrec_eval). Dense encoder: all-MiniLM-L6-v2. BM25: k1=1.2, b=0.75, Lucene variant with Snowball English stemmer. NDCG@10 | Method | ArguAna | FiQA | NFCorpus | SciDocs | SciFact | Average | |---|---|---|---|---|---|---| | BM25 | 36.16 | 25.32 | 31.85 | 15.65 | 67.91 | 35.38 | | Dense | 36.98 | 36.87 | 31.59 | 21.64 | 64.51 | 38.32 | | Convex | 40.03 | 37.10 | 35.61 | 19.65 | 73.38 | 41.15 | | RRF | 39.61 | 36.85 | 34.43 | 20.09 | 71.43 | 40.48 | | Bayesian-OR | 0.06 | 25.54 | 33.47 | 15.88 | 66.97 | 28.38 | | Bayesian-LogOdds | 37.16 | 32.93 | 35.32 | 18.54 | 72.68 | 39.33 | | Bayesian-LogOdds-Local | 39.63 | 37.20 | 34.10 | 1…