Top-K Dynamic Pruning in Full-Text Search

Inverted index → BM25 → threshold θ → WAND / Block-Max WAND / MaxScore / BMM
A from-scratch study note on why you don't have to score every document.

1. How full-text search works: the inverted index

A user types “distributed database” into the search box, and you must find the most relevant documents out of a million. You can’t read each one, so you build an inverted index — storing, for each term, the documents it appears in:

"distributed"  →  [doc3, doc7, doc15, doc88, ...]   ← one posting list
"database"     →  [doc7, doc15, doc40, doc88, ...]

KEY CLARIFICATIONEvery document in a term’s posting list contains that term; conversely, if a document does not contain the term, it is simply not in that list. Different documents contain different terms, so a given document only appears in the lists of the terms it actually matches.

Example: doc3 only contains the word “system”, so it appears only in the “system” list — the “distributed” and “database” lists don’t contain it, and therefore can’t cover it.

Inside a posting list, entries are sorted by document id (docID) in ascending order. The sort lets you intersect lists like a zipper, and it’s exactly what later allows WAND to “skip forward”.

2. Scoring & ranking: BM25

To return the “top-k most relevant”, each matching document needs a relevance score. The standard one is BM25. No need to memorize the formula — keep the intuition:

The more times a term occurs in a document (term frequency, tf) → the more relevant → higher score
The more common a term is across the whole corpus (e.g. “the”, “data”) → the less valuable → lower weight (idf)

A document’s total score = the sum of the BM25 contributions of each query term in that document:

score(doc7) = BM25(distributed, doc7) + BM25(database, doc7)

A single term’s BM25(term, doc) depends on three quantities:

Quantity	Meaning	Property
`idf(term)`	how rare the term is across the corpus	fixed per term — one value for the whole corpus
`tf`	how many times the term occurs in this doc	varies per document
`dl`	length of this document (normalization)	varies per document

3. The core problem: scoring is expensive

Out of a million documents, hundreds of thousands may match at least one term. If you fully compute BM25 for every one of them and then sort to take the top-k, the vast majority of that work is wasted — those documents never make the list.

THE GOALCan we avoid scoring the documents that are doomed never to reach the top-k and skip them outright? That is exactly what “dynamic pruning” algorithms — WAND / Block-Max WAND / MaxScore / BMM — are for. They’re a core optimization inside full-text search engines.

4. The threshold θ: the pruning bar & cold start

During retrieval we maintain a top-k min-heap of the best documents so far.

DEFINITION OF θθ = the score of the k-th best document in the heap (the current minimum bar to make the list). Any document whose upper bound on total score can’t even reach θ can never enter the top-k, so it can be discarded without exact scoring.

Two different scores (don’t conflate them)

	How it’s obtained	Cost
True score (exact total)	sum of BM25 contributions — requires touching the index and evaluating the formula	expensive
Upper bound (optimistic estimate)	sum of the “max possible contribution” of the matched terms — pure addition	nearly free

Pruning logic: first run the cheap upper bound through a filter — if upper bound < θ, discard it (don’t even compute the true score); only when upper bound ≥ θ is it worth computing the true score.

Cold start: θ begins at 0

THE CHICKEN-AND-EGGInitially θ = 0 and the heap is empty → everything clears the bar → nothing can be pruned. So the first k documents you encounter are scored honestly to fill the heap; once it’s full, θ becomes the k-th score, and pruning only kicks in from document k+1 onward. As better documents push in, θ rises monotonically and more and more documents get skipped → you pay tuition early, then run faster and faster.

5. Term upper bounds: how and when they’re computed

CONCLUSIONAn upper bound = the maximum contribution score a term produces across all the documents it matches. It is computed once at index-build time by scanning the list, and stored; at query time it’s just read, never recomputed.

For “distributed”, idf is fixed; what makes the contribution vary across documents is tf and dl. Its contributions form a set of values, and the maximum is the upper bound:

distributed → doc3: 2.1   doc7: 5.0 ← highest   doc15: 0.8   doc88: 3.3 ...
            upper bound = 5.0   (just track a running max while building the posting list — cheap)

PITFALL (correctness, not just speed)Upper bounds can go stale after updates. If a newly inserted document pushes a term’s contribution past the old bound but the bound isn’t updated, the bound becomes too small → you wrongly prune a document that should have made the list → recall is incorrect. So an upper bound must always be a true upper bound (err high, never low), and must be maintained on incremental writes / compaction.

6. WAND: full pivot-skipping flow + trace

WAND (Weak AND) is a document-at-a-time dynamic-pruning algorithm. Each term’s posting list has a cursor pointing at the docID it currently sees; the list is sorted, so cursors only move forward.

Each round, 4 steps

① Sort: order the terms by their cursor’s current docID, ascending.

② Find the pivot: starting from the smallest, accumulate upper bounds until the running sum is ≥ θ. The term where you stop is the pivot term; its current docID is the pivot docID.

③ Skip everything before the pivot: documents before the pivot can only be covered by the terms with smaller cursors, whose upper bounds sum to < θ → hopeless → skip them all at once.

④ Is the smallest cursor at the pivot?

Yes (== pivot docID): the pivot is a candidate → compute its true score, try to insert into the heap, update θ → advance every cursor sitting on it by one (case A).
No (< pivot docID): seek one lagging term forward to ≥ pivot docID, leave the rest, and go back to ① (case B).

Termination: when the upper bounds of all terms sum to < θ (no pivot can be formed) → nothing left can qualify → done.

WHY THE PIVOT SKIP IS SAFEDocuments before the pivot can only be covered by terms whose cursors are still small (terms with large cursors haven’t scanned that far back), and those exact terms have upper bounds summing to < θ (otherwise the pivot would be earlier) → their true-score ceiling can’t reach the bar. The pivot is precisely “the first document that still has a chance of reaching θ”.

Full trace (concrete corpus)

8 documents total; lists built from which terms each document actually contains:

distributed → [8, 25, 40]        upper bound 5
database    → [8, 30, 45]        upper bound 4
system      → [3, 4, 8, 19, 33]  upper bound 1
Query "distributed database system",  θ = 6 (mid-retrieval, heap already full)

Round 1　cursors: distributed→8, database→8, system→3

① sort: system(3), distributed(8), database(8)
② pivot: system 1<6; +distributed=6 ≥6 → pivot docID = 8
③ docs before 8 (3,4,5..) can only be covered by "system" (distributed/database lists have nothing <8), bound 1<6 → skip all
④ smallest cursor system(3) ≠ 8 → case B: seek system to ≥8 → system→8 , back to ①

  re-sort: all three at 8
② pivot: distributed 5<6; +database=9 ≥6 → pivot docID = 8
④ smallest cursor == 8 == pivot → compute true score!
   doc8 matched by all three: 4.8+3.5+0.6 = 8.9 ≥ θ=6 → enters heap, θ rises to 7
   case A: the three cursors at 8 each advance by one → distributed→25, database→30, system→19

Round 2　θ=7, cursors: system→19, distributed→25, database→30

① sort: system(19), distributed(25), database(30)
② pivot: system 1<7; +distributed=6<7; +database=10 ≥7 → pivot docID = 30
③ skip everything before 30:
   · doc19 only in system list → at most 1 <7 → skip
   · doc25 only in distributed list → at most 5, but 5<θ=7 → also skipped! (once θ hit 7, distributed alone can't get in)
④ smallest cursor system(19) ≠ 30 → case B: seek system to ≥30 → system→33
   ... keep re-sorting and advancing until some document is matched by enough high-value terms to align on the pivot, triggering the next true-score computation

TWO OBSERVATIONS1) The cursors don’t crawl one-by-one — they jump forward, and documents in between that can’t reach the bar are skipped without ever computing a true score.

The higher θ goes, the more aggressive the skipping — at θ=6, distributed(5) could still “pad” a score and doc25 had a chance; at θ=7, doc25 (covered only by distributed) is hopeless. This is the “runs faster and faster” effect.

The two cursor-advance rules, summarized

Case	Trigger	Action
A	cursors aligned on the pivot, true score computed	advance every cursor sitting on that docID to the next entry in its own list
B	smallest cursor < pivot (not yet aligned)	seek one lagging term to ≥ pivot, leave the rest

7. Why a direct seek is allowed

THE PERMISSION SLIPThe pivot computation in step ② has already mathematically proven that every document before the pivot can’t reach θ. Since they are guaranteed losers, having a lagging term sit on them is worthless → just jump past them. “Proven hopeless” = “allowed to skip”.

Example: θ=7, pivot=30, system’s cursor at 19. doc19 (and any system posting between 19 and 29) are condemned documents (covered only by system, at most 1 point). So seek system from 19 directly to the first docID ≥30 in its list (33), without ever touching the condemned ones in between.

Why ≥30, not exactly 30: system’s list might not contain doc30; what we want is to advance it to the pivot line or beyond, so we take the first entry ≥30.
Why advance system: the lagging terms (cursor < pivot) are system(19) and distributed(25); a common policy advances the one with the smallest cursor first (it lags most and skips the most condemned docs).
Why it’s cheap physically: the posting list’s block / skip structure — each block records its “max docID”, so if the next block’s max is <30 the whole block is skipped, decompression and all. So 19→33 is a single leap, not a one-by-one walk.

8. What a block is & Block-Max WAND

The “block” wasn’t invented by BMW — inverted indexes already chop a sorted list into blocks for compression and skipping. BMW just hangs one extra number on each block.

"distributed" posting list (sorted by docID):
┌──── block A ────┬──── block B ────┬─ block C ─
│ 128 docIDs      │ 128 docIDs      │ ...
└─────────────────┴─────────────────┴───────────

Definition of a block: a contiguous run of documents in the sorted list, usually a fixed count (commonly 64/128; Lucene defaults to 128).
Why blocks exist anyway: ① compression (adjacent docIDs differ by little, so a block is delta-encoded + integer-compressed); ② skipping (each block header stores its “max docID”, so if that’s below the target the whole block is skipped).

WHAT BMW ADDSIt stores one more number in each block header: the maximum contribution score within that block. So when judging whether something has a chance, it uses the block-local max (tight) instead of the global upper bound (loose):

"distributed":  global bound 5.0       ← WAND uses this (loose)
  [blockA max 1.0][blockB max 5.0][blockC max 0.6]...   ← BMW stores these (tight)

Tighter bounds → the pivot is pushed further back → you can skip whole blocks (shallow skip), and only descend into a block document-by-document (deep skip) when it actually has a chance. WAND is the skeleton; BMW only swaps the “estimate the ceiling” step from global to block-level.

A TUNABLE TRADE-OFFSmaller blocks → tighter local bounds → more aggressive pruning, but more blocks means more metadata/skip overhead and worse compression; larger blocks are the opposite.

9. MaxScore: the dynamic essential / non-essential split

WAND and MaxScore are the two schools of dynamic pruning — same problem, different approach. MaxScore’s killer move: kick the terms whose bounds can’t reach the bar entirely out of the “driving” traversal.

For “distributed database system” with bounds distributed=5, database=4, system=1: sort the terms by upper bound ascending and accumulate from the smallest. The prefix whose running sum is still < θ are the “non-essential” terms; the rest are “essential” (only essential terms drive the traversal; non-essential terms are only consulted by lookup when needed).

THE DYNAMIC PART (most easily missed)The split depends on θ, and θ rises monotonically from 0 → so the split is recomputed as θ grows, the boundary only moves right, and the essential set shrinks monotonically. At cold start (low θ) nothing can be pruned (essential = all terms); the higher θ goes, the more long lists drop out of driving.

Current θ	Non-essential (not driving)	Essential (drives traversal)
0 (initial)	none	all (no pruning)
4.5	system, architecture*	database, distributed
6	system, (low-weight frequent terms)	database, distributed
9	system, database, …	only distributed

WHY MAXSCORE WINS ON MANY-TERM / HIGH-FREQUENCY-LOW-WEIGHT QUERIESFrequent terms (like “the”, “system”) have low idf → small upper bound, yet very long lists. MaxScore can kick these long-and-cheap lists entirely out of the driving traversal, saving the cost of “walking millions of useless postings”.

When there are few terms and all matter: there’s little to drop, and MaxScore’s split is just extra bookkeeping → WAND is simpler and more direct.

10. BMM = Block-Max MaxScore

MaxScore splits essential / non-essential using the global bound (one cut for the whole query). BMM does the split using each block’s local max, so “which terms count as non-essential” can change segment by segment: a term that’s “half-essential” globally can locally degrade to non-essential in the blocks where its local max is low, letting you skip more locally and dynamically.

THE SYMMETRYBMM is to MaxScore what BMW is to WAND — both refine the bound from “one global value” to “one per block”, pruning more aggressively.

11. The four algorithms side by side

	Base algorithm	With block-level bounds
School 1: pivot skipping (all lists together, document-at-a-time)	WAND	BMW (Block-Max WAND)
School 2: essential / non-essential split (only essential terms drive)	MaxScore	BMM (Block-Max MaxScore)

In common: all rely on precomputed term upper bounds + the heap threshold θ to skip documents doomed never to reach the top-k, saving exact scoring.
Difference: the WAND family does “all lists together, pivot-jump”; the MaxScore family does “split terms into essential / non-essential, only essential ones drive”.
Block-level variants (BMW/BMM): refine the bound from global to per-block, skipping harder.
Choosing: many terms / high-frequency-low-weight terms → MaxScore family; few terms that all matter → WAND family is more direct. In practice both are implemented and chosen by query shape.

12. Self-check questions

What is θ?

The score of the k-th best document in the current top-k heap — the bar for "can this make the list" — and it rises monotonically through retrieval.

True score vs. upper bound?

The true score requires touching the index and evaluating the formula (expensive); the upper bound is the sum of matched terms' max contributions (nearly free). Filter on the bound first — anything with bound<θ never gets a true score.

Why can WAND skip everything before the pivot?

Those documents can only be covered by the "still-small-cursor" terms, whose bounds sum to < θ, so their true-score ceiling can't reach the bar.

Why is the seek legitimate?

Everything before the pivot has been proven to be a loser, so a lagging term sitting on them is pointless; use the skip structure to leap to ≥pivot.

How is BMW stronger than WAND?

Block-local bounds are tighter than the global bound → pivot pushed back → whole blocks can be skipped.

Why can't MaxScore prune at cold start?

With low θ the running sum reaches θ almost immediately, so no term can be made non-essential; you must first fill the heap and let θ rise before the essential set shrinks.

Why does MaxScore win on long, many-term queries?

Frequent terms have long, cheap lists that MaxScore can kick out of driving entirely; with few terms that all matter, WAND is simpler.

What happens if an upper bound is maintained wrong?

A bound that's too low wrongly prunes documents that should make the list → incorrect recall; so incremental writes / compaction must keep it a true upper bound (err high, never low).

One-line thread: inverted index finds matches → BM25 scores relevance for top-k → scoring all is too costly → θ + term upper bounds skip the hopeless ones (WAND / MaxScore) → Block-Max refines bounds to per-block, skipping harder (BMW / BMM).

1. How full-text search works: the inverted index#

2. Scoring & ranking: BM25#

3. The core problem: scoring is expensive#

4. The threshold θ: the pruning bar & cold start#

Two different scores (don’t conflate them)#

Cold start: θ begins at 0#

5. Term upper bounds: how and when they’re computed#

6. WAND: full pivot-skipping flow + trace#

Each round, 4 steps#

Full trace (concrete corpus)#

The two cursor-advance rules, summarized#

7. Why a direct seek is allowed#

8. What a block is & Block-Max WAND#

9. MaxScore: the dynamic essential / non-essential split#

10. BMM = Block-Max MaxScore#

11. The four algorithms side by side#

12. Self-check questions#

1. How full-text search works: the inverted index

2. Scoring & ranking: BM25

3. The core problem: scoring is expensive

4. The threshold θ: the pruning bar & cold start

Two different scores (don’t conflate them)

Cold start: θ begins at 0

5. Term upper bounds: how and when they’re computed

6. WAND: full pivot-skipping flow + trace

Each round, 4 steps

Full trace (concrete corpus)

The two cursor-advance rules, summarized

7. Why a direct seek is allowed

8. What a block is & Block-Max WAND

9. MaxScore: the dynamic essential / non-essential split

10. BMM = Block-Max MaxScore

11. The four algorithms side by side

12. Self-check questions