Skip to content

MDEV-39832 mhnsw: skip non-dirty nodes when invalidating the shared cache on commit#5173

Open
RalXYZ wants to merge 1 commit into
MariaDB:mainfrom
RalXYZ:mhnsw-skip-non-dirty-on-commit
Open

MDEV-39832 mhnsw: skip non-dirty nodes when invalidating the shared cache on commit#5173
RalXYZ wants to merge 1 commit into
MariaDB:mainfrom
RalXYZ:mhnsw-skip-non-dirty-on-commit

Conversation

@RalXYZ
Copy link
Copy Markdown

@RalXYZ RalXYZ commented Jun 3, 2026

Description

While benchmarking MariaDB's vector index against pgvector with VectorDBBench (https://github.com/zilliztech/VectorDBBench), I observed that MariaDB's query latency under read/write concurrency is significantly higher than pgvector's, and grows with the volume of data already inserted.

Benchmark: an HNSW index is created at table-creation time, vectors are then inserted continuously, and read bursts are issued at 20% / 50% / 80% of the total insert progress so that reads and writes run concurrently.

Root cause

In MHNSW_Trx::do_commit() (sql/vector_mhnsw.cc), the partial-invalidation path walks every entry in the committing transaction's node cache and clears the shared-cache copy's vec pointer:

for (FVectorNode &from : trx->get_cache())
  if (FVectorNode *node= ctx->find_node(from.gref()))
    node->vec= nullptr;
ctx->start= nullptr;

The trx-local cache, however, contains every node the transaction touched, including the many nodes traversed by search_layer and evaluated by select_neighbors during the insert that were never actually modified. The shared-cache copies of those read-only nodes are still consistent with on-disk data, but they get invalidated anyway, forcing concurrent and subsequent readers to reload them through FVectorNode::load().

There is in fact a pre-existing TODO at the same location:

// also, consider flushing only changed nodes (a flag in the node)

Proposed fix

Add a dirty bit to FVectorNode and set it on the two paths that actually mutate the node:

  • in FVectorNode::save() (covers both the newly-inserted target and every neighbor rewritten by update_second_degree_neighbors);
  • in mhnsw_invalidate() alongside deleted = true.

The partial-invalidation loop in do_commit then skips entries whose dirty is false. The flag fits in the existing trailing byte alongside stored and deleted, so FVectorNode's footprint is unchanged. The flag is set before any storage-engine call inside save(), so that a write path failing mid-flight still forces the shared cache to reload.

Results

On the 1536D-50K VectorDBBench Streaming run (with an HNSW index created at table-creation time):

  • Search latency p95 at 80% insert progress: ~100 ms → ~10 ms
  • recall and QPS are unchanged

The fix is local to sql/vector_mhnsw.cc and small (≈10 lines).

Attatchments

  1. Search latency p95 (under read/write concurrency) MariaDB vs pgvector
image
  1. Search latency p95 (under read/write concurrency) after this fix
image

…ache on commit

Add a dirty bit to FVectorNode, set it in FVectorNode::save() (which
covers the inserted target node and every neighbor rewritten by
update_second_degree_neighbors) and in mhnsw_invalidate() alongside
the deleted flag. The partial-invalidation loop in do_commit now skips
non-dirty entries. The pre-existing TODO ("consider flushing only
changed nodes (a flag in the node)") is removed.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dirty flag to FVectorNode to track modified nodes. During transaction commit (MHNSW_Trx::do_commit), only nodes marked as dirty are invalidated in the shared cache, skipping read-only cache entries and improving performance. The dirty flag is set to true when saving a node or when invalidating/deleting a node. There are no review comments, so I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Jun 4, 2026
Copy link
Copy Markdown
Member

@gkodinov gkodinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! This is a preliminary review.

I believe the diff is clear, understandable and well executed. Approved from my end.

One thing I'd ask you to consider is thinking about a test case for it. Just please think about it: there might be some way. I won't insist on it, but it's a good thing to consider.

Or, maybe, consider at least some observability from the SQL console so that people trying to understand what's wrong can attribute this to cache misses (?). Again: just a thought.

I see that you are already in a dialogue with the final reviewer. This is great! Keep it going. His approval is needed before this can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

4 participants