Skip to content

Add SQLite memory lifecycle hashing, versioning, and forget support#199

Open
RYB-404 wants to merge 3 commits into
XortexAI:mainfrom
RYB-404:ryb-xmem-memory-lifecycle-166
Open

Add SQLite memory lifecycle hashing, versioning, and forget support#199
RYB-404 wants to merge 3 commits into
XortexAI:mainfrom
RYB-404:ryb-xmem-memory-lifecycle-166

Conversation

@RYB-404
Copy link
Copy Markdown

@RYB-404 RYB-404 commented May 23, 2026

Summary

  • Adds reusable memory lifecycle metadata helpers for normalized SHA-256 content hashes, version lineage, current-record status, and soft-forget audit fields.
  • Updates the SQLite vector store to reuse existing current memories with the same normalized hash instead of inserting duplicates.
  • Adds explicit SQLite helpers for versioning (add_version) and soft/hard forget (forget) while keeping superseded or forgotten memories out of semantic and metadata retrieval.

Tests

  • python -m pytest tests\unit\test_memory_lifecycle.py -q
  • python -m pytest tests\e2e\test_memory_flow.py -q
  • python -m ruff check src\storage\memory_lifecycle.py src\storage\local.py tests\unit\test_memory_lifecycle.py

Addresses #166.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a memory lifecycle management system, including duplicate detection via content hashing, versioning, and soft-forgetting capabilities. The SQLiteVectorStore was updated to support these features, and a new memory_lifecycle utility module was added. Feedback focuses on performance optimizations for database queries, specifically suggesting the use of SQLite's JSON functions for hash lookups and batching updates in the forget method. Additionally, it was noted that the versioning process should be made atomic to prevent data inconsistency.

Comment thread src/storage/local.py Outdated
Comment on lines +333 to +342
def _find_current_by_hash(self, content_hash: str) -> Optional[str]:
rows = self._conn.execute(
"SELECT id, metadata FROM xmem_vectors WHERE namespace = ?",
(self._namespace,),
).fetchall()
for row in rows:
meta = json.loads(row["metadata"] or "{}")
if meta.get(CONTENT_HASH_KEY) == content_hash and is_retrievable_memory(meta):
return row["id"]
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of _find_current_by_hash is highly inefficient. It fetches every single record in the namespace and parses their JSON metadata in Python to find a match. This results in $O(N)$ complexity per call, which becomes $O(M \times N)$ during batch add operations. As the database grows, this will cause severe performance degradation.

You should leverage SQLite's built-in JSON functions to perform the filtering at the database level. This is significantly faster and avoids loading unnecessary data into memory.

Suggested change
def _find_current_by_hash(self, content_hash: str) -> Optional[str]:
rows = self._conn.execute(
"SELECT id, metadata FROM xmem_vectors WHERE namespace = ?",
(self._namespace,),
).fetchall()
for row in rows:
meta = json.loads(row["metadata"] or "{}")
if meta.get(CONTENT_HASH_KEY) == content_hash and is_retrievable_memory(meta):
return row["id"]
return None
def _find_current_by_hash(self, content_hash: str) -> Optional[str]:
row = self._conn.execute(
f"SELECT id FROM xmem_vectors "
f"WHERE namespace = ? "
f"AND json_extract(metadata, '$.{CONTENT_HASH_KEY}') = ? "
f"AND json_extract(metadata, '$.{IS_CURRENT_KEY}') IS NOT 0 "
f"AND json_extract(metadata, '$.{FORGOTTEN_AT_KEY}') IS NULL "
f"LIMIT 1",
(self._namespace, content_hash),
).fetchone()
return row["id"] if row else None

Comment thread src/storage/local.py Outdated
Comment on lines +274 to +285
for vec_id in ids:
row = self._conn.execute(
"SELECT metadata FROM xmem_vectors WHERE namespace = ? AND id = ?",
(self._namespace, vec_id),
).fetchone()
if not row:
continue
meta = json.loads(row["metadata"] or "{}")
meta[IS_CURRENT_KEY] = False
meta[FORGOTTEN_AT_KEY] = utc_now_iso()
meta[FORGET_REASON_KEY] = reason
self.update(vec_id, metadata=meta)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The forget method is inefficient for multiple IDs. It performs a SELECT and then calls self.update, which internally performs another SELECT and a COMMIT for every single ID. For a list of $N$ IDs, this results in $3N$ database operations and $N$ commits.

It is better to fetch the metadata for all IDs in a single query, process the updates in Python, and then use executemany to update the database in a single transaction.

        if not ids:
            return True
        placeholders = ",".join("?" for _ in ids)
        rows = self._conn.execute(
            f"SELECT id, metadata FROM xmem_vectors WHERE namespace = ? AND id IN ({placeholders})",
            (self._namespace, *ids),
        ).fetchall()

        now = utc_now_iso()
        updates = []
        for row in rows:
            meta = json.loads(row["metadata"] or "{}")
            meta[IS_CURRENT_KEY] = False
            meta[FORGOTTEN_AT_KEY] = now
            meta[FORGET_REASON_KEY] = reason
            updates.append((json.dumps(meta), self._namespace, row["id"]))

        if updates:
            self._conn.executemany(
                "UPDATE xmem_vectors SET metadata = ?, updated_at = CURRENT_TIMESTAMP "
                "WHERE namespace = ? AND id = ?",
                updates
            )
            self._conn.commit()

Comment thread src/storage/local.py Outdated
Comment on lines +259 to +261
parent_meta[IS_CURRENT_KEY] = False
self.update(parent_id, metadata=parent_meta)
self.add([text], [embedding], ids=[new_id], metadata=[new_meta])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The versioning process is not atomic. The parent record is updated to is_current=False in one call, and the new version is added in another. If the second call fails (e.g., due to a connection issue or constraint violation) or the process is interrupted, the memory will be effectively 'lost' from retrieval because the parent is no longer marked as current, but no successor was successfully created. These operations should be wrapped in a single transaction to ensure data consistency.

@RYB-404 RYB-404 force-pushed the ryb-xmem-memory-lifecycle-166 branch from e0b0a55 to e006a29 Compare May 23, 2026 05:48
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR adds memory lifecycle management to SQLiteVectorStore: SHA-256 content-hash deduplication on add(), a versioning helper (add_version) that marks parents as non-current, and a soft-forget path (forget) that stamps forgotten_at and keeps audit history. A new memory_lifecycle.py module centralises the hash, metadata construction, and retrievability check used by all three code paths.

  • add() now calls build_lifecycle_metadata, which forcefully assigns all lifecycle fields so callers cannot inject stale is_current/forgotten_at values through metadata; dedup is scoped by well-known tenant/user keys via _find_current_by_hash.
  • add_version guards against forgotten or superseded parents via is_retrievable_memory before proceeding, and marks the parent non-current atomically with the child insert.
  • update() recalculates the content hash and blocks collisions with other current records, but never calls is_retrievable_memory on the target and does not re-pin lifecycle fields after current_meta.update(metadata), leaving a path to silently modify or restore forgotten records.

Confidence Score: 3/5

Not safe to merge as-is: the update() method bypasses the lifecycle contract that the rest of this PR carefully enforces, allowing forgotten or superseded records to be silently modified or restored.

The add/add_version/forget paths are well-guarded and consistent. However, update() neither checks the retrievability of the target record nor re-pins lifecycle fields after merging caller metadata — a caller can pass {"forgotten_at": None, "is_current": True} to restore a forgotten memory without any audit trail, directly contradicting the guarantee that forget() is supposed to provide.

src/storage/local.py — specifically the update() method, which is the only write path that does not call build_lifecycle_metadata or is_retrievable_memory.

Important Files Changed

Filename Overview
src/storage/local.py Adds dedup, add_version, and forget support to SQLiteVectorStore; update() lacks a lifecycle guard allowing modification of forgotten/superseded records and silent override of lifecycle fields through the metadata parameter.
src/storage/memory_lifecycle.py New module providing SHA-256 content hashing, lifecycle metadata construction, and retrievability check; build_lifecycle_metadata forcefully assigns all lifecycle fields, preventing caller overrides in the add() path.
tests/unit/test_memory_lifecycle.py Good coverage of the happy path, hash dedup scoping, versioning, forgotten-parent guard, and soft-forget; no test exercises update() with lifecycle field overrides or with a non-current target record.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[add / add_version / update / forget] --> B{Which method?}

    B --> C[add]
    C --> C1[build_lifecycle_metadata\nforces all lifecycle fields]
    C1 --> C2[_find_current_by_hash\nscoped by namespace + dedup keys]
    C2 -->|duplicate found| C3[Return existing ID]
    C2 -->|no duplicate| C4[INSERT new record\nis_current=True]

    B --> D[add_version]
    D --> D1[get parent record]
    D1 --> D2{is_retrievable_memory?}
    D2 -->|No| D3[Return None]
    D2 -->|Yes| D4[build_lifecycle_metadata\nnext_version, parent_memory_id]
    D4 --> D5[_find_current_by_hash]
    D5 -->|existing == parent_id| D6[Return parent unchanged]
    D5 -->|other existing| D7[Mark parent is_current=False\nReturn existing ID]
    D5 -->|no duplicate| D8[Mark parent is_current=False\nINSERT new child record]

    B --> E[forget]
    E --> E1{hard_delete?}
    E1 -->|Yes| E2[delete rows]
    E1 -->|No| E3[Stamp is_current=False\nforgotten_at + reason]

    B --> F[update]
    F --> F1[Fetch row — no lifecycle guard]
    F1 --> F2[current_meta.update — lifecycle fields can be overridden]
    F2 --> F3[Recalculate CONTENT_HASH_KEY only]
    F3 --> F4[_find_current_by_hash collision check]
    F4 -->|collision with other record| F5[Return False]
    F4 -->|no collision| F6[UPDATE record]
Loading

Comments Outside Diff (1)

  1. src/storage/local.py, line 190-232 (link)

    P1 update() has no lifecycle guard and exposes lifecycle fields to caller override

    update() never calls is_retrievable_memory() on the fetched record, so it will happily mutate the content and embedding of a soft-forgotten or superseded memory. Worse, current_meta.update(metadata or {}) runs before the lifecycle fields are reconciled, so a caller can pass metadata={"forgotten_at": None, "is_current": True} to silently restore a forgotten record without going through forget()'s audit trail. Only CONTENT_HASH_KEY is re-pinned after the merge; every other lifecycle field (IS_CURRENT_KEY, FORGOTTEN_AT_KEY, FORGET_REASON_KEY, PARENT_MEMORY_ID_KEY, VERSION_KEY) can be freely overwritten. This creates an inconsistency with add(), which calls build_lifecycle_metadata and overwrites all lifecycle fields unconditionally.

    Fix in Cursor Fix in Codex Fix in Claude Code

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (3): Last reviewed commit: "Prevent memory lifecycle hash collisions" | Re-trigger Greptile

Comment thread src/storage/local.py Outdated
Comment thread src/storage/local.py Outdated
Comment thread src/storage/local.py Outdated
Comment thread src/storage/memory_lifecycle.py
@RYB-404 RYB-404 force-pushed the ryb-xmem-memory-lifecycle-166 branch from e006a29 to 2601823 Compare May 23, 2026 11:51
@RYB-404
Copy link
Copy Markdown
Author

RYB-404 commented May 23, 2026

Addressed the latest bot review points in the pushed update:

  • scoped hash dedup by tenant/user-style metadata (user_id, tenant_id, org_id, workspace_id, project_id) to avoid cross-user memory reuse in shared namespaces
  • tightened _find_current_by_hash to require explicit is_current = true
  • made duplicate-version handling supersede the parent when the successor already exists
  • lifecycle fields are now owned by build_lifecycle_metadata instead of caller metadata overrides
  • added regression coverage for scoped dedup, lifecycle override protection, and duplicate-version superseding

Re-ran:

  • python -m pytest tests\unit\test_memory_lifecycle.py -q
  • python -m pytest tests\e2e\test_memory_flow.py -q
  • python -m ruff check src\storage\memory_lifecycle.py src\storage\local.py tests\unit\test_memory_lifecycle.py

Comment thread src/storage/local.py
Copy link
Copy Markdown
Author

@RYB-404 RYB-404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed updates for the latest lifecycle review findings:

  • add_version() now refuses forgotten or superseded parents instead of creating a new current child from inactive history.
  • update() now rejects same-scope current hash collisions instead of leaving two current records with the same lifecycle hash.
  • Added regression coverage for forgotten/superseded version parents and update hash collisions.

Re-ran:

  • python -m pytest tests\unit\test_memory_lifecycle.py -q
  • python -m pytest tests\e2e\test_memory_flow.py -q
  • python -m ruff check src\storage\local.py tests\unit\test_memory_lifecycle.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant