fix(metacat): handle unmapped entity tokens and training crashes by bgriffen · Pull Request #399 · CogStack/cogstack-nlp

bgriffen · 2026-04-05T10:15:52Z

Summary

Fixes three bugs in MetaCAT supervised training that cause crashes during train_supervised_raw(train_addons=True):

Unmapped entity tokens crash inference — entities whose character offsets don't align with any tokenised token cause UnboundLocalError and KeyError during MetaCAT inference within NER training epochs
Same crash in training data preparation — identical pattern in prepare_from_json → _prepare_from_json_loop causes IndexError during train_raw
_train_meta_cat missing save_dir_path — train_raw requires a save directory when auto_save_model=True (the default config), but _train_meta_cat never provides one. This crashes after all NER epochs complete, losing hours of training progress.

Reproduction

All three bugs are triggered by calling train_supervised_raw(data, train_addons=True) with MetaCAT addons registered. The token-mapping bugs (1 & 2) occur when entity character spans fall at document boundaries or in whitespace-only regions where no BPE tokens are produced. The save_dir_path bug (3) occurs unconditionally with the default MetaCAT config.

Changes

Commit 1: `meta_cat.py` — inference path

prepare_document: skip entities with empty ctoken_idx (prevents UnboundLocalError on ind)
_set_meta_anns: guard ent_id2ind lookup for skipped entities (prevents KeyError)

Commit 2: `data_utils.py` — training data path

_prepare_from_json_loop: skip annotations with empty ctoken_idx (prevents IndexError on ctoken_idx[0])

Commit 3: `trainer.py` — MetaCAT addon training

_train_meta_cat: create a temporary directory and pass it as save_dir_path to train_raw when auto_save_model is enabled

Test plan

train_supervised_raw(train_addons=True) completes without crash on documents containing entities near text boundaries
MetaCAT inference does not crash on entities that fail to map to BPE tokens
train_raw receives a valid save_dir_path and saves best checkpoint during training
Entities that successfully map to tokens are unaffected (no behaviour change)

When an entity's character offsets fall past the end of the tokenised text (or into a whitespace-only region), the token-mapping loop in `prepare_document` produces an empty `ctoken_idx` list. This causes: 1. `UnboundLocalError` on `ind` at line 739 — the loop variable is never assigned because `offset_mapping[last_ind:]` is empty or no pair matches the entity span. 2. `KeyError` in `_set_meta_anns` — skipped entities are absent from `ent_id2ind` but the annotation loop does an unconditional lookup. Fix: skip entities with empty token mappings in `prepare_document`, and guard the dict lookup in `_set_meta_anns`.

`_prepare_from_json_loop` in `data_utils.py` has the same token-mapping pattern as `prepare_document` in `meta_cat.py` (fixed in previous commit), but in the `train_raw` → `prepare_from_json` code path. When an annotation's character offsets don't align with any token in the offset_mapping, `ctoken_idx` is empty and `ctoken_idx[0]` raises `IndexError`. Fix: skip annotations with empty token mappings, consistent with the inference path fix.

`_train_meta_cat` calls `addon.mc.train_raw(data)` without providing `save_dir_path`. When the MetaCAT config has `auto_save_model=True` (the default), `train_raw` raises an exception because it needs a directory to save the best checkpoint during training. This crash occurs after all NER supervised training epochs have completed (potentially hours of work), since `_train_addons` is called at the very end of `train_supervised_raw`. Fix: create a temporary directory for the MetaCAT save checkpoint when `auto_save_model` is enabled, and pass it to `train_raw`.

mart-r

Thank you so much for identifying and fixing the issue! We don't get many outside contributions so seeing one is such a treat!

With that said, I've left a few comments on the temporary file handling in MetaCAT training. The previous approach would keep saving these to new locations if you went through multiple datasets and never clean them up. This could become problematic.

PS:
I would also like the MetaCAT process to be able to have debug output during inference in terms of logging. But since there's currently none, we'll need to look at that separately.

mart-r · 2026-04-07T10:52:04Z

medcat-v2/medcat/trainer.py

+            # train_raw requires save_dir_path when auto_save_model is True
+            save_dir = None
+            if cnf.train.auto_save_model:
+                import tempfile


Move import to the top of the module. There's no reason for this to be a dynamic import.

Done — moved to module-level imports in 5796e13.

mart-r · 2026-04-07T10:58:14Z

medcat-v2/medcat/trainer.py

+            save_dir = None
+            if cnf.train.auto_save_model:
+                import tempfile
+                save_dir = tempfile.mkdtemp(


This method creates a persistent temporary data path that would need to be cleared when done using it.

I would prefer a context manager. We probably just always do that - when the save directory isn't used nothing is written to it. The overhead of creating (and removing) this folder even when it's not used at training time is going to be negligible (it's a 1 time operation per MetaCAT per dataset).

So I would prefer something along the lines of:

with tempfile.TemporaryDirectory() as save_dir: # NOTE: this is a mypy quirk - the types are compatible addon.mc.train_raw(cast(dict, data), save_dir_path=save_dir)

Good call — switched to TemporaryDirectory context manager in 5796e13. The best weights are loaded back into memory before train_raw returns, so the directory can safely be cleaned up on exit.

Address review feedback: move tempfile import to module top and use TemporaryDirectory context manager instead of mkdtemp to avoid leaking orphaned temp directories across multiple training runs.

bgriffen added 3 commits April 5, 2026 20:06

mart-r requested changes Apr 7, 2026

View reviewed changes

fix(metacat): use TemporaryDirectory for training save dir

5796e13

Address review feedback: move tempfile import to module top and use TemporaryDirectory context manager instead of mkdtemp to avoid leaking orphaned temp directories across multiple training runs.

mart-r approved these changes Apr 7, 2026

View reviewed changes

mart-r merged commit 61d3bb0 into CogStack:main Apr 7, 2026
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metacat): handle unmapped entity tokens and training crashes#399

fix(metacat): handle unmapped entity tokens and training crashes#399
mart-r merged 4 commits intoCogStack:mainfrom
bgriffen:fix/metacat-training-crashes

bgriffen commented Apr 5, 2026

Uh oh!

mart-r left a comment

Uh oh!

mart-r Apr 7, 2026

Uh oh!

bgriffen Apr 7, 2026

Uh oh!

mart-r Apr 7, 2026

Uh oh!

bgriffen Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bgriffen commented Apr 5, 2026

Summary

Reproduction

Changes

Commit 1: meta_cat.py — inference path

Commit 2: data_utils.py — training data path

Commit 3: trainer.py — MetaCAT addon training

Test plan

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

mart-r Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

bgriffen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

bgriffen Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Commit 1: `meta_cat.py` — inference path

Commit 2: `data_utils.py` — training data path

Commit 3: `trainer.py` — MetaCAT addon training