fix(metacat): handle unmapped entity tokens and training crashes#399
fix(metacat): handle unmapped entity tokens and training crashes#399mart-r merged 4 commits intoCogStack:mainfrom
Conversation
When an entity's character offsets fall past the end of the tokenised text (or into a whitespace-only region), the token-mapping loop in `prepare_document` produces an empty `ctoken_idx` list. This causes: 1. `UnboundLocalError` on `ind` at line 739 — the loop variable is never assigned because `offset_mapping[last_ind:]` is empty or no pair matches the entity span. 2. `KeyError` in `_set_meta_anns` — skipped entities are absent from `ent_id2ind` but the annotation loop does an unconditional lookup. Fix: skip entities with empty token mappings in `prepare_document`, and guard the dict lookup in `_set_meta_anns`.
`_prepare_from_json_loop` in `data_utils.py` has the same token-mapping pattern as `prepare_document` in `meta_cat.py` (fixed in previous commit), but in the `train_raw` → `prepare_from_json` code path. When an annotation's character offsets don't align with any token in the offset_mapping, `ctoken_idx` is empty and `ctoken_idx[0]` raises `IndexError`. Fix: skip annotations with empty token mappings, consistent with the inference path fix.
`_train_meta_cat` calls `addon.mc.train_raw(data)` without providing `save_dir_path`. When the MetaCAT config has `auto_save_model=True` (the default), `train_raw` raises an exception because it needs a directory to save the best checkpoint during training. This crash occurs after all NER supervised training epochs have completed (potentially hours of work), since `_train_addons` is called at the very end of `train_supervised_raw`. Fix: create a temporary directory for the MetaCAT save checkpoint when `auto_save_model` is enabled, and pass it to `train_raw`.
mart-r
left a comment
There was a problem hiding this comment.
Thank you so much for identifying and fixing the issue! We don't get many outside contributions so seeing one is such a treat!
With that said, I've left a few comments on the temporary file handling in MetaCAT training. The previous approach would keep saving these to new locations if you went through multiple datasets and never clean them up. This could become problematic.
PS:
I would also like the MetaCAT process to be able to have debug output during inference in terms of logging. But since there's currently none, we'll need to look at that separately.
medcat-v2/medcat/trainer.py
Outdated
| # train_raw requires save_dir_path when auto_save_model is True | ||
| save_dir = None | ||
| if cnf.train.auto_save_model: | ||
| import tempfile |
There was a problem hiding this comment.
Move import to the top of the module. There's no reason for this to be a dynamic import.
There was a problem hiding this comment.
Done — moved to module-level imports in 5796e13.
medcat-v2/medcat/trainer.py
Outdated
| save_dir = None | ||
| if cnf.train.auto_save_model: | ||
| import tempfile | ||
| save_dir = tempfile.mkdtemp( |
There was a problem hiding this comment.
This method creates a persistent temporary data path that would need to be cleared when done using it.
I would prefer a context manager. We probably just always do that - when the save directory isn't used nothing is written to it. The overhead of creating (and removing) this folder even when it's not used at training time is going to be negligible (it's a 1 time operation per MetaCAT per dataset).
So I would prefer something along the lines of:
with tempfile.TemporaryDirectory() as save_dir:
# NOTE: this is a mypy quirk - the types are compatible
addon.mc.train_raw(cast(dict, data), save_dir_path=save_dir)There was a problem hiding this comment.
Good call — switched to TemporaryDirectory context manager in 5796e13. The best weights are loaded back into memory before train_raw returns, so the directory can safely be cleaned up on exit.
Address review feedback: move tempfile import to module top and use TemporaryDirectory context manager instead of mkdtemp to avoid leaking orphaned temp directories across multiple training runs.
Summary
Fixes three bugs in MetaCAT supervised training that cause crashes during
train_supervised_raw(train_addons=True):UnboundLocalErrorandKeyErrorduring MetaCAT inference within NER training epochsprepare_from_json→_prepare_from_json_loopcausesIndexErrorduringtrain_raw_train_meta_catmissingsave_dir_path—train_rawrequires a save directory whenauto_save_model=True(the default config), but_train_meta_catnever provides one. This crashes after all NER epochs complete, losing hours of training progress.Reproduction
All three bugs are triggered by calling
train_supervised_raw(data, train_addons=True)with MetaCAT addons registered. The token-mapping bugs (1 & 2) occur when entity character spans fall at document boundaries or in whitespace-only regions where no BPE tokens are produced. Thesave_dir_pathbug (3) occurs unconditionally with the default MetaCAT config.Changes
Commit 1:
meta_cat.py— inference pathprepare_document: skip entities with emptyctoken_idx(preventsUnboundLocalErroronind)_set_meta_anns: guardent_id2indlookup for skipped entities (preventsKeyError)Commit 2:
data_utils.py— training data path_prepare_from_json_loop: skip annotations with emptyctoken_idx(preventsIndexErroronctoken_idx[0])Commit 3:
trainer.py— MetaCAT addon training_train_meta_cat: create a temporary directory and pass it assave_dir_pathtotrain_rawwhenauto_save_modelis enabledTest plan
train_supervised_raw(train_addons=True)completes without crash on documents containing entities near text boundariestrain_rawreceives a validsave_dir_pathand saves best checkpoint during training