โ๏ธ Chicago's second favorite bean
This repo provides a configurable way to collate data from multiple sources into a single denormalized dataframe and create tokenized timelines from the results.
You can download and install this package as follows:
git clone git@github.com:bbj-lab/cocoa.git
cd cocoa
python -m venv .venv
. .venv/bin/activate
pip install -e .The collator pulls from raw data tables (parquet or csv) and combines them into a
single denormalized dataframe in a
MEDS-like format. Each row
in the output represents an event with a subject_id, time, code (all
mandatory), and optional numeric_value / text_value columns.
Collation is driven by a YAML config (like this) that specifies:
- A reference table with a primary key (
subject_id), start/end times, and optional augmentation joins (e.g. joining a patient demographics table). - A list of entries, each mapping a source table (or the reference frame
itself via
table: REFERENCE) to the output schema. Each entry declares which column provides thecode,time, and optionallynumeric_value, andtext_value. Codes can be given a prefixprefix. Some preprocessing can be done with optional entries forfilter_expr,with_col_expr, andagg_expr. These take the form of polars expressions that are evaluated and applied to the dataframe during loading. Mild checks are performed when evaluating these expressions, but in general, the yaml config is just as powerful as the python. Check all yaml files prior to use. - Subject splits (
train_frac/tuning_frac) that partition subjects chronologically into train, tuning, and held-out sets.
A collation config has three top-level sections: identifiers, subject splits, and the reference + entries that define which events to extract.
subject_id: hospitalization_id # the atomic unit of interest
group_id: patient_id # multiple subjects can belong to a group
subject_splits:
train_frac: 0.7
tuning_frac: 0.1
# the remainder is held outsubject_id is the column that uniquely identifies each subject (e.g. a
hospitalization). group_id is an optional higher-level grouping column.
Subjects are sorted chronologically and split into train / tuning / held-out sets
according to the specified fractions.
The reference table is the primary static table to which other static information can be joined:
reference:
table: clif_hospitalization
start_time: admission_dttm
end_time: discharge_dttm
augmentation_tables:
- table: clif_patient
key: patient_id
validation: "m:1"
with_col_expr: pl.lit("AGE").alias("AGE")tableโ the name of the parquet (or csv) file indata_home(without the extension).start_time/end_timeโ columns that define the subject's time window; used to filter events from other tables whenreference_keyis set (see below).augmentation_tablesโ optional list of tables to join onto the reference frame. Each needs akeyto join on and avalidationmode (e.g."m:1"). You can also add computed columns viawith_col_expr.
The entries list defines the events to extract. Every entry produces rows with
the columns subject_id, time, code, numeric_value, and text_value. The
entry's fields tell the collator which source columns map to these outputs.
Required fields:
| Field | Description |
|---|---|
table |
Source table name, or REFERENCE to pull from the reference frame. |
code |
Column whose values become the event code. |
time |
Column whose values become the event timestamp. |
Optional fields:
| Field | Description |
|---|---|
prefix |
String prepended to the code (separated by //), e.g. LAB-RES. |
numeric_value |
Column to use as the numeric value for the event. |
text_value |
Column to use as the text value for the event. |
filter_expr |
A Polars expression (or list of expressions) to filter rows before extraction. |
with_col_expr |
A Polars expression (or list) to add computed columns before extraction. |
reference_key |
Join the source table to the reference frame on this key and keep only rows within the subject's start_timeโend_time window. |
Examples:
A simple categorical event from the reference frame:
- table: REFERENCE
prefix: DSCG
code: discharge_category
time: discharge_dttmA numeric event from an external table:
- table: clif_labs
prefix: LAB-RES
code: lab_category
numeric_value: lab_value_numeric
time: lab_result_dttmFiltering rows before extraction (single filter):
- table: clif_position
prefix: POSN
filter_expr: pl.col("position_category") == "prone"
code: position_category
time: recorded_dttmMultiple filters (applied as a list):
- table: clif_medication_admin_intermittent_converted
prefix: MED-INT
filter_expr:
- pl.col("mar_action_category") == "given"
- pl.col("_convert_status") == "success"
code: med_category
numeric_value: med_dose_converted
time: admin_dttmCreating a computed column with with_col_expr to use as the code:
- table: clif_respiratory_support_processed
prefix: RESP
with_col_expr: pl.lit("fio2_set").alias("code")
filter_expr: pl.col("fio2_set").is_finite()
code: code
numeric_value: fio2_set
time: recorded_dttmUsing reference_key to restrict events to a subject's time window:
- table: clif_code_status
prefix: CODE
code: code_status_category
time: admission_dttm
reference_key: patient_idThe tokenizer consumes the collated parquet output and converts events into integer token sequences suitable for sequence models. It:
- Adds
BOS/EOS(beginning/end-of-sequence) tokens to each subject's timeline. - Optionally inserts configurable clock tokens to mark the passage of time.
- Optionally inserts configurable time spacing tokens between events.
- Computes quantile-based bins for numeric values (from training data only).
- Maps codes (and optionally their binned values) to integer tokens via a vocabulary that is formed during training and is frozen for tuning/held-out data.
- Aggregates per-subject token sequences according to time, and then configurable sort order.
Tokenization is driven by its own YAML config (like this) that specifies:
n_binsโ number of quantile bins for numeric values.fusedโ whether to fuse the code, binned value, and text value into a single token (true) or keep them as separate tokens (false).insert_spacersโ whether to insert time spacing tokens between events.insert_clocksโ whether to insert clock tokens at specified times.collated_inputsโ paths to the collated parquet files to tokenize.subject_splitsโ path to the subject splits parquet file.orderingโ the priority order of code prefixes when sorting events within the same timestamp.spacersโ mapping of time intervals (e.g.,5m-15m,1h-2h) to their lower bounds in minutes, used for time spacing tokens.clocksโ list of hour strings (e.g.,00,04, ...) at which to insert clock tokens.
The tokenizer produces two main outputs:
tokens_times.parquet gives one row per subject with three columns:
subject_idtokensโ the integer token sequence for the subject's timeline.timesโ a parallel list of timestamps, one per token, indicating when each event occurred.
The table will look something like this:
โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ subject_id โ tokens โ times โ
โ --- โ --- โ --- โ
โ str โ list[u32] โ list[datetime[ฮผs]] โ
โโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ 20002103 โ [20, 350, โฆ 21] โ [2116-05-08 02:45:00, 2116-05-โฆ โ
โ 20008372 โ [20, 350, โฆ 21] โ [2110-10-30 13:03:00, 2110-10-โฆ โ
โ โฆ โ โฆ โ โฆ โ
โ 29994865 โ [20, 364, โฆ 21] โ [2111-01-28 21:49:00, 2111-01-โฆ โ
โโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
In this example, token 20 corresponds to the beginning-of-sequence token (BOS),
token 21 to the end-of-sequence token (EOS), and the tokens in between
correspond to the subject's clinical events in chronological order (with ties
broken by the configured ordering). In fused mode each event is a single token;
in unfused mode an event with a numeric value becomes two tokens (code + quantile
bin).
tokenizer.yaml is a plain yaml file that contains information about the
configuration, learned vocabulary, and bins. This file is sufficient to
reconstitute the tokenizer object. Currently, there's an entry for the lookup
that maps strings to tokens:
lookup:
UNK: 0
ADMN//direct: 1
ADMN//ed: 2
ADMN//elective: 3
AGE//age_Q0: 4
โฆand an entry for bin cutpoints:
bins:
VTL//heart_rate:
- 65.0
- 70.0
- 75.0
- 80.0
- 84.0
- 89.0
- 94.0
- 100.0
- 108.0
LAB-RES//platelet_count:
- 62.0
- 114.0
- 147.0
- 175.0
- 203.0
- 233.0
- 267.0
- 314.0
- 390.0
โฆThe lists following each key correspond to the cutpoints for the associated category.
subject_splits.parquet gives a table listing out all subject_id's and their
corresponding split assignment:
โโโโโโโโโโโโโโฌโโโโโโโโโโโ
โ subject_id โ split โ
โ --- โ --- โ
โ str โ str โ
โโโโโโโโโโโโโโชโโโโโโโโโโโก
โ 21081215 โ train โ
โ 20302177 โ train โ
โ โฆ โ โฆ โ
โ 27116134 โ tuning โ
โ 29134959 โ tuning โ
โ โฆ โ โฆ โ
โ 28150003 โ held_out โ
โ 22151813 โ held_out โ
โโโโโโโโโโโโโโดโโโโโโโโโโโ
meds.parquet gives a table of the collated events that were passed to the
tokenizer -- this is created in the collate step:
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโ
โ subject_id โ time โ code โ numeric_value โ text_value โ
โ --- โ --- โ --- โ --- โ --- โ
โ str โ datetime[ฮผs] โ str โ f32 โ str โ
โโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโก
โ 24591817 โ 2111-09-26 18:15:00 โ MED-CTS//sodium_chloride โ 0.0 โ null โ
โ 21343412 โ 2112-01-11 06:31:00 โ LAB-RES//albumin โ 3.3 โ null โ
โ 24894995 โ 2113-01-14 14:25:00 โ LAB-ORD//creatinine โ null โ null โ
โ 20947416 โ 2110-12-12 18:41:00 โ LAB-RES//hemoglobin โ 8.4 โ null โ
โ 25082363 โ 2110-06-17 17:00:00 โ VTL//respiratory_rate โ 30.0 โ null โ
โ โฆ โ โฆ โ โฆ โ โฆ โ โฆ โ
โ 22074503 โ 2110-07-13 03:53:00 โ LAB-ORD//chloride โ null โ null โ
โ 24524153 โ 2110-10-08 03:20:00 โ LAB-RES//glucose_serum โ 179.0 โ null โ
โ 28104308 โ 2112-03-22 14:31:00 โ LAB-RES//sodium โ 137.0 โ null โ
โ 23859742 โ 2110-08-21 21:35:00 โ LAB-RES//ptt โ 26.299999 โ null โ
โ 25805890 โ 2110-10-03 11:00:00 โ LAB-ORD//eosinophils_percent โ null โ null โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโ
All of these things are placed in processed_data_home as configured.
Tip
To train a generative event model on this data, check out our configurable trainer: ๐ฆ cotorra
The winnower prepares held-out timelines for evaluation by filtering and flagging subjects based on outcome criteria. It:
- Loads held-out data from the tokenized timelines and associated timestamps.
- Splits each subject's timeline at a configurable time horizon or at the first occurrence of a specified token, separating events into "past" (before the horizon) and "future" (after the horizon).
- Checks for the presence of outcome tokens in both the past and future periods.
- Filters out subjects whose timelines don't exceed the horizon duration, ensuring subjects have sufficient observation time.
- Outputs a winnowed dataset suitable for inference and evaluation tasks.
Winnowing is driven by a YAML config (like this) that specifies:
outcome_tokensโ list of event codes to track as outcomes (e.g.,XFR-IN//icu,DSCG//expired). The winnower creates binary flags for each outcome indicating whether that token appears in the past or future period.thresholdโ defines how the threshold is set. Currently supported options are as follows:duration_s(integer) thresholds after a given duration (in seconds)first_occurrence(token string) thresholds after the first occurrence of the provided tokenuniform_random(boolean) thresholds at a point in time chosen uniformly at random from the total duration of the timeline
horizon_after_threshold_sis an optional parameter that allows you to set a prediction window (in seconds) after the threshold is triggered
Example configuration:
outcome_tokens:
- XFR-IN//icu
- RESP//imv
- DSCG//expired
- DSCG//hospice
threshold:
# choose one and only one of the following
# duration_s: !!int 86400 # 24h
first_occurrence: XFR-IN//icu
horizon_after_threshold_s: !!int 2592000 # 30d outcome window after prediction thresholdThe output is saved as held_out_for_inference.parquet with columns for each
outcome token (e.g., XFR-IN//icu_past, XFR-IN//icu_future) indicating whether
that outcome occurred in the respective time periods.
All configuration lives under config/. The entrypoint is config/main.yaml,
which points to the collation and tokenization configs and sets shared paths:
raw_data_home: ~/path/to/raw/data # or use `ln -s xxx raw_data`
processed_data_home: ~/path/to/output
collation_config: ./config/collation/clif-21.yaml
tokenization_config: ./config/tokenization/clif-21.yaml
winnowing_config: ./config/winnowing/clif-21.yaml # optional, for winnowingTo use a different dataset or schema, create new YAML files under
config/collation/ and config/tokenization/ and update the paths in
config/main.yaml, or pass your options directly to these objects. Both the
Collator and Tokenizer classes also accept **kwargs that are merged on top
of the YAML config via OmegaConf, so any config value can be overridden
programmatically:
from cocoa.collator import Collator
from cocoa.tokenizer import Tokenizer
collator = Collator(raw_data_home="~/other/data")
tokenizer = Tokenizer(n_bins=20, fused=False)We provide a CLI:
Usage: cocoa [OPTIONS] COMMAND [ARGS]...
Configurable collation and tokenization
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --install-completion Install completion for the current shell. โ
โ --show-completion Show completion for the current shell, to โ
โ copy it or customize the installation. โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ collate Collate raw data into a denormalized format. โ
โ tokenize Tokenize collated data into integer sequences. โ
โ winnow Winnow held-out data for evaluation. โ
โ pipeline Run the full pipeline: collate, tokenize, & winnow. โ
โ info Display configuration information. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
with commands:
-
cocoa collateUsage: cocoa collate [OPTIONS] Collate raw data into a denormalized format. Reads configuration from config/main.yaml and produces a MEDS-like parquet file with collated events. โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ --main-config -m PATH Main configuration file (overrides โ โ default) โ โ --collation-config -c PATH Collation configuration file โ โ (overrides config) โ โ --raw-data-home -r TEXT Raw data directory (overrides config) โ โ --processed-data-home -p TEXT Processed data directory (overrides โ โ config) โ โ --verbose -v Verbose logging for collate; this may โ โ cause memory issues with large โ โ datasets โ โ --help Show this message and exit. โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ -
cocoa tokenizeUsage: cocoa tokenize [OPTIONS] Tokenize collated data into integer sequences. Reads collated parquet files and produces tokenized timelines with vocabulary and bin information. โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ --main-config -m PATH Main configuration file (overrides โ โ default) โ โ --tokenization-config -c PATH Tokenization configuration file โ โ (overrides config) โ โ --processed-data-home -p TEXT Processed data directory (overrides โ โ config) โ โ --tokenizer-home -t TEXT Use a pretrained tokenizer at this โ โ path (overrides config) โ โ --verbose -v Verbose logging for collate; this may โ โ cause memory issues with large โ โ datasets โ โ --help Show this message and exit. โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ -
cocoa winnowUsage: cocoa winnow [OPTIONS] Winnow held-out data for evaluation. Filters held-out timelines and assigns flags to disqualify certain subjects from evaluation based on the configured criteria. โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ --main-config -m PATH Main configuration file (overrides โ โ default) โ โ --winnowing-config -c PATH Winnowing configuration file โ โ (overrides config) โ โ --processed-data-home -p TEXT Processed data directory (overrides โ โ config) โ โ --verbose -v Verbose logging for winnow; prints โ โ summary statistics โ โ --help Show this message and exit. โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ -
cocoa pipelineUsage: cocoa pipeline [OPTIONS] Run the full pipeline: collate, tokenize, & winnow. โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ --main-config -m PATH Main configuration file (overrides โ โ default) โ โ --collation-config PATH Collation configuration file โ โ (overrides config) โ โ --tokenization-config PATH Tokenization configuration file โ โ (overrides config) โ โ --winnowing-config PATH Winnowing configuration file โ โ (overrides config) โ โ --raw-data-home -r TEXT Raw data directory (overrides config) โ โ --processed-data-home -p TEXT Processed data directory (overrides โ โ config) โ โ --verbose -v Verbose logging for pipeline steps โ โ --help Show this message and exit. โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Tip
For common use cases, check out the recipes section!
