watcher.preprocess package

watcher.preprocess.create_dataset(output_dir: str, train_size: float, val_size: float, train_period: str, test_period: str, max_sequence_length: int, db_schema: str = 'public', min_timeline_length: int | None = None, patients_per_file: int = 5000, update_period: str | None = None, dx_code_segments: str | None = None, med_code_segments: str | None = None, lab_code_segments: str | None = None, n_numeric_bins: int = 601, td_small_step: int = 10, td_large_step: int = 60, max_workers: int = 1, log_dir: str | None = None)[source]

Create a full dataset for model training and evaluation.

This function performs data cleaning, temporal alignment, tokenization, aggregation, label generation, and matrix construction to train the Watcher model.

Preprocessing parameters are inherited by the trained model and therefore become part of its effective hyperparameters.

Example

from watcher.preprocess import create_dataset

create_dataset(
    output_dir="/code/pretraining_data",
    train_size=0.8,
    val_size=0.1,
    train_period="2011/01/01-2022/12/31",
    test_period="2023/01/01-2023/12/31",
    uodate_period="2022/01/01-2022/12/31",
    db_schema="public",
    max_sequence_length=2048,
    max_workers=10,
    log_dir="/path/to/log_dir"
)

Train-Test Split:

The dataset is split using a temporal strategy:

Patients whose records fall exclusively within the test_period are assigned to the test set.

Remaining patients are sorted by their first visit date. The most recent patients are added to the test set until the desired test size fraction (1 - train_size - val_size) is reached.

Clinical records in the training or validation set but outside the train_period are assigned id = -1 and excluded from model training (i.e., ignored during loss computation).

Medical Code Tokenization:

You can control how medical codes are tokenized using the dx_code_segments, med_code_segments, and lab_code_segments arguments.

By default, if these are set to None, each unique code is treated as a single token. However, for coding systems with hierarchical structure (e.g., ICD-10 or ATC), splitting codes into subtokens may improve training efficiency.

For example, setting dx_code_segments=”1-2-1-1” tokenizes the ICD-10 code ‘J156’ into:

['J****', '*15**', '***6*', '[PAD]']

Each token masks the non-selected characters with *, and unused segments are padded. If dx_code_segments=None, the same code is tokenized as:

['J156']

This segmentation affects only the tokenization/embedding step; the original codes are still assigned unique vocabulary entries for modeling.

Parameters:

output_dir (str) – Directory where the dataset will be saved.
train_size (float) – Fraction of patients assigned to the training set.
val_size (float) – Fraction of patients assigned to the validation set.
train_period (str) – Date range for training data (e.g., “2011/01/01-2022/12/31”).
test_period (str) – Date range for test data (e.g., “2023/01/01-2023/12/31”).
max_sequence_length (int) – Maximum tokenized sequence length per patient.
db_schema (str) – PostgreSQL schema name to read from.
min_timeline_length (int, optional) – Minimum number of visits required to include a patient. Defaults to None, which includes all patients with at least one valid clinical event.
patients_per_file (int) – Number of patients per intermediate file. Defaults to 5000.
update_period (str, optional) – Date range used to create a fine-tuning dataset for adapting the model to recent medical practice (e.g., “2022/01/01-2022/12/31”).
dx_code_segments (str, optional) – Segmentation pattern for diagnosis codes (e.g., “1-2-1-1”).
med_code_segments (str, optional) – Segmentation pattern for medication codes (e.g., “1-2-1-1-2”).
lab_code_segments (str, optional) – Segmentation pattern for laboratory test codes.
n_numeric_bins (int) – Number of bins for discretizing numeric values. Defaults to 601.
td_small_step (int) – Step size (in minutes) to discretize time progression from +0 to +1440 min (1 day).
td_large_step (int) – Step size in minutes to discretize a 24-hour day from 00:00 to 23:59 based on clock time.
max_workers (int) – Maximum number of parallel workers. Defaults to 1.
log_dir (str, optional) – Directory to store preprocessing logs.

Returns:

None

watcher.preprocess.get_patient_ids(dataset_dir: str, group: Literal['train', 'validation', 'test', 'all']) → list[str][source]

Loads patient IDs used for training, validation, testing, or all patient IDs.

Example

from watcher.preprocess import get_patient_ids

patient_ids = get_patient_ids(dataset_dir="/path/to/dataset", group="train")
print(patient_ids[:10])

Parameters:

dataset_dir (str) – Path to the dataset directory.
group (Literal["train", "validation", "test", "all"]) – Group to load patient IDs from. - “train”: Training set only. - “validation”: Validation set only. - “test”: Test set only. - “all”: Combine training, validation, and test sets.

Returns:

List of patient IDs corresponding to the specified group.

Return type:

list[str]