ssmixtools.cleaning.optional package

Optional cleaning utilities for ssmixtools.

This module provides:
  • step1(): Maps medical codes using optional maps and extracts unique lab values for Step2.

  • step2(): Cleans laboratory test results using optional maps for units and nonnumeric values.

  • step3(): Performs the final data cleaning.

  • step4(): Translate codes into English terms.

ssmixtools.cleaning.optional.step1(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]

Maps codes using optional mappings and extracts unique laboratory values for the next step (step2).

Note

  • Ensure that optional_text_to_ATC.csv and optional_JLAC10_to_JLAC10.csv are completed before performing this step. These files should have been created in reference_dir/created_reference by ssmixtools.cleaning.mandatory.clean()

Parameters:
  • source_dir (str) – Path to the directory containing source data cleaned by ssmixtools.cleaning.mandatory.clean().

  • reference_dir (str) – Path to the directory containing reference mapping files for code mapping.

  • max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.

  • log_dir (str, optional) – Directory for saving log files. If None, logs are printed to the console.

Returns:

None

Example Usage

import ssmixtools

ssmixtools.cleaning.optional.step1(
    source_dir='/path/to/source/data',
    reference_dir='/path/to/reference/mapping/files',
    max_workers=None,
    log_dir='/path/to/log/dir',
)

Workflow

  1. Optional Code Mapping:
    • Applies optional mapping tables to clean and standardize clinical codes.

  2. Unique Laboratory Value Extraction:
    • Extracts unique units and nonnumeric laboratory values for the optional laboratory data cleaning (step2).

ssmixtools.cleaning.optional.step2(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]

Cleans laboratory test results using optionally created mapping tables for units and nonnumeric values.

Note

  • Ensure that lab_nonnumerics.csv and lab_units.csv are completed before performing this step. These files should have been created in reference_dir/created_reference by ssmixtools.cleaning.optional.step1()

Parameters:
  • source_dir (str) – Path to the directory containing source data cleaned by ssmixtools.cleaning.optional.step1().

  • reference_dir (str) – Path to the directory containing reference mapping files.

  • max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.

  • log_dir (str, optional) – Directory for saving log files. Defaults to None.

Returns:

None

Example Usage

import ssmixtools

ssmixtools.cleaning.optional.step2(
    source_dir='/path/to/source/data',
    reference_dir='/path/to/reference/mapping/files',
    max_workers=None,
    log_dir='/path/to/log/dir',
)

Workflow

  1. Unit Mapping and Standardization:
    • Maps original units to standardized units using a predefined mapping table.

    • Applies unit-specific value corrections through addition and multiplication rules.

  2. Nonnumeric Value Mapping:
    • Maps nonnumeric test result values (e.g., “Positive,” “Negative”) to standardized equivalents.

    • Detects numeric values mistakenly recorded as nonnumeric and moves them to the numeric column.

  3. Handling Missing and Invalid Records:
    • Removes rows flagged as non-records or rows missing both numeric and nonnumeric results after cleaning.

    • Ensures that all rows represent valid laboratory results.

ssmixtools.cleaning.optional.step3(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]

Performs final data cleaning.

Note

  • After this step, all JLAC10 codes becomes the method-agnostic form (e.g., 5E0560000001—11).

Parameters:
  • source_dir (str) – Path to the directory containing source data cleaned by ssmixtools.cleaning.optional.step2().

  • reference_dir (str) – Path to the directory containing reference mapping files for cleaning.

  • max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.

  • log_dir (str, optional) – Directory for saving log files. Defaults to None.

Returns:

None

Example Usage

import ssmixtools

ssmixtools.cleaning.optional.step3(
    source_dir='/path/to/source/data',
    reference_dir='/path/to/reference/mapping/files',
    max_workers=None,
    log_dir='/path/to/log/dir',
)

Workflow

  1. Standardized Code Validation:
    • Validates and cleans standardized codes (ICD-10, ATC, and JLAC10).

    • Removes missing or irregular codes, truncates excessively long codes, and ensures consistency using regular expressions.

  2. Handling Missing Values:
    • Drops records with missing values in critical columns.

ssmixtools.cleaning.optional.step4(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]

Translates clinical codes into English terms.

This step translates standardized clinical codes (e.g., ICD-10, ATC, JLAC10) into their corresponding English terms using reference mapping tables.

Note

  • Ensure that you have prepared ICD10_to_text.csv and ATC_to_text.csv in reference_dir.

  • A new column local_item_name is created, and the original item names before translation are copied in this column.

  • Untranslated codes are logged for manual review.

  • A reference table to map JLAC10 codes to English terms is generated to facilitate method-agnostic mapping.

Parameters:
  • source_dir (str) – Path to the directory containing source data cleaned by ssmixtools.cleaning.optional.step3().

  • reference_dir (str) – Path to the directory containing reference mapping files for code translation.

  • max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.

  • log_dir (str, optional) – Directory for saving log files. If None, logs are printed to the console.

Returns:

None

Example Usage

import ssmixtools

ssmixtools.cleaning.optional.step4(
    source_dir='/path/to/source/data',
    reference_dir='/path/to/reference/mapping/files',
    max_workers=None,
    log_dir='/path/to/log/dir',
)