ssmixtools.cleaning.optional package
Optional cleaning utilities for ssmixtools.
- This module provides:
step1(): Maps medical codes using optional maps and extracts unique lab values for Step2.
step2(): Cleans laboratory test results using optional maps for units and nonnumeric values.
step3(): Performs the final data cleaning.
step4(): Translate codes into English terms.
- ssmixtools.cleaning.optional.step1(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]
Maps codes using optional mappings and extracts unique laboratory values for the next step (step2).
Note
Ensure that optional_text_to_ATC.csv and optional_JLAC10_to_JLAC10.csv are completed before performing this step. These files should have been created in reference_dir/created_reference by
ssmixtools.cleaning.mandatory.clean()
- Parameters:
source_dir (str) – Path to the directory containing source data cleaned by
ssmixtools.cleaning.mandatory.clean().reference_dir (str) – Path to the directory containing reference mapping files for code mapping.
max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.
log_dir (str, optional) – Directory for saving log files. If None, logs are printed to the console.
- Returns:
None
Example Usage
import ssmixtools ssmixtools.cleaning.optional.step1( source_dir='/path/to/source/data', reference_dir='/path/to/reference/mapping/files', max_workers=None, log_dir='/path/to/log/dir', )
Workflow
- Optional Code Mapping:
Applies optional mapping tables to clean and standardize clinical codes.
- Unique Laboratory Value Extraction:
Extracts unique units and nonnumeric laboratory values for the optional laboratory data cleaning (step2).
- ssmixtools.cleaning.optional.step2(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]
Cleans laboratory test results using optionally created mapping tables for units and nonnumeric values.
Note
Ensure that lab_nonnumerics.csv and lab_units.csv are completed before performing this step. These files should have been created in reference_dir/created_reference by
ssmixtools.cleaning.optional.step1()
- Parameters:
source_dir (str) – Path to the directory containing source data cleaned by
ssmixtools.cleaning.optional.step1().reference_dir (str) – Path to the directory containing reference mapping files.
max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.
log_dir (str, optional) – Directory for saving log files. Defaults to None.
- Returns:
None
Example Usage
import ssmixtools ssmixtools.cleaning.optional.step2( source_dir='/path/to/source/data', reference_dir='/path/to/reference/mapping/files', max_workers=None, log_dir='/path/to/log/dir', )
Workflow
- Unit Mapping and Standardization:
Maps original units to standardized units using a predefined mapping table.
Applies unit-specific value corrections through addition and multiplication rules.
- Nonnumeric Value Mapping:
Maps nonnumeric test result values (e.g., “Positive,” “Negative”) to standardized equivalents.
Detects numeric values mistakenly recorded as nonnumeric and moves them to the numeric column.
- Handling Missing and Invalid Records:
Removes rows flagged as non-records or rows missing both numeric and nonnumeric results after cleaning.
Ensures that all rows represent valid laboratory results.
- ssmixtools.cleaning.optional.step3(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]
Performs final data cleaning.
Note
After this step, all JLAC10 codes becomes the method-agnostic form (e.g., 5E0560000001—11).
- Parameters:
source_dir (str) – Path to the directory containing source data cleaned by
ssmixtools.cleaning.optional.step2().reference_dir (str) – Path to the directory containing reference mapping files for cleaning.
max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.
log_dir (str, optional) – Directory for saving log files. Defaults to None.
- Returns:
None
Example Usage
import ssmixtools ssmixtools.cleaning.optional.step3( source_dir='/path/to/source/data', reference_dir='/path/to/reference/mapping/files', max_workers=None, log_dir='/path/to/log/dir', )
Workflow
- Standardized Code Validation:
Validates and cleans standardized codes (ICD-10, ATC, and JLAC10).
Removes missing or irregular codes, truncates excessively long codes, and ensures consistency using regular expressions.
- Handling Missing Values:
Drops records with missing values in critical columns.
- ssmixtools.cleaning.optional.step4(source_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]
Translates clinical codes into English terms.
This step translates standardized clinical codes (e.g., ICD-10, ATC, JLAC10) into their corresponding English terms using reference mapping tables.
Note
Ensure that you have prepared ICD10_to_text.csv and ATC_to_text.csv in reference_dir.
A new column local_item_name is created, and the original item names before translation are copied in this column.
Untranslated codes are logged for manual review.
A reference table to map JLAC10 codes to English terms is generated to facilitate method-agnostic mapping.
- Parameters:
source_dir (str) – Path to the directory containing source data cleaned by
ssmixtools.cleaning.optional.step3().reference_dir (str) – Path to the directory containing reference mapping files for code translation.
max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.
log_dir (str, optional) – Directory for saving log files. If None, logs are printed to the console.
- Returns:
None
Example Usage
import ssmixtools ssmixtools.cleaning.optional.step4( source_dir='/path/to/source/data', reference_dir='/path/to/reference/mapping/files', max_workers=None, log_dir='/path/to/log/dir', )