ssmixtools.cleaning.mandatory package
Cleaning utilities for ssmixtools.
- This module provides:
render_maps(): Renders mapping tables needed for data cleaning.
clean(): Cleans clinical records extracted from SS-MIX2 storage.
- ssmixtools.cleaning.mandatory.clean(source_dir: str, output_dir: str, reference_dir: str, max_workers: int | None = None, log_dir: str | None = None)[source]
Cleans Clinical Records Extracted from SS-MIX2 Storage
This function processes clinical records using multiple cleaning steps to standardize and validate the dataset for downstream analysis. The cleaning pipeline includes code mapping, unit normalization, handling missing values, and ensuring the consistency of clinical data formats.
Note
Detailed process analytics are also saved in output_dir.
The cleaning process heavily relies on the provided reference tables; ensure they are updated and accurate.
- Parameters:
source_dir (str) – Path to the directory containing source data to be cleaned.
output_dir (str) – Path to the directory where cleaned data will be saved.
reference_dir (str) – Path to the directory containing reference mapping files.
max_workers (int, optional) – Maximum number of workers for parallel processing. Defaults to the number of physical CPU cores minus 1.
log_dir (str, optional) – Directory for saving log files. If None, logs are printed to the console.
- Returns:
The function saves cleaned datasets and analytics in the specified output_dir.
- Return type:
None
Example Usage
clean( source_dir='/path/to/source/data', output_dir='/path/to/cleaned/data', reference_dir='/path/to/reference/files', max_workers=4, log_dir='/path/to/log/dir',
)
Workflow
- Initialization:
Sets up directories and configurations for cleaning.
- Code Mapping:
Maps local clinical codes to standardized codes using reference tables.
- Laboratory Data Cleaning:
Standardizes units and cleans laboratory data for consistency.
Aggregates numerical and non-numerical test results.
- ssmixtools.cleaning.mandatory.render_maps(reference_dir: str, atc_tables_dir: str, log_dir: str | None = None)[source]
Renders Mapping Tables for Data Cleaning
This function prepares and renders mapping tables required for cleaning clinical data.
Warning
Internet Usage: This function downloads necessary tables from external sources. Ensure a stable internet connection and review the implications of downloading content from these sources.
Terms of Use: Tables are sourced from the Medical Information System Development Center (https://www.medis.or.jp/). Users must comply with MEDIS terms of use before proceeding.
ATC Tables Requirement: This package does not provide ATC-related source tables. Users must prepare these tables themselves. Ensure the prepared tables include mappings from local codes and drug product names to ATC codes. Japan Pharmaceutical Information Center (JAPIC) provides useful resources.
External Dependencies: The fetches data from external sources. If the structure of these sources changes, the function may fail, and this package needs to be updated accordingly. In this case, please report the issue on GitHub.
Note
Rendered mapping tables are used in subsequent data cleaning processes.
- Parameters:
reference_dir (str) – Directory where the rendered mapping tables will be saved.
atc_tables_dir (str) – Directory containing ATC-related source tables. This directory must contain a CSV file named ‘info_atc.csv’.
log_dir (str, optional) – Directory for saving log files. If None, logs are printed to the console.
- Returns:
The function saves mapping tables and related analytics in the specified reference_dir.
- Return type:
None
Example Usage
render_maps( reference_dir="/path/to/reference", atc_tables_dir="/path/to/atc/tables", log_dir="/path/to/log/dir" )