ssmixtools.extraction package
Data extraction utilities for ssmixtools.
- This module provides:
extract(): Extracts data from an SS-MIX2 storage.
- ssmixtools.extraction.extract(ssmix_root: str, output_dir: str, pubkey_path: str, identity_tag: str, max_workers: int | None = None, extraction_period_start: str | None = None, extraction_period_end: str | None = None, min_encounters: int = 1, log_dir: str | None = None) None[source]
Top-Level Module for Extracting Data from SS-MIX2 Storage
This module extracts clinical records from SS-MIX2 storage, processes them, and saves the results in a structured format. The extracted data is encrypted using a hybrid cryptosystem (AES + RSA) for secure handling.
Warning
Keep your private RSA key secure; it is required for decryption.
Ensure the SS-MIX2 root directory is properly structured according to the SS-MIX2 implementation guidelines.
Note
Patient IDs are deidentified by random shuffling and reassignment. The mapping between original and deidentified IDs is stored in the patient ID table.
Deidentification is irreversible if the patient ID table is deleted.
Extracted data is encrypted. Use the provided RSA private key to decrypt it.
- Parameters:
ssmix_root (str) – Path to the SS-MIX2 root directory.
output_dir (str) – Directory for saving the extracted files.
pubkey_path (str) – Path to an RSA public key used for encrypting the data.
identity_tag (str) – Identity tag appended to filenames for source identification.
max_workers (int, optional) – Maximum number of workers for multiprocessing. Defaults to the number of physical CPU cores minus 1.
extraction_period_start (str, optional) – Earliest visit date for inclusion in the extraction.
extraction_period_end (str, optional) – Latest visit date for inclusion in the extraction. Records of all visits are included if at least one visit falls within this range.
min_encounters (int, optional) – Minimum number of visit dates required for inclusion.
log_dir (str, optional) – Directory for saving logs generated during extraction.
- Returns:
Extracted data and analytics are saved in the specified output directory.
- Return type:
None
Workflow
- Initialization:
Configures the extraction protocol and verifies prerequisites.
- Data Extraction:
Extracts records from SS-MIX2 storage and organizes them by record types.
- Inspection and Deidentification:
Inspects extracted data for quality and deidentifies patient IDs.
- Encryption:
Encrypts the extracted data using the provided RSA public key.
Extracted Records
The following clinical record tables are extracted:
Clinical Record Tables File Name Pattern
Contents
ADT-22_*.bin
Admission records
ADT-52_*.bin
Discharge records
ADT-12_*.bin
Outpatient visits
PPR-01_*.bin
Diagnosis records
OMP-01_*.bin
Prescription orders
OMP-02_*.bin
Injection orders
OML-11_*.bin
Laboratory test results
Additionally, the following tables are created:
Additional Tables File Name Pattern
Contents
patient_id_*.bin
Pairs of original and deidentified patient IDs
DOB_*.bin
Patients’ dates of birth
patient_metadata_*.bin
Patient metadata including sex
Example Usage
Generate RSA keys
ssh-keygen -t rsa -b 2048 -f my_rsa_key
Extract Data
import ssmixtools def main(): ssmixtools.extraction.extract( ssmix_root='/path/to/ssmix/root', output_dir='/directory/for/saving/extracted/files', pubkey_path='/path/to/your/rsa/public/key', identity_tag='UT', max_workers=None, extraction_period_start='2011/01/01', extraction_period_end='2023/12/31', min_encounters=1, log_dir='/directory/for/leaving/logs' ) if __name__ == "__main__": main()
Output Directory Structure
output_dir ├── analytics └── tables ├── patient_id_UT.bin ├── DOB_UT.bin ├── patient_metadata_UT.bin ├── ADT-12_0_UT.bin ├── ... └── OML-11_76_UT.bin