ssmixtwins package
- ssmixtwins.create_ssmix(source_dir: str, output_dir: str, max_workers: int = 1, already_validated: bool = False, encoding: str = 'iso2022_jp', n_physicians: int = 30) None[source]
Create SSMIX dataset from CSV files.
This function reads CSV files from the source directory, validates them, and generates SSMIX dataset files. The SSMIX root directory will be created as “<output_dir>/ssmixtwins”. First, this function validates all the CSV files in the source directory, which may take some time. If the CSV files are valid, it proceeds to parse the files and generate SSMIX dataset files.
- Parameters:
source_dir (str) –
Directory containing CSV files. Each CSV file must be named as “<start_age:integer from 0 to 120>_<patient sex: either M,F,U,O,or N>_<file_number>.csv”. ‘file_number’ is an optional number to avoid file name collision. This number is not used in the SSMIX data, therefore, it can be any thing. For example, “64_M_1a5d9f7892cd437fb6f9b22bba876dfa.csv”. Each file must contain a single patient data. Each table has columns as follows:
”timestamp” (str): Timestamp of the event in “YYYYMMDDHHMMSSFFFFFF” format.
”type” (int): The record type. 0 for admissions, 1 for discharges, 2 for outpatient visits, 3 for diagnoses, 4 for prescricription orders, 5 for injection orders, and 6 for laboratory tests.
”text” (str): Textual description of the event (e.g., diagnosis name for diagnoses, medication name for prescription orders).
”icd10” (str): ICD-10 code for the diagnosis (only for rows type==3, empty otherwise).
”mdcdx2” (str): MDCDX2 code for the diagnosis (only for rows type==3, empty otherwise).
”provisional” (str): “1” if the diagnosis is provisional. (only for rows type==3, empty otherwise).
”hot” (str): HOT drug codes for prescription and injection orders (only for rows type==4 or 5, empty otherwise).
”jlac10” (str): JLAC10 code for laboratory tests (only for rows type==6, empty otherwise).
”lab_value” (str): Laboratory test value (only for rows type==6, empty otherwise).
”unit” (str): Unit of the laboratory test value (only for rows type==6, empty otherwise).
”discharge_disposition” (str): Discharge disposition code (only for rows type==1, empty otherwise).
output_dir (str) – Directory to save the SSMIX dataset. The SSMIX root directory will be created as “<output_dir>/ssmixtwins”. Error files may be saved in this directory too.
max_workers (int) – Maximum number of workers for parallel processing. Default is 1, which means no parallel processing. Because this process takes a long time, it is recommended to set this to a higher value if you are processing many files. The appropriate value depends on your CPU and memory resources. Please set an appropriate value so that this does not overwhelm your system. For example, processing 1000 CSV files with 10 workers finishes in a short time. Setting this to one is ok, but it may take a long time to process many files.
already_validated (bool) – If True, skip validation of CSV files. This function first loads all CSV files in the source directory and validates them. The function proceeds only if the CSV files are valid. You can set this to True if you have already validated the CSV files and want to skip the validation step.
encoding (str) – Encoding to use when saving the files. Default is “iso2022_jp”.
n_physicians (int) – Number of random physicians to generate. Default is 30, and physicians are randomly selected from the randomly generated physicians throughout the generated data. The default value is usually sufficient, but you can increase or decrease this number if you want to.
- Returns:
None