Sepsis onset calculation from MIMIC-III data using sepsis-3 criteria
In this blog post, I am going to desribe how we can calculate sepsis onset time using sepsis-3 criteria for the MIMIC-III patient data.
- Reading the relevant tables
- Sepsis-3 Criteria
- Operationalization in code
- Identifying the onset of suspected infection
- Identifying organ failure/dsyfunction
- Finding sepsis onset time
- Code
- Data access
- Assumptions
- References
Reading the relevant tables
We first read the data from the CSV files (corresponding to the relevant tables):
import pandas as pd
import numpy as np
MIMIC_DIR = "C:\\Users\\amits\\work\\datasets\\MIMIC-III-v1.4"
patients = pd.read_csv(f"{MIMIC_DIR}\\PATIENTS.csv.gz")
admissions = pd.read_csv(f"{MIMIC_DIR}\\ADMISSIONS.csv.gz", parse_dates=['ADMITTIME'])
diagnoses_icd = pd.read_csv(f"{MIMIC_DIR}\\DIAGNOSES_ICD.csv.gz")
icustays = pd.read_csv(f"{MIMIC_DIR}\\ICUSTAYS.csv.gz")
micro = pd.read_csv(f"{MIMIC_DIR}\\MICROBIOLOGYEVENTS.csv.gz", parse_dates=['CHARTDATE'], low_memory=False)
prescriptions = pd.read_csv(f"{MIMIC_DIR}\\PRESCRIPTIONS.csv.gz", parse_dates=['STARTDATE'], low_memory=False)
chartevents = pd.read_csv(f"{MIMIC_DIR}\\CHARTEVENTS.csv.gz",
usecols=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME', 'ITEMID', 'VALUENUM'],
chunksize=1000000, # chartevents is big, so we read one chunk at a time
low_memory=False,
parse_dates=['CHARTTIME']
) # 330712483 rows
labevents = pd.read_csv(f"{MIMIC_DIR}\\LABEVENTS.csv.gz",
usecols=['SUBJECT_ID', 'HADM_ID', 'CHARTTIME', 'ITEMID', 'VALUENUM'],
chunksize=1000000,
low_memory=False,
parse_dates=['CHARTTIME']
) #27854055 rows
Sepsis-3 Criteria
From The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3):
Sepsis should be defined as life-threatening organ dysfunction caused by a dysregulated host response to infection
For clinical operationalization, organ dysfunction can be represented by an increase in the Sequential [Sepsis-related] Organ Failure Assessment (SOFA) score of 2 points or more, which is associated with an in-hospital mortality greater than 10%.
Another key point that’s worth mentioning from the above is:
As described later, even a modest degree of organ dysfunction when infection is first suspected is associated with an in-hospital mortality in excess of 10%. Recognition of this condition thus merits a prompt and appropriate response.
We will henceforth refer to infection detection as “suspected infection detection”.
Operationalization in code
To implement it, we will need to:
- Identify the onset of suspected infection
- Use the SOFA score and its increase magnitude
Identifying the onset of suspected infection
To identify the onset of suspected infection, we will:
- Find the time of drawing of first culture for a particular HADM_ID (which uniquely identifies a patient’s specific admission)
- Find the time of administering of antibiotics
- And if they are within 24 hours of each other (irrespective of order), we consider that there is an onset of suspected infection.
Time of drawing of the first culture
# ['CHARTDATE'].min() computes the minimum (earliest) CHARTDATE for each HADM_ID group
# since we already prased the CHARTDATE as datetime during reading, the min function
# works as intended
first_culture = micro.groupby('HADM_ID')['CHARTDATE'].min().reset_index()
first_culture.columns = ['HADM_ID', 'first_culture_time']
Time of administering of antibiotics
Next, we look through the prescriptions dataframe’s DRUG column and filter rows which contain one of the antibiotics
commonly identified for infection onset:
antibiotics = ['Vancomycin', 'Ceftriaxone', 'Meropenem', 'Piperacillin-Tazobactam', 'Levofloxacin']
extended_abx = [
'Cefepime', 'Azithromycin', 'Ampicillin', 'Ampicillin-Sulbactam',
'Clindamycin', 'Linezolid', 'Metronidazole', 'Cefuroxime', 'Tobramycin',
'Gentamicin', 'Imipenem', 'Imipenem-Cilastatin', 'Ciprofloxacin'
]
all_abx = antibiotics + extended_abx
abx_lc = [a.lower() for a in all_abx]
drug_lc = prescriptions['DRUG'].fillna('').str.lower()
# Create a boolean mask: True if any antibiotic name is a substring of the DRUG string
mask = drug_lc.apply(lambda s: any(abx in s for abx in abx_lc))
abx_given = prescriptions[mask]
first_abx = abx_given.groupby('HADM_ID')['STARTDATE'].min().reset_index()
first_abx.columns = ['HADM_ID', 'first_abx_time']
Identifying infection onset
Now that we have identified the first instance of administering an antibiotic and drawing blood culture, we now join the two dataframes based on HADM_ID:
suspected_infection = pd.merge(first_abx, first_culture, on='HADM_ID')
Next, we find the delta between the two times and store it as a separate column:
suspected_infection['delta'] = (
(suspected_infection['first_abx_time'] - suspected_infection['first_culture_time'])
.dt.total_seconds() / 3600
)
Finally, we filter the data to store only those for which the absolute value of delta is within 24 hours:
suspected_infection = suspected_infection[suspected_infection['delta'].abs() <= 24]
We add the suspected infection time as a separate column storing the earlier of the two times:
suspected_infection['infection_time'] = suspected_infection[['first_abx_time', 'first_culture_time']].min(axis=1)
Identifying organ failure/dsyfunction
Next, we want to identify organ failures for calculating the sequential organ failure assessment score (SOFA score). It comprises six subscores corresponding to the condition (See reference 2) of the:
- Kidney (renal)
- Coagulation
- Liver
- Central nervous system
- Cardiovascular system
- Respiratory system
To measure the first three, we will query the labevents table for the following item ids:
- Kidney - creatinine (item id: 50912)
- Coagulation - platelet count (item id: 51265)
- Liver - bilirubin total (itemid: 50885)
We read one chunk at a time of the table, group the data into one of four bins based on their value and then associate a severity label with them (0, 1, 2, 3, 4 - normal to abnormal).
For example, for renal, we do this:
creat_itemid = 50912
renal_rows = []
# Renal - Creatinine
cr = chunk[chunk['ITEMID'] == creat_itemid].copy()
cr['renal_score'] = pd.cut(
cr['VALUENUM'],
bins=[-float('inf'), 1.2, 1.9, 3.4, 4.9, float('inf')], # bins for creatinine
labels=[0, 1, 2, 3, 4] # assignemnt of severity score
).fillna(0).astype(int)
renal_rows.append(cr[['HADM_ID', 'CHARTTIME', 'renal_score']])
Once we have read all the data, stored the data for each chunk as a row in the list, we then create dataframe for each of the three measurements:
# Concatenate results
renal_scores = pd.concat(renal_rows)
coag_scores = pd.concat(coag_rows)
liver_scores = pd.concat(liver_rows)
We see that for some of the measurement, the HADM_ID is NaN, these are data for outpatients,
and will be filtered automatically later on.
Next, we will calculate the remaining three:
- Central nervous system
- Cardiovascular system
- Respiratory system
For the central nervous system, we calculate the Glasgow Coma Scale (GCS) score,
measuring the following by reading the chartevents table:
A. Eye response (item id: 223900, 220739) B. Verbal response (item id: 223901) C. Motor response (item id: 223902)
We read the data one chunk at a time, and accumulate them in a list:
gcs_rows = []
gcs_chunk = chunk[chunk['ITEMID'].isin(
gcs_itemids['EYE'] + gcs_itemids['VERBAL'] + gcs_itemids['MOTOR']
)].copy()
gcs_rows.append(gcs_chunk)
Then, we create a dataframe, and process the individual scores and calculate the final score that will be used for SOFA calculation:
gcs_all = pd.concat(gcs_rows)
gcs_all = gcs_all.dropna(subset=['VALUENUM'])
# Take the maximum when multiple measurements exist for the hour
gcs_all['CHARTTIME'] = pd.to_datetime(gcs_all['CHARTTIME']).dt.floor('h')
gcs_pivot = gcs_all.pivot_table(index=['HADM_ID', 'CHARTTIME'],
columns='ITEMID', values='VALUENUM', aggfunc='max').reset_index()
gcs_pivot.columns.name = None
# Rename columns for clarity
gcs_pivot = gcs_pivot.rename(columns={
223900: 'eye',
220739: 'eye_alt',
223901: 'verbal',
223902: 'motor'
})
# Fill missing eye from alternate
gcs_pivot['eye'] = gcs_pivot['eye'].combine_first(gcs_pivot['eye_alt'])
# Ensure all three components exist
for col in ['eye', 'verbal', 'motor']:
if col not in gcs_pivot.columns:
gcs_pivot[col] = np.nan
# Total GCS calculation (NaNs allowed)
gcs_pivot['gcs_total'] = gcs_pivot[['eye', 'verbal', 'motor']].sum(axis=1, min_count=1)
# Fill fully missing GCS with 15 (assume alert)
# TODO alternative - forward fill
gcs_pivot['gcs_total'] = gcs_pivot['gcs_total'].fillna(15)
# CNS SOFA scoring bins and severity allocation
gcs_pivot['cns_score'] = pd.cut(
gcs_pivot['gcs_total'],
bins=[-float('inf'), 5, 8, 11, 14, 15],
labels=[4, 3, 2, 1, 0]
).astype(int)
# Final GCS score output
cns_scores = gcs_pivot[['HADM_ID', 'CHARTTIME', 'cns_score']]
For Cardio vascular system score, we read the following data from different tables:
- Mean arterial pressure (MAP) from
chartevents - Vasopressors from
inputevents_mvandinputevents_cv- Dopamine
- Dobutamine
- Epinephrine
- Norepinephrine
Based on the rate of the vasopressor administration and the MAP, a severity score is assigned. See reference [2], Page 7 for the severity score assignment.
To see the calculation of the cardiovascular score, see the notebook, Cardiovascular Score Standalone.ipynb in the associated GitHub repository for this post.
For calculating the respiratory system score, we calculate the P/F ratio (P: PaO2 - Partial pressure of oxygen), (F: Fraction of inspired oxygen), by reading the following data:
- Fraction of inspired oxygen (FiO2) from
charevents(item id: 223835, 3420) - Partial pressure of oxygen(PaO2) from
labevents(item id: 50821)
Calculation of total SOFA score
Once we have the subscores, we calculate the total SOFA score:
from functools import reduce
score_dfs = [
renal_scores,
coag_scores,
liver_scores,
cardio_scores,
cns_scores,
resp_scores
]
# Merge all on HADM_ID + CHARTTIME
sofa_df = reduce(lambda left, right: pd.merge(left, right, how='outer', on=['HADM_ID', 'CHARTTIME']), score_dfs)
# Fill missing scores with 0
for col in sofa_df.columns:
if '_score' in col:
sofa_df[col] = sofa_df[col].fillna(0).astype(int)
# Calculate total SOFA score
sofa_df['total_sofa_score'] = sofa_df[[col for col in sofa_df.columns if '_score' in col]].sum(axis=1)
Finding sepsis onset time
Now, we have the suspected infection time and the organ dysfunction data via the SOFA score, we can now use the Sepsis-3 criteria to detect the sepsis onset:
# Merge SOFA with infection time
merged = pd.merge(sofa_df, suspected_infection[['HADM_ID', 'infection_time']], on='HADM_ID', how='inner')
# Time diff in hours for each row
merged['time_diff_hours'] = (merged['CHARTTIME'] - merged['infection_time']).dt.total_seconds() / 3600
# Pre- and post-infection windows (24 h before and after)
pre_window = merged[(merged['time_diff_hours'] >= -24) & (merged['time_diff_hours'] < 0)]
post_window = merged[(merged['time_diff_hours'] >= 0) & (merged['time_diff_hours'] <= 48)]
# Get baseline SOFA (lowest in 24h before)
baseline_sofa = pre_window.groupby('HADM_ID')['total_sofa_score'].min().reset_index()
baseline_sofa.columns = ['HADM_ID', 'baseline_sofa']
# Join baseline with post-infection SOFA
post_with_baseline = pd.merge(post_window, baseline_sofa, on='HADM_ID', how='left')
post_with_baseline['sofa_delta'] = post_with_baseline['total_sofa_score'] - post_with_baseline['baseline_sofa']
# First time SOFA delta ≥ 2 = Sepsis Onset
sepsis_onset = (
post_with_baseline[post_with_baseline['sofa_delta'] >= 2]
.sort_values(['HADM_ID', 'CHARTTIME'])
.groupby('HADM_ID')
.first() # take the earliest charttime for the HADM_ID
.reset_index()
)
# Output: earliest time of sepsis
sepsis_onset = sepsis_onset[['HADM_ID', 'CHARTTIME']].rename(columns={'CHARTTIME': 'sepsis_onset_time'})
With MIMIC-III data, the sepsis onset dataframe should have the shape:
sepsis_onset.shape
(1731, 2)
Code
You can find jupyter notebooks here:
Data access
See here.
Assumptions
- The sepsis onset is determined at an admission level, rather than an ICU stay level