Predict Patient Outcome for ICU Stays

Michael Tang
5 min readNov 24, 2021

ICU (Intensive Care Unit) is the care unit in a hospital system that most frequently encounters life and death. For patient’s family members, it is also an exhausting experience to wait hours outside ICU and have no clear sense of the outcome. On the other hand, given the large amount of processes and medications each patient have gone through, health workers often do not know what procedures or medications are most likely to alleviate or worsen patient symptoms. In this blog, I will go over a group project that I recently worked on. We created a machine learning model to predict a patient’s outcome after staying in ICU using a variety of variables. What’s more, the model was deployed to a Streamlit dashboard which can be used for people to analyze and predict ICU stay outcomes in real time.


The data used for this analysis is from PhysioNet, which was established as a sub organization of the National Institute of Health (NIH). One of PhysioNet’s missions is to catalyze biomedical research and education based on its large collections of open-sourced physiological and clinical data. For this project, we will be using the MIMIC-3 clinical dataset. MIMIC stands for Medical Information Mart for Intensive Care III, which contains de-identified data associated with ICU stays. More specifically, this database includes information such as demographics, laboratory test results, medication associated with each patient.

After reviewing past literature to understand what factors may be most crucial in predicting patient outcome in ICU stay, we selected variables that represent various aspects about the patient and their stay: prescribed medicine, surgery procedure, ICU admission type, admission location, organ failure, insurance, imaging, etc. In order to make the data more suitable for downstream model fitting, we conducted feature engineering that uses one-hot encoding for categorical variables such as diagnosis. Besides, for patients who went through multiple procedures multiple times, we summarized their procedure by only taking into consideration 8 most frequent procedures and counted the number of times that each patient has gone through a certain procedure.


The model construction consisted of 2 stages. In the first stage, we decided to use the demo dataset to validate the idea of predicting mortality rate. The demo dataset is exactly the same as the original dataset, except that it only contains the information for 100 patients randomly selected from the original dataset. For the training, we first split the demo data into training and testing sets. We then utilized 8 different models including a dummy model on the training set. Since there were more positive labels (death) than negative labels. We decided that AUC would be a better metric for evaluating the performance on the validation set. The final results are summarized in the table below:

There were 2 takeaways from the table above. First of all, the results validated the idea of predicting mortality rate with the features we have used such as diagnosis, insurance and admission location. The second take away was that the ensemble methods (Random Forest, XGBoost, Catboost) outperformed the other models.

With these findings in mind, we moved onto the complete dataset. Due to limitations in computational power, we decided to use Random Forest Classifier and XGBoost Random Forest Classifier because random forests are less susceptible to high variance problems. The results of these 2 classifier trained on the complete dataset are shown below:

The results improved drastically over the models trained on the demo dataset. This makes sense as more data was involved in the training process. In addition, we also performed stacking on these 2 models, where the AUC reached 0.877. Finally, we performed grid search on the XGB Random Forest Classifier. The following figures were produced to demonstrate model performance:

Based on the confusion matrix above, we can see that the grid-searched optimized model does experience more type I error than type II errors. We calculated that the model precision and recall, which are shown below:

The table shows that the model recall is significantly lower than precision. This is good for clinical settings since we want to minimize the chance of False Negatives. Finally, we evaluate the model performance on the test set, which yielded an accuracy of 90.38%.


After deploying our best-performance model to the Streamlit app and integrating an interface, we now have a dashboard that is composed of two parts: EDA and prediction. Once entered the app web interface, a user is allowed to examine the distribution of values from two drop-down menus, one of which represents a continuous variable and the other a categorical variable. The page will then show two graphs corresponding to the two variables and on each graph is colored differently by outcome (survived or died) that the user is interested in. Once scrolled down, the user can interact with more drop-down menus where each bar represents one variable used in the model. After they select values for each drop-down, a predicted probability of death will be shown below and summarizes the value to say “Patient has low/medium/high risk of death.” (Low: <15, Medium: >=15 & <20, High: >=20)


From working on this project, we have leveraged tools and methods including relational database, data cleaning, feature engineering, machine learning models, hyper-parameter tuning, code version control, and model deployment on Streamlit. We have been able to use real-world data that are collected from a well-renowned hospital, and to implement advanced machine learning models to predict patient outcomes. We believe our work and the dashboard can be a useful tool for patient families as well as healthcare workers to best leverage information they have about a patient and be more informed and involved in the hospitalization process.



Michael Tang

M.S. in Data Science candidate, 2022 @ Duke University | Biomedical Engineer | Workout Enthusiast