This repository contains a Data Science assessment project completed for the GP8R46 NPA Data Science course. The goal of this project is to act as a Data Scientist for the Scottish Government to assist in the COVID-19 pandemic response.
By analyzing daily published statistics, this project aims to predict the number of patients requiring Intensive Care Units (ICU), enabling the NHS to allocate resources effectively.
The analysis uses a distilled version of open license relational data from the Scottish Government.
- Data Source:
covid19.csv - Key Features: Dates, First/Second Doses, Hospital Admissions, Positive Tests, and ICU numbers.
- Python 🐍
- Pandas (Data Manipulation)
- NumPy (Numerical Analysis)
- Matplotlib & Seaborn (Data Visualization)
- Scikit-Learn (Machine Learning)
- Data Cleaning: Handling missing values and structuring the dataset for analysis.
-
Exploratory Data Analysis (EDA):
- Statistical summary of the data.
- Visualizing relationships between variables (e.g., Second Dose vs. ICU, Positive Tests vs. ICU).
- Correlation analysis to identify key predictors.
-
Feature Selection: Identified
positive_testsas the feature most strongly correlated with ICU admissions. -
Machine Learning:
- Splitting data into Training (90%) and Testing (10%) sets.
- Training a Linear Regression model.
- Evaluating model performance using
$R^2$ scores.
The project compares two modeling approaches to predict ICU numbers:
- Training Score: ~91.4%
- Testing Score: ~87.8%
- Observation: The model provides a decent baseline but struggles with complex patterns.
- Training Score: ~99.0%
- Testing Score: ~94.7%
- Conclusion: The Random Forest model outperformed Linear Regression. It was able to capture non-linear relationships in the data (e.g., high vaccination rates dampening the effect of positive cases on ICU admissions).
- Clone this repository.
- Ensure you have the required libraries installed:
pip install pandas numpy matplotlib seaborn scikit-learn
- Open the Jupyter Notebook
Covid19_ICU_Prediction_Analysis.ipynbto view the analysis and code.
This evidence was produced for the Combined J2G246 Data Science, J2HN46 Data Citizenship & J2G646 Machine Learning Assessment.