Skip to content

samansiddiqui55/Diabetes-EDA-and-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Diabetes EDA and Classification

This project focuses on analyzing and classifying diabetes using the Pima Indians Diabetes dataset. It includes comprehensive Exploratory Data Analysis (EDA), feature treatment, and machine learning model evaluation to predict the presence of diabetes.

🧾 Objective

  • Clean and prepare healthcare data for analysis
  • Discover patterns using EDA and statistical methods
  • Build and evaluate classification models
  • Reduce false negatives using threshold tuning

πŸ“Š Dataset

  • Source: Pima Indians Diabetes Dataset
  • Rows: 768
  • Features: 8 medical features + Outcome (0: No Diabetes, 1: Diabetes)

βš™οΈ Data Cleaning

  • Zero values in features like Glucose, BloodPressure, SkinThickness, etc., were treated as missing.
  • Replaced zero values with:
    • Median for most features
    • Mean for Insulin due to skewness
  • Data was normalized using StandardScaler.

πŸ“ˆ Exploratory Data Analysis (EDA)

  • Point biserial correlation was used to evaluate feature-target relationships.
  • Top influential features: Glucose, BMI, Insulin
  • Visualizations:
    • Histograms
    • Box plots (outlier detection)
    • Pair plots (class separation)
    • Heatmaps (feature correlation)

πŸ€– Model Training & Evaluation

Model Accuracy Notes
K-Nearest Neighbors (k=35) 83% Best performing baseline model
Support Vector Machine 77% Linear kernel
Random Forest 82% Balanced precision and recall

βœ… Threshold Optimization

  • Changed default classification threshold from 0.5 to ~0.35 to minimize false negatives, which is crucial in medical diagnosis.

πŸ—ƒοΈ Output

  • Final prediction file: diabetes classification report.csv

πŸ› οΈ Technologies Used

  • Python
  • Pandas, NumPy
  • Seaborn, Matplotlib, Plotly
  • Scikit-learn
  • SciPy

About

This project analyzes and classifies diabetes using the Pima Indians dataset. It includes data cleaning, exploratory data analysis (EDA), feature evaluation, and machine learning models with threshold tuning to improve prediction accuracy and reduce false negatives

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors