Google Advanced Data Analytics Capstone

Salifort Employee Retention Project

Data Source
Libraries used
EDA and Data Preparation}
Summary of model results
Feature Importance
Conclusion and Insights
Visualization used for analysis

Data Source

Salifort Employee Data from Kaggle.

Libraries used

Pandas: Used for data manipulation, cleaning, and tabular analysis
NumPy: Employed for high-performance numerical computing and array operations
Matplotlib & Seaborn: Utilized for Exploratory Data Analysis (EDA) and creating static visualizations
Scikit-learn: The core library used for building and evaluating machine learning models, specifically Decision Tree and Random Forest Classifiers, as well as for preprocessing and cross-validation
XGBoost: Applied for advanced gradient boosting (XGBClassifier) to achieve higher predictive accuracy and analyze feature importance

EDA and Data Preparation

Analyzed the class distribution of the target variable. The dataset reveals a 16.6% attrition rate, while 83.4% of employees remained. This proportion is consistent across all departments, indicating no specific department has a significantly higher turnover rate.

Categorical data -> numeric data
- Changed Salary: [low, medium, high] -> [0, 1, 2]
- Converted Department column into numeric values with help of pd.get_dummies

Summary of model results

model	precision	recall	F1	accuracy	auc
Decision Tree CV	0.92324	0.91561	0.91934	0.97331	0.97049
XGBoost CV	0.96487	0.91829	0.94098	0.98087	0.98693
Random Forest CV	0.95067	0.91561	0.93279	0.97809	0.98133
XGBoost Test	0.96855	0.92771	0.94769	0.98299	0.96086
Decision Tree FE CV	0.95858	0.91427	0.9359	0.97921	0.97039
XGBoost FE CV	0.96887	0.91695	0.94218	0.98132	0.98496
Random Forest FE CV	0.94166	0.90824	0.92464	0.97543	0.98062
XGBoost FE Test	0.95833	0.92369	0.9407	0.98065	0.95785

Logistic Regression

The logistic regression model achieved (all weighted averages):

precision: 79%
recall: 82%
accuracy of 83%
F1-score: 80%

Tree-based Machine Learning

After feature engineering:

Decision Tree on Validation Set:
- precision: 95.85%
- recall: 91.47%
- accuracy: 97.92%
- F1-score: 93.59%
- AUC: 97.03%
Random Forest on Validation Set:
- precision: 94.16%
- recall: 90.82%
- accuracy: 97.54%
- F1-score: 92.44%
- AUC: 98.06%
XGBoost on Validation Set:
- precision: 96.88%
- recall: 91.69%
- accuracy: 98.13%
- F1-score: 94.21%
- AUC: 98.49%
XGBoost on Test Set:
- precision: 95.83%
- recall: 92.36%
- accuracy: 98.06%
- F1-score: 94.07%
- AUC: 95.78%

XGBoost outperformed Decision Tree and Random Forest models

Feature Importance

Conclusion and Insights

The analysis confirms that employee overworking is the primary driver of churn. To improve retention, we recommend:

Limit Workload: Cap the maximum number of active projects per employee.
Address Stagnation: Review promotion paths for employees with 4+ years of tenure to address specific dissatisfaction.
Regulate Overtime: Either explicitly compensate for overtime or strictly enforce standard hours. Ensure policies are transparent.
Revise Evaluations: Detach high performance scores from excessive working hours (240+/month); reward efficiency over duration.
Improve Culture: Initiate open discussions to address work culture and expectations at both team and company levels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Advanced Data Analytics Capstone

Salifort Employee Retention Project

Table of Contents

Data Source

Libraries used

EDA and Data Preparation

Summary of model results

Feature Importance

Conclusion and Insights

Visulizations used for analysis

Number of project counts

Monthly hours by number of projects

Employee satisfaction analysis

Employee tenure analysis

Satisfaction vs. Evaluation scatterplot

Average monthly hours vs. Satisfaction scatterplot

Heatmap

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Google Advanced Data Analytics Capstone

Salifort Employee Retention Project

Table of Contents

Data Source

Libraries used

EDA and Data Preparation

Summary of model results

Feature Importance

Conclusion and Insights

Visulizations used for analysis

Number of project counts

Monthly hours by number of projects

Employee satisfaction analysis

Employee tenure analysis

Satisfaction vs. Evaluation scatterplot

Average monthly hours vs. Satisfaction scatterplot

Heatmap