- Data Source
- Libraries used
- EDA and Data Preparation}
- Summary of model results
- Feature Importance
- Conclusion and Insights
- Visualization used for analysis
- Salifort Employee Data from Kaggle.
- Pandas: Used for data manipulation, cleaning, and tabular analysis
- NumPy: Employed for high-performance numerical computing and array operations
- Matplotlib & Seaborn: Utilized for Exploratory Data Analysis (EDA) and creating static visualizations
- Scikit-learn: The core library used for building and evaluating machine learning models, specifically Decision Tree and Random Forest Classifiers, as well as for preprocessing and cross-validation
- XGBoost: Applied for advanced gradient boosting (XGBClassifier) to achieve higher predictive accuracy and analyze feature importance
Analyzed the class distribution of the target variable. The dataset reveals a 16.6% attrition rate, while 83.4% of employees remained. This proportion is consistent across all departments, indicating no specific department has a significantly higher turnover rate.
- Categorical data -> numeric data
- Changed Salary: [low, medium, high] -> [0, 1, 2]
- Converted Department column into numeric values with help of pd.get_dummies
| model | precision | recall | F1 | accuracy | auc |
|---|---|---|---|---|---|
| Decision Tree CV | 0.92324 | 0.91561 | 0.91934 | 0.97331 | 0.97049 |
| XGBoost CV | 0.96487 | 0.91829 | 0.94098 | 0.98087 | 0.98693 |
| Random Forest CV | 0.95067 | 0.91561 | 0.93279 | 0.97809 | 0.98133 |
| XGBoost Test | 0.96855 | 0.92771 | 0.94769 | 0.98299 | 0.96086 |
| Decision Tree FE CV | 0.95858 | 0.91427 | 0.9359 | 0.97921 | 0.97039 |
| XGBoost FE CV | 0.96887 | 0.91695 | 0.94218 | 0.98132 | 0.98496 |
| Random Forest FE CV | 0.94166 | 0.90824 | 0.92464 | 0.97543 | 0.98062 |
| XGBoost FE Test | 0.95833 | 0.92369 | 0.9407 | 0.98065 | 0.95785 |
Logistic Regression
The logistic regression model achieved (all weighted averages):
- precision: 79%
- recall: 82%
- accuracy of 83%
- F1-score: 80%
Tree-based Machine Learning
After feature engineering:
- Decision Tree on Validation Set:
- precision: 95.85%
- recall: 91.47%
- accuracy: 97.92%
- F1-score: 93.59%
- AUC: 97.03%
- Random Forest on Validation Set:
- precision: 94.16%
- recall: 90.82%
- accuracy: 97.54%
- F1-score: 92.44%
- AUC: 98.06%
- XGBoost on Validation Set:
- precision: 96.88%
- recall: 91.69%
- accuracy: 98.13%
- F1-score: 94.21%
- AUC: 98.49%
- XGBoost on Test Set:
- precision: 95.83%
- recall: 92.36%
- accuracy: 98.06%
- F1-score: 94.07%
- AUC: 95.78%
XGBoost outperformed Decision Tree and Random Forest models
The analysis confirms that employee overworking is the primary driver of churn. To improve retention, we recommend:
- Limit Workload: Cap the maximum number of active projects per employee.
- Address Stagnation: Review promotion paths for employees with 4+ years of tenure to address specific dissatisfaction.
- Regulate Overtime: Either explicitly compensate for overtime or strictly enforce standard hours. Ensure policies are transparent.
- Revise Evaluations: Detach high performance scores from excessive working hours (240+/month); reward efficiency over duration.
- Improve Culture: Initiate open discussions to address work culture and expectations at both team and company levels.









