M.S. Applied Data Science - Capstone Chronicles 2025

28

As illustrated in Figure 15 , Class I saw a 238.41% increase in size, while Class III experienced a significant 933.30% increase. This analysis highlights how drastically SMOTE had to augment the underrepresented classes to achieve balance. These figures emphasize the initial class imbalance and reinforce the importance of resampling for achieving equitable model performance across all target categories. Figure 15 Class Distribution After SMOTE

the random forest classifier was selected as the most effective model, offering a robust balance between accuracy and generalization. The optimal configuration derived from hyperparameter tuning was applied to a new instance of the random forest model, which was then incorporated into a fresh pipeline tailored specifically to the model’s requirements. Before finalizing the model, the class imbalance in the dataset was addressed. As shown in Figure 14 , the original training data exhibited significant skew, with Class II dominating the distribution and Class III being underrepresented. To mitigate this, the SMOTE was used to balance the class distribution by generating synthetic examples for the minority classes. After applying SMOTE, each class had equal representation (22,619 samples), ensuring the classifier was not biased toward the majority class. This resampling process improved fairness and robustness in model predictions. Figure 14 Class Distribution Before SMOTE

Once the data was balanced, the final pipeline was constructed by incorporating preprocessing steps such as scaling (if necessary for the model) and feature selection using SelectKBest or model-based selectors. For random forest, scaling was not required, but selecting the top features based on the best configuration was beneficial. The final model was then trained on the resampled dataset, ensuring a consistent and effective training process. By integrating resampling, feature selection, and model training, the pipeline was optimized to achieve the best results. The final model was then trained on the resampled dataset using this end-to-end pipeline.

In addition to visualizing the raw class counts, the percentage increase in each class’s sample size after applying SMOTE was also examined.

32

Made with FlippingBook flipbook maker