M.S. Applied Data Science - Capstone Chronicles 2025
19
validation process is followed to enhance accuracy and ensure robust recall classification. 4.5.1 Train- Test Split Validation To ensure data integrity and modeling readiness, class distribution was analyzed in both the training and test datasets. In the training data, the majority class (Class I) accounted for approximately 71.8% of the samples, while class 0 represented 21.2%, and Class II comprised 7%. A nearly identical distribution was observed in the test set, confirming the success of stratification during the train-test split. The dataset contains 41 features, which can be categorized into several groups. The temporal features include five elements, such as month, day, and year. There are 16 categorical features related to product type, status, region, and distribution scope. Additionally, there are 19 binary flag indicators, reflecting conditions like allergen presence, illness, or manufacturing issues. One feature does not fall into these categories and is classified separately. Data validation checks revealed that there were no missing or infinite values, ensuring the dataset’s cleanliness and suitability for modeling. Logistic Regression. Logistic regression serves as a foundational Employing a diverse set of modeling techniques is a strategic approach to capture various data patterns and relationships. Logistic regression serves as a foundational model within this diverse ensemble, offering a benchmark for evaluating the performance of more complex algorithms. This strategy aligns with best practices in machine learning, where baseline models are 4.5.2 Selection of modeling techniques -
essential for contextualizing the efficacy of advanced models (MarkovML, 2023). Logistic regression represents the class of generalized linear models, providing a probabilistic framework for binary classification tasks. Its inclusion ensures coverage of linear modeling paradigms, complementing non-linear models such as decision trees and ensemble methods. This breadth allows for comprehensive analysis across different algorithmic approaches, facilitating a more robust understanding of the data (Kuhn & Johnson, 2013). The simplicity and interpretability of logistic regression make it an ideal baseline model. It enables clear insights into feature contributions and decision boundaries, which is particularly valuable in domains requiring transparency. When used alongside complex models like random forests or gradient boosting machines, logistic regression helps in diagnosing overfitting and understanding the marginal gains in predictive performance, thereby informing model selection and deployment decisions (Arshad et al., 2023). While logistic regression may not capture complex non-linear relationships as effectively as advanced models, its high interpretability and computational efficiency offer significant advantages. In scenarios where model transparency is paramount, such as healthcare or finance, logistic regression provides a balance between performance and explainability. Moreover, studies have demonstrated that in certain contexts, simpler models like logistic regression can outperform complex models, challenging the notion that increased complexity always leads to better performance (Arshad et al., 2023).
23
Made with FlippingBook flipbook maker