ADS Capstone Chronicles Revised

First page Table of contents Previous page 141 Next page Last page

and 30% temporary data. This split ensures that the models have sufficient data for training while reserving enough data for testing and validation. 2. Further split into validation and test sets: The temporary data were then equally split into validation and test sets, each constituting 15% of the original dataset. This split is crucial for tuning the models and evaluating their performance on unseen data. 3. Cross-Validation: To enhance the robustness of model evaluation, k-fold cross-validation was employed. This technique splits the training data into k parts. The model is trained on k-1 of these parts and tested on the remaining one. This process is repeated k times, with each part being used as the test set once. Averaging the results from all k iterations provides a more accurate estimate of the model’s performance ( 3.1. Cross-validation: Evaluating Estimator Performance , n.d.). By maintaining the class distribution across these splits through stratification, the models are trained and evaluated on representative data, which is essential for reliable performance metrics.

To identify optimal model configurations, the parameters for each model used were selected using either grid search or randomized search methods, depending on the model’s computational time (see Table 2). The parameter selection process is detailed below: Ridge classifier : GridSearchCV was used to determine the best

hyperparameters for this model. For the model, the hyperparameters that were tuned were the regularization strength, solver, and max iterations. The optimal parameters were found to be {‘C’: 0.1, ‘max_iter’: 1000, ‘penalty’: ‘l2’, ‘random_state’: 42, ‘solver’: ‘lbfgs’} with a cross-validation score of about 0.73. Lasso classifier : Similarly, the model used the same method as the Ridge Regression for hyperparameter tunning and search method. The best parameters are {‘C’: 1, ‘max_iter’: 1000, ‘penalty’: ‘l1’, ‘random_state’: 42, ‘solver’: ‘saga’}with a cross-validation score of approximately 0.73. XGBoost: Parameters such as the number of estimators (n_estimators), subsample ratio (subsample), and column sample (colsample_bytree) were optimized using RandomizedSearchCV. The best configuration was {‘colsample_bytree’: 0.81, ‘learning_rate’: 0.20, ‘max_depth’: 5, ‘n_estimators’: 289, ‘subsample’: maximum depth (max_depth), learning rate (learning_rate),

4.4.4 Parameter Settings

141

Made with FlippingBook - Online Brochure Maker