M.S. Applied Data Science - Capstone Chronicles 2025

18

sets to ensure robust model evaluation and prevent overfitting. Random forest and XGBoost models were trained using the recalibrated SPPS as the target variable, with position-specific performance metrics serving as predictor features. The comprehensive model comparison revealed Ensemble methods achieving superior performance across all positions (see Table 7), with the highest average R ² (0.978) and lowest MAE (0.452). XGBoost ranked second ( R ² = 0.975, MAE = 0.56), demonstrating excellence in Midfield prediction ( R ² = 0.99) compared to random forest's poor performance ( R ² = 0.50). Neural networks showed significant positional variability, performing well for Forwards ( R ² = 0.923) but failing for Goalkeepers ( R ² = 0.395). random forest consistently underperformed across all positions ( R ² = 0.821, MAE = 1.37), confirming ensemble and gradient boosting superiority for capturing complex tactical relationships in football performance prediction.

Figure 18 Logistic Regression Validation Results for SPPS Against Match Outcomes

The logistic regression model achieved remarkable performance in distinguishing between winning and losing scenarios. The model parameters reveal an intercept (β₀) of 1.5810 and a coefficient (β₁) of 0.9549, indicating that each one-unit increase in the rebalanced SPPS multiplies the odds of winning by approximately 2.599. This substantial effect size demonstrates that our SHAP-informed metric adjustments successfully identified performance dimensions that correlate strongly with match outcomes. 4.4. Modeling and Predictions Following the SHAP-based recalibration of the SPPS, we implemented two ensemble learning algorithms to validate the effectiveness of our rebalanced metric and establish predictive benchmarks. The dataset was randomly split into training (80%) and testing (20%)

65

Made with FlippingBook flipbook maker