ADS Capstone Chronicles Revised

11

As it pertains to our regression analysis on score predictions, Linear Regression is a powerful tool partially due to simplicity, interpretability, and effectiveness in capturing linear relationships between predictor variables and the target variable. By fitting a linear regression model to historical game data, we can estimate the relationship between different factors—such as passing and rushing yards per attempt—and the resulting score. Moreover, linear regression provides insights into the importance of each predictor variable through its coefficients, allowing us to identify which factors have the most significant impact on the score. Using these importance measures, we can adjust the feature list used in the model training to perform dimensionality reduction. Additionally, linear regression models are relatively easy to implement, understand, and interpret, making them accessible even without advanced statistical knowledge. In the end, we use three main features in our regression model: passing yards per attempt, rushing yards per attempt, and sacks. We decided to use the per attempt features rather than totals, as the game script affects how teams start to call plays towards the end of the game, which would have an unwanted impact on our models. As mentioned previously, this is split into predicting the home score and away score in separate models; thus, the features used are tied to the corresponding team. 4.4.1.1 Training and testing datasets. To train a model that can be tested using unseen data, we first split our dataset randomly into a training set and testing set with 80% and 20% of the data, respectively. This results in 4743 records being available for model training and 1186 records on which to test our models. 5 Results and Findings In the realm of predictive modeling, understanding the performance of various algorithms is crucial for achieving reliable results. In this analysis, we evaluate the

effectiveness of five popular classification algorithms in predicting a binary outcome: Logistic Regression, Random Forest, and Gradient Boosting, Gradient Boosting Model, Gaussian Naive Bayes Model. 5.1 Evaluation of Classification Results The logistic regression model demonstrates robust performance on both the training and testing sets, achieving an accuracy of 82.33% on the training set and 82.88% on the testing set. These metrics suggest our logistic regression model is performing well and generalizing effectively to unseen data. There appears to be no significant overfitting issue, as the performance on the testing set is comparable to that on the training set. Furthermore, the precision, recall, and F1-score metrics indicate a balanced performance across both classes. The Random Forest model exhibits remarkable accuracy, achieving a perfect score of 100% on the training set. However, this high accuracy raises concerns about overfitting, as evidenced by the lower accuracy of 86.26% on the testing set. To address overfitting, hyperparameter tuning is essential. After tuning the Random Forest model using techniques like GridSearchCV, we identified the best hyperparameters as {‘max_depth’: 25, ‘min_samples_leaf’: 2, ‘min_samples_split’: 5}. Surprisingly, the testing accuracy barely improved as it changed from 86.26% to 87.44%. Furthermore, the model still displays signs of overfitting due to the higher training accuracy, indicating a need for further investigation and refinement. Gradient Boosting demonstrates strong predictive performance, achieving an accuracy of 91.92% on the training set and 87.77% on the testing set. However, similar to Random Forest, there is evidence of overfitting, with the training accuracy surpassing the testing accuracy. Therefore,

38

Made with FlippingBook - Online Brochure Maker