ADS Capstone Chronicles Revised
10
We utilized the correlation matrix function to identify highly correlated features. By setting a correlation threshold as 0.7, we identified and excluded features that exhibited strong correlations within one another. This step was crucial for reducing redundancy, computational costs, and enhancing model performance. After this step, we were left with 30 columns, ensuring a streamlined dataset ready for modeling. 4.3.2 One Hot Encoding was performed on certain columns. In addition to handling correlations, we leveraged feature engineering techniques to enhance the dataset’s predictive power. Notably, we utilized one-hot encoding for the ‘away’ and ‘home’ columns, encoding team names as binary features. This transformation enabled our model to effectively capture the influence of individual teams on game outcomes. With the ‘outcome_binary’ column impeccably defined, representing 1 for a home team win and 0 for an away team win, we avoided the need for further manipulation of this binary variable. Lastly, to ensure a robust evaluation of our model’s performance, we meticulously partitioned the dataset into training and testing sets using an 80-20 ratio. The data is now prepared and ready for the modeling phase. 4.4 Modeling In the domain of machine learning, selecting the appropriate model entails a nuanced understanding of algorithms, data preprocessing techniques, and evaluation methodologies. This article delves into a comprehensive comparative analysis of Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and Gaussian Naive Bayes in classifying the result of NFL games using in-game statistics. By examining the technical intricacies of model selection, feature engineering, and evaluation strategies, we aim to provide insights into the
practical application of these algorithms. modeling. Additionally, this article delves into regression analysis to predict the number of points scored based on a smaller subset of features. We carefully choose only features that do not directly tell us about the score, as they would have too much influence over the model. Although they would lower error in our model, they do not aid in the goal of prediction. We only run one type of model, linear regression, for our regression analysis, but we do so for away and home scoring individually. 4.4.1 Selection of modeling techniques. The selection of classification models for analyzing the NFL dataset revolves around their suitability for binary prediction tasks and their capacity to capture the intricacies of game outcomes. In this project, we’re focusing on the predicted variable "outcome_binary," which simply tells us whether the home team wins (1) or the away team wins (0). Logistic Regression is a standout choice due to its simplicity and interpretability, making it a solid baseline model for unraveling the relationship between different predictor variables and the likelihood of a specific outcome. When it comes to handling complex datasets with numerous features, ensemble learning approaches like Random Forest and Gradient Boosting shine. These models employ multiple decision trees to grasp non-linear relationships and interactions, offering the flexibility to fine-tune parameters for optimal results, thus proving invaluable for modeling NFL game outcomes. SVM offers a robust approach for binary classification, especially in our scenario, with 90 high dimensional features. On the other hand, Gaussian Naive Bayes, despite its simplistic assumptions regarding feature independence, remains a practical choice. Its computational efficiency and capacity to provide meaningful insights make it a valuable addition to our model toolkit.
37
Made with FlippingBook - Online Brochure Maker