ADS Capstone Chronicles Revised

First page Table of contents Previous page 36 Next page Last page

4.2 Data Quality There are no data quality issues pertaining to invalid data as it is collected from Kaggle (n.d.), which ensures reliability from reputable sources. Using Python, we have also discovered there are no missing values, which results in no imputation step needed. After checking the number of games in the dataset across all years, we have found 267 games each season from 2002 through 2019, 269 games in the 2020 season, 284 games in the 2022 season, and 285 games in the 2021 and 2023 seasons. These discrepancies are explained by the NFL adding an additional playoff game in each Table 1 Descriptive Statistics on Selected Features Feature easier access to time variables. We converted the originally recorded ‘possession_away’ and ‘possession_home’ times from minute:seconds format into seconds. This numeric representation not only facilitates easier evaluation and impact assessment but also streamlines our modeling processes, contributing to a smoother and more efficient analytical workflow. 4.3 Feature Engineering Several feature engineering techniques were implemented to add features deemed as beneficial to our exploration and analysis. To classify the winner of each game, we compared the home and away scores to receive a home or away label in a new column called ‘winning_team.’ Then, we referenced the team names in the ‘home’ or ‘away’ Score (home) Pass yards (home) Rush yards (home) 4.2.1 Data transformation facilitated

conference starting in the 2020 season and an additional regular season game in 2021. The one fewer game in 2022 is due to a mid-game cancellation as the result of an on-field medical incident. For each of the continuous features, we have reported the following values: count, mean, standard deviation, minimum, 1st quartile, median, 3rd quartile, and maximum. These values are displayed in Table 1 for selected variables. Although there are some outliers existent in the data, these are actual data points we decided against adjusting due to their valuable insights.

Count Mean SD Min 25% 50% 75% Max

5929 5929 5929

23.40 227.35 117.99

10.28 77.48 52.30

0 -9 -3

17 173 81

23 222 111

30 277 149

70 522 378

column depending on the winner to list the specific football team that won the specific match. Another column called ‘outcome_binary’ was introduced, which will serve as a target variable during modeling. To streamline our dataset and address potential correlation issues, we implemented a series of additional enhancements. We calculated new features that represent rates by dividing completion numbers by attempt numbers for key scenarios such as ‘third_down,’ ‘fourth_down,’ and ‘redzone.’ This not only reduced the dimensionality of our dataset but also provided clearer insights into the impact of these features on our analysis. 4.3.1 Highly correlated features were removed to prevent duplicacy.

Made with FlippingBook - Online Brochure Maker