M.S. Applied Data Science - Capstone Chronicles 2025

5

4.1 Data Acquisition and Aggregation Data were collected using Python scripts that interfaced with the YouTube Data API v3. A series of automated queries was issued to collect videos across multiple categories and keywords. The API response data was parsed and structured into a Pandas DataFrame for further analysis. A total of 3,046 records were collected, each containing information such as video ID, title, channel name, published date, view count, like count, and comment count. The original dataset consisted of eight columns. Through preprocessing and feature engineering, 12 additional features were created, resulting in a total of 20 columns. These new features were derived using the datetime, string, and math functions in Python, and include variables such as title length, word count, likes per view, comments per view, publish day, and publish hour. These transformations allowed for more detailed analysis of how structural metadata and upload timing relate to engagement. ​ 4.1.1 Data Cleaning Data cleaning involved removing records with zero views and cases where the number of likes exceeded the number of views. These cleaning steps removed fewer than 50 records but were necessary to ensure logical consistency and improve model accuracy. Additionally, categorical and temporal features were transformed to improve interpretability. For example, the published date was converted to publish day and publish hour to better capture when a video was uploaded and how that might impact performance. Data cleaning involved removing records with zero views and cases where the number of likes exceeded the number of views. These cleaning

steps removed fewer than 50 records but were necessary to ensure logical consistency and improve model accuracy. Additionally, categorical and temporal features were transformed to improve interpretability. For example, the published date was converted to publish day and publish hour to better capture when a video was uploaded and how that might impact performance. 4.1.1 Exploratory Data Analysis The EDA process began by examining relationships between key engagement metrics and metadata to uncover patterns that may influence video popularity. By analyzing correlations, scatterplots, and aggregated trends, we aimed to identify which features offer the most predictive potential for future modeling. Understanding how core numerical features relate to each other is essential before selecting which variables to prioritize in modeling. A heatmap helps provide a quick, visual overview of those relationships across the dataset. Figure 1 shows the correlation between numeric variables, including views, likes, comments, and publishing metadata. Figure 1 highlights a strong correlation between views and likes (0.86), as well as between views and comments (0.77). This suggests that videos receiving more exposure typically attract more engagement across all metrics. We also see a strong association between likes and comments themselves (0.80), which further supports the idea that highly engaging videos generate interactions across multiple fronts. Meanwhile, temporal features such as day, month, and year show little to no correlation with engagement, implying they may not be strong predictive features on their own.

117

Made with FlippingBook flipbook maker