M.S. Applied Data Science - Capstone Chronicles 2025

4

4 Methodology This project analyzes YouTube video metadata collected using the YouTube Data API v3, which provides public-facing video-level information, including views, likes, comments, titles, channel data, and publishing timestamps. Using a set of keyword-based queries, the dataset was constructed by retrieving a total of 3,046 videos across a wide range of content categories. The original dataset contained eight columns, but after preprocessing and feature engineering, the working dataset was expanded to 20 columns, including numerical, categorical, and temporal fields. The primary objective of this project is to determine which early indicators from metadata can reliably predict a video’s popularity. All videos were retrieved through custom scripts written in Python using the requests library. The API key was stored securely in a local configuration file and was not shared publicly. After collection, preprocessing steps were performed to clean the dataset, derive new features, and remove records with missing or invalid values, such as videos with zero views or inconsistencies between likes and view counts. Key engineered features include title length, likes per view, comments per view, word count, publish hour, and publish day. The dataset used in this project was collected through a publicly accessible API and contains no personally identifiable information. As a result, there are no ethical concerns related to privacy. All procedures followed are compliant with YouTube’s API Terms of Service. The methodology for this project consists of four core steps: data acquisition, exploratory data analysis, data preparation through cleaning and feature creation, and ultimately, predictive modeling based on selected features.

capturing how popularity builds over time through the recommendation system. 3.4 Poor Video Streaming Performance Explained (and Fixed) This paper shifts focus from content and engagement to technical performance. Arye et al. studied how poor streaming quality affects user behavior and video metrics. A key theme is that external factors like buffering and load times can shape perceived popularity. This adds a new pattern to the field by showing that not all performance issues come from the content itself. Unlike earlier studies, this one does not aim to predict views but to explain underperformance caused by network issues. The results show that user experience can lower engagement even if the video is strong. However, the study does not explore how early metadata or content attributes might help predict performance in advance. This leaves a gap in using pre-publish features for forecasting outcomes. 3.5 Modeling Rabbit-Holes on YouTube Le Merrer et al. (2023) examined how YouTube’s design can lead users into narrow content loops through autoplay and personalization. Their study used bots to simulate user sessions and track how content exposure changes over time. A key theme is how user interaction patterns shape the type of content being surfaced. This adds a different perspective by focusing on platform behavior rather than content or user metrics. Although their work helps explain how visibility can shift, it does not attempt to predict popularity or use video-level features. This contrasts with studies that rely on metadata and early performance. The paper does not explore how information available before viewing, such as titles or categories, can inform future engagement. This leaves a gap in using static features to model potential popularity before a video gains traction.

116

Made with FlippingBook flipbook maker