M.S. Applied Data Science - Capstone Chronicles 2025
8
engineering to derive predictive signals from raw metadata. This included extracting, transforming, and normalizing variables across three primary domains: textual features, temporal attributes, and engagement metrics. For textual features, we focused on the title column, which serves as a primary descriptor of video content. We calculated title_length (character count) and word_count as proxy indicators of complexity and semantic density. These metrics are commonly used in natural language processing tasks to capture surface-level linguistic variation. Additionally, the raw text of the title was later vectorized using TF-IDF during the preprocessing pipeline to preserve token-level information while reducing dimensionality and noise introduced by frequently occurring stop words. Temporal feature engineering involved parsing the ISO 8601-formatted published timestamp. We extracted discrete components, including year, month, day, and publish_hour, as well as publish_day (day of the week), using Python’s datetime utilities. To improve interpretability and support categorical encoding, we converted the numeric month into a human-readable month_name. These features allow the model to learn patterns associated with content release timing, such as the effect of weekday/weekend publishing or diurnal engagement trends. To capture audience interaction behavior, we engineered normalized engagement metrics: likes_per_view and comments_per_view. These continuous variables standardize raw engagement counts against the view count, providing a scale-invariant measure of interaction rate. This is particularly important given the long-tailed distribution of views on platforms like YouTube,
where raw like and comment counts can be misleading due to exposure bias. Together, these engineered features enable the model to leverage a rich mix of textual, temporal, and behavioral information, improving its ability to generalize across diverse video types and publishing contexts. 4.4 Modeling The modeling phase aimed to predict video popularity, measured by total view count, using a range of video-level metadata. After preprocessing and feature engineering, the dataset was prepared with a clean structure containing normalized engagement metrics, such as likes per view and comments per view, as well as categorical and temporal indicators like publish hour. This final dataset was saved and imported into a separate environment for modeling to ensure consistency and reproducibility. 4.4.1 Selection of modeling techniques. To explore a range of predictive capabilities, four distinct supervised learning algorithms were implemented: Linear Regression, Support Vector Regression (SVR), XGBoost, and CatBoost. Using multiple models allowed for a more comprehensive evaluation of how well different types of algorithms perform with the available features. Linear Regression was chosen as a baseline model due to its simplicity and ease of interpretation. SVR was used to assess the performance of a kernel-based method in capturing the underlying relationships in the data. XGBoost was selected for its strong performance with structured data and its built-in regularization features. CatBoost was included because of its fast training speed and its efficiency in handling large datasets using histogram-based learning. These models were chosen to represent a variety of complexities and learning styles, enabling a
120
Made with FlippingBook flipbook maker