AAI_2025_Capstone_Chronicles_Combined
Cinema Analytics and Prediction System
3
ingestion, real-time inference APIs, monitoring, and user-facing dashboards for industry
applications in entertainment and media analytics.
Data Summary
We utilized the Hollywood Movies Dataset from Kaggle ( Hollywood Movies Dataset ).
This dataset comprises 4,803 movies and includes 22 features. During our initial exploration, we
observed missing or zero values across several critical variables, as illustrated in Figure 1. The
dataset contains both text-based and numeric
features. The text-based features include
keywords, overviews, and other descriptive fields,
while the numeric features capture quantitative
aspects such as budget, revenue, popularity, and
runtime. While these features form the common
Figure 1: Missing or zero counts of features
foundation for our analysis, we applied different
feature selection and engineering approaches tailored to each of our tasks.
Movie Recommendation and Genre classification
T o build an effective movie recommendation and classification system, we began with
extensive exploratory data analysis (EDA) of both the textual and numeric features in the
dataset. The data included rich textual content like overviews, genres, and keywords, as well as
numeric metadata such as popularity scores, revenue, and runtime. Through EDA, we examined
word counts in overviews (averaging around 150 words), frequency distributions of keywords,
and imbalances in genre representation: Drama, Comedy, and Action appeared frequently,
170
Internal
Made with FlippingBook - Share PDF online