AAI_2025_Capstone_Chronicles_Combined

Cinema Analytics and Prediction System

3

ingestion, real-time inference APIs, monitoring, and user-facing dashboards for industry

applications in entertainment and media analytics.

Data Summary

We utilized the Hollywood Movies Dataset from Kaggle ( Hollywood Movies Dataset ).

This dataset comprises 4,803 movies and includes 22 features. During our initial exploration, we

observed missing or zero values across several critical variables, as illustrated in Figure 1. The

dataset contains both text-based and numeric

features. The text-based features include

keywords, overviews, and other descriptive fields,

while the numeric features capture quantitative

aspects such as budget, revenue, popularity, and

runtime. While these features form the common

Figure 1: Missing or zero counts of features

foundation for our analysis, we applied different

feature selection and engineering approaches tailored to each of our tasks.

Movie Recommendation and Genre classification

T o build an effective movie recommendation and classification system, we began with

extensive exploratory data analysis (EDA) of both the textual and numeric features in the

dataset. The data included rich textual content like overviews, genres, and keywords, as well as

numeric metadata such as popularity scores, revenue, and runtime. Through EDA, we examined

word counts in overviews (averaging around 150 words), frequency distributions of keywords,

and imbalances in genre representation: Drama, Comedy, and Action appeared frequently,

170

Internal

Made with FlippingBook - Share PDF online