AAI_2025_Capstone_Chronicles_Combined

Cinema Analytics and Prediction System

4

while genres like Western or War were

underrepresented. These insights not only guided

downstream feature engineering decisions but

also informed our strategy to handle genre

imbalance, as we can see in the following plot

(Figure 2), using techniques like resampling and

Figure 2: Genre distribution

class weighting during model training.

We then performed a series of preprocessing steps to prepare the data for modeling.

Text-based fields such as genres and keywords were initially stored as stringified JSON and

required parsing to extract meaningful lists of terms. We also normalized metadata like

popularity, budget, and revenue using min-max scaling to ensure compatibility with deep

learning models. A new feature that combined free text features was engineered by merging

each film’s overview, genres, and keywords (Figure 3).

Figure 3: Combined text feature

This became the core input for multiple modeling approaches, including TF-IDF, BERT

embeddings, and LSTM with GloVe vectors. For classification, the same text features were

paired with normalized metadata and used in both unsupervised and supervised settings. The

BERT-based models consistently demonstrated superior semantic understanding in both

171

Internal

Made with FlippingBook - Share PDF online