AAI_2025_Capstone_Chronicles_Combined

4

missing

values,

inconsistent

encoding,

"medical help," "drink" for "water") consistently present. This suggests that the TF-IDF vectorization method will effectively capture meaningful discriminative features. As seen in Figure 1, examination of label co-occurrence and correlation reveals patterns critical for model architecture selection. As expected, the Aid Related label shows strong positive correlations (r > 0.40) with most other specific need categories, functioning as a reliable general signal, while the Weather Related label correlates positively with specific disaster types such as floods and storms. Related needs frequently co-occur, forming logical semantic clusters: basic needs categories including food, water, and shelter show correlation values between 0.25 to 0.35, while emergency response categories including search and rescue, medical help, and direct report cluster similarly. Conversely, some categories exhibit near-zero or slightly negative correlation, such as offer and request, as a message is typically one or the other. The prevalence of positive label correlations (average pairwise correlation was approximately 0.15) validates the multi-label framing of the problem, as messages often contain multiple simultaneous needs.

redundancy, and class support challenges. A small number of messages in the Related category contained an unexpected value (2) outside the expected binary (0 or 1). These were recoded to (1), assuming a data entry error, to maintain binary consistency. Additionally, 19 exact duplicate messages were identified and removed from the training set to prevent model overweighting and ensure accurate performance metrics. In a live system, input validation and message similarity hashing would mitigate these issues in real-time. Significant challenges were posed by the label distribution. Three categories ("child alone," "offer," and "PII") contained zero positive examples (zero support) in the training set, rendering them untrainable and requiring their exclusion from the model. Furthermore, approximately 24% of messages were completely unlabeled, representing "background noise." The messages were kept in to help the model identify messages that would not need any categorization. To ensure the model learned relevant patterns and not just noise, high-frequency English stop-words (like "the" and "and") were explicitly removed from the message text, forcing the TF-IDF feature extraction process to focus on semantically rich, lower-frequency, high-impact terms like "trapped" or "flood.". We retained the unlabeled messages to serve as negative examples for the classifiers. The message text is the primary predictive feature, and preliminary analysis confirms its high utility. Manual inspection showed a direct, strong lexical signal, with specific category-relevant keywords (e.g., "injured" for

320

Made with FlippingBook - Share PDF online