AAI_2025_Capstone_Chronicles_Combined

13

simultaneous needs frequently resulted in partial predictions where the model correctly identified 2-3 categories but missed others, particularly when minority categories co-occurred with majority categories, suggesting that the independent binary classification approach fails to leverage strong co-occurrence patterns between related categories. Third, short messages with minimal lexical content (fewer than 10 words, comprising approximately 18% of validation messages) posed significant challenges, as sparse feature vectors with only 2-4 non-zero features in the 5,000-dimensional TF-IDF space provided insufficient discriminative signal for confident category assignment. Fourth, the model exhibited geographic and event-type biases inherited from training data heavily weighted toward Haiti earthquake messages (45% of dataset), showing performance degradation on flood-specific terminology and messages using region-specific place names or non-standard English, indicating limited cross-disaster generalization capability. 6 ​ Conclusion This project successfully developed a multi-label machine learning system for automated disaster message classification, achieving test set performance of micro F1=0.677 and macro F1=0.518 through systematic optimization involving cost-sensitive learning, per-label threshold tuning, and category-specific hyperparameter refinement. While falling short of the ambitious targets of micro F1 > 0.75 and macro F1 > 0.60, the model successfully exceeded the F1 > 0.80 threshold for critical basic needs categories including food (0.84) and water (0.81), demonstrating that multi-label classification can achieve production-ready performance for high-frequency humanitarian categories.

The optimization process revealed that per-label threshold tuning delivered the greatest performance gains with minimal computational cost, demonstrating that decision boundary calibration should be prioritized before pursuing expensive architectural modifications. Cost-sensitive learning successfully aligned model behavior with humanitarian priorities by shifting precision-recall trade-offs toward higher recall for minority classes, though persistent performance gaps for categories with fewer than 100 validation examples reveal fundamental TF-IDF limitations requiring transformer-based contextual embeddings as the next development step. The project's findings were constrained by several methodological limitations. The dataset's temporal and geographic boundaries (2010-2012 disasters in Haiti, Chile, Pakistan, and the United States) limit generalizability to contemporary communications, as language patterns and platforms have evolved substantially with the proliferation of Twitter/X, WhatsApp, and other messaging platforms not captured in the training data. The random train-validation-test split strategy does not simulate cross-disaster generalization that would occur in operational deployment, potentially overestimating model performance on genuinely novel disaster scenarios. Additionally, the evaluation metrics employed (F1-score, precision, recall) assume equal importance across all 36 categories, which does not reflect operational reality where medical emergencies and search and rescue operations carry far greater humanitarian significance than infrastructure status updates. A weighted evaluation scheme incorporating domain expert input on category importance would provide more meaningful assessment of model utility for disaster response organizations. The TF-IDF

329

Made with FlippingBook - Share PDF online