AAI_2025_Capstone_Chronicles_Combined
MENTAL HEALTH RISK DETECTION USING ML
4
The goal is to develop a validated, interpretable machine learning model that classifies individuals into low, moderate, or high risk for mental health concerns. We hypothesize that a combination of demographic, occupational, and sentiment-based features can be used to accurately predict mental health risks. The final product will be a web-based interface where users can complete a survey, receive risk scores, and recommendations for intervention planning
and prevention. Data Summary
The dataset includes approximately 300,000 anonymized survey responses (Jikadara, 2024) and contains 17 columns. Most variables are categorical and stored as strings, covering demographic (gender, age group, country, state), occupational (occupation, self-employed), and psychological dimensions (family history, treatment, growing stress, coping difficulty, changes in habits, days indoors, etc.). A timestamp column is also present but will only be used for exploratory analysis, not model training. The dataset presented several quality issues that were resolved through systematic cleaning. A small portion of responses contained missing values, likely due to participants skipping questions or technical errors during submission. These rows were removed to prevent noise and preserve integrity. Exact duplicate entries were also identified and deleted, potentially caused by repeated submissions or system export errors. Additionally, several countries were severely underrepresented (Figure 1), contributing fewer than 300 responses each, and displayed strong gender bias, often overrepresenting male respondents.
Figure 1 Gender Distribution by Country
204
Made with FlippingBook - Share PDF online