M.S. AAI Capstone Chronicles 2024
CNN Lung Disease Classification
3
Health (NIH) Chest X-ray Dataset, which includes 112,120 X-ray images that are 1024 x 1024 in pixel size with disease labels from 30,805 unique patients. It should be noted that these X-ray images were annotated using Natural Language Processing with an estimated 90% accuracy (National Institutes of Health, n.d.).
Dataset Summary
The NIH dataset contains images of 14 diseases with varying distributions. This diversity posed a challenge for building a deep learning model, as imbalanced data can lead to biased predictions and poor performance on underrepresented classes. To mitigate these imbalances, we selected well-represented diseases and combined them with 'No Finding.' From the 112,120 images, we created a sample of 8,000 'No Finding' images and 4,000 images for each disease, plus additional multi-label occurrences. This approach aligned with the dataset's original distribution by including a higher 'No Finding' count while maintaining sufficient diseased cases for training. Our primary objective was to balance the data sufficiency, include multi-label observations, and factor in computational constraints. Overall, this sample represents 23% of the entire dataset and reduced the data by roughly one-fourth the original size.
161
Made with FlippingBook - professional solution for displaying marketing and sales documents online