AAI_2025_Capstone_Chronicles_Combined
Breast Tumor Classification Using Quantum Neural Networks
4
Dataset Summary
The dataset contains various features, plus the binary target: malignant or benign. The
data set is skewed towards benign tumors, with those representing 63% of the data set and the
remaining 37% being malignant. One of the features is an integer identifier- which is discarded.
This leaves 30 features, all of which are floating point values. These features describe the
characteristics of the tumor, which include: the perimeter, radius of the lobes, surface texture,
and so on. There are ten of these features, each with 3 variations: the mean, the worst (most
extreme), and standard error. For example, for the surface texture: there is the mean surface
texture, the worst surface texture, and standard error of the surface texture. With the obvious
exception of the target, all these features are approximately skewed normal distributions- with
the long tail being the higher numbers.
Figure 1
Distribution of 3 features.
As can be seen from Figure 1, these values cover a large range of different values.
Therefore, the features selected will have to be normalized. The normalization was performed
by removing the mean and scaling to unit variance (scikit-learn developers, 2025) at first. That
normalization can be seen in Figure 2. The dataset itself is very clean as well: there are no
missing values or extreme outliers. Therefore, the data cleaning needs are minimal.
257
Made with FlippingBook - Share PDF online