AAI_2025_Capstone_Chronicles_Combined

Breast Tumor Classification Using Quantum Neural Networks

4

Dataset Summary

The dataset contains various features, plus the binary target: malignant or benign. The

data set is skewed towards benign tumors, with those representing 63% of the data set and the

remaining 37% being malignant. One of the features is an integer identifier- which is discarded.

This leaves 30 features, all of which are floating point values. These features describe the

characteristics of the tumor, which include: the perimeter, radius of the lobes, surface texture,

and so on. There are ten of these features, each with 3 variations: the mean, the worst (most

extreme), and standard error. For example, for the surface texture: there is the mean surface

texture, the worst surface texture, and standard error of the surface texture. With the obvious

exception of the target, all these features are approximately skewed normal distributions- with

the long tail being the higher numbers.

Figure 1

Distribution of 3 features.

As can be seen from Figure 1, these values cover a large range of different values.

Therefore, the features selected will have to be normalized. The normalization was performed

by removing the mean and scaling to unit variance (scikit-learn developers, 2025) at first. That

normalization can be seen in Figure 2. The dataset itself is very clean as well: there are no

missing values or extreme outliers. Therefore, the data cleaning needs are minimal.

257

Made with FlippingBook - Share PDF online