AAI_2025_Capstone_Chronicles_Combined

First page Table of contents Previous page 347 Next page Last page

rolloff often load heavily onto the first component, which corresponds to perceived brightness. Temporal descriptors, such as attack and decay durations, contribute more strongly to later components, which may reflect a sound’s sustained energy. Nearest neighbor retrieval within the PCA embedding reveals several multi-label clusters. For instance, percussive samples tend to group together, and sustained pad-like samples appear near one another. However, PCA shows limitations in representing more subtle or nonlinear timbral relationships. For example, two samples with similar perceived texture but different high-frequency content may be placed farther apart in PCA space than expected. Overall, PCA provides an interpretable and lightweight representation that lends itself to analysis. It performs well for sounds with strong differences in brightness, noisiness, or onset characteristics, but less well for timbre categories that involve finer temporal structure or evolving spectral behavior. Training curves for the Audio Spectrogram Transformer and CRNN model (Figure 8) show a clear downward trend in training loss across epochs, indicating effective learning. Validation loss decreases sharply in the first few epochs and then stabilizes, reflecting an early plateau in generalization. The widening gap between training and validation loss indicates mild overfitting, despite rebalancing. Yet, in practice, the downstream retrieval engine still performed well. The embedding was sufficiently expressive to return meaningful sonic matches even if the model did not fully generalize. This is likely because similarity search depends heavily on the geometry of the embedding space rather than on perfect classification accuracy. Deep Model Training Results

347

Made with FlippingBook - Share PDF online