AAI_2025_Capstone_Chronicles_Combined

11

The PCA model provides a useful compact representation of the engineered timbre descriptors and helps identify which combinations matter most for our task. For example, a component with high loadings on spectral centroid and rolloff may collectively correspond to perceived brightness, while a component influenced by attack duration and spectral flux may relate to percussive qualities. These relationships help us interpret perceptual dimensions around timbre. Once reduced, each audio sample is embedded into this PCA space, giving us a classical baseline for the timbre similarity task. PCA shows how far traditional MIR descriptors can separate sounds based on their properties, and it provides a meaningful latent space for comparing the classical embeddings to those learned by the deep model. Although PCA provides valuable linear structure, it cannot fully capture nonlinear timbral relationships (Jensen, 2005). To address these limitations, we train a deep learning model that combines the Audio Spectrogram Transformer (AST) with a convolutional recurrent neural network (CRNN). The AST component processes time-frequency patches and uses self-attention to learn temporal relationships (Gong et al., 2021). The CRNN component captures local time-frequency information and the sonic evolution across frames. The NSynth audio files are first converted into mel spectrograms before being passed to the deep learning model. The AST branch learns macro structures across frequency and time. The CRNN branch learns local harmonic, transient, and envelope-related structure. The outputs of both branches are concatenated into a shared embedding space. A supervised head predicts perceptual timbre labels, allowing the model to learn timbre-awareness during training. Audio Spectrogram Transformer and CRNN Model

343

Made with FlippingBook - Share PDF online