AAI_2025_Capstone_Chronicles_Combined

4

note an ongoing challenge: organizing and navigating timbre spaces in an intuitive manner (Rocchesso et al., 2022), a limitation we aim to improve upon in our work.

The MIR field formalized many timbral descriptors, creating standardized views of spectral, temporal, and harmonic features for audio analysis. Python libraries such as Librosa (McFee et al., 2015) made these computations accessible to researchers and practitioners. As datasets and resources increased, research shifted toward learned representations, including convolutional and recurrent neural networks with their ability to capture nonlinear relationships and temporal structures in timbre (Humphrey et al., 2013). These models are able to encode evolving sonic partials, transient behavior, envelope shapes, and noise-tonal mixtures, which map neatly onto perceptual timbre attributes. More recent transformer-based models extend these ideas. The Audio Spectrogram Transformer (AST) applies self-attention to spectrogram patches and learns relationships across time and frequency (Gong et al., 2021). Hybrid CRNN-transformer systems have been applied to music tagging, environmental sound classification, and instrument identification, and their performance shows that learned embeddings capture timbre effectively; an insight we incorporate into our approach. Commercial audio platforms have also adopted MIR and ML methods. Services such as Splice and Loopmasters use semantic tags combined with learned classifiers and MIR descriptors to organize large audio collections (Splice, n.d.; Loopmasters, n.d.). Digital audio workstations, including Ableton Live, recently incorporate ML-driven categorization and tagging within their library browsers (Ableton, n.d.). Generative audio systems such as Suno and Udio focus on

336

Made with FlippingBook - Share PDF online