AAI_2025_Capstone_Chronicles_Combined

First page Table of contents Previous page 337 Next page Last page

text-to-audio synthesis, although they illustrate a broader trend toward aligning perceptual and acoustic representations through machine learning (Suno, n.d.; Udio, n.d.).

XLN Audio’s XO provides a notable commercial analogue to our system. XO maps drum samples into a two-dimensional navigable space in which nearby points correspond to timbral similarity, demonstrating how embedding-based visualizations can support sample selection (XLN Audio, n.d.). Although XO is limited to percussion, it highlights the range of possible strategies for representing timbre via deep-learned embeddings.

Dataset

This project uses Google Magenta’s NSynth dataset, which contains approximately 305,000 audio files paired with JSON metadata tags (Engel et al., 2017). Each audio file is a four-second monophonic recording at a sampling rate of 16 kHz. The dataset includes a diverse set of instrumental sources, ranging from acoustic instruments such as strings and woodwinds to synthesizers and electronic instruments. Useful for our task, the JSON metadata for each file includes several timbral label categories, represented in binary form (e.g., bright = 1 or 0). Each JSON record also provides basic non-timbral attributes, including pitch, the velocity used to trigger the sound, and an instrument class. For the purposes of this project, the most relevant metadata are the perceptual timbral descriptors. These include whether a sound is bright or dark, percussive, distorted, reverberant, or whether it exhibits behaviors such as long tail release, fast decay, or nonlinear envelope patterns. These help when conducting exploratory analysis. However, the binary nature of the labels does not capture how the complexities of timbre evolve over time. For example, the concept of a long release is flattened into a single yes

337

Made with FlippingBook - Share PDF online