AAI_2025_Capstone_Chronicles_Combined

8

The first notable observation concerns the distribution of instrument families. Categories such as bass and keyboard contain a large number of examples, while others, including flute, string, vocal, and especially synth-lead are comparatively sparse. This imbalance influences supervised tasks, and to prevent the model from overfitting to the most common instrument families, we use targeted oversampling strategies for the more sparse categories. The second observation is the distribution of perceptual (timbral) descriptors. For instance, distortion and reverb labels occur often, and bright and fast-decay characteristics are heavily represented. Others, such as nonlinear envelope behavior, occur much less frequently. If left unaddressed, a model trained on this dataset is likely to ignore underrepresented qualities. Oversampling and careful batching help us counter this imbalance and avoid bias toward the most common timbral categories. Another important factor concerns our need for time-variance in our model training, which NSynth labels mostly ignore. These temporal characteristics directly relate to the perceptual experience of timbre, and as a result, EDA emphasizes the need for custom features that capture sonic evolutions and harmonic behaviors over time. These time-domain observations motivate the design of custom engineered features that better encode dynamic changes.

Methods

Our system is designed to learn complex timbres, represent these in two dimensional spaces, and retrieve similar sounds based on distance metrics within those spaces. The methods to do so include feature engineering, classical dimensionality reduction, deep learning, embedding extraction, similarity indexing, and supervised training on perceptual timbre labels.

340

Made with FlippingBook - Share PDF online