Ensemble of deep learning, visual and acoustic features for music genre classification

Document Type


Publication Date



audio classification, texture, image processing, acoustic features, ensemble of classifiers, machine learning


In this work, we present an ensemble for automated music genre classification that fuses acoustic and visual (both handcrafted and nonhandcrafted) features extracted from audio files. These features are evaluated, compared and fused in a final ensemble shown to produce better classification accuracy than other state-of-the-art approaches on the Latin Music Database, ISMIR 2004, and the GTZAN genre collection. To the best of our knowledge, this paper reports the largest test comparing the combination of different descriptors (including a wavelet convolutional scattering network, which has been tested here for the first time as an input for texture descriptors) and different matrix representations. Superior performance is obtained without ad hoc parameter optimisation; that is to say, the same ensemble of classifiers and parameter settings are used on all tested data-sets. To demonstrate generalisability, our approach is also assessed on the tasks of bird species recognition using vocalisation and whale detection data-sets. All MATLAB source code is available.

Recommended Citation

Nanni, Loris, Yandre MG Costa, Rafael L. Aguiar, Carlos N. Silla Jr, and Sheryl Brahnam. "Ensemble of deep learning, visual and acoustic features for music genre classification." Journal of New Music Research 47, no. 4 (2018): 383-397.

DOI for the article