Isolating the singing voice from music tracks: a deep neural networks approach to karaoke

Jonathan Deboosere

Zang Isoleren uit Muziek aan de hand van Artificiële Intelligentie

Er is geen ontkomen aan, artificiële intelligentie (A.I.) is veelbelovend en wordt reeds in verschillende sectoren gebruikt. A.I. heeft o.a. zijn werking al bewezen bij het herkennen van gezichten en voorwerpen in afbeeldingen. Ook bij audio wordt er al gebruik van gemaakt. Je kunt bijvoorbeeld commando's geven aan je smartphone met je stem. Hiervoor wordt de audio omgezet in een bepaald formaat waaruit de computer kan leren. We maken er a.h.w. een soort afbeelding van. 

In deze thesis is onderzocht of computers kunnen leren om zang/instrumenten van muziek te isoleren. Ook hier wordt de audio eerst omgezet in een leesbaar formaat voor de computer. De computer maakt vervolgens een voorspelling van enkel de zang/instrumenten.

Het splitsen van zang en instrumenten uit een muziek track visueel voorgesteld

Ik onderzocht een nieuwe methode waarbij de audio wordt omgezet in een formaat dat het menselijk gehoor op een betere manier representeert. Uit de resultaten blijkt dat computer meer tijd nodig heeft om uit dit formaat te leren, maar de resultaten zijn veelbelovend.


[1] UFLDL, Stanford, "Convolutional neural network," [Online; accessed May 16, 2018]. [Online]. Available:
[2] M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani, "Exploiting spectro-temporal locality in deep learning based acoustic event detection," EURASIP Journal on Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 26, 2015.
[3] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T.Weyde, "Singing voice separation with deep u-net convolutional networks," in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 323-332.
[4] P. Chandna, M. Miron, J. Janer, and E. Gómez, "Monoaural audio source separation using deep convolutional neural networks," in International Conference on Latent Variable  Analysis and Signal Separation. Springer, 2017,
pp. 258-266.
[5] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, "Joint op-timization of masks and deep recurrent neural networks for monaural source separation," IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 12, pp. 2136-2147, 2015.
[6] Wikipedia, the free encyclopedia, "Bidirectional recurrent neural networks," 2015, [Online; accessed May 10, 2018]. [Online]. Available: https: //
[7] S. I. Mimilakis, K. Drossos, J. F. Santos, G. Schuller, T. Virtanen, and Y. Bengio, "Monaural singing voice separation with skip-ltering connections and recurrent inference of time-frequency mask," arXiv preprint arXiv:1711.01437, 2017.
[8] I. Mosseri and O. Lang. (2018) Looking to listen: Audio-visual speech separation. [Online]. Available:…
[9] HyperPhysics, R. Nave, "Equal loudness curves," 2016, [Online; accessed May 12, 2018]. [Online]. Available:
[10] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016.
[11] J. Ganseman, P. Scheunders, and S. Dixon, "Improving plca-based score-informed source separation with invertible constant-q transforms," in Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European. IEEE, 2012, pp. 2634-2638.
[12] E. Vincent, R. Gribonval, and C. Févotte, "Performance measurement in blind audio source separation," IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462-1469, 2006.
[13] Z.-C. Fan, T. Chan, Y.-H. Yang, and J.-S. R. Jang, "Music signal processing using vector product neural networks," arXiv preprint arXiv:1706.09555, 2017.
[14] MACLab, "The ikala dataset," 2017, [Online; accessed April 30, 2018]. [Online]. Available:
[15] H. Noh, S. Hong, and B. Han, "Learning deconvolution network for semantic segmentation," in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520-1528.
[16] Y. Luo, Z. Chen, and D. P. Ellis, "Deep clustering for singing voice separation."
[17] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, "Singing-voice separation from monaural recordings using robust principal component analysis," in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 57-60.
[18] Y.-H. Yang, "On sparse and low-rank matrix decomposition for singing voice separation," in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 757-760.
[19] -, "Low-rank representation of both singing voice and music accompaniment via learned dictionaries." in ISMIR, 2013, pp. 427-432.
[20] P. Sprechmann, A. M. Bronstein, and G. Sapiro, "Real-time online singing voice separation from monaural recordings using robust low-rank modeling." in ISMIR, 2012, pp. 67-72.
[21] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 31-35.
[22] E. M. Grais, G. Roma, A. J. Simpson, and M. D. Plumbley, "Single-channel audio source separation using deep neural network ensembles," in Audio Engineering Society Convention 140. Audio Engineering Society, 2016.
[23] S. I. Mimilakis, E. Cano, J. AbeBer, and G. Schuller, "New sonorities for jazz recordings: Separation and mixing using deep neural networks," in 2nd AES Workshop on Intelligent Music Production, vol. 13, 2016.
[24] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, "A recurrent encoder-decoder approach with skip-filtering connections for monaural singing voice separation," arXiv, vol. 1709, 2017.
[25] J. Dunn. (2018) Google works out a fascinating, slightly
scary way for ai to isolate voices in a crowd. [On-line]. Available:…
[26] E. M. Grais, M. U. Sen, and H. Erdogan, "Deep neural networks for single channel source separation," in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 3734-3738.
[27] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su, Y.-H. Yang, and R. Jang, "Vocal activity informed singing voice separation with the ikala dataset," in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 718-722.
[28] Multimedia Information Retrieval lab, "Mir-1k dataset," 2009, [Online; accessed April 30, 2018]. [Online]. Available:
[29] Zafar R., Fabian S. and Antoine L., "Professionally-produced music recordings," 2016, [Online; accessed April 30, 2018]. [Online]. Available:…
[30] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam and J. P. Bello, "Medleydb: A dataset of multitrack audio for music research," 2014, [Online; accessed April 30, 2018]. [Online]. Available:
[31] B. McFee, "Librosa,", 2018.
[32] S. Mobin, B. Cheung, and B. Olshausen, "Convolutional vs. recurrent neural networks for audio source separation," arXiv preprint arXiv:1803.08629, 2018.
[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overtting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014.
[34] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[35] S. Ioe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.
[36] C. Rael, "Mir eval,", 2017.

Universiteit of Hogeschool
Burgerlijk Ingenieur Computerwetenschappen
Tijl De Bie, Thomas Demeester
Share this on: