Krzysztof Rychlicki-Kicior, Bartlomiej Stasiak and Mykhaylo Yatsymirskyy Lodz University of Technology

Krzysztof Rychlicki-Kicior, Bartlomiej Stasiak and Mykhaylo Yatsymirskyy Lodz University of Technology 26.01.2015

Multipitch estimation obtains frequencies of sounds from a polyphonic audio signal Number of sources can be known or not This problem belongs to the Music Information Retrieval field.

How to retrieve separate sounds from one, mixed sound signal? Amplitude Fig. 1. An example sound signal Time [s]

Fig. 2. An example of symbolic notation of sound signal

Amplitude Time [s] Fig. 3. A sinusoid, f=440 Hz

Fig. 4. A spectrum of the signal from Fig. 3.

Amplitude Fig. 5. A signal containing two sinusoids, f 1 =440 Hz and f 2 =784 Hz Time [s]

Fig. 6. A spectrum of the signal from Fig. 5.

Amplitude Fig. 7. A signal of the A4 (f 0 = 440 Hz) played on the flute Time [s]

Fig. 8. A spectrum of the signal from Fig. 7.

Amplitude Fig. 9. A signal of A4 and G5 (f 1 = 440 Hz and f 2 =784 Hz) played on the flute. Time [s]

Fig. 10. A spectrum of the signal from Fig. 9.

Fig. 11. A spectrum of signal from Fig. 9. (log scale)

Constant-Q Transform Cepstrum Preprocessing Preprocessing SI-PLCA SI-PLCA Normalization Normalization Fig. 12. Structure of the solution The Judge

Constant-Q Transform non-linear frequency transform that gives more information on lower frequencies than higher much more reasonable choice for music processing than DFT Cepstrum shows the rate of changes in the regular spectrum. Typically used in speech processing (especially as basis of MFCC) and in single-f 0 approaches

After obtaining the representations, additional preprocessing (pre- referring to the fact that it is done before the representation analysis) is done Preprocessing includes: Removing components with small values Smoothing the representations Calculating the salience

Fig. 13. The initial, unprocessed CQT Fig. 14. CQT after removing small components

Fig. 15. CQT after removing small components Fig. 16. CQT after smoothing

After receiving preprocessed sound representations, they are analyzed using the Shift-invariant Probabilistic Latent Component Analysis This method treats spectrogram (or any other of similar representation, such as time-lag) as distribution of time and energy Therefore it can be decomposed to kernel and impulse distributions

Pic. 17. An example of decomposing spectrogram into kernel and impulse distributions [10].

While regular methods of F 0 -estimation would end here, in this case, now happens very important part choosing the appropriate solution from candidates reported by methods First, however, candidates must be grouped and normalized

Each method chooses a set of candidates, where each candidate c has three attributes: Frequency f, Power p Count c Candidates power must be normalized, since energy of frequency components in CQT is different than energy of a component in a cepstrum

The judge is responsible for: Grouping the candidates returned by all the methods using chosen criteria Sorting the candidates and choosing the best of them as a solution of the algorithm

The accuracy of the intervals (2-sound chords) reached 87%. In case of three-sound chords accuracy reached 81.5% and for four-sound chords 75.2%. The interval accuracy before applying the judge has reached 95.2%, however (93.6% for 3-chords and 88.9% for 4-chords)

Interval (semitones) Accuracy (%) 0 88.83 1 87.82 2 85.77 3 87.62 4 89.11 5 88.24 6 84.62 7 83.00 8 85.11 9 89.67 10 90.85 11 83.75 Table 1. Percentage of correctly detected intervals by the type of interval (intervals higher than an octave have been reduced to their equivalents within an octave)

Applying multiple methods gives a very good and predictable results even for more complex polyphony The judge role is very important, as the current version, while giving overall good accuracy, might still be improved

[1] K. Dressler: "Multiple fundamental frequency extraction for mirex 2012". In: The 13th International Conference on Music Information Retrieval (2012) [2] J. Leon, F. Beltran, J. Beltran, "A complex waveletbased fundamental frequency estimator in single-channel polyphonic signals", Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013 [3] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchoff and A. Klapuri, "Automatic music transcription: challenges and future directions", Journal of Intelligent Information Systems, 41(3), Springer-Verlag, 407-434 (2013)

[4] M. Davy and A. Klapuri, Signal Processing Methods for Music Transcription, Springer- Verlag (2006) [5] C. Yeh: "Multiple fundamental frequency estimation of polyphonic recordings". Ph.D. thesis, Universite de Paris (2008) [6] K. Rychlicki-Kicior, B. Stasiak: "Multipitch estimation using judge-based model", Bulletin of the Polish Academy of Sciences, Technical Sciences, Vol. 62(4), 2014

[7] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka: RWC Music Database: Music Genre Database and Musical Instrument Sound Database, Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR 2003), pp.229-230, October 2003 [8] http://www.music-ir.org/mirex/wiki/ 2014:Multiple_Fundamental_Frequency_Estimation_ %26_Tracking_Results [9] K. Rychlicki-Kicior, B. Stasiak, Metaheuristic Optimization of Multiple Fundamental Frequency Estimation, in: Man-Machine Interactions 3 (Eds.: A. Gruca, T. Czachórski, S. Kozielski), Springer, pp. 307 -- 314 (2014) [10] P. Smaragdis, B. Raj, Shift-Invariant Probabilistic Latent Component Analysis, 2007