This is a repository copy of A New Method of Onset and Offset Detection in Ensemble Singing.
|
|
- Ashlyn George
- 5 years ago
- Views:
Transcription
1 This is a repository copy of A New Method of Onset and Offset Detection in Ensemble Singing. White Rose Research Online URL for this paper: Version: Published Version Article: D'Amario, Sara, Daffern, Helena orcid.org/ and Bailes, Freya (2018) A New Method of Onset and Offset Detection in Ensemble Singing. Logopedics Phoniatrics Vocology,. Reuse This article is distributed under the terms of the Creative Commons Attribution (CC BY) licence. This licence allows you to distribute, remix, tweak, and build upon the work, even commercially, as long as you credit the authors for the original work. More information and the full terms of the licence here: Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by ing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. eprints@whiterose.ac.uk
2 Logopedics Phoniatrics Vocology ISSN: (Print) (Online) Journal homepage: A new method of onset and offset detection in ensemble singing Sara D Amario, Helena Daffern & Freya Bailes To cite this article: Sara D Amario, Helena Daffern & Freya Bailes (2018): A new method of onset and offset detection in ensemble singing, Logopedics Phoniatrics Vocology, DOI: / To link to this article: The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group Published online: 27 Mar Submit your article to this journal Article views: 351 View Crossmark data Full Terms & Conditions of access and use can be found at
3 LOGOPEDICS PHONIATRICS VOCOLOGY, RESEARCH ARTICLE A new method of onset and offset detection in ensemble singing Sara D Amario a, Helena Daffern a and Freya Bailes b a Department of Electronic Engineering, University of York, York, UK; b School of Music, University of Leeds, Leeds, UK ABSTRACT This paper presents a novel method combining electrolaryngography and acoustic analysis to detect the onset and offset of phonation as well as the beginning and ending of notes within a sung legato phrase, through the application of a peak-picking algorithm, TIMEX. The evaluation of the method applied to a set of singing duo recordings shows an overall performance of 78% within a tolerance window of 50 ms compared with manual annotations performed by three experts. Results seem very promising in light of the state-of-the-art techniques presented at MIREX in 2016 yielding an overall performance of around 60%. The new method was applied to a pilot study with two duets to analyse synchronization between singers during ensemble performances. Results from this investigation demonstrate bidirectional temporal adaptations between performers, and suggest that the precision and consistency of synchronization, and the tendency to precede or lag a co-performer might be affected by visual contact between singers and leader follower relationships. The outcomes of this paper promise to be beneficial for future investigations of synchronization in singing ensembles. ARTICLE HISTORY Received 28 June 2017 Revised 8 February 2018 Accepted 12 March 2018 KEYWORDS Interpersonal interaction; offset detection; onset detection; singing ensemble; synchronization Introduction Accurate analysis of sound, typically musical tones, as performed by an individual is fundamental to the investigation of performed musical characteristics such as tempo, rhythm and pitch structure. The analysis of singing ensemble recordings represents a major challenge in this respect, due to the difficulties of: (i) separating individual voices within polyphonic recordings to evaluate the contribution of each singer and (ii) identifying tone onsets and offsets. Whilst onsets and offsets are often clearly distinguishable for percussive sounds, in singing these vary according to vibrato, vocal fluctuations, timbral characteristics and onset envelopes, especially within a legato phrase where consonants are absent. Currently, there are no robust methods to identify onsets and offsets of individual voices, particularly in the context of ensemble singing. A protocol for onset offset detection of singing ensemble recordings might be useful for a range of aspects of music performance analysis and audio signal processing, such as music information retrieval, transcription applications and to evaluate synchronization between musicians during singing ensemble performances. The use of close proximity microphones, although capturing the data of the individual singers, does not eliminate bleed from other performers (1), and makes isolation of individual notes and therefore onsets and offsets difficult. Recent studies conducted by David Howard analysed tuning in two different SATB ensembles: the complexities of polyphonic analysis associated with audio recordings (2 4) were avoided by applying acoustic analysis in conjunction with electrolaryngography (Lx) to extract the f o estimates from vocal fold contact information. Electrolaryngography, coupled with electroglottography (EGG), two non-invasive techniques that assess vocal fold vibration in vivo through electrodes placed externally on either side of the neck at the level of the larynx, allow measurement of performance data in solo and ensemble performances and are often employed in singing research (for a recent review, see (5)). However, the use of Lx/EGG for the temporal analysis of onsets and offsets to assess synchronization between singers during vocal ensemble performances has still to be evaluated. Several approaches have been suggested for note-onset detection (for a review, see (6)). A few studies have focused on spectral features of the signals (7), combined phase and energy information (8), analysed phase deviations across the frequency domain (9), considered change of energy in frequency sub-bands (10), or are based on probabilistic methods such as hidden Markov models (11). Other approaches are based on the fundamental frequency contour and sound level envelope (12), or on time and frequency domain features (13). The selection and reliability of the algorithms mentioned above are strictly correlated to the type and quality of the audio signal; for example, time domain methods perform relatively well if the signal is very percussive as in piano or drum recordings. It is noteworthy that existing algorithms perform less well in singing compared with other classes, such as solo brass, wind instruments and polyphonic pitched instruments. In Music Information Retrieval Evaluation exchange (MIREX 2016), the best-performing algorithm for onset detection of solo singing voice achieved an F-measure, which is a metric of the overall performance, of 61.7%; whereas, the CONTACT Sara D Amario sda513@york.ac.uk Department of Electronic Engineering, University of York, Heslington, York YO10 5DD, UK ß 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
4 2 S. D AMARIO ET AL. best-performing algorithms for drums, plucked strings, brass andwindinstrumentsachieved anf-measure of 93%, 92%, 91% and 78%, respectively. Toh et al. (14) implemented a system for the analysis of the solo singing voice that accurately identified 85% of onsets within 50 ms of the ground truth such as the manually annotated values of the same recordings. However, this is not precise enough for the analysis of the highly accurate coordination that is found in professional music ensembles, known to be in the order of tens of milliseconds (15,16). In summary, automated onset detection of nonpercussive performances, such as singing ensemble performances, from audio recordings remains a challenge and is currently under development and evolving. A robust algorithm able to automatically extract timing information in such performances will be highly beneficial for the investigation of synchronization between members of a singing ensemble. This paper addresses the complexities of analysing onset and offset timings in polyphonic singing recordings through a case study considering synchronisation in singing ensemble performances. A novel method to investigate temporal coordination in singing ensembles is developed and tested, based on the combined application of electrolaryngography and acoustic analysis, and on a new automated algorithm, termed TIMEX, that automatically extracts timing in monoaural singing performances. The effectiveness of this new method for the analysis of synchronization in singing ensembles was tested in a pilot study. A secondary aim of the pilot study was to investigate the importance of visual cues and leader follower relationships in singers synchronization during vocal ensemble performances with the central question: Do the presence/absence of visual contact (VC) between musicians and the instruction to act as leader or follower affect synchronization between singers in vocal duos? Synchronization between musicians is maintained through iterative temporal adjustments which might relate to expressive interpretations or noise during cognitivemotor processes. Research suggests that synchronization in small ensembles (17,18) might be affected by VC between musicians when auditory feedback is limited or musical timing is irregular, and by leader follower relationships between members of a musical ensemble. However, how synchronization evolves during vocal ensemble performances in relation to these factors still needs to be fully understood. Based on previous evidence, it was hypothesized that the combination of electrolaryngography and acoustic analysis is a valuable tool for the analysis of synchronization in singing ensembles by tracking the f o profile, as this combination proved to be a successful method in studies analysing intonation in SATB quartets from f o estimates (2 4). It was also conjectured that the leader s onset might tend to precede those of the follower, as found by (17) in piano duos. Finally, it was hypothesized that singers do not significantly rely on VC to temporally synchronize their actions with the co-performers actions during the ensemble performance of regular rhythms, as found by (19) in piano duos. The remainder of this paper is organized in four sections. First, an overview and evaluation of the novel onset/offset detection method is presented (see section TIMEX ). A case study of synchronization between singers in two vocal duos, based on the application of the new protocol is then described (see section Case study of synchronization in singing ensembles ). Finally, results of the algorithm s evaluation and the case study are discussed and conclusions presented. TIMEX: an algorithm for the automatic detection of note onsets and offsets The purpose of this section is to first describe (see section Algorithm specification ) and then test (see section Algorithm evaluation ) a novel algorithm developed to automatically extract temporal information relating to the notes within a legato phrase sung on any vowel. The input for the algorithm is the f o profile extracted from monoaural audio recordings of a singing ensemble obtained using Lx and a head-mounted microphone. Algorithm specification When singers perform legato, there are no silences between the notes within a phrase: phonation continues until the next rest/breath, effectively creating a portamento between notes. In the development of the algorithm, it was therefore necessary to set criteria with which to analyse the beginning and ending of each note within the piece. This has resulted in four categories being defined to denote the true beginning and ending of the scored notes. These are shown in Figure 1 and defined as: Onset (ON): beginning of phonation after a silence Note ending (NE): peak/trough in f o during phonation within a legato phrase, that is atypical of a vibrato cycle s characteristics for extent and frequency, calculated between 80 and 120 cents and between 2 and 9 Hz, respectively, and refined for each singer Note beginning (NB): peak/trough in f o during phonation that exceeds the maximum vibrato extent and is less than the vibrato frequency following a note ending Offset (OF): ending of phonation followed by a silence. In order to automate the extraction of the above categories, the following definitions have been formulated and parameter values inputted. The values were manually determined by testing with several recordings and can be modified by the users. Break: a sequence of one or more points where the Lx signal is null. Rest: a sequence of a minimum number of consecutive points where the Lx signal is null. The number of minimum points required to classify a break as a rest is arbitrarily defined; for this specific set of recordings, it has been set to a corresponding time window of 300 ms representing a quaver rest at 100 beats per minute (BPM). Phrase: a section of the Lx recording comprised between an onset and the following offset.
5 LOGOPEDICS PHONIATRICS VOCOLOGY 3 Figure 1. The f o profile of measures 1 3 of the raw Lx and audio signal from an upper voice performance of the two-part piece composed for this study (see section Stimulus material ), showing: (i) on the top panel, the Lx recording with the four sets of categories identified for each note within a legato phrase (i.e. onset, note beginning, note ending and offset), a local peak and the phrases; (ii) on the bottom, the audio recording, with the ON and OF fluctuation ranges and the break range. Fluctuation: the difference in frequency between two Lx or AUDIO points; the fluctuation can be linear or logarithmic, depending how it is measured. For these recordings, it was set to 80 cents. Local max: a point where the Lx/AUDIO value is higher than the Lx/AUDIO values at the previous and at the following point. Local min: a point where the Lx/AUDIO value is lower than the Lx/AUDIO values at the previous and at the following point. Onset/offset fluctuation range: the range of points after an onset or before an offset, where the singer s voice typically oscillates; local max/min points are ignored within this range, because they are not aligned with note changes, but are the result of the vibrato. Its duration is arbitrarily defined; a value of 300 ms has been used, as appropriate with this set of recordings. Vibrato frequency threshold: the minimum frequency of oscillation of the Lx or audio signal that classifies the segment as vibrato, and therefore is not associated to a true note change from the score. For these recordings, it was set to 5 Hz. Local peak: a point with a positive Lx value that falls in the middle of a range of a prescribed temporal window, where at least one point with null Lx frequency exists before and after such a point. The temporal window to conduct the check is arbitrarily defined; a time span of 500 ms centred around the point in question has been used with satisfactory results in this project. Spiking range: a range of points immediately before an onset or after an offset, where the Lx signal artificially spikes relative to the corresponding AUDIO signal. The width of such a range is arbitrarily defined; given the steepness of the spikes, a value of just 10 ms has proven sufficient to isolate the spikes. TIMEX detects and extracts ON, NB, NE and OF ensuring consistency of the analysis based on the following steps, as shown in Figure 2. Step 1: removal of Lx readings in the spiking range. The first operation performed on the raw Lx data is to remove all the positive Lx readings within the spiking range (adjacent to the breaks), replacing them with null values. This step is executed to prevent the artificial spikes from leading to a skewed and distorted reconstruction of the Lx signal from the AUDIO signal (the reconstruction procedure is explained in Step 2). Step 2: reconstruction of the missing Lx signal from the AUDIO signal. If the Lx signal is weak, the algorithm reconstructs the signal from the audio recording. This is achieved
6 4 S. D AMARIO ET AL. Figure 2. Algorithm flowchart. through a normalization procedure designed to reconstruct the Lx signal to follow the same shape as the AUDIO signal. The audio signal is scaled to match the original Lx values at the edges of the interval where the Lx signal is missing, therefore avoiding artificial max/min points being generated at the edges; from here on, the original Lx signal refers to the signal after the Lx readings in the spiking range have been removed, as per Step 1. Using the following nomenclature: t 0, t 1 : time intervals at the boundaries of the range where the original Lx signal is missing or weak, and the audio signal is at least partially available. f o _Lx 0, f o _Lx 1 : the values of the original Lx signal at t 0 and t 1 ; they are both positive by definition of how t 0 and t 1 are selected. f o _AUDIO 0, f o _AUDIO 1 : the values of the AUDIO signal at t 0 and t 1 ; if one of them is zero, it is calculated as the other one multiplied by the ratio between f o _Lx at that point and f o _Lx at the other end, while if both are zero reconstruction is not attempted for this interval. f o _Lx L (t), f o _AUDIO L (t): the values of the linearized Lx signal and the AUDIO signal respectively at time t, with t falling between t 0 and t 1 ; these are linearized as falling on a straight line connecting f o _Lx 0 and f o _Lx 1, and f o _AUDIO 0 and f o _AUDIO 1, respectively. f o _Lx(t), f o _AUDIO(t): the values of the original Lx signal and the AUDIO signal respectively at time t, with t falling between t 0 and t 1. The linearized Lx and AUDIO values are first computed as follows: f o Lx L ðþ¼ t f o Lx 0 þ ðf o Lx 1 f o Lx 0 Þ t t 0 (1) t 1 t 0 f o AUDIO L ðþ¼ t f o AUDIO 0 þ ðf o AUDIO 1 f o AUDIO 0 Þ t t 0 t 1 t 0 (2) Then, if f o _AUDIO(t) ¼ 0, f o _Lx(t) ¼ 0 (reconstruction not possible at a point where even the microphone reading is not available), otherwise f o _Lx(t) is reconstructed as f o LxðÞ¼ t f o AUDIOðÞ t f o Lx L ðþ t f o AUDIO L ðþ t The result of this reconstruction is that the Lx signal follows the shape of the AUDIO signal in the areas where the raw signal is not available, remaining continuous with the original values where present, as shown in the example of Figure 3. Step 3: removal of Lx local peaks. After the Lx signal has been reconstructed, any remaining local peaks are identified, based on the selected range (see definition above) and removed. The purpose is to eliminate spurious readings that are sometimes produced by the Lx sensor, which typically occur in a narrow time range, and can be identified via a proper selection of the local peak range. Removing the peaks after the signal has been reconstructed, from the AUDIO data where possible, allows the maximum amount of Lx data to be retained. The resulting Lx signal left after the (3)
7 LOGOPEDICS PHONIATRICS VOCOLOGY 5 Figure 3. Excerpt of the Lx and AUDIO signals from a recording of the upper voice performance, showing the reconstruction of the f o _Lx signal from f o _AUDIO signal in the temporal interval t 0 t 1, in which the Lx signal was missing. The Lx signal was reconstructed (see f o _Lx_Reconstructed) based on the linearized Lx and AUDIO signal (see f o _Lx_Linearized and f o _AUDIO_Linearized, respectively). removal of the local spikes is defined as the reconstructed Lx signal. Step 4: identification of onsets, offsets, note beginnings and note endings. Once the Lx signal has been reconstructed, it is processed to extract onsets and offsets of phonation and local max/min points during phonation. Then, local max/ min points are retained if all the following conditions are satisfied: 1.1. The point is not too close to the adjacent local max/ mins. Points that are too close to each other are removed, to avoid retaining small steps within a tone ascending or descending section as note beginnings or note endings, when they are just fluctuations of the singer s voice that sometimes occur within a note change. A value of just 10 ms is sufficient to discriminate those points from the max/mins to be retained The point does not fall within the onset or the offset fluctuation range Any of the following two conditions are satisfied: The logarithmic fluctuation, measured in cents, of the current point from the previous onset or max/min, or to the next max/min, is greater than a prescribed threshold. The distance in cents between two points at frequencies f 1 and f 2 is defined as in (3) max ðf 1 ; f 2 Þ cf ð 1 ; f 2 Þ ¼ 3986:3137 log 10 min ðf 1 ; f 2 Þ The frequency of oscillation of the point, relative to the previous and the next point, is lower than the vibrato frequency threshold; this condition is applied to disregard any max/mins that are the result of a vibrato of the singer s voice, without having to set a threshold (4) Figure 4. Example of the vibration frequency computed across a full cycle, extracted from an audio clip of the upper voice used for the study. that is too high for the logarithmic fluctuation, which would lead to discarding valid note beginnings or endings for semitones. The vibration frequency (vf n ) of the point is defined as the lowest of the oscillation frequencies relative to the previous and the next max/min, as shown in Figure 4: 1 vf n ¼ maxðt n t n 1 ; t nþ1 t n Þ The ability to manually tweak the results after visual validation is set to ensure that all and only the relevant max/min points are retained as note beginnings/endings. Algorithm evaluation Testing TIMEX on a set of singing performances The effectiveness of the algorithm was tested on 28 Lx recordings of a two-part piece composed by the first author for the following case study, as shown in Figure 5, and performed by two singing duos (see section Participants for more details). The data collected include 728 note beginnings, 728 note endings, 112 onsets and 112 offsets, with a total of 1680 timing extractions. Each audio file was approximately 25 s long, and the total length of the audio clips was about 10 minutes, which is much longer than the singing recordings used in the Music Information Retrieval Evaluation exchange (MIREX 2016) onset detection task. Recordings were manually cross-annotated by three experts, external to this investigation, who marked the beginning and ending of each note using Praat software (20,21). Experts used the same software setup displaying a spectrogram and a waveform with a fixed time window, and a tier for hand annotations; this display setup also gave the experts the chance to listen to the recordings. Markings were applied to monoaural recordings of the two-part performances sampled at 48 khz and post-processed with a time step (5)
8 6 S. D AMARIO ET AL. Figure 5. Duet exercise composed for the study, showing the notes chosen (see ) for the analysis of the synchronization and the four sets of time categories (e.g. ON: onset; NB: note beginning; NE: note ending; OF: offset). All notes were used for the evaluation of the reliability of TIMEX. of 1 ms. This time step setting was chosen to allow the detection of small asynchronies in the order of tens of milliseconds, such as those found in the literature of music ensemble performances. The evaluation procedure followed that described in MIREX 2016 for onset detection. A tolerance value was set to ±50 ms and the detected times were compared with ground-truth values manually detected by the experts. This is a standard procedure for the evaluation of onset detection algorithms, although the comparison of values detected by the algorithm with those manually detected by experts, and commonly referred to as ground-truth values, remains ambiguous and subjective as there can be no true objective value. A large time displacement of 50ms is a well-known criterion in the field of onset detection that takes into account inaccuracy of the hand labelling process (6). In addition, a small-time window of 10 ms was also chosen to detect small asynchronies in the synchronization during professional ensemble performances. The mean of the standard deviations for the manual annotations computed across the three experts was 59 ms. For a given ground-truth onset time, any extracted value falling within the tolerance time window of 10 or 50 ms was considered correct detection (CD). If the algorithm detected no value within the time window, the detection of that ground-truth time was reported as a false negative (FN). Detections outside all the tolerance windows were counted as false positives (FPs). The performance of the detection method was evaluated based on the three measures commonly used in the field of onset detection: Precision (P), Recall (R) and F-measure (F). The Precision measures the probability that the detected value is a true value, thus calculating how much noise the algorithm provides. The Recall indicates the probability that a true value is identified, therefore, measuring how much of the ground truth the algorithm identifies. The F-measure represents the overall Table 1. Performance of TIMEX. Tolerance Precision Recall F-measure 50 ms 65% 97% 78% 10 ms 23% 89% 36% performance, calculating the harmonic mean of Precision and Recall. The measures are computed as follows: N cd P ¼ N cd þ N fp (6) N cd R ¼ N cd þ N fn (7) F ¼ 2PR P þ R (8) N cd is the number of correct values detected by the algorithm; N fp is the number of false values detected; N fn is the number of missed values. As files were cross-annotated by three experts, the mean Precision and Recall rates were defined by averaging Precision and Recall rates computed for each annotation. The overall results are reported in Table 1. TIMEX achieved higher results in all measures than the best-performing algorithms for the singing voice from MIREX 2016 (22) with the same threshold of 50 ms, although based on a different data set and extracting different timing categories, such as onsets in MIREX and onsets/ offsets/beginnings/endings by TIMEX. The full data set of detection errors was scrutinized to investigate how FP and FN errors were distributed across performers and over the duration of the pieces. As shown in Table 2, the detection errors, computed with a tolerance level set at 10 ms, varied across the four performers: the total number of FNs found for singer 2 performing the upper voice was approximately half that of singer 1 performing the same piece, and the total number of FPs for singer 4 performing the lower voice was less than those found for singer
9 LOGOPEDICS PHONIATRICS VOCOLOGY 7 3 performing the lower part. These results suggest that singers might have a particular technique that affects the performance of the algorithm. As shown in Figure 6, the total number of FPs was distributed similarly across the course of the piece. However, FNs were more likely to occur when the note being analysed was a semitone from the previous note (as found regarding notes 1 2, 6 7, and of the upper voice, and 4 5, 16 18, of the lower voice) or for intervals greater than a 3rd (as found regarding note of the upper voice, and and of the lower voice). Evaluating the algorithm s reconstruction process The algorithm s reconstruction process was evaluated with respect to: (i) reliability of the Lx signal, as indexed by the measurement of the continuous/discontinuous parts of the Lx signal and (ii) performance of the reconstruction process. Onset/offset detection based on the AUDIO recording is not Table 2. Distribution of detection errors across performers. Performer s part False negatives False positives S1 upper voice S2 upper voice S3 lower voice S4 lower voice False negatives and false positives were averaged across performances. Tolerance level set at 10 ms. fully reliable in the case of singing ensemble recordings, therefore, the quantification of the percentage of times that this step was followed is important to test the reliability of the protocol. The analysis of the quality of the Lx signal analysis was conducted on the full set of recordings collected for the following case-study, including 96 recordings of the upper voice and 96 recordings of the lower voice part of a duet piece composed for the experiment. Sections of the Lx signal associated with rests in the music score were not scrutinized, as the Lx was supposed to be null in the absence of phonation. Results show that the Lx signal was unusable for 0.7% of the recordings and, therefore, the algorithm s application of the AUDIO signal was limited to 0.7% of the full set of recordings. Analysis also shows that the discontinuous Lx segments were on average 31 ms long (SD 18 ms). A subset of 40 discontinuous Lx segments averaging 30 ms in length was used to assess the precision of the reconstruction method, by comparing the reconstructed Lx signal with the corresponding Lx signal. The Lx values from the raw segments were initially deleted, then the reconstruction process was run based on the Lx and AUDIO signal, and eventually the raw values were compared with the reconstructed recordings. Results show an average margin of error of 0.034%; the margin of error (E) was first computed for each data point as follows, E ¼ V raw av rec V raw (9) Figure 6. Distribution of percentage detection errors computed at the beginning and ending of each note across the course of the piece.
10 8 S. D AMARIO ET AL. and then averaged across the entire sample. V raw represents the raw value extracted from the Lx signal, whilst the V rec is the value reconstructed from the algorithm based on the shape of the AUDIO signal. NVC_UpperVoiceL: without VC, and upper voice designated leader and lower voice follower NVC_UpperVoiceF: without VC, and upper voice designated follower and lower voice leader Case study of synchronization in singing ensembles The following case study aims to test the overall protocol featuring the application of TIMEX to Lx and audio recordings, to analyse the effect of VC and the instruction to act as leader or follower on the synchronization between singers during singing duo performances. This study serves as a test for a subsequent experiment with a larger sample of duos. Methods Participants Four undergraduate singing students (three females and one male) were recruited from the Department of Music at the University of York. Singers had at least 7 years experience performing in a singing ensemble (mean 9.3 years, SD 2.1), but they had not sung together prior to the experiment. They reported having normal hearing and not having absolute pitch. Stimulus material A vocal duet exercise was composed for this study, featuring mostly a homophonic texture to allow investigation of the synchronization per note, as shown in Figure 5. The upper voice has a range of a 7th, whilst the lower voice a range of a 5th; the upper voice features a higher tessitura than the lower voice. Apparatus Participants were invited to sing in a recording studio at the University of York, treated with absorptive acoustic material. Singers wore head-mounted close proximity microphones (DPA 4065), placed on the cheek at approximately 2 cm from the lips, and electrolaryngograph electrodes (Lx, from Laryngograph Ldt placed on the neck positioned either side of the thyroid cartilage. One stereo condenser microphone (Rode NT4) was placed at equal distance in front of the singers at approximately 1.5 m from the lips. The five outputs (2 Lx, 2 head-mounted mics, one stereo mic) were connected to a multichannel hard disk recorder (Tascam DR680) and recorded at a sampling frequency of 48 khz and 24-bit depth. Design The study consisted of a within subject design in which participants were asked to sing the piece in the following four conditions, applied in a randomised order: VC_UpperVoiceL: with VC, and upper voice designated leader and lower voice (LowerVoice) follower VC_UpperVoiceF: with VC, and upper voice designated follower and lower voice leader Each condition was presented three times, resulting in 12 takes; each take consisted of four repeated performances of the piece, resulting in a 4 (conditions) 3 (repeated performances of each condition), 4 (repeated performances within each condition) design featuring a total of 48 repetitions of the piece per duet. Procedure Singers received the stimulus material prior to the experiment, to practise the piece. On the day of the experiment, first participants were asked to fill in a background questionnaire and consent form. Then, head mounted microphones and Lx electrodes were placed on each singer and adjusted. The correct placement of the Lx electrodes was verified by checking the signal on the visual display and listening over headphones. The microphones were adjusted for the sound pressure level of each participant to avoid clipping. Singers were invited to familiarize themselves with the piece for 10 minutes, singing together from the score to the vowel/i/, while listening for 10 seconds to a metronome set at 100 BPM before starting to rehearse. If singers were able to perform the piece without errors, the four conditions and associated 12 takes were then presented; otherwise, they were allowed to practise the piece for 10 more minutes and then the test was repeated. Once the musicians passed the performance test without errors with the score, each singer was assigned the role of leader or follower; these roles were then reversed according to UpperVoiceL and UpperVoiceF conditions. Signs labelled leader and follower were placed on the floor in front of the participants, to remind them of their roles. Each singer only had one assigned part/musical voice. Singers were invited to face each other at a distance of 1.5 m in the visual condition and to face away from each other at the same distance in the non-visual contact (NVC) condition. Singers were not aware of the purpose of the study. The 12 takes were recorded singing by heart with short breaks between each of them. The experiment lasted approximately one hour. Ethical approval for the study was obtained from the Physical Sciences Ethics Committee (PSEC) at The University of York (UK). Analysis For each recorded performance, two sets of data including the audio waveform from the microphones and the Lx waveform were imported into Praat as.wav files and f o was extracted with a time step of 1 ms. These data were imported into Microsoft Excel 2016 in the form of a tabular list of data points, including the f o in Hertz and corresponding timestamp. Asynchronies were then calculated to measure the phase synchrony between singers for NB, NE, ON and OF of the selected notes, as shown in Figure 5.
11 LOGOPEDICS PHONIATRICS VOCOLOGY 9 Table 3. Summary of the mean and median values per condition showing the differences across conditions and the levels of p values for the significant effects ( p <.05; p <.01). Duo 1 Duo 2 Duo 1 Duo 2 VC NVC VC NVC UpperVoiceL UpperVoiceF UpperVoiceL UpperVoiceF ON Precision (M) Consistency (SD) Consistency (CV) Tendency to lead (median signed) NB Precision (M) Consistency (SD) Consistency (CV) Tendency to lead (median signed) NE Precision (M) Consistency (SD) Consistency (CV) Tendency to lead (median signed) OF Precision (M) Consistency (SD) Consistency (CV) Tendency to lead (median signed) Mean, SD and median asynchronies are expressed in ms, whilst CV values are dimensionless numbers. Those notes were chosen as being relevant to synchronization. The phase asynchrony was computed subtracting the follower s timestamp values from the leader s (leader minus follower) related to NB, NE, ON and OF of the selected notes. Negative values show that the leader preceded the follower, while positive values indicate that the follower is ahead of the leader. The detection of ON, NB, NE and OF was automated through the application of TIMEX and the resulting timestamp data obtained from the note detection algorithm were then analysed in SPSS (SPSS 24, IBM, Armonk, NY). This event detection method was visually validated for the entire data set by the first author (SD). In addition, occasional pitch errors due to the musician singing a wrong note were also investigated by comparing the f o values and the audio recording with the notated score. Takes in which a pitch error occurred were excluded from the analysis. The overall error rate was less than 1%. Outliers were identified based on the MAD (median absolute deviation), and asynchronies that fell more than 2.5 absolute deviations from the median were excluded. This approach is the most robust method to detect outliers, when the distribution is not normal and outliers are present (23), as in this case. Results The following sections present the results of four sets of analyses that were run to measure the effect of VC and leader follower relationships on interpersonal synchronization. The first set measures the precision of interpersonal synchronization, as indexed by the mean of absolute asynchronies. The second and third set of analyses investigate the amount of variation of interpersonal synchronization, as indexed by the standard deviation (SD) and the coefficient of variation (CV) of the absolute asynchronies. The fourth set of analyses focuses on the tendency to precede or lag a co-performer, as indexed by the median (Mdn) of signed asynchronies. Each set of analyses was run on ON, NB, NE and OF across each duo/performance in VC, NVC, UpperVoiceL and UpperVoiceF. Each set includes descriptive analyses and paired tests, including dependent paired t-tests and Wilcoxon s signed rank tests. t-tests were chosen to analyse differences between means within the absolute asynchronies data sample, whilst Wilcoxon s tests were selected to assess median differences across signed asynchronies. These statistical tests were run for each condition. Results, using Bonferroni s correction for multiple comparisons, are summarized in Table 3. Visual contact Duo 1 Mean, SD and CV of absolute asynchronies and median of signed asynchronies for duo 1, calculated for ON, NB, NE and OF during VC and NVC, are shown in Figure 7. Results from the paired sample tests showed a significant effect of the presence of VC on the NB standard deviation asynchronies, t(23) ¼ 2.43, p ¼.023, r ¼.45. As can be seen in Figure 7(B), consistency of synchronization was found to significantly increase in the NVC condition for NB standard deviation asynchronies, compared with the VC condition. No significant effect was found for the remaining paired sample tests conducted across duo 1. Duo 2 Median signed asynchronies, and mean, SD and CV of absolute asynchronies for duo 2 are shown in Figure 8. Paired sample tests were run as for duo 1. The t-test on the mean NB asynchronies highlighted a significant effect of the presence/absence of VC, t(23) ¼ 2.86, p ¼.018, r ¼.51, showing that precision improved in NVC. No significant effect was found for the remaining paired sample tests.
12 10 S. D AMARIO ET AL. Figure 7. Interpersonal synchronization of duo 1 with visual contact (VC) and without visual contact (NVC) between singers, as indexed by the mean (A), standard deviation (B), coefficient of variation (CV) of absolute asynchronies (C) and median of signed asynchronies (D) calculated across ON, NB, NE and OF. Error bars represent the standard error of the mean for precision and consistency, and the interquartile range for tendency to precede. Smallest values in the precision and consistency of asynchronies indicate an increase in coordination, whilst negative values in the tendency to precede mean that the designated leader is ahead of the follower. p <.05. Leader follower relationships Duo 1 Mean, SD and CV of absolute asynchronies, and median signed asynchronies, averaged across the 48 performances in the UpperVoiceL and UpperVoiceF conditions for duo 1, are shown in Figure 9. Paired sample t-tests yielded a significant effect of the instruction to act as leader or follower on both measures of consistency for NB: SD asynchronies, t(23) ¼ 2.48, p ¼.0021, r ¼.46, and CV asynchronies, t(23) ¼ 2.60, p ¼.016, r ¼.48. Consistency of NB synchronization was significantly better when the upper voice was instructed to follow, rather than to lead, as shown in Figure 9(B,C). Wilcoxon s tests revealed a main significant effect of leader follower instruction on the degree of preceding ON median asynchronies, T ¼ 60, p ¼.010, and NB median asynchronies, T ¼ 71, p ¼.024. One sample t-tests conducted on ON and NB for each condition showed that: (i) ON median asynchronies when the upper voice was instructed to follow were significantly different from 0, t(23) ¼ 3.208, p ¼.004, r ¼.56; (ii) NB median values when the upper voice was instructed to lead were significantly different from 0, t(23)¼ 6.287, p ¼ , r ¼.80; and (iii) NB median data when the upper voice was instructed to follow were different from 0, t(23)¼ , p ¼ E 11, r ¼.92. These results demonstrate that when either voice was instructed to lead, the designated leader significantly tended to precede the designated follower at NB. However, when the upper voice was instructed to follow, the designated follower (i.e. the upper voice voice) significantly tended to precede at ON. Duo 2 Median signed asynchronies, and mean, SD and CV of absolute asynchronies computed for duo 2 in UpperVoiceL and UpperVoiceF conditions are shown in Figure 10. Paired sample tests were calculated as for duo 1. A significant effect of the leader follower instruction was found on the consistency of NB synchronization, as indexed by: (i) SD
13 LOGOPEDICS PHONIATRICS VOCOLOGY 11 Figure 8. Interpersonal synchronization of duo 2 with visual contact (VC) and without visual contact (NVC) between singers, as indexed by the mean (A), standard deviation (B), coefficient of variation (CV) of absolute asynchronies (C) and median of signed asynchronies (D) calculated across ON, NB, NE and OF. Error bars represent the standard error of the mean for precision and consistency, and the interquartile range for tendency to precede. Smallest values in the precision and consistency of asynchronies indicate an increase in coordination, whilst negative values in the tendency to precede mean that the designated leader is ahead of the follower. p <.05. asynchronies, t(23) ¼ 4.40, p ¼.0002, r ¼.8; and (ii) CV asynchronies, t(23) ¼ 2.65, p ¼.014, r ¼.48. Consistency of NB synchronization was better when the upper voice was instructed to lead and the lower voice to follow. Finally, as shown in Figure 10(D), Wilcoxon tests revealed a significant effect of leader follower instruction on the degree of preceding/lagging: (i) median NB asynchronies, T ¼ 38.5, p ¼.001; (ii) median NE asynchronies, T ¼ 33, p ¼.001; and (iii) median OF asynchronies, T ¼ 42, p ¼.002. One sample t-tests on median ON, NB, NE and OF were conducted as for duo 1 to observe whether the tendency to precede/lag was significant in each condition. Results showed that: (i) NB asynchronies were significantly different from 0 when the upper voice was instructed to lead, t(23) ¼ 3.564, p ¼.002, r ¼.60, and to follow t(23) ¼ 2.718, p ¼.012, r ¼.49; (ii) NE value were significantly different from 0 when the upper voice was instructed to lead, t(23)¼ 2.845, p ¼.009, r ¼.51, and also to follow, t(23) ¼ 3.144, p ¼.005, r ¼.55; and (iii) OF asynchronies were significantly different from 0 when the upper voice was instructed to lead t(23) ¼ 4.695, p ¼.00009, r ¼.70. These results demonstrate that when either voice was instructed to lead, the upper voice significantly tended to precede the lower voice at NB and NE. However, when the upper voice was instructed to lead, the designated leader tended to lag at OF. These results show a complex pattern of leader and follower relationships, rather than a clear separation of roles, which seems to be independent of the researcher s instruction to lead or follow. Piece learning effects Prior to investigating the effect of VC and leader follower relationships, data were examined for evidence of changes in interpersonal synchrony across the course of the 48 repeated performances. The learning effect was investigated averaging the asynchronies for each performance and for each synchronization measure (i.e. precision, consistency and tendency to precede). Results show that there were no
14 12 S. D AMARIO ET AL. Figure 9. Interpersonal synchronization for duo 1 with the upper voice as the designated leader (UpperVoiceL) or follower (UpperVoiceF), as indexed by the mean (A), standard deviation (B), coefficient of variation (CV) of absolute asynchronies (C) and median of signed asynchronies (D) calculated across ON, NB, NE and OF. Error bars indicate the standard error of the mean for precision and consistency, and the interquartile range for tendency to precede. Smallest values in the precision and consistency of asynchronies indicate an increase in coordination, whilst negative values in the tendency to precede mean that the designated leader is ahead of the follower. p <.05. discernible learning effects for duo 1 or duo 2, as shown in Figures 11 and 12. Discussion The aim of the study was to describe and test a novel algorithm, TIMEX, that extracts onsets and offsets of phonation and note beginnings and endings from monoaural recordings of ensemble singing. The algorithm presented in this paper is based on the fundamental frequency profile. It has been developed on the basis of a purely mathematical definition of a local max/ min, with the addition of a series of rules to ignore points that the definition would retain but would not represent a change of note in the score being performed. The rules have been conceived based on the issues encountered during the first processing attempts, such as local spikes, vibrato, Lx signal interruptions and onset/offset fluctuation range. Each of these rules is associated with the definition of a threshold parameter to enforce the rule, which has been tweaked by trial and error to provide the most accurate results, comparing the output of the algorithm for the selected recording to the score that was performed. When testing the algorithm and in the case study presented, the same parameters were used for the four semiprofessional performers involved, and for the upper and lower voice parts. The fluctuation threshold and the vibrato frequency threshold can be expected to be different for opera singers, who might exhibit a larger vibrato extent. Optimal values regarding rest, fluctuation and spiking range are expected to vary across pieces, especially if the tempo and duration of rests and notes at the beginning and ends of phrases (and therefore onset and offset of phonation) are very different from the two-part piece used for this set of recordings. The evaluation of TIMEX in the present study showed an overall F-measure of 78% within a tolerance window of 50ms, which seems very promising in light of the state-ofthe-art techniques presented at MIREX in 2016 yielding F-measures of around 60%. Direct comparisons with other
15 LOGOPEDICS PHONIATRICS VOCOLOGY 13 Figure 10. Interpersonal synchronization for duo 2 with the upper voice as the designated leader (UpperVoiceL) or follower (UpperVoiceF), as indexed by the mean (A), standard deviation (B), coefficient of variation (CV) of absolute asynchronies (C) and median of signed asynchronies (D) calculated across ON, NB, NE and OF. Error bars indicate the standard error of the mean for precision and consistency, and the interquartile range for tendency to precede. Smallest values in the precision and consistency of asynchronies indicate an increase in coordination, whilst negative values in the tendency to precede mean that the designated leader is ahead of the follower. p <.05; p <.01. methods cannot be made unless the same data set is used; comparative evaluations are planned in the future. Other avenues of research should take into account issues relating to the small fluctuations within the onsets; TIMEX limits the detection to local max/min points, whilst the ground truth considers also the steepness of the f o profile, by detecting onsets based on the rate of change of the curve. This could be addressed by developing the algorithm further using the second derivative of the waveform in addition to the max/min points. A future direction of this research should also consider the analysis of the singing voice with lyrics. It is reasonable to expect that this algorithm will work well with percussive instruments, although they would probably require the use of different thresholds for the same rules. Whilst the issues of singing onset detection cannot be considered solved by this system, its potential is promising. Furthermore, this study described and tested a new protocol for the analysis of synchronization in singing ensembles based on the combined application of electrolaryngography and acoustics analysis, and the TIMEX algorithm. The use of electrolaryngography allowed the identification of the contribution of individual voices, avoiding the complication of polyphonic recordings. This set-up was very successful: the signal failed on only 0.7% of the entire set of recordings, during which the analysis had to rely on the acoustic signal, which could potentially suffer from audio bleed from the other singers. In order to ensure accurate and reliable recordings of vocal fold vibration in the Lx signal, the proper placement of the electrodes is fundamental. The electrodes should be placed in the thyroid region behind the vocal folds in the middle of each thyroid lamina (24). Furthermore, consideration should be given to the fact that the Lx signal may be too weak or noisy to be reliable for use on certain populations, including children (25), sopranos (26), and when a thick layer of subcutaneous tissue is present in the neck (27,24). Finally, the role of VC and leader follower relationships was investigated in the two singing duets. Synchronisation was assessed by analysing timings between singers in each duo, as indexed by ON, NB, NE and OF asynchronies
DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes
DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring 2009 Week 6 Class Notes Pitch Perception Introduction Pitch may be described as that attribute of auditory sensation in terms
More informationOBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES
OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES Vishweshwara Rao and Preeti Rao Digital Audio Processing Lab, Electrical Engineering Department, IIT-Bombay, Powai,
More informationComputer Coordination With Popular Music: A New Research Agenda 1
Computer Coordination With Popular Music: A New Research Agenda 1 Roger B. Dannenberg roger.dannenberg@cs.cmu.edu http://www.cs.cmu.edu/~rbd School of Computer Science Carnegie Mellon University Pittsburgh,
More informationInstrument Recognition in Polyphonic Mixtures Using Spectral Envelopes
Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu
More informationSemi-automated extraction of expressive performance information from acoustic recordings of piano music. Andrew Earis
Semi-automated extraction of expressive performance information from acoustic recordings of piano music Andrew Earis Outline Parameters of expressive piano performance Scientific techniques: Fourier transform
More informationMeasurement of overtone frequencies of a toy piano and perception of its pitch
Measurement of overtone frequencies of a toy piano and perception of its pitch PACS: 43.75.Mn ABSTRACT Akira Nishimura Department of Media and Cultural Studies, Tokyo University of Information Sciences,
More informationPOST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS
POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS Andrew N. Robertson, Mark D. Plumbley Centre for Digital Music
More informationAN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS
AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS Rui Pedro Paiva CISUC Centre for Informatics and Systems of the University of Coimbra Department
More informationTemporal coordination in string quartet performance
International Symposium on Performance Science ISBN 978-2-9601378-0-4 The Author 2013, Published by the AEC All rights reserved Temporal coordination in string quartet performance Renee Timmers 1, Satoshi
More informationMELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT
MELODY EXTRACTION FROM POLYPHONIC AUDIO OF WESTERN OPERA: A METHOD BASED ON DETECTION OF THE SINGER S FORMANT Zheng Tang University of Washington, Department of Electrical Engineering zhtang@uw.edu Dawn
More informationTHE importance of music content analysis for musical
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333 Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With
More informationAutomatic Rhythmic Notation from Single Voice Audio Sources
Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung
More informationTempo and Beat Analysis
Advanced Course Computer Science Music Processing Summer Term 2010 Meinard Müller, Peter Grosche Saarland University and MPI Informatik meinard@mpi-inf.mpg.de Tempo and Beat Analysis Musical Properties:
More informationAPPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC
APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC Vishweshwara Rao, Sachin Pant, Madhumita Bhaskar and Preeti Rao Department of Electrical Engineering, IIT Bombay {vishu, sachinp,
More informationThe Relationship Between Auditory Imagery and Musical Synchronization Abilities in Musicians
The Relationship Between Auditory Imagery and Musical Synchronization Abilities in Musicians Nadine Pecenka, *1 Peter E. Keller, *2 * Music Cognition and Action Group, Max Planck Institute for Human Cognitive
More informationMUSI-6201 Computational Music Analysis
MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015 temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155)
More informationCS229 Project Report Polyphonic Piano Transcription
CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project
More informationHowever, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene
Beat Extraction from Expressive Musical Performances Simon Dixon, Werner Goebl and Emilios Cambouropoulos Austrian Research Institute for Artificial Intelligence, Schottengasse 3, A-1010 Vienna, Austria.
More informationAutomatic Laughter Detection
Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional
More informationMusic Source Separation
Music Source Separation Hao-Wei Tseng Electrical and Engineering System University of Michigan Ann Arbor, Michigan Email: blakesen@umich.edu Abstract In popular music, a cover version or cover song, or
More informationhit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.
CS 229 FINAL PROJECT A SOUNDHOUND FOR THE SOUNDS OF HOUNDS WEAKLY SUPERVISED MODELING OF ANIMAL SOUNDS ROBERT COLCORD, ETHAN GELLER, MATTHEW HORTON Abstract: We propose a hybrid approach to generating
More informationVoice & Music Pattern Extraction: A Review
Voice & Music Pattern Extraction: A Review 1 Pooja Gautam 1 and B S Kaushik 2 Electronics & Telecommunication Department RCET, Bhilai, Bhilai (C.G.) India pooja0309pari@gmail.com 2 Electrical & Instrumentation
More informationEqual or non-equal temperament in a capella SATB singing
Equal or non-equal temperament in a capella SATB singing David M Howard Head of the Audio Laboratory, Intelligent Systems Research Group Department of Electronics, University of York, Heslington, York,
More information2. AN INTROSPECTION OF THE MORPHING PROCESS
1. INTRODUCTION Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals,
More informationEfficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas
Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications Matthias Mauch Chris Cannam György Fazekas! 1 Matthias Mauch, Chris Cannam, George Fazekas Problem Intonation in Unaccompanied
More informationSinger Traits Identification using Deep Neural Network
Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic
More informationDrum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods
Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National
More informationVideo-based Vibrato Detection and Analysis for Polyphonic String Music
Video-based Vibrato Detection and Analysis for Polyphonic String Music Bochen Li, Karthik Dinesh, Gaurav Sharma, Zhiyao Duan Audio Information Research Lab University of Rochester The 18 th International
More informationMusic Radar: A Web-based Query by Humming System
Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,
More informationTranscription of the Singing Melody in Polyphonic Music
Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,
More informationAN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY
AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY Eugene Mikyung Kim Department of Music Technology, Korea National University of Arts eugene@u.northwestern.edu ABSTRACT
More informationQuarterly Progress and Status Report. Replicability and accuracy of pitch patterns in professional singers
Dept. for Speech, Music and Hearing Quarterly Progress and Status Report Replicability and accuracy of pitch patterns in professional singers Sundberg, J. and Prame, E. and Iwarsson, J. journal: STL-QPSR
More informationMusic Representations
Lecture Music Processing Music Representations Meinard Müller International Audio Laboratories Erlangen meinard.mueller@audiolabs-erlangen.de Book: Fundamentals of Music Processing Meinard Müller Fundamentals
More informationQuery By Humming: Finding Songs in a Polyphonic Database
Query By Humming: Finding Songs in a Polyphonic Database John Duchi Computer Science Department Stanford University jduchi@stanford.edu Benjamin Phipps Computer Science Department Stanford University bphipps@stanford.edu
More informationTOWARDS IMPROVING ONSET DETECTION ACCURACY IN NON- PERCUSSIVE SOUNDS USING MULTIMODAL FUSION
TOWARDS IMPROVING ONSET DETECTION ACCURACY IN NON- PERCUSSIVE SOUNDS USING MULTIMODAL FUSION Jordan Hochenbaum 1,2 New Zealand School of Music 1 PO Box 2332 Wellington 6140, New Zealand hochenjord@myvuw.ac.nz
More informationMelody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng
Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the
More informationAutomatic Laughter Detection
Automatic Laughter Detection Mary Knox 1803707 knoxm@eecs.berkeley.edu December 1, 006 Abstract We built a system to automatically detect laughter from acoustic features of audio. To implement the system,
More informationPitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound
Pitch Perception and Grouping HST.723 Neural Coding and Perception of Sound Pitch Perception. I. Pure Tones The pitch of a pure tone is strongly related to the tone s frequency, although there are small
More informationTopics in Computer Music Instrument Identification. Ioanna Karydi
Topics in Computer Music Instrument Identification Ioanna Karydi Presentation overview What is instrument identification? Sound attributes & Timbre Human performance The ideal algorithm Selected approaches
More informationVoice source and acoustic measures of girls singing classical and contemporary commercial styles
International Symposium on Performance Science ISBN 978-90-9022484-8 The Author 2007, Published by the AEC All rights reserved Voice source and acoustic measures of girls singing classical and contemporary
More informationContest and Judging Manual
Contest and Judging Manual Published by the A Cappella Education Association Current revisions to this document are online at www.acappellaeducators.com April 2018 2 Table of Contents Adjudication Practices...
More informationGetting Started. Connect green audio output of SpikerBox/SpikerShield using green cable to your headphones input on iphone/ipad.
Getting Started First thing you should do is to connect your iphone or ipad to SpikerBox with a green smartphone cable. Green cable comes with designators on each end of the cable ( Smartphone and SpikerBox
More informationAPP USE USER MANUAL 2017 VERSION BASED ON WAVE TRACKING TECHNIQUE
APP USE USER MANUAL 2017 VERSION BASED ON WAVE TRACKING TECHNIQUE All rights reserved All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
More informationImproving Frame Based Automatic Laughter Detection
Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for
More informationAutomatic music transcription
Music transcription 1 Music transcription 2 Automatic music transcription Sources: * Klapuri, Introduction to music transcription, 2006. www.cs.tut.fi/sgn/arg/klap/amt-intro.pdf * Klapuri, Eronen, Astola:
More informationAudio-Based Video Editing with Two-Channel Microphone
Audio-Based Video Editing with Two-Channel Microphone Tetsuya Takiguchi Organization of Advanced Science and Technology Kobe University, Japan takigu@kobe-u.ac.jp Yasuo Ariki Organization of Advanced Science
More informationAN ALGORITHM FOR LOCATING FUNDAMENTAL FREQUENCY (F0) MARKERS IN SPEECH
AN ALGORITHM FOR LOCATING FUNDAMENTAL FREQUENCY (F0) MARKERS IN SPEECH by Princy Dikshit B.E (C.S) July 2000, Mangalore University, India A Thesis Submitted to the Faculty of Old Dominion University in
More informationTopic 10. Multi-pitch Analysis
Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds
More informationTOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC
TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC G.TZANETAKIS, N.HU, AND R.B. DANNENBERG Computer Science Department, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail: gtzan@cs.cmu.edu
More informationDAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval
DAY 1 Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval Jay LeBoeuf Imagine Research jay{at}imagine-research.com Rebecca
More informationCase Study Monitoring for Reliability
1566 La Pradera Dr Campbell, CA 95008 www.videoclarity.com 408-379-6952 Case Study Monitoring for Reliability Video Clarity, Inc. Version 1.0 A Video Clarity Case Study page 1 of 10 Digital video is everywhere.
More informationAutomatic Construction of Synthetic Musical Instruments and Performers
Ph.D. Thesis Proposal Automatic Construction of Synthetic Musical Instruments and Performers Ning Hu Carnegie Mellon University Thesis Committee Roger B. Dannenberg, Chair Michael S. Lewicki Richard M.
More informationAUD 6306 Speech Science
AUD 3 Speech Science Dr. Peter Assmann Spring semester 2 Role of Pitch Information Pitch contour is the primary cue for tone recognition Tonal languages rely on pitch level and differences to convey lexical
More informationMultiple instrument tracking based on reconstruction error, pitch continuity and instrument activity
Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity Holger Kirchhoff 1, Simon Dixon 1, and Anssi Klapuri 2 1 Centre for Digital Music, Queen Mary University
More informationEXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION
EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION Hui Su, Adi Hajj-Ahmad, Min Wu, and Douglas W. Oard {hsu, adiha, minwu, oard}@umd.edu University of Maryland, College Park ABSTRACT The electric
More informationA Bayesian Network for Real-Time Musical Accompaniment
A Bayesian Network for Real-Time Musical Accompaniment Christopher Raphael Department of Mathematics and Statistics, University of Massachusetts at Amherst, Amherst, MA 01003-4515, raphael~math.umass.edu
More information2 2. Melody description The MPEG-7 standard distinguishes three types of attributes related to melody: the fundamental frequency LLD associated to a t
MPEG-7 FOR CONTENT-BASED MUSIC PROCESSING Λ Emilia GÓMEZ, Fabien GOUYON, Perfecto HERRERA and Xavier AMATRIAIN Music Technology Group, Universitat Pompeu Fabra, Barcelona, SPAIN http://www.iua.upf.es/mtg
More informationZooming into saxophone performance: Tongue and finger coordination
International Symposium on Performance Science ISBN 978-2-9601378-0-4 The Author 2013, Published by the AEC All rights reserved Zooming into saxophone performance: Tongue and finger coordination Alex Hofmann
More informationMultidimensional analysis of interdependence in a string quartet
International Symposium on Performance Science The Author 2013 ISBN tbc All rights reserved Multidimensional analysis of interdependence in a string quartet Panos Papiotis 1, Marco Marchini 1, and Esteban
More informationAcoustic and musical foundations of the speech/song illusion
Acoustic and musical foundations of the speech/song illusion Adam Tierney, *1 Aniruddh Patel #2, Mara Breen^3 * Department of Psychological Sciences, Birkbeck, University of London, United Kingdom # Department
More informationAUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC
AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC A Thesis Presented to The Academic Faculty by Xiang Cao In Partial Fulfillment of the Requirements for the Degree Master of Science
More informationKeywords Separation of sound, percussive instruments, non-percussive instruments, flexible audio source separation toolbox
Volume 4, Issue 4, April 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Investigation
More informationNAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING
NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING Mudhaffar Al-Bayatti and Ben Jones February 00 This report was commissioned by
More informationInstructions to Authors
Instructions to Authors European Journal of Psychological Assessment Hogrefe Publishing GmbH Merkelstr. 3 37085 Göttingen Germany Tel. +49 551 999 50 0 Fax +49 551 999 50 111 publishing@hogrefe.com www.hogrefe.com
More informationCSC475 Music Information Retrieval
CSC475 Music Information Retrieval Monophonic pitch extraction George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 32 Table of Contents I 1 Motivation and Terminology 2 Psychacoustics 3 F0
More informationEstimating the Time to Reach a Target Frequency in Singing
THE NEUROSCIENCES AND MUSIC III: DISORDERS AND PLASTICITY Estimating the Time to Reach a Target Frequency in Singing Sean Hutchins a and David Campbell b a Department of Psychology, McGill University,
More informationAnalysis of local and global timing and pitch change in ordinary
Alma Mater Studiorum University of Bologna, August -6 6 Analysis of local and global timing and pitch change in ordinary melodies Roger Watt Dept. of Psychology, University of Stirling, Scotland r.j.watt@stirling.ac.uk
More informationHow do we perceive vocal pitch accuracy during singing? Pauline Larrouy-Maestri & Peter Q Pfordresher
How do we perceive vocal pitch accuracy during singing? Pauline Larrouy-Maestri & Peter Q Pfordresher March 3rd 2014 In tune? 2 In tune? 3 Singing (a melody) Definition è Perception of musical errors Between
More informationMAutoPitch. Presets button. Left arrow button. Right arrow button. Randomize button. Save button. Panic button. Settings button
MAutoPitch Presets button Presets button shows a window with all available presets. A preset can be loaded from the preset window by double-clicking on it, using the arrow buttons or by using a combination
More informationA prototype system for rule-based expressive modifications of audio recordings
International Symposium on Performance Science ISBN 0-00-000000-0 / 000-0-00-000000-0 The Author 2007, Published by the AEC All rights reserved A prototype system for rule-based expressive modifications
More informationSpeech and Speaker Recognition for the Command of an Industrial Robot
Speech and Speaker Recognition for the Command of an Industrial Robot CLAUDIA MOISA*, HELGA SILAGHI*, ANDREI SILAGHI** *Dept. of Electric Drives and Automation University of Oradea University Street, nr.
More informationIntroduction to Performance Fundamentals
Introduction to Performance Fundamentals Produce a characteristic vocal tone? Demonstrate appropriate posture and breathing techniques? Read basic notation? Demonstrate pitch discrimination? Demonstrate
More informationWHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?
WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.
More informationANALYSING DIFFERENCES BETWEEN THE INPUT IMPEDANCES OF FIVE CLARINETS OF DIFFERENT MAKES
ANALYSING DIFFERENCES BETWEEN THE INPUT IMPEDANCES OF FIVE CLARINETS OF DIFFERENT MAKES P Kowal Acoustics Research Group, Open University D Sharp Acoustics Research Group, Open University S Taherzadeh
More informationHidden Markov Model based dance recognition
Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,
More informationSinging accuracy, listeners tolerance, and pitch analysis
Singing accuracy, listeners tolerance, and pitch analysis Pauline Larrouy-Maestri Pauline.Larrouy-Maestri@aesthetics.mpg.de Johanna Devaney Devaney.12@osu.edu Musical errors Contour error Interval error
More informationThe Measurement Tools and What They Do
2 The Measurement Tools The Measurement Tools and What They Do JITTERWIZARD The JitterWizard is a unique capability of the JitterPro package that performs the requisite scope setup chores while simplifying
More informationControlling Musical Tempo from Dance Movement in Real-Time: A Possible Approach
Controlling Musical Tempo from Dance Movement in Real-Time: A Possible Approach Carlos Guedes New York University email: carlos.guedes@nyu.edu Abstract In this paper, I present a possible approach for
More informationPHYSICS OF MUSIC. 1.) Charles Taylor, Exploring Music (Music Library ML3805 T )
REFERENCES: 1.) Charles Taylor, Exploring Music (Music Library ML3805 T225 1992) 2.) Juan Roederer, Physics and Psychophysics of Music (Music Library ML3805 R74 1995) 3.) Physics of Sound, writeup in this
More informationMELODIC AND RHYTHMIC CONTRASTS IN EMOTIONAL SPEECH AND MUSIC
MELODIC AND RHYTHMIC CONTRASTS IN EMOTIONAL SPEECH AND MUSIC Lena Quinto, William Forde Thompson, Felicity Louise Keating Psychology, Macquarie University, Australia lena.quinto@mq.edu.au Abstract Many
More informationPredicting Variation of Folk Songs: A Corpus Analysis Study on the Memorability of Melodies Janssen, B.D.; Burgoyne, J.A.; Honing, H.J.
UvA-DARE (Digital Academic Repository) Predicting Variation of Folk Songs: A Corpus Analysis Study on the Memorability of Melodies Janssen, B.D.; Burgoyne, J.A.; Honing, H.J. Published in: Frontiers in
More informationSINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION
th International Society for Music Information Retrieval Conference (ISMIR ) SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION Chao-Ling Hsu Jyh-Shing Roger Jang
More informationAutomatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting
Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting Dalwon Jang 1, Seungjae Lee 2, Jun Seok Lee 2, Minho Jin 1, Jin S. Seo 2, Sunil Lee 1 and Chang D. Yoo 1 1 Korea Advanced
More informationUsing the new psychoacoustic tonality analyses Tonality (Hearing Model) 1
02/18 Using the new psychoacoustic tonality analyses 1 As of ArtemiS SUITE 9.2, a very important new fully psychoacoustic approach to the measurement of tonalities is now available., based on the Hearing
More informationSupervised Learning in Genre Classification
Supervised Learning in Genre Classification Introduction & Motivation Mohit Rajani and Luke Ekkizogloy {i.mohit,luke.ekkizogloy}@gmail.com Stanford University, CS229: Machine Learning, 2009 Now that music
More informationTopic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)
Topic 11 Score-Informed Source Separation (chroma slides adapted from Meinard Mueller) Why Score-informed Source Separation? Audio source separation is useful Music transcription, remixing, search Non-satisfying
More informationThe Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng
The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng S. Zhu, P. Ji, W. Kuang and J. Yang Institute of Acoustics, CAS, O.21, Bei-Si-huan-Xi Road, 100190 Beijing,
More informationMusic Information Retrieval Using Audio Input
Music Information Retrieval Using Audio Input Lloyd A. Smith, Rodger J. McNab and Ian H. Witten Department of Computer Science University of Waikato Private Bag 35 Hamilton, New Zealand {las, rjmcnab,
More informationWeek 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University
Week 14 Query-by-Humming and Music Fingerprinting Roger B. Dannenberg Professor of Computer Science, Art and Music Overview n Melody-Based Retrieval n Audio-Score Alignment n Music Fingerprinting 2 Metadata-based
More informationBook: Fundamentals of Music Processing. Audio Features. Book: Fundamentals of Music Processing. Book: Fundamentals of Music Processing
Book: Fundamentals of Music Processing Lecture Music Processing Audio Features Meinard Müller International Audio Laboratories Erlangen meinard.mueller@audiolabs-erlangen.de Meinard Müller Fundamentals
More informationAugmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series
-1- Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series JERICA OBLAK, Ph. D. Composer/Music Theorist 1382 1 st Ave. New York, NY 10021 USA Abstract: - The proportional
More informationInterface Practices Subcommittee SCTE STANDARD SCTE Measurement Procedure for Noise Power Ratio
Interface Practices Subcommittee SCTE STANDARD SCTE 119 2018 Measurement Procedure for Noise Power Ratio NOTICE The Society of Cable Telecommunications Engineers (SCTE) / International Society of Broadband
More informationComputational Modelling of Harmony
Computational Modelling of Harmony Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile End Rd, London E1 4NS, UK simon.dixon@elec.qmul.ac.uk http://www.elec.qmul.ac.uk/people/simond
More informationSoundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 6, OCTOBER 2011 1205 Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE,
More informationAUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION
AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate
More informationy POWER USER MUSIC PRODUCTION and PERFORMANCE With the MOTIF ES Mastering the Sample SLICE function
y POWER USER MUSIC PRODUCTION and PERFORMANCE With the MOTIF ES Mastering the Sample SLICE function Phil Clendeninn Senior Product Specialist Technology Products Yamaha Corporation of America Working with
More informationA Beat Tracking System for Audio Signals
A Beat Tracking System for Audio Signals Simon Dixon Austrian Research Institute for Artificial Intelligence, Schottengasse 3, A-1010 Vienna, Austria. simon@ai.univie.ac.at April 7, 2000 Abstract We present
More informationTimbre blending of wind instruments: acoustics and perception
Timbre blending of wind instruments: acoustics and perception Sven-Amin Lembke CIRMMT / Music Technology Schulich School of Music, McGill University sven-amin.lembke@mail.mcgill.ca ABSTRACT The acoustical
More informationACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING. University of Porto - Faculty of Engineering -DEEC Porto, Portugal
ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING José Ventura, Ricardo Sousa and Aníbal Ferreira University of Porto - Faculty of Engineering -DEEC Porto, Portugal ABSTRACT Vibrato is a frequency
More informationMusic BCI ( )
Music BCI (006-2015) Matthias Treder, Benjamin Blankertz Technische Universität Berlin, Berlin, Germany September 5, 2016 1 Introduction We investigated the suitability of musical stimuli for use in a
More information