Speech and Speaker Recognition for the Command of an Industrial Robot

Speech and Speaker Recognition for the Command of an Industrial Robot CLAUDIA MOISA*, HELGA SILAGHI*, ANDREI SILAGHI** *Dept. of Electric Drives and Automation University of Oradea University Street, nr. 1, Oradea ROMANIA hsilaghi@uoradea.ro, moisa.claudia@live.com ** Politehnica University of Timisoara Vasile Parvan Bv, nr. 2, Timisoara ROMANIA evil4uandrei@yahoo.com Abstract - This paper presents a recognition method for ordering an industrial robot using isolated words, and also a method of speaker recognition, in case an order may be amended only by a particular person. Finally, some results of the simulation are presented. Key-Words: Voice processing, Speech recognition, Speaker recognition, Industrial robot 1 Introduction Automatic speech recognition has been studied for many years, having as a goal the man computer dialog. Nowadays, the communication with the computer can be done using a speech recognition and voice synthesis system. Speech recognition is a process of automatic drawing and description of linguistic information contained in the voice signal and it can be done using computers. Linguistic information is called phonetic information and speech recognition can be extended to the person, making the speaker recognition. The speech recognition is performed comparing the input voice signal with signals stored in a previously made library. Various parameters are extracted from the voice signal and the comparison is done based on various mathematical methods. There are two main types of recognition, namely isolated word recognition and continuous speech recognition. Also, to achieve advanced recognition systems you need to know if you use more speakers or speaker is always the same person. Independent speaker recognition systems are more complex because they must work adaptive, changing parameters with the change of the speaker. This paper aims to address technically and theoretically an effective example of application for registration in real-time voice processing and examination. Thus, it gives a series of theoretical and practical views on how the audio signal processing with description of possible applications can be developed using digital acquisition and processing systems. 2 Speech/silence algorithm Acoustic signal is composed of a sequence of sounds generated by human vocal tract system to control brain. Used voice signal is represented by a structure called the "wave", which contains the following fields: length analysis window, ie the framework ("LC"), the number of frames (N), number of samples and the total number frames of sound signal ("CN"). ISSN: 1792-5967 31 ISBN: 978-960-474-238-7

wave.file = sound; wave.lc = 256;% analysis window length / length of a framework wave.n = length (wave.file)% number of samples of noise signal wave.nc = floor (wave.n / wave.lc)% number of frames A very important aspect in analysis, synthesis, speech recognition and coding is the correctly detection of the periods of silence and speech in the voice signal. Voice signal characteristics are useful in this regard. Thus, sound segments (vowels), are characterized by high energy and a strong correlation between adjacent samples, while no sound segments (consonants) are very similar to noise by their low energy and poor correlation between samples. In real-time, sound/no sound decision can be taken with two parameters: energy and zero crossing rate. Decision criteria are summarized in the following table: SIGNAL ENERGY ZCR Sound (vowel) Big Small No sound (consonants) Medium / Small Big Noise Small Big Quiet ~0 ~0 Tabel 1. Below is presented the silence-speech detection algorithm. It works both for detecting the period of silence / speech. The following algorithm is significantly improved in comparison with ordinary speech recognition algorithms. It recognizes isolated words, the results proved to be better than the classical algorithm. Steps: 1. Importing file containing the voice signal 2. Set i = 1 3. For i < Framework_no Silence_speech (i) = 0 4. For i < Framework_no If En (i)> En_min If En (i)> En_max Silence_speech (i) = 1 Else (NTZ (i)> NTZ_min) & (NTZ (i) <NTZ_max) Silence_speech (i) = 1 5. Display Silence_speech The algorithm was implemented in MATLAB. 3 Speech recognition The main difficulty faced by speech recognition programs is that the voices of two people may be in a way similar or on the other side the voice of the same person may vary in certain situations, especially when it is used for industrial robot control where with a command given by a user, it will act to meet the voice command. 3.1 Frequency analysis of the voice signal Voice signal in frequency analysis provides a more useful set of parameters in processing than the time domain analysis. Thus, the excitation and vocal tract can be easily separated in the spectral domain. Uttering different sentences differ in the same time, while they are similar in frequency. Also, the human auditory system is more sensitive to the voice signal issues than those related to phase. Therefore, spectral analysis is used to extract the majority voice signal parameters. 3.2 Linear Prediction Analysis A common method to analyze the voice signal is linear predictive analysis (Linear Predictive Coding), also known as LPC analysis or modeling AR (autoregressive). This is a simple, fast and at the same time, very efficient for calculating the parameters of voice signal. To determine LPC coefficients we analyzed the voice signal frame. LPC coefficients were obtained using MATLAB function Lpc (). The imported voice signal was examined and sampled with a sampling frequency fe = 16 khz. We used a LPC prediction order equal to 18 and we chose a frame length of 256 samples. Vocal tract is modeled by a numerical filter that has coefficients that may vary over time and a gain. The voice signal parameters are: sound / no sound decision fundamental frequency F0 for sound segments gain digital filter filter coefficients To note is that the number of LPC coefficients is always with one more than the order predictors, in this case 19. Also first rate is always 1 and corresponds to a sample immediately and unmodified. ISSN: 1792-5967 32 ISBN: 978-960-474-238-7

3.3 Training Recognition of orders is based on a dictionary of words. The creation of the dictionary represents the training operation - to determine the specifics of each word, e.g. the way a word is said and save it for later recognition. Training is done by repeating the words and updating the dictionary after each pronounced command. The words are saved in the dictionary as following: 1 st line containing the string delimiter "$$$" and it marks that the next word is a new one a row that contains the command name a row that contains an integer N to specify the number of frames after feature extraction was performed a matrix of N rows and 19 columns (as predictors LPC order is 18), each line representing the LPC coefficients for the corresponding word/command It is very important that during training, the system is being trained exactly in the conditions in which it will be used. It is also recommended to use a sensitive microphone that will receive the voice command and it will reduce the noise. To obtain the desired performance during training, the speakers must be fluent and speak with normal voice, not to speak too slowly or too quickly. The system is designed to adapt to the user's voice. Obviously, it is known that the pronunciation of different words can be similar, so the system allows some small errors and it adapts to how the user speaks. However, during training, you should try to commit as little mistakes as you can. The training is implemented in the following way: the voice signal for training is recorded. Thus, the next operation is the feature extraction of the obtained sound: energy, zero crossing rate and LPC coefficients. To note that LPC coefficients are extracted only where the speech/silence algorithm detects speech. The parameters thus obtained are saved in a text file that represents the dictionary. 4 Speaker recognition Speaker recognition is basically divided into two parts: recognition and identification. This is a way to automatically identify who is the speaker on the basis of individual information included in speech. The main goal of this project is to identify the speaker from a list with reference speaker models. The algorithm has to compare a voice signal from an unknown speaker with a database consisting of known speakers. The system that has been previously trained with a number of speakers can recognize the unknown speaker. In the figure below is presented the fundamental process of speaker identification. In most applications, the voice is used to confirm the identity of a speaker. Figure 1.The fundamental process of speaker identification This diagram in Figure 1 was implemented in MATLAB code. 4.1 The Mel cepstral analysis In order to determine the correct speaker, we used the Mel cepstral analysis. It uses the Mel scale and it gives results by obtaining the MFCC (Mel Frequency Cepstral Coefficients) coefficients. After determining the power spectrum obtained using Discrete Fourier Transforme, the obtained signal is Mel frequency scale passed through a triangular filter bank. Then it follows the logarithmization and finally the MFCC coefficients are obtained by applying Discrete Cosine Transform on the new spectrum. Latter justification is that the coefficients resulting from the calculation of power spectrum are strongly correlated among themselves. Although it is known that the coefficients are not cepstral correlated, Discrete Cosine Transform coefficient allows the switch to the Mel scale. 4.2 The determination of distance in the acoustic environment Voice signal recognition involving comparison of models consisting of various parameters was performed by calculation of distances. Of the phonetic, spectral changes that lead to different sounds must meet distance and the same sound that is perceptually, must be ISSN: 1792-5967 33 ISBN: 978-960-474-238-7

associated with smaller distances. Thus, for recognizing a word is calculated the distance between the parameters of each word in the dictionary. Word to the minimum distance is considered to be the word recognized. If the distance is greater than a maximum allowable threshold the algorithm considers that the word is unknown. Repeated experiments and statistical processing of the results shows that acceptable results are obtained by setting the threshold value to 35. 5 Graphical User Interface GUI (Graphical User Interface) has been implemented in order to give the user an application as easy to use. It consists of graphical widgets such as windows, menus, round buttons as old radios used to have (called radio buttons), check boxes and graphics windows for displaying results. Graphical interface uses in addition to keyboard, an pointing device (mouse). GUI consists of three parts of command and control, and the two windows used to display results as follows: The training part, where the user is given the opportunity to train the program with different commands for industrial robots that the program will have to recognize. Also the user can record and listen the given/trained commands. The recognition part does the speech and speaker recognition of user commands and displays the results in the recognized word s panel. The results part, is divided into 2 graphics, and it displays the waveforms and a selected feature parameter. The user can choose to see in the upper display panel - the graph of signal waveforms or the trained voice recognition introduced. Based on the previously selected option in the bottom panel graphics, it can choose between several speech parameters: - zero crossing rate - energy - autocorrelation - average magnitude difference function - speech/silence detection algorithm - spectogram - LPC spectrum Also, the "autocorrelation" and "Average magnitude difference function" calculates and displays the fundamental frequency. The program was implemented in MATLAB programming environment. Figure 2.The implemented program GUI 6 Results First results on which we will stop are those derived from basic operations of the module tools: We chose the waveform corresponding to the voice command "start". Figure 3.The waveform corresponding utterance voice command "start" On the signal we will apply the following processing: 1. Segmentation in frames of 256 samples 2. Applying a rectangular window 3. Removing continuous component 4. Voice signal detection Figure 4.The graph representing the zero crossing rate (ZC R) ISSN: 1792-5967 34 ISBN: 978-960-474-238-7

Figure 5.The graph representing the signal energy In order to obtain the best results of speech/silence detection algorithm, we determined the minimum and maximum zero crossing rate and energy, with already implemented functions. On one hand, we analyzed parameters of signals and Gaussian noise with normal distribution. Then we performed the same analysis on traffic signals containing noise, the street and the bus. Next we analyzed parameters of speech signals acquired in terms of environmental noise. The results are presented in the following tables: Figure 6. The graph representing the result of speech/silence detection algorithm The percentage of speech recognition system is around 90%. Better results were obtained in speaker recognition algorithm. For a proper functioning of the program is important that the utterance is similar orders in all cases. Algorithm steps used are as follows: 1. Rectangular windows on the disabled to signal application. 2. Removing continuous component 3. Speech localization using speech/silence detection algorithm 4. Mel scale cepstral coefficients calculation 5. Calculating the distance between the user's voice command with the voice command and dictionary Percentage recognition system under these conditions is around 95%. Also user can view the voice signal spectrum and LPC spectrum by graphics, as follows: Table 2. Limits of energy and voice commands given NTZ for Industrial Robot After we have also analyzed the noise signal we extracted the following: Environmental noise parameters: ZCR = 70, Energy = 1-2 Parameters consonants: ZCR> 100 Using the results obtained, we chose the following statistical limits: - Minimum energy = 1 - Maximum energy = 10 - ZCR minimum = 10 - ZCR maximum = 80 Figure 7. The graph representing the spectrum of voice command Figure 8. The graph representing the spectrum of LPC voice command Also, the user has the option to display the position of "autocorrelation" and "Average magnitude difference ISSN: 1792-5967 35 ISBN: 978-960-474-238-7

function" which also calculates and displays the fundamental frequency. Figure 9. The graph representing the function of "autocorrelation" of the voice command; it also calculates the fundamental frequency value Figure 10. The graph representing the "average magnitude difference function" command; it also calculates the fundamental frequency value Results obtained from experimental verification of the program developed, speech recognition testing male and female, are presented in two tables: 7 Conclusion The implemented algorithm wanted to be a tool to control industrial robots using the voice command. Besides the usefulness of tools already implemented in the application, it demonstrates its reliability and ease of future development. It allows users to easily add their own tools in any of the processing modules.based on obtained experimental results it demonstrates that the proposed algorithm is indeed functional and it can be used in voice command control of industrial robots. Percentage of correct recognition of commands is high enough, besides the fact that the used computational resources (CPU frequency, RAM) are lower compared to other algorithms. References: [1] Giurgiu M., Cepstral Analysis Of Speech, Proceedings of Rep 94, Bucuresti, pp.13-16 [2] Furui S., Cepstral Analysis Technique For Automatic Speaker Verification, IEEE Transactions On Acoustics Speech, And Signal Processing, Vol. Assp-29, Nr. 2, pp. 254-272 [3] Giurgiu, M., Isolated Word Speech Recognition System Using Both Dtw And Vq, Proceeding of 2nd International Conference: Design To Manufacture In Modern Industry,1994, Bled, Slovenia, pp. 566-571 [4] Silaghi Helga, Electrical drive systems with induction machine. Data Acquisition. Informatic Techniques,Treira Publishing, ISBN 973-8159-26-1,Oradea, Romania, 2000 [5] Silaghi Helga, Silaghi M., About Using the Microrobot System RV-M1 in the Automatization of the Dimensional Control Operations, Electrical Drives and Power Electronics, vol.i, ISBN 80-88922-06-2, High Tatras - 1996, Slovakia, p.275-279 [6] Silaghi Helga, Control Problem of an Industrial Robot Equiped with DC Servomotors, Annals of University of Oradea, ISSN 1223-2106, Felix Spa, 1996, Romania, p.357-362 [7] Silaghi Helga, The Challenge of Designing Actuated Medical Robots for Safe Human Interaction, Simpozionul National de Electrotehnica Teoretica SNET 07, ISBN 978-973-718-899-1, 2007, Bucuresti, Romania, pp.285-290 Tabel 3. ISSN: 1792-5967 36 ISBN: 978-960-474-238-7