Blind Identification of Source Mobile Devices Using VoIP Calls Mehdi Jahanirad 1, Ainuddin Wahid Abdul Wahab, Nor Badrul Anuar, Mohd Yamani Idna Idris, and Mohamad Nizam Ayub Faculty of Computer Science and Information Technology University of Malaya 50603 Kuala Lumpur, Malaysia Security Research Group (SECReg) 1 mehdijahanirad@siswa.um.edu.my Abstract Sources such as speakers and environments from different communication devices produce signal variations that result in interference generated by different communication devices. Despite these convolutions, signal variations produced by different mobile devices leave intrinsic fingerprints on recorded calls, thus allowing the tracking of the models and brands of engaged mobile devices. This study aims to investigate the use of recorded Voice over Internet Protocol calls in the blind identification of source mobile devices. The proposed scheme employs a combination of entropy and mel-frequency cepstrum coefficients to extract the intrinsic features of mobile devices and analyzes these features with a multi-class support vector machine classifier. The experimental results lead to an accurate identification of 10 source mobile devices with an average accuracy of 99.72%. Index Terms Pattern recognition, mel-frequency cepstrum coefficients, entropy, device-based detection technique I. INTRODUCTION Audio forensics has attracted increasing attention in recent years because of its application in different situations that require trust in the authenticity and integrity of audio signals [1]. An example of such application is the forensic acquisition, analysis, and evaluation of admissible audio recordings as evidence in court cases [2]. Current audio authenticity approaches are categorized according to the artifacts extracted from the signal itself. These approaches include: (a) environment-based techniques in which the the frequency spectra are forced through the recording environment, (b) device-based techniques in which the frequency spectra are produced by a recording device, and (c) ENF-based techniques in which the frequency spectra are generated by the power source of the recording device [3]. Although advanced research has been conducted on ENF-based techniques [4], [5] and environmentbased techniques [6], [7], few have explored the application of device-based techniques in real-time forensics [1], [8]. Device-based techniques are based on blind source camera identification in image forensics [9], [10], [11]. However, the adaptation of this approach in audio forensics is challenging because audio evidence is produced by the combination of audio sources such as speakers and environments. The first practical evaluation on source microphone authentication was developed by Kraetzer et al. [8] through statistical pattern recognition techniques. This method utilizes features from the detection of hidden communication and identifies the origin of audio streams. Buchholz et al. [12] focused on microphone classification by using Fourier coefficients histogram as audio features. This method eliminates speech convolution by computing the coefficients from near silent frames. Although microphone forensics allows the identification of source recording devices, it cannot provide sufficient evidence to identify source communication devices. Garcia- Romero and Espy-Wilson [13] proposed an automatic acquisition device identification method using speech recordings from the Lincoln-Labs Handset Database based on both microphone and landline telephone handsets. This method eliminates the effects of signal variations caused by speech signals through frequency response characterization of the device contextualized by the speech content. The method implements Gaussian mixture models on 23 mel-frequency cepstrum coefficients (MFCCs), 38 linear-frequency cepstrum coefficients, and their combination with the first-order derivative (delta) of both feature sets to determine the Gaussian supervector (GSV) associated with each device. The linear support vector machine (SVM) classifier builds the training model by using this vector as an intrinsic fingerprint of individual acquisition devices. Hanilçi et al. [14] proposed a method based on advanced speaker recognition research [15], [16], [17] to extract MFCCs as features from recorded speech. This method collects recorded speech samples from different mobile devices to identify their source brands and models. However, Hanilçi et al. [14] considered mobile devices as ordinary tape recorders to eliminate the complications when transmitting and receiving signals. Referring to the acquisition device identification methods in [13], [14], Panagakis and Kotropoulos [18] proposed a telephone handset identification method that uses random spectral features (RSFs) and labeled spectral features (LSFs). This method extracts the RSFs and LSFs from the mean spectrogram of speech signals. The method also uses sparse representation-based classification (SRC), as well as neural network (NN) and SVM classifiers to assess its performance in classifying the dataset in which are obtained from eight telephone handsets. 978-1-4799-2027-3/14/$31.00 2014 IEEE 486
Previous device-based techniques focused on authentication based on recording devices. In the present work, we propose new approach to the identification of source mobile devices that are engaged in VoIP calls. The mobile devices used in this study are equipped with built-in circuits and electrical components. The digital signal processing in these devices produces signal variations. Thus, calls recorded using these devices contain intrinsic artifacts that are captured using a combination of entropy and MFCC features. Furthermore, this method uses the near-silent segments of signals for feature extraction to eliminate the interference resulting from the variation in speakers. Finally, the combined feature set is analyzed with a multiclass SVM classifier to identify 10 source mobile devices. The logarithm of the filterbank outputs is used to determine the spectral envelope in decibels. Eventually, the discrete cosine transform of these envelopes determine the MFCCs. During this process, 12 coefficients are computed by the MFCC algorithm. Each row in the mel cepstrum output represents the 12 coefficients computed for each frame (Fig. 2). II. ALGORITHM OVERVIEW The proposed algorithm includes two main stages: feature extraction and feature analysis. Feature extraction determines meaningful information from a collection of call recordings to distinguish the mobile devices from one another. Feature analysis utilizes these features in building the model for each mobile device and tests the model to evaluate its performance in detecting all mobile devices from the same class. The class represents the brand and model of the mobile devices. A. Feature extraction The combination of entropy and MFCC has been used in speech recognition to improve its performance and robustness in the presence of additive noise [19], [20], [21]. The present study explores the entropy of the mel frequency cepstrum spectrum as feature for the blind identification of source mobile devices. Fig. 1 illustrates the computation of MFCCs. Fig. 2. entropy-mfcc feature extraction At this point, the feature extraction algorithm uses the entropy to capture the peakiness of the distribution among all frames in the mel cepstrum output. For the mel cepstrum output of M i,j, M is the array with a size of {N 12}, when N is the total number of frames, i = {1, 2, 3,...,N}, and j = {1, 2, 3,...,12}. The algorithm computes the entropy for the 12 coefficients in two stages. First, it normalizes the spectrum into the probability mass function (PMF) through (2) X i x(i) = N i=1 X for i =1to N, (2) i where X i is the energy of the i th frequency component, and x i is the PMF of the signal. Second, it computes the entropy H(x) as Fig. 1. Computation of MFCCs H(x) = x X x i. log 2 x i. (3) The feature extraction algorithm generates blocks by splitting the audio frames and then extracts the MFCC features from all blocks that are generated from the sample data. The algorithm splits each block into approximately 23 ms frames, with each frame windowed with the Hamming window in the time domain. Subsequently, it determines the FFT magnitude spectrum from the windowed frame and filters the spectrum using 27 triangular-shaped filters in the mel domain. The mel domain is linear for frequencies below 1000 Hz and logarithmic for frequencies more than 1000 Hz. Equation (1) computes the mel domain central frequencies for the logarithms of base 10. log (1 + f/1000) f mel = 1000 (1) log 2 Moreover, the algorithm generates a total of 12 entropy- MFCC features through MATLAB functions. B. Feature analysis Feature analysis investigates the extracted features by using classification techniques. This study applies SVM in classification because of its satisfactory performance in pattern recognition approaches [22]. SVMs are initially designed for 1-to-1 classifications. The multi-class SVM classifier generates N(N 1)/2 binary SVM classifiers, where each classifier is trained to separate each pair of classes, and N represents the number of classes. The binary classifiers are combined using the classical voting system when the class with the maximal number of votes is estimated. The classification technique involves two stages: training and testing. Feature analysis organizes features in 10 classes 978-1-4799-2027-3/14/$31.00 2014 IEEE 487
with respect to the 10 mobile devices in Table I. For each class, the extracted features produce a total data subset represented by the same label. The method randomly selects 70% of the data subset for training and uses the remaining 30% for testing. The classifier builds the training model using the training data subset. Then, the classifier predicts the labels corresponding to the testing data subset based on the training model and without considering its true labels. For evaluation, the classifier compares the actual classes against the predicted labels to determine the number of correct matches. The method computes the identification accuracy with the fraction of the total number of correct matches to the total number of testing data. The method is repeated 10 times. In each repetition different training and testing data subsets are randomly selected, and the average identification accuracy is then determined for each mobile device. III. EXPERIMENTAL SETUP The proposed setup involves the collection of call recordings, as shown in Fig. 3. We record a total of 25 Skype calls for each device in a truly silent environment. The devices are listed in Table I. The silent session eliminates the possible convolutions caused by speech signals generated by different speakers. The MP3 Skype call recorder v.3.1 freeware application [23] records the signals in.mp3 format. The method converts the recorded files to.wav format and then digitizes them into sample data. Then, it enhances the sample data to remove the noise generated from environmental reverberations. The histograms illustrated in Fig. 4 indicate the distinctiveness of the spectrum of the recording signal obtained from the mobile devices of the same model. However, the clean signals are more distinct than the noisy signals. The color determines the noise level in decibels with respect to the color plot in the right side of the histogram. Evidently, the colors vary with respect to the fact that high-level noises are reduced by using the enhancement process. The enhancement process uses g=1 as the subtraction domain and e=1 as the gain exponent for the magnitude domain spectral subtraction [24]. TABLE I MOBILE DEVICES, MODELS, AND CLASS NAMES USED IN THE EXPERIMENTS Mobile Devices Models Operating Class System Name Galaxy Note 10.1-A GT-N8000 Android 4.1.2 GNA Galaxy Note 10.1-B GT-N8000 Android 4.1.2 GNB Galaxy Note GT-N7000 Android 2.3.6 GN Galaxy Note II-A GT-N7100 Android 4.1.2 GNIIA Galaxy Note II-B GT-N7100 Android 4.1.2 GNIIB Galaxy Tab 10.1 GT-P7500 Android 3.1 GT Apple ipad MC775ZP Apple ios 5.1.1 ipada Apple ipad New MD366ZP Apple ios 5.1.1 ipadb Asus Nexus 7 Android 4.2.2 Asus HTC Sensation XE - Android 4.0.3 HTC The proposed algorithm segments the clean signals into overlapping frames with a length of 40 samples. The shorten frame signals consist of an array with a size of {N f n}, where N f is number of frames, and n is the frame length. The proposed frame shortening method segments the recorded signal with a length of 8 s into blocks of approximately 200 ms. As a result, we generate a total of 1000 blocks from the recorded calls from each mobile device. The 12 entropy- MFCC features are computed by using the generated blocks to obtain the data subset with a length 1000 for each mobile device. The method randomly selects 700 blocks for training and uses the remaining 300 blocks to test the data subset. We thus obtain 7000 training and 3000 testing data from the 10 mobile devices. The experiment is repeated 10 times, and the average accuracy is computed. Fig. 3. Proposed set up for recording conversations IV. RESULTS Table II shows the average confusion matrix generated by running 10 experiments using the 10-class SVM classifier. The diagonal values of the matrix represent the respective classification accuracies of the 10 mobile devices, whereas the non-diagonal values indicate the misclassification among the mobile devices. A high average classification accuracy of 99.72% is achieved for all mobile devices. The percentage of misclassification among the mobile devices is negligible (less than 0.27%). The mobile devices of the same model (Galaxy Note 10.1-A, B and Galaxy Note II-A, B) have high average accuracy rates of 99.74% and 99.76%, respectively. In an alternative approach, Fig. 5 visualizes the classification results by using the Euclidean distance similarity methods adopted from [25]. This method determines the similarity distances between the feature values with the Euclidean distance matrix of N N and then reduces its dimension to 2 to determine the X and Y components (Fig. 5). Each color represents the class label of the dataset associated with each mobile device. The unfilled markers represent data instance from the training data subset, and the filled markers represent data instance from the testing data subset. As shown in Fig. 5, the Euclidean distance method clusters both the training and testing data subsets into 978-1-4799-2027-3/14/$31.00 2014 IEEE 488
(a) Galaxy Note 10.1-A (b) Galaxy Note I0.1-B (c) Galaxy Note II-A (d) Galaxy Note Fig. 4. Histogram comparison of the recording signals from mobile devices of the same model (clean vs. noisy signal). 978-1-4799-2027-3/14/$31.00 2014 IEEE 489
TABLE II CONFUSION MATRIX FOR IDENTIFYING SOURCE MOBILE DEVICES BASED ON CALL RECORDINGS Total average accuracy rate Predicted (%) 99.72% GNA GNB GN GNIIA GNIIB GT ipada ipadb Asus HTC Actual GNA 99.67 0.1 0.13 GNB 99.8 0.1 GN 99.83 GNIIA 0.1 99.83 GNIIB 0.2 99.7 0.1 GT 99.77 ipad-a 0.1 0.17 99.6 ipad-b 99.73 0.23 Asus 0.27 0.1 99.63 HTC 0.17 0.13 99.63 Note: The cell marked with an asterisk indicates a value of less than 0.1%. Fig. 5. Clustering of training (unfilled markers) and testing (filled markers) data subsets into 10 groups using the Euclidean distance method. 10 groups. This observation confirms the results obtained by the ten-class SVM classifier. We can therefore infer that the proposed entropy-mfcc features are effective in the blind identification of source mobile devices using recorded VoIP calls. V. CONCLUSION In this work, we present an approach to the identification of source mobile devices using recorded VoIP calls. We adopted MFCC and entropy features from speech recognition studies to develop the framework for identifying the distinguishing pattern in different mobile devices. Given the use of a silent Skype session in the investigation, the difference between the samples is only caused by the different mobile devices. An average accuracy of 99.72% was achieved for all 10 devices. Most notably, this study is the first to investigate the distinguishing features of source mobile devices using VoIP calls. The results of our study suggest that the proposed approach should be tested in the identification of different types of mobile devices using conversations recorded during communication via any type of service provider, such as cellular, PSTN, and subsets. ACKNOWLEDGMENT We would like to thank the UM/MoHE High Impact Research Grant Allocation (UM.C/HIR/MOHE/FCSIT/17) for 978-1-4799-2027-3/14/$31.00 2014 IEEE 490
funding this research and all members of the Security Research Group (SECReg) of the Department of Computer System and Technology of the University of Malaya for sharing their knowledge and experience. They led us through many helpful discussions and have been a constant source of motivation, guidance, encouragement, and trust. REFERENCES [1] C. Kraetzer, K. Qian, and J. Dittmann, Extending a context model for microphone forensics, in Proc. Conference on Media Watermarking, Security, and Forensics, Burlingame, CA, 2012. [2] R. Maher, Audio forensic examination, IEEE Signal Process. Mag., vol. 26, no. 2, pp. 84 94, Mar. 2009. [3] S. Gupta, S. Cho, and C.-C. Kuo, Current developments and future trends in audio authentication, IEEE Multimedia, vol. 19, no. 1, pp. 50 59, Jan. 2012. [4] A. J. Cooper, Further considerations for the analysis of ENF data for forensic audio and video applications, International Journal of Speech Language and The Law, vol. 18, no. 1, pp. 99 120, 2011. [5] J. Ode Ojowu, J. Johan Karlsson, and Y. Liu, ENF extraction from digital recordings using adaptive techniques and frequency tracking, IEEE Trans. Inf. Forensics Security, vol. 7, no. 4, pp. 1330 1338, Aug. 2012. [6] A. Rabaoui, M. Davy, S. Rossignol, and N. Ellouze, Using one-class SVMs and wavelets for audio surveillance, IEEE Trans. Inf. Forensics Security, vol. 3, no. 4, pp. 763 775, dec 2008. [7] G. Muhammad and K. Alghathbar, Environment recognition for digital audio forensics using MPEG-7 and mel cepstral features, Journal of Electrical Engineering, vol. 62, no. 4, pp. 199 205, Aug. 2011. [8] C. Kraetzer, A. Oermann, J. Dittmann, and A. Lang, Digital audio forensics: a first practical evaluation on microphone and environment classification, in Proc. Workshop on Multimedia & Security, Dallas, Texas, USA, 2007, pp. 63 74. [9] M. Kharrazi, H. Sencar, and N. Memon, Blind source camera identification, in Proc. International Conference on Image Processing (ICIP 04), vol. 1, Oct. 2004, pp. 709 712. [10] O. Celiktutan, B. Sankur, and I. Avcibas, Blind identification of source cell-phone model, IEEE Trans. Inf. Forensics Security, vol. 3, no. 3, pp. 553 566, Sep. 2008. [11] A. Swaminathan, M. Wu, and K. Liu, Nonintrusive component forensics of visual sensors using output images, IEEE Trans. Inf. Forensics Security, vol. 2, no. 1, pp. 91 106, Mar. 2007. [12] R. Buchholz, C. Kraetzer, and J. Dittman, Microphone classification using fourier coefficients, Information Hiding,LNCS, no. 5806, pp. 235 246, 2009. [13] D. Garcia-Romero and C. Y. Espy-Wilson, Automatic acquisition device identification from speech recordings, in Proc. 2010 IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Dallas, Texas, USA, 2010, p. 18061809. [14] C. Hanili, F. Erta, T. Erta, and mer Eskidere, Recognition of brand and models of cell-phones from recorded speech signals, IEEE Trans. Inf. Forensics Security, vol. 7, no. 2, pp. 625 634, Apr. 2012. [15] F. Bimbot et al., A tutorial on text-independent speaker verification, EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 4, 2004. [16] W. Campbell, Generalized linear discriminant sequence kernels for speaker recognition, in Proc. Int. Conf. on Acoustics, Speech Signal Pro, 2002, pp. 161 164. [17] W. M. Campbell and K. T. Assaleh, Speaker recognition with polynomial classifiers, IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp. 205 212, May 2002. [18] Y. Panagakis and C. Kotropoulos, Telephone handset identification by feature selection and sparse representations, in Proc. 2010 Workshop on Information Forensics & Security, Tenerife, Spain, 2012, pp. 73 78. [19] H. Misra, S. Ikbal, H. Bourlard, and H. Hermansky, Spectral entropy based feature for robust ASR, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 04), vol. 1, 2004, pp. I 193 6. [20] H. Yeganeh, S. Ahadi, S. Mirrezaie, and A. Ziaei, Weighting of mel sub-bands based on SNR/entropy for robust ASR, in Proc. IEEE International Symposium on Signal Processing and Information Technology,(ISSPIT 2008), 2008, pp. 292 296. [21] Y. H. Lee and H. K. Kim, Entropy coding of compressed feature parameters for distributed speech recognition, Speech Communication, vol. 52, no. 5, pp. 405 412, 2010. [22] I. H. Witten, E. Frank, and M. A. Hall, Data Mining Practical Machine Learning Tools and Techniques. Burlington, MA 01803, USA: Elsevier Inc., 2011. [23] MP3 Skype Recorder v.3.1. [Online]. Available: http://voipcallrecording.com [24] R. Berouti, M. Schwartz and J. Makhoul, Enhancement of speech corrupted by acoustic noise, in Proc IEEE ICASSP, no. 4, 1979, pp. 208 211. [25] T. Segaran, Programming Collective Intelligence: Building Smart Web 2.0 Applications. London, UK: OReilly Media, Inc., 2007. 978-1-4799-2027-3/14/$31.00 2014 IEEE 491