Audio Compression Technology for Voice Transmission 1 SUBRATA SAHA, 2 VIKRAM REDDY 1 Department of Electrical and Computer Engineering 2 Department of Computer Science University of Manitoba Winnipeg, Manitoba CANADA 1 Abstract:- Digitized voice is transmitted during different sorts of communications. For transmitting voice first the analog voice message is sampled and converted into digital signal. Then the signal is encoded and finally transmitted. In order to minimize the traffic over network, voice message is compressed during transmission. Compression and decompression process should not take much time. Again in cellular technology the compression and decompression need to be implemented in hardware level. If they require a complex hardware, that may not be effective. In this paper a very simple, linear, effective and easy to implement compression and decompression technique has been proposed. Our proposed technique keeps track of change in the digitized voice. Considering the digitized signal as a graph of amplitude vs. time, it keeps track of the change in direction of the wave. The proposed technique is not a loss-less compression scheme and it introduces a very little noise within acceptance range. Keywords:- Signal wave, Sharp edge, PCX-compression. 1 Introduction Voice transfer plays a major role in today s communication. Voice, in form of digital data is transmitted from one node to another node over network. Voice transfer is necessary in many sorts of communications like internet-telephony using voice over IP, cellular telephone, different popular messengers like Yahoo messenger, msn-messenger, online conference, online radio service and many other technologies. In any sort of voice transmissio n first the analog voice message is sampled and thus converted from analog signal to digital signal. Then the digitized signal is encoded and finally transmitted. The quality of service depends on the data transmission rate during ongoing service. Large amount of traffic keeps bad effect on the quality of service. In order to minimize the traffic the digitized voice message is compressed. The compressed digitized voice message is then transmitted. At the receiver end the compressed signal is received and then it is decompressed. Sender performs compression. Decompression is performed in receiver end. Amount of traffic on the network is inversely related to the amount of compression done. Obviously highly strong compression scheme is preferable because it minimizes the traffic and thus helps the signal to be transmitted in quickly. But algorithms, those ensure high compression, take much time during compression and decompression. Taking much time for compressing and decompressing digitized voice message introduces delay in ongoing voice transmission. In cellular phone extra hardware is added for compression and decompression. This hardware should be very simple and easy for implementation. If the algorithm is too complex, the required hardware may also be complex. So a very simple algorithm is needed. Audio signal can be segmented in different ways. Signal can be encoded further depending on the segments. Segmentation using Bayesian changepoint detection [5] can be applied for detecting sudden change in signal. Our method also detects changes in signal, but it detects the change at the magnitude level, not at the frequency level. 1.1 Problem Definition We consider the problem of encoding the signal after sampling. In existing techniques the voice message is sampled on each small time interval and the sampled signal (data) is encoded. We introduce a new method for encoding. Our algorithm compresses the signal up to a significant
level. The complexity of compression and decompression in our technique is very less. Our method is very straight forward and thus very easy to implement. 1.2 Paper Organization The remaining of the paper is organized as follows. In section 2 some related compression techniques have been discussed. Section 3 introduces our technique. Section 4 shows the analysis of performance of our technique. Section 5 consists of conclusion and some future works on this method. 2 Some Related Works Audio signal encoding has been challenge for many years. A large number of methods can be found for signal segmentation. Mainly the segmentation is based on searching change-points detection using suitable signal parameters. Many reliable methods are based on maximum likelihood and Bayesian approach [2][3]. Bayesian detectors are very effective because they remove nuisance parameters from the analysis by a marginalization process. RLE or Run-length encoding [1][6] is a very simple form of data compression in which runs of data (that is, sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This is most useful on data that contains many such runs; for example, simple graphic images such as icons and line drawings. Data that has long sequential runs of bytes (such as lower-quality sound samples) could be RLE compressed after Delta encoding [7] is applied to it. Delta encoding is a way of storing data in form of differences (deltas) between sequential data rather then data themselves. It is sometimes called delta compression because some instances of the encoding can make encoded data shorter then non-encoded data. Delta modulation [4] is used for transmission. Analog-to-digital signal conversion in which (a) the analog signal is approximated with a series of segments, (b) each segment of the approximated signal is compared to the original analog wave to determine the increase or decrease in relative amplitude, (c) the decision process for establishing the state of successive bits is determined by this comparison, and (d) only the change of information is sent, i.e., only an increase or decrease of the signal amplitude from the previous sample is sent whereas a no change condition causes the modulated signal to remain at the same 0 or 1 state of the previous sample. PCX [8] compression is one form of Run-length encoding. This compression is used as a format of saving pictures. If bitmap pictures are stored in PCX compression format it takes much less space. All the techniques mentioned above are loss-less compression. That is after decompressing the encoded signal, the original data (signal) is found. Our proposed method is a lossy compression scheme. 3 Encoding Voice Message When a voice signal is sampled and digitized, if we plot the signal it looks like a graph of amplitude vs. time. Figure 1 shows such a graph. It is the representation of a simple voice signal of duration 0.058 second recorded at 11025 Hz. If we analyze the signal carefully we see that the amplitude of the signal varies over time. Sometimes it increases with time, sometimes it decreases, sometimes it remains same. We can define three runs for signal. They are a) gradually increasing, b) gradually decreasing and c) running same. Figure 2 shows the 3 runs. Figure 2 is an enlarged and partial view of figure 1. From point a to point b the signal is on increasing run. From b to c it is on same run. And from c to d it is on decreasing run. Our method encodes signal with this concept. 3.1 Introducing Our Method Our method detects the above three runs and extracts only the end points of each run. Thus the encoded message is the combination of the end points of the separated runs in the original signal. When the
encoded signal is decoded we get straight lines for each runs in the original signal. As an example, if we encode the signal shown in figure 2, the portion a-bc-d of the signal will be replaced by three straight lines (1 from a to b, 1 from b to c and 1 from c to d). Figure 1 and figure 2 are drawn for a signal recorded at 11025 Hz. The portion a to d consists of 61 samples. (Figure 1 has been drawn with 640 samples). So 61 bytes are necessary to store the a-bc-d portion. For this portion our method will save the following information sequentially: amplitude of point a, number of samples between a and b, amplitude of point b, number of samples between b and c, amplitude of point c, number of samples between c and d, amplitude of point d. Our method will save only 7 items and thus takes 7 bytes to save this portion of the signal. In this method the smoothness of the original signal is ignored. But as the voice is recorded at high frequency, the amount of deviation is very little. Finally the decoded signal is lightly distorted. Figure 3 shows the amount of distortion. Figure 3a shows the a-b-c-d portion of the original signal. Figure 3b shows that portion if encoded by our method and figure 3c shows the superimposition of the signal got by our method on the original signal. The grayed portion expresses the amount of distortion. In the encoded signal we store only the end points of three types runs. During decoding we have to construct the signal from the end points only. Such as in figure 3 there are n-1 samples between point a and point b, i.e. point b is nth sample from point a (as we have collected data and drawn graph there are 22 samples between point a and point b). In the encoded stream only magnitude of a, magnitude of b and n are stored. We need to calculate all n-1 points during decoding and thus construct the signal. Since all individual runs of the original signal will be replaced by straight lines, the magnitude of ith point ( 0 < i < n ) between a and b will be (mb-ma) * i / n}, where ma and mb are magnitudes of points a and b respectively. 3.2 The Algorithm In this paper we present the complete encoding and decoding techniques. We present the algorithms for encoding the original signal and then for decoding the encoded signal. We consider each sample as an 8-bit data. 3.2.1 Algorithm Encode Here GetNextSample() is a function that samples the voice message and returns the sampled value. Input: The original signal stream, i.e. sampled voice message. Output: Encoded signal. Procedure Encode ( ) define SAME = 0 define INCREASING = 1 define DECREASING = 2 variables: v1, v2 : BYTE status : BYTE encoded_stream : Array of BYTE i, n : integer i = 0 v1 = GetNextSample( ) // store the first sample encoded_stream[i] = v1 v2 = GetNextSample( ) // initialize first run if ( v2 > v1 ) status = INCREASING else if ( v2 < v1 ) status = DECREASING else status = SAME // initialization complete n = 1 while ( message not end ) v2 = GetNextSample( ) if (( status = INCREASING and v2 > v1 ) or ( status = DECREASING and v2 < v1 ) or ( status = SAME and v2 = v1 )) // on the same run n = n + 1 } else
// the run ends. save it and start next run i = i + 1 encoded_stream[i] = n // store the number of // samples on the run i = i + 1 encoded_stream[i] = v1 // store the last // sample of the run // initialize next run if ( v2 > v1 ) status = INCREASING else if ( v2 < v1 ) status = DECREASING else status = SAME // initialization complete n = 1 } //end if } //end while return encoded_stream }//end Procedure The size of encoded signal (encoded_stream) that we get is much less than the original signal. This algorithm can be implemented while sampling the original signal. 3.2.2 Algorithm Decode Input: Encoded signal. Output: Decoded signal. Procedure Decode ( ) variables: v1, v2 : BYTE encoded_stream, decoded_stream: Array of BYTE p, i, j, n : integer p = 0 j = 1 decoded_stream[p] = encoded_stream[0] v1 = encoded_stream[0] while ( encoded_stream not end) // read number of samples in the run n = encoded_stream[j] j = j + 1 // read end point (last sample) of this run v2 = encoded_stream[j] j = j + 1 for i = 1 to n do // make this run p = p + 1 decoded_stream[p] = v1 + (v2 v1) * i / n } //end for } //end while return decoded_stream }//end Procedure 4 Performance Analysis The method can be implemented during sampling. Thus no extra other time is required for encoding. Again at receiver end it can be decoded as soon as the signal is received. The complexity of our algorithm is only O(n). Both the encoder and the decoder circuits can be implemented by using only a comparator, a counter and some other basic gates in hardware. The system is also parallelizable. Encoding and decoding can run parallelly. Since voice is sampled at a higher frequency, the distortion found in our technique is very low. Compression achieved by our method is higher in case of lower sampling rate. If the sampling rate is higher, less compression is achieved. Again amount of distortion that we get is less in case of higher sampling rate. We have analyzed the performance on several recorded voices. The voices have been recorded at 11025 Hz and 22050 Hz. On average case for the voice signals recorded at 11025 Hz our method can compress the signal by 70.4%, i.e. size of encoded signal = 29.6% of original signal. In case of the voice signals recorded at 11025 Hz, our method achieves 65.2% compression on an average. Figure 4 shows the method in which we have calculated the amount of distortion. We have calculated the rms (root mean square) value of the distortion. Let we have analyzed a signal of n samples. F 1, F 2, F 3,..., F n are the sampled values, i.e. F i ( 1 <= i <= n) series is the original signal. And f i ( 1 <= i <= n) series is the decoded signal. Certainly all f i are not equal to F i. The amount of distortion at ith sample is equal to Fi - fi. The rms value of total distortion = sqrt(average(squar(f i - fi))) for i = 1 to n. Let x be the sampling levels. As we have sampled the signal in byte (i..e 2 8 = 256 level sampling), x = 256 in our analysis. So the percentage of distortion = (sqrt(average(squar(fi - fi))) * 100 )/x. The distortion got by our process for the signal recorded at 11025 Hz is 1.5%. In case of the signal recorded at 22050 Hz the distortion got is 1.1%.
5 Conclusion There are many techniques for encoding voice signal. We have shown a completely different method for doing this. This encoding method cannot keep the original signal intact. Rather the signal is slightly distorted. i.e. this is a lossy compression scheme. When we encode the original signal, the encoded signal that we get is much less than the original signal, i.e. the compression is very high. Again when we decode the encoded signal a very little distortion within acceptance level takes place. Lossy compression can be applied in case of voice transmission depending on the situation. This method will be helpful in voice transmission where the target is to send only the voice message. The very little noise that we get cannot affect the tone of the voice. As the distortion level is very low and the overall performance is good the scheme can be accepted. The encoding and decoding process described in this paper are very straightforward and thus the technique is very easy to implement both in software and hardware level. In future this method can be improved by smoothing the sharp edges and thus making the decoded signal more perfect. Acknowledgements We would like to express our thanks to Manju Reddy for sending us the QAI Technical Report [4] and to Rajsekaran and Venugopal for assisting us during the implementation of several variations of our techniques on some recorded voice message and also to Apurba Krishna Deb for his insightful comments and suggestions. References: [1] DPS (1990), Digital Paper Solutions, Inc, Westmont. [2] F. Gustafsson, Adaptive filtering and change detection. J. Wiley New York, 2000. [3] J. J. K. Ó Ruanaidh and W. J. Fitzgerald, Numerical Bayesian methods applied to signal processing. Springer-Verlag New York, 1996 [4] QAI Technical Report (1992), Quality America Inc. [5] R. Cmejla and P. Sovka, Audio Signal Segmentation using recursive Bayesian change-point detectors, 3rd WSEAS International Conference on Signal processing, Robotics and Automation, Staltzburg, Austria, 2004. [6] Wikipedia Technical Journal (1996). [7] Wikipedia Technical Journal (1998). [8] ZSoft (1988) PCX Technical Reference Manual, ZSoft Corporation.