Novel Quantization Strategies for Linear Prediction with Guarantees

Smon S. Du* Ychong Xu* Yuan L Hongyang Zhang Aart Sngh Pulkt Grover Carnege Mellon Unversty, Pttsburgh, PA 15213, USA *: Contrbute equally. SSDU@CS.CMU.EDU YICHONGX@CS.CMU.EDU LIYUANCHRISTY@GMAIL.COM HONGYANZ@CS.CMU.EDU AARTI@CS.CMU.EDU PGROVER@ANDREW.CMU.EDU Abstract Quantzed data s the norm n many energy constraned problems, a concrete example beng bran sgnals recorded by dstrbuted sensors placed around the head n Bran-Computer Interface BCI) applcatons. However, machne learnng algorthms typcally gnore the quantzed nature of such data. In ths paper, we undertake a prncpled study of effcent quantzaton methods for lnear classfcaton. We propose and analyze a customzed quantzaton scheme for dagonal lnear dscrmnant analyss classfer ncludng both learnng and predcton steps. Experments on synthetc and real dataset show the effectveness of our proposed strateges. 1. Introducton In ths work, we nvestgate the problem of dong centralzed predcton usng quantzed data obtaned from dstrbuted sensors. As an example, n Bran-Computer Interface BCI) applcatons, hundreds or even thousands of electrodes placed around or nsde the head are used to sense bran sgnals Lebedev & Ncolels, 2006). These quantzed sgnals are then used for a specfc predcton task such as classfcaton. For example, a neuroprosthetc goal mght nvolve predctng whether an ndvdual s tryng to move hs hand towards left or rght purely based on the quantzed bran data to decode the patent s desred movement. Other applcatons nclude wreless sensors networks for the Internet of Thngs Zhou et al., 2013) and electrc power grd Nabaee & Labeau, 2012). In these settngs, sensors need to communcate data at hgh rates, and consequently consume large amounts of energy Won et al., Proceedngs of the 33 rd Internatonal Conference on Machne Learnng, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyrght 2016 by the authors). 2014). To avod large energy consumpton, data s quantzed for both tranng and predcton. One key observaton for predcton task s that all features readngs of dfferent sensors) do not have the same relevance to the predcton goal. Thus, f we compress each feature n accordance to ts relevance, we can reduce communcaton cost and keep predcton error low smultaneously. Formally, gven communcaton constrants or equvalently, energy constrants), our am s to devse a data quantzaton technque that supports our prespecfed task. Tradtonal nformaton-theoretc quantzaton technques as those proposed by Berger, 1979; Slepan & Wolf, 1973; Cover, 1975; Wyner & Zv, 1976) are dffcult to apply to these problems because these methods requre ether movng unquantzed data to a central node pror to compresson, whch s not applcable n aforementoned settngs, or storng and estmatng parameters at each sensor, whch needs complex hardware that already consumes hgh energy. Recently, Mahzoon et al. Mahzoon et al., 2014) proposed rate allocaton and determnstc quantzaton strateges for quantzng sgnals from m sensors and then drectly used these quantzed data for lnear regresson and lnear classfcaton, but ther method needs already traned model. In ths work, we propose a two-stage actve quantzaton strategy for tranng Dagonal Lnear Dscrmnant Analyss DLDA) classfer. We frst use ntal codes based on our pror knowledge about the underlyng dstrbuton. Then after the frst round samplng, we change our codes based on these data and sample agan. Our fnal estmaton of parameters of the DLDA classfer s based on the second round quantzed data. Once the classfer s traned, we use a randomzed dtherng-nose based quantzaton for the testng data on whch predcton s desred. Fnally, Theorem 2 n Sec. 3 reveals how the number of tranng samples and total bts used for quantzaton affect the predcton accuracy. To the best of our knowledge, our pro-

posed strategy s the frst one that quantzes features n both learnng and predcton steps wth provable bounds. Experments on smulated and real data demonstrate the effectveness of our method. 1.1. Related Works The study of quantzaton starts from tradtonal nformaton theory, where one needs to estmate the jont dstrbuton across all the sensors Wyner & Zv, 1976; Cover, 1975; Slepan & Wolf, 1973; Berger, 1979), whch s hard to realze Mahzoon et al., 2014). Recently, Zhu et al. Zhu & Lafferty, 2014) focused on quantzed estmaton of Gaussan sequence models n Eucldean balls. However, there are sgnfcant dfferences between our work and exstng ones: we search for the optmal allocated bts for the quantzed data, rather than for quantzng the predctors, and conduct sold theoretcal analyss for the behavour of the quantzed data as the nput to the lnear predctor. 2. Notaton and Problem Statement In the dstrbuted sensor network settng, suppose we have m sensors and a sum rate of R bts that needs to be allocated across dfferent sensors for quantzaton. We use bold X to represent a sample and X s the feature from - th sensor. If there are n samples, we denote these samples by {X j)} n j=1. R s the number of bts we assgn to -th sensor. Thus, m =1 R R. For each feature X, ts quantzed representaton usng R bts s denoted by X. More precsely, the -th sensor uses an encoder functon E : R {0,, 2 R 1} and sends E X ) to the fuson center. We assume that the communcaton channel s noseless. Also, we do not use vector quantzaton snce despte the smplcty afforded by asymptotc vector quantzaton analyss, we use scalar quantzaton strateges because we am for desgnng technques that are applcable to sensors wth very small memory. The decson center uses a correspondng decodng functon D : {0, 1,, 2 R 1} R. Snce both tranng and predcton are done at the fuson center, we can only use quantzed data for both tasks. Our goal s to mnmze predcton error. In ths paper, we consder the problem of tranng a lnear classfer from quantzed samples then dong predcton based on quantzed features. We focus on a smple but wdely used lnear classfer, Dagonal Lnear Dscrmnant Analyss DLDA). DLDA s a classcal classfcaton method for contnuous valued features and has been wdely used n varous domans Venables & Rpley, 2013). We assume each sample X belongs to one of the two exstng classes wth equal probablty,.e., Pr [classx) = 1] = Pr [classx) = 2] = 1/2. DLDA makes the assumptons that gven the class, class X) = c, each feature s dstrbuted ndependently accordng to a Gaussan dstrbuton: X N µ c, σc) 2, and σ1 = σ 2 = σ for = 1,, m. Wthout loss of generalty, we also assume µ 1 = µ 2 = µ. Under these assumptons, DLDA s a lnear classfer wth w = µ /σ 2, for = 1,, m. Under quantzaton constrants, we can only use the quantzed observatons for both tranng and predcton. Here we want to desgn a tranng algorthm together wth quantzaton strateges ) that mnmze ] the classfcaton error: Pr [Ĉ X class X), where Ĉ ), denotes the lnear classfer traned by quantzed samples. Whle the problem of fndng the optmal strateges that mnmze classfcaton error s hard, we nstead relax the problem to estmatng the decson varable w X, and use t to obtan upper bounds on classfcaton error. 3. Actve Learnng for Quantzed DLDA 3.1. Quantzed Tranng for DLDA For DLDA, n the tranng phase, we need to estmate {µ } m =1 and {σ } m =1 usng quantzed features and labeled data. Notce that snce the two classes have symmetrc means around 0 and same varance, whenever we have a sample wth label 1, we can negate t and obtan a sample from class 2. Thus, equvalently, n the tranng phase, we are just estmatng parameters of a Gaussan dstrbuton. Our technque has two rounds. In the frst round, we use our pror knowledge about the underlyng parameters to construct ntal quantzers. Then we use these quantzed observatons to obtan a rough estmate of the underlyng dstrbutons. Based on estmated parameters from the frst round, we construct new codes to quantze data from the next round. Fnally, we use the quantzed samples from the second round to learn parameters of underlyng dstrbuton and weght vector for DLDA classfcaton. Formally, we assgn R nt bts to the -th sensor and use the followng code n the frst round: E nt X ) 1) = arg mn µ nt c nt σ nt kd nt X k=0,,2 Rnt 1 where µ nt and σ nt are our ntal guess on mean and varance. c nt = 2 max log ) σ nt /µ nt R nt, 1 ) controls the range of quantzaton regon and d nt = 2µ nt c nt σ nt )/2 Rnt 1) s the quantzaton unt. The correspondng decoder s, for k = 0,, 2 Rnt 1 D nt k) = µ nt c nt σ nt kd nt. 2) Let n 1 be the number of samples n the frst round. We estmate mean and varance by µ = n1 j=1 X j) n 1, σ 2 = n1 j=1 X j) µ ) 2 n 1.

5 Actve10x Actve30x Actve50x Optmal 5 Actve10x Actve30x Actve50x Optmal 5 Actve10x Actve30x Actve50x Optmal Fgure 1. Classfcaton accuracy of proposed quantzaton scheme on syntheszed data. Optmal s the optmal Bayes classfcaton rule appled to unquantzed samples. where X j) = D nt E nt X j)) ) s the quantzed representaton of X j). X X γ X In the second round, we assgn R b b 0 b b 3 3 bts to the -th sensor and we sample another set of data ponts usng unform Fgure 2. An llustraton of dtherng based quantzaton strategy. quantzaton scheme nformed by the frst round estmaton We use R on mean and varance: = 2 bts for quantzng X, so d = 2b /2 2R 1 ) = b /3. In ths scenaro, feature X s quantzed to 1 b because after addng dtherng nose, the nearest quantzaton pont s 1 3 Ẽ X ) = arg mn µ c σ k d X b. 3, 3) k=0,,2 R 1 D k) = µ c σ k d. 4) where c = 2 log ) ) m ɛ max log σ µ, 1 and d = 2 µ c σ ) /2 R 1). Let n 2 be the number of observatons from the second round we use to estmate the mean, the varance and the weght vector for DLDA: ˆµ = n2 X ) 2 n2 j=1 j), ˆσ 2 j=1 X j) ˆµ =, ŵ = ˆµ n 2 n 2 ˆσ 2, where X j) = D ) Ẽ X j)) s the quantzed representaton of X j) n the second round. 3.2. Quantzed Predcton for DLDA In the prevous secton, we have good estmatons on underlyng dstrbuton of features {ˆµ } m =1 and {ˆσ } m =1 ) and weght vector ŵ for DLDA. In ths secton, we dscuss how to use these estmatons for predcton. Frst, we assgn bts to each sensor accordng to 9). As the frst step of our quantzaton, we pck b for each sensor such that X b holds wth hgh probablty. Then for the -th sensor, we place 2 R quantzaton ponts unformly n the regon [ b, b ],.e., the quantzaton ponts are { b kd k = 0,, 2 R 1} where d = 2b /2 R 1) s a unt quantzaton regon. For the feature from the -th sensor, X, we frst add dtherng nose γ unformly dstrbuted wthn [ d /2, d /2], then we map ths value to the nearest quantzaton pont. Formally, our encodng and decodng functons are E x) = arg mn b kd x γ, 5) k {0,,2 R 1} D k) = b kd. 6) Fg. 2 provdes an example of such a quantzaton strategy. By addng dtherng nose, we now show that the correlaton between quantzaton error from dfferent sensors s removed consstent wth Schuchman, 1964)). Specfcally, wth we derved the followng result: Theorem 1. Suppose for = 1,, m, X b, wth dtherng nose quantzaton strategy, we have [ ) ] 2 E w X w X 4 w 2 b 2 2 2R. 7) =1 Now we can optmze bts assgnment for the test data to mnmze Eqn. 7): mn R,=1,,m s.t w 2 b 2 2 2R 8) =1 R = R, R 1 for = 1,, m =1 Routne algebra shows that the optmal bts assgnment for the -th sensor s: [ 1 R = 2 log 8 ln 2 ] w2 b2 1 1, 9) λ

where [x] = max{x, 0} and λ s selected such that R = R. The rates of each source are then rounded to the nearest nteger to ensure feasblty of quantzaton. The next theorem reveals how the number of tranng samples and the number of bts for quantzaton affect the predcton accuracy: Theorem 2. Assume for = 1,, m, µ µ nt, σ σ nt, n the frst stage 1 R nt = Ω )) µ nt log µ σnt µ, and 2 n 1 = Ω log ) [ ) m µ nt 2 ) σ nt 2 ) δ µ µ nt 4 ) ]) µ σ nt 4 of order 5. Then varance s calculated for each channel σ σ and n the second stage R = Ω 1 log µ ɛ σ n 2 = Ω 1 ɛ log 2 ) m 2 ɛ log m ) )) µ 4 δ σ4 σ 4 µ 4 probablty at least 1 δ, for all = 1,, m, ) ) Pr Ĉ X class X) = opt O ɛ), σ µ ))), then wth where opt denotes the classfcaton error of the best possble classfer. Theorem 2 shows the predcton error comes from two sources: one from quantzaton, the other from the statstcal nference. For a gven target accuracy parameter ɛ, the number of bts requred for each sensor R depends logarthmcally on 1/ɛ. Therefor, totally we need O m log 1/ɛ)) bts to make error nduced by quantzaton be at the order of ɛ. The number of samples requred depends quadratcally on 1/ɛ up to logarthm factor. Thus, f we have nfnte bts no quantzaton error), we recover the same sample complexty for parametrc model for nference and predcton Wasserman, 2013). 4. Experments 4.1. Smulated Data We frst test our quantzaton strateges on syntheszed data. Data s generated accordng to DLDA assumptons: for = 1,, m, µ s set to 1 and σ s set to be, 1.2 and 2 respectvely for left, mddle and rght plots of Fg. 1. We use m = 100 sensors and number of total bts R vares from 100 to 200. We use n 1 = 1000 samples n the frst round and n 2 = 10000 samples n the second round for tranng and 10000 samples for testng. The ntal guesses of parameters are set to be 10 to 50 tmes of the true values. Fg. 1 shows that the more accurate ntal guesses are, the fewer bts needed to acheve certan classfcaton accuracy. Also notce that f sgnal-to-nose rato µ /σ ) of some sensors are much larger than that of others, we need fewer bts to reach optmal classfcaton accuracy. 1 We omt loglog )) terms. 2 We omt log ) dependences on µ and σ. 4.2. Real Data In ths secton, we test our quantzaton scheme on EEG data. We use the bran sgnals of the frst subject n experment of data set 1 from BCI Competton IV Blankertz et al., 2007). In the experment, there are total 200 trals. Each tral corresponds to a motor magery of ether left hand or foot and lasts 8s. There total m = 59 sensors and sgnals were sampled at 100Hz. See Blankertz et al., 2007) for the detals. For each tral, raw EEG tme seres are band-pass fltered wth a butterworth IIR flter band power) and the logathm s appled to the normalzed varance to yeld a feature vector for that tral. Thus, we generate 200 nstances each wth 59 features. Then we randomly select 40 samples for testng and the remanng for tranng. For tranng, 40 samples are used n the frst round and 120 samples are used n the second round. We use 10 tmes of true mean and varance of tranng samples as ntal guesses. Fg. 3 shows the classfcaton accuracy on testng samples wth dfferent bts used. The unquantzed classfer s traned drectly usng 160 tranng samples wthout quantzaton and then s appled to unquantzed testng samples. Notce that even wth just an average of 3 bts per sensor, full nfnte number of bts) quantzaton accuracy can be acheved. Another observaton s that as we ncrease total bts, the result becomes more stable. 0.45 0.4 Unquantzed Actve 100 150 200 250 Fgure 3. Classfcaton accuracy of proposed quantzaton scheme on EEG data. 5. Concluson In ths paper, we propose and analyze an actve learnng based quantzaton algorthm together wth a predcton algorthm that only requre quantzed samples for dagonal lnear dscrmnant analyss. Experments on synthetc and real world data show that wth a few bts, we can acheve near optmal accuracy as usng un-quantzed samples. In ths work, we only consder DLDA classfer. How to effcently assgn bts among sensors and quantze features for nonlnear classfers s an mportant problem that has both theoretcal and practcal mplcatons.

References Berger, T. Decentralzed estmaton and decson theory. In IEEE Seven Sprngs Workshop on Informaton Theory, Mt. Ksco, NY, 1979. Blankertz, Benjamn, Dornhege, Gudo, Krauledat, Matthas, Müller, Klaus-Robert, and Curo, Gabrel. The non-nvasve berln bran computer nterface: fast acquston of effectve performance n untraned subjects. NeuroImage, 372):539 550, 2007. Cover, Thomas M. A proof of the data compresson theorem of slepan and wolf for ergodc sources corresp.). Informaton Theory, IEEE Transactons on, 212):226 228, 1975. Gamburd, Alex, Lafferty, John, and Rockmore, Dan. Egenvalue spacngs for quantzed cat maps. Journal of Physcs A: Mathematcal and General, 3612):3487, 2003. 36th Annual Internatonal Conference of the IEEE, pp. 1626 1629. IEEE, 2014. Wyner, Aaron D and Zv, Jacob. The rate-dstorton functon for source codng wth sde nformaton at the decoder. Informaton Theory, IEEE Transactons on, 22 1):1 10, 1976. Zhou, Yang, Huang, Chuan, Jang, Tao, and Cu, Shuguang. Wreless sensor networks and the nternet of thngs: Optmal estmaton wth nonunform quantzaton and bandwdth allocaton. Sensors Journal, IEEE, 1310):3568 3574, 2013. Zhu, Yuancheng and Lafferty, John. Quantzed estmaton of gaussan sequence models n eucldean balls. In Advances n Neural Informaton Processng Systems, pp. 3662 3670, 2014. Lebedev, Mkhal A and Ncolels, Mguel AL. Bran machne nterfaces: past, present and future. TRENDS n Neuroscences, 299):536 546, 2006. Mahzoon, Majd, Albalaw, Hassan, L, Xn, and Grover, Pulkt. Usng relatve-relevance of data peces for effcent communcaton, wth an applcaton to neural data acquston. In Communcaton, Control, and Computng Allerton), 2014 52nd Annual Allerton Conference on, pp. 160 166. IEEE, 2014. Nabaee, Mahdy and Labeau, Fabrce. Quantzed network codng for sparse messages. In Statstcal Sgnal Processng Workshop SSP), 2012 IEEE, pp. 828 831. IEEE, 2012. Schuchman, Leonard. Dther sgnals and ther effect on quantzaton nose. Communcaton Technology, IEEE Transactons on, 124):162 165, 1964. Slepan, Davd and Wolf, Jack K. Noseless codng of correlated nformaton sources. Informaton theory, IEEE Transactons on, 194):471 480, 1973. Venables, Wllam N and Rpley, Bran D. Modern appled statstcs wth S-PLUS. Sprnger Scence & Busness Meda, 2013. Wasserman, Larry. All of statstcs: a concse course n statstcal nference. Sprnger Scence & Busness Meda, 2013. Won, Mnho, Albalaw, Hassan, L, Xn, and Thomas, Donald E. Low-power hardware mplementaton of movement decodng for bran computer nterface wth reduced-resoluton dscrete cosne transform. In Engneerng n Medcne and Bology Socety EMBC), 2014