Space-Time-Frequency Bag of Words Models for Capturing EEG Variability: A Comprehensive Study

Size: px

Start display at page:

Download "Space-Time-Frequency Bag of Words Models for Capturing EEG Variability: A Comprehensive Study"

Virginia Norman
5 years ago
Views:

1 Space-Time-Frequency Bag of Words Models for Capturing EEG Variability: A Comprehensive Study By Kyung Min Su and Kay A. Robbins University of Texas at San Antonio January 2, 2015 Updated April 19, 2015 UTSA CS Technical Report TR

2 Contents 1. Introduction Bag-of-words models for EEG data Extracting descriptors Key frames Gradient descriptors Normalized frame vectors Peak-frame repeat method Building a dictionary Representing EEG data using a dictionary Matching words for a frame Histogram of words in an epoch Improving the quality of a descriptor Sub-intervals along time axes Frequency sub-bands Evaluation of parameters Gradient descriptor versus normalized frame descriptor Evaluation of the dictionary sizes Evaluation of clustering methods Evaluation of sub-intervals Evaluation of sub-bands Cross subject transfer learning Discussion Acknowledgments References

3 1. Introduction EEG data are multi-channel time-series recorded from the scalp surface to provide spatial and temporal measurements of brain signals. Because EEG headsets have varying channel numbers, channel placements, and sampling rates, EEG data may have different dimensions depending on the type of headset used for signal acquisition. These differences make it difficult to combine datasets for large-scale machine learning or data mining applications. Many traditional EEG features, including raw signal, are channel-specific [1], and not appropriate for processing multi-headset data of various channel configurations. Frame-based EEG features, which extract values from a field topography [2], [3], are less channel-specific. However, they usually assume that all EEG datasets are from the same headset. To represent EEG data regardless of headsets configurations, we have investigated several variations of the classical Bag-of-Words (BOW) model, a widely used technique to extract features from images for applications such as retrieval [4]. Images come in different sizes, shapes, and orientations. BOW approaches are effective in mapping such data to common feature sets. Traditional BOW models use a dictionary of local features based on key points and then construct a histogram of the occurrences of these features in an image. A disadvantage of BOW features is that they lose information about global spatial relationships of the key points in the image. However, this loss also makes the features robust to variations in scale and orientation. In this document, we describe several BOW approaches for EEG data that retain some frequency, spatial, and temporal relationships in EEG data. The proposed descriptors are relatively insensitive to the number of channels, channel placement, sampling rates, signal range, and subject response time. As a result, we can process EEG datasets of various configurations using a common dictionary of features.we have experimentally compared various approaches and parameters to provide an empirical basis for choosing optimal conditions. Section 2 describes the ideas of configuration independent EEG features based on BOW models, and Section 3 explains the implementation details and test results. Section 4 briefly discusses the implications of results for EEG analysis. 2. Bag-of-words models for EEG data In this section, we describe the proposed EEG descriptors based on Bag-of-words (BOW) models that provide a relatively uniform representation of EEG data across different data collections. Figure 1 shows an overview of the processing steps for the proposed method. Processing consists of three parts: extracting descriptors from EEG data, building a dictionary from the extracted descriptors, and representing the data using the dictionary. The details of each part are explained in the following subsections. 3

Figure 1 Overview of the processing steps for the proposed methods. 2.1 Extracting descriptors EEG data is multi-channel and includes both spatial and temporal information.

4 Figure 1 Overview of the processing steps for the proposed methods. 2.1 Extracting descriptors EEG data is multi-channel and includes both spatial and temporal information. Figure 2 shows an example of this data in channel time space. Each frame contains the values recorded at the headset channel positions at a particular time and captures spatial relationships among channels. To represent spatial information in BOW, we tried two descriptors: a gradient descriptor and a normalized frame vector. The gradient descriptor method of Section uses spatial gradients in the same way as the scale-invariant feature transformation (SIFT) [5] features of traditional image processing applications. The second method, explained in 2.1.2, directly uses the frame vectors of the EEG recording. Figure 2 EEG data in channel time space. Each column corresponds to a time sample Key frames Following BOW approaches described in the literature for images, we initially tried representing each frame using a traditional BOW gradient descriptor and forming histograms using key-frames defined in time. The method is analogous to the traditional SIFT methods [5] for finding key points and representing the key points using a gradient descriptor. Our EEG gradient descriptor method also finds the key frames, in which overall power is a local maximum, and then calculates gradient vectors from the interpolated topographic map of each key frame to capture the relative activity of the brain. A configuration-invariant gradient descriptor is generated by appending the histograms of the gradient vectors in the four quadrants in each key frame. The following sub-sections explain how to apply the gradient descriptor key frames method on EEG data in more detail. EEG key frames are defined as the frames that correspond to the peaks of the global field power (GFP). According to Lehmann [6], brain states can be estimated from stable patterns at the power peaks. Even though, the topographic maps between power peaks change continuously, their patterns at the peaks have (by necessity) a zero time derivative and hence retain the same pattern for a few time instants. Lehmann 4

and others have hypothesized that these patterns reflect brain status. Regardless of whether this is the case, the power peaks form dominant patterns that are analogous to image key points.

5 and others have hypothesized that these patterns reflect brain status. Regardless of whether this is the case, the power peaks form dominant patterns that are analogous to image key points. For an L N array, representing L channels and N frames, define the i th frame, Y i, as: Y i = [y i1, y i2,, y il] T where y ij is the element at i th time frame and j th channel. The GFP is the actual standard deviation of values within each frame. The GFP of the i th frame, GFP i is defined as: L GFP i = 1 L (y ij 1 L y ik) j=1 Low GFP means there is little variation between the channels and very little spatial information in the frame. Therefore, we use only peak frames in GFP, which have a large variation and clear spatial distinction among channels. To visualize the effect of key frames, we simulate EEG signals using the DipoleSimulator of BESA [7]. The location and orientation of the simulated dipoles are shown in Figure 3 (a). Figure 3 (b) shows first 150 topographic maps corresponding to the first 150 frames of a simulated EEG signal for the dipole simulation shown in Figure 3(a). In this simulation, two of four dipoles are active at each time. Figure 3(c) shows the corresponding power curve for these frames. There are 10 peak frames in the first 150 frames, and these 10 peak frames have steady topographic patterns that change according to their dipole status (see Figure 3(d)). L k 2 dipole status frame (a) Dipole behavior for simulation of EEG signals (b) Topographic maps of 150 frames 5

band-pass filtered in the alpha frequency band of 8 Hz to 12 Hz.

Figure 4(a) shows the topographic maps of 90 consecutive frames.

6 (c) Global field power (GFP) curve (d) Topographic maps of the frames corresponding to GFP peaks. Figure 3 Topographic maps and GFP curves of simulated EEG data Figure 4 shows an example of actual EEG data that has been band-pass filtered in the alpha frequency band of 8 Hz to 12 Hz. This data is taken from the 109-subject publicly available BCI-2000 dataset [8], [9]. Figure 4(a) shows the topographic maps of 90 consecutive frames. Figure 4(b) shows the power curve corresponding to these frames, and Figure 4(c) shows the topographic maps at the power peak frames. The topographic patterns at the power peaks usually retain their spatial characteristics for several frames. (a) Topographic maps of 90 frames (b) Global Field Power (GFP) curve (c) Topographic maps of peak frames Figure 4 Topographic map and GFP curve for a BCI2000 dataset [7], [8] (Subject 1, Task 1, Alpha band pass filtered). 6

7 (a) sub-band1 (0~4Hz) (b) sub-band2 (4~8Hz) (c) sub-band3 (8~12Hz) (d) sub-band4 (12~16Hz) (e) sub-band5 (16~20Hz) (f) sub-band6 (20~24Hz) (g) sub-band7 (24~28Hz) (h) sub-band8 (28Hz ~ ) Figure 5 Curves of GFP from three types of headsets. 7

One difficulty with configuration independent descriptors based on GFP is that different headset configurations are likely to have a different amount of total power depending on coverage of the scalp

To compare GFP curves for different headset configuration, we generated virtual EEG data from Biosemi 64-channel EEG to simulate other headsets by choosing only those channels that closely mapped the

8 One difficulty with configuration independent descriptors based on GFP is that different headset configurations are likely to have a different amount of total power depending on coverage of the scalp and on frequency band. To compare GFP curves for different headset configuration, we generated virtual EEG data from Biosemi 64-channel EEG to simulate other headsets by choosing only those channels that closely mapped the position in the simulated headsets [10]. Figure 5 shows the power peaks of three test headsets and their correlations. In the figure, the blue solid represents the power curve of the original Biosemi 64-channel data, the green dashed line is the power curve of an Emotiv 14-channel simulated headset, and the red dash-dot line is the power curve of an ABM 9-channel simulated headset. The correlation between power curves of the simulated headset data and Biosemi data is shown in the legend area. The Biosemi headset covers the entire head area evenly, while the Emotiv headset does not have any detectors on the top of the head and the ABM has only 9 channels placed at the top of the head. Headsets with different channel configurations have highly correlated GFPs even with low numbers of detectors and biased placement Gradient descriptors Gradient descriptors provide low dimensional vector representations of topographic maps and are generated through a SIFT (scale invariant feature transform)-like method. Given the spatial location of each channel, we generate an interpolated topographical map on a spatial grid for each key frame in a multi-channel EEG data set. Because we interpolate EEG data into a topographical map of common dimensions, we can process all EEG data in the same way regardless of the physical channel configuration, provided there are sufficient channels to perform an interpolation. Figure 6 shows each step of the generation of the gradient descriptor from a topographic map. A topographic map of size (24 24) shown in Figure 6(a) is divided into a sample grid of size (8 8) as shown in Figure 6(b). Each grid cell corresponds to a (3 3) block of pixels from the topographic map. Figure 6(c) shows the gradient information, which is extracted from the (3 3) pixels of each cell. We use a histogram to represent the distribution of gradients in each quadrant area of the gradient map as shown in Figure 6(d). To reduce the boundary effects between the histogram bins, the smoothing process of SIFT method is also applied to each histogram. Finally, the four histograms are concatenated and normalized to be of unit length. This method follows the SIFT technique for image retrieval quite closely. However, traditional SIFT uses a global orientation to adjust the direction of the gradient vectors. We do not measure the global orientation of the gradient map because all gradient maps of EEG data already have the same orientation. Because we use relative gradient information on a topographic map, the features are invariant to the scale of EEG data. (a) Topographic map (24 24) (b) Sample grid (8 8) (c) Gradient map (8 8) (d) Gradient descriptor (2 2) Figure 6 Gradient descriptor for a frame (EEGLAB sample dataset, 32 channels). The gradient descriptors are the EEG counterpart to the SIFT descriptors of image processing. These descriptors form montage-independent EEG descriptors. However, the method requires a sufficient number of EEG channels to be distributed over the entire head surface in order to interpolate a topographic map with sufficient accuracy. 8

Figure 7 shows a case where a gradient descriptor method fails to generate the same (or similar) descriptors for the same brain pattern for different headsets.

To simulate EEG data of the same brain pattern but detected from these other headsets, we choose the closest 9 and 14 channels of Biosemi data, respectively.

9 Figure 7 shows a case where a gradient descriptor method fails to generate the same (or similar) descriptors for the same brain pattern for different headsets. The ABM headset has only nine channels, and the Emotiv headset places its 14 detectors only around the headband area. To simulate EEG data of the same brain pattern but detected from these other headsets, we choose the closest 9 and 14 channels of Biosemi data, respectively. If we compute the gradient descriptors from each headset, as shown in Figure 7, their interpolated maps and gradient descriptors are not similar to each other, even for the same subject. In order to accommodate low-density headsets, we developed another method called normalized frame vectors, which we describe in the next section. Headset Biosemi 64 channels ABM 9 channels Emotiv 14 channels Channel placement Interpolated map Gradient descriptor Figure 7 Gradient descriptor from three different types of headsets for the same brain pattern. 9

10 2.1.3 Normalized frame vectors This section presents an alternative scale-independent spatiotemporal representation of EEG that uses normalization to remove the scale differences. After normalizing each channel to have zero mean and unit standard deviation, we normalize each frame so that it has a unit length. normalized frame i = Y i Y i where Y i is the i th frame and Y i is the norm of Y i After normalization, each normalized frame vector has the same length and represents the relative spatial relationship between channels. These normalized frame vectors are less sensitive to noise than the respective gradients. They can be used with or without temporal key frames to form a model Peak-frame repeat method The peak frame method has the advantage of capturing patterns that are stationary and hence more likely to contain data rather than noise. However, sometimes a signal will have a very low density of peak frames, particularly when the signal is applied to frequency sub-bands. Normalized frame descriptors on the other hand have a word for each frame, so the bin densities are much better. On the other hand, many of the normalized frames may contain a significant amount of noise. We also examined a hybrid approach in which replace each frame by the word corresponding to the closest peak frame that occurred at a time less than or equal to the current frame. We refer to this method as the peak-frame repeat method. The appendix compares the performance of the three methods for converting frames to words. 2.2 Building a dictionary In a bag-of-word model, a word in the dictionary represents a typical pattern of data. After extracting frame descriptors representing the relations between channels, we cluster the frame descriptors to find the typical relationships between channels. K-means clustering is the most popular clustering method, but it tends to assign more clusters to the dense areas of dominant classes. If the data is unbalanced, k-means clustering will assign more clusters to the dominant groups and fewer clusters to the minority groups. If we build a dictionary using k-means clustering, we may have less resolution among samples of the minority groups. However, the rare patterns often are important in EEG data. To find the best clustering method, we tried the five different clustering methods given in Table 1. Method k-means (k-means++) k-centers TABLE 1 CLUSTERING METHODS. Description K-means tries to reduce the sum of errors. Here, an error is the distance between the sample and its cluster center. K-means tends to place more cluster centers on the dense area. K-means++ is basically the same as k- means, but the algorithm uses a revised seeding method resulting in higher accuracies. While k-means clustering selects seeds randomly, k-means++ uses random seeding but gives more weight to samples far from the already chosen seeds [11]. K-centers decomposes the data into segments of similar volume. After picking a seed centroid randomly, it picks the next centroid from the remaining data so that it is as far as possible from all previous centroids. The method repeats the centroid assignment until the required number of 10

11 k-medoids subtractive (radius-based) affinity propagation centroids is reached [12]. K-centers differs from k-means++ in that it picks the next centroid deterministically. After assigning data into clusters using randomly picked data points as seeds, k-medoids randomly picks a new candidate data point for each cluster among cluster members and computes the within-cluster score. If the score improves, the candidate becomes the new medoid. The method continues to repeat over all medoids until no changes occur [13]. Subtractive clustering segments samples into groups of similar radius by removing all samples in the vicinity of the cluster centers. Subtractive clustering is less sensitive to the density of samples [14]. Affinity propagation simultaneously considers all data points as potential exemplars and messages are exchanged between data points to find the cluster members [15]. The clustering methods in Table 1 have different characteristics. We use a low-dimensional example to illustrate their differences. Test samples are generated from five clusters in two dimensions. In Figure 8(a), the cluster on the lower right has more samples than other clusters. All other clusters have similar densities. As shown in Figure 8 (b) ~ (d), the k-means and k-means++ methods assign more clusters to the lower right dense area, while subtractive clustering and affinity propagation methods assign clusters more evenly throughout the entire space. (b) k-mean clustering (c) k-mean++ clustering (a) test samples (d) subtractive clustering Figure 8 Voronoi diagram of test samples and four clustering methods. (e) affinity propagation In addition to choosing a clustering method, we also need to decide the number of clusters. These values are determined experimentally as explained in Section

12 2.3 Representing EEG data using a dictionary A dictionary of exemplar spatial patterns is formed by clustering. These patterns are either the gradient descriptors described in Section or the normalized frame vectors described in Section A bagof-words model represents an epoch by a histogram of matching words which are the best match of the spatial pattern at each frame or at key frames. The following sections explain the selection of matching words and experimental parameters. We assume that we have built the dictionary with a base headset --- in our case this will be 64-channel Biosemi Matching words for a frame To represent data, we need to find matching words for each frame by comparing corresponding channels between the dictionary and the data. Many EEG headsets follow the system [16] for placing electrodes so electrode positions for small headsets are subsets of the positions for larger headsets. For example, suppose the dictionary has been built from Biosemi 64-channel data. To represent data from an ABM 9 channel headset, we compare ABM data to POz (30 th ), Fz (38 th ), Cz (48 th ), C3 (13 th ), C4 (50 th ), F3 (5 th ), F4 (40 th ), P3 (21 th ), P4 (58 th ) channels of the base (Biosemi 64-channel) headset. According to the international system, the above nine channels of the Biosemi headset exactly overlap the nine channels of the ABM headset, assuming that the experimenters have not manually measured head positions. The matching procedure is: If a new headset has more channels than the headset of a dictionary, we use only overlapped channels between them during the matching words processing. If a new headset has channels that do not spatially overlap with the headset corresponding to the dictionary, we can use interpolation or nearest-neighborhood selection to generate matching channels for the new headset. Once we have matched channels, we need to determine how to select the corresponding words. When we assign a sample to the matching words, we can use two approaches: hard assignment and soft assignment [17]. Hard assignment associates a sample to the nearest word, while soft assignment associates a sample based on a probability. Hard assignment is fast, but it has uncertainty and implausibility issues. Uncertainty happens when two or more words are relevant to a sample. In this case, we cannot claim one matching word for the sample because all candidate words are close enough to the sample. Implausibility happens when all words are too far from a sample. In this case, we cannot claim that the sample and its matching word are similar to each other. In this work, we use soft assignment to represent a sample as the matching probability of words, which is estimated based on the distance between a sample and a word in the dictionary. The soft assignment calculates a weight by multiplying a Gaussian density function and the Euclidean distance function [17]. On the assumption that all words have the same distributions of samples, we use the same smoothing parameter σ for all words. In our test, the parameter σ was tuned empirically. The word weights for a frame are normalized so that they sum to 1. That is, if a dictionary has N words, each frame is represented as a weight vector of length N whose elements sum to Histogram of words in an epoch After representing raw EEG data as a probability vector of matching words, we represent an epoch as the histogram of words by adding the probability vectors into a histogram and making the histogram a unit vector. If we use entire frames in the epoch, the epoch feature vector will be the average of the probability vectors for the individual frames. If we use only power peak frames, the epoch feature will be the average 12

13 of probability vectors corresponding to power peak frames. No matter which method is used, the size of epoch feature vector is always the size of the dictionary, and feature vectors can be processed in the same way as long as the same dictionary is used. Another advantage of this approach is that the histogram is invariant with respect to the sampling rate. Hence, we can easily share information across different datasets without regard to their configurations if one universal dictionary is used for all datasets. 2.4 Improving the quality of a descriptor Because a bag-of-word model is a dimension reduction technique, it comes with some information loss and performance degradation. To improve the quality of descriptor, we have applied two additional strategies: sub-intervals and sub-band pass filtering Sub-intervals along time axes The proposed EEG descriptor uses a histogram to represent an epoch. Because a histogram is an orderless feature, it does not capture temporal relationships between frames. To compensate for the loss of temporal order information, we divide an epoch into small sub-intervals and represent each sub-window separately using a histogram. We then represent the epoch by concatenating histograms of all subwindows. This allows the BOW features to retain some temporal ordering information but not does not require perfect time-locking to capture features Frequency sub-bands It is generally known that neural oscillations in certain frequency bands have specific biological meanings [18]. Table 2 lists some common frequency bands and example locations on the scalp. Many studies have shown that power in specific bands such as theta or alpha is associated with processes such as fatigue or attention shifting. Because the goal of this work is to develop general-purpose EEG features, we use a series of band-pass filtered signals covering the low frequency spectrum, instead of exclusively choosing a specific band. If we use M bands for filtering, we obtain M word lists from M filtered data bands. The feature is the concatenation of these M word lists. We also create a separate dictionary for each frequency band. TABLE 2 PHYSICALLY RELEVANT EEG FREQUENCY BANDS [18]. Band Frequency Location Delta 0 ~ 4 Hz Frontally in adults, posteriorly in children Theta 4 ~ 7 Hz Lateralized or diffuse Alpha 8 ~ 15 Hz Posterior regions of the head, both sides, central sites (c3-c4) at rest Beta 16 ~ 31 HZ Both sides, symmetrical distribution Gamma 32 ~ Hz Somatosensory cortex Mu 8 ~ 12 Hz Sensorimotor cortex 13

14 3. Evaluation of parameters The proposed BOW descriptors have many parameters including the size of a dictionary, the type of clustering method, the number of sub-intervals, and ranges of frequency sub-bands. This section describes the test datasets and empirically evaluates some of the parameter choices. The tests in this section use the Visual Evoked Potential (VEP) oddball task dataset from the Army Research Laboratory (ARL) [10], [19]. The dataset records EEG signals while two types of images were presented: an image of an enemy combatant (target) and an image of a U.S soldier (non-target). As shown in Table 3, the data is not balanced and target samples are uncommon relative to non-target samples. The ratio of target to nontarget images is about 1:7. ABM dataset and Emotiv dataset also have the similar ratio of target to nontarget samples. TABLE 3 PERCENTAGE OF EACH CLASS IN BIOSEMI DATASET. Class Sample number Percentage Non-target (label 34) % Target (label 35) % Other % Total 13, % The images are presented in random order at a frequency of 0.5 Hz, and subjects were instructed to identify each image with a button press. The same subjects performed the test using the three different headsets specified in Table 4. TABLE 4 TEST HEADSETS. Headset ABM Biosemi Emotiv Channels (EEG + External) 10 (9 + 1) 68 (64 + 4) 14 (14 + 0) Sampling rate 256 Hz 512 Hz 128 Hz Before extracting feature vectors and doing the classification tests, we used the fully-automated PREP preprocessing pipeline to remove experimental artifacts from the data. The data is high-pass FIR filtered at 1 Hz, and line noise is removed using a multispectral tapering technique. Bad channels are identified based on four criteria: extreme amplitudes (deviation criterion), lack of correlation with any other channel (correlation criterion), lack of predictability by other channels (predictability criterion), and unusual high frequency noise (noisiness criterion) [20]. After interpolating removed bad channels, we removed the remaining noise and subject-generated artifacts using the Artifact Subspace Reconstruction (ASR) method [21] implemented as a part of clean_rawdata0.31 EEGLAB plug-in [22]. We ran ASR with the default parameters except for a burst criterion parameter of 20 in order to retain as much signal as possible. We sub-band pass filtered the cleaned data using the EEGLAB pop_eegfiltnew function and then normalized the data in two steps. In the first step, we normalized each channel to have zero mean and unit standard deviation so that all channels have the same scale. In this study, we use feature vectors extracted from each frame, so we did additional normalization for each frame to remove scale differences between frames. After normalization, each one-second EEG epoch is represented using the proposed BOW histogram features. Last, we apply PCA on the extracted features to reduce the dimension. 14

15 To test various parameter options for generating histogram features, we compared classification results using the parameters summarized in Table 5. Because of the complexity of the parameter choices, we chose to vary each parameter independently. With the exception of section 3.6 the following sections do within subject tests in which training samples and test samples are from the same subject. To pick balanced samples, we randomly pick the same number of samples from each class. The measures for the within subject tests are the average classification accuracies over 14 test subjects. Other details of the tests are explained in the following sub-sections. TABLE 5 PARAMETERS FOR EVALUATION. Parameter Options Default Clustering methods k-means, k-centers, k-medoids, subtractive clustering, affinity propagation k-means Size of a dictionary 50, 100, 150, 200, 250, 300 words 100 Number of sub-intervals 1 ~ 16 1 Number of sub-bands 1, 2, 4, 8, 12, 16 1 Classifier Linear discriminant analysis (LDA), ARRLS [23] LDA Frames Only peak frames, Only peak Peak frames with gaps filled frames Entire frames Descriptor Normalized frames, gradient descriptor Normalized frames 3.1 Gradient descriptor versus normalized frame descriptor To describe EEG frames, we compared two different features: gradient descriptor and normalized frame descriptor. Table 6 shows the classification performance when these two features were used for a bag-ofwords model. To compare these two features, we built a bag-of-word model using each descriptor with the Biosemi 64-channel data and tried a classification test in which the training set and the test set were from different headsets. The overall goal is to find robust features that transfer across headsets, subjects, and paradigms. As explained in section 2.1, gradient descriptors are susceptible to differences in channel configurations. Table 6 compares the performance for different training headsets when the test headset is a 64-channel Biosemi headset. For this test, we used the default parameters of Table 5 (k-means clustering, 100-word dictionary, one sub-interval, one sub-band, LDA classifier) and did the within-subject classification test, in which training samples and test samples are from the same subject. This is a difficult test, as neither the ABM nor the Emotiv headsets have good coverage of the scalp. If the training headset and the test headsets are the Biosemi headset, the gradient descriptor and the normalized frame descriptor have the similar performance. However, if the training headset is Emotiv, the performance of the gradient descriptor degrades more than the normalized frame descriptor. 15

16 TABLE 6 DESCRIPTOR COMPARISON FOR WITHIN SUBJECT CLASSIFICATION. Test headset Biosemi Training Gradient descriptor Normalized frame descriptor headset Accuracy Drop Accuracy Drop Biosemi ABM Emotiv In both cases, the accuracy drops considerably for training headsets with low numbers of channels, although the gradient descriptor appears to do slightly worse. We use the normalized frame descriptor for representing various EEG data in the remainder of the tests. 3.2 Evaluation of the dictionary sizes To find the best size of a dictionary, we did the within subject classification test on VEP datasets with various sizes of dictionaries ranging from 50 to 300. The within subject classification test uses training samples and test samples from the same headset and the same subject. These tests used the dictionary built from the 64-channel Biosemi data with no sub-bands and no sub-intervals. The VEP dataset consists of three types of headsets listed in Table 4. We assume that training and test samples are from the same headset. The test uses the default parameters in Table 5, but dictionary sizes vary from 50 to 300. Table 7 shows the average accuracies of 10 runs for various dictionary sizes. As shown in Table 7, the classification performance does not appear to be very sensitive to the size of the dictionary. We chose 100 words as the default dictionary size. TABLE 7 COMPARISON OF VARIOUS DICTIONARY SIZES FOR WITHIN SUBJECT CLASSIFICATION. Word number ABM Biosemi Emotiv Average Evaluation of clustering methods Table 3 shows the percentage of each class in our test EEG data. According to Table 3, the data is not balanced and interesting target samples are uncommon relative to less-interesting non-target samples. If the clustering method used to build the dictionary is sensitive to the bias of the data, the dictionary built from the data could be biased to these unimportant but dominant samples. The low-dimensional example of Figure 8 shows k-means and k-means++ assign more clusters to the dominant (dense) areas, while subtractive and affinity clustering assign clusters more evenly throughout the entire sample space. However, it is not clear how well these observations apply to very high-dimensional data. We performed a more direct test of performance of how sensitive the overall feature representation is to selection of the clustering procedure. We tried a within-subject classification test with five different dictionaries built using five different clustering methods. Table 8 shows the average accuracies of 10 runs. Although affinity propagation is slightly better than other clustering methods for unbalanced data, the difference in accuracies does not appear to be sufficient to warrant the significant increase in execution time required to compute the affinity clustering results. The k-means clustering method is available with a GPU implementation, making computation much faster [24]. 16

17 TABLE 8 ACCURACY FOR WITHIN SUBJECT CLASSIFICATION WITH BOW DICTIONARIES BUILT USING DIFFERENT CLUSTERING METHODS. Accuracy (%) affinity subtractive propagation clustering k-means k-centers k-medoids ABM Biosemi Emotiv Average For clustering methods that use random seeds, we usually repeat clustering more than one time to find the best seeds. Table 9 shows the classification accuracies of three clustering methods using random seeds with five replicates. As shown in the table, there is very little difference among replicates. Based on these results, we suggest using k-means clustering with two or three replicates to build a suitable dictionary. TABLE 9 WITHIN SUBJECT CLASSIFICATION ACCURACY FOR 5 CLUSTERING REPLICATES. Clustering method k-means k-centers k- medoids Headset Replicates Average Standard deviation ABM Biosemi Emotiv ABM Biosemi Emotiv ABM Biosemi Emotiv Evaluation of sub-intervals The last sections showed the performance of bag-of-models when entire band and entire period of an epoch were used. The following sections show how to improve the quality of the descriptor by using subbands and sub-intervals as described in Section 2.4. To find the optimal number of sub-intervals, we performed within-subject classification on the VEP data with various numbers of sub-windows and one frequency band. Table 10 shows there is a definite increase in performance across headsets as the number of subintervals increases to five. The performance increases level off beyond eight. 17

18 TABLE 10 WITHIN SUBJECT CLASSIFICATION ACCURACY FOR DIFFERENT NUMBERS OF SUBINTERVALS. Subintervals ABM Biosemi Emotiv Average Evaluation of sub-bands To test the effect of frequency resolution on classification accuracy, we divided the frequency range from 0 Hz to 32 Hz into equal sized sub-bands. For example, if the number of sub-bands is two, one band uses 0 Hz to 16 Hz and another band uses 16 Hz to 32 Hz. We use EEGLAB s pop_eegfiltnew() to extract data in specific frequency ranges [25]. To reduce the overlap between sub-bands, the transition bandwidth was set to 1 Hz. Table 11 shows the results of within subject classification tests on VEP datasets with various numbers of sub-bands. Except the number of sub-bands, the test uses the default parameters, which are k-means clustering, 100 words dictionary, one sub-interval, and the LDA classifier. There is a weak dependence on number of frequency bands. TABLE 11 WITHIN SUBJECT CLASSIFICATION ACCURACY FOR DIFFERENT NUMBERS OF FREQUENCY BANDS. Sub-bands ABM Biosemi Emotiv Average When optimized independently, classification accuracy improves with an increasing number of subintervals and frequency bands, but the effects fall off as the time and frequency subdivisions increase beyond a certain point. Based on physical considerations and these results, a selection of eight frequency sub-bands of width 4 Hz and eight subintervals of length 125 ms appears to be a good choice. While we have not done an extensive test of optimization together, this choice appears to be close to optimal in the cases we have examined. 3.6 Cross subject transfer learning Because bag-of-word descriptors are relatively independent of the EEG headset configuration, these features can be combined to improve classification across different configurations. To highlight the advantage of BOW descriptor, we performed cross-subject transfer learning tests on the VEP multiheadset data. As described in Table 4, the VEP test headsets have 10, 14 or 68 channels and sampling rates varying from 128 Hz to 512 Hz. To share information between different datasets, we use the ARRLS transfer learning classifier, which tries to reduce three factors simultaneously: structural risk, the joint distribution matching error, and the manifold consistency error [23]. 18

19 For comparison, we also tested RAW features. Because each headset has different parameters, we cannot combine them directly. In the case of RAW features, we resampled all data to 128Hz and used common channels between a test headset and a training headset for feature extraction. Then we used the raw signals of the common channels in an epoch as the feature vector. The BOW tests use eight sub-intervals and eight sub-bands with the default parameters of Table 5 for the remaining parameters. For the RAW features, the ABM to EMOTIV and EMOTIV to ABM are not available, since these headsets have no detector overlap. The BOW features are based on a dictionary built from 64-channel Biosemi EEG. Frames from each headset are matched to the best words based on closet channels, so BOW is available for all test conditions. In our experiments, test samples and training samples may be from different headsets of the VEP multiheadset data as shown in Table 12. In each run, we randomly pick samples from each headset and do the leave-one-subject out classification test. Therefore, we use one subject as the test subject and the rest subjects as the training subjects to predict the unknown labels of test samples from the test subject. The previous sections use the within subject classification test, which uses the training and test samples from the same subject, so their performance cannot be directly compared to the results in Table 12, which uses the training and test samples from the different subjects. Table 12 shows accuracy averaged over five runs. In each run, we randomly pick balanced samples from a training subject and a test subject. To get balanced samples from imbalanced dataset, we randomly pick the same number of samples from each class with replacement. As shown in Table 12, the BOW features can be used to combine data from different headsets and show better performance than RAW features in all cases. TABLE 12 ACCURACY OF CROSS SUBJECT TRANSFER LEARNING TEST FOR THE VEP COLLECTION. Test Training headset headset RAW * Bag-of-words ABM ABM Biosemi Emotiv Not available ABM Biosemi Biosemi Emotiv ABM Not available Emotiv Biosemi Emotiv * Use common channels between a test headset and a training headset. 19

20 4. Discussion This technical report evaluates the performance of the bag-of-words descriptor for EEG data. Our comparison of gradient descriptor and direct frame representations showed that frame representations had equivalent or better performance and were less sensitive to headset configuration. Gradient descriptors require a somewhat uniform distribution of detectors across the head in order to get an accurate estimate of the gradient, while frame descriptors can always be calculated. Sub-intervals and sub-bands approaches are useful to increase the quality of descriptors with dictionary sizes of 100 words for each subcase. However, sub-interval and sub-bands approach can introduce boundary effects when used with peak frames. To mitigate boundary effects, we propose to peak-frame repeat method. As shown in the appendix, this method shows comparable performance to the approach using entire frames, while reducing the amount of resources required to process frames. In particular, the method reduces the clustering time to build a dictionary and also reduces the memory space to store extracted frame representations because it only stores peak frames. Based on our test results, we recommend using frequency bands of width 4 Hz and subintervals on the order of 100 ms. Intuitively, this choice gives temporal and frequency resolutions at scales that are physically meaningful. Such a BOW representation of 1-second epochs gives features of length 6400, which is comparable in size to a raw representation of 1-second epochs of 64-channel EEG sampled at 128 Hz. An advantage of BOW is a common representation across headset configurations and sampling frequencies. Another important advantage of BOW is that it removes the exact time-locking requirements that are present in raw feature representations. Our results have not shown a strong dependence on the details of the actual dictionary in terms of the clustering method used or the number of replicates needed to find an optimal clustering. These results suggest that it is possible to use a single dictionary across headsets and experimental paradigms without losing much resolution. This is an important finding, as it allows the use of efficient GPU k-means clustering to produce dictionaries that are broadly applicable across headsets and paradigms. 5. Acknowledgments The authors gratefully acknowledge the use of the data from the VEP headset comparison study and to their collaborators at ARL including particularly David Hairston, Scott Kerick and Vernon Lawhern. This research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Computational support for this project was provided by the Computational System Biology Core, funded by the National Institute on Minority Health and Health Disparities (G12MD007591) from the National Institutes of Health. 20

21 References [1] F. S. Bao, X. Liu, and C. Zhang, PyEEG: An Open Source Python Module for EEG/MEG Feature Extraction, Comput. Intell. Neurosci., vol. 2011, pp. 1 7, [2] M. M. Murray, D. Brunet, and C. M. Michel, Topographic ERP Analyses: A Step-by-Step Tutorial Review, Brain Topogr., vol. 20, no. 4, pp , Jun [3] D. Brunet, M. M. Murray, and C. M. Michel, Spatiotemporal Analysis of Multichannel EEG: CARTOOL, Comput. Intell. Neurosci., vol. 2011, Jan [4] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. Bray, Visual categorization with bags of keypoints, in In Workshop on Statistical Learning in Computer Vision, ECCV, 2004, pp [5] D. G. Lowe, Distinctive Image Features from Scale-Invariant Keypoints, Int J Comput Vis., vol. 60, no. 2, pp , Nov [6] D. Lehmann, H. Ozaki, and I. Pal, EEG alpha map series: brain micro-states by space-oriented adaptive segmentation, Electroencephalogr. Clin. Neurophysiol., vol. 67, no. 3, pp , Sep [7] BESA Brain Electrical Source Analysis. [Online]. Available: [Accessed: 19-Nov-2013]. [8] G. Schalk, D. J. McFarland, T. Hinterberger, N. Birbaumer, and J. R. Wolpaw, BCI2000: A General-purpose Brain-Computer Interface (BCI) System, IEEE Trans. Biomed. Eng., vol. 51, no. 6, pp , Jun [9] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, PhysioBank, PhysioToolkit, and PhysioNet : Components of a New Research Resource for Complex Physiologic Signals, Circulation, vol. 101, no. 23, pp. e215 e220, Jun [10] W. D. Hairston, K. W. Whitaker, A. J. Ries, J. M. Vettel, J. C. Bradford, S. E. Kerick, and K. McDowell, Usability of four commercially-oriented EEG systems, J. Neural Eng., vol. 11, no. 4, p , Aug [11] D. Arthur and S. Vassilvitskii, K-means++: the advantages of careful seeding, in In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, [12] S. Dasgupta and P. M. Long, Performance guarantees for hierarchical clustering, J. Comput. Syst. Sci., vol. 70, no. 4, pp , Jun [13] H.-S. Park and C.-H. Jun, A simple and fast algorithm for K-medoids clustering, Expert Syst. Appl., vol. 36, no. 2, Part 2, pp , Mar [14] S. L. Chiu, Fuzzy model identification based on cluster estimation, J. Intell. Fuzzy Syst., vol. 2, pp , [15] B. J. Frey and D. Dueck, Clustering by Passing Messages Between Data Points, Science, vol. 315, no. 5814, pp , Feb [16] R. Oostenveld and P. Praamstra, The five percent electrode system for high-resolution EEG and ERP measurements, Clin. Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol., vol. 112, no. 4, pp , Apr [17] J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J.-M. Geusebroek, Visual Word Ambiguity, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7, pp , Jul [18] Electroencephalography, Wikipedia, the free encyclopedia. 03-Feb [19] J. T. Anthony J. Ries, A Comparison of Electroencephalography Signals Acquired from Conventional and Mobile Systems, J. Neurosci. Neuroengineering, vol. 3, no. 1, [20] R. Kay, EEG-Clean-Tools, GitHub. [Online]. Available: Tools. [Accessed: 03-Apr-2015]. [21] T. Mullen, C. Kothe, Y. M. Chi, A. Ojeda, T. Kerth, S. Makeig, G. Cauwenberghs, and T.-P. Jung, Real-time modeling and 3D visualization of source dynamics and connectivity using wearable EEG, in th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2013, pp

22 [22] Plugin list process - SCCN. [Online]. Available: [Accessed: 17-Apr-2015]. [23] M. Long, J. Wang, G. Ding, S. J. Pan, and P. S. Yu, Adaptation Regularization: A General Framework for Transfer Learning, IEEE Trans. Knowl. Data Eng., vol. 26, no. 5, pp , May [24] K. J. Kohlhoff, M. H. Sosnick, W. T. Hsu, V. S. Pande, and R. B. Altman, CAMPAIGN: an opensource library of GPU-accelerated data clustering algorithms, Bioinforma. Oxf. Engl., vol. 27, no. 16, pp , Aug [25] A. Delorme and S. Makeig, EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis, J. Neurosci. Methods, vol. 134, no. 1, pp. 9 21, Mar

23 Appendix Within subject classification accuracy of using only peak frames ABM sub-intervals sub-bands sub-bands sub-bands sub-bands sub-bands sub-bands Biosemi sub-intervals sub-bands sub-bands sub-bands sub-bands sub-bands sub-bands Emotiv sub-intervals sub-bands sub-bands sub-bands sub-bands sub-bands sub-bands * * Error during LDA: pooled variance of training is not positive.

Brain-Computer Interface (BCI)

Brain-Computer Interface (BCI) Christoph Guger, Günter Edlinger, g.tec Guger Technologies OEG Herbersteinstr. 60, 8020 Graz, Austria, guger@gtec.at This tutorial shows HOW-TO find and extract proper signal