HomeLog: A Smart System for Unobtrusive Family Routine Monitoring

Size: px

Start display at page:

Download "HomeLog: A Smart System for Unobtrusive Family Routine Monitoring"

Mabel Hodge
5 years ago
Views:

1 HomeLog: A Smart System for Unobtrusive Family Routine Monitoring Abstract Research has shown that family routine plays a critical role in establishing good relationships among family members and maintaining their physical and mental health. In particular, regularly eating dinner as a family and limiting screen viewing time significantly reduce prevalence of obesity. Fine-grained activity logging and analysis can enable a family to track their daily routine and modify their life styles for improved wellness. This paper presents Home- Log a practical system to log family routine using offthe-shelf smartphones and smartwatches. HomeLog automatically detects and logs details of several important family routine activities, including family dining, TV viewing and conversation, in an unobtrusive manner. By providing a detailed family routine history, HomeLog empowers family members to actively engage in making positive changes to improve family wellness. Based on the sensor data collected from real families, we carefully design robust yet lightweight signal features for classification of various family activities. HomeLog keeps track of the ambient noise characteristics and adapts its learning algorithms in response to the dynamics of the environment. Our extensive experiments involving 8 families with children show that HomeLog can detect family routine activities with over 8% precision and 2% recall rates across different families and home environments. 1 Introduction Research has shown that family routine plays a critical role in establishing good relationships among family members and maintaining their physical and mental health [5][13][11][21]. For instance, regularly eating dinner as a family, and limiting screen viewing time significantly reduced prevalence of obesity [45][1][55] [18][52][54]. Reducing sedentary behavior (e.g., screen viewing) was as effective as increasing physical activity in preventing obesity Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM [45]. According to the Social Ecological Model [38], the environment of family can have tremendous effect on child obesity. For instance, parents eating habits and physical activities [41][47][52] strongly predict children s tendency towards obesity. In addition to the implications for family health, fine-grained analysis of family routine enables important studies in sociology and home economy. For instance, research has showed that the amount of shared time (including conversation and eating) between spouses and between parents and children have strong links with family income, mother s employment status, ages of children, and geographic location (urban or rural) [6][2][19]. Unfortunately, to date, there has been no unobtrusive and convenient approach to logging family activities for family routine assessment. Some of the available methods for family activity monitoring rely on videotaping, which not only incurs considerable installation/analysis costs, but also raises privacy concerns. Other methods resort to daily self-report from family members. However, the accuracy of the result is susceptible to human errors and subjectivity of users. This paper presents HomeLog a practical system to log family routine using off-the-shelf smartphones and smartwatches. HomeLog automatically detects and logs details of several important family routine activities in an unobtrusive manner. It uses the built-in accelerometer and microphone of the smartphone and smartwatches to detect activities that are closely related to family wellness, including start/end time and participants of family dining and TV viewing. By providing a detailed family routine history, HomeLog empowers family members to actively engage in making positive changes to improve family wellness, e.g., preventing child obesity. The design of HomeLog faces several challenges such as the significant inferences from various noises in the home. Moreover, the microphone in smart devices is designed to capture close vocals, and usually has low sensitivity. We carefully analyze the acoustic data collected from real families and choose the acoustic features that are robust against various noises and the low sensitivity of built-in microphone. In addition, the runtime features of an activity may substantially deviate from that in training data. HomeLog adapts its algorithms in response to the dynamics of the environment under the users supervising. In order to preserve the privacy of the family, HomeLog adopts lightweight yet effective algorithms to process high-rate acoustic data stream on-

2 the-fly without transmitting any data to the server. We have evaluated HomeLog with extensive experiments involving 8 families with children (one or two week recording in each family). Our results show the effectiveness of HomeLog in family activity detection (with average 88.8% precision and 9.2% recall) across different families and home environments. Moreover, the long-term, fine-grained family activity history provided by HomeLog makes it possible to analyze routine patterns/anomalies and improve family life styles. 2 Related Work The studies by American Academy of Pediatrics have shown that, a healthy family routine is not only helpful in establishing good relationships among family members, but also critical for the proper development of children s physical and mental health [5][13][11][21]. Family meals create opportunities for communication within family members and greatly benefits the parent-child relationship [4][17]. Unlike family meal, research found that TV viewing could be a double-edged sword it helps children better explore the world, but may also reduce their physical activities, or even lead to risk behaviors [15][14]. Traditionally, keeping diaries is the primary approach for long-term studies on family activities [26][32]. However, manually recording is often susceptible to subjectivity of individuals, leading to biased routine analysis. Video recording has also been used for monitoring family activities [12]. However, it often incurs high installation costs and may cause privacy concerns. Recently, several systems are designed to detect the usage of electrical appliances based on the electromagnetic interference and ambient sensors [4][23][28]. However, these systems can only detect family activities that involve substantial appliance usage. Recently, activity monitoring using mobile devices has received significant attention. Several systems are developed to provide in-home behavior monitoring, such as motionbased fall detection, especially for elderly people [37][7][9]. In [29], a system based on 3D camera is developed to leverage imaging processing techniques to identify human pose and infer corresponding activities. In [8], a sound-based system is used for bathroom monitoring, aiming to help caregiving of dementia patients. However, these studies focus on monitoring accidents or risk behaviors for either elderly people or patients. Moreover, they typically require custom hardware and hence present barriers for large-scale adoption. Several recent mobile health systems are designed based on off-the-shelf smartphones. The system presented in [43] detects conversation and physical activities by analyzing data collected from built-in microphone and accelerometer, and uses them to assess user s mental and physical wellbeing. A mobile App called isleep [24], which is designed to analyze sound samples from the phone s built-in microphone, monitors users sleep quality along with other sleep-related events, such as snoring and coughing. Sleep Hunter [22] leverages both motion and acoustic data to detect light or deep sleep stage. The system presented in [46] detects breathing rate during sleep by sound samples. Some recent studies focus on user experiences with mobile health systems such as privacy concerns [1] and sharing behaviors [42]. Acoustic event recognition algorithms have been widely adopted in smartphone-based activity monitoring systems. Auditeur [36] is designed as mobile-cloud service platform to allow client s smartphone to recognize various sound events such as car honks or dog barking. SoundNet associates environmental sounds with words or concepts in natural languages to infer activities [33]. Crowd++ [56] counts the number of speakers in a conversation using MFCC (Melfrequency cepstral coefficient) [48] features. Row mean vector of spectrogram [3] is a simple but effective method for speaker recognition by comparing the Euclidean distance of the energy distributional features. However, these studies are not focused on voice recognition in the presence of significant noise such as during a family meal. 3 Requirements and Challenges HomeLog is designed to be an unobtrusive system that helps users keep track of their family routine. Specifically, it employs the built-in accelerometer and microphone of smartphones and smartwatches to detect two important family routine activities, including family meal and TV viewing. For each of these activities, HomeLog logs the start and end time. We believe such fine-grained monitoring of family routine helps users better understand their lifestyle. In particular, these two activities constitute significant portion of time the family members spent together at home. Specifically, HomeLog is designed to meet the following requirements: 1) Since HomeLog needs to operate in parallel with family routine, it needs to be unobtrusive to use. It should minimize the burden on the user, and should not interfere with the users daily activities by any means. 2) Home- Log needs to detect the family activities and their start/end time as well as the participants in a robust fashion, i.e., across different users, smartphones, smartwatches and households. 3) Since family routine involves privacy sensitive activities such as family conversation, the privacy of the family needs to be strictly protected. For example, the system should process the collected sensor samples on the fly and only keep the results, instead of storing or transmitting any raw data, which may contain sensitive information such as conversations. To meet these requirements, four challenges need to be addressed in developing HomeLog. First, in order to monitor the family routine in an unobtrusive manner, HomeLog samples and analyzes both acceleration and acoustic signal to detect various family activities. Any family members are not required to carry the phone with them all the time, and the children are not supposed to wear any devices. However, the distance between the microphone and the sound source (e.g., family members in a conversation or TV speaker) varies over time (possibly even during the same activity), which leads to highly variable acoustic features. For example, in a typical TV viewing scene and the user is wearing a smartwatch, the loudness of the sound from TV keeps changing as the user moves around the room, which makes it difficult to know whether TV is turned on simply by the captured sound volume. Moreover, the microphone in smart devices is designed to capture close vocals, and usually has low sensitivity. Second, due to the inherent dynamic nature of home environment, the activity detection is susceptible to various

Motion Pre-processing Acceleration on X-axis Changing Rate of Acceleration Motion Features Model Building HMM of Family Routine Training Data Survey Sound Admission Control Acoustic Features

(Optional) Figure 1. System overview noises such as sound caused by pets.

Third, since the daily habits, the living environment, and the smart devices owned by users vary significantly among different families, a training process is required before using HomeLog.

Therefore, the training process only gains basic understand of the family s regular routine and captures several snapshots of the family activities.

Lastly, HomeLog needs to process the sensor samples on the fly, in order to preserve the user s privacy.

4 System Design Fig.1 shows the software architecture of HomeLog.

It ensures a high likelihood that at least one monitoring device is always close to the occurring family activity.

In pre-processing, the smartphone keeps sampling the data from built-in accelerometer on smartwatches and microphone on every smart devices.

3 Motion Pre-processing Acceleration on X-axis Changing Rate of Acceleration Motion Features Model Building HMM of Family Routine Training Data Survey Sound Admission Control Acoustic Features Clattering Sounds Event Detection Family Meal Detection Review Feature Extraction Conversations Volume Distribution Pitch Variance TV Viewing Detection Meal TV Family Routine Time/Date Feature Fusion (Optional) Figure 1. System overview noises such as sound caused by pets. Therefore, in order to provide fine-grained accurate monitoring results, we need to design robust event detection and classification algorithms that can handle various noises in practice. Third, since the daily habits, the living environment, and the smart devices owned by users vary significantly among different families, a training process is required before using HomeLog. Moreover, the training period must be short in order to minimize the user s burden. Therefore, the training process only gains basic understand of the family s regular routine and captures several snapshots of the family activities. HomeLog needs to adapt to the dynamics of the family routine, by adjusting the parameters from user s limited feedback. Lastly, HomeLog needs to process the sensor samples on the fly, in order to preserve the user s privacy. As the acoustic signal is sampled at a relatively high frequency, the processing algorithm must also be lightweight to run in real-time, yet effective at recognizing family routine activities. 4 System Design Fig.1 shows the software architecture of HomeLog. This system can be installed on multiple smartphones (Tablets are also supported) and smartwatches that own by the parents in the family, without any constraint for children. It ensures a high likelihood that at least one monitoring device is always close to the occurring family activity. Motion of the smartwatches and sound signal captured by each smart devices are recorded. HomeLog detects family meal and TV viewing by features extracted from above data. In pre-processing, the smartphone keeps sampling the data from built-in accelerometer on smartwatches and microphone on every smart devices. If the device is carried out of home, or none of the family member is active, the data will not be further processed. The motion signal from accelerometer is processed to get the user s current motion status on wrist, which can potentially provide information about the user s activity. The collected acoustic signal is translated into 21 energy channels for each 5ms frame. Specifically, HomeLog conducts FFT on each frame, and applies Mel Filter [5] to divide the energy spectrum from FFT into 21 energy channels. Based on the experimental data collected in real families, we have identified a set of unique features for different family activities. Specifically, HomeLog uses the motion of user s wrist and the clattering sound (mostly produced by tableware) to identify family meal, and the special features of the frames with low energy of sound for TV viewing. We divide the time into 3-min windows and consider all features within each window to perform an activity detection. Such a window is referred to as detection window hereafter. If one s voice appears within a window for certain times, he/she can be classified as a participant of the activity. The algorithm of detection has a framework of Hidden Markov Model (HMM). The model is built and updated by three sources: a survey for the family, data recorded in the family for one day with manually labeled activities, and the interaction with the users. Under the definition of our HMM, for one specific family routine activity, each detection window is in either of two possible states, which indicate whether this activity occurs in the detection window. The goal is to find the best sequence of states (Markov Chain) that describes the family meals and the TV viewings in a whole day.

4 4.1 Pre-processing To reduce energy consumption for sound recording, we apply 16 khz, i.e. lowest sampling rate for smartphones in common, for our acoustic signal analysis. For every 5 ms frame of sound signal with 16 khz sampling rate, Fast Fourier Transform (FFT) can provide energy distribution from 2 Hz to 8 khz, with 4 effective values every 2 Hz [25][44]. However, the data size after FFT is relatively large, which often prevents a real-time and lightweight analysis in later stages of the signal processing pipeline. Mel Filter [31][39][5] provides a solution for this problem. The basic idea of Mel Filter is to use Mel Scale, which is based on just-noticeable differences of pitch, to build a series of triangle overlapping windows and to transform data from frequency domain into energies on channels. It simulates the hearing process of human. After discarding the noise which mainly falls under 8 Hz, we apply the rule of 1/3 octave bands from 8 Hz to 8 khz [53] to spread 21 channels for sound from 8 Hz to 8 khz. By auditory sense, each two neighbor channel keeps same distance under this design. Then, we apply Mel Filter to transform FFT result into 21 channels. The energy value on channel i is represented as e i. Volume and pitch can be extracted from energy distribution. Specifically, the volume of sound signal represented as V, is given by: V = 21 i=1 e i (1) The pitch of sound signal is represented as P. In our design, pitch corresponds the channel with the highest energy, and the number of dominating channel can be used to describe the pitch feature of acoustic signal: P = arg max i [1,21] e i (2) This process is performed each 5 ms on 1 ms acoustic signals, i.e. 2 Hz framing rate with 5% overlap. This framing rate is also applied to all collected motion data. In order to reduce the computational load, we discard detection windows that only contain environmental noises (e.g., silence or sound of dishwasher) based on variance of sound volume V within a detection window. Low variance of V indicates that the window only contains continuous noise, which is usually captured during night when all family members are sleeping, or when no one is at home. Another scene is the device is carried out of home, which can be detected by location sensors like the Global Positioning System (GPS) or Wi-Fi connection to the home hotspot. If one device cannot provide useful information for a detection window, it will not further process the data at the moment. When the HomeLog run on multiple devices, it is possible that only few of them can capture some interested data (Phones may be left for charging at somewhere far away from the activities). This step ensures that power consumption is reduced when the family routine activities are not likely to be sensed, and the devices are always active when they are able to detect the activities. Another step in the pre-processing aims to extract useful features that are related to the family routine activities from the motion data and the sound signal. In next sections, we introduce these features, why we select them and how to use them to facilitate the detection. 4.2 Family Meal Detection A well-defined family meal should contain two characteristics. First, the participants should be eating something. Second, parents and children should both appear in this scene. According to these characteristics, the family meal can be described with three main features. First feature is the clattering sound caused by clashes between tableware. The clattering sound is the most distinctive acoustic characteristic of family dining activity, regardless of other dynamics, such as the type of food and variation of tableware. Second feature is the wrist gestures of users (only parents) that are detected by smartwatches. A typical example is the user usually sits with the arm on the table during the family meal, and the hand is neither so active as the physical exercise nor so stable as reading a book. Third feature is the conversation between family members. It is very common that a family meal contains a large amount of conversations between parents and children Clattering Sound To describe the occurrences and frequency of clattering sound within a detection window, the system looks for an energy peak from channel 12 to 16 (associated with frequency ranging from 1 4kHz) for each 5ms frame. To detect whether a frame contains clattering sound, we adopt the a lightweight algorithm. For each frame, we compute e all, the average energy over all channels; and e 12 16, the average energy across channel 12 to 16. We use r = e /e all to detect clattering sound. For example, Fig.2 shows an example of clattering sound detection in a typical family meal scenario. Fig.2(a) shows the energy on 21 channels over time, and Fig.2(b) shows the corresponding e and e all. We can see that one occurrence of clattering sound may result in several continuous clattering frames, and always leads to a higher e 12 16, even when the clattering sound and human voice is overlapped around 1 second. Therefore, comparing e and e all is a simple and effective way of detecting clattering sound in typical family meal scenario. In our study, we setup a training data set that contains oneweek sound recording from 5 families, and manually label 3 frames with clattering sound. For this training set, we apply the Gaussian Kernel Density Estimation (Gaussian KDE)[49] to calculate the Probability Density Function (PDF) p(r Clattering), which represents the distribution of r for clattering sound. Then we apply the Bayesian rule to derieve P(Clattering r), which represents the probability of a frame contains clattering sound given r. To describe the occurrence and density of the clattering sound in a detection window, the expression E[N Clattering ] should be used, which is the expectation of number of frames containing clattering sound, derived by the sum of P(Clattering r) for all frames in the detection window. Fig.3 shows an example of family meal detection based on the real data set collected in a home. We can see that all family meal windows contain large numbers of clattering frames. The clash of other objects such as keys and coins can also produce a similar sound. Different from clattering

(a) (b) Channel Energy 21 13 5 2 1 ClatteringuSound HumanuVoice MeanuofuEnergyuonuChannelu12utou16 MeanuofuEnergyuonuAlluChannels.25.5.75 1 1.25 1.5 1.75 Timeu(s) Figure 2.

(b) shows the comparison between e 12 16 and e all for the same sound clip. frames of dining activity, such false alarms are usually isolated and not likely to occur in a burst.

Each bar represents the expectation of number of frames containing clattering sound in a detection window. 4.2.

5 (a) (b) Channel Energy ClatteringuSound HumanuVoice MeanuofuEnergyuonuChannelu12utou16 MeanuofuEnergyuonuAlluChannels Timeu(s) Figure 2. An example of clattering sound detection in a typical family meal scenario. (a) shows the energy on 21 channels over time, where clattering sound and human voice are marked with rectangles. (b) shows the comparison between e and e all for the same sound clip. frames of dining activity, such false alarms are usually isolated and not likely to occur in a burst. E[N Clattering ] Family Meal Time (min) Figure 3. An example of family meal detection. Each bar represents the expectation of number of frames containing clattering sound in a detection window Wrist Gesture The features from motion data can be used to demonstrate the behavior of user. In order to reduce the computational load, we yield the acceleration data from the accelerometers every 5 ms. Under the design of low sampling rate, we cannot focus on microscope features for motion data. Here we focus on the overall gesture of user in a detection window. According to the fact that watch is always worn on the wrist, as shown in Fig.4, the direction of the X-axis is always parallel to the arm. Therefore, the acceleration on X-axis on a smartwatch is always determined by the gravity and the overall gesture of the arm. In addition, to learn how active the user s hand is, the changing rate of acceleration should be acknowledged. If the acceleration data from each frame is represented by a vector (acceleration data from X-axis, Y- axis, and Z-axis), the changing rate of acceleration between two vectors will be represented as the angle between them. This angle describes the rotation of the smartwatch. From these observations, we select two features of motion data for the detection of family meal, which are the acceleration on X-axis (A x ) and the changing rate of acceleration (Rc). The value of A x is derived by the average of acceleration on X- axis for each 5 ms, and the value of Rc is derived by the average of the changing rate of acceleration for each two neighbor frames. As shown in Fig.??, the behavior while eating is distinct from other common activities at home (I am adding cooking, reading, video gaming, TV viewing, physical exercising), and can be observed from the acceleration data. (Need a figure here) Y X Z Figure 4. The direction of X, Y, and Z-axis of the accelerometer in a smartwatch. The acceleration depends on the gravity and the motion of the wrist. When the watch is not moving, the acceleration on one axis will be read as 9.8m/s 2 if the direction of that axis points straightly to the ground Conversation The goal of conversation detection is to identify the occurrence of the conversation, as well as the family members who participate in the conversation. A family meal is supposed to include large amount of voice of family members. The speaker recognition technique presented in [16] shows that pronunciation of vowels is identical characteristic of human. However, its maintenance of the whole database for voice of each family member is high-cost for smart devices. Row mean vector of spectrogram [3] provides an effective and efficient approach to recognize speakers by measuring Euclidean distance of energy distribution on frequency domain, which is already given by the step of pre-processing. For each 5 ms frame, we run the speaker recognition to calculate the probability that the frame contains voice of at least one family member, represented as P(Voice {e i i [1,21]}). In a detection window, the feature that we extract related to the conversation is expressed as E[N Voice ], which means the expectation number of frames that contains the family members voice, and is derived by sum of P(Voice {e i i [1,21]}) from each frame in the detection window. 4.3 TV Viewing Detection TV viewing is difficult to detect because it often consists of a vast variety of different sounds. Even for a particular TV program, it is often challenging to find the underlying acoustic features to uniquely identify the TV viewing activity. Therefore, instead of relying on frame-based acoustic features, HomeLog exploits the characteristics within a detection window to detect TV viewing. These characteristics, reflecting energy distribution and variance of pitch, are not only efficient to calculate, but also more robust across different TV programs, and much less susceptible to dynamics such as distance between the smart devices and TV. Specifically, to detect TV viewing, the system applies the following two features for each detection window. The first feature is the volume/pitch distribution, and the second feature is a fusion of sound signal captured by multiple devices Volume Distribution & Pitch Variance In order to detect TV viewing, two features is effective for each detection window. 1) Percentage of low-energy frames (Percent): The percentage of frames with root mean Y

6 square (RMS)[2] less than 5% of the mean RMS within a detection window. 2) Variance of pitch of low-energy frames (Var pitch ): The variance of pitch which is defined as P in Eqn.(2) of Section 4.1 for frames with RMS less than 5% of the mean RMS within a detection window. The Percent reflects the energy distribution within the window, and works well in separating TV sound from other foreground sounds that often involve human activities, such as family meal and conversation. This is primarily due to the fact that TV sound is usually more continuous (i.e., containing less pauses or quiet frames), as opposed to foreground sounds. Therefore, the energy distribution of TV sound is more right-skewed, resulting in less low-energy frames, and therefore has smaller Percent. Var pitch also focuses on the low-energy frames within a detection window and describes the stability of pitch, making it a good supplement to identify TV sound. Due to its continuous nature, TV sound has a more stable pitch for low-energy frames, compared with other foreground sounds. Fig.5 shows an example of identifying TV viewing activity based on the feature space formed by Percent and Var pitch. We can see that, the dining and TV viewing activities can be separated in the feature space. Variance of Pitch of Low-energy Frames Dining & Conversation TV Viewing 1% 2% 3% 4% 5% 6% Percentage of Low-energy Frames Figure 5. A data set including TV viewing and family meal in the feature space. Each mark represents a detection window, labeled with ground truth Feature Fusion When there are multiple devices in a household and all of them are not moved significantly, HomeLog can leverage them to improve the sensing performance. A novel feature fusion algorithm can significantly improve the accuracy of TV detection by fusing the acoustic data captured by multiple phones in a home. Our idea of multi-device fusion is based on the following observation: TV is a sound source with fixed location, whose volume stays within a limited range for a relatively short period of time. This means detection can benefit from localization of sound sources. We focus on the fusion algorithm for two devices although it can be extended to more generic scenarios. Our design is based on binaural hearing[34], which is a technique for determining of the direction and origin of sounds by two sound receivers. In particular, we present a fusion algorithm based on the interaural level difference (ILD), the differences of sound volume captured, which is a basic features to recognize sound sources. The process of feature fusion consists of three steps: similarity check, sound source detection in high-energy frames, and sound source detection in low-energy frames. In the first step, the goal of similarity check is to figure out whether two devices are at home and near each other by examining the similarity between sound captured by two devices based on ILD. We define the detection windows that start at the same time instance on two different devices as the binaural detection windows. To describe the similarity between binaural detection windows A and B, we define Average Cosine Similarity per Frame (C(A,B)) as: l cos(e(a, i), E(B, i)) i=1 (3) C(A,B) = l Here, the energy distribution for frame i in detection window X is represented as vector E(X,i). If C(A,B) is above a threshold, the two devices are likely close to each other, and the feature fusion algorithm continues to the next step. The next step in feature fusion is to detect the number of (a) (b) Volume 15 1 Volume:Ratio(A:B) Conversations Phone:A Phone:B Time(s) Figure 6. An example of TV viewing with conversation. (a) shows captured volume by two smartphones. (b) shows the volume ratio between corresponding frames. sound sources in binaural detection windows. If the acoustic signals captured by two devices are highly similar, the number of sound sources can be estimated by ILD. If the acoustic signal in high-energy frames originates from a single sound source, it is more likely caused by TV. In contrast, if the acoustic signal in low-energy frames originates from multiple sound sources, it is more likely to be caused by human activities other than TV. The method we use to detect sound sources is based on acoustic localization by ILD [3]. Specifically, if the acoustic signal is from a single source and captured by two receivers, it satisfies V 1 /V 2 = d1 2/d2 2 = V, where V 1 and V 2 are volumes received by receivers and d 1 and d 2 are distances between receivers and sound source. This equation can be applied to compute the relative distances between the sound source and the devices. In indoor scenarios, V may be impacted by various factors (e.g., echoes and obstacles), but its coefficient of variation is limited when d 1 and d 2 are fixed. To detect whether the acoustic signals come from the same source, we define Coefficient of Variation of Volume Ratio per Frame (CV (A,B)) in binaural detection windows A and B as: CV (A,B) = σ( V (A,B)) { µ( V (A,B))} VA,i (4) V (A,B) =,i [1,l] V B,i Here, the volume of frame i in detection window X is repre- TV

7 sented by V X,i from Eqn.(1), µ( V (A,B)) is the mean of volume ratios between A and B, and σ( V (A,B)) is the standard deviation of volume ratios. CV (A,B) thus is the ratio of the standard deviation to the mean. The lower CV (A,B) is, the more likely the acoustic signals come from a single source. Fig.6 shows an example of how to detect sound sources by volume ratio. In the first 2 seconds, phone B is carried by user from the dining table to the sofa. TV is turned on at the 3th second. During the 7th-75th second and the 14th- 15th second, the subjects talk to each other. We can see that when the frames only contain TV sound, volume ratio is relatively stable. In contrast, as conversation involves multiple sound sources, the variance of the volume ratio is significantly increased. We now discuss how the above feature fusion algorithm can improve the accuracy of TV detection in several challenging scenarios: 1) Quiet TV programs that contain discontinuous low volume sound may be recognized as noise. 2) TV programs that contain similar sound as family meal or conversation may be misclassified; and 3) Home parties or party-like events with a continuous sound profile (e.g., large amount of conversation) that may be misclassified as TV. In a relatively quiet scenario, where TV sound is discontinuous, the acoustic signal in high-energy frames is still from a single sound source, so the event will be recognized as TV viewing by feature fusion. If the high-energy frames contain clattering sound and conversation, but they are from a single sound source, they will be more likely from TV than family activities. If the high-energy frames come from multiple sound sources, we also check CV (A,B) in all low-energy frames. These frames usually contain sound from TV and noise, which are continuous and have low volume. In the scenario of home parties, continuous conversation or music are similar to the TV sound. However, these low-energy frames come from multiple sound sources and hence will not be misclassified. 4.4 Model Building In order to establish the connection between the family routine activities and the data collected by the smartphones and the smartwatches, We built a mathematical model that can calculate the probability of the activities occurrences with the input of the features extracted from the data. The first step of our model building is to discover the influence between activity and informations that we can retrieve. Since we only focus on typical activities in family, two significant characteristics can be summarized. First, the occurrences of these activities are influenced by time and date. For example, it is reasonable for a family to watch a specific TV program at a fixed time in a day. Second, the durations of these activities are predictable. For example, a family meal is more possible to last for 2 minutes than 6 or more minutes. Here we can focus on building model for one activity, because the family meals and the TV viewings can be described with the same model. If we assume that the data collected by the smartphones and the smartwatches can reflect the occurrence of the activity, we can combine all related factors to build a Bayesian Network (BN) for one detection window in this scenario, as shown in Fig.7(a). With this BN, the goal of our system can be expressed as: given all facts and observations, finding the probability of which state the detection window is in. (a) (b) Observation 1 Activity will continue? φ (1,1) Time, date, duration of the activity Fact 1 Fact 2 Fact 3 Whether the activity occurs S 1 : Activity φ (1,2) φ (2,1) State Observation 2 Observation 3 Features from the motion data and the sound signal S 2 : No Activity φ (2,2) Activity will start? Figure 7. Models for family routine activities. (a) The Bayesian Network model between facts, state, and observations; (b) The Hidden Markov Model of one activity A more practical model for this case is a HMM of family routine, as shown in Fig.7(b). The HMM assign two possible states for a detection window, which are activity occurs and activity does not occur, corresponding to the State node in BN. The observations of the HMM are data collected by smartphones and smartwatches, corresponding to the Observation nodes in BN. We define the HMM as λ, and the goal of our system can be expressed as Eqn.(5). arg maxp(x λ,o) (5) X Here, X is a sequence of states from l detection windows, given by X = {x 1,x 2,...,x l }; O is a sequence of observations in l detection windows, given by O = {o 1,o 2,...,o l }. A Markov Chain is built by X and O, which starts in the morning when the family members wake up (we have x 1 = 2, which means no activity occurs at the very beginning of the day), and ends with the family members sleep at the night. In our system, O can be observed by analysis with data from smart devices. Once the maximal-likelihood X that outputs O is found, the activity s occurrences in these l detection windows are determined. This can be done by the Viterbi algorithm. From the discussion above, we have a simple expression λ = {Φ,Θ} for this specific case. Here, Φ represents the transition probabilities between two states, given by Φ = {φ (1,1),φ (1,2),φ (2,1),φ (2,2) }; Θ represents the emission parameters for observations associated with two states, given by Θ = {θ (o,1),θ (o,2) }, where o is an observation from the smart devices in a detection window. However, according to the BN, the transition probabilities in Φ are not fixed for every detection window, but depend on Fact nodes such as time and date; the probability for a specific observation in a state contains features from multiple sources (motion data and sound signal), and follows a continuous distribution due to the complexity of the data from sensors. Thus, our HMM is called a Hidden Markov Model with Dynamic Transition......

8 Probabilities and Multiple Continuous Observations[27]. In next sections, we introduce how to decide the parameters Φ and Θ for this HMM Transition Probabilities The transition probabilities, i.e. Φ, contains four entries {φ (1,1),φ (1,2),φ (2,1),φ (2,2) }. According to the definition of our HMM, when the activity is not occurring, we only need to know the probability of its occurrence in next detection window. On the other side, while the activity is currently occurring, we only need to know the probability of whether it continues to next detection window. Therefore, two models are enough to describe all the transition probabilities, which are the probability distribution of one activity s occurrence related to time/date, and the probability distribution of its duration. The best way to gain knowledge of these probability distributions is study on the users descriptions of their own family routines. In our study, we collected the ground truth of the activities in 8 families, and conducted interviews with them about their self-recognition of their family routine activities after the data collections. They gave us the estimations of the starting time and the range of durations for the families meal and TV viewing. After we compare their descriptions and the ground truths, we found it is possible to build the models of these probability distributions just according to the content of interview. Assuming that a survey for one family shows that they consider they usually have 2 meals in home on weekday, which occur around 7:am and 6:pm, the realistic models will appear as the Gaussian distributions as shown in Fig.8(a). Specifically, we use 7:am and 6:pm as the means of Gaussian distributions, and choose 2 minutes as the standard deviation for our cases. The probability is also scaled by the number of meals at home, which means the sum of all probabilities in a day equals 2. We can also apply Gaussian distribution for the duration of each activity. For example, Fig.8(b) shows a probability distribution of duration of family meals for a family, where they described their meals at home as lasting for 15 to 25 minutes. This process establish the relationship between transition probabilities of our HMM and the time in a day. For the weekday and the weekend, different models should be applied, because of the family routine might be different. The transition probabilities of our HMM can be read directly on these probability distributions. When applying the Viterbi algorithm, the transition probability from S 2 to S 1 is equal to the probability of the activity s occurrence according to time/date; the transition probability from S 1 to S 1 is equal to the probability of the activity s continuousness to next 3 minutes. Because the transition probabilities are dependent to previous states (the influence of activity s duration), we need to adjust the structure of our HMM to ensure Viterbi algorithm to run properly. The structure is shown in Fig.9. All transition probabilities in the new structure are independent to previous states, but the required memory space are significantly improved. In practice, we run Viterbi algorithm for 4 states in this HMM (i.e. 4 detection windows) to ensure any activities within 2 hours can be handled. (a) (b) P : 4:AM 8:AM 12: 4:PM 8:PM Tim e P Duration (min) Figure 8. An example of the probability distributions of the family meal s occurrence and duration. (a) The probability distribution of the family meal s occurrence during weekday, according to usually on 7:am and 6:pm ; (b) The probability distribution of the family meal s duration, according to lasting for 15 to 25 minutes. Activity S S 1,3 1,2 S 1,4 S 1,1... S2: No Activity Figure 9. The adjusted HMM. One activity is divided into multiple states as S 1,k, indicating the activity already lasts for k detection windows. Here S 1,k can only transit to S 1,k+1 or S 2, meaning that the activity continues to next detection window or stops Observations The observation o within one detection window is represented as a vector of features, i.e. o =< f 1, f 2, f 3,... >, where f i represents a feature related to the activity. The features for the family meal and TV viewing detections are shown in Table.1 and Table.2. Here, each feature is described as a continuous value. Our target is to calculate P(o S) and find the best expression of θ (o,s). Table 1. Features for the family meal detection Term Description E[N Clattering ] The expectation of number of frames containing clattering sound A x The average of acceleration on X-axis Rc The changing rate of acceleration E[N Voice ] The expectation of number of frames containing the family members voice Table 2. Features for the family meal detection Term Description Percent The percentage of low-energy frames Var pitch The variance of pitch of low-energy frames CV (A,B) The coefficient of variation of volume ratio per frame (optional) in binaural detection windows A and B After HomeLog is deployed in a new family and complete the survey of the regular family routines as we discuss in section 4.4.1, it runs for one day to collect data and label the data with ground truth according to the result of survey. A : 6

9 training data set is constructed through this way. Based on the training data set, we can apply Gaussian KDE to calculate two PDFs, which are p(o 1) and p(o 2), corresponding to the observations while activity is occurring or not occurring. To transform these PDFs into probabilities P(o 1) and P(o 2), we follow the assumption below: P(o S) = lim P(B S) = lim p(o S)do = βp(o S) µ(b) µ(b) B (6) Here, B is a small continuous space with a measure approaching, and we have o B. We can treat the probability of observation o as the probability of observations in its neighborhood B. Then, we use a constant β to describe the measure of B, also build a connection between P(o S) and p(o S). We choose θ (o,s) = p(o S), because β is treated a constant that does not affect the outcome of the Viterbi algorithm. A special feature is the CV (A,B) from the feature fusion in TV viewing, because it is not always available to be observed. However, this still does not affect the outcome of the Viterbi algorithm, because the influence of adjusting the observation vector o in some detection windows is limited within those detection windows but does not expand to the whole Markov Chain Updating the HMM In order to achieve high detection accuracy and maintain an up-to-date model for gradually changing environment, HomeLog keeps updating the parameters in HMM from the interaction with the user. Specifically, whenever the user view the detected family routine presented by the HomeLog, he or she may correct the error in it or confirm it is accurate. After receiving the review by users, HomeLog treat all recently confirmed result and corresponding features as a new training data set. The parameters of HMM will be updated consequently. Through this process, the probability distributions of activities occurrences and durations as shown in Fig.8 will be more closed to the truth in family, by gaining a new knowledge of average and standard derivation of these probability distributions. Furthermore, the PDF of observations associated with states also will be adjusted. The outof-date data will be discarded from the training set, and will be deleted to save the data storage. To keep HomeLog as an unobtrusive monitoring system, users are only requested to review and correct some errors to adjust the detecting ability of it. In our evaluation, we can see that the user does not need to adjust HomeLog for too many times, because of its high accuracy. 5 Performance Evaluation In order to evaluate the performance of HomeLog, we have recruited 8 families for data collection. Our study has been approved by the Institutional Review Boards (IRB) of the authors institute. This group of families is referred to as the second group hereafter. The period of experiment lasts one or two weeks for each family. We provided each family with multiple devices. The app pre-installed on the devices can continuously record audio and motion unless the device is taken out of home. User may manually start/end the app on any device. The parents of the family are required to carry a smartphone as his or her own smartphone, and the other smartphone is kept at a relatively fixed location at home (e.g., kept charging somewhere). At least one of the parents of the family are required to wear a smartwatch if the smartwatches are available. These requirements take into account the habits of different users (i.e., carrying the phone, leaving the phone at a relatively fixed position at home, etc.). They also ensure family activities to be captured by at least one devices. In order to get the ground truth, we firstly gain knowledge of the subjects regular family routine based on the interviews with them, then listen to the recordings and manually labeled the duration of activities. The details of the participated families and collected data are shown in Table 3. Because we offer the subjects the right to delete any recorded clips due to privacy concern, not all of the family routine activities during the experiments are recorded by the app. 5.1 Micro-scale Routine Analysis HomeLog is designed to provide fine-grained family routine logging, which allows family members to review their activities. We analyze the family routine in detail using data collected in one or two weeks. To show the novelty of our HMM for family routine activities, we compare its performance with the classifying result through the Support Vector Machine (SVM), which recognizes the activities only according to the observations by smart devices and without the awareness of the time/date or activity s duration. Such analysis also sheds light on the lifestyle of a family and motivates family members to make positive changes. In this section, we use the results from family 4 as an example. In Section 5.2, we analyze the detection accuracy of different activities of all families. The detection results along with the ground truth are shown in Fig.1. Detailed long-term family activity logs like this make it possible to analyze routine patterns/anomalies and suggest possible ways to improve family life styles. We can see that the family usually has dinner around 7-8 pm for about an hour, except for day 5, which is Friday, when they started dinner at around 8 pm for about 2 minutes. Moreover, the family watched more TV and finishes TV viewing later than most other days during the week, possibly due to the fact that it s a Friday. Compared with the ground truth, we can see that Home- Log is accurate in detecting most of the activities. In day 3 and 5, HomeLog produces a few misclassifications for dining activity through SVM detection, due to interferences caused by TV viewing. However, we can reduce the errors by the HMM detection. Furthermore, by considering the the time/date and activity s duration, HMM can eliminate the improper detection result that shows discontinuous activities or short activities that last only for few minutes. 5.2 Evaluation of Event Detection In this section, we focus on the performance of HomeLog in single activity detection. The main objective is to evaluate the detection accuracy of the occurrence and the duration of each activity. The data collected in the experiments are processed using the methods described in Section 4.2, and the training data is

10 Table 3. Families that participated in the experiment and their daily routine Family Children Phone Smartwatch Data Family Meal TV Viewing (Ages in Years) (Weeks) (Number) (Number) 1 1 daughter(5) Nexus 4 N/A daughter(4) Nexus 4 N/A daughters(5, 8),2 sons(1, 3) Nexus 4 Sony Smartwatch sons (1, 3, 5) Nexus 3 Sony Smartwatch sons (3, 5) Moto G N/A daughters(1,3),1 son(7) Moto G2 2 Sony Smartwatch daughters(3,11),2 sons(7,13) Moto G2 2 N/A daughters(7,1,18) Moto G2 2 Sony Smartwatch Day 1 Day 2 Day 4 Meal TV Meal TV Meal TV 6:pm 7:pm 8:pm 9:pm Meal Day 3 TV 7:pm 8:pm 9:pm 1:pm SVM Detection Family Meal{ HMM Detection Ground Truth SVM Detection TV Viewing{ HMM Detection 7:pm 8:pm 9:pm Ground Truth Day 5 Meal TV 7:pm 8:pm 9:pm 1:pm Figure 1. Detected family routine based on data collected from family 4 during 5 days. built based on the interviews with subjects and the collected data in the first day of experiment. First, for each detection window (3 minutes), we detect family activities including the family meal and TV viewing. In the rest of this section, we discuss three types of detection results: 1) The overall detection accuracy of HomeLog for each detection window. We use precision and recall as the metrics for this evaluation. Specifically, we define the precision of detecting activity A as the number of true-positive windows divided by the total number of windows detected as A. The recall of detecting activity A is defined as the number of true-positive windows divided by the total number of windows that actually associated with A. We do not take into account the true negatives, because most of the windows containing no activities are detected or discarded. 2) The detection accuracy of the occurrence of each activity. We show the sum of false negatives and false positives for this metric. 3) The detection accuracy of the duration of each activity. We show the average detection error of each activity s start/end time in minutes for this metric. In our design of HomeLog, the detection result is further calibrated by user s review. The HMM of family routine is updated in a dynamic environment as described in Section The performance of the updated HMM is also shown in next sections. 1% 9% 8% 7% 6% 5% Precision (HMM) Precision (SVM) Recall (HMM) Recall (SVM) Family 1 Family 2 Family 3 Family 4 Family 5 Family 6 Family 7 Family 8 Figure 11. Overall accuracy of family meal detection in detection windows. The average precision and recall by HomeLog with HMM are 8.7% and 89.5%, respectively. 2 1 Number of errors per week Family 1 Family 2 Family 3 Family 4 Family 5 Family 6 Family 7 Family 8 Figure 12. Errors in detecting the family meal s occurrence. The number of errors is the sum of false negatives and false positives for each family meal and detected result Family Meal Detection In this section, we evaluate the performance of family meal detection. Fig.11 shows the detection result of 8 families. We can see that by applying the HMM instead of SVM detection, HomeLog increases the recall by up to 11.45% (6.82% on average.) This is primarily because the HMM

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0 MICROPHONE