Speech Recognition and Voice Separation for the Internet of Things

Similar documents
EE123 Digital Signal Processing

HU8550 SMART UHD TV 50" 55" 60" 65" 75" 85" SPEC SHEET PRODUCT HIGHLIGHTS. Ultra High Definition 4K (3840 x 2160) UHD Upscaling

IoT Software Platforms

EE123 Digital Signal Processing

Korea Electronics Technology Institute

数字化变革新旗舰 5K 智能协作终端发布. Adrian Wang. Jun, 8 th 2017 Spark Room Kit Series Launch Webinar TME, CTG

INTRODUCTION OF INTERNET OF THING TECHNOLOGY BASED ON PROTOTYPE

1CHDVRD1 USER MANUAL. These instructions apply to unit model 1CHDVRD1 only. Please read carefully before use.

AppNote - Managing noisy RF environment in RC3c. Ver. 4

ISSN (PRINT): , (ONLINE): , VOLUME-5, ISSUE-4,

VMware Pulse IoT Center 1.0 Release Notes

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Step 1 - Propose a Topic

Design and Realization of the Guitar Tuner Using MyRIO

Voice Controlled Car System

Comparison Parameters and Speaker Similarity Coincidence Criteria:

HOME AUTOMATION USING IOT LINKED WITH FACEBOOK FACIAL RECOGNITION

Building Automation and Context Aware Energy Consumption using IoT Smart Campus

UA22D " Series 5 LED - Television. The ultimate home-base of entertainment. Full HD 1080p. Digital Noise Filter.

Perseverance and Innovation Leads to Success

UN55ES8000FXZA Fast Track Troubleshooting Manual Rev 6/6/12

INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad

Attendance Management System using Facial Recognition and Cloud based IoT Technology

DESIGNING OPTIMIZED MICROPHONE BEAMFORMERS

STB Front Panel User s Guide

MULTI CHANNEL VOICE LOGGER MODEL: DVR MK I

Multiband Noise Reduction Component for PurePath Studio Portable Audio Devices

THE NEXT GENERATION OF CITY MANAGEMENT INNOVATE TODAY TO MEET THE NEEDS OF TOMORROW

Quick Start for TrueRTA (v3.5) on Windows XP (and earlier)

Semi-supervised Musical Instrument Recognition

7 DESIGN ASPECTS OF IoT PCB DESIGNS JOHN MCMILLAN, MENTOR GRAPHICS

VXI RF Measurement Analyzer

System Memory Requirements for Digital TV and Set-Top Platforms

Keysight Technologies U3801A/02A IoT Fundamentals Applied Courseware. Data Sheet

experience. UA40D " Series 5 LED - Television The ultimate home-base of entertainment Full HD 1080p Digital Noise Filter

International Journal of Advance Engineering and Research Development REMOTE VOTING MACHINE

WiPry 5x User Manual. 2.4 & 5 GHz Wireless Troubleshooting Dual Band Spectrum Analyzer

19 D4000 LED TV - Television. The ultimate home-base of entertainment. Wide Colour Enhancer Plus. Clear Motion Rate

Surveillance Robot based on Image Processing

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

An Iot Based Smart Manifold Attendance System

IOT Based Fuel Monitoring For Vehicles

Using Extra Loudspeakers and Sound Reinforcement

Harmony Smart Control. User Guide

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

ex 800 Series ematrix System

Designing and Implementing an Affordable and Accessible Smart Home Based on Internet of Things

Set-Top Box Video Quality Test Solution

Seminar Room & Lecture Theatre

LabView Exercises: Part II

Sarcasm Detection in Text: Design Document

A Matlab toolbox for. Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE

Acoustic Echo Canceling: Echo Equality Index

Harmony Ultimate. User Guide

Agilent E5500 Series Phase Noise Measurement Solutions Product Overview

Pre-processing pipeline

Edison Revisited. by Scott Cannon. Advisors: Dr. Jonathan Berger and Dr. Julius Smith. Stanford Electrical Engineering 2002 Summer REU Program

REPORT DOCUMENTATION PAGE

Exhibits. Open House. NHK STRL Open House Entrance. Smart Production. Open House 2018 Exhibits

Product Guide. WaveAnalyzer High-Resolution Optical Spectral Analysis

Using Extra Loudspeakers and Sound Reinforcement


Implementation of A Low Cost Motion Detection System Based On Embedded Linux

SIZE CLASS 65" UN65KS8000

Universal Voice Logger

KS8500 Curved SUHD TV

The Raspberrypi and the RTL 2832U 820T/820T2 Pan-adapter

Cisco Explorer 8650HD DVR

New Technologies: 4G/LTE, IOTs & OTTS WORKSHOP

An Introduction to The Internet of Things

Product Brochure. MP5000 Wireless Test Station

FOSS PLATFORM FOR CLOUD BASED IOT SOLUTIONS

IOT DEVELOPER SURVEY RESULTS. April 2017

Face Recognition using IoT

Kindle User s Guide - Amazon S3 kindle user s guide, 5th edition chapter 1 getting started 5 chapter 1 getting started welcome

NCTA Technical Papers

NAGALAND UNIVERSITY (A Central University Estd. By the Act of Parliament No.35 of 1989) Headquarters: Lumami

Digital Signal. Continuous. Continuous. amplitude. amplitude. Discrete-time Signal. Analog Signal. Discrete. Continuous. time. time.

N5264A. New. PNA-X Measurement Receiver. Jim Puri Applications Specialist March Rev. Jan Page 1

Video Transmission. Thomas Wiegand: Digital Image Communication Video Transmission 1. Transmission of Hybrid Coded Video. Channel Encoder.

Smart Home. The beginning of a smarter home. Ambi Kodak LaMetric Netatmo Tend

gresearch Focus Cognitive Sciences

A Standard Smart Hotel TV with Pro:Centric Smart

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Getting Started with Launchpad and Grove Starter Kit. Franklin Cooper University Marketing Manager

Video Application Starter Kits

New Products and Features on Display at the 2012 IBC Show

The Digital Audio Workstation

D-Lab & D-Lab Control Plan. Measure. Analyse. User Manual

3Gb/s, HD, SD quad split to WUXGA converter / multiview building block with timecode input COPYRIGHT 2011 AXON DIGITAL DESIGN BV ALL RIGHTS RESERVED

Music Source Separation

Speech Recognition and Signal Processing for Broadcast News Transcription

MindMouse. This project is written in C++ and uses the following Libraries: LibSvm, kissfft, BOOST File System, and Emotiv Research Edition SDK.

SiRX Single-Chip RF Front-End for Digital Satellite TV

FPGA Prototyping using Behavioral Synthesis for Improving Video Processing Algorithm and FHD TV SoC Design Masaru Takahashi

Hidden Markov Model based dance recognition

MULTI CHANNEL VOICE LOGGER MODEL PCVL - 4/8/10/16/32/64. ORIGINAL EQUIPMENT MANUFACTURER OF VOICE LOGGING SYSTEMS Radio and CTI Expert Organisation

E-MANUAL. Thank you for purchasing this Samsung product. To receive more complete service, please register your product at.

LOW POWER DIGITAL EQUALIZATION FOR HIGH SPEED SERDES. Masum Hossain University of Alberta

Transcription:

Speech Recognition and Voice Separation for the Internet of Things Mohammad Hasanzadeh Mofrad and Daniel Mosse Department of Computer Science School of Computing and Information University of Pittsburgh 1

Discussion Outline Motivations and contributions Background Proposed voice-enabled IoT prototype Reconstruction lowpass filter for a voice-enabled IoT prototype Results Summary and conclusion Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 2

Motivation Ways of communicating with IoT devices Graphical User Interface (GUI) Speech Interfaces Limitations of the current smart home IoT devices (e.g. a smart speaker) 1. Devices are not customizable: static functionality (voice commands and accuracy) 2. Smart home speakers cannot handle complex scenarios such as: 1. They fail processing combined commands separated by and. 2. They fail processing two concurrent commands Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 3

Contributions Contributions of this paper are two folds: 1. Prototype: A customizable voice-enabled IoT system + 2. Model and Implementation: A model for handling two concurrent voice commands to a voice-enabled IoT device. For example, the case a person says, Dim the lights. and at the same time the other person says, Turn on the TV. Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 4

Background Smart home speakers Voice-enabled device widely use speech processing and natural language processing to create a Recording is done by the device Processing is done in the Cloud Blind Source Separation (BSS) The Cocktail party effect The problem of processing multiple concurrent voice commands by a voice-enabled IoT device BSS solution: Independent component Analysis Low-pass filters in signal processing (we use the Butterworth filter) Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 5

Discussion Outline Motivations and contributions Background Proposed voice-enabled IoT prototype Reconstruction lowpass filter for a voice-enabled IoT prototype Results Summary and conclusion Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 6

Proposed voice-enabled IoT Prototype Spoken language: Play music on Spotify Raspberry Pi Google Cloud Speech API Transcribed text Text-to-intent API Executed intent The proposed model consists of the following components: 1. The Raspberry Pi records voice and sends it to the Google Cloud speech-to-text API 2. The Google Cloud speech-to-text API transcribes the voice into text 3. The text-to-intent API receives the text and converts it to an intent and target device. Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 7

Proposed voice-enabled IoT prototype Text-to-intent API Text-to-intent API receives the transcribed text from the Google Cloud speech-to-text API and extracts the followings using a simple language model: 1. The intent of the voice message 2. The target device that the command is intended to be executed on. The intents that are currently supported by our proposed prototype are Play music Pause music Resume music Stop music Device An open-source command-line music player Text-to-intent API FIFO Queue Music Player Service Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 8

Proposed voice-enabled IoT Prototype Hardware Inexpensive prototype! $68.42 The main hardware components are: Raspberry Pi 3 Model B Motherboard, $35.80 Quad core Cortex A53 @ 1.2GHz 1GB SDRAM Wireless 802.11 Bluetooth 4.0 Kinobo USB 2.0 Mini Microphone, $4.65 Samsung 64GB Micro SD Card, $19.99 Raspberry Pi Case, $7.98 Other hardware: keyboard, cables, etc. Sofware: Raspbien, Python, Cloud API, Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 9

Discussion Outline Motivations and contributions Background Proposed voice-enabled IoT prototype Reconstruction lowpass filter for a voice-enabled IoT prototype Results Summary and conclusion Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 10

Reconstruction Low-pass Filter for a Voice-enabled IoT Prototype Problem: Two Echo Dots are placed at the proximity of each other and two persons simultaneously talk with their proximate Dot, the voice recorded by each Echo Dot is distorted by a low frequency voice of the other party. Goal: Process both recordings recorded by the Echo Dots and then extract and execute both issued commands. 0101010101 1010101010 Alexa Voice Service (AVS) Distorted voice recorded by Amazon Echo Dot Distorted voice sent to AVS Transcription error Mohammad Hasnzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 11

Proposed Reconstruction Lowpass Filter (RLF) The Butterworth filter is used to build the proposed Reconstruction Lowpass Filter (RLF) Rec 1 Filter() Fil 1 Rec 1 Fil 2 Src 1 Rec 2 Filter() Fil 2 Rec 2 Fil 1 Src 2 Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 12

Proposed Reconstruction Low-pass Filter (RLF) Consider the recorded voice from each microphone rec i is a mixture of source signals src i, noise signals noise i, where i {0, 1} and filtered voice fil j is an approximation of the noise: rec i = src i + noise (i+1 mod 2) src i = rec i - noise (i+1 mod 2) src i = rec i fil j i j In this work we used a 6 th order Butterworth filter with the cut-off frequency of 500 Hz. Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 13

Dataset for Blind source Separation Two Persons are participated in the study Voices are stored as wav audio format Available online: https://github.com/hmofrad/viota Different proximities to the microphones (Person i, microphone i ) Common smart speaker commands are used. Dataset Number of sentences Microphone proximity Dataset 1 (near) 30 Near Dataset 2 (far) 44 Far Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 14

Discussion Outline Motivations and contributions Background Proposed voice-enabled IoT prototype Reconstruction lowpass filter for a voice-enabled IoT prototype Results Summary and conclusion Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 15

Results Performance metric we use is Word Error Rate, WER = (S + D + I)/N #Substitutions #Deletions #Insertions #NumOfWords WER is widely used in speech processing and NLP Algorithms are: Baseline model which uses the raw recording files Reconstruction Independent Component Analysis (RICA) The proposed Reconstruction Lowpass Filter (RLF) Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 16

Results RICA performs the worst because it overfits the input recordings. The proposed RLF has overall improvement of 2-3% compared to the Baseline model Our results are always better for both datasets. Dataset Microphone Baseline RICA RLF Mic Dataset 1 1 0.96 ± 0.11 0.91 ± 0.22 0.99 ± 0.03 Mic (near) 2 0.95 ± 0.13 0.35 ± 0.37 0.96 ± 0.12 (Mic 1 +Mic 2 )/2 0.95 ± 0.12 0.63 ± 0.29 0.97 ± 0.08 Dataset 2 (far) Mic 1 0.96 ± 0.10 0.95 ± 0.13 0.98 ± 0.04 Mic 2 0.44 ± 0.39 0.18 ± 0.39 0.47 ± 0.40 (Mic 1 +Mic 2 )/2 0.70 ± 0.24 0.56 ± 0.26 0.73 ± 0.22 Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 17

Discussion The 2-3% improvement may not be a groundbreaking improvement at the first glance but Our results are better than both Baseline and RICA models At scale it significantly contributes to the Cloud throughput, availability, and utilization by reducing the number of commands send by users. Avoid potential Cloud upgrades and expansion Reduce number of retries due to accuracy Keep the number of requests low Requests are now less noisy will result in intended action Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 18

Summary and Conclusion A customizable voice-enabled IoT prototype is proposed which can be used as a preprocessing step to the speech-to-text API Raspberry Pi Google Cloud speech-to-text API Text-to-intent API Devising a method for voice separation in IoT environment. Reconstruction Lowpass Filter (RLF) Takeaways A good preprocessing can eliminate potential retries on the Cloud This is achievable with a inexpensive hardware. Mohammad Hasanzadeh Mofrad and Daniel Mosse. "Speech Recognition and Voice Separation for the IoTs." IoT 2018. 19