Speech Recognition and Signal Processing for Broadcast News Transcription

Similar documents
Overview of Information Presentation Technologies for Visually Impaired and Applications in Broadcasting

Exhibits. Open House. NHK STRL Open House Entrance. Smart Production. Open House 2018 Exhibits

Studies for Future Broadcasting Services and Basic Technologies

Personal Mobile DTV Cellular Phone Terminal Developed for Digital Terrestrial Broadcasting With Internet Services

ITU Workshop on Making Television Accessible From Idea to Reality, hosted and supported by Japan Broadcasting Corporation (NHK)

ADS Basic Automation solutions for the lighting industry

Methods, Evidence, Action? The Case of Digital Television. Dr Jeremy Klein, Head of Public Sector, Generics Group.

*Please note that although this product has been approved in Japan, its launch in other countries has not yet been confirmed.

Recently new broadcasting media have entered the market one after another. FM radio broadcasting. BS broadcasting CS analog broadcasting 1992

Outline. Why do we classify? Audio Classification

Overview of the Hybridcast System

Entrance Hall Exhibition

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Chapter 2. Analysis of ICT Industrial Trends in the IoT Era. Part 1

Metadata for Enhanced Electronic Program Guides

Research & Development. White Paper WHP 318. Live subtitles re-timing. proof of concept BRITISH BROADCASTING CORPORATION.

JOURNAL OF PHARMACEUTICAL RESEARCH AND EDUCATION AUTHOR GUIDELINES

Transmission System for ISDB-S

UNIT-3 Part A. 2. What is radio sonde? [ N/D-16]

Auto classification and simulation of mask defects using SEM and CAD images

Introductions to Music Information Retrieval

As novidades do Laboratório de Pesquisas de Ciências e de Técnicas (STRL) da NHK. Sep 2003 SET2003 9:00-11:00 Auditório B Hiroo Arata

REVIEW OF THE MANDATORY DAYTIME PROTECTION RULES IN THE OFCOM BROADCASTING CODE

Focused-ion-beam fabrication of nanoplasmonic devices

Musical Hit Detection

Speech and Speaker Recognition for the Command of an Industrial Robot

Verification Methodology for a Complex System-on-a-Chip

Usability testing of an Electronic Programme Guide and Interactive TV applications

Survey on Electronic Book Features

Setup Guide. Flanders Scientific BoxIO. Rev. 1.1

Japan Completed Analog Switch Off in Terrestrial Television Broadcasting

TVU MediaMind Server. Monitor, control, manage and distribute all your video content. Advantages

TVU MediaMind Server. Monitor, control, manage and distribute all your video content. Advantages

Digital Television Switchover. Michael Starks for Jamaica Broadcasting Commission

Validity of TV, Video, Video Game Viewing/Usage Diary: Comparison with the Data Measured by a Viewing State Measurement Device

How many seconds of commercial time define a commercial minute? What impact would different thresholds have on the estimate?

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

The software concept. Try yourself and experience how your processes are significantly simplified. You need. weqube.

How to Manage Color in Telemedicine

Koester Performance Research Koester Performance Research Heidi Koester, Ph.D. Rich Simpson, Ph.D., ATP

Film Grain Technology

The software concept. Try yourself and experience how your processes are significantly simplified. You need. weqube.

Digital picture transmission between Antarctica and India

Kazo Vision. 1. System Chart

The MAMI Query-By-Voice Experiment Collecting and annotating vocal queries for music information retrieval

Power Performance Drill Upgrades. TorqReg. ARDVARC Advanced Rotary Drill Vector Automated Radio Control. Digital Drives Upgrade

Dolby MS11 Compliance Testing with APx500 Series Audio Analyzers

Requirements for the Standardization of Hybrid Broadcast/Broadband (HBB) Television Systems and Services

American Chemical Society Publication Guidelines

Improving Piano Sight-Reading Skills of College Student. Chian yi Ang. Penn State University

LT-42WX70 42-inch Full HD Slim LCD Monitor

Message. Edwards. Message Center networks. Edwards: The first and last word for clear, concise workplace communications.

DECISION. The translation of the decision was made by Språkservice Sverige AB.

Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset

Voluntary Product Accessibility Template

RedRat Control User Guide

inter.noise 2000 The 29th International Congress and Exhibition on Noise Control Engineering August 2000, Nice, FRANCE

Digital Terrestrial Television in the Czech Republic

Enhancing Music Maps

WOZ Acoustic Data Collection For Interactive TV

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

THE BERGEN EEG-fMRI TOOLBOX. Gradient fmri Artifatcs Remover Plugin for EEGLAB 1- INTRODUCTION

Grade 6. Library Media Curriculum Guide August Edition

Specifications LED Display Video Controller VX4. Xi an NovaStar Tech Co., Ltd. Rev1.0.4 NS

Agenda. ATSC Overview of ATSC 3.0 Status

Computer Coordination With Popular Music: A New Research Agenda 1

Cyclone V5 Teletext & Text Publishing System System Overview

Enabling and Enriching Broadcast Services by Combining IP and Broadcast Delivery. Mike Armstrong, James Barrett & Michael Evans

Teaching Plasma Nanotechnologies Based on Remote Access

MUSI-6201 Computational Music Analysis

Voluntary Product Accessibility Template (VPAT)

E X P E R I M E N T 1

Index. - Registration assistant of momit Home 1 - Start of session/registration 2 - Registration of devices. - momit Home App 1.

The APA Style Converter: A Web-based interface for converting articles to APA style for publication

Classroom Setup... 2 PC... 2 Document Camera... 3 DVD... 4 Auxiliary... 5

IMPROVING SIGNAL DETECTION IN SOFTWARE-BASED FACIAL EXPRESSION ANALYSIS

Automatic Labelling of tabla signals

Brain.fm Theory & Process

Specifications LED Display Video Controller VX4S

Audio-Based Video Editing with Two-Channel Microphone

Implementing Playback Delay Across Multiple Sites with Dramatic Cost Reduction and Simplification Joe Paryzek, Pre-Sales Support Grass Valley, a

DISCOVERING THE POWER OF METADATA

Lab Assignment 2 Simulation and Image Processing

IJMIE Volume 2, Issue 3 ISSN:

Simple motion control implementation

SHORT TERM PITCH MEMORY IN WESTERN vs. OTHER EQUAL TEMPERAMENT TUNING SYSTEMS

CHAPTER 8 CONCLUSION AND FUTURE SCOPE

Communications in Japan

Digital Drive-Thru Communication System

DESIGN PATENTS FOR IMAGE INTERFACES

P1: OTA/XYZ P2: ABC c01 JWBK457-Richardson March 22, :45 Printer Name: Yet to Come

LD-V4300D DUAL STANDARD PLAYER. Industrial LaserDisc TM Player

Faster 3D Measurements for Industry - A Spin-off from Space

So much for OFCOM being the 'consumer champion' of the UK general public.

Smart Traffic Control System Using Image Processing

The Fernsehfee Blocks Out. Commercials. Fernsehfee

Voluntary Product Accessibility Template

Issue 67 - NAB 2008 Special

Transcription:

2.2.1 Speech Recognition and Signal Processing for Broadcast News Transcription Continued research and development of a broadcast news speech transcription system has been promoted. Universities and researchers working both inside and outside of research institutions in Japan have cooperated together toward the development of an automatic closedcaptioning system. Such a system would be used to display an announcer's speech in captions during a news program. To accelerate the development of such a system, we are creating a news speech database of NHK's main news programs, such as "News at 7" and "Good Morning Japan." Starting on April 1, 1999, the contents of each day's news have been added to the database. We have also made efforts to accumulate language data by creating transcriptions of such programs as "Sunday Debate" and "Sunday Sports." In relation to an Acoustic Model, extensive examinations were conducted on a model production method used for a news speech database to improve its recognition performance. As a first stage toward speaker dependent speech recognition from speaker independent speech recognition, we performed a clustering of announcers' speech data and developed a technique to select an appropriate cluster Acoustic Model at the time of recognition. As a result, we obtained a 10 to 20 % reduction in the recognition processing time. Regarding the Language Model, we tested a method to utilize a manuscript produced by a reporter. This manuscript is usually submitted right before the news broadcast. Further, we constructed a technique to automatically estimate a word based on its context. This involves manually inputting words that were not included in the reporter's manuscript. In another recognition method we developed, we register in advance where a word appears in a reporter's manuscript, and the system performs recognition prioritizing of word selection based on this registered word order. These improvements resulted in a recognition accuracy approximately 4 to 5 % higher. Regarding the decoder and related technology, the former recognition system had used information from a whole sentence to output a subtitle. We switched to a new system which finalizes a recognition result sequentially as it recognizes words. This reduced the average time to output a recognition result from 7.2 down to 0.6 second. We integrated these research results and used an updated recognition system on the September 30th, 1999, editions of "News at 7," "News at Noon," and "Good Morning Japan." Figure 1: Broadcast news transcription system The system provided an average recognition accuracy of 86% on the portion containing a studio announcer's speech (492 sentences)(this includes 97% of the anchor's speech containing 122 sentences, and 88% of the 213 sports and weather related sentences). Recognition delay on the portion containing the anchor was 3 seconds. Especially notable is the recognition accuracy of 95% and higher for the studio anchor, thus achieving our target performance. We also pursued development of a system which manually detects and corrects mistakes in the recognition result instantly. This system has 2 stages in its correction process. The first is "error detection" and the second is "error correction." In order to enhance the accuracy of this correction process, we also constructed a prototype system capable of synchronous presentation of a sound and its textual manuscript to an error detector and corrector. This is accomplished through the application of a speech rate conversion system. The recognition performance during the anchor's portion of a broadcast has reached a level for practical use. It was decided to launch this closed-captioning service on the NHK news program "News at 7" on the evening of March 27, 2000. In preparation for this occasion, a practical speech recognition system and a practical recognition error correction system were developed. Speech recognition system Closed-captioning Speech input Database for speech recognition Recognition error correction Final check Electronic general scenario Simultaneous subtitling production system Figure 2: Automatic news closed-caption production system NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services 27

2.2.2 Human-friendly Information Presentation Binocular viewing of stereo 3-D images is known to give rise to a conflict between vergence and accommodation. In order to clarify the extent to which the accommodation function is impaired in elderly subjects, we studied accommodation responses to binocular 3-D images and performed a subjective evaluation of the results. The results showed that the accommodation response induced in elderly subjects by movement in depth of 3-D images was less than 1/3 of that induced in younger subjects, indicating that older people respond less to 3-D images. With the aim of improving TV color reproduction taking into consideration the changes in chromatic vision associated with aging, we attempted to ascertain most preferred TV color temperature conditions for both young and elderly viewers. The results revealed that the most preferred color temperature for the elderly is in the vicinity of 16,000K, compared to around 9,300K for young subjects (see figure). We also clarified visual psychological effects under conditions of most preferred color temperature by means of principal component analysis. With a view to improving access for visually impaired users to ISDB menu screens etc., we continued work on a multimodal interface which uses both Braille and sounds, including auditory warning signals, in place of the usual graphical user interface (GUI). Basic research was conducted on presenting the GUI environment through tactile modalities such as shape and surface texture, as well as sound. Evaluation rating 1.0 0.5 0.0-0.5 Image: ITE skin color chart (female face) Elderly group (average age 68) Young group (average age 23) -1.0 5,000 10,000 15,000 20,000 Color temperature (K) Result of optimal color temperature experiment 28 NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services

2.2.3 Next Generation Human Interface Digital satellite broadcasting will bring a variety of colorful new services. In order to determine factors which will make the digital receiver remote controls easy for anybody to use, we test-manufactured various prototype remote controls and evaluated the coordination between the actual remote operation and its display on a computer-simulated screen. Three types of remote control were used: a button type with the smallest number of buttons possible, a trackball type similar to the widely-used PC pointing devices, and a voice recognition type operated by vocal commands (see picture). Research and development is underway on a reception terminal that will provide data broadcasting and electronic textual data to visually-impaired persons or persons with Speech recognition type Trackball type Button type Conventional type Prototype remote controls for digital television both visual and hearing impairments. Evaluation tests were conducted on the accessibility and the operability of teletext, ISDB information, and six-finger Braille display systems. Improvements were made based on the test results. Additionally, with a view to future interactive services, we studied a remote communication procedure designed for people with both visual and hearing impairments. It was verified that a newly developed conversation protocol can make such remote communication possible. We also constructed a prototype wearable six-finger Braille terminal, and confirmed the basic characteristics of its input and output functions. To make broadcasting enjoyable for elderly viewers who find TV speech too rapid, we developed a speech rate conversion system. This technology can slow down speech to a speed that is easier to understand. In fiscal 1999, we developed this technology into a software application, and confirmed that it is capable of outputting vocal data through a real-time speech conversion process on a PC. We also test-manufactured a non-linear editing device with a variable-speed reproduction function. This allows speech to be synchronized with the picture to contribute to efficient broadcast scene editing work. We also investigated a sound signal processing method which can maintain the intelligibility of output speech at up to 5 times the normal speed. NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services 29

2.2.4 Efficient Video Retrieval Based on Image Recognition With a view to applications such as automatic video indexing and editing support, research has been pursued centering on automatic face recognition. We made improvements to a prototype recognition system (see figure) which detects, tracks and recognizes people's faces in video material. The system is able to identify of the order of tens of individuals, a recognition accuracy sufficient for practical use. Improved facial feature selection has provided increased robustness to dilations and rotations of the image caused by three-dimensional movement. We also studied a facial pose angle estimation method which may be required in order to automate the process of database registration. In order to examine the use of face recognition to support video editing, we constructed an index of face recognition results from a video clip, and test-manufactured a graphical user interface (GUI) for a video retrieval system. The GUI accepts search keys such as a person's name or the composition of the scene, and the system retrieves appropriate video sequences by accessing the index. A feature extraction technique that makes use of color information, composition, and background information was investigated in an attempt to create a flexible video retrieval system with human-like capabilities. Working with the ATR Human Information Processing Research Laboratories, we investigated the extraction of optimal color statistics for use in the flexible retrieval system. We also proposed an image Video Face recognition system Face image database Outline of identity recognition system Recognition result retrieval technique which uses the image composition and complexity of the background as keys for recognition. Retrieval tests were conducted using still images of various types. Concerning possible effects on viewers' health from visual effects such as are used in some animations, research cooperation continued with external research institutions such as the Tokyo Women's Medical University and the Medical School of Gifu University. Surveys were also conducted concerning research trends. 30 NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services

2.2.5 User-centered Video Presentation Technique In order to clarify the basic structural elements for userfriendly and intelligible EPGs (Electronic Program Guide), we studied TV programming categorical structure hierarchies. In the test, we used a top-to-bottom method of hierarchy for compulsory classification experiments, and a bottom-to-top technique on subjective assessment regarding listed program name similarity. The experimental subjects were 24 Recognition score 1.0 0.8 0.6 0.4 0.2 0.0 Sound-visual presentation Visual-sound presentation 0 1 2 3 4 5 6 7 8 Delay time (second) Influence of text/sound data stimulus onset asynchrony on a subject's recognition persons in their 20s and up, and 295 NHK programs were used. The average number of genres using compulsory classifications was 7.8. Eighty percent of the subjects used 10 categories to classify the programs. When the similarity of the program names was analyzed with a cluster analysis method, the results indicated that the most popular genres on EPGs (Electronic Program Guide)should total approximately 10. Additionally, we investigated the degree that a person's understanding is affected by the asynchronous presentation of multiple stimuli. Short term memory behavior was used as an index. Especially examined were the results when using video and sound, or text and sound, since these are the presentation methods used with TV. In the experiment, we presented a word in the order of sound and text, or text and sound, to test whether a subject would recognize if it was the same word or not. The results are shown in the inserted figures, indicating that remarkably more accurate recognition was observed in the case of the sound and text order, and 80% and higher recognition accuracy was obtained up to a delay of 2 seconds. Based on these results, we further enhanced our knowledge regarding an intelligible presentation of information involving interface device display delays using both a "text" image and "sound" information. NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services 31