Speech Recognition and Signal Processing for Broadcast News Transcription

2.2.1 Speech Recognition and Signal Processing for Broadcast News Transcription Continued research and development of a broadcast news speech transcription system has been promoted. Universities and researchers working both inside and outside of research institutions in Japan have cooperated together toward the development of an automatic closedcaptioning system. Such a system would be used to display an announcer's speech in captions during a news program. To accelerate the development of such a system, we are creating a news speech database of NHK's main news programs, such as "News at 7" and "Good Morning Japan." Starting on April 1, 1999, the contents of each day's news have been added to the database. We have also made efforts to accumulate language data by creating transcriptions of such programs as "Sunday Debate" and "Sunday Sports." In relation to an Acoustic Model, extensive examinations were conducted on a model production method used for a news speech database to improve its recognition performance. As a first stage toward speaker dependent speech recognition from speaker independent speech recognition, we performed a clustering of announcers' speech data and developed a technique to select an appropriate cluster Acoustic Model at the time of recognition. As a result, we obtained a 10 to 20 % reduction in the recognition processing time. Regarding the Language Model, we tested a method to utilize a manuscript produced by a reporter. This manuscript is usually submitted right before the news broadcast. Further, we constructed a technique to automatically estimate a word based on its context. This involves manually inputting words that were not included in the reporter's manuscript. In another recognition method we developed, we register in advance where a word appears in a reporter's manuscript, and the system performs recognition prioritizing of word selection based on this registered word order. These improvements resulted in a recognition accuracy approximately 4 to 5 % higher. Regarding the decoder and related technology, the former recognition system had used information from a whole sentence to output a subtitle. We switched to a new system which finalizes a recognition result sequentially as it recognizes words. This reduced the average time to output a recognition result from 7.2 down to 0.6 second. We integrated these research results and used an updated recognition system on the September 30th, 1999, editions of "News at 7," "News at Noon," and "Good Morning Japan." Figure 1: Broadcast news transcription system The system provided an average recognition accuracy of 86% on the portion containing a studio announcer's speech (492 sentences)(this includes 97% of the anchor's speech containing 122 sentences, and 88% of the 213 sports and weather related sentences). Recognition delay on the portion containing the anchor was 3 seconds. Especially notable is the recognition accuracy of 95% and higher for the studio anchor, thus achieving our target performance. We also pursued development of a system which manually detects and corrects mistakes in the recognition result instantly. This system has 2 stages in its correction process. The first is "error detection" and the second is "error correction." In order to enhance the accuracy of this correction process, we also constructed a prototype system capable of synchronous presentation of a sound and its textual manuscript to an error detector and corrector. This is accomplished through the application of a speech rate conversion system. The recognition performance during the anchor's portion of a broadcast has reached a level for practical use. It was decided to launch this closed-captioning service on the NHK news program "News at 7" on the evening of March 27, 2000. In preparation for this occasion, a practical speech recognition system and a practical recognition error correction system were developed. Speech recognition system Closed-captioning Speech input Database for speech recognition Recognition error correction Final check Electronic general scenario Simultaneous subtitling production system Figure 2: Automatic news closed-caption production system NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services 27

2.2.2 Human-friendly Information Presentation Binocular viewing of stereo 3-D images is known to give rise to a conflict between vergence and accommodation. In order to clarify the extent to which the accommodation function is impaired in elderly subjects, we studied accommodation responses to binocular 3-D images and performed a subjective evaluation of the results. The results showed that the accommodation response induced in elderly subjects by movement in depth of 3-D images was less than 1/3 of that induced in younger subjects, indicating that older people respond less to 3-D images. With the aim of improving TV color reproduction taking into consideration the changes in chromatic vision associated with aging, we attempted to ascertain most preferred TV color temperature conditions for both young and elderly viewers. The results revealed that the most preferred color temperature for the elderly is in the vicinity of 16,000K, compared to around 9,300K for young subjects (see figure). We also clarified visual psychological effects under conditions of most preferred color temperature by means of principal component analysis. With a view to improving access for visually impaired users to ISDB menu screens etc., we continued work on a multimodal interface which uses both Braille and sounds, including auditory warning signals, in place of the usual graphical user interface (GUI). Basic research was conducted on presenting the GUI environment through tactile modalities such as shape and surface texture, as well as sound. Evaluation rating 1.0 0.5 0.0-0.5 Image: ITE skin color chart (female face) Elderly group (average age 68) Young group (average age 23) -1.0 5,000 10,000 15,000 20,000 Color temperature (K) Result of optimal color temperature experiment 28 NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services

2.2.3 Next Generation Human Interface Digital satellite broadcasting will bring a variety of colorful new services. In order to determine factors which will make the digital receiver remote controls easy for anybody to use, we test-manufactured various prototype remote controls and evaluated the coordination between the actual remote operation and its display on a computer-simulated screen. Three types of remote control were used: a button type with the smallest number of buttons possible, a trackball type similar to the widely-used PC pointing devices, and a voice recognition type operated by vocal commands (see picture). Research and development is underway on a reception terminal that will provide data broadcasting and electronic textual data to visually-impaired persons or persons with Speech recognition type Trackball type Button type Conventional type Prototype remote controls for digital television both visual and hearing impairments. Evaluation tests were conducted on the accessibility and the operability of teletext, ISDB information, and six-finger Braille display systems. Improvements were made based on the test results. Additionally, with a view to future interactive services, we studied a remote communication procedure designed for people with both visual and hearing impairments. It was verified that a newly developed conversation protocol can make such remote communication possible. We also constructed a prototype wearable six-finger Braille terminal, and confirmed the basic characteristics of its input and output functions. To make broadcasting enjoyable for elderly viewers who find TV speech too rapid, we developed a speech rate conversion system. This technology can slow down speech to a speed that is easier to understand. In fiscal 1999, we developed this technology into a software application, and confirmed that it is capable of outputting vocal data through a real-time speech conversion process on a PC. We also test-manufactured a non-linear editing device with a variable-speed reproduction function. This allows speech to be synchronized with the picture to contribute to efficient broadcast scene editing work. We also investigated a sound signal processing method which can maintain the intelligibility of output speech at up to 5 times the normal speed. NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services 29

2.2.4 Efficient Video Retrieval Based on Image Recognition With a view to applications such as automatic video indexing and editing support, research has been pursued centering on automatic face recognition. We made improvements to a prototype recognition system (see figure) which detects, tracks and recognizes people's faces in video material. The system is able to identify of the order of tens of individuals, a recognition accuracy sufficient for practical use. Improved facial feature selection has provided increased robustness to dilations and rotations of the image caused by three-dimensional movement. We also studied a facial pose angle estimation method which may be required in order to automate the process of database registration. In order to examine the use of face recognition to support video editing, we constructed an index of face recognition results from a video clip, and test-manufactured a graphical user interface (GUI) for a video retrieval system. The GUI accepts search keys such as a person's name or the composition of the scene, and the system retrieves appropriate video sequences by accessing the index. A feature extraction technique that makes use of color information, composition, and background information was investigated in an attempt to create a flexible video retrieval system with human-like capabilities. Working with the ATR Human Information Processing Research Laboratories, we investigated the extraction of optimal color statistics for use in the flexible retrieval system. We also proposed an image Video Face recognition system Face image database Outline of identity recognition system Recognition result retrieval technique which uses the image composition and complexity of the background as keys for recognition. Retrieval tests were conducted using still images of various types. Concerning possible effects on viewers' health from visual effects such as are used in some animations, research cooperation continued with external research institutions such as the Tokyo Women's Medical University and the Medical School of Gifu University. Surveys were also conducted concerning research trends. 30 NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services

2.2.5 User-centered Video Presentation Technique In order to clarify the basic structural elements for userfriendly and intelligible EPGs (Electronic Program Guide), we studied TV programming categorical structure hierarchies. In the test, we used a top-to-bottom method of hierarchy for compulsory classification experiments, and a bottom-to-top technique on subjective assessment regarding listed program name similarity. The experimental subjects were 24 Recognition score 1.0 0.8 0.6 0.4 0.2 0.0 Sound-visual presentation Visual-sound presentation 0 1 2 3 4 5 6 7 8 Delay time (second) Influence of text/sound data stimulus onset asynchrony on a subject's recognition persons in their 20s and up, and 295 NHK programs were used. The average number of genres using compulsory classifications was 7.8. Eighty percent of the subjects used 10 categories to classify the programs. When the similarity of the program names was analyzed with a cluster analysis method, the results indicated that the most popular genres on EPGs (Electronic Program Guide)should total approximately 10. Additionally, we investigated the degree that a person's understanding is affected by the asynchronous presentation of multiple stimuli. Short term memory behavior was used as an index. Especially examined were the results when using video and sound, or text and sound, since these are the presentation methods used with TV. In the experiment, we presented a word in the order of sound and text, or text and sound, to test whether a subject would recognize if it was the same word or not. The results are shown in the inserted figures, indicating that remarkably more accurate recognition was observed in the case of the sound and text order, and 80% and higher recognition accuracy was obtained up to a delay of 2 seconds. Based on these results, we further enhanced our knowledge regarding an intelligible presentation of information involving interface device display delays using both a "text" image and "sound" information. NHK STRL ANNUAL REPORT 1999 Studies for Improving Conventional Broadcasting Services 31