GALE Phase 3 Arabic Broadcast Conversation Speech Part Introduction

Similar documents
GALE Phase 2 Arabic Broadcast Conversation Speech Part Introduction

GALE Phase 2 Arabic Broadcast Conversation Speech Part Introduction

Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development

The second season of medical tourism reality television program "VISIT TO BE TREATED" (V&T) is completed.

Review of Regional Activities

Al JAZEERA DOCUMENTARY

ACHABAKA Desirable, Elegant and Witty

Speaker Recognition: Building the Mixer 4 and 5 Corpora

Snapshot on IDB Volume and distribution of IDB financing Transport Strategy New Commitment to Road Safety Available windows for financing road safety

Issue 76 - December 2008

Audio Watermarking (NexTracker )

BFI RESEARCH AND STATISTICS PUBLISHED AUGUST 2016 THE UK FILM MARKET AS A WHOLE. Image: Mr Holmes courtesy of eone Films

Global pay TV revenues crawl to $200 billion

UNLEASH YOUR IMAGINATION

FALL IN LIKE WITH US

Dolby MS11 Compliance Testing with APx500 Series Audio Analyzers

MULTI CHANNEL VOICE LOGGER MODEL: DVR MK I

Universal Voice Logger

Curriculum Vitae. Presented by. Ala Hamdan. Ala Hamdan

DQT1000 MODEL DIGITAL TO QAM TRANSCODER WITH DIGITAL PROCESSING AND MULTIPLEXING

Issue 67 - NAB 2008 Special

Metadata for Enhanced Electronic Program Guides

RULES LAIFF RD LATINARAB INTERNATIONAL FILM FESTIVAL (LAIFF) NOVEMBER 1-10, 2013 BUENOS AIRES ARGENTINA

KPTV-MENA. Kazan Professional TV Services. Specialists in European Formats. & Branding

ENCRYPTING FOR GROWTH

Business Units. Surveillance and ELV. Architectural & Architainment Lighting. Pro Lighting

COMPLICATED IN THEORY, SIMPLER IN PRACTICE

Methodologies in Near Eastern Studies SYLLABUS

The Effects of Political and Social Turmoil on LIS Research in the Arab World

The Great Transition: Shifting from Fossil Fuels to Solar and Wind Energy Supporting Data - Climate

Digital Signage Content Overview

GO GLOBAL WITH EUTELSAT GLOBAL CONNECTIVITY, LOCAL DELIVERY. Laurent Roussel Future of Satellite technologies Media Technology Conference

The use of Time Code within a Broadcast Facility

A LOW COST TRANSPORT STREAM (TS) GENERATOR USED IN DIGITAL VIDEO BROADCASTING EQUIPMENT MEASUREMENTS

APPLICATION FORM FOR A CABLE BROADCASTING LICENCE

Abstract WHAT IS NETWORK PVR? PVR technology, also known as Digital Video Recorder (DVR) technology, is a

UTAH 100/UDS Universal Distribution System

MIDDLE EAST. A rich mix of A-List celebrity interviews, fashion and events balanced by strong regional personality and news

R&S SFD DOCSIS Signal Generator Signal generator for DOCSIS 3.1 downstream and upstream

National Park Service Photo. Utah 400 Series 1. Digital Routing Switcher.


ITU-T Y Functional framework and capabilities of the Internet of things

INTRO: Media in the Middle East is the subject of an exhibition at New York's Museum of Television and Radio.

Easy HD Expressway! HDMI/Composite(CVBS) to DVB-T/ISDB-T/ISDB-Tb

DIGITAL BROADCAST TEST AND MONITORING SOLUTIONS

Video Reference Timing with Tektronix Signal Generators

PCIe HDMI Video Capture Card - HDMI, VGA, DVI, or Component Video at 1080p60

Case Study Broadcast Studio. SVT Swedish Television, Stockholm. Copyright 2012 All rights reserved

Network Infrastructure for the Television beyond 2000

Automatic License Plate Recognition. 7.0 User Guide

See It. Take It. Avenue Flexible Matrix Router

Open Source Software for Arabic Citation Engine: Issues and Challenges

COLOUR TELEVISION INSTRUCTIONS

Introduction. Fiber Optics, technology update, applications, planning considerations

Harvard Law School Library Collection Development Policy

SFE100 Test Transmitter

HD-SDI/HDMI to DVB-T/ISDB-T/ISDB-Tb

introducing the region s first Baselight colour grading system

Adtec Product Line Overview and Applications

Technical Specifications

This brochure is printed with soy ink and environment-friendly paper.

PREMIUM HEADEND SYSTEM

The new standard for customer entertainment

BrightEye NXT 410-H Clean HDMI Router with HDCP

BROADCAST PRODUCTION\MASTER CONTROL SWITCHERS

Both selections are taken from Tales from The Arabian Nights, retold by Stella Maidment (Pavilion, 2010).

Forward TS Product Line

Serial Digital Interface

High Definition Television. Commercial File Delivery. Technical Specifications

HC T1N / HC J1N Professional 4 band (100MHz~2500MHz) Full HD Digital TV Modulator

Al Jazeera Media Network

Research & Development. White Paper WHP 318. Live subtitles re-timing. proof of concept BRITISH BROADCASTING CORPORATION.

Easy HD Expressway! HV-100E/HV-100J/HV-100EH Full HD Digital TV Modulator. HDMI/Composite(CVBS) to DVB-T/ISDB-T/ISDB-Tb

Digital audio is superior to its analog audio counterpart in a number of ways:

DigiPoints Volume 2. Student Workbook. Module 1 Components of a Digital System

The new standard for customer entertainment

NVISION Compact Space and cost efficient utility routers

JVC INSTRUCTIONS AV-21L91 AV-25L91 AV-29L91

BrightEye NXT 410 Clean HDMI Router

FAQ s DTT 1. What is DTT? 2. What is the difference between terrestrial television and satellite television?

UCR 2008, Change 3, Section 5.3.7, Video Distribution System Requirements

We all know that Ethernet and IP underpin

THE MOST INNOVATIVE ADVANCED, COST EFFECTIVE RADIO & AUDIO BROADCASTING PLATFORM

ENSURE YOUR STATE S INFORMATION IS UP TO DATE

Scholarly productivity of Arab librarians in Library and Information Science journals from 1981 to 2010: An analytical study

APPENDIX B. Standardized Television Disclosure Form INSTRUCTIONS FOR FCC 355 STANDARDIZED TELEVISION DISCLOSURE FORM

COPYRIGHT 2011 AXON DIGITAL DESIGN B.V. ALL RIGHTS RESERVED

Cisco D9859 Advanced Receiver Transcoder

MHP. First outing for. at IFA 99. Introduction

Production Automation To Add Rich Media Content To Your Broadcasts VIDIGO VISUAL RADIO PRODUCT INFORMATION SHEET

MULTI CHANNEL VOICE LOGGER MODEL PCVL - 4/8/10/16/32/64. ORIGINAL EQUIPMENT MANUFACTURER OF VOICE LOGGING SYSTEMS Radio and CTI Expert Organisation

HV-100E/HV-100J HD HDMI/Composite AV Sender. DVB-T/ISDB-T Digital TV Modulator

Ponderosa is expandable by 8 input and/or 8 output increments up to 64x64 in a 4RU frame. Typical Configurations:

(I) AV Encoder - QAM. (II) SD Encoder - QAM

B-LINE HIGH QUALITY SYSTEM

SATELLITE RELATED SERVICES

Technical Solution Paper

Professional Headend Solutions. A-LINE series featuring MPEG Encoder, Multiplexer, Scrambler, Modulators, and IP Streamers

Set-Top Box Video Quality Test Solution

The implementation of HDTV in the European digital TV environment

Transcription:

GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 1. Introduction GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 contains approximately 129 hours of Arabic broadcast conversation speech collected in 2007 and 2008 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 3 of the DARPA GALE program. Broadcast audio for the DARPA GALE (Global Autonomous Language Exploitation) program was collected at LDC s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology ( HKUST), Hong Kong (Chinese); Medianet (Arabic); and MTC (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program. The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Abu Dhabi TV, a television station based in Abu Dhabi, United Arab Emirates; Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Al Baghdadya, an Iraqi broadcast programmer based in Egypt; Al Fayhaa, an Iraqi television channel; Al Hiwar, a regional broadcast station based in the United Kingdom; Alhurra, a U.S. government-funded regional broadcaster; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Lebanese Broadcasting Corporation, a Lebanese television station; Bahrain TV, a television station in the Kingdom of Bahrain; Dubai TV, a broadcast station in the United Arab Emirates; Kuwait TV, a national broadcast station in Kuwait; Oman TV, a national broadcaster located in the Sultanate of Oman ; Qatar TV, a broadcast programmer in Qatar; Saudi TV, a national television station based in Saudi Arabia; Syria TV, the national television station in Syria; and Tunisian National TV, a national television station in Tunisia. 2. Broadcast Audio Data Collection Procedure LDC s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. The collection schedule is stored in a relational database using a Mysql database server. The database contains a history of all of the recordings that have been made; it has configuration and status information for all recorders; it has information about all receivers and associates specific programs of interest with the appropriate receiver; it contains a schedule of all recording jobs to be executed and their status; and it stores all audit judgments associated with a given recording. 1

For the GALE program, Medianet collected Arabic broadcast news (BN) and broadcast conversation (BC) programming from across the Gulf region using its internal system and LDC s portable broadcast collection platform installed in 2008. Among the sources collected by Medianet were Abu Dhabi TIV, Al Arabiya, Al Baghdadya, Al Fayhaa, Al Forat, Al Hiwar, Al Iraqiyah, Al Manar, Al Ordiniyah, Al Sharqiya, Bahrain TV, Dubai TV, Kuwait TV, Oman TV, Qatar TV, Palestine Satellite Channel, Saudi TV and Tunis TV. MTC collected Arabic BN and BC programming from Al Baghdadya, Alhurra, Al Maghribia, Arabiaa, Radio Sawa and Yemen TV using its internal collection system. LDC s portable broadcast collection platform is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint weighs less than 30 pounds and can be transported as carry-on luggage. The portable platform deployed at Medianet s Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. Further information about LDC s broadcast collection system can be found in LDC s Broadcast Collection System Data Sheet, http://www.ldc.upenn.edu/datasheets/broadcast_collection_system_ds.pdf. 3. Broadcast Collection Audit Procedure All broadcast data collected for GALE by LDC and by the remote collection sites managed by LDC were manually audited by Arabic, Chinese, and English speakers for language, program and quality. The broadcast auditing process served three principal goals: as a check on the operation of LDC s broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program s genre, data type and topic. LDC developed a Broadcast Audit Interface Tool to audit its local collection which presented auditors with three segments from each recording (beginning, middle and end) from which audit judgments were made in English. Each remote collection site used a form of audit procedure based on the LDC model. Medianet generated English-language.xls reports for the Arabic programming it collected. Those reports contained one set of auditors judgments for an entire program, including audio quality; genre; data format; percentage of Modern Standard Arabic; dialect type and percentage; topic; and comments. MTC generated English-language.xml and.html audit reports for the Arabic programming it collected. Those reports contained auditors judgments from three portions of each program (beginning, middle and end), including whether a recording occurred, the audio quality, language, whether the correct program was recorded, the data type and topic. 4. Source Data Profile This release contains 142 audio files. Following is a breakdown of files by source and distinct program: Source Program Program ID #Broadcasts Total Hrs. 2

Abu Dhabi TV Traditions and Modern Times TRADMODTIMES 2 1.9 Al Alam Iraq Now IRAQNOW 7 7.2 Al Alam Under Spotlight SPOTLITE 6 6.1 Al Alam With the Event WTHEVENT 1 1.0 Al Baghdadya Face to Face FACE 2 1.7 Al Fayha Freedom Space FREEDOMSPACE 1 1.7 Al Fayha Security Questions URGENTQUESTIONS 1 0.9 Al Hiwar To All Arabs ALLARABS 5 4.6 Al Hiwar Case and Debate CASEDEBATE 4 4.5 Al Hiwar Culture and Literature CULTURE 3 2.6 Al Hiwar Free Opinion FREEOPINION 4 3.7 Al Hiwar The Khaliji Dimension KHALIJIDIMENSION 2 1.2 Al Hiwar Light on Events LIGHTEVENTS 10 12.8 Al Hiwar Third Dimension THIRDDIMENSION 1 0.9 Al Hiwar Valid for Every Time and Place VALID 2 1.6 Al Hurra Al Hurra Presents ALHURRAPRESENTS 1 0.8 Al Hurra All Directions ALLDIRECTIONS 3 2.6 Al Hurra Conversations with Huyam CONVERSATIONS 1 0.8 Al Hurra Equality EQUALITY 3 2.0 Al Hurra The Four Sides FOURSIDES 1 1.0 Al Hurra Free Hour FREEHOUR 5 5.2 Al Hurra Gulk Talks GULFTALKS 1 0.9 Al Hurra Inside Washington INSIDEWASHINGTON 1 0.6 Al Jazeera From Washington FROMWASH 2 2.1 Al Jazeera More Than One Opinion MOREOPINION1 2 2.1 3

Al Jazeera Today's Interview TODINTER 3 1.6 Al Jazeera Without Boundaries 1 WITHOUTBOUNDS1 1 1.0 Al Ordiniyah Our Story OURSTORY 1 0.9 Al Ordiniyah Al Ordiniyah Talk Show TALKSHOW 2 1.9 Al Ordiniyah Unfettered UNFETTERED 3 2.8 Al Arabiya Across Oceans ACROSSOC 1 1.0 Al Arabiya Arabs Debate ARABSDEBATE 1 1.8 Al Arabiya Bil Arabi2 BILARABI2 1 0.8 Al Arabiya Edaat EDAAT 3 3.1 Al Arabiya Events and Viewpoints EVENTSVIEWPTS 3 5.5 Al Arabiya Fourth Estate 2 FOURTHES2 3 1.3 Al Arabiya Fourth Estate FOURTHES 14 7.2 Al Arabiya Frankly Speaking FRANKLYSPEAKING 1 0.8 Al Arabiya Point of Order POINTORDR 3 1.6 Bahrain TV The Last Word LASTWORD 1 0.8 Dubai TV Moreover MOREOVER 2 1.7 Dubai TV This Program is for You YOU 2 1.9 Kuwait TV Good for Publication PUBLICATION 1 0.9 Kuwait TV Six by Six SIX 3 2.8 Kuwait TV Weekly Issues WEEKLYISSUES 3 3.5 Oman TV Affairs of the Hour AFFAIRHR 3 3.1 Oman TV Economic Perspective ECONPERSPECTIVE 1 0.8 Oman TV Morning Coffee MORNCOFF 1 1.0 Qatar TV Security Questions QUESTIONS 2 1.1 Saudi TV No Boundaries2 NOBOUNDARIES2 1 0.8 4

Saudi TV Saudi Sermon SAUDISERMON 3 2.5 Saudi TV Spotlight SAUDISPOTLIGHT 1 0.3 Syria TV Circle of Events CIRCLEVT 3 3.1 Syria TV Weekly File WEEKFILE 1 1.0 Syria TV Windows WINDOWS 1 1.0 Tunis 7 Tunis Sermon TUNISSERMON 3 1.3 5. Data Directory Structure The directory structure in this data release is organized as follows. Broadcast audio collection top directories /data Documentation directory /docs 6. Data File Description 6.1 Audio File Format The audio files in this release are FLAC compressed Waveform Audio File format (.flac), 16000 Hz singlechannel 16-bit PCM files. 6.2 Audio File Names The broadcast audio files in this collection follow LDC s defined naming convention for broadcast audio files. {SRC}_{PRG}_{LNG}_YYYYMMDD_HHMMSS.flac where - - {SRC} is the source ID (e.g., CNN, VOA, etc.) - {PRG} is the program ID (e.g., LARRYKING, etc.) - {LNG} is the three-letter language ID defined in ISO639-3. ARB is Standard Arabic; CMN is Mandarin Chinese; ENG is English. - YYYYMMDD is the data collection (broadcast) date. 5

- HHMMSS is the start time of the program (HH is the hour in the 24-hour format) 7. Data Validation Native Arabic speakers audited every recording in this release. All audio files were checked to be valid.flac files. The docs/checksum.md5 file contains MD5 checksums of all audio files in this corpus. 8. Copyright Information Portions 2007 Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya, Al Fayha, Al Hiwar, Aljazeera, Al Ordiniyah, Bahrain TV, Dubai TV, Kuwait TV, Oman TV, PAC Ltd, Qatar TV, Saudi TV, Syria TV, Tunisian National TV, 2007, 2011 Trustees of the University of Pennsylvania Authors: Kevin Walker, Christopher Caruso, Kazuaki Maeda, Denise DiPersio, Stephanie Strassel 6