GALE Phase 2 Arabic Broadcast Conversation Speech Part Introduction

Similar documents
GALE Phase 2 Arabic Broadcast Conversation Speech Part Introduction

GALE Phase 3 Arabic Broadcast Conversation Speech Part Introduction

Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development

The second season of medical tourism reality television program "VISIT TO BE TREATED" (V&T) is completed.

Review of Regional Activities

Speaker Recognition: Building the Mixer 4 and 5 Corpora

Al JAZEERA DOCUMENTARY

ACHABAKA Desirable, Elegant and Witty

Issue 76 - December 2008

Audio Watermarking (NexTracker )

Snapshot on IDB Volume and distribution of IDB financing Transport Strategy New Commitment to Road Safety Available windows for financing road safety

Dolby MS11 Compliance Testing with APx500 Series Audio Analyzers

Global pay TV revenues crawl to $200 billion

RULES LAIFF RD LATINARAB INTERNATIONAL FILM FESTIVAL (LAIFF) NOVEMBER 1-10, 2013 BUENOS AIRES ARGENTINA

DQT1000 MODEL DIGITAL TO QAM TRANSCODER WITH DIGITAL PROCESSING AND MULTIPLEXING

UNLEASH YOUR IMAGINATION

Metadata for Enhanced Electronic Program Guides

Issue 67 - NAB 2008 Special

Curriculum Vitae. Presented by. Ala Hamdan. Ala Hamdan

COMPLICATED IN THEORY, SIMPLER IN PRACTICE

BFI RESEARCH AND STATISTICS PUBLISHED AUGUST 2016 THE UK FILM MARKET AS A WHOLE. Image: Mr Holmes courtesy of eone Films

MULTI CHANNEL VOICE LOGGER MODEL: DVR MK I

Digital Signage Content Overview

Universal Voice Logger

A LOW COST TRANSPORT STREAM (TS) GENERATOR USED IN DIGITAL VIDEO BROADCASTING EQUIPMENT MEASUREMENTS

Methodologies in Near Eastern Studies SYLLABUS

Abstract WHAT IS NETWORK PVR? PVR technology, also known as Digital Video Recorder (DVR) technology, is a

GO GLOBAL WITH EUTELSAT GLOBAL CONNECTIVITY, LOCAL DELIVERY. Laurent Roussel Future of Satellite technologies Media Technology Conference

APPLICATION FORM FOR A CABLE BROADCASTING LICENCE

UTAH 100/UDS Universal Distribution System

R&S SFD DOCSIS Signal Generator Signal generator for DOCSIS 3.1 downstream and upstream

ENCRYPTING FOR GROWTH

National Park Service Photo. Utah 400 Series 1. Digital Routing Switcher.

FALL IN LIKE WITH US

Easy HD Expressway! HDMI/Composite(CVBS) to DVB-T/ISDB-T/ISDB-Tb

DIGITAL BROADCAST TEST AND MONITORING SOLUTIONS

Case Study Broadcast Studio. SVT Swedish Television, Stockholm. Copyright 2012 All rights reserved

The use of Time Code within a Broadcast Facility

PCIe HDMI Video Capture Card - HDMI, VGA, DVI, or Component Video at 1080p60

INTRO: Media in the Middle East is the subject of an exhibition at New York's Museum of Television and Radio.

Network Infrastructure for the Television beyond 2000

See It. Take It. Avenue Flexible Matrix Router


Introduction. Fiber Optics, technology update, applications, planning considerations

SFE100 Test Transmitter

HD-SDI/HDMI to DVB-T/ISDB-T/ISDB-Tb

Understanding ATSC 2.0

Adtec Product Line Overview and Applications

KPTV-MENA. Kazan Professional TV Services. Specialists in European Formats. & Branding

PREMIUM HEADEND SYSTEM

The new standard for customer entertainment

BrightEye NXT 410-H Clean HDMI Router with HDCP

Video Reference Timing with Tektronix Signal Generators

ITU-T Y Functional framework and capabilities of the Internet of things

Forward TS Product Line

HC T1N / HC J1N Professional 4 band (100MHz~2500MHz) Full HD Digital TV Modulator

High Definition Television. Commercial File Delivery. Technical Specifications

Research & Development. White Paper WHP 318. Live subtitles re-timing. proof of concept BRITISH BROADCASTING CORPORATION.

Serial Digital Interface

Easy HD Expressway! HV-100E/HV-100J/HV-100EH Full HD Digital TV Modulator. HDMI/Composite(CVBS) to DVB-T/ISDB-T/ISDB-Tb

Open Source Software for Arabic Citation Engine: Issues and Challenges

MIDDLE EAST. A rich mix of A-List celebrity interviews, fashion and events balanced by strong regional personality and news

Harvard Law School Library Collection Development Policy

The new standard for customer entertainment

DigiPoints Volume 2. Student Workbook. Module 1 Components of a Digital System

Digital audio is superior to its analog audio counterpart in a number of ways:

NVISION Compact Space and cost efficient utility routers

The Effects of Political and Social Turmoil on LIS Research in the Arab World

The Great Transition: Shifting from Fossil Fuels to Solar and Wind Energy Supporting Data - Climate

Summary of Speech Technology and Market Opportunities in the TV and Set-top Box Markets: hands-free remote control systems

Al Jazeera Media Network

Business Units. Surveillance and ELV. Architectural & Architainment Lighting. Pro Lighting

BrightEye NXT 410 Clean HDMI Router

Automatic License Plate Recognition. 7.0 User Guide

THE MOST INNOVATIVE ADVANCED, COST EFFECTIVE RADIO & AUDIO BROADCASTING PLATFORM

UCR 2008, Change 3, Section 5.3.7, Video Distribution System Requirements

This brochure is printed with soy ink and environment-friendly paper.

COLOUR TELEVISION INSTRUCTIONS

Cisco D9859 Advanced Receiver Transcoder

Both selections are taken from Tales from The Arabian Nights, retold by Stella Maidment (Pavilion, 2010).

HV-100E/HV-100J HD HDMI/Composite AV Sender. DVB-T/ISDB-T Digital TV Modulator

MHP. First outing for. at IFA 99. Introduction

(I) AV Encoder - QAM. (II) SD Encoder - QAM

B-LINE HIGH QUALITY SYSTEM

Ponderosa is expandable by 8 input and/or 8 output increments up to 64x64 in a 4RU frame. Typical Configurations:

Set-Top Box Video Quality Test Solution

The implementation of HDTV in the European digital TV environment

Professional Headend Solutions. A-LINE series featuring MPEG Encoder, Multiplexer, Scrambler, Modulators, and IP Streamers

Technical Solution Paper

Correlated Receiver Diversity Simulations with R&S SFU

SingMai Electronics SM06. Advanced Composite Video Interface: DVI/HD-SDI to acvi converter module. User Manual. Revision th December 2016

BROADCAST PRODUCTION\MASTER CONTROL SWITCHERS

Easy HD Expressway! HV-100E/HV-100J Full HD Digital TV Modulator. HDMI/Composite(CVBS) to DVB-T/ISDB-T/ISDB-Tb

Content storage architectures

SOUTH AFRICAN NATIONAL STANDARD

Competition Works. Consumers Win!

Product Proposal. Digital Signage Solution. Provided By:

SATELLITE RELATED SERVICES

RFS-806. Digital Modulator AV to QAM. User Manual

R&S EFL240/R&S EFL340 Portable TV Test Receiver Professional installation of cable and satellite TV systems and antennas

Transcription:

GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 1. Introduction GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 contains approximately 123 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by the Linguistic Data Consortium (LDC), MediaNet, Tunis, Tunisia and MTC, Rabat, Morocco during Phase 2 of the DARPA GALE program. Broadcast audio for the DARPA GALE (Global Autonomous Language Exploitation) program was collected at LDC s Philadelphia, PA USA facilities and at three remote collection sites: Hong Kong University of Science and Technology ( HKUST), Hong Kong (Chinese); Medianet (Arabic); and MTC (Arabic). The combined local and outsourced broadcast collection supported GALE at a rate of approximately 300 hours per week of programming from more than 50 broadcast sources for a total of over 30,000 hours of collected broadcast audio over the life of the program. The broadcast conversation recordings in this release feature interviews, call-in programs and roundtable discussions focusing principally on current events from the following sources: Al Alam News Channel, based in Iran; Al Arabiya, a news television station based in Dubai; Aljazeera, a regional broadcaster located in Doha, Qatar; Al Ordiniyah, a national broadcast station in Jordan; Lebanese Broadcasting Corporation, a Lebanese television station; Nile TV, a broadcast programmer based in Egypt, Oman TV, a national broadcaster located in the Sultanate of Oman; Saudi TV, a national television station based in Saudi Arabia; and Syria TV, the national television station in Syria. 2. Broadcast Audio Data Collection Procedure LDC s local broadcast collection system is highly automated, easily extensible and robust and capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. The broadcast material is served to the system by a set of free-to-air (FTA) satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and recorders is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate keyframes and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate automatic speech recognition (ASR) output. The collection schedule is stored in a relational database using a Mysql database server. The database contains a history of all of the recordings that have been made; it has configuration and status information for all recorders; it has information about all receivers and associates specific programs of interest with the appropriate receiver; it contains a schedule of all recording jobs to be executed and their status; and it stores all audit judgments associated with a given recording. For the GALE program, Medianet collected Arabic broadcast news (BN) and broadcast conversation (BC) programming from across the Gulf region using its internal system and LDC s portable broadcast collection platform installed in 2008. Among the sources collected by Medianet were Abu Dhabi TIV, Al Arabiya, Al Baghdadya, Al Fayhaa, Al Forat, Al Hiwar, Al Iraqiyah, Al Manar, Al Ordiniyah, Al Sharqiya, Bahrain TV, Dubai TV, Kuwait TV, Oman TV, Qatar TV, Palestine Satellite Channel, Saudi TV and Tunis TV. 1

MTC collected Arabic BN and BC programming from Al Baghdadya, Alhurra, Al Maghribia, Arabiaa, Radio Sawa and Yemen TV using its internal collection system. LDC s portable broadcast collection platform is a TiVO-style digital video recording (DVR) system that records two streams of A/V material simultaneously. It supports analog CATV (NTSC and PAL) and FTA DVB-S satellite programming and can operate outside of the United States. It has a small footprint weighs less than 30 pounds and can be transported as carry-on luggage. The portable platform deployed at Medianet s Tunisian collection facility collected multiple streams of regional Arabic programming from various sources. Further information about LDC s broadcast collection system can be found in LDC s Broadcast Collection System Data Sheet, http://www.ldc.upenn.edu/datasheets/broadcast_collection_system_ds.pdf. 3. Broadcast Collection Audit Procedure All broadcast data collected for GALE by LDC and by the remote collection sites managed by LDC were manually audited by Arabic, Chinese, and English speakers for language, program and quality. The broadcast auditing process served three principal goals: as a check on the operation of LDC s broadcast collection system equipment by identifying failed, incomplete or faulty recordings; as an indicator of broadcast schedule changes by identifying instances when the incorrect program was recorded; and as a guide for data selection by retaining information about a program s genre, data type and topic. LDC developed a Broadcast Audit Interface Tool to audit its local collection which presented auditors with three segments from each recording (beginning, middle and end) from which audit judgments were made in English. Each remote collection site used a form of audit procedure based on the LDC model. Medianet generated English-language.xls reports for the Arabic programming it collected. Those reports contained one set of auditors judgments for an entire program, including audio quality; genre; data format; percentage of Modern Standard Arabic; dialect type and percentage; topic; and comments. MTC generated English-language.xml and.html audit reports for the Arabic programming it collected. Those reports contained auditors judgments from three portions of each program (beginning, middle and end), including whether a recording occurred, the audio quality, language, whether the correct program was recorded, the data type and topic. Further information about the audit procedure and LDC s Broadcast Audit Interface Tool can be found in Audit Procedure Specification, Version 2.0, included with this release. 4. Source Data Profile This release contains 143 audio files. Following is a breakdown of files by source and distinct program: Source Program Program ID #Broadcasts Total Hrs. Al Alam Iraq Now IRAQNOW 5 5.2 2

Al Alam Under Spotlight SPOTLITE 3 3.1 Al Alam With the Event WITHEVENT 2 2.1 Al Jazeera From Washington FROMWASH 2 2.1 Al Jazeera More Than One Opinion MOREOPINION1 1 1.0 Al Jazeera Open Dialogue OPENDIAL 1 1.0 Al Jazeera Opposite Directions OPPDIREC 1 1.0 Al Jazeera Platform PLATFORM1 4 2.2 Al Jazeera Today's Interview TODINTER 1 0.5 Al Jazeera Without Boundaries 1 WITHOUTBOUNDS1 2 2.1 Al Arabiya Across Oceans ACROSSOC 2 2.1 Al Arabiya Bil Arabi BILARABI 7 6.4 Al Arabiya Edaat EDAAT 9 9.3 Al Arabiya Fourth Estate FOURTHES 29 15.6 Al Arabiya Point of Order POINTORDR 3 1.6 LBC Naharkum Saiid NAHAR 32 33.2 Oman TV Affairs of the Hour AFFAIRHR 2 2.1 Oman TV Morning Coffee MORNCOFF 1 1.0 Saudi TV No Boundaries NOBOUNDARIES 1 1.0 Nile TV Egypt Nightly News EGYPNNSCO 11 11.4 Al Ordiniyah Jordan Nightly News JORDNNSCO 2 2.1 Saudi TV Saudi Nightly News SAUDNNSCO 13 7.0 Syria TV Circle of Events CIRCLEVT 4 4.2 Syria TV Weekly File WEEKFILE 2 2.1 Syria TV Windows WINDOWS 3 3.1 3

5. Data Directory Structure The directory structure in this data release is organized as follows. Broadcast audio collection top directories /data Documentation directory /docs 6. Data File Description 6.1 Audio File Format The audio files in this release are flac compressed Waveform Audio File format (.flac), 16000 Hz singlechannel 16-bit PCM files. 6.2 Audio File Names The broadcast audio files in this collection follow LDC s defined naming convention for broadcast audio files. {SRC}_{PRG}_{LNG}_YYYYMMDD_HHMMSS.flac where - - {SRC} is the source ID (e.g., CNN, VOA, etc.) - {PRG} is the program ID (e.g., LARRYKING, etc.) - {LNG} is the three-letter language ID defined in ISO639-3. ARB is Standard Arabic; CMN is Mandarin Chinese; ENG is English. - YYYYMMDD is the data collection (broadcast) date. - HHMMSS is the start time of the program (HH is the hour in the 24-hour format) 7. Data Validation Native Arabic speakers audited every recording in this release. All audio files were checked to be valid.wav files. The docs/checksum.md5 file contains MD5 checksums of all audio files in this corpus. 8. Copyright Information 4

Portions 2006-2007 Al Alam News Channel, Al Arabiya, Ajlazeera, Al Ordiniyah, Nile TV, Oman TV, PAC Ltd, Saudi TV, Syria TV, 2006-2007, 2011, 2013 Trustees of the University of Pennsylvania Authors: Kevin Walker, Christopher Caruso, Kazuaki Maeda, Denise DiPersio, Stephanie Strassel 5