This whitepaper was produced in collaboration with Fraunhofer IIS. THE MPEG-H TV AUDIO SYSTEM Use Cases and Workflows
MEDIA SOLUTIONS FRAUNHOFER ISS THE MPEG-H TV AUDIO SYSTEM INTRODUCTION This document describes common use cases for the MPEG-H next generation audio codec. It is important to understand that the MPEG-H TV Audio System does not only describe a single audio codec like, for example, Augmentative and Alternative Communication (AAC) but instead a complete audio delivery system from capture to the end user. Next Generation Audio (NGA) codecs exploit the fact that today s audio decoders are able to handle more complex operations than were previously possible. This allows a far greater reliance on the decoder to render the audio on the specific reproduction system being used and enables the user to personalize the audio experience. In a traditional audio decoder a 5.1 AAC stream would be decoded to six channels, each of which would simply be fed via amplification to the corresponding speaker. In this case, the broadcaster maintained control over the end user experience by encoding multiple streams each with a complete mix for every coding mode (one 5.1 and one stereo for example) and / or language. This naturally requires lots of bandwidth and is inefficient since much of the audio mix is common to all streams. NGA codecs can carry channel-based data in the same way as existing codecs but they also allow the carriage of objects and, in the case of MPEG-H, Higher Order Ambisonics (HOA). This means that, for example, a channel based 5.1 bed can be encoded separately to user-selectable objects such as monaural language tracks. This is clearly much more efficient than multiple 5.1 encodes. The key features of MPEG-H Audio which set it apart from previous generations of audio codecs are: OBJECTS The ability to transmit specific elements of the audio mix separately and allow the user to change between different equivalent elements like different languages and alter these elements in terms of volume and position within the limits defined by the broadcaster. Objects allow: 1) Personalization the ability of the end user to select components such as biased commentary or languages 2) Dialogue Enhancement The ability to change the volume of the dialogue in relation to the ambience 3) Advanced Accessibility features Audio Description services can be provided in a very efficient way together with multiple languages, while still enabling Dialogue Enhancement HIGHER ORDER AMBISONICS The ability to transmit a sound field based on mathematical description of the sound field. This format allows an easy manipulation of immersive sound on the receiver side and is currently the most favored format for VR and AR applications. Rendering and advanced loudness and DRC capabilities. The ability of the decoder / renderer to make best use of the reproduction resources (speaker / headphone configuration) and adapt the audio to the reproduction device (TV speaker, AVR / soundbar or mobile device). 2
MEDIA SOLUTIONS FRAUNHOFER ISS THE MPEG-H TV AUDIO SYSTEM THE AUTHORING UNIT (AU) New ways of producing and defining content are required to support these complex, but efficient ways of processing audio. Besides the audio data, information about the properties of each audio element and their relationship to other elements are required. All the additional information is conveyed as metadata, and for the MPEG-H TV Audio System this is defined and handled by the newly introduced concept of an authoring unit. The AU is used as part of the mixing process and allows the engineer to define how all the individual captures will form the final presentation, including the personalization options available to the end user. The AU generates what is called a scene description. In the scene description the role and properties of each audio element of the actual mix is defined. An audio element can be a channel, an object or an ambisonic representation of the audio. These elements can be mixed together based on the content generators preference. By way of a simple example the AU might be used to associate a set of mixed audio channels with a 5.1, channel-based bed, plus two mono language tracks in the form of objects. The objects are defined by a set of metadata including their location, their default volume and the type of audio. Additionally, the content creator can allow the end user to alter the position and the volume within predefined limits. The output of the authoring unit is the uncompressed pulse code modulation (PCM) audio tracks in combination with metadata containing the scene description. For today s infrastructure the physical output of the AU is usually SDI. The metadata defined in the AU is modulated onto a PCM Control Track which is commonly carried on the 16th SDI channel along with the uncompressed audio. This combination allows a robust and secure transport of the metadata. ENCODING AND DECODING The purpose of a contribution encoder is to provide compression which is relatively transparent and allows downstream manipulation of the signals without audible artifacts. Historically, contribution encoding of audio has either used standard codecs at higher bitrates, in some cases with proprietary alignment (such as Ericsson s Phase Aligned Audio) or proprietary codecs such as Dolby E. The MPEG-H TV Audio System introduces a contribution format which, like Dolby E, is designed to replace SDI channel-based transport where compression and a secure transport of metadata is required. The input to the contribution encoder will be 16 channels of PCM audio over SDI where the 16th channel may be dedicated to the control track output from an external AU. In use cases where the AU is upstream of the contribution encoder the control track is demodulated by the encoder and the metadata is embedded and carried in the bitstream. Where the AU is not present upstream the encoder must act as a rudimentary AU in defining the minimum required metadata which includes the channel configuration, the object and loudness metadata. The audio channels are carried as independent full bandwidth encoded signals i.e. no exploitation of inter-channel redundancy or bandwidth (in the case of low-frequency effects (LFE) channels, for example) is used in the compression. It should be noted that MPEG-H encoded contribution bitstreams are not decodable by consumer MPEG-H decoder / renderers such as those found in set-topboxes. An MPEG H contribution decoder is required. The purpose of the contribution decoder is to decode the audio streams to PCM and re-modulate the metadata defined in the AU or encoder onto the control track. These are then output over SDI for further production, archiving or passing directly to the emission encoder. 3
MEDIA SOLUTIONS FRAUNHOFER ISS THE MPEG-H TV AUDIO SYSTEM The emission encoder is responsible for defining the bitstream which will be decoded by the end user device such as set-top-box or smartphone including any channels, objects, HOAs and personalization options. This means that all the metadata associated with these components must be made available to it. With the exception of a few predefined legacy configurations like stereo or 5.1 channel based setups, it is necessary that the emission encoder receives all metadata together with the PCM audio from an AU. OPERATING MODES Here we define some common use cases of MPEG-H which place different requirements on the encode workflow. LIVE MIX This use case covers any case where the final mix is defined at source. A typical example would be a live sports or news event where the audio mix will be defined in the outside broadcast (OB) truck. Here, the AU will be in the OB truck and will produce a control track in real-time. The contribution encoder will encode audio components and demodulate the metadata from the control track for carriage in the bitstream. A contribution decoder at a production site, for example, will decode the bitstream and present PCM audio components and a control track over SDI. Any intermediate mixing at the production facility would require an AU to re-write the control track. There may be one or more distribution hops all using MPEG-H contribution encoders enabling the preservation of the control track. At the final emission encoder, the control track and PCM component data are used to define the MPEG-H bitstream for consumption by the end user device. AUDIO INPUTS SDI BASEBAND AUDIO AND CONTROL TRACK SDI BASEBAND AUDIO AND CONTROL TRACK AUTHORING UNIT CONSUMER 4
MEDIA SOLUTIONS FRAUNHOFER ISS THE MPEG-H TV AUDIO SYSTEM LIVE, MULTIPLE MONO In this use case the acquired audio is not mixed at source but instead is first compressed using a contribution encoder and delivered to a production facility where the mixing and authoring is performed. In this case, the first contribution encoder must define the mandatory metadata associated with audio being carried in the knowledge that it will be used as input to a downstream authoring unit. The configuration of the encoder is limited to the transport of a number (N) of mono channels, each associated with a channel of the SDI input. The encoder GUI or API is used to define the number of mono tracks. The downstream AU is then used to define the metadata and modulate the control track to act as input to either subsequent distribution hops or an emission encoder. AUDIO INPUTS SDI BASEBAND AUDIO AND BASIC CONTROL TRACK AUTHORING UNIT CONSUMER N MONO CONFIGURATION SDI BASEBAND AUDIO AND CONTROL TRACK DISTRIBUTION OF LEGACY, CHANNEL BASED CONTENT This use case applies mostly to the carriage of legacy, channel based content using MPEG-H contribution encoders. Where a number of traditional, channel based, audio presentations are to be broadcast with a particular service (for example 5.1, English Language and stereo, Korean language presentations) the contribution encoder can be used to define the metadata. The encoder GUI or API is used to assign SDI input channels to one or more channel groups each of which has an associated loudness. This data is then used to set the metadata carried in the which will, after decode, appear in the control track. In this way, the unit is effectively acting as a basic authoring unit. AUDIO INPUTS SDI BASEBAND AUDIO AND CONTROL TRACK CONSUMER CHANNEL GROUP INPUT 5
MEDIA SOLUTIONS FRAUNHOFER ISS THE MPEG-H TV AUDIO SYSTEM OFF-LINE AUTHORING In certain circumstances, usually where the audio inputs are known to be fixed and stable, it may be desirable to use an authoring unit to define a set of channels, objects and personalization features off-line. With this feature broadcasters can generate a set of presets for the typical use cases and the onsite crew only needs to load the right preset. This reduces the time to setup the transmission and reduces the risk of errors in the metadata. The configuration and metadata output by SDI control track in the SDI, as used in previous examples is output as a configuration file. This file can then be uploaded to the contribution encoder via the GUI or API and will be applied to the audio input to the encoder via SDI. The output of the contribution encoder will be an with the metadata encoded as per the configuration file. Downstream contribution decodes and encodes and the emission encode will follow as per previous use cases. Audio Inputs AUTHORING UNIT Config File SDI Baseband Audio and Control Track Emission CONSUMER Contribution 6
MEDIA SOLUTIONS FRAUNHOFER ISS THE MPEG-H TV AUDIO SYSTEM CONCLUSION This document has introduced the MPEG-H TV Audio System and in so doing has described both the advantages in its use and the differences when compared to existing audio codecs. The use of objects and Higher Order Ambisonics not only enhances the immersive audio experience by ensuring that a single stream renders optimally on any reproduction system but also allows user personalization of features such a commentary and crowd noise. This, object based, delivery of audio also offers significant bandwidth savings, particularly when delivering content in multiple languages. 7