Spatially scalable HEVC for layered division multiplexing in broadcast

2017 Data Compression Conference Spatially scalable HEVC for layered division multiplexing in broadcast Kiran Misra *, Andrew Segall *, Jie Zhao *, Seung-Hwan Kim *, Joan Llach +, Alan Stein +, John Stewart +, Hendry #, Ye-kui Wang #, Yan Ye^ and Yong He^ * Sharp Labs. Of America 5750 NW Pacific Rim Blvd, Camas, WA 98607, USA {misrak, asegall, jzhao, kimse} @sharplabs.com + Technicolor 975 avenue des Champs Blancs, 35576 Cesson Sévigné, France {joan.llach, alan.stein, john.stewart} @technicolor.com # Qualcomm 5775 Morehouse Dr, San Diego, CA 2121 USA {fhendry, yekuiw} @qti.qualcomm.com ^Interdigial 9710 Scranton Rd, San Diego, CA 92121 USA {yan.ye, yong.he} @interdigital.com Abstract: Recent broadcast standards support Layered Division Multiplexing (LDM) to achieve graceful degradation as signal quality degrades at the receiver. LDM is accomplished by using different constellations within the same Radio Frequency (RF) spectrum. LDM thus enables delivering multiple service tiers in a single broadcast channel. LDM when used in conjunction with scalable source coding codecs such as the Scalable extension of High Efficiency Video Coding (SHVC), further helps improve overall spectrum utilization and efficiency. In this paper we investigate a 2-tier broadcast LDM based service with one service tier aimed at lower video resolution such as 540p, 720p, 1080p for a mobile receiver (smaller/indoor antenna) and the other service tier targeting twice the video resolution of the lower tier, for stationary receivers (larger/outdoor antenna). The primary contribution of this paper is to identify 2- tier transmission configurations of interest to broadcasters and compare the spectrum and bitrate coding efficiency gains of an SHVC-based multi-tier service versus a simulcast (single layer) based multi-tier service for an Advanced Television Systems Committee (ATSC) 3.0 transmission system. Bitrate s ranging from 38% and 57% is observed for the SHVC based layered system. For large coverage and pedestrian with a receiver test scenarios, channel utilization s ranging from 23% to 46% is observed. For mobile and tablet in bedroom scenarios a smaller broadcast bandwidth s ranging from 6% to 9% is observed. 1. Introduction Existing deployments of broadcast transmission such as Advanced Television Systems Committee (ATSC) 1.0 [1], Digital Video Broadcasting Terrestrial (DVB-T) [2], include single video resolution service within a broadcast radio frequency (RF) channel. However recent physical layer (PHY) broadcast standards such as ATSC 3.0 [3] and DVB-T2 [4] systems support Layered Division Multiplexing (LDM). LDM superimposes multiple physical layer pipes (PLPs) comprising different constellations, power levels, modulations and coding schemes. One usage scenario envisioned for a two layer LDM is to transmit two different video resolution services within the same broadcast RF channel. The Core and Enhanced Layer are the first and second layer, respectively of a two layer 2375-0359/17 $31.00 2017 IEEE DOI 10.1109/DCC.2017.81 3

LDM [3]. LDM receivers would first recover the robust Core Layer signal and then the Enhanced Layer signal. In such a setup, receivers closer to the transmitting station (with higher overall SNR) may decode the service with higher video resolution e.g. Ultra High Definition (UHD) video resolution, while receivers that are farther away from the transmitting antenna (with lower SNR) would receive and decode the service with lower video resolution e.g. High Definition (HD). Note, none of the above assumes the use of a scalable video compression system. Such a service may be enabled by use of simulcasting single-layer bitstreams within each of the LDM layers. However, such an approach would waste precious spectrum when compared to an approach using a two-layer scalable video bitstream such as a SHVC (Scalable extension of High Efficiency Video Coding) [6] bitstream where the base layer provides the lower video resolution service and the base layer together with the enhancement layer provides the higher video resolution service. Moreover, SHVC supports non-cross-layer aligned intra random access point (IRAP) pictures of any type, and an SHVC bitstream can start with any type of IRAP access unit, including an IRAP access unit where the base layer picture is an IRAP picture while the enhancement layer pictures are non-irap pictures. This allows easy splicing of multi-layer SHVC bitstreams at any type of IRAP access unit. This paper investigates the use of SHVC in conjunction with LDM for delivering multiple video resolution services within the same RF channel. Since broadcast channels are bandlimited, the bitstream generated by the SHVC encoder is appropriately limited to pre-determined target bitrates. The target bitrates are enforced for every random access segment (RAS) in the bitstream. The ATSC 3.0 PHY system described in [3] and the SHVC constraints described in [7] are used to determine the experimental conditions and demonstrate the improved performance of SHVC base LDM versus simulcast based LDM. The trade-off for improved source video compression and spectrum utilization is increased decoder complexity. However, this increased decoder complexity is borne by receivers that intend to support the higher resolution video service, such receivers may be inherently more capable. Receivers that support only the lower video resolution service; need only extract the base layer bitstream and use a single-layer High Efficiency Video Coding (HEVC) [5] decoder. The remaining sections of the paper are organized as follows: In section 2 we briefly introduce LDM. In section 3 we describe the operation of the scalable extension of HEVC. We also compare the worst case bandwidth requirements for 2-layer SHVC decoder versus a single layer HEVC decoder for spatial scalability configurations supported in ATSC 3.0. Next we describe the scenarios of interest (and their PHY layer setup) for evaluating the proposed system in section 4. The experimental setup including the test video sequences used, the encoder configuration and the metrics used for the evaluation are described in section 5. The results are presented in section 6. We conclude in section 7. 2. Layer division multiplexing (LDM) In recent broadcast standards such as ATSC 3.0 [3], DVB-T2 [4] source data is carried in PLPs whose capacity and robustness may be adjusted to the required transmission 4

(a) (b) (c) Figure 1: Example illustration of layered constellations from [3] for (a) Core Layer QPSK (b) Enhanced Layer non-uniform 64-QAM and (c) Combined Core and Enhanced Layer characteristics. The capacity and robustness is influenced by the choice of constellation, error correcting codes and their rates. These broadcast standards also support the combining of multiple PLPs with a pre-determined power ratio for transmission over an RF channel. Such an approach allows transmission of tiered services within an RF channel. Figure 1 replicated from [3] illustrates an example constellation for a two-layer LDM transmission where the Core Layer signal, shown in sub-figure (a), uses Quadrature Phase Shift Key (QPSK) modulation and the Enhanced Layer signal, shown in sub-figure (b), uses Non-Uniform Constellation (NUC) of 64-QAM (Quadrature Amplitude Modulation). The transmission constellation resulting when the two constellations of the two layers are combined and normalized is shown in sub-figure (c). Note, the choice of constellations depends on the code rate selected for emission. Supported constellations in ATSC 3.0 [3] are: uniform QPSK modulation and five non-uniform constellation sizes; 16-QAM, 64-QAM, 256-QAM, 1024-QAM and 4096-QAM respectively. 5

Enhancement layer (high video resolution) bitstream Enhancement layer decoder Enhancement layer decoded picture buffer High resolution video output SHVC bitstream Bitstream Demultiplex Scale base layer pictures Base layer (low video resolution) bitstream Base layer decoder Base layer decoded picture buffer Low resolution video output Figure 2: SHVC decoder architecture for a two layer system A receiver interested in Core Layer signal treats the Enhanced Layer signal as noise. A receiver wishing to decode the Enhanced Layer signal may first recover the Core Layer signal, cancel the Core Layer signal from the received signal and then recover the Enhanced Layer signal. Note, the recovery process is typically not specified in broadcast standards as they only determine the emission. 3. Scalable extension of HEVC In this section we examine the receiver side impact of choosing an SHVC system for the multi-tiered video resolution service. The ATSC 3.0 [7] system constraints the scalable extension of HEVC to configurations where the enhancement layer video resolution is 1.5, 2 or 3 times the width and height of the base layer video resolution. These spatial scalability configurations are sometimes referred to as 1.5x, 2x and 3x, respectively. To reduce the needed complexity of an SHVC decoder, ATSC 3.0 [7] further disables tools that were designed to map color gamuts across layers. For such an SHVC bitstream, the decoder illustrated in Figure 3 may be used to extract the video of interest. A receiver interested in the base layer lower resolution video need only implement the bitstream demultiplexer and the base layer decoder (encapsulated in light blue colored box). A receiver interested in the higher resolution video would need to implement all the blocks shown in Figure 3. If a simulcast system is used for transmission of the multi-tiered video resolution service instead of an SHVC system; then the higher resolution video decoder would be a single layer HEVC decoder. The trade-off in such an instance would be the increased transmission bandwidth. Next we examine the worst case memory bandwidth needed at the decoder. The worst case memory bandwidth for the system shown in Figure 3 is for the interpolation process used in enhancement layer motion compensation for bi-predicted prediction unit of size 8x8 [8]. Note, the memory bandwidth requirement for the lower resolution video decoder is unchanged from a single layer HEVC decoder. The worst case decoder memory band width for a SHVC decoder decoding the higher video resolution (assuming spatial scalability factor no less than 1.5x) will be 1.45 to 1.60 times [8] of the case where the 6

higher resolution video is decoded using a single-layer HEVC decoder. The 1.60 worst case figure corresponds to a decoder architecture where the entire base layer picture is scaled, whereas the 1.45 worst case figure corresponds to the decoder architecture where only those base layer samples that are used by enhancement layer for prediction are scaled. Interestingly it has been observed in [8] that the average number of memory accesses for the block based decoder architecture is within 11% of the higher resolution single-layer HEVC decoder. 4. Scenarios of interest for 2-layer LDM service carrying multiple resolutions In this section we briefly describe the various transmission scenarios of interest for a 2- layer LDM service carrying two different video resolutions. Here the Core Layer carries a video service with lower video resolution whereas the Core + Enhanced Layer carries a video service with twice the lower video resolution. It is assumed that audio and other miscellaneous information is carried in its own PLP. Also, listed below are example physical layer configurations for each transmission scenario with properties such as Signal-to-Noise Ratio (SNR), Source code rate of Forward Error Correction (FEC) etc. The example configurations make use of Additive White Gaussian Noise (AWGN) as well as Rayeligh channel models. These example configurations are used in section 5 and 6 for experimental evaluation and comparison of the SHVC and simulcast schemes. 4.1. Scenario A: Lager coverage area In this transmission scenario the higher resolution video service is targeted for a class of receivers that are stationary and in the current ATSC 1.0 coverage area. There is also a second class of receivers that are stationary but not in the current ATSC 1.0 coverage area (rural, or with an indoor or integrated antenna). The second class of receivers would decode the more robust Core Layer video service. An example physical layer configuration for scenario A is shown in Table 1. Table 1: Physical layer configuration for scenario A (32K FFT, long FEC codes, 150 microsecond guard interval) PLP-1 PLP-2 PLP-3 Resolution UHD (2160p) HD (1080p) Audio/misc SNR db (AWGN) 14.3 9.5 3.6 Modulation 64-QAM NUC 16-QAM NUC QPSK FEC source rate 11/15 11/15 11/15 Spectral Efficiency (b/s/hz) 4.0 2.67 1.31 Bit-rate (Mbps) 11 7.5 0.47 4.2. Scenario B: Pedestrian phone/tablet In this transmission scenario there exists a first class of receivers that are stationary and would decode the higher resolution video service. There also exists a second class of 7

receivers that may correspond to handheld devices and moving at pedestrian speeds (possibly indoors). The second class of receivers is expected to decode the more robust Core Layer video service. An example physical layer configuration for scenario B is shown in Table 2. Table 2: Physical layer configuration for scenario B (16K FFT, short FEC codes, 150 microsecond guard interval) PLP-1 PLP-2 PLP-3 Resolution HD (1080p) HD (720p) Audio/misc SNR db (AWGN) 8.6 2.2 2.2 SNR with 6 db Single Frequency Network (SFN*) gain 2.6-3.8-3.8 Modulation 16-QAM NUC QPSK QPSK FEC source rate 10/15 9/15 9/15 Spectral Efficiency (b/s/hz) 2.23 1.0 1.0 Bit-rate (Mbps) 4.7 3.5 0.3 * SFN represents a network where multiple transmitters in proximity to one another radiate the same waveform and share a frequency 4.3. Scenario C: Mobile enabled In this transmission scenario there exists a first class of receivers that are stationary and would decode the higher resolution video service, while the second class of receivers may be moving at relatively high speeds and would decode the Core Layer video service. An example physical layer configuration for scenario C is shown in Table 3. Table 3: Physical layer configuration for scenario C (8K FFT, short FEC codes for qhd, 32K FFT, long FEC codes for HD 150 microsecond guard interval) PLP-1 PLP-2 PLP-3 Resolution HD (1080p) qhd (540p) Audio/misc SNR db (AWGN) 14.3-2.3-2.3 Modulation 64-QAM NUC QPSK QPSK FEC rate 11/15 4/15 4/15 Spectral Efficiency (b/s/hz) 4.0 0.44 0.44 Bit-rate (Mbps) 2.7 2.34 4.4. Scenario D: Tablet in bedroom In this transmission scenario there exists a first class of receivers that are stationary and would decode the higher resolution video service, while the second class of receivers are 8

portable and would decode the Core Layer video service. An example physical layer configuration for scenario D is shown in Table 4. Table 4: Physical layer configuration for scenario D PLP-1 PLP-2 PLP-3 Resolution HD (1080p) HD (720p) Audio/misc SNR db (AWGN) 30.0 (AWGN) 0.86 (Rayleigh) -2.3 (AWGN) Modulation 4096-QAM NUC QPSK QPSK FEC rate 12/15 6/15 4/15 Spectral Efficiency (b/s/hz) 7.1 0.59 0.44 Bit-rate (Mbps) 4.5 2.75 0.3 5. Experimental setup In this section we describe the test video sequences used, the encoder configuration and the metrics used for the comparing HEVC (Main 10) simulcast versus SHVC (Scalable Main 10) based LDM transmission. The test sequences used for each scenario are listed in Table 5. The terms BL and EL in Table 5 are used to represent the lower and higher video resolution, respectively, for both SHVC and HEVC simulcast. Sample frames from each of the sequences are shown in Figure 5. The sequences (a), (b) and (c) in Figure 5 were downloaded for testing from Ultra Video Group Test Sequences web site [9]. The sequences (d) and (e) were donated for testing by Public Broadcasting Service (PBS). All of the original sequences are 4K resolution. The sequences were down sampled to the appropriate resolution for experimentation. The integer and fractional frame rates used in our evaluation are 60fps and 59.94 fps respectively. Since sequences (a) through (c) are originally 120fps, every other frame was dropped from the original test sequence to generate a 60 fps test sequences. No such processing was performed for the PBS sequences. Table 5: List of test sequences for each scenario Scenario Resolutions Bit depth Frame rate Frame count Input sequences Larger coverage area Pedestrian phone/tablet BL 1920x1080 EL 3840x2160 BL 1280x720 EL 1920x1080 8 60 300 Bosphorus Jockey ReadySetGo 10 59.94 1040 PBS1 1782 PBS2 Bosphorus 8 60 300 Jockey ReadySetGo 10 59.94 1040 PBS1 1782 PBS2 9

Mobile enabled Tablet in bedroom BL 960x540 EL 1920x1080 BL 1280x720 EL 1920x1080 8 60 300 Bosphorus Jockey ReadySetGo 10 59.94 1040 PBS1 1782 PBS2 Bosphorus 8 60 300 Jockey ReadySetGo 10 59.94 1040 PBS1 1782 PBS2 For the experiments SHM-7.0 [10] reference software provided by Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG 16 WP 3 and ISO/IEC JTC 1/SC 29/WG 11 is used. The encoder rate-distortion decisions for base layer in SHM-7.0 are made independent of the rate-distortion decisions for the enhancement layer. Testing is performed for both constant bitrate and constant quality Parameter (QP) configuration. Constant bitrate is achieved by adjusting the bitrate every intra-period. (a) Bosphorus (b) Jockey (c) ReadySetGo (d) PBS1 (e) PBS2 10

For constant quality configuration, the bit rate (BR) is calculated as follows: HEVC EL BR SHVC EL BR HEVC BL BR HEVC EL BR For constant quality configuration, Channel utilization (CU) is calculated as follows (SE denotes spectral efficiency): The channel utilization accounts for the modulation and coding configuration for each PLP, and it quantifies how the channel bandwidth is reduced for the same quality of service by using SHVC. The quality of the coded video is restricted to lie between 38dB and 42dB which approximately represents broadcast video quality. For constant bitrate configuration 10-bit sequences were used for experimental evaluation. 6. Results In this section we examine the results for the experimental setup described in section 5 for the various PHY scenarios described in section 4. Table 6 tabulates the BR and CU s for the constant quality configuration where the random access point periods for BL and EL are both set to ½ seconds. Overall, the bit rate reduction ranges from 38% to 56%. For larger coverage area scenario and the pedestrian phone/tablet scenario, the needed channel bandwidth can be reduced by 24% to 43%, while for the mobile enabled scenario and the tablet in bedroom scenario, the channel bandwidth can be reduced by about 5% to 9%. Table 6: Test results for constant quality configuration Larger coverage area Pedestrian Tablet Mobile enabled Tablet in bedroom BR CU BR CU BR CU BR CU Bosphorus 48% 35% 40% 24% 38% 7% 42% 6% Jockey 56% 42% 50% 31% 41% 9% 50% 8% ReadySetGo 42% 33% 42% 24% 41% 9% 45% 7% PBS1 51% 32% 47% 24% 46% 7% 47% 5% PBS2 45% 34% 43% 26% 42% 8% 43% 7% Average 48.40% 35.20% 44.40% 25.80% 41.60% 8.00% 45.40% 6.60% Table 7 tabulates test results from 10 bits sequences with constant bit rate configuration. The results show that for each test, compared to HEVC simulcast, SHVC lowers the bit rate of enhancement layer (EL) for nearly the same EL quality, or lowers the bit rate of EL and improve quality of picture in EL. The video quality is measured as Peak Signal to Noise Ratio (PSNR). Table 7: Test results for constant bitrate configuration. 11

Intra period BL: Every 32 frames EL: Every 32 frames PBS1 PBS2 Larger coverage area Bitrate Luma PSNR Pedestrian Tablet Bitrate Luma PSNR Mobile enabled Bitrate Luma PSNR Tablet in bedroom Bitrate Luma PSNR BL 1942 53.04 1221 52.59 913 51.83 1221 52.59 EL Simulcast 4691 52.02 1942 53.04 1942 53.04 1942 53.04 EL SHVC 3110 51.98 922 52.78 1337 52.92 922 52.78 BL 6842 41.47 3240 40.29 1814 39.65 2502 39.30 EL Simulcast 10214 41.28 4293 39.70 2447 37.84 4091 39.53 EL SHVC 9212 42.55 3944 41.28 2420 39.30 3825 40.82 7. Conclusion This paper presented test results for SHVC versus HEVC simulcast for layer division multiplexing in an ATSC 3.0 system. Overall, bit rate reduction ranging from 38% to 57% is observed for the SHVC system. Results show that when SHVC is used, for the larger coverage area scenario and the pedestrian phone/tablet scenario, there is a 23% to 43% channel utilization s that corresponds to by how much broadcast channel bandwidth may be reduced. This can then be used for additional programming, improved visual quality or a larger foot print. Similarly for the mobile enabled scenario and the tablet in bedroom scenario, there is a s of about 6% to 9%. Acknowledgment The authors would like to thank PBS for making available video sequences that enabled successful comparison of simulcast HEVC and SHVC for LDM. References [1] A/53: ATSC Digital Television Standard [2] ETSI EN 300 744 "Digital Video Broadcasting (DVB); Framing structure, channel coding and modulation for digital terrestrial television," European Standard (Telecommunications series) [3] ATSC 3.0, A/322, ATSC Standard: Physical Layer Protocol [4] ETSI EN 302 755 "Digital Video Broadcasting (DVB); Framing structure, channel coding and modulation for a second generation digital terrestrial television (DVB- T2)," European Standard [5] ISO/IEC 23008-2:2013 Information technology -- High efficiency coding and media delivery in heterogeneous environments -- Part 2: High efficiency video coding [6] ISO/IEC 23008-2:2015 Information technology -- High efficiency coding and media delivery in heterogeneous environments -- Part 2: High efficiency video coding [7] ATSC 3.0, A/341, ATSC Candidate Standard: Video - HEVC [8] Elena Alshina, Alexander Alshin, JCTVC-N0150, AhG17: complexity analysis of SHM2.0, Joint Collaborative Team on Video Coding (JCT-VC) 14th Meeting: Vienna, AT, 25 July 2 Aug. 2013 [9] Ultra Video Group Test Sequences http://ultravideo.cs.tut.fi/#testsequences [10] SHM-7.0 https://hevc.hhi.fraunhofer.de/svn/svn_shvcsoftware/tags/shm-7.0/ 12