Efficient encoding and delivery of personalized views extracted from panoramic video content

Size: px

Start display at page:

Download "Efficient encoding and delivery of personalized views extracted from panoramic video content"

Michael Norman
5 years ago
Views:

Efficient encoding and delivery of personalized views extracted from panoramic video content Pieter Duchi Supervisors: Prof. dr. Peter Lambert, Dr. ir. Glenn Van Wallendael Counsellors: Ir.

1 Efficient encoding and delivery of personalized views extracted from panoramic video content Pieter Duchi Supervisors: Prof. dr. Peter Lambert, Dr. ir. Glenn Van Wallendael Counsellors: Ir. Johan De Praeter, Niels Van Kets Master's dissertation submitted in order to obtain the academic degree of Master of Science in Electrical Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Rik Van de Walle Faculty of Engineering and Architecture Academic year

3 Efficient encoding and delivery of personalized views extracted from panoramic video content Pieter Duchi Supervisors: Prof. dr. Peter Lambert, Dr. ir. Glenn Van Wallendael Counsellors: Ir. Johan De Praeter, Niels Van Kets Master's dissertation submitted in order to obtain the academic degree of Master of Science in Electrical Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Rik Van de Walle Faculty of Engineering and Architecture Academic year

4 Acknowledgments I would like to express my gratitude to everyone who supported me in creating this master s dissertation. Without them, I would not be where I am today. First of all, I would like to thank my supervisors Johan De Praeter, Glenn Van Wallendael and Niels Van Kets. They were always available for advice, feedback and comments. I would like to emphasize the excellent guidance of Johan De Praeter and his nicely developed framework (including the modified HM reference software) he made available for me. This saved me a lot of time and was of key importance for the second part of my thesis period. He was always willing to answer all my questions at any time. I wish them all the best in their future careers. I would also like to thank prof. Rik Van de Walle and prof. Peter Lambert for their enthusiastic and very interesting courses on multimedia. These classes aroused my interest and fascination in everything related to multimedia and they were very helpful to understand the basics of this research. The panoramic content I received was from a company called Kiswe. Finding panoramic content is not easy and therefore a special thanks to them making this type of content available for me to work on in my master s dissertation. Next I am also very grateful to my parents and sisters. They always supported me throughout my academic career and guided me in the correct direction when I had different choices to pick. Finally, I would like to thank my girlfriend Valérie De Vrieze. Her support, love and ability to make me smile have been invaluable to me throughout the years. Pieter Duchi, May 2016 i

5 Usage permission The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the copyright terms have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation. Pieter Duchi, May 2016 ii

6 Efficient encoding and delivery of personalized views extracted from panoramic video content by Pieter Duchi Thesis submitted to obtain the degree of Master in Electrical Engineering - Communication and Information Technology Academic year Ghent University Faculty of Engineering and Architecture Department of Electronics and Information Systems - Data Science Lab Head of department: prof. dr. ir. R. Van de Walle Promotors: prof. dr. ir. P. Lambert, dr. ir. G. Van Wallendael Supervisors: ir. Johan De Praeter, Niels Van Kets Summary This master s dissertation compares two approaches in order to efficiently encode and deliver personalized views extracted from panoramic video content. First, the relevant features of HEVC are discussed. Next, an overview of existing methods for delivering the desired view of the panoramic content to the user is explained. Two approaches are discussed in this master s dissertation namely the tile-based method and a non-tile-based method. The first method divides the panoramic video into non-overlapping tiles, encodes them and only sends the necessary tiles that overlap with the desired viewpoint to the user. The last method (non-tile-based method) makes use of coding information obtained from encoding the panoramic video in order to skip encoding decisions when personalized views are encoded for each user. Both methods are applied on the panoramic video and investigated. For this master s dissertation only static views are taken in consideration. Finally, both approaches are compared in terms of the bit rate, quality and coding time of the selected views. In the end, it is concluded that the non-tile-based method has the most advantages. Keywords: Panoramic video, video interaction, High Efficiency Video Coding (HEVC), tiling, fast encoding iii

7 Efficient encoding and delivery of personalized views extracted from panoramic video content Pieter Duchi Supervisor(s): prof. dr. Peter Lambert, dr. ir. Glenn Van Wallendael, ir. Johan De Praeter, Niels Van Kets Abstract The trends for future video services is enhancing user experience to be more interactive. This could be achieved by letting the user select his own personalized view in an extremely high resolution video. Because encoding and delivering these personalized views for each user is a computationally complex process, two techniques are discussed in this paper. The first method is based on splitting the panoramic video in different tiles and send only the tiles that overlap with the desired viewpoint of each user. The second method is based on reusing coding information extracted from a panoramic encoded video in order to speed up the encoding of each personal view. Simulation results, when only static views are considered, point out that the second method has the most advantages with the current infrastructure available. Keywords Panoramic video, video interaction, High Efficiency Video Coding (HEVC), tiling, fast encoding I. INTRODUCTION TODAY, the camera work of the video contents is edited by a director, which means that all users obtain the same limited experience. At this moment, high resolution digital imaging sensors exist and make it possible to capture high resolution video up to 4K+. By stitching comprehensive high-resolution views from multiple cameras, a panoramic video can be created with a resolution far beyond HD (e.g pixels and higher). This panoramic video can offer the possibility of viewing an arbitrary Region-of-Interest (RoI) interactively based on coordinates or tracking, which will enhance user experience to be more interactive. By adding zoom functionality, the content can be displayed from panoramic displays to lower spatial resolution displays such as tablets or even mobile devices. A lot of applications are possible with panoramic video, such as interactive viewing of sports events, providing virtual pan/tilt/zoom within a wide-angle from a surveillance camera, streaming instructional videos such as lecture videos, video conferencing, etc. Delivering these beyond HD resolutions to the user poses some problems. Even when compressed without significant loss in video quality, delivering this high resolution content to the user is not possible due to limited network capacity. Moreover, at the user side it is not possible to display such high resolution content because the panoramic video would not fit on the limited resolution of the display devices and the decoding load for smaller devices would be too high. In order to overcome these problems, two techniques are proposed. The first technique is called the tile-based approach and the other one is called the non-tile-based approach. This paper will compare both techniques in terms of quality, bit rate and coding delay. A lot of research has already been done on the tilebased approach and was mostly applied with the H.264/AVC codec. The outline of this paper is as follows. Both techniques will be shortly explained in section II. Next, section III contains a brief overview of the HEVC standard, which will be used as the compression standard in both techniques. In section IV, both techniques will be investigated and conclusions for each method will be given. Section V will show the results of the comparison between the tile-based method and the non-tile-based method. Finally, conclusions will be noted in section VI. II. RELATED WORK The first technique, called the tile-based method, is already introduced in literature [1 3]. This technique was mostly applied with the H.264/AVC codec. For this approach, the panoramic video is subsampled at the server to different resolutions (including a thumbnail, see further) in order to provide zooming by having multiple resolution layers. These different layers are then subdivided into a grid of non-overlapping tiles and encoded. At the user side, the user selects his RoI he is interested in based on a thumbnail, which is a small resolution overview of the entire panoramic video. Next, the tiles falling within and intersecting with the RoI boundary for the requested resolution are streamed from the server. These tiles are rendered at the user and cropped to the appropriate resolution of the display if necessary. The tile-based method has a lot of disadvantages. A first disadvantage is that tiled streaming pays the price of sending additional bits outside the RoI that are not displayed at the user side. This is because some tiles may partially overlap with the RoI, since the RoI is unlikely to be aligned with tile boundaries. To reduce these wasted bits, one can reduce the dimension of the tiles. But since each tile is encoded independently, small tiles lead to a lower compression ratio, increasing the number of bits needed for the RoI. Another disadvantage is that the user also needs a customized video player to decode, combine and synchronize the tiled streams, which makes this approach harder to deploy. Another disadvantage is that the tiles need to have an encoding structure that allows random access. This means that the tiles can only be decoded starting from an intra-coded tile. Moreover, it also needs a small intraperiod to allow low-delay panning, which leads to an excessive increase in bit rate. Due to many disadvantages of the tile-based method, another technique (called a non-tile-based approach) has been proposed. This technique is totally different in the way it encodes the selected RoI of each user on the fly. Such an approach is very flexible to support any RoI but not scalable to a large number of users. Also, encoding a large resolution RoI will also take a lot of time, which cannot meet the requirement for a low-latency system. In order to speed up the encoding, Van Kets et al. of The Data Science Lab of the UGhent proposed a method [4] to

8 lower the encoding complexity. This was done by reusing coding information of the encoded full panoramic video in order to speed up the encoding process of the RoI of each user. However in their research, only CU coding information was used from the panoramic video, but there is a lot more coding information that can be used to further speed up the encoding of the RoIs. The reuse of more coding information was also deeply investigated. An advantage of the non-tile-based approach is that the user can use a standard decoder and have very flexible digital pan/tilt/zoom possibilities. A disadvantage is that the server will need a lot of encoders in parallel to provide a personalized view for each user. However, cloud services are available nowadays which can do the processing and encoding. The cloud system can also scale the number of encoders needed depending on the amount of users watching their personalized stream. III. HIGH EFFICIENCY VIDEO CODING (HEVC) HEVC is the newest video compression standard and is the successor of the AVC/H.264 standard. Its main improvement is the increased compression efficiency (up to 50% bit rate reduction for equal perceptual video quality). This is achieved by dividing the frame into Coding Tree Units (CTUs) of typically pixel blocks. These CTUs can be recursively split into smaller Coding Units (CUs) according to a quadtree structure. The smallest CU size that is allowed is 8 8 pixels. Each CU becomes the decision making point for the prediction mode (intra or inter) and can be partitioned further into Prediction Units (PUs), which are the basic units for intra- and inter-prediction. There exist eight possible PU partition sizes. There are two types of Motion Vector (MV) prediction modes for inter-prediction, namely Advanced MV Prediction (AMVP) and merge mode. Both techniques use MVs from the neighboring PU blocks to determine a good match for the current PU block. AMVP uses these MVs as predictors to determine a MV delta with the actual MV. For merge, the MV is copied from its (spatial or temporal) neighbors. This merge concept can be used in combination with a skip mode. If a skip mode is used, it implies the following: merge mode is used, CU only contains one PU (M M) and no residual data is present in the bitstream. This is well suited to encode static regions where the prediction error tends to be very small. The prediction residual obtained in each PU is transformed using the residual quadtree (RQT) structure. This structure is obtained by recursively splitting each CU into Transformation Units (TUs) according to a quadtree structure. The smallest TU size is 4 4 pixels. These TUs are used for transformation and quantization of the residual picture. More detailed information of this standard can be found in [5]. IV. INVESTIGATION OF BOTH APPROACHES In order to implement, evaluate and later compare both methods, panoramic content was chosen. Hockey sports panoramic content was picked, because this type of content has static areas such as the ice hockey field, moving areas such as the supporters and fast moving parts such as the hockey players in the video. It is important to have a large range of spatial and temporal variability in the scenes, because this influences the complexity of the encoding. The hockey content consists of five sequences, split in two scenes. Only the first sequence of each scene was used for the eventual comparison, called hockey1 1 and hockey2 1. An example of these two different scenes is shown in figure 1. The hockey sequences consist of three 4K videos stitched together. This results in a resolution of pixels. They all have a frame rate of 60 frames per second (FPS) and one sequence lasts at most 10s (600 frames). Furthermore, the sequences are in YUV-format and are 4:2:0 chroma sub-sampled. To evaluate and extend the non-tile-based method and eventually to compare both methods, RoIs were chosen that contain different types of movement. Some contain little motion, some are purely static and others have high motion. It was also foreseen that the RoIs are regions a lot of users will look at, such as the ice hockey field itself. These chosen RoIs are shown in figure 2. The top and middle views are indicated by their corresponding view numbers as shown in the figure. The middle views, which mostly show the ice hockey field, are specified by their prefix m. The views with view number five (5 and m5) were ignored because these do not have the correct RoI resolution. The RoIs have each a resolution of pixels (1088p). The reason for the small deviation from 1080p will be explained further. Note that only static views without zooming are considered. In order to have a better indication on how much spatial and temporal information each view contains, the spatial perceptual information (SI) and the temporal perceptual information (TI) measure is calculated as described in the ITU-T Recommendation P.910 [6]. The calculated values (TI and SI) of each view are shown in figure 3. It is seen that there is a large variety of TI/SI values and therefore corresponds with the assumption that views with different types of motion and spatial details are considered. A. Tile-based approach For the tile-based approach, the panoramic sequences were split into different tile sizes. Due to the large amount of possible settings, only static views were explored and no zooming was allowed. One of the goals was finding an optimal tile size and therefore different tile sizes were picked. The choice was to pick 16:9 tile resolutions, because the RoIs are also close to 16:9 and this is the most common aspect-ratio. Another possibility could be using squared tiles. It is very important that the tile sizes should be multiples of 8. This is due to the encoding, because the smallest CU size is 8 8. The chosen tile sizes were , , , and pixels. Next, these tiles were compressed using version 16.5 of the HEVC Test Model (HM) software. The tiles need to provide random access in order to allow changing of the RoI at any time as the tiles are pre-encoded on the server. Therefore, coding configuration Random Access was chosen. An intraperiod of 32 (a multiple of 8 which is the GOP size) is picked, because this corresponds to a delay of 0.5s, which is a considerable amount of maximum delay when other tiles are selected, e.g. when another RoI is chosen. All tiles were encoded with four different QP values ( 22, 27, 32 and 37 ) and decoded again. From the outputs of the encoding and decoding, the decoding time, the bit rate and the YUV-PSNR were retrieved from each tile.

Hockey1 1 Hockey2 1 Fig. 1. Snapshot of each panoramic scene used for generating results. out of these experiments are explained in section V when both methods are compared. Fig. 2.

9 Hockey1 1 Hockey2 1 Fig. 1. Snapshot of each panoramic scene used for generating results. out of these experiments are explained in section V when both methods are compared. Fig. 2. Selected 1088p RoIs for hockey1 1. The RoIs are called with their corresponding notation on the figure. The middle views are indicated by their prefix m. Fig. 3. Spatial and temporal information for each view in sequence hockey1 1 and hockey2 1. The notes beneath the markers specify the particular view. With this data, views were simulated with their corresponding bit rate, decoding time and PSNR (the tiles that partially overlap with the RoI) using the tile-based method. Next to this, two types of overhead were investigated. These include tiling overhead due to encoding and bit rate overhead due to the extra pixels that are sent. For the tiling overhead due to encoding, this consists of the extra header overhead introduced by using multiple tiles to cover a specific RoI in the video. Secondly, because each tile is encoded independently, the prediction is constrained within the tile. This causes constraints on MV lengths which results in less optimal predictions. This leads to higher residual images, which means a reduction of compression efficiency. The other type of overhead is the bit rate overhead due to the extra pixels sent when a RoI is sent to the user. This type of overhead occurs because the RoI does not necessarily align with the tiles and therefore redundant regions outside of the RoI will be transmitted as well. These two overheads will determine which tile size is the optimal one in terms of bit rate, PSNR and decoding time, as there is a trade-off between the two overheads as the tile size changes. Out of different experiments, it was proven that the 144p ( pixels) tiles are the overall best in terms of bit rate and decoding time for static 720p and 1080p RoIs on this type of content. Some other conclusions B. Non-tile-based approach For the non-tile-based method a modified decoder and encoder is used to extract and read the coding information from the panoramic video. It is the same HM 16.5 implementation as the tile-based method, however for that method these modifications were not triggered. The panoramic video was encoded with four different QP values: 22, 27, 32 and 37 and decoded again while the coding information was retrieved. Next, the coding information of the encoded panoramic video was cropped to perfectly overlap the area of each view. Due to the fact that the views are 1088p and the views are chosen at a position that is a multiple of 64 (maximum CTU size), the CTUs of the panoramic video are aligned with the CTUs of the views. After this, each view was encoded with the same four QP values ( 22, 27, 32 and 37 ), using different types of coding information from the panoramic video that overlaps with the region of that view. Reusing coding information of the panoramic video will lower the coding complexity and thus speed up the encoding process, but will lead to a less optimal RD optimization. By feeding the encoder with more coding information such as PU, MVs, mode etc., more coding steps can be skipped. Different kinds of coding information can be reused from the panoramic sequence. It was chosen to gradually feed more information to the encoder starting from CU and then adding mode, PU, MV and finally merge. From the output of the fast encoding process, the bit rate, PSNR, encoding time and decoding time were retrieved and stored. The views were also encoded without acceleration and decoded in order to retrieve the correct coding information, bit rate and PSNR that can be used as reference. All the views were encoded with a Low Delay P configuration. This is possible because each user gets a personalized view and encoder instance. Therefore, the cropped region of the raw panoramic video (views) can be fed to one encoder instance for each user. It does not need I-frame refreshes because the user continues to look at the same personalized stream. This configuration results in a lower delay, which is an important requirement for interacting with personalized views. For the tile-based method a Random Access configuration was needed because all the tiles can be retrieved by all users at any time and any position that corresponds with their selected RoI. With the retrieved data, different experiments were performed. The first experiment was to determine the amount of correlation between co-located blocks of the cropped coding information of the panoramic video and the corresponding coding information of the view itself. The second experiment was to determine how much the encoding is sped-up (complexity reduction) and how much the quality (bit rate and PSNR) is affected

10 by fast encoding the view by reusing the cropped coding information from the panoramic video. To do this correct metrics were chosen. A metric that shows the difference in compression efficiency is the Bjøntegaard Delta (BD) rate [7]. In this context, it shows the average increase in bit rate for the same PSNR of encoding a personalized view by reusing information from the original panoramic sequence (fast encoder) compared to encoding this view without reusing information. In order to determine the complexity reduction, the time saving (TS) metric was calculated. It is determined by comparing the encoding time of the fast encoder (T fast ) to the encoding time of the reference encoder (T ref ) and is given by equation 1. For the views a low BD-rate and a high time saving is wanted. However a tradeoff exists between them when different coding information is reused. TS(%) = T ref T fast T ref (1) Experiments showed that on this type of content definitely enough correlation exists (mostly > 80%) between coding information obtained from encoding the panoramic video and the coding information obtained from encoding the view itself. This implied that the reuse of more coding information from the panoramic sequence to fast encode the RoIs would give good results. It was seen that reusing only CU information, the BDrate was minimum 4.9% and maximum 7.4% with already a time saving of around 79%. Using CU, mode, PU and MVs resulted in a BD-rate with a minimum of 8.3% and a maximum of 19.5% and a speed increase up to 97%. Using also merge (with skip) information resulted in strange behavior of the BD-rate but could possibly be solved by encoding the panoramic video without skip. V. COMPARISON BETWEEN BOTH APPROACHES The main focus of this extended abstract is to compare the tile-based method and the non-tile-based method in terms of bit rate and PSNR for particular views. The bit rate should be low to make it applicable for a lot of users who only have a limited bandwidth available. However, the PSNR should be high to have a good quality video of the RoI. Another important factor is the delay between selecting the RoI and the RoI really appearing on the screen of the user. This delay together with the quality will determine the Quality of Experience (QoE) of the user. However, it is difficult to measure the entire cycle consisting of the processing delay, the coding delay and the network delay. Therefore, only the delay in terms of coding will be compared between both methods. The different views on which the comparison will be performed were already discussed (section IV) and visualized (figure 2). A. In terms of bit rate For the non-tile-based method, the bit rates are retrieved from the encoding step with different coding information supplied to the encoder. For the tile-based method, the bit rates are calculated by the sum of the tiles of one particular tile size that (partially) overlap with the corresponding view. Figure 4 shows the bit rates of both methods and the reference for hockey1 1 view m4. This view represents the plain white ice hockey field, the cheerleaders and the audience. It was previously mentioned that Fig. 4. Comparison between the tile-based method and the non-tile-based method in terms of bit rate for hockey1 1 view m4. the optimal tile size was 144p. However, on the figure it looks like the 360p tiles are mostly the best in terms of bit rate for the tile-based method. For instance the view consisting of 360p tiles for QP 27, the lowest bit rate is 95 Mbps for the tile-based method. However, due to the alignment of the views with the 72p/360p tiles, the 72p/360p tiles have the least pixel overhead and therefore the lowest bit rate. If more arbitrary non-aligned views were considered, 144p tiles would have been the best for the tile-based method. If both methods are compared, it is visible that the bit rates of the non-tile-based method are much lower than the bit rates of the tile-based method. For example, for the view of figure 4, a bit rate of 6.83 Mbps is seen for QP 22 for which the cropped CU coding information of the panoramic video is reused, whereas 144p tiles have a bit rate of around 185 Mbps for QP 22. The reason for this large difference is due to the coding configuration of both methods. The non-tile-based method uses a Low Delay P configuration and therefore only uses the first frame as I-frame followed by all P frames. This is possible because every user has a dedicated encoder, which starts fast-encoding its personalized stream. For the tile-based method, every tile is pre-encoded with a Random Access configuration. This is needed because every user can use the tiles of any location at any time and therefore the tiles were encoded with an intraperiod of 0.5s (32 frames). This means that the tile-based method already consists of 19 I-frames for each tile in order to encode 600 frames. I-frames consume the most bit rate, because these are only intra-predicted. Taking into account that the non-tile-based method only needs one I- frame of 1088p resolution and the tile-based method needs 19 I-frames for each tile that (partially) overlaps with the RoI, it is easily seen that the bit rates of the non-tile-based method will be the lowest. This difference in bit rate between both methods will only increase when the sequences are longer in time. The same bit rate differences are seen for the other views. In this comparison only static views are considered, but it is expected that the non-tile-based method will still outperform the tile-based method in terms of bit rate when panning and tilting

11 are taken in consideration. If after some time the user selects a totally different RoI, the bit rate of the non-tile-based method will peak. This is because the residual image will be large in order to represent the new area based on predictions from the old RoI. In the worst case, these residual images can be considered as I-frames. Therefore, the views were also encoded using the same Random Access configuration as the tile-based approach. It was seen that in this case the non-tile-based approach still performs better in terms of bit rate. For the tilebased method when a totally different RoI is chosen, this will have no big influence on the bit rate results, due to the Random Access coding configuration. So in the worst scenario, when the user pans/tilts around selecting totally different RoIs per 0.5s, the non-tile-based method will still perform better in terms of bit rate. B. In terms of PSNR Another important aspect to compare with is quality, which is measured in PSNR. The mean PSNR for the tile-based method is calculated by first transforming all the PSNR values back to the Mean Squared Error (MSE). Then the average of all the MSE values, temporally and spatially corresponding to each tile size and to each QP that covers the view, was calculated and transformed back to PSNR. This gives a better average than simply averaging over the PSNR values because PSNR 10 log 10 (MSE). Figure 4 shows the PSNR of both methods for hockey1 1 view m4. From the figure, it is visible that the tile-based method for all the tile sizes perform better in terms of PSNR than the non-tile-based method. A PSNR of around 38 db for QP 32 is visible when the cropped CU, mode and PU coding information of the panoramic video is reused, whereas the 144p tiles have a PSNR of around 39 db for QP 32. Similar behavior is visible for the other views. Note that for the non-tile-based method the PSNR drops significantly when merge coding information is also reused from the panoramic video. The reason for this drop in PSNR is that when merge information is supplied also skip is forced. If skip is used, no residuals are encoded under assumption that the eventual residual is really small for that block. This assumption does not hold anymore when the cropped coding structure of the panoramic video is used for fast encoding the view. For this application, the skips can cause wrong blocks to be copied. Because no residual is encoded, these errors cannot be corrected, resulting in a large drop of PSNR for the quality. However as mentioned earlier, inter-tile artifacts are visible for the tile-based method. These lower the QoE and of course do not appear in the non-tile-based approach. Therefore, the subjective quality of the non-tile-based approach is better. C. In terms of coding delay It is also important to have notion about the delay that is introduced by each method. However, as mentioned earlier, it is difficult to measure the entire cycle consisting of processing delay, coding delay and network delay. Therefore only the coding delay between both methods will be compared. For the tilebased method, the coding delay only consists of the sum of the decoding times of all the tiles that (partially) overlap with the Fig. 5. Comparison between the tile-based method and the non-tile-based method in terms of PSNR for hockey2 1 view m1. RoI. This is because the tiles are already pre-encoded on the server and therefore no encoding time needs to be taken into account. For the non-tile-based method, the coding time consists of both the encoding time and the decoding time, because at the server side the RoI needs to be encoded and at the user side the RoI needs to be decoded. The reference has the same coding time principle as the non-tile-based method. Note that the coding time is over a period of 10s (600 frames) and therefore all results should be divided by 600 to have an average coding delay per frame. Figure 6 shows the coding times of both methods and the reference for one particular view. The y-axis is logarithmic in order to represent the large range of different coding times. From the figure, it is visible that the coding times of the non-tilebased method are larger than the coding times of the tile-based method. When looking at the non-tile-based method where all coding information is supplied to the encoder (CUModePUMVsMerge) for QP 22, the coding time is 850s, whereas for the tile based method using 144p tile sizes to represent the RoI, the coding time is only around 200s for 600 frames. The other views show similar behavior. The reason is that encoding a view of 1088p is still a complex operation even though coding information is supplied, TU, intra-modes, residuals and entropy coding still need to be determined. However, every view is encoded using the HM implementation of the HEVC standard, which is very slow and single-threaded. Another more used implementation of the HEVC standard used in the industry is x265. This encoder implementation is highly optimized and multi-threaded and can therefore be up to 100 times faster than the HM implementation. If the x265 encoder is modified so that it can reuse the coding information of the panoramic video, the views can be encoded in real-time. This makes it possible for the user to change his personalized view without noticeable delay. Note that network and processing delay was not taken into account. In this comparison only static views were considered, but it is expected that there will be almost no difference in coding time of both methods when panning and tilting are taken in consider-

12 Fig. 6. Comparison between the tile-based method and the non-tile-based method in terms of coding time. For the tile-based method the coding time is only the decoding time and for the non-tile-based method the coding time consists of the encoding time and the decoding time. ation. Therefore, the same conclusions can be made. It should also be noted that it is assumed that the entire panoramic content is available when both methods are applied. If the panoramic content would be captured live, the coding time consisting of encoding and decoding the full panoramic video should also be taken into account for the non-tile-based method. For the tile-based method it is also assumed that the tiles are pre-encoded on the server. This is not the case anymore when the panoramic content is captured live. Therefore, the coding time of encoding each necessary tile should also be taken into account. However, these would lead to an unfair comparison as the non-tile-based method uses accelerations to encode the views and therefore also coding information from the panoramic video could be reused to encode the tiles [8]. D. Final comparison A final comparison is given in this subsection taking into account every advantage, disadvantage known of both methods. First, it was seen that the Low Delay P coding configuration of the non-tile-based method is better than the Random Access configuration of the tile-based method, because the excessive I- frames cause a very high bit rate increase. Secondly, the PSNR of the tile-based method was higher than the PSNR of the nontile-based method. However, the non-tile-based method did not have inter-tile artifacts and therefore its subjective quality is better. Next, the coding delay of the tile-based method was lower than the coding delay of the non-tile-based method. But it was expected that when a more optimized multi-threaded encoder such as x265 together with the reuse of the coding information of the panoramic video is used, the non-tile-based method could be performed in real-time. In terms of infrastructure needed to implement both approaches, following conclusion can be made. The end user only needs a standard decoder for the non-tile-based method, whereas for the tile-based method the end user needs a customized video player to decode, combine and synchronize the tiled streams. In terms of storage requirements, the tile-based method needs a lot of storage at the server side in order to store all the tiles of different resolutions of the panoramic video in order to allow zooming. It is assumed that the server only has tiles of one tile size, namely the optimal tile size (144p). However, the nontile based method only needs to store the full panoramic video with its corresponding coding information retrieved from full encoding the panoramic content on the server. In terms of processing requirements, the tile-based method has cheaper processing requirements for the server because it does not need to encode when users request a RoI. This is because all tiles are already pre-encoded on the server. However, the non-tile-based method needs a lot of encoders in order to make a dedicated personalized stream for each user. Nowadays cloud services can handle the encoding and processing. The cloud system can also scale the number of encoders needed depending on the amount of users watching their personalized stream for a cheap price. VI. CONCLUSION In this extended abstract, two methods to efficient encode and deliver personalized views extracted from panoramic video content were compared. It was shown that reusing coding information obtained from the panoramic video to fast encode each personalized view has more advantages than the tile-based method. This conclusion also holds when the user pans or tilts. During research, it was assumed that the entire panoramic content was available when both methods were applied. When the panoramic content would be captured live, the non-tile-based method would still outperform the tile-based-method in terms of bit rate. Further improvements can be made for the non-tilebased method. The BD-rate of the views can be lower by disabling optimizations (such as disabling skip) of the encoder and pre-processing the coding information obtained from encoding the panoramic video before using this information to fast encode the personalized views. REFERENCES [1] A. Mavlankar, P. Agrawal, D. Pang, S. Halawa, N. M. Cheung, and B. Girod, An interactive region-of-interest video streaming system for online lecture viewing, in Packet Video Workshop (PV), th International, pp , dec [2] N. Quang Minh Khiem, G. Ravindra, A. Carlier, and W. T. Ooi, Supporting zoomable video streams with dynamic region-of-interest cropping, Proceedings of the first annual ACM SIGMM conference on Multimedia systems - MMSys 10, p. 259, [3] Y. Umezaki and S. Goto, Image segmentation approach for realizing zoomable streaming HEVC video, in Information, Communications and Signal Processing (ICICS) th International Conference on, pp. 1 4, dec [4] N. Van Kets, J. De Praeter, G. Van Wallendael, J. De Cock, and R. Van De Walle, Fast encoding for personalized views extracted from beyond high definition content, IEEE International Symposium on Broadband Multimedia Systems and Broadcasting, BMSB, vol Aug, [5] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, Overview of the High Efficiency Video Coding, IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp , [6] ITU-T Recommendation P.910, Subjective video quality assessment methods for multimedia applications, apr [7] G. Bjontegaard, Calculation of average PSNR differences between RDcurves, Tech. Rep. VCEG-M33, apr [8] M. Makar, A. Mavlankar, P. Agrawal, and B. Girod, Real-time video streaming with interactive region-of-interest, in Image Processing (ICIP), th IEEE International Conference on, pp , sep 2010.

13 Contents 1 Introduction Introduction Problem statement Goal Outline Introduction to the H.265/High Efficiency Video Coding (HEVC) standard Introduction Typical HEVC video encoder structure Coding tree structure Intra-picture prediction Inter-frame prediction Transformation, quantization and entropy coding Delivering personalized views of panoramic content to the user Panoramic video Delivering Region-of-Interest (RoI) to the user Entire panoramic video Tile-based approach Non-tile-based approach Tile-based approach Introduction Panoramic content Splitting into tiles Encoding/decoding of tiles Results Visualization x

14 4.5.2 Determine tiling overhead due to encoding Determine bit rate overhead due to extra pixels Conclusion Non-tile-based approach Introduction View selection Methodology Results Correlation BD-rates and complexity reduction Conclusion Comparing the tile-based method with the non-tile-based approach Introduction Comparison in terms of bit rate Comparison in terms of PSNR Comparison in terms of coding delay Conclusion Conclusions Conclusions Future work Bibliography 64 A Appendix 65 A.1 Full BD-rate table of the non-tile-based approach A.2 Reusing the correct coding information to fast-encode the views xi

15 List of Figures 2.1 Typical HEVC encoder structure. Various basic video coding blocks are visible. Image taken from [1] Frame with CU-structure overlay Partitioning a certain CB of size M M into PBs, the eight different splitting modes Partitioning of a CTU (outer square at left, the root node at the right) into multiple CUs (solid lines) and TUs (dotted lines). The corresponding (residual) quadtree is shown in the right figure Left: HEVC intra prediction modes, Right: AVC intra prediction modes Capturing multiple high-resolution images and stitching them together to create a panoramic image Tile-based streaming system architecture Non-tile-based system architecture of a RoI fast encoding solution Snapshot of each panoramic scene used for generating results Example of indexing and splitting for the 576p tiles for the hockey1 1 sequence Visualization of the bit rate in function of tile number with the different points per tile number representing the bit rate per 0.5s (intraperiod) Visualization of the decoding time in function of tile number with the different points per tile number representing the decoding time per 0.5s (intraperiod) Visualization of PSNR in function of tile number with the different points per tile number representing the PSNR per 0.5s (intraperiod) Bit rate tiling overhead due to encoding visualized by summing all the bit rates temporally and spatially corresponding to each tile size and to each QP that cover the full panoramic video Decoding time tiling overhead due to encoding visualized by summing all the decoding times temporally and spatially corresponding to each tile size and to each QP that cover the full panoramic video Mean PSNR corresponding to each tile size and to each QP that cover the full panoramic video xii

16 4.9 Inter-tile artifacts for a 360p RoI for different tile sizes and QP values Example of pixel overhead for the 576p tiles on the hockey1 1 sequence for a 1080p RoI Selected 1080p RoIs for hockey1 1. The RoIs are called with their corresponding notation on the figure. Selected 1088p RoIs for hockey1 1. The bottom views are indicated by their prefix b Bit rate overhead for hockey1 1 view 4 and hockey1 1 view b0 compared to its reference Number of 1080p views for particular tile size with best relative bit rate overhead for hockey1 1 and hockey Number of 1080p views for particular tile size with best relative decoding time overhead for hockey1 1 and hockey Number of 720p views for particular tile size for hockey1 1 and hockey2 1. Top: relative bit rate overhead, Bottom: relative decoding time overhead Selected 1088p RoIs for hockey1 1. The RoIs are called with their corresponding notation on the figure. The middle views are indicated by their prefix m Spatial and temporal information for each view in sequence hockey1 1 and hockey2 1. The numbers beneath the markers specify the particular view number Schematic figure of the necessary steps for the non-tile-based method in order to discuss the results Visualization of the CU coding structure of the cropped panoramic coding information (left) and the coding information of the view (right) for hockey1 1 view 0. The differences between the co-located blocks are indicated in red Visualization of the CU coding structure of the cropped panoramic coding information (left) and the coding information of the view (right) for hockey1 1 view 1m. The differences between the co-located blocks are indicated in red Visualization of the CU coding structure of the cropped panoramic coding information (left) and the coding information of the view (right) for hockey1 1 view 4m. The differences between the co-located blocks are indicated in red Screenshot of fast encoded views with merge (including skip) supplied as coding information. Left: influenced (indicated by red rectangles), right: non-influenced RD-curves when merge information is supplied in order to explain the negative behavior of the BD-rate Comparison between the tile-based method and the non-tile-based method in terms of bit rate xiii

17 6.2 Comparison between the tile-based method and the non-tile-based method in terms of bit rate, where the non-tile-based method uses a Random Access configuration for fast encoding the views Comparison between the tile-based method and the non-tile-based method in terms of PSNR Comparison between the tile-based method and the non-tile-based method in terms of PSNR, where the non-tile-based method uses a Random Access configuration for fast encoding the views Comparison between the tile-based method and the non-tile-based method in terms of coding time. For the tile-based method the coding time is only the decoding time and for the non-tile-based method the coding time consists of the encoding time and the decoding time xiv

18 List of Tables 4.1 The chosen tile sizes with their corresponding number of tiles for the hockey sequences Bit rate overhead per pixel overhead for hockey1 1 view 4 and hockey1 1 view b0 with QP Correlation [%] between the cropped coding information of the panoramic video and the coding information of the corresponding view for CU, mode, PU and merge BD-rates and Time Savings obtained by supplying different coding information. The numbers in the second row represent the type of coding information that is reused from the panoramic video. A: CU, B: CU & Mode, C: CU & Mode & PU, D: CU & Mode & PU & MVs, E: CU & Mode & PU & MVs & Merge A.1 BD-rates and Time Savings from all views obtained by supplying different coding information from the panoramic video. The numbers in the second row represent the type of information that is reused from the panoramic video. A: CU, B: CU & Mode, C: CU & Mode & PU, D: CU & Mode & PU & MVs, E: CU & Mode & PU & MVs & Merge A.2 BD-rates and Time Savings from all views obtained by supplying different coding information. The numbers in the second row represent the type of information that is reused from the view itself. A: CU, B: CU & Mode, C: CU & Mode & PU, D: CU & Mode & PU & MVs, E: CU & Mode & PU & MVs & Merge xv

19 Acronyms AMVP AVC BD-rate CABAC CB CTB CTU CU DCT DPB DST FPS GOP HD HEVC JCT-VC MPEG MSE MV MVC PB PSNR PU QoE QP RD RoI SAO TU VCEG Advanced MV Prediction. Advanced Video Coding. Bjøntegaard Delta rate. Context Adaptive Binary Arithmetic Coding. Coding Block. Coding Tree Block. Coding Tree Unit. Coding Unit. Discrete Cosine Transform. Decoded Picture Buffer. Discrete Sine Transform. frames per second. Group Of Pictures. High Definition. High Efficiency Video Coding. Joint Collaborative Team on Video Coding. Moving Picture Experts Group. Mean Squared Error. Motion Vector. Multiview Video Coding. Prediction Block. Peak Signal-to-Noise Ratio. Prediction Unit. Quality of Experience. Quantization Parameter. Rate Distortion. Region-of-Interest. Sample Adaptive Offset. Transform Unit. Video Coding Experts Group. xvi

20 Chapter 1 Introduction 1.1 Introduction For conventional TV or fixed media such as Blu-ray/DVD applications, the camera work of the video contents is nowadays edited by a director, which means that all users obtain the same limited experience. That is why today, the trend for future video services is enhancing user experience to be more interactive, for example when a user at home watches a football match, he could follow different players or look at the audience in the stadium. Further, high resolution digital imaging sensors exist and make it possible to capture high resolution video up to 4K+. By stitching comprehensive high-resolution views from multiple cameras, a panoramic video can be created with a resolution far beyond High Definition (HD) (e.g pixels and higher). This panoramic video can offer the possibility of viewing an arbitrary RoI interactively based on coordinates or tracking. By adding zoom functionality, the content can be displayed from panoramic displays to lower spatial resolution displays such as tablets or even mobile devices. One could even make it more immersive by automatically matching appropriate audio to the selected RoI. A lot of applications are possible with panoramic video, such as interactive viewing of sports events, providing virtual pan/tilt/zoom within a wide-angle from a surveillance camera, streaming instructional videos such as lecture videos, video conferencing, etc. 1.2 Problem statement Having a high resolution panoramic video causes some problems. Even when compressed without significant loss in video quality, delivering this high resolution content to the user is not possible due to limited network capacity. Moreover, at the user side it is not possible to display such high resolution content because the panoramic video would not fit 1

21 on the limited resolution of the display devices and the decoding load for smaller devices would be too high. If arbitrary regions corresponding to arbitrary zoom factors can be provided to the user, the transmission and/or decoding of the entire high resolution video can be avoided. Moreover, if the video content can be encoded in such way that arbitrary RoIs corresponding to different zoom factors can be simply extracted from the compressed bitstream, dedicated video encoding can be avoided for each user. A lot of solutions, such as tile-based streaming, have already been proposed to make this possible. These solutions were mostly applied with the H.264/AVC codec. One could also accelerate the video encoding of the personalized view by using coding information from the full encoded panoramic video, which will be called as a non-tile-based approach. 1.3 Goal In this master s dissertation, the new video compression standard HEVC will be used. This standard has a higher coding efficiency than its predecessor H.264/AVC at the cost of an increased computational complexity. The aim of this thesis is to compare the existing tile-based method with the proposed non-tile-based method based on quality, bit rate and coding delay of the resulting RoIs. Next to this, an optimal tile size for the tile-based method will be found and the bit rate overhead and quality decrease when lowering the complexity for the video encoder will be investigated for the non-tile-based approach. 1.4 Outline In order to provide a better understanding of the rest of this master s dissertation for the reader, the relevant features of HEVC will be discussed in chapter 2. An overview of the existing methods for delivering the RoI of panoramic content to the user will be discussed in chapter 3. In chapter 4 the tile-based method will be applied on panoramic content and in chapter 5 the proposed non-tile-based method will be investigated. In chapter 6 these two methods will be compared and finally in chapter 7 conclusions will be made and further improvements will be noted. 2

22 Chapter 2 Introduction to the H.265/HEVC standard 2.1 Introduction This chapter gives a brief overview of HEVC, which is the successor of the Advanced Video Coding (AVC)/H.264 standard. It will give the reader a better understanding of the tileand non-tile-based methods that will be explained further in this thesis. HEVC was developed by the Joint Collaborative Team on Video Coding (JCT-VC), an expert group proposed by the ISO/IEC Moving Picture Experts Group (MPEG) and the ITU-T Video Coding Experts Group (VCEG). The main improvement is the increased compression efficiency (up to 50% bit rate reduction for equal perceptual video quality). However the codec becomes far more complex to achieve this higher compression rate and therefore results in a considerable increase in the encoding time. The main goals of HEVC are to support a higher video resolution (8K and beyond) and to increase the usability of parallel processing architectures. A complete overview of HEVC can be found in [1]. 2.2 Typical HEVC video encoder structure Figure 2.1 shows an HEVC encoder structure. Most of the blocks represent basic video coding techniques that were already used in previous block-based codec standards. These include the following basic steps: 1. Divide frames into blocks. 2. Apply prediction (intra or inter) on these blocks. The final result is called a residual image. 3

23 Figure 2.1: Typical HEVC encoder structure. Image taken from [1]. Various basic video coding blocks are visible. 3. Apply a transformation (e.g. Discrete Cosine Transform (DCT)) and a subsequent quantization on the residuals. 4. Execute entropy coding (e.g. Context Adaptive Binary Arithmetic Coding (CABAC)). 5. Use a deblocking filter to reduce the artifacts introduced by the block-based coding. Note that in this step HEVC also adds a Sample Adaptive Offset (SAO) filter. This filter simply adds offset values, which are obtained by indexing a lookup table to certain sample values, in order to fix mispredictions such as ringing and banding artifacts caused by large transforms Coding tree structure Coding Units (CUs) As in AVC, HEVC also uses a block-based hybrid video coding architecture. This means that every frame is divided into blocks and each of these blocks is encoded on its own. In AVC the frames are divided into so-called macroblocks. The luma macroblock has a fixed size of 16 16, the chroma macroblocks a fixed size of 8 8 for a 4:2:0 chroma subsampling. On the other hand, in HEVC each frame is divided into units of variable sizes, namely Coding Tree Units (CTUs), which consist of one luma and two chroma Coding Tree Blocks (CTBs). A CTU is in fact the logical structure whereas a CTB is the practical unit because the video coding must be done on each luma/chroma component. This distinction between units and blocks also holds for all the 4

Figure 2.2: Frame with CU-structure overlay. units mentioned further in this section. The size of the luma CTB can be chosen as 16, 32 or 64 pixels squared.

24 Figure 2.2: Frame with CU-structure overlay. units mentioned further in this section. The size of the luma CTB can be chosen as 16, 32 or 64 pixels squared. In general, the greater the size, the more compression can be obtained because less blocks are present. This leads to less bits for signalization and thus a larger efficiency. However, searching the optimal partitioning of each CTU also results in an increase in time complexity for the encoder. Subsequently all these CTUs can be partitioned separately by using a quadtree structure. More precisely the CTUs can be recursively divided into 4 equally sized units in a quadtree manner. A leaf node of the resulting recursive quadtree is called a CU, whose size ranges from down to 8 8 pixels for luma. In this way a large flexibility can be provided: regions with much detail can be coded via small blocks, whereas flat regions can be coded use large blocks. Each CU becomes the decision making point for the prediction type (mode). Figure 2.2 illustrates the partitioning of a picture into CUs. Prediction Units (PUs) Each CU can be partitioned further into Prediction Units (PUs), which are the basic units for intra- and inter-prediction. The possible partitioning modes for a certain CB of size M M are illustrated in figure 2.3. PUs are the only units that can be rectangular (i.e. non-square). The PU size can vary between (the maximum CTU size) and 4 4 pixels. For intra-predicted CUs only the M M and M/2 M/2 prediction modes are supported. Transform Units (TUs) The prediction residual obtained in each PU is transformed using the residual quadtree (RQT) structure. This structure is obtained by recursively splitting each CU into Transform Units (TUs) in a quadtree manner. TU sizes can range from down to 4 4 pixels. TUs contain coefficients for spatial block transform 5

and quantization. An example of the partitioning of a CTU in CUs and TUs is shown in figure 2.4. Depth In chapter 5, depth will be used to refer to certain block sizes.

25 Figure 2.3: Partitioning a certain CB of size M M into PBs, the eight different splitting modes. Figure 2.4: Partitioning of a CTU (outer square at left, the root node at the right) into multiple CUs (solid lines) and TUs (dotted lines). The corresponding (residual) quadtree is shown in the right figure. and quantization. An example of the partitioning of a CTU in CUs and TUs is shown in figure 2.4. Depth In chapter 5, depth will be used to refer to certain block sizes. A CU with a certain depth has a certain size and is not further split into smaller sub-cus. More precisely, CUs of samples have a depth of 0, while CUs of 32 32, and 8 8 respectively have a depth of 1, 2 and 3. This is also mentioned in figure Intra-picture prediction Intra-picture prediction predicts the pixels in a Prediction Block (PB) using the surrounding decoded pixel data within the same frame. It uses interpolation techniques to construct an estimation that resembles the PB being encoded as closely as possible. This technique thus exploits spatial redundancy. For an I-frame only intra-picture prediction is used. Compared to AVC, which has 9 intra-prediction modes, HEVC has a much finer granularity (35 intra-prediction modes). Most of these 35 modes (namely 33) are angular 6

26 Figure 2.5: Left: HEVC intra prediction modes, Right: AVC intra prediction modes. modes. These can be used if the target regions have strong directional edges. Otherwise DC (takes an average of neighboring pixels) or planar mode (a kind of bi-linear prediction) can be utilized. This finer granularity allows to make a more accurate prediction image. The different modes are shown in figure Inter-frame prediction As a video is a sequence of frames, a high correlation must exist between subsequent frames. This correlation increases as the frame rate becomes higher and if the camera does not move much between consecutive frames. Due to this resemblance, a lot of temporal redundancy is present. This is exploited by inter-frame prediction which is a technique to predict the values of the pixels in a PB using decoded PBs of other frames. The encoder will search for the corresponding pixels/sub-pixels of each block that matches other PBs in one or more reference frames in the Decoded Picture Buffer (DPB). The best matches with respect to the original block result in Motion Vectors (MVs) that point to these PB areas. This MV enables the decoder to redetermine the area used as a reference by the encoder. If no suitable match can be found, intra-picture prediction can be used for that block if it results in a better Rate Distortion (RD). This process is called Motion Estimation and can be very time consuming. The DPB can contain previous decoded frames and future decoded frames (depending on the coding configuration). A frame that only uses one block from a frame in its DPB is called a P-frame, whereas a frame that uses a weighted average of two blocks from previous and/or future frames is called a B-frame. In this master s dissertation both Random Access and Low Delay P will be used as coding configurations. In the first configuration, both P- and B-frames are used. In the latter configuration, only P-frames are allowed. 7

27 There are two types of MV prediction modes for inter-frame prediction: Advanced MV Prediction (AMVP) and merge mode. AMVP uses MVs from the neighbors as MV predictors. These are put in a candidate MV list and only the best candidate index plus the MV delta is transmitted to the decoder. The MV delta is the difference between the actual MV and the selected candidate MV. Because the candidate index is also transmitted, the decoder can reconstruct the actual MV. Using this technique results in more efficient compression because the difference between neighboring motion vectors is usually smaller than the actual MV itself. Merge mode is similar to AMVP in a sense that it uses MVs from its neighbors. However, instead of using these neighbors as predictors, the MV is copied from its (spatial or temporal) neighbors. The corresponding neighbor index is sent to the decoder and therefore no MV delta is involved. This merge concept can be used in combination with a skip mode. If a skip mode is used, it implies the following: merge mode is used, CU only contains one PU (M M) and no residual data is present in the bitstream. This is well suited to encode static regions where the prediction error tends to be very small Transformation, quantization and entropy coding After intra-picture prediction or motion estimation, the information retrieved from these previous steps is used to construct an intra- or motion-compensated picture which is subsequently subtracted from the input frame signal. This results in a residual image which consists of small values, depending on the amount of correlation that it could exploit. These small values, which also have a low entropy, have good compression capabilities. The residual image is further transformed and quantized. For the transformation an Integer DCT is used, which is a good approximation of the DCT. It has the advantage that it is easy to calculate (only requiring integer arithmetic). As an exception, for the 4 4 intra luma blocks the Discrete Sine Transform (DST) is used. The transformation transforms the data of the residual image in such a way that the data describing the higher spatial frequencies (finest detail) is clustered together at the bottom-right of the resulting matrix and the data describing the lower spatial frequencies is clustered together at the top-left of the resulting matrix. After transforming the picture, the resulting matrix is quantized. The amount of quantization, which states how much spatial detail is saved, is set by the Quantization Parameter (QP). This QP is used to determine the quantization step size and can be set anywhere from 0 to 51. An increase of QP by six, doubles the quantization step size and therefore the QP scales logarithmically with respect to the quantization step size. A larger value of the QP will cause more elements in the matrix to be discarded. These discarded elements will be mostly the higher spatial frequencies thanks to the transformation that was ap- 8

28 plied in advance. This step is the only step where picture fidelity is being lost. No other steps reduce the picture quality. Finally, the quantized transform coefficients and the motion- and/or intra-prediction data are passed to the entropy coder. For HEVC the most used entropy coder is CABAC. 9

29 Chapter 3 Delivering personalized views of panoramic content to the user The goal of this research is to investigate how personalized views can be efficiently encoded and delivered to the user from panoramic video content. A lot of research has already been done on this previously, where mostly H.264 was used. In this chapter, first the creation of a panoramic video will be reported. Thereafter, three approaches to deliver panoramic video will be discussed. The first one is sending the full panoramic video to the user, where it will be mentioned why it is impossible to send the full panoramic video. The second one is the tile-based approach which is the most frequently used technique in research. This approach will be extensively discussed with all its extensions and different methods each with its pros and cons. The last one is a non-tile-based method in which coding information of the full panoramic video will be used. These last two techniques will respectively be applied to H.265 in chapter 4 and 5. These will then be compared in chapter Panoramic video In the near future, high resolution digital video will be widely available at low cost. This decrease in cost is driven by increasing spatial resolution (up to 4-8K) offered by digital imaging sensors and increasing capacities of storage devices. Furthermore, there exist algorithms for stitching a comprehensive high-resolution view from multiple cameras [2,3], which results in a so-called panoramic image. A global simplified view of how a panoramic image is created will be explained and clarified by means of figure 3.1. First, multiple high-resolution cameras are needed. These are arranged densely in an arch and adjacent cameras need to have overlapping areas. The cameras need to be highly calibrated. The captured images from the capture devices are then fed to the processing machine, where 10

30 Figure 3.1: Capturing multiple high-resolution images and stitching them together to create a panoramic image. the mismatches in color of the camera images are corrected in order to generate smooth transitions between the registered views. The processing machine also synthesizes the panoramic video. In this synthesis process, corresponding points in the overlapping area of the camera are searched and the camera images are stitched. In this way, a video can be created with resolution far beyond HD. 3.2 Delivering RoI to the user Due to the availability of the panoramic video, it is possible to view an arbitrary RoI interactively. The main principle is that the user zooms into its RoI to view a cropped region of the video at a high resolution. This RoI is indicated by the user on a downsampled version of the panoramic video that serves as an entire field-of-view for the user, called a thumbnail. From this indication of the user the corresponding coordinates of the RoI are sent to the server. The server then sends the RoI at a higher resolution to the client for playback. This resolution should be adapted to the particular kind of device, going from mobile handset to a panoramic display. The user may also pan/tilt around the video, which means viewing a different region but at the same resolution. In this case, the server crops a different RoI and streams the RoI to the user. The server should also be able to react to the users changing RoI with as little latency as possible. As mentioned earlier, different techniques enable digital pan/tilt/zoom by cropping the RoI chosen by the user. These techniques will be discussed below Entire panoramic video A possible way to deliver the panoramic video is by sending it directly over the network to the home user [4]. Even if the panoramic video is encoded, such an amount of data cannot easily be transported to viewers at home due to the fact that the network capacity is limited and insufficient for the amount of bit rate that is needed. Moreover, this very high resolution would not even fit on the limited resolutions of their displays such as 11

31 smartphone, tablet, television. Also for the smaller devices, the decoding load would be too high. Moreover, for the RoI principle, the user is only interested in one specific area. In order to accomplish this, the video needs to be cropped at the user to a display resolution that suits the user. The other areas in the video are not displayed at that moment and therefore a lot of bandwidth is wasted. However, this method gives cheap processing requirements on the server-side and the user would be able to change its RoI with very little latency depending on the computational capabilities of the user device. Another possibility is to stream a spatially downsampled encoded version of the panoramic video that fits the user display resolution. This is feasible, but due to the large downsampling, watching a local RoI in the highest captured resolution would not be possible. The user would only be able to see a cropped, low-quality upsampled RoI Tile-based approach Regular A naive way to allow dynamic cropping of RoI is to pre-encode each possible RoI as an independent video stream and provide the requested RoI to the user when requested. However, this results in huge storage requirements. One could also only preencode the popular RoIs, but this reduces the flexibilities of user interactions. In order to solve the above, tile-based streaming is introduced in literature [5 15] and will be explained by means of figure 3.2. When this method will be applied in chapter 4, figures will be shown to further clarify the concept. At the server side, the panoramic video is subsampled to different resolutions (including a thumbnail) in order to provide zooming by having multiple resolution layers. These different layers are then subdivided into a grid of non-overlapping tiles. Tiles of the same (x,y) coordinates from all frames in a Group Of Pictures (GOP) are grouped together and independently encoded using a video compression standard such as AVC or HEVC to create a tiled stream. Independently means that motion vectors and other dependencies are constrained to within the tile area. At the user side, the user requests the thumbnail and selects the RoI he is interested in. Based on this information and the video information, including indexing of the tiles, tile sizes and resolutions available of the video from the server, the tiles falling within and intersecting with the RoI boundary for the requested resolution are streamed from the server. The indices of the overlapping tiles can be calculated at the user or at the server. The received tiles, including the thumbnail, need to be decoded and consequently the user will have multiple instances of a decoder. The rendering of the received tiles is accomplished by controlling the scaling and placement of these tiles in correspondence with the chosen RoI. It is also important that these tiles are highly synchronized with each other and with the thumbnail. The tile portions falling outside the RoI are not 12

Figure 3.2: Tile-based streaming system architecture. rendered. When the user zooms in/out, the server switches between (and crops from) different resolution videos.

32 Figure 3.2: Tile-based streaming system architecture. rendered. When the user zooms in/out, the server switches between (and crops from) different resolution videos. To allow continuous adjustment of the zoom factor and not having to store a lot of resolution layers, additional re-sampling can be employed at the user. If the user pans around in the panoramic video, new tiles may be included or tiles may be dropped. If a tile is not yet available at the user, the missing pixels are filled in by upsampling the relevant parts from the thumbnail. However, this results in a decreased Quality of Experience (QoE) of the user. The main advantage of the tile-based approach is its simplicity at the server side. The server only has to split the panoramic video and encode the tiles once for generating a repository of tiles. This approach avoids the necessity for real-time encoding. Upon receiving the RoI, the server only has to transmit a minimal set of tiled streams that cover the RoI. Because the server does not have to send out completely different tiles for each user, a publish-subscribe paradigm can be used. This means that the server can multicast each tiled stream to a channel. A client interested in a particular RoI can subscribe to channels which have the tiles necessary for decoding that RoI. This system allows to scale to large number of users because it avoids having a dedicated encoder for each user s 13

33 individual RoI. The tile-based approach has a lot of disadvantages. A first disadvantage is for mobile devices. These are unlikely to have the processing power required for performing the decoding and recombination of multiple spatial segments. Another disadvantage is that tiled streaming pays the price of sending additional bits outside the RoI that are not displayed at the user side. This is because some tiles may partially overlap with the RoI, since the RoI is unlikely to be aligned with tile boundaries. To reduce these wasted bits, one can reduce the dimension of the tiles. But since each tile is encoded independently, small tiles lead to a lower compression ratio, increasing the number of bits needed for the RoI. Both effects will be researched further in this thesis in chapter 4. The user also needs a customized video player to decode, combine and synchronize the tiled streams, which make this approach harder to deploy. Another disadvantage is the fact that the tiles need to have an encoding structure that allows random access. This means that the tiles can only be decoded starting from an intra-coded tile. Moreover, it also needs a small intraperiod to allow low-delay panning, which leads to an excessive increase in bit rate. Extensions Upward Prediction To overcome the issue of random access and enabling RoI switching during any frame interval without waiting for the end of the GOP or having to transmit extra tiles from the past, the following method is proposed [16 19]. In this setup, the base layer (a.k.a thumbnail) is coded using AVC with I-, P-, and B-frames. The reconstructed base layer video frames are upsampled by a suitable factor and used as a prediction signal for encoding video corresponding to the higher resolution layers. The higher resolution frames are again coded as independent P tiles. By applying upward prediction from the thumbnail, efficient random access to RoIs of any spatial resolution can be performed. At the user side, the RoI is rendered by transmitting the corresponding frame from the base layer and a few P tiles from exactly one higher resolution layer. A drawback of this method is that no Motion Compensation Prediction can be performed among temporally successive frames of the higher resolution layers. This leads to a decrease in compression efficiency. They further tried to improve the compression efficiency by applying background extraction and long-term memory motion-compensated prediction [20]. Another disadvantage of this method is that the coding scheme is not standard compliant and thus dedicated decoders need to be placed at the users. MVC The synchronization of tiles can sometimes be hard, that is why the following method is proposed [5,21,22]. In this setup all tiles and the thumbnail from the panoramic video are compressed and multiplexed together as a single stream and stored on the 14

34 server. They used the Multiview Video Coding (MVC) standard to do this, without inter-tile prediction. When a user requests a RoI, the server only extracts the needed tiles including the thumbnail from the original stream. From these, an MVC sub-stream is made again and sent to the user. The ease of MVC is that the sub-streams are highly synchronized and the decoded images can be output simultaneously. It also allows the flexible selection and extraction of sub-streams from the original MVC stream. Another advantage is that more tiles can be transported within the sub-stream without having to be decoded at that moment. Due to their availability at the user, they can be quickly rendered and synchronized. Speed-up encoding tiles When a real-time system is considered, the on-the-fly tile splitting at the server will be too slow. That is why the following method is proposed [7, 23]. In this research, Makar et al. tried to speed-up the tile generation and encoding operations without any degradation in video quality. In order to do this, they encoded only those tiles that intersect with the users RoI during a given frame interval. They also used a technique known as selective downsampling for static backgrounds in the scene. To speed-up furthermore they also use modes (prediction-, PU- and skip modes) and MV information available from the original panoramic video bitstream to speed-up the encoding of the tiles. Send non-roi overlapping tiles with low quality An alternative solution without the need of upsampling the thumbnail for the relevant parts when the necessary tiles are not available, has been proposed [13]. In this approach they send the current RoI of the panoramic video with the full quality tiles and the tiles lying outside the RoI are sent with low quality. When the user changes its RoI, these lower quality tiles can be used and will be of better quality than the upsampled thumbnail parts. This increases the QoE of the user. However, a disadvantage of this method are the increased bit rate requirements needed for the network Non-tile-based approach A different approach is to encode the RoIs of the users on the fly. In this way the server crops the RoI from the panoramic video with the appropriate resolution, encodes it into a video stream and transmits it to the user. Such an approach is very flexible to support any RoI but not scalable to a large number of users. Also, encoding a large resolution RoI is computationally complex and therefore troubles can occur meeting the requirements for a low-latency system. In order to speed up the encoding, Van Kets et al. of The Data Science Lab of the UGhent proposed a technique [24] to lower the encoding complexity, while keeping the perceptual quality as high as possible. This was done by reusing coding information of the encoded full panoramic video in order to speed up the encoding process 15

35 of the RoI of each user. However in their research, only CU coding information was used from the panoramic video, but there is a lot more coding information that can be used to further speed up the encoding of the RoIs. The proposed architecture is shown in figure 3.3. On the server side, a thumbnail video is created. Now instead of encoding tiles, the full panoramic video is pre-encoded for different resolutions to allow zooming. From these pre-encoded streams all possible coding information is retrieved such as CU, PU, MVs, mode, TU, etc., and stored together with the video information on the server. At the user side, starting from a thumbnail, the user can select a RoI. The RoI information is sent to the server. From this information, the cropped RoI is created based on the selected resolution from the panoramic video and encoded using different types of coding information (e.g. CU with mode and PU). By making reuse of the information extracted from the full encoded panoramic video, the encoding of the personalized views from the same content can be sped up. This is because with the coding information, encoding decisions can be skipped. The coding information can also be pre-processed by for instance machine learning [25], before supplying the coding information to the encoder. Using this approach, the encoding complexity can be tremendously reduced with only a small increase in bit rate. The reuse of more coding information from the panoramic video to fast encode the personalized views of each user and their effects on the bit rate, quality and encoding complexity will be deeply investigated in chapter 5. A huge advantage of this system is the ease of the decoding at the user side. The user can use a standard decoder and have very flexible digital pan/tilt/zoom possibilities. A disadvantage is that the server will need a lot of encoders in parallel to provide a personalized view for each user. However, cloud services are available nowadays which can do the processing and encoding for you. The cloud system can also scale the number of encoders needed depending on the amount of users watching their personalized stream. 16

36 Figure 3.3: Non-tile-based system architecture of a RoI fast encoding solution. 17

37 Chapter 4 Tile-based approach 4.1 Introduction In order to send a personalized RoI to any user, the tile-based method can be applied. This means that the panoramic video is split and stored into tiles at the server and only the necessary tiles that overlap with the RoI are transmitted to the user. Subsection of chapter 3 gives a more detailed explanation. Assume a situation where the user selects a RoI which overlaps with multiple tiles. Since this RoI does not necessarily align with the tiles, redundant regions outside of the RoI will be transmitted as well. Using the tile-based approach, the whole tile needs to be streamed even when the overlap with the RoI is small. The amount of data sent to the user is influenced by the tile size. A larger tile size will result in a better compression efficiency, but leads to more redundant data being transmitted for partially overlapped tiles and vice versa for a smaller tile size. The tile size will also impact the decoding calculation costs which are important for mobile devices. Therefore, there must be some kind of optimal tile size and that is what will be examined in this chapter. In order to find this optimal tile size, first panoramic content should be chosen (section 4.2), then the content needs to be split in tiles (section 4.3) and furthermore, these tiles need to be encoded with the appropriate coding configuration (section 4.4). Finally, results are examined and evaluated (section 4.5). 4.2 Panoramic content The available panoramic content consists of three sport games: basketball, football and hockey. The content used in this thesis is hockey. This is because this type of content has static areas such as the ice hockey field, moving areas such as the supporters and fast moving parts such as the hockey players in the video. It is important to have a 18

38 (a) Hockey1 1 (b) Hockey2 1 Figure 4.1: Snapshot of each panoramic scene used for generating results. large range of spatial and temporal variability in the scenes because this influences the complexity of the encoding. The hockey content consists of five sequences, split in two scenes. Three of them, called hockey1 x, are captured during the break where the mascot and cheerleaders enter the ice hockey field. The other two, called hockey2 x, are captured during the match. The x can be either {1, 2, 3} for the first scene or {1, 2} for the second scene. An example of these two different scenes is shown in figure 4.1. The hockey sequences consist of three 4K videos stitched together. This results in a resolution of pixels. They all have a frame rate of 60 frames per second (FPS) and one sequence lasts at most 10s. Furthermore, the sequences are in YUVformat and are 4:2:0 chroma subsampled. The results will be retrieved by using these sport sequences. Although, as mentioned in our introduction, the tile-based method could also be performed on surveillance video. However, this type of content is mostly static and results may vary a bit. 4.3 Splitting into tiles The next step is splitting the panoramic videos into different tile sizes. Due to the large amount of possible settings, only static views will be explored and no zooming will be allowed. Therefore, the tile splitting will be performed on the full panoramic video without being preceded by downsampling. Since the final goal is to find the optimal tile size, a large range of tile sizes should be considered. As mentioned earlier, the tile size influences the amount of redundant overlap with the RoI. Given a fixed tile area (A T ), tile width (w T ), tile height (h T ), RoI area (A R ), RoI width (w R ) and RoI height (h R ), in best case w the selected region is R h R A T w T ht A R actual RoI [8]. Hence according to this formula, if 19

Tile Size (pixels) Number of Tiles (#) 1280 720 24 (3 8) 1024 576 40 (4 10) 640 360 96 (16 6) 256 144 560 (14 40) 128 72 2133 (27 79) Table 4.

39 Tile Size (pixels) Number of Tiles (#) (3 8) (4 10) (16 6) (14 40) (27 79) Table 4.1: The chosen tile sizes with their corresponding number of tiles for the hockey sequences. Figure 4.2: Example of indexing and splitting for the 576p tiles for the hockey1 1 sequence. the RoI width and height are integer multiples of the tile width and height, then tiled streaming would result in transmission of a region ( equal ) ( to the dimension ) of the RoI. In w the worst case, the region selected would be R h w T + 1 R A ht + 1 T A R the size of the actual RoI. The choice was to pick 16:9 resolutions for the tiles as our resulting RoIs are also chosen to be 16:9, because this is the most common aspect-ratio. Another possibility could be using squared tiles. It is very important that the tile sizes should be multiples of 8. This is due to the encoding, because the smallest Coding Block (CB) for luma is 8 8. Table 4.1 shows the chosen tile sizes and the resulting number of tiles for the hockey sequences. The tile size results in 24 tiles for the pixels resolution of the hockey sequences. For the , there are already 2133 tiles. From this, it is clear that the number of tiles increases rapidly with a lower tile size. Further in this thesis, the tile resolutions will be abbreviated by extending the height of the resolution with p, such that for example is 360p shortened. For later purposes and evaluating results, the tiles are indexed in horizontal order from left to right starting from zero. Figure 4.2 shows an example of the indexing and splitting for the 576p tiles. Note that in that figure the boundary tiles (right and bottom side) do not necessarily have the appropriate tile size, but are still multiples of Encoding/decoding of tiles The generated tiles need to be compressed. To do this, the HEVC Test Model (HM) software is the most used HEVC standard in research. It is an example implementation of 20

40 HEVC which follows the standard. However, it does not provide parallelization techniques. This is to maintain full correctness, completeness and readability. The disadvantage of this is that the encoding itself is slow. In the industrial context another faster implementation of the standard would be used such as x265, which provides parallelization techniques. All the tiles retrieved from splitting the different hockey sequences into tiles of different tile sizes are encoded and decoded with HM version 16.5 on the HPC cluster. The Raichu cluster is used and has 64 nodes, with on each node a 2 8-core Intel E (Sandy 2.6 GHz) CPU, 32 GB memory and 400 GB disk space. As mentioned earlier in subsection 3.2.2, the tiles need to provide random access in order to allow changing of RoI at any time as the tiles are pre-encoded on the server. That is why as coding configuration Random Access Main is chosen. A considerable amount of maximum delay when other tiles are selected, e.g. when another RoI is chosen, is 0.5s. Therefore, an intraperiod of 32 (a multiple of 8 which is the GOP size) is picked. As our hockey sequences are 60 FPS, this results in approximately 0.5s per I-frame. Note that in this delay consideration, the network delay and decoding time are not taken into account. IDR is picked as Decoding Refresh Type, which stands for Instantaneous Decoder Refresh. This means that all subsequent transmitted frames should be decodable without reference to any frame decoded prior to the IDR picture. The other settings for the encoder are kept standard. Per tile, the tile is moreover split in time (temporally), each representing one intraperiod of the (maximum) 600 frames in order to maintain the random access as explained above. All the above is done for four different QP values: 22, 27, 32 and 37. It can already be seen that this resulted in a very large amount of different tiles that needed to be encoded and stored. Therefore the 72p tile size was only applied on the hockey1 1 sequence. Later it will be concluded that this was a good choice. An extra consideration was taken into account to lower the amount of traffic on the cluster. This included that the tiles were lossless pre-encoded with x264 and at the node itself decoded first before encoding it again with the HM software. Next to this, the full panoramic video was also encoded with the same coding scheme principle. 4.5 Results In this section the results retrieved from the encoding and decoding will be investigated. Therefore, from the outputs of the encoding and decoding, the decoding time, the bit rate and the YUV-PSNR were retrieved from each tile. Peak Signal-to-Noise Ratio (PSNR) is the difference in quality between the original sequence and the encoded sequence based on pixel-by-pixel comparison [26]. It is an objective metric for measuring the quality of a video sequence. In the follow-up of this section, first the retrieved data will be visualized. Then, two types of overhead are discussed and visualized. These include tiling overhead 21

41 due to encoding and bit rate overhead due to the extra pixels that are sent. Finally the optimal tile size will be determined. The plots shown further are made for all the sequences but only hockey1 1 and hockey2 1 will be visualized in this thesis Visualization All the retrieved data can be visualized without any processing performed on the data. Figure 4.3 shows the bit rate of two sequences with each a different tile size in function of the tile number. Per tile number there are different points which represent the bit rate per 0.5s (intraperiod) as explained in section 4.4. For sub-figure 4.3a, the bit rates of QP 27 are lower than of QP 22. This is also visible in the other sub-figures. More general, higher QP values need less bits. This is because more quantization is performed and therefore less bits need to be spent on the blocks. Looking at sub-figure 4.3a, some areas (tiles) of the video have a larger bit rate than others. Take for example tile 17, which has the highest bit rate with a maximum at around 4.3 Mbps for a period of 0.5s. From this it can be deduced that there must be motion present in this tile. This is because an area with motion needs more bits to represent than a static area. This is checked by looking at figure 4.2. Tile 17 on that image corresponds to the moving cheerleaders entering the hockey field. Tiles from 0 to 10 also have a considerable high bit rate ranging from Mbps for QP 22. If these tiles are again checked with figure 4.2, it can be seen that this corresponds to the audience which also contains motion and details. Other examples are tiles 24 until 29. However, the lower bit rates of tile 13 range around 0.1 Mbps. This tile corresponds to the plain white ice hockey field and requires less bits to encode. The reason for this is that the encoder will use skip modes (see subsection of chapter 2) because this is a static region and hence the prediction error tends to be very small. Another example is tile 22. Looking at sub-figure 4.3b, it is seen that tile 190 has a high bit rate relative to the other bit rates of that figure and clearly represents motion on a finer scale. If the other sub-figures of figure 4.3 are also examined, it can be seen that the smaller the tile size, the finer the granularity is and the smaller the areas where motion is present can be detected. This is due to the increasing number of tiles for a lower tile size and the smaller the tiles, the more constraint a particular type of movement (static, fast) will be detected. It is also obvious that smaller tile sizes have a smaller bit rate, because they need less data to represent for a particular tile. To be more precise, the 720p tiles go up to 5.5 Mbps and the 144p tiles only have a maximum of 0.74 Mbps. Note that some tiles and especially the last tiles have a considerably lower bit rate. An example of this is shown in sub-figure 4.3d, where tiles from 80 to 95 correspond to the bottom tiles. These tiles are the boundary tiles and do not correspond to the particular tile size. They are smaller and therefore need less bits to be represented. Figure 4.4 shows the decoding time of two sequences with each a different tile size in 22

42 (a) hockey1 1 s 576p tiles (b) hockey1 1 s 144p tiles (c) hockey2 1 s 720p tiles (d) hockey2 1 s 360p tiles Figure 4.3: Visualization of the bit rate in function of tile number with the different points per tile number representing the bit rate per 0.5s (intraperiod). function of tile number, with again per tile number different points representing the decoding time per 0.5s (intraperiod). The same conclusions can be retrieved from the figure. The higher decoding times represent motion and the decoding time decreases when QP decreases as was the case for the bit rate. Also the lower tile sizes need less decoding time. Recognizing the non-bottom boundary tiles is also more easy for the decoding times. For example from figure 4.4d, it is clearly visible that tiles 15, 31, 47, 63, are all boundary tiles. The last figure 4.5 represents the PSNR of two sequences with each a different tile size in function of tile number, with again per tile number the different points representing the PSNR per 0.5s (intraperiod). Looking at sub-figure 4.5a, for tile number zero the PSNR for QP 22 is larger than for QP 37, respectively 42 db compared to 37 db. It can be seen that the larger the QP, the lower the quality is. This is because as explained in subsection 2.2.4, more higher spatial frequencies are discarded and hence less detail can be represented. The figure also shows that smaller tile sizes have a larger range and fluctuation in PSNR values and they also tend to have a larger PSNR. This can be verified 23

by looking for example at sub-figure 4.5a and sub-figure 4.5b. Only looking at QP 22, the PSNR values for the first sub-figure range from 42 db to 47.

43 (a) hockey1 1 s 576p tiles (b) hockey1 1 s 144p tiles (c) hockey2 1 s 720p tiles (d) hockey2 1 s 360p tiles Figure 4.4: Visualization of the decoding time in function of tile number with the different points per tile number representing the decoding time per 0.5s (intraperiod). by looking for example at sub-figure 4.5a and sub-figure 4.5b. Only looking at QP 22, the PSNR values for the first sub-figure range from 42 db to 47.5 db, whereas the PSNR values for the second sub-figure range from 41 db to 50.5 db. The reason for this increase in PSNR for smaller tile sizes will be explained further in subsection Determine tiling overhead due to encoding Tiling the video comes at a cost. First, there is extra header overhead present for a specific RoI in the video for the tile-based approach. Secondly, because each tile is encoded independently the prediction is constrained within the tile. This causes constraints on MV lengths which results in less optimal predictions and therefore a higher residual image is obtained, which means a reduction of compression efficiency. This leads to higher storage requirements on the server and higher bandwidth requirements for streaming the same RoI. This effect is visible in figure 4.6. The figure is obtained by summing all the bit rates temporally and spatially corresponding to each tile size and to each QP that cover 24

The encoded panoramic video without tiling is also shown on this figure as reference. The y-axis shows the total bit rate and the x-axis shows the different QP values.

44 (a) hockey1 1 s 576p tiles (b) hockey1 1 s 144p tiles (c) hockey2 1 s 720p tiles (d) hockey2 1 s 360p tiles Figure 4.5: Visualization of PSNR in function of tile number with the different points per tile number representing the PSNR per 0.5s (intraperiod). the full panoramic video. The encoded panoramic video without tiling is also shown on this figure as reference. The y-axis shows the total bit rate and the x-axis shows the different QP values. Per QP the different tile sizes are shown. From the figure a decrease in bit rate can be seen for higher QP values. Take for example sub-figure 4.6a. On that figure the total bit rate of the 720p tile sizes decreases from 850 Mbps for QP 22 down to 90 Mbps for QP 37. This is the same cause as mentioned in subsection Another aspect is that the bit rate tiling overhead starts to increase exponentially with smaller tile sizes. For instance in sub-figure 4.6a for QP 37, the bit rate increases from 90 Mbps for 720p tile size, to 100 Mbps for 360p tile size and to 340 Mbps for 72p tile size. The reason for this is that there are huge constraints on the MV lengths as explained above and this is more applicable for the smaller tile sizes such as the 144p and especially for the 72p tiles. Another deduction from that figure is that for the tiles of 576p and 720p, the increase in bit rate relative to the reference is very small (max 1%). This is because the constraints on MV lengths is less applicable here due to the larger sizes. Therefore, the main factor is the extra header overhead that is introduced. Also note the large bit 25

rates up to hundreds of Mbps. This is caused by the repetitive I-frame every 0.5s that is needed to maintain random access.

45 (a) hockey1 1 (b) hockey2 1 Figure 4.6: Bit rate tiling overhead due to encoding visualized by summing all the bit rates temporally and spatially corresponding to each tile size and to each QP that cover the full panoramic video. rates up to hundreds of Mbps. This is caused by the repetitive I-frame every 0.5s that is needed to maintain random access. An I-frame is only intra-predicted and is therefore less compression efficient. The same effects are seen on the other sub-figure. The same method as for bit rate is applied for the decoding time and the phenomenon is similar as seen in figure 4.7. The decoding time starts to increase exponentially for smaller tile sizes. However, the total decoding time for the 576p tiles is lower than for the 720p tiles. Take for example figure 4.7a, QP 27. There, a decrease of 20s is visualized between 720p and 576p tiles. The relative decrease in decoding time over the different QP values is less steep for the same tile size. Here it is only an average decrease of 150% from QP 22 down to QP 37, where an average decrease of 900% was seen for the bit rate. Another aspect from this figure is that the decoding time for the reference video is approximately 1.3 times larger than for the 360p-720p tiles. This means that it is more beneficial in terms of decoding time to decode the panoramic video by decoding larger tiles separately and stitch them together than decoding the entire panoramic video at once. For the tile-based approach, it is also important how the tiles for the different tile sizes affect the quality. Therefore, the mean PSNR is calculated. In order to do so, all the PSNR values were first transformed back to the Mean Squared Error (MSE). Then the average of all the MSE values, temporally and spatially corresponding to each tile size and to each QP that cover the full panoramic video, was calculated and transformed back to PSNR. This gives a better average than simply averaging over the PSNR values because PSNR 10 log 10 (MSE). Figure 4.8 shows the resulting PSNR values. The first thing that can be deduced from the figure is that the PSNR starts to increase with lower tile sizes. For example looking at sub-figure 4.8a for QP 27, the PSNR increases from 42.5 db for 720p tiles to 43.2 db 26

full panoramic video. 8: Mean PSNR corresponding to each tile size and to each QP that cover the full panoramic video. for 72p tiles.

Another aspect is that the PSNR of the reference is smaller than the average PSNR of the tiles, which is non-intuitive. This can be seen in for example sub-figure 4.

The reason for this phenomenon is that smaller tile sizes make less use of skip modes because they have less neighbors to use their motion information from and therefore the chance

46 (a) hockey1 1 (b) hockey2 1 Figure 4.7: Decoding time tiling overhead due to encoding visualized by summing all the decoding times temporally and spatially corresponding to each tile size and to each QP that cover the full panoramic video. (a) hockey1 1 (b) hockey2 1 Figure 4.8: Mean PSNR corresponding to each tile size and to each QP that cover the full panoramic video. for 72p tiles. Also the increase per tile size is larger for increasing QP, because for the same tile sizes as previously, but now for QP 37, the PSNR increases from 39 db to 40.7 db. Another aspect is that the PSNR of the reference is smaller than the average PSNR of the tiles, which is non-intuitive. This can be seen in for example sub-figure 4.8a, where the PSNR of the reference for QP 22 is 43.7 db, whereas for the 144p tiles the PSNR is around 44.3 db. The reason for this phenomenon is that smaller tile sizes make less use of skip modes because they have less neighbors to use their motion information from and therefore the chance of reusable motion information that results in a good prediction is smaller. The encoder needs to search its own MVs again and will split regions containing details in the smaller tile sizes into smaller blocks. Hence more detail can be represented, which gives a larger PSNR value but also an increase in bit rate. The larger tile sizes will use skip modes whenever possible, because the HM encoder is favored to do that and 27

47 therefore a lower bit rate is retrieved but also a lower PSNR. This conclusion can only be made for videos where the camera is static. If the camera would move, the smaller tile sizes would also use more skip modes. These PSNR values should be cautiously interpreted. They do not cover everything with respect to quality. Because all tiles are encoded independently, there will be more artifacts between the tiles for a higher QP and a smaller tile size. These artifacts are called inter-tile artifacts. A reason for this is that when a tile is quantized, the quantized DC values of the blocks of a tile will be somewhat different than the quantized DC values of the blocks from its neighboring tiles, leading to a visible boundary between the tiles. This is an artifact that PSNR cannot cope with. Figure 4.9 illustrates these artifacts for a 360p RoI. From the QP 37 sub-figures, one can be more preferred to use larger tile sizes, so less inter-tile artifacts are introduced and an overall nicer image perception is obtained. However, more blurriness within the tiles can be seen for larger tile sizes. For the QP 22 sub-figures, almost no inter-tile artifacts are seen. The almost non-visible boundaries can be solved by doing post-processing, such as de-blocking, when tiles are stitched together Determine bit rate overhead due to extra pixels Besides the tiling overhead due to encoding, there is also the extra bit rate overhead due to the extra pixels sent when a RoI is sent to the user. This overhead is illustrated by a red overlay in figure 4.10 for a 1080p RoI for 576p tiles. Due to this extra overhead, there will be a trade-off between the extra pixels sent and the tiling overhead as tile size changes. In the extreme case, each CTU is a tile, leading to lowest compression efficiency but fewest redundant regions. Larger tile sizes achieve better compression efficiency, but will lead to more redundant data being transmitted for partially overlapped tiles with the RoI. The optimal tile size will be found in this section. A first attempt to find the optimal tile size is to compare how much bit rate is needed for some static RoIs of 1080p if it is constructed with the tile-based method. The chosen RoIs are shown in figure The top and bottom views are indicated by their corresponding view numbers as shown in the figure. The bottom views are specified by their prefix b. Both views with view number five ( 5 and b5 ) were later ignored because they deviate too much from the 1080p RoI. These views were all encoded with the same coding scheme principle as for the tiles. From the output of these views, the bit rates were retrieved and stored. The next step is determining the amount of bit rate needed per tile size for the tiles that (partially) cover the appropriate views. From these bit rates and the bit rates from the reference views, the amount of bit rate overhead [%] is calculated. The results for hockey1 1 view number 4 and hockey1 1 view number b0 are shown in figure The 28

(a) Sixteen 72p-tiles stitched together for QP 37 (d) Sixteen 72p-tiles stitched together for QP 22 (b) Four 144p-tiles stitched together for QP 37 (e) Four 144p-tiles stitched together for QP 22 (c)

first view corresponds to motion from the cheerleaders and the second view represents more a static area as seen in figure 4.11.

48 (a) Sixteen 72p-tiles stitched together for QP 37 (d) Sixteen 72p-tiles stitched together for QP 22 (b) Four 144p-tiles stitched together for QP 37 (e) Four 144p-tiles stitched together for QP 22 (c) Cropped 720p tile for QP 37 (f) Cropped 720p tile for QP 22 Figure 4.9: Inter-tile artifacts for a 360p RoI for different tile sizes and QP values. Figure 4.10: Example of pixel overhead for the 576p tiles on the hockey1 1 sequence for a 1080p RoI. first view corresponds to motion from the cheerleaders and the second view represents more a static area as seen in figure If both sub-figures are compared, a maximum of 215% bit rate overhead is seen for the first sub-figure and a maximum of 309% for the second sub-figure when looking at QP 37 and the 72p tiles. By comparing also the remainder with each other, it is clear that the first figure has the least overall bit rate 29

Figure 4.11: Selected 1080p RoIs for hockey1 1. The RoIs are called with their corresponding notation on the figure. Selected 1088p RoIs for hockey1 1.

49 Figure 4.11: Selected 1080p RoIs for hockey1 1. The RoIs are called with their corresponding notation on the figure. Selected 1088p RoIs for hockey1 1. The bottom views are indicated by their prefix b. overhead. This is because the absolute value of the reference view which contains motion is larger than for a more static view and therefore the overhead in percent is less for views containing motion. In both sub-figures a minimum bit rate overhead is seen for the 360p tiles for the different QP values. This is also the case for the other views and sequences stating that 360p might be the optimal tile size. However, this might not be the case, because the 360p tiles are perfectly aligned with the first view of figure 4.12a of 1080p and partially with the second view of figure 4.12b. Therefore the amount of pixel overhead is zero for the first view or small for the second view, basically leading to only the tiling header overhead. Another aspect from the figure is that the 72p tiles are getting worse for increasing QP, going from 27% for QP 22 to 215% for QP 37 for hockey1 1 view number 4. Although the 72p tiles are also perfectly aligned, the amount of overhead due to encoding is taking the upper hand. From the sub-figures, it is also clear that the larger the tile size, the less the bit rate overhead increments for increasing QP. Take for example the 720p and 144p tiles of figure 4.12b. The 720p tile size increases only with 5% (25% 30%), whereas for the 144p tile size the bit rate overhead increases with 260% (50% 310%) from QP 22 until QP 37. One of the reasons for this phenomenon is that smaller tile sizes need a lot more tiles to cover the same RoI and therefore a larger overall header overhead occurs for smaller tile sizes. Take for example the 576p tiles, they need 6 tiles, while the 72p tiles need 240 tiles to cover the same view of figure 4.12b. Another reason is that the absolute value of the reference view is smaller for a larger QP value. These two reasons explain the eventual larger relative bit rate overhead for smaller tile sizes. Out of these observations, it is clear that 72p will not be the optimal tile size and therefore only making the 72p tiles for hockey1 1 was a good choice. Similar phenomena are found for the other sequences and views. Another investigation was if there is some kind of linear relation between the amount of pixel overhead and the resulting bit rate overhead. Table 4.2 shows the results for the same views as above only with QP 22. In this table the pixel overhead for the 72p and 360p tiles are zero for hockey1 1 view 4 as stated above. For the same sequence and view, there is also no equal value of bit/pixel-overhead between the different tile sizes. More precisely, there is no fixed range found. Take for example hockey1 1 view b0, there 30

50 (a) hockey1 1 view 4 (b) hockey1 1 view b0 Figure 4.12: Bit rate overhead for hockey1 1 view 4 and hockey1 1 view b0 compared to its reference. Sequence View Tile Size bit/pixel-overhead[b/pel] Pixel Overhead hockey p p p 0 144p p 0 hockey1 1 b0 720p p p p p Table 4.2: Bit rate overhead per pixel overhead for hockey1 1 view 4 and hockey1 1 view b0 with QP 22 the bit/pixel-overhead varies from to , which is far from equal. This was also the case for the other sequences and views. In order to mitigate the effect of alignment of views with particular tile sizes, more views were considered. It is however impossible to encode all possible views and that is why it is chosen to look at the relative bit rate overhead between the tile sizes instead of absolute bit rate overhead (compared with reference). Therefore, a RoI window of 1080p was created sliding per 64 pixels (largest CTU size) in both horizontal and vertical direction. For each window position, the bit rate overhead was calculated for the different tile sizes and QP values. Then for each window, the tile size with the lowest bit rate overhead was selected and a specific counter was incremented for that tile size. Due to the large number of different views, the alignment of the tiles with the RoI is canceled. The results for hockey1 1 and hockey2 1 are shown in figure

51 (a) hockey1 1 (b) hockey2 1 Figure 4.13: Number of 1080p views for particular tile size with best relative bit rate overhead for hockey1 1 and hockey2 1. (a) hockey1 1 (b) hockey2 1 Figure 4.14: Number of 1080p views for particular tile size with best relative decoding time overhead for hockey1 1 and hockey2 1. From this figure, it is clear that 144p tiles are the best suited tile size for 1080p static views for a high quality such as QP 22 and 27. For the first figure, it produces the best result in terms of minimal bit rate overhead with around 1200 out of 1651 views. Note that this is another optimal tile size (144p) than in the previous experiment (360p). For lower quality (higher QP), the 144p tiles are less optimal going from 1200 down to 95 for QP 37. From that QP 37, it is seen that 360p tiles become more suited because they have the most views in terms of minimal bit rate overhead at that moment, namely 659. It is also very clear again that the 72p tiles are not the tiles to choose for this RoI resolution, because they are almost never the best choice. The same method has been applied for the decoding time and the results for the same sequences are shown in figure In that figure, the 144p tiles are best over the entire QP range and therefore again the 144p tiles are the best suited tile size for 1080p static views. Until now 1080p views were considered, but the same results in terms of bit rate overhead 32

52 (a) hockey1 1 (b) hockey2 1 (c) hockey1 1 (d) hockey2 1 Figure 4.15: Number of 720p views for particular tile size for hockey1 1 and hockey2 1. Top: relative bit rate overhead, Bottom: relative decoding time overhead and decoding time overhead are obtained for 720p views as shown in figure A small difference is that the 144p tiles are more dominant in the higher QP values for the bit rates. This can be seen in for example figure 4.15a, where the 144p tiles still got 939 out of 2603 views in terms of minimal bit rate overhead for QP 37. It is also seen that the 72p tiles have more views for QP 22 than was the case for the 1080p RoIs. For smaller RoIs such as a 360p RoI, the 72p tiles are dominant. Out of this observation, it can be seen that it is better to send small tiles to represent the 360p RoI than fully covering this RoI with a 720p tile. 4.6 Conclusion For the tile-based method on this type of content, it is proven that the 144p tiles are the overall best in terms of bit rate and decoding time for static 720p and 1080p RoIs. It should be noted that this may only count for this type of content where the panoramic video itself stays at a fixed position. Next to this, only static views were considered and 33

53 results may vary for views where the user pans/tilts around in the panoramic video. Also, the overhead introduced by lower network layers, such as the packetization overhead, is not taken into account. This overhead involves that each tile needs to be put in a different transport layer packet. Considering all these elements, it is possible that a bigger tile size such as 360p is more likely to be optimal. 34

54 Chapter 5 Non-tile-based approach 5.1 Introduction A different approach of sending a personalized RoI to each user is encoding the RoIs of the users on the fly. In order to speed-up the encoding process of each personalized view, coding information such as CU, MV, mode, TU, etc., retrieved from the full encoded panoramic video is used to skip encoding decisions. This approach is called the non-tilebased method 1 and was introduced in subsection of chapter 3. However, a question that arises is how much correlation exists between the cropped coding information of the fully encoded panoramic video that overlaps with the RoI and coding information of the view encoded without any acceleration. Another question could be how does the bit rate or the quality change when different coding information retrieved from the panoramic video is used to speed-up the encoding process of the views? These questions will be investigated and answered in this chapter. Before investigating these questions, the RoIs or views have to be selected and this selection will be done in section 5.2. Secondly, the methodology with its different steps and components will be explained in section 5.3. From the data that will be retrieved in section 5.3, first the correlation between the coding structure of the panoramic video and the coding structure of the view itself will be explored. Finally, the quality in terms of BD-rate and complexity reduction of the fast encoded views will be investigated in section Note that the proposed method is based upon the technique of Van Kets et al. of The Data Science Lab of the UGhent [24] in which they only used CU as coding information to speed-up the encoding process of each personalized view. 35

55 Figure 5.1: Selected 1088p RoIs for hockey1 1. The RoIs are called with their corresponding notation on the figure. The middle views are indicated by their prefix m. 5.2 View selection The same panoramic sequences as in previous chapter 4 are used. However, the method will only be applied to the first sequence of the different scenes hockey1 1 and hockey2 1. This is because other sequences will not give much more information as they are sequels of that particular scene. Another reason is that encoding the full panoramic video takes a lot of time. In the chosen sequences, different RoIs with a resolution of pixels (1088p) are selected. The reason for this small deviation from 1080p RoIs will be explained further. The choice was to pick RoIs that contain different types of movement. Some contain little motion, some are purely static and others have high motion. It was also foreseen that the RoIs are regions a lot of users will look at, such as the ice hockey field itself. These chosen RoIs are shown in figure 5.1. The top and middle views are indicated by their corresponding view numbers as shown in the figure. The middle views, which mostly show the ice hockey field, are specified by their prefix m. The views with view number five ( 5 and m5 ) were ignored, as was the case for the tile-based approach in chapter 4. Note again, that only static views without zooming are considered. In order to have a better indication on how much spatial and temporal information each view contains, the spatial perceptual information (SI) and the temporal perceptual information (TI) measure is calculated as described in the ITU-T Recommendation P.910 [27]. The SI is calculated by filtering the luminance plane of each frame (F n ) at time n with the Sobel filter. Next, the standard deviation over the pixels in each Sobel-filtered frame is computed. This is repeated for each frame of the corresponding view and the maximum value in the time series of that view is chosen. This maximum value represents the spatial information content of that view. This can be shortly represented in equation 5.1. { } SI = max std [Sobel(F n)] time space (5.1) The TI is based upon the motion difference feature, which is the difference between the pixel values (of the luminance plane) at the same position in space but at successive frames. Taking the maximum over time of the standard deviation over space of the pixel 36

56 differences, leads to the value that represents the motion information of that view. This can be shortly represented in equation 5.2, where F n is the pixel at the i-th row and j-th column of the n-th frame in time. More motion in adjacent frames will result in higher values of TI. { } TI = max std [F n(i, j) F n 1 (i, j)] time space (5.2) These values (TI and SI) are calculated for each view corresponding to the view overlay of figure 5.1 and are shown in figure 5.2. On the figure it is visible that views 3, m3, 4 and m4 of hockey2 1 do not contain a lot of motion (TI < 3). This also corresponds to figure 4.1b with the same overlay grid as in figure 5.1. These views mostly correspond to the ice hockey field where no players/cheerleaders are active. However looking at the same middle view numbers ( m3 and m4 ) but now for hockey1 1, high TI values (>12) are seen. This corresponds to figure 5.1, where the cheerleaders are entering the field, resulting in the higher motion indices. Looking at view 1 of both sequences, it can be seen that the hockey1 1 sequence has a lower temporal index than the hockey2 1 sequence. This can again be verified by figures 5.1 and 4.1b, where in the second sequence the ice hockey players are concentrated and for the other sequence the view mostly consists of the ice hockey field. Although during the sequence of hockey1 1 a player passes by, resulting in a larger TI than hockey2 1 view m4. This variety of TI values correspond with the assumption that views with different types of motion are considered. The largest temporal index is sixteen. This value would be higher if the camera was non-static. The ice hockey players or cheerleaders also have more spatial details, resulting in the larger number of the spatial index. Higher spatial indices would have been obtained if the views had contained more spatial details such as grass or water. For this type of sport content, no such spatial details are present. However, a football match sequence consists of a grass field and would have led to higher SI values. 5.3 Methodology In this section the methodology will be discussed in order to obtain the necessary information to discuss the results in section 5.4. An overview of the necessary steps that are needed is shown in figure 5.3. Selecting and cropping of the personal views out of the panoramic sequences has already been discussed in the previous section 5.2. Because our views will use coding information retrieved from the coding process of the panoramic video, a modified decoder and encoder is used to extract and read that information. It is the same HM 16.5 implementation that was used in chapter 4 for the tile-based method, however for that method these modifications were not triggered. The panoramic video is first fully encoded with four different QP values: 22, 27, 32 37

Figure 5.2: Spatial and temporal information for each view in sequence hockey1 1 and hockey2 1. The numbers beneath the markers specify the particular view number. Figure 5.

The coding configuration used is not Random Access as was the case for the tile-based method (chapter 4), but Low Delay P. This configuration consists of an I- frame followed by P-frames.

57 Figure 5.2: Spatial and temporal information for each view in sequence hockey1 1 and hockey2 1. The numbers beneath the markers specify the particular view number. Figure 5.3: Schematic figure of the necessary steps for the non-tile-based method in order to discuss the results. and 37. The coding configuration used is not Random Access as was the case for the tile-based method (chapter 4), but Low Delay P. This configuration consists of an I- frame followed by P-frames. The reason why this configuration is used, will be explained further. After the encoding process, the panoramic video encoded for four different QP values is decoded again. While the decoding takes place, the decoder generates a textual representation of the coding information in different files such as the CU structure, modes, PU, MVs, merge, etc. For the CU structure, this representation is a sequence of numbers 38

58 corresponding to the depth of each CU in its quadtree representation. The depths have already been illustrated in figure 2.4 of chapter 2. In this figure, the sequence for that CTU block would be The mode file just indicates whether it is intracoded (i) or inter-coded (p) for each block. The PU file indicates which splitting mode (see figure 2.3 of chapter 2) is used for each block. The MV files contain the horizontal and vertical components of both the motion vector and motion vector difference, as well as the reference POC, reference index, motion vector predictor index and inter direction. The merge file indicates if merge is used with the corresponding merge index for each PU in a CU or if skip is eventually used. Merge only applies when the mode of that CU is inter (p). More coding information files were retrieved such as TU, residual information, intra-prediction modes, etc., but these will not be used to speed-up the encoding process of the personalized views. The views were also fully encoded and decoded with four different QP values ( 22, 27, 32 and 37 ) in order to obtain their coding information, bit rates, PSNR values and encoding times. This information will allow to have an optimal reference representation of the views. Next, the coding information of the encoded panoramic video was cropped to perfectly overlap the area of each view. Due to the fact that the views are 1088p and the views are chosen at a position that is a multiple of 64 (maximum CTU size), the CTUs of the panoramic video are aligned with the CTUs of the views. The views were then encoded again, but now using the cropped coding information from the panoramic video. By reusing this coding information in the encoding step, the encoding is accelerated. This is because normally the encoder searches for instance the best CU structure of each CTU per frame, but is now forced to use the structure that is described in the CU file. It can therefore directly start determining the optimal PU and TU partitioning for the CU it read from the CU file. This will lower the coding complexity, but will lead to a less optimal RD optimization. By feeding the encoder with more coding information such as PU, MVs, mode etc., more coding steps can be skipped. From the output of the fast encoding process, the bit rate, PSNR and encoding time is retrieved and stored. All the views were encoded with a Low Delay P configuration. The first reason is of course because the coding information that is supplied, is from a low-delay P encoding process. However, the most important reason is the following. Because each user gets a personalized view and encoder instance, the cropped region of the raw panoramic video (views) can be fed to it for each user. It does not need I-frame refreshes because the user continues to look at the same personalized stream. This configuration results in a lower delay, which is an important requirement for interacting with personalized views. This was however not the case for the tile-based method, where the tiles are pre-stored on the server and can be retrieved by all users at any time and any position that corresponds with their selected RoI. The encoding and decoding was again performed on the same 39

59 HPC cluster that was used for the tile-based method. 5.4 Results After the encoding and decoding step, results can be retrieved and examined. First the coding information retrieved from the views and the cropped coding information of the fully encoded panoramic video that overlaps with the views will be visualized. In order to do this, the text files first need to be parsed and interpreted. This was done by a framework that was already available. This framework can visualize the coding information and store it in a matrix that corresponds to the position of the blocks in the frame. The position in the matrix is the pixel position divided by eight, because this is the smallest luma CU size. By using this matrix, the correlation between the cropped panoramic coding information and the coding information of the view itself can be calculated. This will be discussed in subsection Next, the eventual goal is to measure how the complexity reduction of the encoder by forcing the different coding information affects the quality and bit rate of the RoIs. This analysis will be performed in subsection Correlation The correlation between the coding information of the views and the cropped coding information of the panoramic video will give an idea how much resemblance there is between them. If the resemblance is high and due to the fact that the coding information of the view itself is the optimal one, it is expected that the bit rate and PSNR will not increase that much when coding information of the panoramic video is reused when encoding the view. The first way to have an idea of the correlation is to simply visualize the coding information for an arbitrary frame. Figure 5.4 shows the CU coding structure for QP 22, 27 and 37 of the cropped panoramic coding information and the coding information of the view for hockey1 1 view 0. Figure 5.5 shows the same information but for hockey1 1 view m1 and figure 5.6 for hockey1 1 view 4. In the figures it can be seen that for larger QP values, the blocks tend to become bigger (mostly CUs) for both the cropped CU info of the panoramic video and the view CU info. This is because a large QP value discards high spatial frequencies and therefore it will represent less details. The fewer details present, the easier it is to find a suitable match and therefore larger blocks are chosen. This is because these larger blocks will need less bits to encode the same information maintaining the same quality. This only holds when there are almost no high frequency components left after quantization. Looking at figure 5.4 and comparing the CU info of the cropped CU info of the panorama and the view CU info for QP 22, it is seen that the resemblance between the blocks at 40

(a) Cropped CU info of panorama QP 22 (b) View CU info QP 22 (c) Cropped CU info of panorama QP 27 (d) View CU info QP 27 (e) Cropped CU info of panorama QP 37 (f) View CU info QP 37 Figure 5.

The differences between the co-located blocks are indicated in red. the same position (co-located blocks) is small.

This is due to the presence of larger blocks, because less details need to be represented. For QP 37, it is seen that the resemblance is even higher and that co-located blocks are being split.

60 (a) Cropped CU info of panorama QP 22 (b) View CU info QP 22 (c) Cropped CU info of panorama QP 27 (d) View CU info QP 27 (e) Cropped CU info of panorama QP 37 (f) View CU info QP 37 Figure 5.4: Visualization of the CU coding structure of the cropped panoramic coding information (left) and the coding information of the view (right) for hockey1 1 view 0. The differences between the co-located blocks are indicated in red. the same position (co-located blocks) is small. This is because when blocks are more split by the encoder, the chance of resemblance reduces. Looking at the CU info for QP 27, it is seen that the resemblance is increased. This is due to the presence of larger blocks, because less details need to be represented. For QP 37, it is seen that the resemblance is even higher and that co-located blocks are being split. Now looking at figure 5.5, it is directly visible that the resemblance is already high for QP 22. This is because this is a more static view (except for the audience at the top), which means that almost no high frequency components are present and therefore higher blocks are more useful as explained before. Sometimes there are random splits visible in the cropped CU info of the panoramic video, where they are not appearing in the CU info of the view. The optimal encoding structure for the view will always be the encoding of the view itself. A 41

The differences between the co-located blocks are indicated in red.

61 (a) Cropped CU info of panorama QP 22 (b) View CU info QP 22 (c) Cropped CU info of panorama QP 27 (d) View CU info QP 27 (e) Cropped CU info of panorama QP 37 (f) View CU info QP 37 Figure 5.5: Visualization of the CU coding structure of the cropped panoramic coding information (left) and the coding information of the view (right) for hockey1 1 view 1m. The differences between the co-located blocks are indicated in red. possible reason why these splits are occurring in the cropped panoramic CU structure is because these blocks might have been used as prediction or determined from a prediction for other blocks outside the borders of the RoI in another frame. For QP 37, it is seen that the blocks are almost all the same except for one CTU block in the upper-left corner. Figure 5.6 contains the mix of everything: the ice hockey field, the cheerleaders and the audience. It is seen that the cheerleaders are represented with smaller blocks. This is because they are moving and therefore smaller blocks are necessary to represent them with inter-prediction. The resemblance for QP 22 is low, but increases for higher QP. The ice hockey field is again coded with larger blocks. Until now only a single frame of different views was considered. In order to have a global 42

6: Visualization of the CU coding structure of the cropped panoramic coding information (left) and the coding information of the view (right) for hockey1 1 view 4m.

This correlation per view and per QP is obtained by using the matrix representation of the framework and determining the equality in terms of CU depth for the co-located

The amount of equal CU depths divided by the total number of positions results in the amount of correlation for CU for one frame.

62 (a) Cropped CU info of panorama QP 22 (b) View CU info QP 22 (c) Cropped CU info of panorama QP 27 (d) View CU info QP 27 (e) Cropped CU info of panorama QP 37 (f) View CU info QP 37 Figure 5.6: Visualization of the CU coding structure of the cropped panoramic coding information (left) and the coding information of the view (right) for hockey1 1 view 4m. The differences between the co-located blocks are indicated in red. idea about the correlation, the correlation was calculated over all frames of a particular view. This correlation per view and per QP is obtained by using the matrix representation of the framework and determining the equality in terms of CU depth for the co-located blocks between the cropped CU info of the panoramic video and the CU info of the view. The amount of equal CU depths divided by the total number of positions results in the amount of correlation for CU for one frame. Doing this for all frames and averaging the correlation values, results in the amount of CU correlation for that particular view. However the coding information consists of more than CU, therefore also mode-, PU- and merge correlation was calculated by using the same technique. For modes, the correlation is in terms of equality between co-located i/p blocks, for PU in terms of equality between the splitting modes for co-located blocks and for merge in terms of equality between 43

63 the merge flag for co-located blocks. Table 5.1 shows the results for the same views as previously together with two middle views of hockey2 1. From the table it is visible that the correlation increases with QP for each type of coding information. For example hockey1 1 view 0, the CU correlation increases from 54.43% to 92.04% from QP 22 until QP 37. This corresponds with the conclusion that was found in the previous section. The larger the QP, the more the encoder uses larger blocks and the higher the resemblance between the cropped coding information of the panoramic video and the coding information of the view. It is also visible that this applies for the modes, PU and merge coding information as well. Another conclusion that can be seen is that the correlation is larger than 50% everywhere (mostly 80%+). This states that the correlation between the two is high and definitely not random. The correlation for the modes have high values (>99%). The reason is because they both have an I-frame for the first frame (100% correlation for that frame) and for the other frames, the most used prediction mode is P, resulting in the large correlation values. If the correlation of the different views are compared to each other, it is seen that hockey1 1 view 0 has the lowest CU, PU and merge correlation for QP 22, namely 54.43%, 60.38% and 75.99%. Hockey2 1 view m3 has the highest correlation for QP 22, namely 91.08% for CU, 95.61% for PU and 96.34% for merge. The reason can be explained by looking at figure 5.1 and figure 4.1b. From the first figure, it is seen that hockey1 1 view 0 corresponds to the audience. Therefore motion and details are present, which leads to a larger number of splits and therefore the correlation decreases. This also corresponds with the conclusion made from figure 5.4. The same reasoning applies for hockey1 1 view m4, where the cheerleaders enter the field. Hockey2 1 view m3 mostly shows the static ice hockey field, resulting in the use of larger blocks and therefore a higher correlation. For hockey2 1 view m1, the view consists of ice hockey players and the ice hockey field. Due to this combination, the correlation is a value between the lowest and highest correlation. View m1 of hockey1 1 has a lower correlation than expected because it seemed to correspond to view m3 of hockey2 1, where almost only the ice hockey field is visible. From figure 5.5, it was also seen that a high correlation existed between the cropped coding information of the panoramic video and the coding information of that view. However by looking to the other frames of that view, it is seen that a player ice skates from bottom left to upper right on the field throughout the view. This results in smaller splits for the CU blocks and less chance of correlation between the cropped coding information and the coding information of the view. The other views of the different sequences give the same correlation ranges according to the type of motion that is present in the view. 44

64 Sequence View QP CU Mode PU Merge hockey1 1 hockey2 1 0 m1 m4 m1 m Table 5.1: Correlation [%] between the cropped coding information of the panoramic video and the coding information of the corresponding view for CU, mode, PU and merge BD-rates and complexity reduction The goal of this chapter is to determine how much the encoding is sped-up (complexity reduction) and how much the quality (bit rate and PSNR) is affected by fast encoding the view by reusing the cropped coding information from the panoramic video. A metric that shows the difference in compression efficiency is the Bjøntegaard Delta rate (BDrate) [28]. In this context it shows the average increase in bit rate for the same PSNR of encoding a personalized view by reusing information from the original panoramic sequence (fast encoder) compared to encoding this view without reusing information. In order to determine the complexity reduction, the time saving (TS) metric is calculated. It is determined by comparing the encoding time of the fast encoder (T fast ) to the encoding time of the reference encoder (T ref ) and is given by equation 5.3. TS(%) = T ref T fast T ref (5.3) Different kinds of coding information can be reused from the panoramic sequence. For this master s dissertation, gradually more information is fed to the encoder starting from CU and then adding mode, PU, MVs and finally merge. As explained in section 5.3, for 45

65 each fast encoded view and normal encoded view the bit rate, PSNR and encoding time were retrieved and stored. This allows to calculate the BD-rates and TS. The lower the BD-rate and the higher the time saving, the better it is to use these coding structures. However a trade-off exists between them. Table 5.2 shows the results (BD-rate and TS) for the same views of the sequences as in table 5.1. Starting by looking at the BD-rates until character D (CU, mode, PU and MVs), it is seen that the BD-rate increases when more coding information from the panoramic sequence is given to the encoder. For example for hockey2 1 view m1, the BD-rate increases from 5.5% to 13.7%. The reason is that the more non-optimal coding information that is given to the encoder, the worse the predictions are and the higher the eventual residual images will be. Higher residual images result in higher bit rates and therefore higher BD-rates. The reason why the BD-rate is already high when only CU is supplied to the encoder is caused by an encoder optimization which is explained in more detail in appendix A.2. An exception is visible for hockey2 1 view m3. For this view, the BD-rate decreases again between PU and PU with MVs. The reason for this is caused by an optimization that is used inside the HM encoder. More precisely, another MV of the most recently calculated 2N 2N PU with the same reference picture as the currently tested MV can be used as candidate starting point for motion estimation. However because the PUs are supplied, the MV of the most recently calculated 2N 2N PU is not available and can therefore not be used as starting point. This can lead to a prediction match that is less optimal. However when the MVs are supplied, the best match MV of the cropped panoramic video will be used and can therefore lead to a better prediction when no MVs were supplied. It is also seen that the BD-rates are the smallest for hockey2 1 view m3, namely 4.9%- 8.3%. This corresponds with the results retrieved from the correlation, where this view had the highest correlation. Therefore it is close to the optimal coding structure obtained from encoding the view itself and therefore results in the lowest BD-rate. It is also the sequence with the lowest spatial and temporal information as seen in figure 5.2. Again looking at table 5.2, it is seen that the increase from supplying the CU structure to supplying the CU with mode is large taking in consideration that the modes had a high correlation between co-located blocks as was concluded in subsection This states that choosing the wrong mode results in a lower compression efficiency than when the optimal mode is picked. The highest increase in BD-rate happens when the PU information is supplied as well. Take for instance hockey1 1 view m4, the BD-rate goes from 8.0% to 15.0%, which is almost double. By looking at view m1 of hockey1 1 and hockey2 1, it is seen that the BD-rates deviate from each other, although from the correlation table 5.1 it was expected that they would result in a similar BD-rate. However by looking at the temporal and spatial information in figure 5.2 of these views, it is seen that hockey2 1 view m1 has a higher temporal and 46

66 spatial index than hockey1 1 view m1. It is also visible that the BD-rates of hockey1 1 view m1 and view 0 are close to each other. Looking again at the spatial and temporal information of these sequences, it is seen that their temporal indices are close to each other, explaining similar behavior. Now looking again at table 5.2, where also merge information is supplied (character E), it is seen that high and even negative BD-rates appear. The reason why these values appear is that when merge information is supplied also skip is forced. Forcing skip causes the strange behavior. If skip is used, no residuals are encoded under assumption that the eventual residual is really small for that block. This assumption does not hold anymore when the cropped coding structure of the panoramic video is used for fast encoding the view. For this application, the skips can cause wrong blocks to be copied. Because no residual is encoded, these errors cannot be corrected, resulting in a large quality decrease and therefore causing the high BD-rates. An example of these wrongly encoded blocks is shown in sub-figure 5.7a for hockey2 1 view 1m and indicated by red rectangles. Although a normal behavior is visible for hockey1 1m view 0, for which the BD-rate only increases with 2.7%. This is because the merge indices that are used by skip all point to MVs used inside the view and is therefore used in the correct way. The reason why all MVs point inside the view is because the fully panoramic video is encoded from upper-left to bottom-right. Because view 0 is the upper-left view, no merge indices can point to MVs that were used by blocks outside the left and upper corners of the view. This is confirmed by figure 5.7b where no wrong blocks are copied. The negative BD-rate is explained by showing two RD-curves in figure 5.8, namely the RD-curve of hockey1 1 view 0 and the RD-curve of hockey2 1m view 3. In the first sub-figure, the PSNR increases when the bit rate increases which is the normal behavior. However for the second sub-figure this is not the case. The PSNR drops from 40 db to 39.5 db when the bit rate increases from 100 kbps to 300 kbps, resulting in the eventual negative value for the BD-rate. A possible solution to solve these faults that occur when merge is also used as coding information, is by encoding the full panoramic video without skip. Because encoding the full panoramic video takes a lot of time, this could not be investigated anymore for this master s dissertation. Now looking at the time saving values in table 5.2, it is seen that when only the CU information from the panoramic sequence is reused, large values of TS can already be obtained, ranging from 78.1%-81.2%. This is because the encoder already knows its CU depth in advance and does not have to check every possible combination of CU configurations for each CTU. Also supplying the mode information to the encoder, gives only an extra time saving increase of 0.3%-1.2%. If this is compared with the amount of BD-rate increases that it causes, it is not optimal to use it. However it is necessary to further apply other coding structures, because for example merge only works if the 47

8: RD-curves when merge information is supplied in order to explain the negative behavior of the BD-rate. mode is inter-predicted.

67 (a) hockey2 1 view 1m (b) hockey1 1 view 0 Figure 5.7: Screenshot of fast encoded views with merge (including skip) supplied as coding information. Left: influenced (indicated by red rectangles), right: non-influenced. (a) hockey1 1 view 0 (b) hockey2 1m view 3 Figure 5.8: RD-curves when merge information is supplied in order to explain the negative behavior of the BD-rate. mode is inter-predicted. Also supplying PU as coding information for encoding the view, the time saving increases by approximately 13%. This states that finding the optimal PU partition modes is also computationally complex. With motion also included, the time saving is already around 97% and even when the merge information is given to the encoder, the time saving goes above 99%. However, the resulting BD-rate increase that supplying merge information causes, makes it not worth to use it. Perhaps if no skip is allowed, it could be used. The BD-rates and time savings of the other views can be found in table A.1 (appendix A). The BD-rates and time savings are in the same range. Until now only static views were considered. If the user selects after some time a totally different RoI, it is expected that the bit rate will peak. This is because the residual image will be large in order to represent the new area based on predictions from the old RoI. During panning/tilting, reusing MV information is not a good idea, because these will not represent the correct length or will point to the wrong position. Therefore, it would be good to modify (e.g. scaling) the MVs to take panning/tilting into account or even discard the reuse of MV information in this scenario. 48

68 BD-Rate (%) Time Saving (%) Sequence View A B C D E A B C D E hockey1 1 hockey m m m m Table 5.2: BD-rates and Time Savings obtained by supplying different coding information. The numbers in the second row represent the type of coding information that is reused from the panoramic video. A: CU, B: CU & Mode, C: CU & Mode & PU, D: CU & Mode & PU & MVs, E: CU & Mode & PU & MVs & Merge 5.5 Conclusion For the non-tile-based method on this type of content, it is shown that there is definitely enough correlation between the coding information obtained from encoding the panoramic video and the coding information obtained from encoding the view itself. This led to the reuse of coding information from the panoramic sequence to fast encode the RoIs. It was seen that reusing only CU information, the BD-rate was minimum 4.9% and maximum 7.4% with already a time saving of around 79%. Using CU, mode, PU and MVs resulted in a BD-rate with a minimum of 8.3% and a maximum of 19.5% and a speed increase up to 97%. Using also merge (with skip) information resulted in strange behavior of the BD-rate but could possibly be solved by encoding the panoramic video without skip. How much coding information that will be used for encoding these views depends on the amount of bit rate increase that is allowed. In this master s dissertation this will depend on the amount of bit rate the tile-based approach will need in order to represent these views. In the next chapter both methods will be compared in terms of bit rate and PSNR. 49

69 Chapter 6 Comparing the tile-based method with the non-tile-based approach 6.1 Introduction To recapitulate, the idea is to send a personalized HD RoI to each user extracted from a full panoramic video which is far beyond HD. In the last two chapters (5 and 6) two methods were deeply investigated in order to efficiently encode and deliver these RoIs. The first method was the tile-based approach and the second method was the non-tilebased approach. It is important to know which method is superior and this is what will be investigated in this chapter. In order to accomplish this, the two methods will be compared in terms of bit rate (section 6.2) and PSNR (section 6.3) for particular views. The bit rate should be low to make it applicable for a lot of users who only have a limited bandwidth available. However, the PSNR should be high to have a good quality video of the RoI. Another important factor is the delay between selecting the RoI and the RoI really appearing on the screen of the user. This delay together with the quality will determine the QoE of the user. However, it is difficult to measure the entire cycle consisting of the processing delay, the coding delay and the network delay. Therefore, only the delay in terms of coding will be compared between the two methods in section 6.4. The same views (see figure 5.1) will be considered as in chapter 6. These views all have a resolution of 1088p and some represent more static and others represent more motion in its content as seen in figure Comparison in terms of bit rate In this section, the bit rates of both methods will be compared. For the non-tile-based method, these are the bit rates retrieved from the encoding step with the different coding 50

70 information supplied to the encoder. For the non-tile-based method, the bit rates are calculated by the sum of the tiles of one particular tile size that (partially) overlap with the corresponding view. Figure 6.1 shows the bit rates of both methods for four different views. The tile-based method can be recognized by its star pattern on the chart bars. The bit rates of the fully encoded reference views using a Low Delay P configuration are also shown on the figure. It is seen that the bit rates of sub-figure 6.1d are the lowest. This is because the view (hockey2 1 view m3 ) mostly represents the ice hockey field and is therefore mostly static. In chapter 5, it was concluded that the optimal tile size was 144p. However, it looks like the 360p tiles are the best in terms of bit rate for the tile-based method. For instance in sub-figure 6.1c, for that view consisting of 360p tiles for QP 22, it has the lowest bit rate namely 110 Mbps. However, keep in mind that again due to the alignment of the views with the 360p tiles, the 360p tiles have the least pixel overhead and therefore the lowest bit rate. If more arbitrary non-aligned views were considered, 144p tiles would have been the best for the tile-based method. If both methods are compared, it is directly visible that for these four views the bit rates of the non-tile-based method are much lower than the bit rates of the tile-based method. For example, the first sub-figure 6.1a represents the view that corresponds with the audience and has a bit rate of 7.28 Mbps for QP 22 for which the cropped CU coding information of the panoramic video is reused, whereas 144p tiles have a bit rate of around 170 Mbps for QP 22. The reason for this large difference is due to the coding configuration of both methods. The non-tile-based method uses a Low Delay P configuration and therefore only uses the first frame as I-frame followed by all P frames. This is possible because every user has a dedicated encoder, which starts fast-encoding its personalized stream. For the tile-based method, every tile is pre-encoded with a Random Access configuration. This is needed because every user can use the tiles of any location at any time and therefore the tiles were encoded with an intraperiod of 0.5s (32 frames). This means that the tile-based method already consists of nineteen I-frames for each tile in order to encode 600 frames. I-frames consume the most bit rate, because these are only intra-predicted. Taking into account that the non-tile-based method only needs one I-frame of 1088p resolution and the tile-based method needs nineteen I-frames for each tile that (partially) overlap with the RoI, it is easily seen that the bit rates of the non-tile-based method will be the lowest. This difference in bit rate between the tile-based method and the non-tile-based method will only increase when the sequences are longer in time. In order to give the tile-based method a fair chance, the non-tile-based method was also applied using the same coding configuration as the tile-based method. Keep in mind that the use of a Random Access configuration is actually not needed in this method because each user has its own dedicated personalized stream. For this coding configuration, reusing all the coding information of the panoramic video was not possible, because the modified 51

71 (a) hockey1 1 view 0 (b) hockey1 1 view m4 (c) hockey2 1 view m1 (d) hockey2 1 view m3 Figure 6.1: Comparison between the tile-based method and the non-tile-based method in terms of bit rate. encoder does not support the reuse of bi-predictive (B) motion vectors. The modified encoder only supports uni-predictive (P) motion vectors, therefore only the cropped coding information of CU, mode and PU from the full panoramic video could be reused to fast encode the views. Figure 6.2 shows the bit rates of the non-tile-based method, the tile-based method and the fully encoded reference view using all the same coding configuration (Random Access). Take for example sub-figure 6.2a, it is seen that the bit rates of the non-tile-based method are closer to the bit rates of the tile-based method, but are still lower. For instance the bit rate of reusing the CU, mode and PU as coding information of panoramic video for QP 22 is around 51 Mbps, whereas for the view consisting of the 576p tiles a bit rate of around 55 Mbps is visible. Comparing both methods for the other sub-figures, it is seen that the non-tile-based method still performs better. This is also the case for the other views not shown in figure 6.2. The main reason the non-tile-based method still performs better, is that the amount of bit rate increase caused by reusing 52

72 (a) hockey1 1 view 0 (b) hockey1 1 view m4 (c) hockey2 1 view m1 (d) hockey2 1 view m3 Figure 6.2: Comparison between the tile-based method and the non-tile-based method in terms of bit rate, where the non-tile-based method uses a Random Access configuration for fast encoding the views. coding information from the panoramic video is not large enough compared to the bit rate overhead of the tile-based method. For the tile-based method, there were two types of overhead, namely the amount of bit rate overhead due to tiling and the amount of bit rate overhead to the extra pixels sent to the user. These two types of overhead are more profound than the reuse of coding information from the fully encoded panoramic video. In this comparison only static views are considered, but it is expected that the nontile-based method will still outperform the tile-based method in terms of bit rate when panning and tilting are taken in consideration. As noted in subsection of chapter 5, if the user selects after some time a totally different RoI, the bit rate of the non-tile-based method will peak. This is because the residual image will be large in order to represent the new area based on predictions from the old RoI. In the worst case, these residual images can be considered as I-frames and therefore the results when the non-tile-based 53

73 method is applied using a Random Access coding configuration can be considered. For the tile-based method when a totally different RoI is chosen, this will have no big influence on the bit rate results, due to the Random Access coding configuration. So in the worst scenario, when the user pans/tilts around selecting totally different RoIs per 0.5s, the non-tile-based method will still perform better in terms of bit rate. 6.3 Comparison in terms of PSNR In the previous section, it was clear that the non-tile-based method outperforms the tilebased method in terms of bit rate. Another important aspect to compare with, is quality which is again measured in PSNR. The PSNR for the tile-based method is calculated in the same way as in subsection of chapter 5. Figure 6.1 shows the PSNR of both methods for the same four views. Start by looking at the first sub-figure 6.3a, it is visible that the tile-based method for all the tile sizes performs better in terms of PSNR than the non-tilebased method. A PSNR of around 38 db for QP 32 is visible when the cropped CU, mode and PU coding information of the panoramic video is reused, whereas the 144p tiles have a PSNR of around 39 db for QP 32. Similar behavior is visible for sub-figure 6.3b and 6.3c. However for sub-figure 6.3d, the non-tile-based method performs better for certain tile sizes. For example, a PSNR of around 45.7 db is visible for QP 22 when the cropped CU coding information of the panoramic video is reused, whereas for 576p tiles a PSNR of around 45 db for QP 22 is visible. The view of that sub-figure represents the static ice hockey field, whereas the other views represent the motion of the audience, players and cheerleaders. Note that for the non-tile-based method the PSNR drops significantly, except for the upper-left view (hockey 1 1 view 0 ), when merge coding information is also reused from the panoramic video. The reason for this drop in PSNR was already visualized in figure 5.7 and discussed in subsection of the previous chapter. Due to the large bit rate difference caused by the different coding configuration for both methods, the PSNR was also retrieved from the views when Random Access was used as coding configuration for fast encoding the views. Figure 6.4 shows the PSNR of both methods for the same views encoded using a Random Access configuration. The subfigures are very similar to the sub-figures of figure 6.3 and therefore the same conclusions can be made. However, as seen in figure 4.9 of chapter 5, inter-tile artifacts are visible for the tile-based method. These lower the QoE and of course do not appear in the nontile-based approach. Therefore, the subjective quality of the non-tile-based approach is better. 54

74 (a) hockey1 1 view 0 (b) hockey1 1 view m4 (c) hockey2 1 view m1 (d) hockey2 1 view m3 Figure 6.3: Comparison between the tile-based method and the non-tile-based method in terms of PSNR. 6.4 Comparison in terms of coding delay It is also important to have notion about the delay that is introduced by each method. However, as mentioned in the introduction of this chapter, it is difficult to measure the entire cycle consisting of processing delay, coding delay and network delay. Therefore only the coding delay between both methods will be compared. For the tile-based method, the coding delay only consists of the sum of the decoding times of all the tiles that (partially) overlap with the RoI. This is because the tiles are already pre-encoded on the server and therefore no encoding time needs to be taken into account. For the non-tile-based method, the coding time consists of both the encoding time and the decoding time because at the server side the RoI needs to be encoded and at the user side the RoI needs to be decoded. The reference has the same coding time principle as the non-tile-based method. Note that the coding time is over a period of 10s (600 frames) and therefore all results 55

75 (a) hockey1 1 view 0 (b) hockey1 1 view m4 (c) hockey2 1 view m1 (d) hockey2 1 view m3 Figure 6.4: Comparison between the tile-based method and the non-tile-based method in terms of PSNR, where the non-tile-based method uses a Random Access configuration for fast encoding the views. should be divided by 600 to have an average coding delay per frame. Figure 6.5 shows the coding times of both methods and the reference for one particular view. The y-axis is logarithmic in order to represent the large range of the different coding times. From the figure, it is visible that the coding times of the non-tile-based method are larger than the coding times of the tile-based method. When looking at the non-tile-based method where all coding information is supplied to the encoder (CUModePUMVsMerge) for QP 22, the coding time is 850s, whereas for the tile based method using 144p tile sizes to represent the RoI, the coding time is only around 200s for 600 frames. The other views show similar behavior. The reason is that encoding a view of 1088p is still a complex operation even though coding information is supplied, TU, intra-modes, residuals and entropy coding still need to be determined. However, every view is encoded using the HM implementation of the HEVC standard, which is very slow and single-threaded. Another more used implementation of the HEVC standard used in the industry is x265. This 56

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC Motion Compensation Techniques Adopted In HEVC S.Mahesh 1, K.Balavani 2 M.Tech student in Bapatla Engineering College, Bapatla, Andahra Pradesh Assistant professor in Bapatla Engineering College, Bapatla,