Video has been considered one of the most computationally intensive applications in mobile and consumer devices and as display resolutions grow, the computational needs for video processing are becoming even more challenging. The latest video coding technologies like HEVC (H.265) and VP9 need much more processing power than their legacy counterparts H.264 and VP8 respectively. With current silicon technology it may not be possible to increase the CPU clock beyond a certain extent due to thermal issues. However, heterogeneous System on Chips (SoCs) with multiple processing units have been launched in the market recently by chip makers which can deliver the desired compute performance to fulfill the increasing demand of video algorithms. Samsung® Exynos™, NVIDIA® Tegra® and Qualcomm® Snapdragon™ chipset series are, to name just a few, powered by ARMv7 architecture and incorporate multiple CPU cores (running as high as 2.5GHz) along with GPU Compute capability. No doubt, these platforms provide greater computational power to video software makers, but at the same time programmers need to design and architect their software in a parallel way to extract the maximum performance out of multi-core based systems.
In this blog we discuss the conventional techniques that have been adopted by video software designers to parallelize their software for multi-core platforms. Pros and cons are discussed and key points are presented which should be considered while writing software for a parallel compute based architecture. It is discussed that traditional techniques can no longer deliver desired results in the scenarios on modern SoCs where CPU clocks are dynamically changing as a result of high end power management technology like Dynamic Voltage and Frequency Scaling (DVFS) and ARM® big.LITTLE™ technology. A novel hybrid massive parallelism based technique has been proposed thereafter that exploits the multiple CPUs effectively in various scenarios, does almost a perfect load balancing and facilitates easy work offload to GPU Compute.
Multi-core Programming Models
Parallel computing is becoming commonplace and most performance critical software is being ported to take advantage of multi-core. Optimal load balancing can be challenging if the software has not been suitably architected and resource sharing has not been designed correctly. This becomes even more challenging when GPU Compute based acceleration is used to complement CPU based parallelism. To utilize various CPU cores simultaneously and effectively we need to create multiple threads in the software so that the scheduler can map them correctly to the available CPU cores. Two popular multithreading models are as follows:-
Model 1: Functional Split:
In the functional split-based model one core performs a specific type of functionality and passes on the processed intermediate data to the next core and so on. A pipeline is typically made on a few blocks basis. The module to core mapping is done statically based on profile data obtained for a few representative streams related to the use case in such a way that all cores are uniformly loaded.
On a quad-core based system (like Qualcomm’s Snapdragon 805) the core mapping in functional split based model might resemble Figure-1. In this model each module of the decoder is mapped to a particular core. Since each core deals with a lesser amount of code the approach provides good performance for code cacheability. Intermediate data can be efficiently passed on to the next core via shared L2 and with the help of Snoop Control Unit (SCU).
Model 2: Spatial Split:
In the spatial split-based model, each core performs all the processing for a specific region of a video frame. For instance, a frame can be spatially divided into four parts and each part can be processed by each core. Since a core is doing all the functional modules by itself for its allocated region in the frame, no frequent sync ups are needed during the frame processing. HEVC and VP9 standards already support tile based encoding and allow spatial modeling for multi-core.
Improved data cache efficiency will be an advantage in this model all intermediate data will be cached via L1 for the next module as the data is consumed within the core itself. However significant amounts of (L1) cache thrashing for code may be caused if decoder code size exceeds L1 instruction cache size. This method can provide almost an ideal load balance across cores for symmetric multi-core based SoCs.
Proposed Hybrid and Massive Parallelism Based Multithreading
There are numerous shortcomings that make the above traditional multithreading techniques ineffective with next generation multi-core devices. The functional split method makes load balancing difficult as the execution times of divided processes may not be uniform, particularly with the increasing number of cores in the system. Frames with more spatial dependencies (intra blocks) may also cause a slowdown in this model. Load balancing may worsen when there is a significant variation in content bit-rate or certain toolsets are not present.
The spatial split method is ideal for symmetrical multi-core based parallelism; however it becomes difficult to follow this approach on heterogeneous platforms that involve GPU cores where block level massive parallelism is desired. The load balancing becomes suboptimal in the case of spatial split when CPU clock frequencies are interfered with at run time by power management mechanisms like DVFS and big.LITTLE. Load imbalance may also be caused due to the varying spatial complexity of the video frame in the case of the spatial split method.
To mitigate the above, a hybrid and massive parallelism based method is proposed for multi-core video decoder implementations. This model meets four primary objectives:
- Real-time processing for any kind of streams (even when toolsets like tiles and Wavefront Parallel Processing (WPP) are not enabled in the stream)
- Uniform DDR load throughout the processing
- A design that facilitates massive parallelism for easy GPU offload
- Scalability as per number of cores and changing clocks
Figure -3 below explains the hybrid multithreading pictorially. The Entropy decoding of the current frame is run in parallel with the previous frame’s second half; the processing of second half can be further split into multiple threads spatially or can be massively parallelized if desired to be accelerated on a GPU. If cores are not adequately loaded even Entropy decoding can be further split into multiple threads (in case the tiles/WPP toolsets are present)
Parallelizing an HEVC Decoder to Deliver the Best Performance
We applied the proposed hybrid technique to optimize an HEVC decoder for ARM Cortex®-A15 CPU based multi-core devices and achieved improved load balancing. Although the HEVC decoder has been taken as an example, techniques are generic and can be applied to most performance critical software.
CPU time is precious, never leave it waiting (have adequate threads): To minimize the CPU waiting time, ensure that the number of active (running) threads at any given time never falls below the number of cores in the system. This can be achieved by scheduling the (compute intensive) tasks in a set of N parallel threads in such a way that another set of N threads always remain in a ready state before the currently running N threads finish their work (where N is the number of cores in the system).
Create threads in multiple planes: For the HEVC decoder implementation, we realized that even if we created adequate threads (e.g. in the case of spatial plane based multithreading) there might be synchronization related delays towards the end of the frame processing and to mitigate this issue, the hybrid multithreading approach was found to bring significant benefits. In the new approach we triggered a few threads containing processing of the next frame in parallel to the current frame processing; this significantly improved utilization of CPU time.
Memory bandwidth – maximize the number of ALU instructions per memory access: Software is a mix of arithmetic and memory instructions and the right ratio of the two makes it high performing. Typically memory is clocked at a lower frequency than the CPU and may struggle to keep pace with CPUs running at much higher frequencies if the software is memory access intensive. Memory throughput is yet another aspect that needs to be addressed while architecting software for multi-core. For example if we launch four threads in parallel, the number of memory accesses (per second) also becomes four times and might act as a bottleneck. In such a case it may be wiser to re-architect the software to reduce the memory read and writes. More compact data structures, data packing and avoiding lookups can certainly help in such scenarios
Maintain uniform memory access rate: In video processing there are a few modules that are memory access intensive, for example motion estimation in case of video encoders. In the case of spatial split all the cores try to execute similar processing and hence can end up choking memory channels at times when the memory intensive modules are concurrently executed on all cores. Combining the functional multithreading method with spatial helped the HEVC decoder to distribute DDR accesses uniformly. In addition, sudden bursts of memory copies were avoided to further improve performance.
Load balancing with dynamically changing clocks (DVFS and big.LITTLE): Most platforms targeted for the handheld market are enabled with power management techniques such as DVFS (Dynamic Voltage and Frequency Scaling) where CPU frequencies can vary dynamically based on the processing load and other factors. big.LITTLE technology from ARM is another innovation that enables switching between CPU core size dynamically based on the application load in order to improve battery life. A load balancing (multithreading) design that assumes uniform clocking of all CPU cores would fail to optimally balance the load in such scenarios. To circumvent this we need to add a third dimension to software called “scalability”. Smart scalable software adapts itself to the non-uniform and dynamically changing CPU clocks and provides optimal load balancing at all times. In the HEVC decoder the proposed hybrid multithreading method worked well with non-uniform clocking as well. We tested the HEVC decoder on various commercial devices (Samsung Galaxy Note 10.1, Google Nexus 5, Google Nexus 10 and Samsung Galaxy S4) with DVFS and big.LITTLE enabled and found that the proposed hybrid approach delivered 20-25% better results when compared with the functional or spatial method of load balancing. The performance gain went up to 40% for the content with higher bit-rate variations
Enabling massive parallelism for GPU Compute: Using GPU (such as ARM Mali Graphics Processing Unit) for general purpose computations not only provides an additional performance boost to the software but also improved battery life becomes an advantage when it comes to handheld devices. All modern SoCs are loaded with GPU Compute capabilities and load balancing becomes even more challenging on such heterogeneous platforms. GPUs are different! Unlike CPUs they work well with many tasks that can be executed in parallel. For GPU Compute, massive data parallelism is to be created as explained in . Offloading (mapping/un-mapping) overheads and memory sharing between CPU and GPU have to be taken into account while designing software that exploits GPU Compute along with multi-core processing. In the hybrid approach for the HEVC decoder, we got the data accumulated upfront (for the entire frame) that is required by GPU on the CPU itself, so that the modules running on GPU can seamlessly exhibit block level parallelism. Entropy decoding (running on multi-core CPUs) was extended with extra processing that prepared data for various modules running on the GPU; as a result the GPU could comfortably perform the processing related to motion compensation, IDCT, loop filtering and SAO using block level parallelism with fewer sync up overheads.
In this blog, we discussed the novel hybrid and massive parallelism based technique that can deliver high performing software on modern multi-core SoCs and the key factors that should be considered while designing software for highly parallel platforms. It was concluded that proposed hybrid multithreading technique can facilitate GPU based acceleration and deliver better results when the CPU clocks are dynamically changing. An HEVC decoder was taken as an example to compare the results and benefits. Concepts discussed were generic and are applicable to any software that is performance critical and being written for a parallel platform.
 Bingbing Xia,Fei Qiao,Huazhong Yang and Hui Wang, ”An efficient methodology for transaction-level design of multi-core h.264 video decoder”, Consumer Electronics (ICCE), 2011 IEEE International Conference, Jan. 2011
 Kue-Hwan Sihn, Hyunki Baik, Jong-Tae Kim, Sehyun Bae and Hyo Jung Song, ”Novel approaches to parallel H.264 decoder on symmetric multicore systems”, Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference, Apr. 2009
 Nishihara, K., Hatabu, A. and Moriyoshi,T., ”Parallelization of H.264 video decoder for embedded multicore processor”, Multimedia and Expo, 2008 IEEE International Conference, Apr. 2008  Falcao, G., Sousa, L., and Silva, V.,”Massively LDPC Decoding on Multicore Architectures”, Parallel and Distributed Systems, IEEE Transactions, Feb. 2011
 Ngai-Man Cheung, Xiaopeng Fan, Au, O.C. and Man-Cheung Kung,”Video Coding on Multicore Graphics Processors”, Signal Processing Magazine, IEEE, Issue 2, Mar. 2010
 Yun-il Kim, Jong-Tae Kim, Sehyun Bae, Hyunki Baik and Hyo Jung Song, ”H.264/AVC decoder parallelization and optimization on asymmetric multicore platform using dynamic load balancing”, Multimedia and Expo, 2008 IEEE International Conference, June 23 2008-April 26 2008
 Hyunki Baik, Kue-Hwan Sihn, Yun-il Kim, Sehyun Bae, Najeong Han and Hyo Jung Song , ”Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture”, Signal Processing and Information Technology, 2007 IEEE International Symposium, 15-18 Dec. 2007
 Alvarez Mesa, M., Ramirez, A. ,Azevedo, A. , Meenderinck, C., Juurlink, B. and Valero, M., ”Scalability of Macroblock-level Parallelism for H.264 Decoding”, Parallel and Distributed Systems (ICPADS), 2009 15th International Conference, Date 8-11 Dec. 2009
 ARM Limited, ”Cortex™-A15 Revision: r2p0, Technical Reference Manual” ,http://infocenter.arm.com, Sept 2011
 ITU-T, ”Recommendation ITU-T H.265”, www.itu.int, Apr. 2013  Sanjeev Verma, “Enabling GPU Compute on an ARM Mali-T600 GPU creates a power efficient HEVC decode solution”, Aricent: Enabling GPU Compute on an ARM Mali-T6… | ARM Connected Community, Feb 2014
This article was originally published in ARM Connected Community