mm-4105, realtime 4k hdr decoding with gpu aces, by gary demos

REALTIME 4K HDR DECODING WITH GPU ACES GARY DEMOS

IMAGE ESSENCE LLC

4k Real4me (24fps 2D) Image Bandwidth

• Exr half-‐float (e.g. ACES/OCES) or 16-‐bit unsigned short integers:

-‐ 2Bytes/col x 3cols(RGB) x 4096 x 2160 x 24fps = 1.27GBytes/sec = 10.2gbps

• 32-‐bit floats (used inside OpenCL in the GPU and within most CPU decoding steps):

-‐ 4Bytes/col x 3cols(RGB) x 4096 x 2160 x 24fps = 2.54GBytes/sec = 20.4gbps

• 10-‐bit dpx-‐packed pixels:

-‐ 4Bytes/3cols x 3cols(RGB) x 4096 x 2160 x 24fps = .85GBytes/sec = 6.8gbps

Future Fron4ers

• 2Bytes/col x 3cols(RGB) x 4096 x 2160 x 60fps = 3.19GBytes/sec = 25.5gbps




• 3D any of the above x2

• DisplayPort 1.2 goes up to 20gbps • A W9000 has six DisplayPort 1.2 outputs • The demonstra4on system has four W9000’s • That’s 24 DisplayPort 1.2 outputs! • Total available pixel output is 24 x 20gbps = 480gbps

• That’s more than: -‐ 2x (3D) 2Bytes/col x 3cols(RGB) x 8192 x 4320 x 120fps = 51.0GBytes/sec = 407.7gbps! -‐ Could work up to this in an array of displays • S4ll a few issues (at least for this author):

-‐ Locking playback speed with pixels from CL -‐ Synchronizing audio

Real4me Floa4ng Point ACES Decoding Including Real4me Interac4ve Adjustment and RRT/ODT in the GPU

Floa4ng Point

Decoding

Compressed Bidiles (SATA FlashRam)

ACES GPU Processing in OpenCL

• Sharpen/soeen spa4al filter • Transform to P3 Colorspace

• ASC CDL adjustments • Transform back to ACES • RRT and ODT in 3D LUT • Fix and pack pixels

Packed Pixels Ready for Display

DVS Atomix

4k Real4me 10/12-‐bits RGB

4x FirePro W9000s 2x Intel E5-‐2690 CPUs

Fifo of Frames For Smooth Playout

CPU Par44oning • Running Scien4fic Linux 6.4 • Relying on a fifo-‐of-‐frames in the DVS Atomix using the FIFO-‐API to smooth out the non-‐real4me ahributes of Linux • Mul4ple decoder processes forked at startup • Compressed bidiles are retrieved by each process from SATA FlashRAM/SSD • The number of decoder processes is selected at run4me startup (tuned for performance and available memory)

CPU Par44oning (cont.) • Parent process becomes display process • Display process creates shared memory and sends semaphores to decoder processes that buffers are available • Each decoder process creates a frame or range of frames • A display process manages shared memory and DMA to/from GPU’s and DVS Atomix • Display process tells decoder processes when buffers again become available

GPU Par44oning: • numDevices OpenCL call provides the number of GPU’s available • Ver4cal screen height par44oned into numDevices • Four Firepro W9000 GPUs in this demonstra4on system • All GPUs share a common “context” and associated “kernels” (one CL interpret) • Each of the four GPUs given a “command_queue” and separate “cl_mem” buffers

GPU Par44oning (cont.) • Kernel args for each cl_mem are updated for each of the four GPUs before invoking the kernel with that GPU’s command_queue • Each GPU given 1/4 of screen height EnqueuedWrites of half-‐float ACES • Each GPU’s packed pixels retrieved into appropriate quarter of screen height via EnqueuedReads of packed pixels • Double-‐buffered DMA (getbuffer/putbuffer) to DVS Atomix using FIFO API (fifo of frames helps smooth linux non-‐real4me aspects yielding real4me)

OpenCL Code: • Macros are used for all math • For CPU code, “.h” files are included and macros invoked • For GPU code, cl includes the same “.h” files, and macros invoked with each cl kernel • Macros separated into various types: -‐ Interac4on processing, ACES to/from P3 and ASC_CDL applied in P3 -‐ RRT (Reference Rendering Transform) processing, using LUT (faster but less accurate, real4me at 4k) or direct computa4on (slower but highly accurate, real4me at 2k) -‐ ODT (Output Device Transform) processing, for the type of ODT selected

OpenCL Code (cont.) • Final step in cl is 32-‐bit floats to fix, and RGB packing (either 10bits or 16bits), adding +-‐1/2lsb noise dither • OpenCL does not include a random number intrinsic, so random numbers for dithering are DMA’d up to the GPU for use in noise dither, using a randomizing func4on of frame number and scanline

Reasons for liking OpenCL: • Support for DEVICE_TYPE_CPU as well as DEVICE_TYPE_GPU • Vendor independence • Portability • Easily extended to automa4cally u4lize mul4ple GPU’s by seqng up mul4ple command queues based upon number of devices detected at run4me • Run4me interpret is oeen convenient • Excellent descrip4on of expected precision for math intrinsic func4ons • Strong support for both 32-‐bit and 64-‐bit floa4ng point

Reasons for liking OpenCL (cont.) • Well-‐thought-‐out device and system query capabili4es • getGlobalID provides an excellent mechanism for parallelism without requiring further considera4on of lower level hardware organiza4on • Easy specifica4on of global, constant, and local datatypes • Pipelining control via blocking and non-‐blocking read and write queues and via clFinish and kernel barriers • First-‐class support of half-‐float using vload_half and vstore_half

Weaknesses of OpenCL (aka “wish list”): • Difficult to obtain visibility during debugging (although print statements available on some systems with DEVICE_TYPE_CPU) • No detail provided by “out of resources” error (e.g. what resources are we out of?)

Weaknesses of OpenCL (aka “wish list”, cont.): • Lack of visibility during performance tuning

-‐ How much 4me is being spent in read/write queues to/from CPU? -‐ How full are global and constant memory? -‐ How much global memory bandwidth is being u4lized? -‐ How full are registers? -‐ If caches are present, how effec4ve are they on a given kernel? -‐ Are there unnecessary waits that could be async overlapped?

• The 4, 8, 16 CL SIMD types are not mirrored in CPU SSE/AVX/F16 intrinsics.

-‐ Were they to be iden4cal, they could be used in macros that are included in common between CL kernels and CPU threads

System Performance: • Limited by memory and bus bandwidth issues • DirectGMA will improve this • Plenty of GPU power s4ll available for real4me 4k processing when using 3D LUT RRT/ODT • CPU power sufficient for wavelet-‐only floa4ng point decoding at 4k • CPU power sufficient for mo4on-‐compensated flowfield sinc-‐and-‐wavelet full configura4on at 2k. Speed is about 1/3 real4me at 4k. • With threads and forked processes, will be able to take advantage of an4cipated major increase in computa4onal cores

CL/GL Interop Explora4on: • Using X11 on Linux (no glut support) • Get 10-‐bit depth at setup from X11 (as configura4on using GLXChooseFBConfigs) • Uses GL, GLX, and CL/GL context (some of this is recent, as of CL 1.2) • Improves (reduces) memory transfer amount required by direct output from GPU • Can take over the screen (using X11 XChangeProperty) • Relies on “FrameBufferObject” and “Acquire” and “Release” by CL (Release by CL implies re-‐acquire by GL, must CLFinish and GLFinish correspondingly) • Can support 4k at 10bits via DisplayPort 1.2 (and HDMI 1.4a via DP to HDMI dongle) • Reportedly can be used with MacOSX and Windows (with X11-‐style constructs)

CL/GL Interop Weaknesses: • Limited to single GPU for CL when using a CL/GL FBO -‐ Would be nice to have separate FBO quadrant output from each of the four GPU’s • Not smooth 4ming if GPU running near capacity • No locked audio sync • No “fifo-‐of-‐frames” to smooth out the non-‐smooth-‐non-‐real4me Linux behavior -‐ Working on using cl_gl event sync to simulate this

Ahributes of the floa4ng point codec • Layered with 5 layers up to base layer at 1k using wavelets • Two more layers from 1k to 2k and 2k to 4k built with sinc filters, using wavelet stacks to code the up-‐res deltas • Base and up-‐res layers can be mo4on compensated (sinc filter is phase-‐neutral and sub-‐pixel displacement precision to 1/100 pixel)

Floa4ng Point Codec (cont.) • Flowfield is used at low resolu4on for mo4on displacement, coded also as wavelet stack. Upsized for each layer when applied. • Floa4ng point coding is automa4cally adap4ve to gamma, since a floa4ng point quan4za4on scale is used for each image region using the average and minimum brightness • YUV encoding takes advantage of codec’s unlimited range and nega4ve number reproduc4on to support full ACES gamut and dynamic range

Fron4ers for using the available GPU power: • Spectral color processing to improve upon CIE 1931 limita4ons • More ODTs to take advantage of new HD and UHD displays and new projectors and projec4on light sources as they increase dynamic range and gamut • More processing in the pipeline

-‐ more elaborate sharpening -‐ dynamic range regional contrast adapta4on

-‐ addi4onal interac4ve controls -‐ adapta4on to viewing surround (if not dark surround)

• Addi4onal work on the RRT, and on exis4ng ODT types (in conjunc4on with the RRT algorithmic modifica4ons)

Many Thanks to the AMD/ATI FirePro Professional Graphics Group For Their Support Many Thanks to AMD/APU team for providing 4k Display Thanks also to R&S/DVS ACES Overview:

hhp://www.oscars.org/science-‐technology/council/projects/pdf/ACESOverview.pdf

Reference papers for Gary Demos: • The Unfolding Merger of Television and Movie Technology

SMPTE Conference, Oct 2012

• File and Folder InteracMve Decoding

SMPTE Conference, Oct 2011, including YouTube Video:

hQp://www.youtube.com/watch?v=Ggt_8qseGtw

• Layered MoMon CompensaMon SMPTE Journal, Jan 2009

24 | PRESENTATION TITLE | NOVEMBER 19, 2013 | CONFIDENTIAL

DISCLAIMER & ATTRIBUTION

The informa4on presented in this document is for informa4onal purposes only and may contain technical inaccuracies, omissions and typographical errors.

The informa4on contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, soeware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obliga4on to update or otherwise correct or revise this informa4on. However, AMD reserves the right to revise this informa4on and to make changes from 4me to 4me to the content hereof without obliga4on of AMD to no4fy any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combina4ons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdic4ons. SPEC is a registered trademark of the Standard Performance Evalua4on Corpora4on (SPEC). Other names are for informa4onal purposes only and may be trademarks of their respec4ve owners.