Commit 73cbcae2 authored by Sofus Albert Høgsbro Rose's avatar Sofus Albert Høgsbro Rose
Browse files

It Works!

parents
.tmp
dev
dist
test.sh
deploy.sh
This diff is collapsed.
# Fast HDR-->SDR Conversion Without a GPU
A slightly insane solution to a devilishly irritating problem.
## Problem: What's the Problem?
I like to master my 3D work in HDR, because, well, I can. I'm a color nerd without a reference monitor.
When I want to show my friends, I try to stream this HDR content using the wonderful Jellyfin (Emby fork). Problems ensue: https://github.com/jellyfin/jellyfin/issues/415 . This isn't the first time somebody has wanted to do something like this, and had trouble: https://www.verizondigitalmedia.com/blog/best-practices-for-live-4k-video-ingestion-and-transcoding/ .
Long story short? The colors look **WRONG**. It's all wrong. It irritates me.
I am a normal person: I don't own an HDR screen, so like any normal person I spend days writing software to solve my irritations!
*Plus, my screen just doesn't do BT.2020. I'm not sure any screen really does.*
## Problem: Isn't This Solved in VLC/mpv/etc. ?
It is indeed! They do a very nice 2020/PQ --> 709 on the GPU, in real time, and with very nice tone mapping.
However, ffmpeg cannot do this in realtime. This means, no live transcoding --> streaming of HDR clips through Jellyfin - and none of the other fun realtime imaging one wants to do on a when one isn't plugged into a loud ASIC more power-hungry than my refrigerator.
## Problem: How to Solve It?
My approach is as follows:
* **Piping**: We keep `ffmpeg` to decode, but have it spit out raw YUV444 data, which we process and spit back out to `ffmpeg` for encoding.
* **Threaded I/O**: Read, Write, and Process happen in different threads - the slowest determines the throughput.
* **8-Bit LUT**: The actual imaging operations are first precomputed into a `256x256x256x3` LUT. These arbitrarily complex global operations are then applied with a simple lookup (no interpolation - just pure 8-bit madness!).
## Problem: So - Was it Worth It?
Well - all the optimization and reverse engineering aside, the colors in my HDR content looks right through Jellyfin. So? All is right with the world again!
Though, to be honest, none of my non-technical peers quite understand where I've been the past couple days...
# Features
How can `fast-hdr` make your day brighter?
* [**Hackable**] Want to easily implement arbitrarily complex global imaging operations, whilst seeing them executed with decent accuracy in real time? It's not like there's a lot of code to sift through - just spice up `hdr_sdr_gen.cpp` with your custom tonemapping function, or whatever you want!
* [**CPU-Based**] There's no GPU here, which means flexibility. Run it on your Raspberry Pi! Run it on a phone! Run it cheaply in - dare I say it - the cloud!
* [**Fast**] With FFMPEG as a decoder piped in, I get `30` FPS at 4K on a `Threadripper 1950X (16-core)`.
* Of course, not all CPUs in the world are Threadrippers, so the `ffmpeg` wrapper sets the max resolution to `1080p`, which works like smooth butter; `80` FPS on my less-powerful-but-still-kinda-beefy server. Comment it out if you don't like it :P
* [**Jellyfin-oriented `ffmpeg` Wrapper**] Designed for Jellyfin, there's a Python wrapper around `ffmpeg` which injects the HDR-->SDR conversion into any complex `ffmpeg` invocation!
* Everyone uses `ffmpeg` - therefore, `fast-hdr` can run everywhere :) !
* [**Realtime, Arbitrarily Complex Global Operations**] With a `3 * 256^3` sized 8-Bit LUT (just 50MB in memory!) precomputed at compile-time by `hdr_sdr_gen.cpp`, you can perform arbitrarily complex global operations, and apply them to an image (each 8-bit YUV triplet has a corresponding triplet) without any interpolation whatsoever. In this case, that's just an inverse PQ operator, followed by a global tonemap, followed by an sRGB correction.
* [**UNIX Philosophy**] `fast-hdr` does processing, and that's it. It's just a brick. You can use any decoder / encoder you want - as long as it spits out / gobbles up YUV444p data to `stdout`! Hell, I don't know, go crazy with OpenHEVC, or whatever shiny new thing ILM cooked up this time. (Though I'd probably use `ffmpeg` as it's less hard and at least tested).
# How To Do
0. Make sure you're on Linux, and have `python3` and `gcc` (and can invoke `g++`).
1. Get your hands on a standalone `ffmpeg` and `ffprobe`, and put it in `res`.
2. Run `compile.sh`. It might take a sec, it has to precompute the `.lutd`.
Then, let it play with some footage! For example, here's a complex ffmpeg invocation (run by Jellyfin):
```bash
rm -rf .tmp
FFMPEG="./dist/ffmpeg-custom"
FFMPEG_PY="./dist/ffmpeg"
HDR_SDR="./dist/hdr_sdr"
LUT_PATH="./dist/cnv.lut8"
# Test on a nice shot of a cute little catterpiller.
TIME="00:01:28"
FILE="./Life Untouched 4K Demo.mp4"
HLS_SEG=".tmp/transcodes/hash_of_your_hdr_file_no_need_to_replace%d.ts"
FILE_OUT=".tmp/transcodes/hash_of_your_hdr_file_no_need_to_replace.m3u8"
TEST=true "$FFMPEG_PY" -ss "$TIME" -f mp4 -i file:"$FILE" -map_metadata -1 -map_chapters -1 -threads 0 -map 0:0 -map 0:1 -map -0:s -codec:v:0 libx264 -pix_fmt yuv420p -preset veryfast -crf 23 -maxrate 34541128 -bufsize 69082256 -profile:v high -level 4.1 -x264opts:0 subme=0:me_range=4:rc_lookahead=10:me=dia:no_chroma_me:8x8dct=0:partitions=none -force_key_frames:0 "expr:gte(t,0+n_forced*3)" -g 72 -keyint_min 72 -sc_threshold 0 -vf "scale=trunc(min(max(iw\,ih*dar)\,1920)/2)*2:trunc(ow/dar/2)*2" -start_at_zero -vsync -1 -codec:a:0 aac -strict experimental -ac 2 -ab 384000 -af "volume=2" -copyts -avoid_negative_ts disabled -f hls -max_delay 5000000 -hls_time 3 -individual_header_trailer 0 -hls_segment_type mpegts -start_number 0 -hls_segment_filename "$HLS_SEG" -hls_playlist_type vod -hls_list_size 0 -y "$FILE_OUT"
```
## How To: Jellyfin
To get Jellyfin to convert HDR footage when transcoding for web playback, follow the How To steps and make sure it works locally.
Then:
0. Make sure the python script `ffmpeg`, the actual binaries `ffmpeg-custom` and `ffprobe`, the compiled binary `hdr_sdr`, and the generated LUT `cnd_lut8` are in `dist`.
1. Copy `dist` to somewhere on your server owned by the `jellyfin` user. Probably a good idea to `chmod` it too.
2. In the Jellyfin interface, in Playback -> `ffmpeg` Path, point it at the Python script `ffmpeg`. *It's critical that `ffprobe` is there too, otherwise Jellyfin will be unable to read header info about your files, like audio tracks or subtitle tracks.*
Things to be aware of:
- Seeking is super slow, as for some reason the wrapper doesn't understand keyframes, and will try to encode its way to wherever you seeked to. So don't seek :) Resuming works fine, however.
- Occasionally, you might have to `killall ffmpeg-custom && killall hdr_sdr`, or they'll keep eating CPU cycles after you've stopped watching. I'm still not sure why. I have no idea why.
# Testing / Stability / Modularization / Any Kind of Good Software Development Practices
No
# Contributing
*Is that a `malloc` I see in your C++? In this devout Stroustrup'ian neighborhood? Blasphemy...*
I like you, you insane monkey - if this horrible hack truly is actually useful to you then I'm speechless!
I'm happy to help you make it work, help with bugs (make an Issue!), and/or (probably) accept any kind of PR. God knows there's enough wrong with this piece of software that it needs some fixing up...
Development happens at https://git.sofusrose.com/so-rose/fast-hdr .
# TODO
It gets wilder!
* **Seeking in Jellyfin**: One cannot seek from the Jellyfin interface. This may have something to do with the `ffmpeg` wrapper not catching the `q` to quit.
* **Better Profiling**: Measuring performance characteristics of a threaded application like this isn't super easy, but probably worth it.
* **Dithering**: There's a reason nobody precomputes transforms on every possible 8-bit YUV value: Posterization. We can solve this to a quite reasonable degree by dithering while applying the LUT!
* **More Robust `ffmpeg` Wrapper**: It's a little touchy right now. Like, it throws an exception if you give it a wrong `-f`...
* **More Vivid Image**: Personal preference, I like vivid! This is just a matter of tweaking the image processing pipeline.
* **Verify Color Science**: Right now, it's all done a bit by trial and error. It does, however, match VLC's output quite well.
* **10-bit LUT**: Who cares if this needs a 3GB buffer to compute? It solves posterization issues, and allows directly processing 10-bit footage to boot!
* **Gamut Rendering**: Currently, there's no gamut mapping or rendering intent management. It coule be nice to have.
* **Better Tonemapping**: The present tonemapping is arbitrarily chosen, even though it does look nice.
* **Variable Input**: YUV444p isn't nirvana. Plus, imagine how fast clever LUT'ing directly on 4K YUV420p data might be.
#!/bin/bash
set -e
rm -r dist
mkdir -p dist
# Compile to Dist
g++ --std=c++17 -pthread -o dist/hdr_sdr -fopenmp -fopenmp-simd -Ofast -frename-registers -funroll-loops -D_GLIBCXX_PARALLEL -flto src/hdr_sdr.cpp
# Generate LUT8
cd gen
./compile.sh
./hdr_sdr_gen ../dist/cnv.lut8
cd -
# Copy Things to Dist
#~ cp src/hdr_sdr.cpp dist/
cp src/ffmpeg dist/
cp res/* dist/
hdr_sdr_gen
#!/bin/bash
set -e
g++ --std=c++17 -o hdr_sdr_gen -fopenmp -O3 -march=native hdr_sdr_gen.cpp
/*
* This file is part of the fast-hdr project (https://git.sofusrose.com/so-rose/fast-hdr).
* Copyright (c) 2020 Sofus Rose.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, version 3.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
// Usage: <RGB PRODUCER> | ./hdr_sdr [WIDTH] [HEIGHT] | <RGB ENCODER>
// TODO:
// - Desaturate Highlights
// - Dithering
// Libraries
#include <iostream>
#include <fstream>
#include <string>
#include <math.h>
#include <algorithm>
#include <thread>
#include <unistd.h>
#include <sys/stat.h>
// User Defines
#define IMG_BITS 8
// User Types
typedef uint8_t img_uint; // Must Hold Image Data Point of >= IMG_BITS Size
#define IMG_INT_MAX ((1 << IMG_BITS) - 1)
#define IMG_INT_MAX_D ( (double) IMG_INT_MAX )
// Resolution and Size of LUTD (Dimensioned LUT
#define LUTD_BITS IMG_BITS
#define LUTD_CHNLS 3
#define LUTD_RES (1 << LUTD_BITS) // 2**LUTD_BITS
#define LUTD_SIZE (LUTD_RES * LUTD_RES * LUTD_RES * LUTD_CHNLS)
// Each 8-Bit YUV Triplet => Corresponding YUV Triplet.
// 4D LUT, Three "Cubes": Y Cube, U Cube, V Cube.
// 0. To Advance One Y, U, V, C(hannel) vue, Advance by a Stride
// 1. Use Old YUV to Find X,Y,Z Index On Cube(s).
// 2. Compute New YUV by Indexing Each Cube Identically
#define LUTD_Y_STRIDE(y) (y << (0 * LUTD_BITS)) // Y: Shift by (2**LUTD_BITS)**0
#define LUTD_U_STRIDE(u) (u << (1 * LUTD_BITS)) // U: Shift by (2**LUTD_BITS)**1
#define LUTD_V_STRIDE(v) (v << (2 * LUTD_BITS)) // V: Shift by (2**LUTD_BITS)**2
#define LUTD_C_STRIDE(c) (c << (3 * LUTD_BITS)) // C: Shift by (2**LUTD_BITS)**3
// Namespacing
using namespace std;
//###########
// - LUT Methods
//###########
void write_lutd(img_uint *lutd, string path_lutd) {
// The array must be sized as LUTD_SIZE.
ofstream file_lutd(path_lutd, ofstream::binary);
if (file_lutd.is_open()) {
file_lutd.write(reinterpret_cast<char*>(lutd), LUTD_SIZE);
}
}
//###########
// - Color Model Conversions
//###########
void yuv_rgb(
img_uint *y,
img_uint *u,
img_uint *v,
double *buf_rgb
) {
int c = (double) *y;
int d = (double) *u - 128.0;
int e = (double) *v - 128.0;
buf_rgb[0] = clamp( c + 1.370705 * e, 0.0, IMG_INT_MAX_D ) / IMG_INT_MAX_D;
buf_rgb[1] = clamp( c - 0.698001 * d - 0.337633 * e, 0.0, IMG_INT_MAX_D ) / IMG_INT_MAX_D;
buf_rgb[2] = clamp( c + 1.732446 * d , 0.0, IMG_INT_MAX_D ) / IMG_INT_MAX_D;
}
void rgb_yuv(
double *buf_rgb,
img_uint *y,
img_uint *u,
img_uint *v
) {
double r = buf_rgb[0] * 255.0;
double g = buf_rgb[1] * 255.0;
double b = buf_rgb[2] * 255.0;
*y = (img_uint) clamp( 0.257 * r + 0.504 * g + 0.098 * b + 16.0 , 0.0, IMG_INT_MAX_D);
*u = (img_uint) clamp(-0.148 * r - 0.291 * g + 0.439 * b + 128.0, 0.0, IMG_INT_MAX_D);
*v = (img_uint) clamp( 0.439 * r - 0.368 * g - 0.071 * b + 128.0, 0.0, IMG_INT_MAX_D);
}
void rgb_hsl(
double *buf_rgb,
double *buf_hsl
) {
// RGB to HSL with Luma Lightness
// Calculate Chroma
double chnl_max = max(buf_rgb[0], max(buf_rgb[1], buf_rgb[2]));
double chnl_min = min(buf_rgb[0], min(buf_rgb[1], buf_rgb[2]));
double chroma = chnl_max - chnl_min;
// Calculate Lightness
buf_hsl[2] = (chnl_max + chnl_min) / 2;
// Calculate Saturation
if (buf_hsl[2] != 0.0 && buf_hsl[2] != 1.0) {
buf_hsl[1] = chroma / ( 1 - abs(2 * buf_hsl[2] - 1) );
} else {
buf_hsl[1] = 0.0;
}
// Calculate Hue (Degrees)
if (chroma == 0) {
buf_hsl[0] = 0.0;
} else if (chnl_max == buf_rgb[0]) {
buf_hsl[0] = (buf_rgb[1] - buf_rgb[2]) / chroma;
} else if (chnl_max == buf_rgb[1]) {
buf_hsl[0] = 2.0 + (buf_rgb[2] - buf_rgb[0]) / chroma;
} else if (chnl_max == buf_rgb[2]) {
buf_hsl[0] = 4.0 + (buf_rgb[0] - buf_rgb[1]) / chroma;
}
buf_hsl[0] *= 60.0;
}
void hsl_rgb(
double *buf_hsl,
double *buf_rgb
) {
// HSL to RGB with Luma Lightness
double buf_rgb_1[3] {0.0, 0.0, 0.0};
double chroma = ( 1 - abs(2 * buf_hsl[2] - 1) ) * buf_hsl[1];
double H_reduc = buf_hsl[0] / 60.0;
double X = chroma * ( 1 - abs(fmod(H_reduc, 2.0) - 1) );
if (ceil(H_reduc) == 1) {
buf_rgb_1[0] = chroma;
buf_rgb_1[1] = X;
buf_rgb_1[2] = 0.0;
} else if (ceil(H_reduc) == 2) {
buf_rgb_1[0] = X;
buf_rgb_1[1] = chroma;
buf_rgb_1[2] = 0.0;
} else if (ceil(H_reduc) == 3) {
buf_rgb_1[0] = 0.0;
buf_rgb_1[1] = chroma;
buf_rgb_1[2] = X;
} else if (ceil(H_reduc) == 4) {
buf_rgb_1[0] = 0.0;
buf_rgb_1[1] = X;
buf_rgb_1[2] = chroma;
} else if (ceil(H_reduc) == 5) {
buf_rgb_1[0] = X;
buf_rgb_1[1] = 0.0;
buf_rgb_1[2] = chroma;
} else if (ceil(H_reduc) == 6) {
buf_rgb_1[0] = chroma;
buf_rgb_1[1] = 0.0;
buf_rgb_1[2] = X;
} else {
buf_rgb_1[0] = 0.0;
buf_rgb_1[1] = 0.0;
buf_rgb_1[2] = 0.0;
}
double m = buf_hsl[2] - (chroma / 2);
buf_rgb[0] = buf_rgb_1[0] + m;
buf_rgb[1] = buf_rgb_1[1] + m;
buf_rgb[2] = buf_rgb_1[2] + m;
}
void s_sat(double *buf_rgb, double fac_sat) {
double buf_hsl[3] {0.0, 0.0, 0.0};
rgb_hsl(buf_rgb, buf_hsl);
buf_hsl[1] *= fac_sat;
hsl_rgb(buf_hsl, buf_rgb);
//~ buf_rgb[0] = clamp(buf_rgb[0], 0.0, 1.0);
//~ buf_rgb[1] = clamp(buf_rgb[1], 0.0, 1.0);
//~ buf_rgb[2] = clamp(buf_rgb[2], 0.0, 1.0);
}
//###########
// - HDR Transfer Curves
//###########
#define c1 0.8359375
#define c2 18.8515625
#define c3 18.6875
#define m1 0.1593017578125
#define m2 78.84375
double gam_pq_lin(double v) {
return pow( (pow(v, 1/m2) - c1) / (c2 - c3 * pow(v, 1/m2)), 1/m1 );
}
double gam_lin_pq(double v) {
return pow( (c1 + c2 * pow(v, m1)) / (1 + c3 * pow(v, m1)), m2 );
}
//###########
// - Colorspace Conversions
//###########
void bt2020_bt709(double *buf_rgb) {
// bt2020_bt709
double cmat[9] {
1.6605, -0.5876, -0.0728,
-0.1246, 1.1329, -0.0083,
-0.0182, -0.1006, 1.1187
};
double r = buf_rgb[0];
double g = buf_rgb[1];
double b = buf_rgb[2];
buf_rgb[0] = r * cmat[0] + g * cmat[1] + b * cmat[2]; //Red
buf_rgb[1] = r * cmat[3] + g * cmat[4] + b * cmat[5]; //Green
buf_rgb[2] = r * cmat[6] + g * cmat[7] + b * cmat[8]; //Blue
//~ buf_rgb[0] = r * cmat[0] + g * cmat[1] + b * cmat[2]; //Red
//~ buf_rgb[1] = r * cmat[3] + g * cmat[4] + b * cmat[5]; //Green
//~ buf_rgb[2] = r * cmat[6] + g * cmat[7] + b * cmat[8]; //Blue
}
float normal_pdf(double x, double m, double s)
{
static const double inv_sqrt_2pi = 0.3989422804014327;
double a = (x - m) / s;
return inv_sqrt_2pi / s * std::exp(-0.5f * a * a);
}
void render_gamut(double *buf_rgb) {
// Calculate Lightness
double chnl_max = max(buf_rgb[0], max(buf_rgb[1], buf_rgb[2]));
double chnl_min = min(buf_rgb[0], min(buf_rgb[1], buf_rgb[2]));
// Zebras
if (chnl_min <= 0.0) {
buf_rgb[0] = 0.0;
buf_rgb[1] = 0.0;
buf_rgb[2] = 1.0;
}
if (chnl_max >= 1.0) {
double chroma = (chnl_max - chnl_min);
buf_rgb[0] = 1.0;
buf_rgb[1] = 0.0;
buf_rgb[2] = 0.0;
}
//~ double luma = 0.2126 * buf_rgb[0] + 0.7152 * buf_rgb[1] + 0.0722 * buf_rgb[2];
//~ double L = (chnl_max + chnl_min) / 2;
//~ // Scale Towards Gray Based on Luminance
//~ double scl = normal_pdf(clamp(chroma, 0.0, 1.0), 1.0, 0.01) / 3.98942;
//~ buf_rgb[0] = (1 - scl) * buf_rgb[0] + scl * luma;
//~ buf_rgb[1] = (1 - scl) * buf_rgb[1] + scl * luma;
//~ buf_rgb[2] = (1 - scl) * buf_rgb[2] + scl * luma;
//~ buf_rgb[0] = (1 - chroma) * buf_rgb[0] + chroma * luma;
//~ buf_rgb[1] = (1 - chroma) * buf_rgb[1] + chroma * luma;
//~ buf_rgb[2] = (1 - chroma) * buf_rgb[2] + chroma * luma;
//~ buf_rgb[0] = chroma;
//~ buf_rgb[1] = chroma;
//~ buf_rgb[2] = chroma;
}
//###########
// - Tonemapping
//###########
double tm_cool(double v) {
v *= 150.0;
double A = 0.15;
double B = 0.50;
double C = 0.10;
double D = 0.20;
double E = 0.02;
double F = 0.30;
double W = 11.2;
return ((v*(A*v+C*B)+D*E)/(v*(A*v+B)+D*F))-E/F;
//~ return v / (v + 1);
}
//###########
// - SDR Transfer Curves
//###########
double gam_lin_srgb(double v) {
return v > 0.0031308 ? (( 1.055 * pow(v, (1.0f / 2.4)) ) - 0.055) : v * 12.92;
}
double gam_srgb_lin(double v) {
return v > 0.04045 ? pow(( (v + 0.055) / 1.055 ), 2.4) : v / 12.92;
}
#define alpha 1.099
#define beta 0.018
#define zeta 0.081
double gam_lin_709(double v) {
if (v <= 0) {
return 0.0;
} else if (v < beta) {
return 4.5 * v;
} else if (v <= 1) {
return alpha * pow(v, 0.45) - (alpha - 1);
} else {
return 1.0;
}
}
double gam_709_lin(double v) {
if (v <= 0) {
return 0.0;
} else if (v < zeta) {
return v * (1 / 4.5);
} else if (v <= 1) {
return pow( (v + (alpha - 1)) / (alpha), (1 / 0.45) );
} else {
return 1.0;
}
}
//###########
// - Processing Methods
//###########
void proc(img_uint *y, img_uint *u, img_uint *v) {
// Create RGB Buffer
double buf_rgb[3] {0.0, 0.0, 0.0};
// YUV to RGB
yuv_rgb(y, u, v, buf_rgb);
// PQ --> Lin
for (int chnl = 0; chnl < 3; chnl++) {
buf_rgb[chnl] = gam_pq_lin( buf_rgb[chnl] );
}
// Global Tonemapping
for (int chnl = 0; chnl < 3; chnl++) {
buf_rgb[chnl] = tm_cool( buf_rgb[chnl] );
}
// Lin --> sRGB
for (int chnl = 0; chnl < 3; chnl++) {
buf_rgb[chnl] = gam_lin_srgb( buf_rgb[chnl] );
}
// RGB to YUV
rgb_yuv(buf_rgb, y, u, v);
}
void gen_lutd(img_uint *lutd) {
// Process the Payload Using Precomputed YUV Destinations
#pragma omp parallel for
for (size_t v = 0; v <= IMG_INT_MAX; v++) {
for (size_t u = 0; u <= IMG_INT_MAX; u++) {
for (size_t y = 0; y <= IMG_INT_MAX; y++) {
size_t ind_lutd = (
LUTD_Y_STRIDE(y) +
LUTD_U_STRIDE(u) +
LUTD_V_STRIDE(v)
);
// Get LUTD Pointers
uint8_t *y_lutd = &lutd[ind_lutd + LUTD_C_STRIDE(0)];
uint8_t *u_lutd = &lutd[ind_lutd + LUTD_C_STRIDE(1)];
uint8_t *v_lutd = &lutd[ind_lutd + LUTD_C_STRIDE(2)];
// Set LUTD Values to Default
*y_lutd = y;
*u_lutd = u;
*v_lutd = v;
// Process Default LUTD Values
proc(y_lutd, u_lutd, v_lutd);
}
}
}
}
//###########
// - Application
//###########