Utilising Low Complexity CNNs to Lift Non-Local Redundancies in Video Coding

Jan P. Klopp, Liang-Gee Chen, Shao-Yi Chien

Graduate Institute of Electrical Engineering, National Taiwan University

Upsampling version: Online-Trained Upsampler

Main Results

Online (at encoding time) trained denoiser
Design of a highly efficient CNN
Parameters can be signalled to decoder
Compared to H265, up to 6.8% coding gain Chroma, up to 14.4% on Luma
Complexity 216 to 486 MAC/Pixel, a thousand times less than pretrained denoisers

Methods

Train CNN on the fly to predict residuals of a group of pictures (GoP)
Quantise the CNN's parameters
Test the CNN's performance
Compress parameters and add to bit stream if test is positive

The dashed arrows/boxes indicate data transfer/operations that are only carried out in streaming scenarios where the decoder can access previous GoP's data.

Test previously signalled CNN on the new GoP
Fine-tune CNN on new GoP
Quantise fine-tuned CNN's parameters
Test fine-tuned and quantised CNN
If gains are higher than from previous CNN, compress and add to bit stream

Material

Paper (IEEE Xplore) Paper (preprint)

Additional Results

We supply a few preliminary results for applying our method to two codecs used in practice: x265 and AV1. For a full comparision, please refer to the results in the paper (link above), as our method has not been adjusted for x265 or AV1. Comparison is provided in the form of BDRate coding gain over the respective baseline codec, YUV PSNR values were averaged with weights 6, 1 and 1 as is common practice in video coding.

Comparison w/ x265 as baseline

		GoP: 32 Frames		GoP: 128 Frames
Dataset	Resolution	Complexity (MAC/pixel)	Coding Gain	Complexity (MAC/pixel)	Coding Gain
HM A	2560×1600	148.5	-6.5%	148.5	-6.9%
HM B	1920×1080	486	-5.5%	486	-5.7%
HM C	832×480	486	-5.1%	486	-5.4%
VTM A	3840×2160	148.5	-4.9%	148.5	-5.0%
VTM B	1920×1080	148.5	-4.5%	486	-5.4%

Comparison w/ AV1 as baseline

		GoP: 32 Frames		GoP: 128 Frames
Dataset	Resolution	Complexity (MAC/pixel)	Coding Gain	Complexity (MAC/pixel)	Coding Gain
HM A	2560×1600	148.5	-4.8%	148.5	-4.7%
HM B	1920×1080	486	-4.3%	486	-4.2%
HM C	832×480	486	-3.8%	486	-4.2%
VTM A	3840×2160	148.5	-3.0%	148.5	-3.0%
VTM B	1920×1080	486	-4.0%	486	-3.9%

Sequences: HM A: NebutaFestival, PeopleOnStreet, SteamLocomotiveTrain, Traffic; HM B: BasketballDrive, BQTerrace, Cactus, Kimono1, ParkScene; HM C: BasketballDrill, BQMall, PartyScene, RaceHorses; VTM A: Campfire, CatRobot1, DaylightRoad2, FoodMarket4, ParkRunning3, Tango2; VTM B: BasketballDrive, BQTerrace, Cactus, MarketPlace, RitualDance

Cite

If you find our work helpful for your research, please consider citing it:

@ARTICLE{9088301,
            author={J. P. {Klopp} and L. {Chen} and S. {Chien}},
            journal={IEEE Transactions on Image Processing},
            title={Utilising Low Complexity CNNs to Lift Non-Local Redundancies in Video Coding},
            year={2020},
            volume={},
            number={},
            pages={1-1},
            doi={10.1109/TIP.2020.2991525},
            ISSN={1941-0042},
            month={},}