UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video Compression

Abstract

Implicit Neural Representations (INRs) have demonstrated significant potential in video compression by representing videos as neural networks. However, as the number of frames increases, the memory consumption for training and inference increases substantially, posing challenges in resource-constrained scenarios. Inspired by the success of traditional video compression frameworks, which process video frame by frame and can efficiently compress long videos, we adopt this modeling strategy for INRs to decrease memory consumption, while aiming to unify the frameworks from the perspective of timeline-based autoregressive modeling. In this work, we present a novel understanding of INR models from an autoregressive (AR) perspective and introduce a Unified AutoRegressive Framework for memory-efficient Neural Video Compression (UAR-NVC). UAR-NVC integrates timeline-based and INR-based neural video compression under a unified autoregressive paradigm. It partitions videos into several clips and processes each clip using a different INR model instance, leveraging the advantages of both compression frameworks while allowing seamless adaptation to either in form. To further reduce temporal redundancy between clips, we treat the corresponding model parameters as proxies for these clips, and design two modules to optimize the initialization, training, and compression of these model parameters. In special, the Residual Quantization and Entropy Constraint (RQEC) module dynamically balances the reconstruction quality of the current clip and the newly introduced bitrate cost using the previously optimized parameters as conditioning. In addition, the Interpolation-based Initialization (II) module flexibly adjusts the degree of reference used during the initialization of neighboring video clips, based on their correlation. UAR-NVC supports adjustable latencies by varying the clip length. Extensive experimental results demonstrate that UAR-NVC, with its flexible video clip setting, can adapt to resource-constrained environments and significantly improve performance compared to different baseline models.

Figure 1: Different AutoRegressive (AR) methods: (a) Pixel-level AR. (b) Frame-level AR. (c) Implicit domain (INR-based) AR. (d) Our UAR-NVC, which integrates INR-based AR and timeline-based AR by using INR models to represent each video clip and applying AR between INR models to capture inter-clip relationships.

The contributions of our work are summarized as follows:

Unified Framework: We propose UAR-NVC, a novel framework that unifies timeline-based and INR-based NVC frameworks. Most existing INR model can be seamlessly integrated into our framework, demonstrating its flexibility and generality.
Enhancements in Autoregressive Modeling: We design two modules to enhance autoregressive modeling for videos under the proposed UAR-NVC framework. The RQEC module dynamically balances the newly introduced information and the distortion of the current clip, based on the INR model parameters of the previous clip. The II module adaptively controls the reference strength for INR model initialization, allowing it to handle diverse video data effectively.
Comprehensive Experiments: We conduct extensive experiments to verify the efficiency of the proposed UAR-NVC framework. The results demonstrate its superior performance in rate-distortion optimization and highlight its advantages for practical applications.

Method

Figure 2: The proposed UAR-NVC framework

Figure 2: The proposed UAR-NVC, a practical INR framework for video compression. The video frames are grouped into several GOPs (video clips), where we train one model for one GOP. To balance correlation capture and random access, GOPs are grouped into GOMs. And time dependency exists only between GOPs within a GOM. The right part shows the training and compression pipeline of one GOM in UAR-NVC.

Our UAR-NVC framework unifies timeline-based and INR-based neural video compression from an autoregressive perspective. The video is partitioned into clips (GOPs), each modeled by a separate INR model instance. To leverage correlation between clips and reduce redundancy, we introduce two key modules:

Residual Quantization and Entropy Constraint (RQEC): This module uses the parameters of the previously optimized model as conditioning to dynamically balance the reconstruction quality of the current clip and the bitrate cost of the newly introduced information. It stores the difference (residual) between the initialized and optimized parameters for efficient compression.
Interpolation-based Initialization (II): This module adaptively controls the reference strength for INR model initialization based on the correlation between neighboring clips. It interpolates between the previous model's initial and optimized parameters to provide a better starting point for training.

The framework also incorporates the concept of Group of Models (GOM), analogous to Group of Pictures (GOP) in traditional codecs, to support random access.

Figure 3: (a): GOP level video partition with I-frames and P-frames. (b): GOM level video partition with I-models(#1/4) and P-models(#2/3/5/6).

Figure 4: Residual quantization-based entropy estimation on P-model.

Results

Experimental results on the UVG and VVC Class B datasets demonstrate the effectiveness of UAR-NVC. When integrated with base INR models like NeRV, HNeRV, and HiNeRV, our framework achieves significant improvements in rate-distortion (RD) performance compared to their baseline versions. Key findings include:

Substantial BD-rate savings (up to 64.97% for HiNeRV on UVG) across different base models and GOP sizes.
Superior performance compared to traditional codecs like H.265/x265 (veryslow), especially at larger GOP sizes.
Effective balance between memory efficiency (due to clip-based processing) and compression performance, enabled by the RQEC and II modules.
Flexibility to adapt to different resource constraints by varying the clip length (p).

$Table 1: BD-rate (\%) for PSNR/MS-SSIM on different datasets and p (GOP).$

Table 1: BD-rate (%) for PSNR/MS-SSIM on different datasets and p (GOP).

Table 2: Rate Distortion performance on different sequence of UVG.

Figure 8: Subjective quality comparison between HiNeRV (base) and HiNeRV (ours).

Figure 9: Rate Distortion performance on VVC ClassB dataset with different p (GOP).

Visual Comparison

Abstract

Method

Results