Immersive Translate

Research on video super-resolution technology based on diffusion model

Name: Wang Chengzhang Student ID: 231040395

1 Summary

With the rapid development of digital video technology, people have higher and higher requirements for video quality. The super-resolution video task aims to improve the visual clarity of the video by increasing the spatial resolution of the video frame, which is of great significance for video surveillance, video conferencing, film and television post-production and other fields. Video Super-Resolution (VSR) refers to the process of reconstructing a high-resolution video sequence from a low-resolution video sequence. In the study of video super-resolution, the diffusion model can not only process the image clarity of a single video frame, but also generate clearer and more coherent video frames. In recent years, with the vigorous development of the field of artificial intelligence and video, the research has received a lot of attention.

Keywords: artificial intelligence; video super resolution; diffusion model;

2. Background

Video Super Resolution (VSR) technology aims to reconstruct high-resolution video from low-resolution video, and this technology has a wide range of applications in many fields such as medical imaging, video surveillance, and video enhancement. With the popularity of high-definition and ultra-high-definition display devices, video super-resolution technology has received more and more attention. However, VSR tasks face challenges such as how to effectively use inter-frame information and process motion and changes in video.

As a generative model, the diffusion model has achieved remarkable success in image generation, image super-resolution and image editing since its emergence, surpassing a series of image generation task networks including convolutional neural networks and adversarial neural networks (GANs), showing great potential for using diffusion models for image tasks. Although the image task has achieved remarkable success, the super-resolution video task using the diffusion model is still a challenging task, and the difficulties are as follows: 1. Compared with the image task, the video input will have one more dimension. Specifically, for image input, the four-dimensional matrix tensor of [B, C, H, W] is generally used as the input of the computer, while for the video input, the five-dimensional matrix tensor of [B, C, T, H, W] is generally used as the input. 2. Due to the video input, when we process the image dimension, we not only need to consider the spatial dimension information (the location information brought by the width and height of the image), but also the spatial and temporal dimension information (the spatial information brought by the frame rate before and after the video). 3. Datasets are difficult to train and validate, because the existing datasets are generally composed of many videos, and their large data volume leads to the need for long-term training and high-performance hardware support when training and validating models.

At present, some studies on diffusion models improve the network layer through the image super-resolution task model of the existing diffusion model, so that it can adapt to the video super-resolution task, and the other part directly improves the original model into a 3D model and learns the video as a whole. The above two ideas have made breakthroughs in image tasks such as video generation and super-resolution.

All in all, as technology continues to advance, video super-resolution technology is expected to be more widely used in the future, bringing significant improvements in video quality and user experience.

3. Application prospects

Video super-resolution technology, thanks to the rapid development of deep learning, especially driven by generative adversarial networks (GANs) and convolutional neural networks (CNNs), has achieved significant improvements in image quality. This technology has shown a wide range of applications in a variety of industries, such as remote monitoring, medical imaging, and high-quality live video, and its development is further driven by the growing demand for high-resolution video. Research on video super-resolution continues to increase in academia, and researchers are exploring new algorithms and models to improve the performance and efficiency of super-resolution, including the development of new loss functions and network architectures to optimize super-resolution processes. The development of real-time video super-resolution technology is critical for online video and live streaming services, and researchers are working to improve the real-time nature of algorithms to deliver high-quality video streams without sacrificing bandwidth. In addition, the pre-trained super-resolution model can be deployed on the user's device to upscale the resolution of the transmitted low-resolution video, so as to obtain high-resolution video without causing bandwidth congestion. Although video super-resolution technology has made significant progress, there are still some challenges, such as the acquisition of training data, the generalization ability of models, and the performance in real-world scenarios, which provide direction for future research and opportunities for technological innovation. With the deepening of research and the maturity of technology, it is expected that the field of video super-resolution will continue to maintain a momentum of rapid development.

4. Research the problem, the goal

This study focuses on two key directions of video image quality improvement: super-resolution enhancement of video images and optimization and improvement of video frame rate.

First of all, one of the core goals of the research is the resolution enhancement task of video images. This task aims to use deep learning technology to predict high-resolution counterparts from existing low-resolution video images. Through this process, we aim to significantly improve the clarity and detail of the video image, making the video content more detailed and realistic in terms of visual presentation.

Secondly, another goal of the research is to increase the video frame rate. Through deep learning technology, we plan to realize intelligent interpolation of video frames, that is, the clever insertion of new frames into existing video frame sequences to increase the coherence between frames. The application of this technology will make the video playback effect smoother and more natural, especially in dynamic scenes, which can effectively reduce motion blur and stuttering, and improve the viewing experience.

In summary, this study aims to not only improve the resolution and visual clarity of video images through deep learning technology, but also enhance the dynamic fluency of video through frame rate optimization. The combination of these two technologies will revolutionize the quality and enjoyment of video content.

5. Literature review

With the continuous progress in the field of deep learning, diffusion models have become the mainstream models for tasks such as image generation and image super-resolution due to their excellent generation performance and powerful generation capabilities. Next, this paper will introduce the papers in related fields from the following three aspects: First, the derivation process of the denoising diffusion model and the basic structure of the current mainstream diffusion model (LDM) will be introduced. Secondly, the relevant literature on diffusion models in the field of image super-resolution is introduced. Thirdly, the relevant literature on diffusion model in the field of video super-resolution and generation is introduced.

5.1 Diffusion model

Denoising Diffusion Probabilistic Models (DDPM) ^[1].It was first proposed by Jonathan Ho et al. in 2020 to solve the problems of unstable training and incomplete data coverage of previous models such as GAN. The core idea of the DDPM model is to generate data by simulating the physical diffusion process, learn the image distribution through a series of noise addition and denoising, and when the computer learns all the image distributions, it can generate the required images. It consists of two main processes: forward diffusion and reverse denoising. As shown in Figure 1, during forward diffusion, Gaussian noise is gradually added to the original image until the image becomes pure noise. This process simulates the phenomenon of molecular diffusion in thermodynamics, in which molecules gradually diffuse from a region of high concentration to a region of low concentration. Mathematically, this process is designed as a Gaussian Markov process, in which the probability distribution of each step is only related to the previous step and is a Gaussian distribution. The reverse process simulates the inverse process of the forward process, and denoises the noise of each step forward process step by step, turning a random noise sampled on a standard normal distribution into a real image. This reverse process is also a Gaussian Markov process. In practical applications, the U-Net network structure is usually used to learn how to gradually remove noise from the noisy image to restore the original image. Through the denoising of the image, after denoising, our final output image is required to be consistent with the input image, and the image distribution obtained by each step of the denoising process needs to be consistent with the image distribution of the noise. The formula for the loss function is the difference between the distribution of noisy images in the forward process and the distribution of denoised images in the reverse process. Once trained, during the validation generation phase, the model will start directly with a pure noise and gradually remove the noise to finally produce a clear image.

Figure 1: Denoising diffusion model flow

Figure 2 shows the training and validation formula. The training part on the left shows that after the original image X₀ is input for step-by-step noise, the noised image calculated by the model is compared with the real noise composed of the time-step composition in the denoising process to fit the distribution of the noisening process. On the right is the validation part, once each noise distribution is fitted, then we can subtract the noise distribution predicted by the neural network from the original noise image through the formula, so that we can obtain the denoised image.

Figure 2: Noise-adding, denoising formula

LDM model: Since the DDPM denoising diffusion model proposed in 2020, it has received extensive attention, and in 2022, Robin Rombach et al. ^[2] improved the DDPM model and applied the diffusion model to the image generation task, making the model the mainstream model architecture for the current image task.

Figure 3: LDM model structure

As shown in Figure 3, the model adds the VAE module (red part) and condition module (white part) on the basis of DDPM (green part), so that the model can be better applied to image tasks. The VAE module is composed of an encoder and a decoder, and its purpose is to process the original image and compress and reconstruct the image under the condition that the amount of information remains unchanged. The condition module processes the input condition information, and reshapes the input text condition or image condition into the shape size required for the denoising process after passing through the neural network. The introduction of these two modules is mainly used to solve two problems of DDPM: first, the original image input for denoising will take up a lot of computational power, and second, the original DDPM model cannot generate the image I need because its input and output images are consistent, and its lack of guidance. The VAE module in red is mainly used to solve the first problem, and the purpose of this module is to compress the original image into the latent domain, which will reduce the amount of data for training. In practice, the VAE module will choose 4 or 8 times downsampling, so that the size of the original image input into DDPM after VAE is 16 times or 32 times smaller, which greatly reduces the input amount. After DDPM completes the inference, the VAE module performs upsampling, that is, restoring the compressed image to its original size to obtain the real output. In order to solve the problem that the model cannot generate the required image, the model adds a condition module, that is, the condition is used as an input to guide the process of denoising generation. The specific process is that in the training stage, the process of conditionally guided generation of images or text is introduced, and the output that is consistent with the input image is generated under the conditional guidance; In practical application, the pure Gaussian noise and conditional information are used as inputs, and the desired image results are obtained through the denoising process of the model. The model not only greatly solves the problem of large amount of image data, but also innovatively introduces conditional guidance, so that the diffusion model can be truly applied to a series of tasks such as image generation. Through its unique architecture and optimization, LDM has opened a new era of high-resolution image synthesis and brought new technological breakthroughs to the field of deep learning and image processing.

DDIM: Due to the introduction of the DDPM model, it has achieved remarkable success in image generation, image super-resolution and image editing, but because the theoretical derivation of the model requires a step-by-step noise and denoising process, the problem of its large amount of computation has always been the focus of attention of various experts. The DDIM network proposed by Jiaming Song et al. ^[3] has made an improvement on the theory of DDPM. Specifically, DDIM improves the calculation formula of the denoising process, as shown in Equation 1, the variance part of the original DDPM is replaced by the prediction noise, so that the denoising part will be a deterministic process. Secondly, the derivation process of DDPM is assumed to be the process of Markov chain, that is, noise and denoising need to be inferred step by step, and the DDIM derivation process removes the assumption of Markov chain, so that the derivation process can generate high-quality data in fewer steps, which significantly improves the generation speed.

(1)

5.2 Image super-resolution task for diffusion models

Image super-resolution is a classic image restoration problem that involves reconstructing a high-resolution image from a low-resolution image. With the development of digital image processing technology, there is a growing demand for high-resolution images, as they can provide more detailed information, resulting in significant improvements in visual quality and application effects. Traditional image super-resolution methods, such as nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation, are simple to calculate, but they are often unable to effectively recover the details of the image, especially in the high-frequency region. With the rise of deep learning technology, learning-based methods, especially those based on convolutional neural networks, have made breakthroughs in the field of image super-resolution. These methods are able to learn complex mapping relationships from low- to high-resolution images, resulting in more visually satisfying results.

However, despite the success of deep learning methods in image super-resolution, they often require large amounts of high-resolution training data and may not perform well in the face of unknown degenerate models or complex scenarios. In addition, these methods still present challenges in generating high-frequency detail and maintaining image naturalness. To address these issues, we introduce the diffusion model, a powerful generative model that generates data by simulating a gradual process from data distribution to noise distribution, and then reversing this process. Diffusion models have shown their powerful capabilities in a variety of fields, such as image synthesis, image inpainting, and image super-resolution.

For the research related to super-resolution images, it is mainly divided into the following research directions: first, the DDPM model is used to learn the mapping relationship between low-resolution images and high-resolution images; Second, the LDM model is used as the main structure to guide the generation of high-resolution images by using low-resolution images as the guide for the denoising process. Thirdly, the process of generating high resolution is achieved by combining the diffusion model and the GAN model to learn the interpolation between the generated image and the real image of the diffusion model network.

The first is to use the DDPM model to learn the direct mapping relationship between high and low, which is consistent with the idea of convolutional neural network and GAN neural network, and in the training stage, the model receives low-resolution and high-resolution image pairs as input and output. In this way, the network is trained to predict the transition from a low-resolution image to a high-resolution image. This process involves not only the upsampling of the image, but also the extraction and reconstruction of features to ensure a high-resolution image generated.

In the field of deep learning, one of the early applications of convolutional neural networks (CNNs) is image processing tasks, especially in the field of image super-resolution ^{[4][5][6][7].} The core goal of this mission is to use deep learning models to convert low-resolution images into high-resolution images while preserving or enhancing the detail and quality of the images. This kind of neural network uses convolution, residual network and other modules to import low-high resolution image pairs as the input and output of the model during training, and the network learns the mapping relationship between the two.

The GAN-related network ^{[8][9][10][11]} aims to use the GAN network to output a fake high-resolution image by inputting LR images, and then discriminating with the real image with the discriminator, when the discriminator cannot distinguish the real from the false, it means that the model has completed training. In practical application, network models such as ESRGAN, BSRGAN, SRGAN and other network models can cope with different degradation environments and enhance the reconstruction and capture ability of image details by optimizing some modules, modifying the running logic of the discriminator, and adding complex degradation logic. The above advantages of GAN and convolutional neural network are clear thinking and relatively simple network design, but due to the simple design of neural network can not deal with large data, the generation of GAN network is unstable, and the parameters are complex, resulting in its super-resolution effect is often unsatisfactory.

THE NETWORK REPRESENTED BY REFUSSION ^[12] IS BASED ON DDPM AND DERIVED USING SDE FORMULA, AND THE RESULTING NETWORK CHANGES FROM A LOW-RESOLUTION IMAGE TO A HIGH-RESOLUTION IMAGE FROM THE PERSPECTIVE OF PROBABILITY DISTRIBUTION, AS SHOWN IN FIGURE 4. Compared with the GAN network, the generation process is closer to the denoising process of the diffusion model, and the final result is better than the other two.

To sum up, the advantage of learning the mapping relationship between low-resolution images and high-resolution images is that the thinking is clear and the network design is relatively simple, but because the mapping relationship is established, it is inevitable that it is difficult to establish complex degradation problems, and because it is difficult to establish complex degradation relationships, most of these networks can only be used for data set pairs, and the results are often not satisfactory for the complex degradation causes in the real world.

Figure 4: Refusion infrastructure

Second, the LDM model is used as the main structure to generate high-resolution images, which focuses more on using low-resolution images as a guiding condition, and optimizing or replacing the denoising network to achieve better denoising generation. The paper focusing on DDPM structure focuses on improving the denoising process of the unet network, by replacing the number of layers of the network and the network itself, in order to achieve better results.

The SR3 ^[13] paper is also the first paper to use a diffusion model for super-resolution tasks by using different numbers of layers at different depths and improving the network of residual blocks. In this paper, the high-resolution image and the noise image are stitched together as a whole, and part of the residual module of the denoising network is adjusted to better extract features.

Figure 5: Basic structure of LWTDM

As shown in Figure 5, LWTDM ^[14] uses the attention structure instead of the structure of the unet network to achieve a diffusion model. This paper also provides guidance for the use of the diffusion model as a denoising architecture in the future, and the subsequent structures such as DIT and VIT ^[15][16] will use the attention structure to complete image-related tasks. Experiments have shown that in the case of large datasets and rich information, the experimental results of Attention are better than those using the Unet network, so many studies are trying to use Attention for image-related tasks.

Denoising Diffusion Probabilistic Model for Retinal Image Generation and Segmentation ^[17]., learn the texture and retina parts separately, so that the sub-parts can be better distinguished from each other after combination, and the edge texture of the image can be more obvious. This article also uses a cosine noise scheduler and stress training techniques. The cosine noise scheduler controls the addition of noise through the cosine function, which makes the change of noise intensity smoother and more gradual, which affects the training speed and generation effect of the model. The Reread Training Technique (RTT) uses the same data to retrain the model. In the case of denoising, the final result will be different due to the influence of noise, so the same input has a different structure. Therefore, the same data can be used for training multiple times, so that the loss function of one training data will be smaller, and the model has a better ability to maintain the same performance for changes and perturbations in the input data. By adding these two techniques, this paper successfully created a dataset related to retinal segmentation.

DDNM ^[18] used the method of zero-value domain decomposition to perform mathematical derivation, and concluded that for some processes from high-resolution images to low-resolution images caused by the degradation of position, the value range does not change for the original image after degradation, and the zero domain disappears. Therefore, on the basis of the DDPM structure, the prediction image in each step is added to a part of the original image to generate a fusion image, and this image is used as Xt-1 for the next prediction operation, as shown in the picture.

Figure 6: DDNM structure

The IDM ^[19] network solves the problem of fixed magnification and artifacts. The architecture mainly integrates LR and HR noise images, and its purpose is to better integrate and better integrate features into the images. As shown in the figure, the yellow part is used to extract features, the blue on the right is added with position information, and the combination of LR+HR noise map + position information is finally sent to the denoising network. First, on the left: X is the LR image, and S is the zoom factor, which is used to select the zoom level. EDSR is a residual block network, which establishes the initial features and then amplifies them bicubic times. The resulting α consists of three types of information: the first top leg is to add the zoom factor S into a map and add it to the image. The U of the second leg is to concat the two images of LR and HR noise, and then downsample and extract the features, which is more and which is less determined by the scale S. The last input to the downsampling is LR amplification + LR amplification and HR noise fusion. Then, the coordinate information is added during the upsampling process to obtain the final result.

Figure 7: IDM structure

The LDDPM ^[20] network adopts the structure of LDM network, but after the conditional guidance input, a conditional feature extraction network is paralleled to better integrate all the information of the two images, and through the fusion of the original denoising network and the feature extraction network, the results of better image generation and edge details are obtained. The pyramid-shaped PDDPM ^[21] network structure of the paper is upsampled twice as much with each denoising, and the upsampling size I need is obtained after the combination of multiple upsampling. In addition, this paper innovatively uses position coding as input and image input together, that is, the extracted position information is encoded and stitched as input in the stage of inputting low-resolution images, which contains more position information, and the output image will not be deformed and distorted. This article is a testament to the importance of location coding.

More research focuses on the task of LDM structure, because the LDM structure itself does not learn the linear relationship of the image, so the use of LDM structure can represent more nonlinear relations, and the output is more flexible, which is why researchers favor this model. The biggest difference between LDM architecture and other structures is that low-resolution images are used as conditions to guide the denoising process of the denoising model, and there are many papers and experiments that modify the structure by modifying the conditional input structure, denoising network selection, high-resolution image input method, etc., to obtain better experimental results. SRDiff ^[23] did not use the attention method to modify the input structure, but used the interpolation of high-resolution and low-resolution images as input, and the whole model predicted the interpolation for noise prediction, which showed that the interpolation can extract more features, and further proved that the more conditional information is extracted, the more it is integrated into the denoising process, and the better the generated image will be. This article on LDDPM also proves the above conclusions.

In the main unet structure, the main replacement method is to modify the unet structure, among which the most advanced structure is the SD3 model, which has achieved good output results on large datasets. Denoising networks are designed to successfully predict noise and fuse LR images, which can be roughly divided into the following categories under the LDM framework: 1. Neural networks based on unet, which are also the most common networks, are usually modified by changing the residual blocks in unet, using different normalization layers, or combining attention with unet and using attention at the bottom layer. 2. Neural networks with attention as the main structure. The structure is based on the VIT-DIT-SD3 network. The reason for using attention is that the inductive bias of U-Net is not critical to the performance of diffusion models, they can be easily designed according to the standard, and attention is shown to have better training results in the case of large data volumes.

Finally, about the pre-training of HR images. In Latent Diffusion Models (LDMs), the encoder-decoder structure is used to compress image data into a low-dimensional latent space and then decode it back into the form of the original data. An encoder is usually a deep neural network that is responsible for encoding high-dimensional image data into a low-dimensional latent spatial representation. This latent spatial representation captures the main features of the image and allows the image to be processed and generated more efficiently due to its low dimensionality. The decoder is the inverse process of the encoder, which restores the representation of the latent space to the original image data. In LDM, the decoder is usually a network symmetrical to the encoder structure, which progressively restores the details of the image through a series of upsampling and convolution operations. The reason for including encoders and decoders in LDM is to enable images to enter latent domains for better computation. There is relatively little research in this area. The article that has been improved in this place is REUSSION, which uses unet's jump connection to make the input and output more matched.

Many articles have proven that when LR is used as a conditional input, the more input features are richer and the higher the effect and quality. In addition, in the denoising process, the higher the proportion of the LR image, the more the image meets the expectations, which is also desired by the high-resolution task, that is, the proportion of the resolution (parameter guidance_scale representation condition is increased while the original features are preserved, and the higher the value, the more qualified images are generated; The lower the value, the greater the degree of freedom of the image obtained. All in all, for super-resolution tasks, there are more conditions to consider). Some methods to solve this problem are: multi-feature extraction (IDM, SD3) with the goal of obtaining the highest possible frequency information; directly zoom in on the LR image and train with the difference (SRdiff); Direct and Noise Additive (LWTDM), as well as dale-3 ^[22] synthetic subtitles, are effective. DALL-E 3 is OpenAI's latest advancement in text-to-image generation, and its core innovation lies in improving the model's prompt-following ability by improving image descriptions. In previous studies, text-to-image models often encountered difficulties in dealing with detailed image descriptions, often ignoring or obfuscating the meaning of prompt words. This is mainly due to the noise and inaccuracy of the image descriptions in the training dataset. DALL-E 3 solves this problem by training a specialized image description generator and using it to regenerate the description of the training dataset. The training process of DALL-E 3 consists of two stages: firstly, a small dataset that only describes the subject of the image is constructed for fine-tuning, and short synthetic subtitles (SSC) are generated; Then create a large dataset of detailed descriptions to generate descriptive composite captions (DSC). These captions not only describe the subject of the image, but also details such as the environment, background, text, style, color, etc. in the image.

In experiments, DALL-E3 was compared with models such as DALL-E2 and Stable Diffusion XL, and the results showed that DALL-E3 outperformed other models in terms of cue following, style, and consistency, and was favored by human evaluators. This paper fully proves that the use of synthetic subtitles to obtain more semantic information for training will get better results. This is also used in various papers. Since most of the input conditional information is vague and lacking, the use of synthetic subtitles can be more refined to obtain conditional guidance generation. The results show that the conditional title may lead to unsatisfactory output results due to errors and insufficient descriptions. As a result, richer descriptive descriptions are synthesized before input, making the conditional guidance more adequate. This idea was also cited in SD3's paper, which uses 50% of the synthetic titles as a condition. I think it is possible to design a pre-trained network to allow LR images to combine location information and temporal information to extend some features and enrich the input features.

Figure 8: DDGAN network

The third idea is to use DDGAN networks for image generation ^{[24][25][26][27].}。 This model was first proposed by Zhisheng Xiao in 0222, and this network combines the GAN network and the diffusion model, so that the model responds on the basis of the diffusion model and reduces the running time. In traditional generative models, such as GAN, VAE, and Normalizing Flows, there are often trade-offs between quality, speed, and generation diversity. For example, GANs are able to produce high-quality samples and are fast to sample, but often lack diversity. Although the denoising diffusion model can produce high-quality and diverse samples, the sampling process is slow and computationally expensive.

To solve this problem, the authors of DDGAN ^[26] propose a new model that uses conditional GANs to align the distribution of denoising at each step, rather than assuming that the denoising distribution is a Gaussian distribution. This approach allows for denoising with larger step sizes, which reduces the number of sampling steps required and speeds up sampling. Experiments on the CIFAR-10 dataset show that this new method is more competitive in terms of sample quality and diversity compared to the original diffusion model.

In addition, the model demonstrated better pattern coverage and sample diversity when compared to traditional GAN. To the best of the authors' knowledge, this is the first model to reduce the sampling cost of a diffusion model to the point where it can be economically applied to real-world applications. The paper also discusses the limitations of the Gaussian assumption of the denoising distribution under different conditions, and proposes a multimodal distribution method to better simulate the denoising process, especially if the edge distribution of the data is not Gaussian. A key advantage of this approach is that it allows the model to be more stable during training and to be able to generate a more diverse sample variety. Overall, this paper proposes an innovative approach to solve the trilemma dilemma in generative learning and verifies its effectiveness through experiments. This work provides valuable insights and direction for the subsequent development of generative models.

SRDDGAN and SRDIGAN ^[24][25] are modified by this correlation model to adapt to super-resolution tasks, and the experimental results show that SRDDGAN is superior to the existing diffusion model-based methods in terms of PSNR and perceived quality indicators, and at the same time achieves a fast sampling speed, which is about 11 times that of the existing diffusion model SR3, making it more suitable for real-world applications. In addition, SRDDGAN introduces a low-resolution (LR) encoder module to make more efficient use of the information in low-resolution images and use it as conditional input to improve the fidelity and detail recovery of the model. In conclusion, SRDDGAN proposes a new Single Image Super Resolution (SISR) method by combining the advantages of the denoising diffusion model and GAN, which can significantly increase the sampling speed and improve the diversity of generated samples while maintaining image quality.

SRDIG ^[27] is a paper that explores how to combine denoising diffusion models to simultaneously deal with motion deblurring and super-resolution problems in images. This work targets the common motion blur problem in low-resolution (LR) images and attempts to restore its detail while enhancing the image resolution. The proposed method has competitive sample quality and diversity compared with the image super-resolution model based on the diffusion model when dealing with motion blur and super-resolution tasks. In particular, compared to diffusion models such as SR3, the sampling process requires only four steps, which is about 11 times faster. Compared with the traditional GAN, the proposed model has significant improvements in training stability and sample diversity, while maintaining competitiveness in sample fidelity. The DDGAN network has been used to conduct sufficient experiments on remote sensing images and medical images, which shows that the model has a very high sampling speed.

In summary, through countless studies on neural networks, it has been proved that the effectiveness of using DDPM and LDM structures makes the use of conditional constraints as a guide instead of directly establishing linear maps when using neural networks to effectively face the complex degradation of the real world. A series of image super-resolution task models are focusing on how to better integrate low-resolution information and location information. These studies show that a better way to extract information and better integrate this information into the denoising process will significantly improve the quality of the generation, i.e., better predict the content of the image in the zero-domain part. Therefore, we should make innovations in the extraction and fusion of conditional information, so that we can better obtain the generated images.

5.3 Video super-resolution task for diffusion models

Video Super Resolution (VSR) technology is a way to increase the resolution of a video by enhancing a low-resolution video to a high-resolution video, providing more detail and clarity. With the development of display technology, people's requirements for video quality are getting higher and higher, and high-definition and even ultra-high-definition video has become the mainstream. However, due to limitations in shooting equipment, storage space, and transmission bandwidth, much video content still exists at lower resolutions. VSR technology effectively enhances the viewing experience of these low-resolution videos, either through hardware or software methods. The development background of video super-resolution technology is closely related to image super-resolution technology. With the vigorous development of artificial intelligence, image super-resolution technology has achieved remarkable results. Video super-resolution technology is further developed on this basis, which not only needs to deal with the resolution improvement of a single frame image, but also needs to consider the time continuity and motion compensation between frames in the video sequence.

The application of diffusion model in the field of video super-resolution has become a research hotspot in recent years. By simulating the diffusion process of the data, these models are able to establish a mapping relationship between different resolutions, so as to realize the conversion from low-resolution to high-resolution images. The ECDP ^[52] model aims to reduce time consumption and improve the efficiency of super-resolution image generation through conditional diffusion process and probabilistic stream sampling techniques. Experimental results show that ECDP significantly reduces the model inference time while maintaining image quality. In response to the computational cost of high-resolution image generation, DistriFusion [^51] model reduces the cost of computation by dividing the model input into multiple patches for processing by taking advantage of the parallelism between multiple GPUs, making the model more efficient and faster. SinSR ^[53] proposed a single-step super-resolution generation method that accelerates the diffusion-based SR process by deriving a deterministic sampling process from the latest methods. This approach not only improves the speed of inference, but also achieves a level of performance comparable to that of the multi-step approach. The ResShift ^[54] paper improves the conversion efficiency by constructing a Markov chain to move the residuals in the conversion process from high-resolution to low-resolution images. With only 15 sampling steps, ResShift achieves performance comparable to or better than current state-of-the-art methods.

The most important research on video diffusion models is Sora and Google's Image Video network, both of which have their own strengths. At its core, Sora is a diffusion-based Transformer model, which implements video generation through the following steps: Video Compression Network: Compresses the original video into a latent spatiotemporal representation. Temporal Potential Patches: Extracts a series of potential temporal patches from compressed videos. At its core, it is a diffusion Transformer model, that is, starting from a frame full of visual noise, gradually denoising and generating a video. Sora, as an innovative technology, demonstrates the great potential of AI in the field of video generation. Not only is it capable of simulating motion and interactions in the physical world, but it is also capable of handling complex scenarios in the digital world. Imagen Video ^[32] is a text-to-video generation model developed by Google that transforms text prompts into high-definition video through a cascading diffusion model. This technology is capable of generating high-definition video clips up to 5.3 seconds long with a resolution of 1280×768 and a frame rate of 24fps. At the core of Imagen Video is a cascading diffusion model composed of multiple sub-models, including a text encoder, a basic video diffusion model, a spatial super-resolution model, and a temporal super-resolution model, with a total of 11.6 billion parameters. The generation process consists of multiple steps, starting with generating an initial video based on text prompts, and then increasing the resolution and frame rate of the video through a series of spatial and temporal super-resolution models. Although Imagen Video has made significant progress on experimental results, it still faces a number of challenges, including the use and creation of datasets. Future directions may include improving the quality and duration of video generation, enhancing the generalization ability of models, and exploring new use cases. Imagen Video demonstrates the potential of AI to understand and generate complex video content, and research in this area still holds great promise. The above two networks are modified in the image super-resolution network LDM to meet the requirements of the video task, which can also be found that the task is closely related to the image task.

In summary, video-related tasks are still an emerging field, and relevant studies have shown that we need to pay attention to several issues in the processing of video: first, we need to make machine learning the temporal correlation from the same video; Second, we need models that can generate coherent, time-linked images that make up videos. In this study, some layers of the model are improved to 3D layers, so that the model can have perception ability. By introducing an optical flow assistance model, we can understand the movement and attitude changes caused by time.

Overall, diffusion models are gaining momentum in the field of super-resolution, which has not only become a hot topic of current research, but also shows great potential in model optimization and innovation. With the continuous progress of technology, the diffusion model has achieved remarkable results in quality improvement and detail enhancement. Scholars are working to further improve the diffusion model and explore new algorithms and strategies to achieve more accurate and efficient processing in the field of image super-resolution. The research prospects in this field are promising, and it indicates that there will be more breakthroughs in image processing technology in the future.

6. Research Methods

The aim of this study is to recover high-resolution video from low-resolution video image reconstruction through deep learning diffusion modeling. This task involves several aspects of research: first, low- to high-resolution reconstruction of video images; Second, the smoothness of the video itself is improved. Third, the optimization of artifact removal. Next, the research methods and research innovations will be described in the following two parts.

6.1 Super-resolution image tasks

This task aims at the high-resolution restoration of video images, which is different from general image restoration, and when performing high-resolution restoration of video images, we also need to consider the temporal causality of images from the same video, that is, the neural network needs to learn the relationship between different frames of the same video. Therefore, we need to add a 3D network layer to the original LDM model to learn the relationship between different video images. Here, we apply the design idea of the upscale-A-vedio paper, which is to use the 2D and 3D layers interchangeably in the unet network. This method can extract not only spatial information, but also spatiotemporal information.

As shown in the figure below, the lower half of the unet network is used to extract feature information from low-resolution videos, which is called conditional unet, and the upper half of the unet network is used as the main network for denoising, which is called denoising unet. The whole network extracts features through the conditional unet network through the input low-resolution video as a condition, and each extracted feature is given a tensor shape through a shared linear layer transformation, and then input into the denoising unet network as a conditional input item of the denoising network, so that the denoising network model can better accept the conditional input and be used to guide the denoising process. For denoising unet, due to the limitation of hardware conditions, this study adopts the method of transfer learning, that is, using the weight data completed by pre-training, freezing the network layer that has completed training, and training the modified layer to adapt to the video generation task. In this model, we freeze the VAE and most of the denoising unet network, train the conditional unet network and some new network layers to achieve the goal of outputting high-resolution video. It is worth mentioning that we use the method of reverse correspondence between the conditional unet network part and the denoising unet network, that is, the conditional weight obtained by the downsampled part of the conditional unet is used as the condition accepted by the upsampled part of the denoising unet. The purpose of this design is to extract feature information by downsampling and upsampling to obtain pixel information, and through the method of reverse connection, we can make the denoised unet network obtain features and pixel information when sampling up and down, so as to achieve the purpose of fusing more information.

Figure 9: Video image super-resolution network

Figure 10: Denoising unet network vs. conditional unet network

6.2 Video Frame Interpolation Task

Specifically, we want to predict the middle frame in the known before and after frames, and then insert the middle frame to increase the frame rate. In this task, we improve the LDM model of video images trained in the previous task.

As shown in Figure 9, the network is a video interpolation network, through sampling two frames of low-resolution video, passing through a conditional fusion network taking the front and back two frames as the overall condition, after passing through a conditional fusion network, input the LDM model that task one completes training, the model will input two frames and the image of the middle frame synthesized through the network, and compare the three frames of the high-resolution video and the real video obtained by these three frames as the loss function of the model, It is used to compare the difference between the generated image and the real image, and make the image closer to the real image through the reverse optimization of the model to complete the training of the model. Once the training is complete, we can use the model to perform high-resolution tasks on existing low-resolution videos, and insert a frame in the middle of every two frames to increase the video frame rate.

Figure 11: Video interpolation network

Figure 12: De-artifact network

6.3 Artifact removal tasks

The main causes of artifacts and flickering problems are as follows:1Inaccuracy of motion estimation: In video super-resolution tasks, when there is a significant change in lighting or large motion in the video, the prediction network may not be able to perceive such large changes, introducing artifacts and blurring. 2.Underutilization of inter-frame information: Traditional video super-resolution methods may not make full use of the spatial and temporal information between successive video frames, resulting in ineffective recovery of details in the video and artifacts. 3.Flicker artifacts: In video super-resolution, flicker artifacts are usually caused by inconsistencies in brightness or color between different frames during video frame reconstruction, which can be caused by insufficient feature fusion between frames or inaccurate motion compensation. Therefore, we designed a denoising 3D layer in the decoder as well as a transition recognition network to remove artifacts and flickering issues, as shown in Figure 12. This is due to the fact that the 3D convolution module operates on the spatiotemporal domain and is able to take into account both spatial and temporal correlations between video frames, which helps to better restore details in the video and reduce artifacts. Second, adding a 3D layer can enhance low-level consistency and reduce flicker artifacts during decodingto enhance information fusion during decoding. The purpose of the transition recognition network is to identify that there is a large transformation in the video, and divide the input video into two tensors before and after the transition frame into the decoder network, so as to solve the problem of artifact blurring caused by large motion. Specifically, we refer to the design idea of RAFT, compare the correlation between the two frames of the image, and set a filter to identify the image frames with poor correlation before and after, and the input video will be divided into two tensors before and after the transition frame into the decoder network.

In summary, by adding a 3D layer to the decoder and adding a transition recognition network, the spatiotemporal information of video data can be used more effectively, the quality of video super-resolution can be improved, and artifacts and flicker problems can be reduced.

7 work plan

In this study, we conducted an in-depth literature review to grasp the latest research trends and technical challenges of diffusion models in the field of video super-resolution, and laid a solid foundation for subsequent theoretical construction and experimental design. We choose the LDM model as the core framework of the study, and use the U-Net network as the main denoising network, and specially design the structure of alternating 2D layer and 3D layer to meet the special needs of video image processing.

In the selection of datasets, we took a phased approach. In the initial stage, we selected small-scale datasets such as SPMCS, UDM10, REDS30, and YOUHQ40 for model parameter adjustment and preliminary trial operation to verify the feasibility and effect of the model. After achieving stable results on small-scale datasets, we extended the model to large-scale datasets such as videoLQ and AIGC30 to comprehensively evaluate the performance of the model.

In the model training and validation phase, we adopted a combination of pre-training, fine-tuning, and transfer learning. By importing pre-trained weights, we accelerated the convergence of the model and trained the new weights to fit the specific task needs. At the same time, we use the validation set to monitor the performance of the model to avoid overfitting.

In terms of performance evaluation and analysis, we used a variety of quantitative and qualitative methods, including peak signal-to-noise ratio (PSNR), structural similarity (SSIM), video quality evaluation (VQA), and visual information fidelity (VIF) to comprehensively evaluate the performance of the model.

Finally, in the analysis and optimization stage, the output results will be compared, and some of the layers and some model weights will be optimized to further improve the performance and generalization ability of the model. Through this process, we not only promote the development of video super-resolution technology, but also provide new ideas and methods for the research of related technologies in the future.

References

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.

Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

Song, Jiaming, Chenlin Meng, and Stefano Ermon. "Denoising diffusion implicit models." arXiv preprint arXiv:2010.02502 (2020).

Dong, Chao, et al. "Image super-resolution using deep convolutional networks." IEEE transactions on pattern analysis and machine intelligence 38.2 (2015): 295-307.

Zhang, Yulun, et al. "Residual dense network for image super-resolution." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Zhou, Fuqiang, Xiaojie Li, and Zuoxin Li. "High-frequency details enhancing DenseNet for super-resolution." Neurocomputing 290 (2018): 34-42.

[8] Li, Zhuangzi. "Image super-resolution using attention based densenet with residual deconvolution." arXiv preprint arXiv:1907.05282 (2019).

Zhang K, Liang J, Van Gool L, et al. Designing a practical degradation model for deep blind image super-resolution[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 4791-4800.

Wang X, Yu K, Wu S, et al. Esrgan: Enhanced super-resolution generative adversarial networks[C]//Proceedings of the European conference on computer vision (ECCV) workshops. 2018: 0-0.

Ledig C, Theis L, Huszár F, et al. Photo-realistic single image super-resolution using a generative adversarial network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4681-4690.

Nirkin, Yuval, Yosi Keller, and Tal Hassner. "Fsgan: Subject agnostic face swapping and reenactment." Proceedings of the IEEE/CVF international conference on computer vision. 2019.

Luo Z, Gustafsson F K, Zhao Z, et al. Refusion: Enabling large-size realistic image restoration with latent-space diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 1680-1691.

An T, Xue B, Huo C, et al. Efficient remote sensing image super-resolution via lightweight diffusion models[J]. IEEE Geoscience and Remote Sensing Letters, 2023.

Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

Dosovitskiy, Alexey. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

Alimanov, Alnur, and Md Baharul Islam. "Denoising diffusion probabilistic model for retinal image generation and segmentation." 2023 IEEE international conference on computational photography (ICCP). IEEE, 2023.

Wang, Yinhuai, Jiwen Yu, and Jian Zhang. "Zero-shot image restoration using denoising diffusion null-space model." arXiv preprint arXiv:2212.00490 (2022).

Gao S, Liu X, Zeng B, et al. Implicit diffusion models for continuous super-resolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 10021-10030.

Wang, Xin, et al. "Super-resolution reconstruction of single image for latent features." Computational Visual Media (2024): 1-21.

Ryu, Dohoon, and Jong Chul Ye. "Pyramidal denoising diffusion probabilistic models." arXiv preprint arXiv:2208.01864 (2022).

Betker, James, et al. "Improving image generation with better captions." Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2.3 (2023): 8.

Li H, Yang Y, Chang M, et al. Srdiff: Single image super-resolution with diffusion probabilistic models[J]. Neurocomputing, 2022, 479: 47-59.

Xiao Z, Kreis K, Vahdat A. Tackling the generative learning trilemma with denoising diffusion gans[J]. arXiv preprint arXiv:2112.07804, 2021.

Tan Z, Chai M, Chen D, et al. Diverse semantic image synthesis via probability distribution modeling[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7962-7971.

Karras T, Laine S, Aila T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 4401-4410.

Betker J, Goh G, Jing L, et al. Improving image generation with better captions[J]. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023, 2(3): 8.

Zhang Y, Yang Q, Zhou Y, et al. Tcdm: Transformational complexity based distortion metric for perceptual point cloud quality assessment[J]. IEEE Transactions on Visualization and Computer Graphics, 2023.

Meng C, He Y, Song Y, et al. Sdedit: Guided image synthesis and editing with stochastic differential equations[J]. arXiv preprint arXiv:2108.01073, 2021.

Ho J, Chan W, Saharia C, et al. Imagen video: High definition video generation with diffusion models[J]. arXiv preprint arXiv:2210.02303, 2022.

Harvey W, Naderiparizi S, Masrani V, et al. Flexible diffusion modeling of long videos[J]. Advances in Neural Information Processing Systems, 2022, 35: 27953-27965.

Ho J, Salimans T, Gritsenko A, et al. Video diffusion models[J]. Advances in Neural Information Processing Systems, 2022, 35: 8633-8646.

Blattmann A, Rombach R, Ling H, et al. Align your latents: High-resolution video synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 22563-22575.

Luo Z, Chen D, Zhang Y, et al. Videofusion: Decomposed diffusion models for high-quality video generation[J]. arXiv preprint arXiv:2303.08320, 2023.

Chen H, Zhang Y, Cun X, et al. Videocrafter2: Overcoming data limitations for high-quality video diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7310-7320.

Esser P, Chiu J, Atighehchian P, et al. Structure and content-guided video synthesis with diffusion models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 7346-7356.

Jain S, Watson D, Tabellion E, et al. Video interpolation with diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7341-7351.

Danier D, Zhang F, Bull D. Ldmvfi: Video frame interpolation with latent diffusion models[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(2): 1472-1480.

Anwar S, Khan S, Barnes N. A deep journey into super-resolution: A survey[J]. ACM Computing Surveys (CSUR), 2020, 53(3): 1-34.

Croitoru F A, Hondru V, Ionescu R T, et al. Diffusion models in vision: A survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10850-10869.

Yang L, Zhang Z, Song Y, et al. Diffusion models: A comprehensive survey of methods and applications[J]. ACM Computing Surveys, 2023, 56(4): 1-39.

Kingma D, Salimans T, Poole B, et al. Variational diffusion models[J]. Advances in neural information processing systems, 2021, 34: 21696-21707.

Gao S, Liu X, Zeng B, et al. Implicit diffusion models for continuous super-resolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 10021-10030.

Moser B B, Shanbhag A S, Raue F, et al. Diffusion models, image super-resolution and everything: A survey[J]. arXiv preprint arXiv:2401.00736, 2024.

Niu A, Pham T X, Zhang K, et al. ACDMSR: Accelerated conditional diffusion models for single image super-resolution[J]. IEEE Transactions on Broadcasting, 2024.

Wang Y, Yang W, Chen X, et al. SinSR: diffusion-based image super-resolution in a single step[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 25796-25805.

Liu H, Ruan Z, Zhao P, et al. Video super-resolution based on deep learning: a comprehensive survey[J]. Artificial Intelligence Review, 2022, 55(8): 5981-6035.

Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.

Li M, Cai T, Cao J, et al. Distrifusion: Distributed parallel inference for high-resolution diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 7183-7193.

Yuan Y, Yuan C. Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(7): 6862-6870.

Wang Y, Yang W, Chen X, et al. SinSR: diffusion-based image super-resolution in a single step[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 25796-25805.

Yue Z, Wang J, Loy C C. Resshift: Efficient diffusion model for image super-resolution by residual shifting[J]. Advances in Neural Information Processing Systems, 2024, 36.