AttentionX Model: Improving Perceptional Feature Building

1. Objective

This project aims to have better understanding on the attention mechanism to capture local and global correlations for some vision tasks, especially learned image compression and post-processing. The investigation also includes one use case of the proposed AttentionX mdel for removing compression artefacts and perceptional feature learning for further restoring a compressed image with more detailed high frequency components, making the decoded image looking more vivid. Particularly, this research will focus on improving image compression efficiency with the trade-off of different degrees of complexity. To achieve this goal, we will incorporate the intra frame coding of VVC with grouped residual fusion network of perceptional attention (PA-GRFN) as post-processing filter. Compared to other attention-based models, PA-GRFN aims at fewer parameters, less computational complexity, and a more interpretable feature learning mechanism for perceptional understanding.

Get Help With Your Essay
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing Service

2. Novelty

Attention models have recently attracted widespread attention on different vision tasks such as person re-ID, image classification, object recognition, and are now extended to learned or generative image compression, image restoration [1], and super resolution. SENet [4] brings an effective, lightweight gating mechanism to self-recalibrate the feature map via channel-wise importance. Beyond channel, CBAM [3] introduce spatial attention in a similar way. SKNet [5] further introduces a dynamic kernel selection mechanism which is guided by the multi-scale group convolutions, with a small number of additional parameters and calculations to improve the classification performance. GCNet [6] fully explores the advantages and disadvantages of Non-Local [7] and SE [4] modules and combines the advantages of both to design a more effective global context module, obtaining compelling results on object detection tasks. Google Brain proposed a “stand-alone” attention network [8,9], where convolutional layers can be replaced entirely or partly by attention blocks for some vision tasks.

GRDN [2] shows the best performance in the NTIRE 2019 Real Image Denoising Challenge and it can be used together with attention scheme for effective removal of compression artefacts. In the Challenge of Learned Image Compression (CLIC) 2019, a solution [1] of GRDN and CBAM won the second highest PSNR in the low-rate track and the top one of fastest track among the decoders with high mean opinion score (MOS), which is a GRDN consisting of cascading grouped residual dense blocks (GRDBs) followed by a convolutional block attention module (CBAM) [3]. Our question is whether perceptional attention can adaptively adjust its receptively field size based on multiple scales of input image information. To answer this question, we will implement for the first time “Perceptional Attention” for vision tasks. We will aim to achieve near state-of-the-art architecture with lower model complexity. We expect that perceptional attention will selectively capture target objects with different scales, which can effectively model the global context [6,7,10] and adapt for structural and perceptional retention with multi-scale hierarchical feature fusion [14]. We will expect this to outperform major benchmarks for some vision tasks and the use case as an example is for removing compression artefacts, such as blockiness, ringing, contouring, etc.

3. Impact

The primary benefit of using AI and ML for image and video compression is cost and development savings. Image and Video codecs are complex algorithms that take a significant amount of time to develop and fine-tune, while AI systems help to make learning-based algorithms converge faster [11]. AI deep tools adaptation for the VVC standardization are a successful example. At low or extreme low bitrates, it is very difficult to completely remove compression artefacts through these traditional filters. During the past several years, it has been confirmed that learned image/video compression and post-processing using DNN can be a breakthrough in solving this problem but with 10-100 times of computation complexity. The main impact of this work will be the good point for reduction of network model size, memory usage, and computational complexity in AI-based image compression and post-processing. As a special use case discussed here, it will improve the density of codecs for best performance with the combination of the best-performing traditional video coding algorithm and a learned post-processing scheme for image restoration/enhancement.

4. Methods

Image restoration is namely the removal of motion blur, haze, rain drop, rain streak, noise, and compression artefacts. This project is going to take the vision task of removing compression artefacts as the special use case for proving the performance of the proposed AttentionX model. Dual Residual Network exploits the potential of paired operations of up-/down-sampling or convolution with different size kernels as a modular dual residual block (DRB) [11]. Grouped Residual Dense Network (GRDN) uses a cascading grouped residual dense blocks followed by a convolutional block attention module (CBAM) for removing compression artefacts [1]. We will use the modified GRDN with multi-path attention fusion as the baseline network and perceptional attention scheme of multi-scale hierarchical feature fusion for better capturing long-range dependencies for removing compression artefacts and image quality enhancement.

Attention has playing an important role to many computer vision problems, including learned image compression. GCNet [6] fully explores the advantages and disadvantages of Non-Local [7] and SE [4] modules and combines the advantages of both to design a more effective global context module, obtaining compelling results on object detection tasks. Through a rigorous empirical analysis, it is observed in GCNet [6] that the global contexts modeled by non-local network are almost the same for different query positions within an image. New step for our consideration is how to dynamically aggregate feature for attention map to enhance the decoded image under unknown combined distortions [13], namely how to do split, feature fuse and select in each attention layer for generating perceptional attention weights from input. However, for most existing methods, the weights of attention layers are fixed parameters, namely “fixed attention”, which are learned from training and do not vary depending on the input [1]. SKNet [5] tried to solve this by introducing a dynamic kernel selection mechanism which is guided by the multi-scale group convolutions. Operation-wise Attention [13] indicates that attention weights have higher variances for middle layers than for other layers, indicating the tendency that the middle layers more often change the selection of operations than other layers. At this stage, what is the best solution for feature fusion is still unknown and it will be considered carefully in this project by experiment design for any better dynamic operation of early-, middle-, or later-fusion for stacked or embedded attention layers, depending on the distortion type of the input image.

In this proposal, we simplify the attention block by explicitly using a query-independent attention map for all query positions. This simplified block has significantly smaller computation cost with almost no decrease in accuracy as indicated in [6]. To model the global context attention features, SENet [4] perform rescaling to different channels to recalibrate the channel dependency with global context. CBAM [3] recalibrates the importance of different spatial positions and channels both via rescaling. All these [3-4] adopt rescaling for feature fusing is not effective enough for global context modelling [6]. One reason is that the larger the target object is, the more attention need to be assigned. However, at much higher layer when an attention block inserts, all scale information is getting lost and such a pattern disappears [5].

Fig. 1 The basic block proposed in this project (α is the residual scaling parameter)

Fig.2 The network architecture of our basic AttentionXNet Model.

Fig.3 The network architecture of our proposed PA-GRFN as post-processing filter for enhancing high frequency components.

GCNet [6] tried to model this via addition fusion as Simplified NL (SNL) block or Global Context (GC) Block. In this proposal, we propose a dynamic selection mechanism in the attention layer that allow each neuron to adaptively adjust its receptive field size base on multiple scales of input information. As shown in Fig. 1, a building block is called Dynamic Context (DC) block, in which two branches with different scales, namely a SNL block and GC block, are fused using softmax attention that is guided by the information in these branches. Different attentions on these branches yield different sizes of the effective receptive fields of neurons in the fusion layer. Three DC blocks are stacked to a deep block termed Flexible Attention (AttentionX) block. As shown in Fig. 2, multiple AttentionX blocks are stacked to a deep network termed Dynamic Context Networks (AttentionXNets), which enable capturing target object with different scales or different receptive field sizes per image. Obviously, the better network architecture of AttenXNet will be finalized from our experiments after the project starts. At this stage, more AttentionX layers mean better performance with increased complexity.

For the use case design, we will incorporate the intra frame coding of VVC with grouped residual fusion network of perceptional attention (PA-GRFN) as post-processing filter. As shown in Fig. 3, the difference of Reconstruction Module of Content, Structure, and Perception, are the loss functions, which are listed here as MSE, MS-SSIM, and perception loss as defined in [15].

Compared to some other attention-based models [14], PA-GRFN aims at fewer parameters, less computational complexity, and a more interpretable feature learning mechanism for perceptional understanding. The sequence of our experiments will be:

Our first attempt will be the AttentionX block design. This model is relatively simple, but it will offer a simple benchmark to convince us the performance.

The second attempt on AttentionXNet network design will bring the concept into the first step of removing compression artefact

The implementation of PA-GRFN model will require more time and resources, but ideally it would result in state-of-the-art performance at reduced complexity of our second-step module for image enhancement as a use case.

If experiments 1,2,3 result in the expected reduction in model complexity, we will extend the experiments on our two-step solution for compressed neutral image restoration of big size.

CNNs generate feature representation of complex objects be collecting hierarchical and different parts of semantic sub-features, but these features are usually distributed in grouped form in the feature vector of each layer, representing various semantic entities. For example, we can consider a C channels, H x W convolutional feature map and divide it into G groups along the channel dimension. SGE [10] uses a Spatial Group-wise Enhance module that can adjust the importance of each sub-feature by generating an attention factor for each spatial location in each semantic group, so that every individual group can autonomously enhance its learnt expression of the semantic feature regions and suppress possible noise. SGE outperforms SENet [4] on detecting small objects by ~1% AP with spatial measuring of the similarity by simple dot product between the global semantic feature (spatial averaging from this group) and local feature to some extent. From this point of view, Semantic Attention with/without segmentation map can be considered for our follow-on research project.

5. Risks

The incremental approach of 4 steps described in “Methods” will mitigate the risks involved in this research plan. At each step, the requirement is to have a significant reduction in complexity given a similar performance with respect to the baseline model. If this is not achieved, intermediate approaches could be followed, such as substituting attention only in the deep layers of each model [1]. A comparison of performance and complexity will be then implemented at different depths, to set the best trade-off between attention and convolutions, before moving to the next step. The scope of the projects and the complexity of models could be adjusted while running the project, by using additional models and resources that have not been initially considered in this plan [9-10, 12].

6. Timeline

The following timeline is predicted for the 4 steps described in “Methods”:

4 person/week – early stop?

6 person/week

6 person/week – early stop?

8 person/week

The total duration is 24 person/week, or 2 persons for three months, with a retrospective and potential early stop at the end of first (two-weeks) and third step (two-months).

7. References

S. Cho, and et al, “Low Bit-rate Image Compression based on Post-processing with Grouped Residual Dense Network”, Paper: http://openaccess.thecvf.com/content_CVPRW_2019/papers/CLIC%202019/Cho_Low_Bit-rate_Image_Compression_based_on_Post-processing_with_Grouped_Residual_CVPRW_2019_paper.pdf

D. Kim, and et al, “GRDN:Grouped residual dense network for real image denoising and ganbased real-world noise modeling”. https://arxiv.org/pdf/1905.11172.pdf, TestCode: https://github.com/BusterChung/NTIRE_test_code

S. Woo, and et al, “CBAM: Convolutional block attention module”, https://arxiv.org/pdf/1807.06521.pdf, Code: https://github.com/Youngkl0726/Convolutional-Block-Attention-Module

J. Hu, and et al, “Squeeze-and-excitation networks”, https://arxiv.org/pdf/1709.01507.pdf, Code: https://github.com/ResearchingDexter/SKNet_pytorch

X. Li, and et al, “Selective kernel networks”, https://arxiv.org/pdf/1903.06586.pdf, Code: https://github.com/implus/SKNet

Y. Cao, and et al, “GCNet: Non-local networks meet squeeze excitation networks and beyond”, https://arxiv.org/pdf/1904.11492.pdf, Code: https://github.com/xvjiarui/GCNet

X. Wang, and et al, “Non-local neural networks”, https://arxiv.org/pdf/1711.07971v3.pdf, Code: https://paperswithcode.com/paper/non-local-neural-networks

P. Ramachandran and et al, “Stand-Alone Self-Attention in Vision Models”, https://arxiv.org/pdf/1906.05909.pdf

I. Bello and et al, “Attention Augmented Convolutional Networks”, https://arxiv.org/pdf/1904.09925.pdf

X. Li and et al, “Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks”, https://arxiv.org/pdf/1905.09646.pdf, Code: https://github.com/implus/PytorchInsight

J. Diascorn, “How is AI technology impacting video compression?”, https://www.v-net.tv/2019/03/18/how-is-ai-technology-impacting-video-compression/

X. Liu and et al, “Dual Residual Networks Leveraging the Potential of Paired Operations for Image Restoration”, http://openaccess.thecvf.com/content_CVPR_2019/papers/Liu_Dual_Residual_Networks_Leveraging_the_Potential_of_Paired_Operations_for_CVPR_2019_paper.pdf

M. Suganuma and et al, “Attention-based Adaptive Selection of Operations for Image Restoration in the Presence of Unknown Combined Distortions”, http://openaccess.thecvf.com/content_CVPR_2019/papers/Suganuma_Attention-Based_Adaptive_Selection_of_Operations_for_Image_Restoration_in_the_CVPR_2019_paper.pdf

Z. Hui, “Progressive Perception-Oriented Network for Single Image Super-Resolution”, https://arxiv.org/pdf/1907.10399.pdf, Code: https://github.com/Zheng222/PPON

T. Tariq, “A HVS-inspired Attention to Improve Loss Metrics for CNN-based Perception-Oriented Super-Resolution”, https://arxiv.org/pdf/1904.00205.pdf

Turn in your highest-quality paper
Get a qualified writer to help you with

“ AttentionX Model: Improving Perceptional Feature Building ”

Get high-quality paper

NEW! AI matching with writer