mxahan.github.io

Paper Review Part 1

I understand the hard-work a researcher has to go through to get a publication or continuing research. As a researcher I take ideas from multiple sources for my own research to contribute in a effective way towards my fields. It is absolutely devilish/unacceptable to knowingly plagiarize/steal ideas from people without properly crediting them.

This blog contains review of different interesting papers for my personal interest. I have tried to cite them in a correct way all the way. Please let me know If I have mistaken to cite something in the proper way. Moreover I would be glad if somebody wants to discuss/ argue with my understandings. I am always flexible to change my understanding based on facts. {(zhasan3@umbc.edu) subject: Missed citation in proper way/ Paper discussion from blog}

Paper List

Paper1

Clevrer: Collision events for video representation and reasoning [1]

IMRAD

Missing in datasets; underlying logic, temporal and causal structure behind the reasoning process!! they study these gaps from complementary perspective.

Video reasoning models! - provide benchmark - Video dataset proposal!

Motivation: CLEVR, psychology

Figure: Sample from the dataset [1]

Analysis of various visual reasoning models on the CLEVRER.

Prior Arts

Video Understanding, VQA, Physical and Causal reasoning.

Argument and Assumption/ Context

Dataset to fill the missing domain should: Video, diagnostic annotation, temporal relation, Explanation, prediction, counterfactual. Need to match this.

Figure: Dataset Comparison [1]

Problem statement

Missing in datasets; underlying logic, temporal and causal structure behind the reasoning process!! Can we prepare a dataset covering this missing links!

contribution

Dataset

NS-DR model ??

Approach and Experiments:

Controlled environment. Offered CLEVRER dataset. CLEVRER:

Evaluations

used base line models:

Figure: Evaluated models [1]

Results:

Show different model strengths.

Neuro-Symbolic Dynamic Reasoning - Model

oracle model

My thoughts

Reference

[1] Yi, Kexin, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B. Tenenbaum. “Clevrer: Collision events for video representation and reasoning.” arXiv preprint arXiv:1910.01442 (2019).

Paper2

A Style-Based Generator Architecture for Generative Adversarial Networks [2]

Interesting blog

IMRAD

Prior Arts

Argument and Assumption/ Context

Problem statement

Traditional GAN generator improvement via style learning layer by layer. Can the feature be better selected by introducing a trick before feeding to generator? Is it possible to entangle the styles and features?

contribution

Approach and Experiments:

Style based Generator:

From random vector, z, to a vector space, w, of same size (I guess loss backpropagates here too). Affine transformation learn (how?) from w to y (Get y from random vector instead of style image!). Then the AdaIN layer.

Noise (Gaussian noise) introduction cause stochastic variation in the generated images.

Figure: Source

Important Note about Style generator: Each layer overridden by next AdaIN operation. Each AdaIN controls one CNN. As normalization and rescaling happen after each layers.

Disentanglement studies: Latent space consists of linear subspaces each basis controls one factor of variation. Latent space W can be learned from the random vector by f(z).

where,

and

Evaluations

Results:

Figure: Source [2]

My thoughts

Not the traditional style transfer from one image to another. Rather a group to another. There is no image input here, only the random vector like the original GAN.

Key ideas & Piece of Pie

AdaIN in generator. Feature transform network for the latent random variables.

Reference

[2] Karras, Tero, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

(R. Zhang et al 2018) R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. CVPR, 2018. 6, 7

Paper3

Arbitrary Style transfer in Real-time with Adaptive Instance Normalization [3]

IMRAD

Prior Arts

Argument and Assumption/ Context

Problem statement

Can the generation be more depend on the statistical features instead of the pixel value itself? Can the features be transferred by statistically to match the style?

contribution (Piece of Pie)

AdaIN

Background

Where

and SD accordingly.

and SD accordingly.

Approach and Experiments:

Preprocessing

Architecture:

Encoder f, Adaptive IN layer, Decoder g

Features t,

And reconstructed image

uses pretrained VGG19 models

Figure: Mind the AdaIN layer in between enc-dec. [3]

Loss Function: Weighted sum of the two losses

Evaluations

Results:

Dataset

Metrics

Alternatives:

Controls (No modify in the training) in test-time

Convex interpolation between output of encoder and the AdaIN.

Example

Figure: Source [3]

To interpolate set of K styles, the authors performs convex weighted sum of the AdaIN features for the K styles

To interpolate set of K styles, the authors performs convex weighted sum of the AdaIN features for the K styles

Example

Figure: Source [3]

My thoughts

Key ideas & Piece of Pie

Reference

[3] Huang, Xun, and Serge Belongie. “Arbitrary style transfer in real-time with adaptive instance normalization.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501-1510. 2017.

Paper4

THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS [4]

IMRAD

Train smaller network instead of prune. Pruned network hard to train. Sparse network slow learning. Smaller subnetwork!

Randomly-initialized, dense NN contains subnetwork already initialized such way - trained in isolation - it can match the test accuracy of the original network after training at most same number of iteration. Mathematically Initially Finally reach, test accuracy a, validation loss l, at iteration j And reach, test accuracy a’, validation loss l’, at iteration j’. Now according to lottery ticket hypothesis, there exist such m that j>j’, a<a’ and finally m has fewer 1.

Need appropriate initialization. Train with full network, and prune to get wining tickets. And reset to original initial values to the remaining weights.

Implication - Improve training performance, design better network, and improve theoretical understand.

Prior Arts

Argument and Assumption/ Context

Only few weights are important (they depends on the initialization). Different initialization gives different important weights. Given same initialization the weights will reach same conclusion.

Problem statement

Can we train smaller network, not prune. - by initialization appropriately.

contribution (Piece of Pie)

Computer vision task small subnet with appropriate initialization learns the same task of the big network with parameters.

Background

Approach and Experiments:

On Fully connected

Convolutional network.

Evaluations strengths and Weakness

Weakness: Only on image and small dataset! Learning rate dependency

Results:

Exhaustive result on 1. Pruning strategy 2. Early stopping 3. Training accuracy 4. Comparing random reinitialization and random sparsity 5. Examine winning tickets. 6. Hyperparameter exploration for fully connected networks 7. Hyperparameter exploration for CNN 8. Hyperparameter for VGG and RESNET

My thoughts

Key ideas & Piece of Pie

Different initialization gives different important weights. Given same initialization the weights will reach same conclusion.

Reference

[4] Frankle, Jonathan, and Michael Carbin. “The lottery ticket hypothesis: Finding sparse, trainable neural networks.” arXiv preprint arXiv:1803.03635 (2018).

Paper5

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification [5]

IMRAD

Problems: Conv3d fails to capture long term relationships in video frames and over-parameterized! Dependencies on the optical-flow features.

Propose net architecture (Motivated from cond2d DenseNet) for feature extraction in video action recognition. Notion of 3D temporal transition layer (TTL).

Claim to be efficient representation, computational efficient and robust. Evaluated on three dataset: HMDB51, UCF101, Kinetics. (T3D)

Additional contribution: Supervised transfer across architecture for weight initialization. (student teacher knowledge distillation).

Prior Arts

Three related line of research: Video classification, temporal convNets, transfer learning.

Argument and Assumption/ Context

Problem statement

contribution (Piece of Pie)

Conv3D DenseNet with TTL for video AR. Cross architecture Knowledge distillation initialization for the AR from Image net. End2End training.

Background

Conv2D DenseNet, google Net,

Approach and Experiments:

Architecture:

Figure: Overall architecture [5] TTL with multiple temporal consideration. Denseblock is similar to 2d ConvNet but for 3D convnet.

Transfer Leraning:

Figure: Knowledge transfer from 2D Dense pre-trained imagenet to activity recognition. [5]

Model depth increases accuracy!

Experiments with multi frame resolutions.

Pytorch and GPU implementation. Experiments with both the supervised transfer and train from scratch (Kinetics dataset).

Comparison with 3D convnet based on inception and ResNet.

Evaluations strengths and Weakness

Results:

State of the art performance using supervised pretraining.

My thoughts

Cross domain transfer! How long to pretrain! Parameters claim validation! Ablation study description! Frame rate and size dependency.

Key ideas & Piece of Pie

Cross domain transfer prtraining.

Reference

[5] Diba, Ali, Mohsen Fayyaz, Vivek Sharma, Amir Hossein Karami, Mohammad Mahdi Arzani, Rahman Yousefzadeh, and Luc Van Gool. “Temporal 3d convnets: New architecture and transfer learning for video classification.” arXiv preprint arXiv:1711.08200 (2017).

Paper6

ViP: Virtual Pooling for Accelerating CNN-based Image Classification and Object Detection [6]

IMRAD

New pooling method. Computes only some of the filtered activation layers and interpolate the rest. Can be added with existing network. Provide error bound by taking the ViP layer.

Prior Arts

CNN acceleration:

Argument and Assumption/ Context

Redundancy in spatial can be removed by taking some intermediate values and interpolating the rest. So less compute for the closest values. Compute values in fixed interval and interpolate the missing points in each filter output for each channel.

Problem statement

Improve PerforatedCNNs by removing data dependency. Can we leverage the spatial Redundancy in CNN channel?

contribution (Piece of Pie)

CNN computation reduction. Find only some intermediate values and linear interpolate the rest.

Background

Approach and Experiments:

Algorithm for the ViP

Experimented on

Evaluations strengths and Weakness

So basically dropping one value after another. Does it work on smaller resolution when information in one line! Or low pass filtering over the masked channel!

Results:

Accuracy drops little and computational speed increases significantly.

My thoughts

Key ideas & Piece of Pie

Low pass filter over masked ConV output channel works!

Reference

[6] Chen, Zhuo, Jiyuan Zhang, Ruizhou Ding, and Diana Marculescu. “Vip: Virtual pooling for accelerating cnn-based image classification and object detection.” In The IEEE Winter Conference on Applications of Computer Vision, pp. 1180-1189. 2020.

Paper7

link

Paper8

IMRAD

Prior Arts

Argument and Assumption/ Context

Problem statement

contribution (Piece of Pie)

Background

Approach and Experiments:

Evaluations strengths and Weakness

Results:

My thoughts

Key ideas & Piece of Pie

Reference

Paper9

IMRAD

Prior Arts

Argument and Assumption/ Context

Problem statement

contribution (Piece of Pie)

Background

Approach and Experiments:

Evaluations strengths and Weakness

Results:

My thoughts

Key ideas & Piece of Pie

Reference