mxahan.github.io

Trending Topics

So far In this writing I have covered

1. Representation Learning

Representation learning: A review and new perspective - (Bengio, Y. et al. 2014)

original paper

Review paper on unsupervised learning covering probabilistic model, auto-encoders, manifold learning, and deep networks to learn the data Representation.

Representation learning; learning data Representation that make it easier to extract useful information when building classifiers and other predictors.

Application:
What makes Representation good?
Building Deep Representations

Greedy layerwise unsupervised pre-training- Resulting deep features used as input to another standard classifier. Another approach is greedy layerwise supervised pre-training. Layerwise stacking often leads better representations.

single-layer learning modules

Two approaches

PCA connected to probabilistic models, Encoder-decoder, Manifolds

Probabilistic Model

Learning joint distribution and posterior distribution , where h is the latent variables for x.

1. Directed Graphical Models

Probabilistic interpretation of PCA

Prior;

And the likelihood

Sparse Coding

Clever equation regarding the prior and likelihood. A variation of sparse coding leads to Spike-and-slab sparse coding (S3C). S3C outperforms sparse coding shown by Goodfellow et al.

2. Undirected Graphical Models

Markov random fields. Expressed joint probability as the multiplication of clique potentials. A spacial form of markov random field is Boltzmann distribution with positive clique potentials. Here comes the notion of energy function.

Restricted Boltzmann machines (RBMs)

A little modification in energy function in Boltzmann machines leads to Restricted Boltzmann machines.

Generalizations of the RBM to real-valued data

Directly Learning A parametric Map from input to Representation

Graphical models are intractable for multiple layers.

1. Auto-encoders

encoder

For each data instances

Decoder

Now the objective function to minimize

2. Regularized Auto-encoders

Use a bottleneck by forcing latent space dimension lower than the input space dimention.

Sparse auto-encoders

Applying the sparsity regularization to force the latent representation to be close to zero.

Representation learning as Manifold Learning

May be I will add later.

Connection between probabilistic and Directed encoding Models

Global training in Deep Models

Building in invariance

Generating transformed Examples

Convolution and pooling

Temporal coherence and Slow features

Algorithms to disentangle factors of Variation

2. Region based Object Detectors

original post link

3 Parts:

Sliding-window Detectors

For window in windows
  patch = get_patch(image, window)
  results =  detector(patch)

image

Selective Search (SS)

This is a very good point to look back and talk about region proposal methods. This is alternative and better version of sliding window detectors. In opencv tutorial there are several proposal methods

As things stand, among these methods selective search in the most commonly used.

What Fast and high recall. Computing Hierarchical grouping of similar region based on color, texture, size and shape.

Final equation stands, weighted sum of previous four findings.

Region proposal method to find Region of interest (ROIs).

R-CNN

Cares about 2000 ROIs by using a region proposal method. The regions are warped into fixed size images and feed into a CNN network individually. Then followed by fully connected layers to classify and refine the boundary box.

The system flow

image

Pseudocode

ROIs = region_proposal(image)
for ROI in ROIs
    patch = get_patch(image, ROI)
    results = detector(patch)

Boundary box regressor

Fast R-CNN

R-CNN is slow. Instead use feature extraction for whole image. It proposes a feature extractor and region proposal method.

image

Network flow:

image

Pseudocode

feature_maps = process(image)
ROIs = region_proposal(image)
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)

Here comes the multitask loss (Classification and localization loss)

Faster R-CNN

feature_maps = process(image)
ROIs = region_proposal(image)         # Expensive!
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    results = detector2(patch)

Network flow

image

The region proposal is replaced by a Region Proposal network (convolutional network)

image

Region Proposal network

ZF networks structure

image

Overcome the problems of the computational complexity of the selective search by offering region proposal after the CNN layer instead of the original image. Three key steps

image img

Figure: Source

img

Figure: Source

Region-based Fully convolutional Networks (R-FCN)

Faster R-CNN

feature_maps = process(image)
ROIs = region_proposal(feature_maps)
for ROI in ROIs
    patch = roi_pooling(feature_maps, ROI)
    class_scores, box = detector(patch)         # Expensive!
    class_probabilities = softmax(class_scores)

R-FCN

feature_maps = process(image)
ROIs = region_proposal(feature_maps)         
score_maps = compute_score_map(feature_maps)
for ROI in ROIs
    V = region_roi_pool(score_maps, ROI)     
    class_scores, box = average(V)                   # Much simpler!
    class_probabilities = softmax(class_scores)

Networks

image

SSD

Inspired by blog

SSD eliminates the necessary for the region proposal networks.

2 parts

Figure: source (38x38x512 to 38x38x4x(21+4)) The addition 4 are because of the box coordinates.

Figure: source

Multibox: Making multiple prediction containing boundary boxes with class score. (4 boxes each with class score)

The core advantages of SSD are the multiscale features.

Figure: source paper (At some auxiliary stage SSD takes 6 box predictions)

Default boundary box are similar ideas like anchors in Faster R-CNN. This is the hard part. For this task the model needs to predict different type of boxes shapes. The boundary boxes are chosen manually.

To boxes are annotated as positive/negative based on the IoU of matching boxes with the ground truth.

Figure: source Multiscale features and default boundary boxes. Higher resolution maps can detect small objects.

Loss Function: sum of localization (mismatch between gt and predicted positive matches) loss and confidence loss (Classification loss).

3. Region based Fully Convolutional networks (R-FCN)

Original post link

Two stage detection

There is a notion of position, leads to position sensitive score maps. Example: For 9 nine features we get nine feature maps. imag

Now we take map from each feature map depending on the position. From each 9 feature maps we get select one box per feature map and select if for voting.

image

image

The creates total (C+1)x3x3 score maps. Further example: Person detectors. Total 9 features leads to 9 feature maps. And we select 9 feature maps and get one box from each map to get the voting.

image

The image of network R-FCN showed above and the following is network followed

image

Boundary Box Regression

Convolutional filter creates kxkx(C+1) scope maps. Another convolutional filter to create 4xkxk maps from same features maps. We apply the position based ROI pool to compute KxK array with each element containing boundary box and finally average them.

Region Proposal Network (RPN) - Backbone of Faster R-CNN

original post link

images

iamg

Feature Pyramid Networks (FPN)

img

Key Idea: Multi-scale Feature map

Top-down and bottom up data structure:

imga Figure: FPN with RPN (3x3 and 1x1 conv are RPN Head)

Total structure

img

4. Deep learning for object Detection P2

original post

R-CNN and Fast R-CNN

image Figure: R-CNN modules

image Figure: Fast R-CNN module

Faster R-CNN Architecture

ig Figure: Building blocks of Fast R-CNN

Spatial Pyramid pooling

img Figure: network with Spatial pyramid pooling layer

Fixed size constraint comes only in the fully connected layers.

img

ROI pooling layer

interesting blog

img ROI vs SPP

img Zooming into the network

ROI (Region of interest) converts proposal networks into a fixed shape required for the fully connected layers in classification. Figure: Source

ROI takes two inputs

ROI pooling layers converts each ROI (regardless of the input feature map or proposal sizes) into a fixed dimension map.

NN prune

detail TL-DR: original inspiration comes from biological synaptic pruning. In neural network, rank the individual weights and drop p% by setting smaller p’s to zero. This is weight pruning. The neurons can also be dropped by dropping the neuron itself. This is done by deleting a whole column of weight matrix based on their L2 norms.

detail TL-DR: Old ideas from Yan Lecun’s. The main point concerns about ranking the neurons. The ranking is usually done by L1/L2 norm of weights, activation, or zero occurance of neuron in validation time, etc. After pruning the NN performance drops which is recovered by retraining iteratively.

Not Popular yet because pain of implemenation, Unstability of ranking method and some genius people’s unwillingness to share.

Some keypoints with reference

GraphNN

Very nice survey

Motivation

Models: -

Energy Based learning

key tutorial

Introduction

training

Video Activity recognition

informative resource blog post

Approaches

Contemporary works based on single and two stream papers

  1. LRCN (long term recurrent Convolutional network for visual recognition and description, Donahue et al 2014)
    • Contribution
      • Based on RNN (not stream)
      • Encoder-decoder for video presentation
      • End2End training (but use flow!)
  2. C3D (Learning spatiotemporal features with convolutional networks (Du tran et al 2014)
    • Contributions
      • 3D CNN as feature extractors
      • Extensive search for best 3D cnn kernel and architecture.
      • Using deconvolutional layers for interpretation.
      • Factorized Spatio-temporal CN
  3. Conv3D and attention (Yao et al, 2015)
    • Contributions
      • Novel 3D CNN-RNN encoder-decoder for spatiotemporal
      • Use of attention within CNN-RNN encoder decoder frameworks.
    • Not actually action recognition but cnn+lstm ..
  4. TwoStreamFusion (Feichtenhofer et al. 2016)
    • Contribution
      • Long range temporal modeling and better long range losses
      • Multi-level fuses architecture
  5. TSN (Temporal segment Networks: Wang et al 2016)
    • Contribution
      • Long range temporal Modeling
      • Bath norm, dropout and pretrained
  6. ActionVLAD (Girdhar et al 2017)
    • Contributions:
      • learnable video level aggregation of features (!) - Pooling from different regions
      • End2end training
  7. HiddenTwoStream (Zhu et al 2017)
    • Contributions
      • Novel architecture for optical flow input using separate network
      • Spatial stream CNN, parallel with MotionNet (for optical flow) and Temporal Stream CNN then late fusion.
  8. I3D (Carreira et al. 2017)
    • Contributions
      • 3D based model into two stream architecture
      • New dataset
      • Extension from C3D (2.)
  9. T3D (Diba et al. 2017)
    • Contributions
    • Combining temporal information across variable depth
    • Supervised transfer learning

Some Dataset

Multitask Learning

Summary blog by S. Ruder

Motivation:

Two major ways:

Benefits:

Two main line of focus

Auxiliary tasks:

Matrix Factorization

As the name suggests, the MF is expressing metrix as multiplication of matrixes. R = PQT Interesting understanding begins with the multiply of P and Q-transpose. Each row of R, rj is the projection of each row over the rows of the Q []. Means, transforming each row of the P matrix by the row of the Q matrix to new projection on the row of the Q matrix.

The object of MF is straightforward, to reconstruct the original matrix. Sometimes the objective is modified with regularization and weighting loss on reconstruction. As we get two matrix its connected with the embedding concept. So basically P and Q are embedding of row and columns of the original matrix R. link

Way to find P or Q matrix. link