mxahan.github.io

Introduction


layout: post title: “Theories1” categories: Math —

So far In this writing I have covered

GAN math

I am covering some of the GAN maths from my old notes.

The informative PDFGan math1

Recurrent Neural Network

RNN learns while training and remember things learns from prior inputs while generating outputs.

RNN takes Series of input to produce a series of output vectors (No preset limilation on size).

Output hiddenstate

Unrolled Version,

Optimization algorithm: Back propagation through time (BPTT). Faces vanishing gradient problem. Parameter sharing

Deep RNNs

Bidirectional RNNs

Recursive Neural Network

Encoder Decoder Sequence to Sequence RNNs

LSTM Modified form of RNN to avoid vanishing gradient.

An elaborate tutorial with equations.

A step by step tutorial of LSTM

Understanding the diagram

Another nice tutorial from Roger Grosse source.

Attention

This writing consists of the personalized summary of Attention blog by Lilianweng. Great writer, a personal inspiration. Hope to put my understanding along with his contents.

The primary problems with the seq2seq models are the context vectors limitation.

Figure: Source By Bahdanau et al., 2015

Three key points to remember

Here are some keywords and terms we should distinguish with each other

Figure: Global vs local attention source from Luong, et al., 2015

Neural Turing Machine

Figure: NTM addressing mechanism from Source

Transformer

Seq2seq without recurrent unit. The model architectures are cleverly designed

Figure: source vaswani, et al, 2017

Here the key point is the weighting of the V,K, Q to get the final attention.

Figure: Source Vaswani, et al., 2017

Figure: Source Vaswani, et al. 2017

Figure: Source Vaswani et al., 2017

SNAIL

Solves the positioning problem in the transformer models.

Figure: Source Mishra et al., 2017

Self-Attention GAN

Figure: Convolution operation and self-attention Source Zhang et al. 2018

It has similar concept like key, value and query (f, g, h).

Figure: Self-attention source Zhang et al., 2018

Normalizing Flows

Motivational blog1 and blog2

More motivational blog

Problems with the VAE and GAN- They don’t explicitly learn the pdf of real data as is intractable to go through all latent variable z. Flow based generative model overcomes this problem with the technique normalizing flows.

Types of Generative models

Figure: sources

Normalizing flows are techniques to transform simple distribution to complex distribution. This ideas are entangled with the invertible transformation of densities. These transformation combinely forms Normalizing Flows

Figure: Source

Here assuming,

and

so, we find

Expanding total form

Becomes

In short, the transformation random variable goes through is the flow and full chain is the normalizing flows.

Estimating parametric distribution helps significantly in

There has been lack of attention in the (2., 3. and 4.) of the previous points.

The Question is to feasibility of finding distribution with following properties:

The solution is:

Change of variables, change of volume

Lets go over some points

Consider x as Random variable and

Now PDF Taking log

Models with normalizing Flows

Simply the negative log likelihood of the data

Real-NVP (Non-volume preserving)

Another recent network; Affine coupling layer.

Figure: Special case fo the IAF as 1:d in both x and u are equivalent. Source

NICE

same as real-nvp but the affine layer has only shift (sum)

Glow

Figure: Sources

Autoregressive Models are Normalizing Flows

A density estimation technique to represent complex joint densites as product of conditional distributions.

For common choices, we have

Where,

and

Here the inductive bias tells that earlier variables don’t depend on later variables! To sample from the distribution to get data x from 1:D

where

Now,

Figure: Graphical View source

Here the learnable parameters are alpha’s and mu’s by training neural network with data. and the inverse

\

Figure: Earlier Source

MADE

Figure: Source

PixelRNN

Figure: Source

WaveNet

Figure: Top diagram shows Wavent and bottom picture shows the causal convolutional layers Source

Inverse Autoregressive Flows(IAF)

Two major change in equations from the Masked AF earlier

Where,

and

Mind the disinction between u and mu

Figure: Source Looks like inverse pass of MAF with changed labels. Extract u’s from the x’s using alpha’s and mu’s.

MAF (trains quickly but slow sample) and IAF (trains slow but fast samples) are trade of each other.

Masked Autoregressive Flows

Figure: Difference between MAF and IAF. I believe, the IAF equation should be z instead of x in the transition. source

Parallel Wavenet

Combination of MAF and IAF is done in Parallel wavenet. It uses that idea that IAF can compute likelihood of own data cheaply but not for external data. MAF learns to generate likelihood of the external data.

Figure: Source Here MAF (teacher) trains to generate samples and IAF (student) tries to generate samples with close match with teacher’s likelihood of the IAF generated samples.

Distribution

Probabilities and machine learning are more than connected. In ML we need to have a clear understanding of the Probabilities, distribution, densities and how they are connected with each others.

Here is my short note

And another big notes not by me

Word2vec

Elaborate explnation

Gradient, Hessian, Jacobian

good cover

Simple things but comes back to increase you confusion again and again. They all comes with the functions of variable and their derivatives but in vector/matrix form. Lets dive into gradient first. Gradient (delta) is integrated with multi-variable to single value [Rn to R] function and partial derivatives. Hessian (delta square) just follow the definition of gradient to second order. Gradient covers the slope of f with each variables. Hessian looks into the second derivatives w.r.t each variables. Hessian matrix have partial derivates of function with each of the two variable combines.

Now the jacobian is defined for multi-variable to multi-function [Rn to Rm]. Matrix of derivative of each function with each variable once. Jacobian is basically defined for the first derivative of the functions.

Now if we think gradient as a multivariable to multi function then jacobian of it [gradient vector function] is the hessian of the original function f.

The key point to remember is vector valued and scale valued function for multiple variables. intersting and connection to Laplacian.