mxahan.github.io

Introduction

This blogs contains some of the commonly encountered terms in DL. Lets Move forward. The containing terms are

Embedding

Very nice post to start.

Collaborative Filter

Boltzmann Machine

courtesy: Hinton

Stochastic dynamics of Boltzmann Machine.

The probabilities of being a neuron to one is

The interesting Collection: if the units are update sequentially independent of the total inputs, the network will reach Boltzmann/Gibbs distribution. (Google it.) Here comes the notion of energy of state vector. It’s also connected to Hopfield network (1982): Described by undirected graph. The states are updated by the connection from the connected nodes. The neural attract or repel each other. The energy is connected with the node values, thresholds and edge weights. The update rule of S_i = 1 if sum of w_ij*s_j > threshold else -1. Following the update rule the energy either decreases or stays same. Different from Hopfield as for the probabilistic approach of weights.

There probabilities of states being in the v energy states. The probabilities of being a neuron to one is where E(v) represents energy function. The probabilistic state value helps to overcome energy barrier as it goes to new energy state sometimes.

Learning rule and update using gradient ascent. <> is expectation. This causes slow learning as the expectation operation. Quick notes- data expectation means: Expected value of s_i*s_j in the data distribution and model expectation means sampling state vectors from the equilibrium distribution at a temperature of 1. Two step process firstly sample v and then sample state values.

Convex: When observed data specifies binary state for every unit in the Boltzmann machine.

Higher order Boltzmann machine

Instead of only two nodes more terms in energy function.

Conditional Boltzmann Machine

Boltzmann machine learns the data distribution.

Restricted Boltzmann machine (Smolensky 1986)

A hidden layer and visible layers: Restriction: no hidden-hidden or visible-visible connection. Hidden units are conditionally independent of the visible layers. Now the data expectation can be found by one sweep but the model expectation needs multiple iteration. Introduced reconstruction instead of the total model expectation for s_i*s_j. Learning by contrastive divergence.

Related to markov random field, Gibbs sampling, conditional random fields.

RBM

Nice post by invertor

Key reminders: Introducing probabilities in Hopfield network to get boltzman machine. Now one hidden layer makes it RBM with some restriction like no input-output direct connection or hidden-hidden connection. One single layer NN.

Learning rule: Contrastive divergence (approximation of learning rule of (data - model))

More cool demonstration for contrastive divergence.

Learning Vector Quantization

Fairly simple Idea. Notes

Its a prototype-based learning method (Representation).

Noise Contrastive Estimation

original Paper

Contrastive loss:

link 1 The idea that same things (positive example) should stay close and negative examples should be in orthogonal position.

link 2

link 3 Same ideas different applications.

initial paper with algorithm with a simpler explanation resource.

concise resource

Triplet loss training tricks

some loss function to formulate

visual product similarity - Cornell

Some notes on Recent review

prototypical CL

Model Fairness

Active Learning

good Start Three ways to select the data to annotate

another source Three types

common link resource

very good resource Which row to label?

Deep learning and Active learning

one of the nicest introductory blog

Here we will be discussing autoencoders and variational autoencoders with math

link1 link2

Bregman Divergence

can be 101

Three key things to cover: Definition:

properties: Convexity

application: ML optimization