ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Boltzmann Machine with Energy-Based Models and Restricted Boltzmann machines(RBM)
    MLAI/DeepLearning 2019. 10. 19. 19:36

    1. Overview

    A Boltzmann machine (also called stochastic Hopfield network with hidden units) is a type of stochastic recurrent neural network and Markov random field.

    Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield networks. They were one of the first neural networks capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems.

    They are theoretically intriguing because of the locality and Hebbian nature of their training algorithm (being trained by Hebb's rule), and because of their parallelism and the resemblance of their dynamics to simple physical processes. Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems.

    They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function. That's why they are called "energy-based models" (EBM). 

    2. Energy-Based Models

    2.1 Boltzmann distribution

    $$p_{i}=\frac{e^{\frac{-\varepsilon _{i}}{kT}}}{\sum_{j=1}^{M}e^{\frac{-\varepsilon _{j}}{kT}}}$$

    Probability of a certain state of your system where state i and $\varepsilon _{i}$ is the energy of that system. k is constant and T is the temperature of your system. And denominator is the sum of all of the possible states of the system. So the probability is proportional is inversely related to energy.

    Basically molecules of gas could be collected in corner of the right top, but what the Boltzmann distribution is saying that the probability of that state occurring is very low because the energy in that state would be very high. Because the molecules are very close to each other, they would be bumping, be making a lot of chaos and havoc, and would be moving very quickly. To be in that state, they would have to have a huge energy.

    What we normally observe is molecules are distributed widely because this is the lowest energy state for that system. 

    3. Boltzmann machines

    Standard Boltzmann machine or full Boltzmann machine. Every single node connects to every single node and no output. In theory, it can solve lots of different problems. 

    Energy is defined in Boltzmann machines through the weights of synapses. And then, once the system is trained up once the weights are set, then what happens is the system based on those weights will always try to find the lowest energy state for itself.

    4. Restricted Boltzmann machine(RBM)

    In practice, the Standard Boltzmann machine is hard to implement because as you increase the number of nodes, the number of connections between them grows exponentially. Therefore the different type of architecture was proposed which is called the Restricted Boltzmann machine

    With the simple restriction that hidden nodes cannot connect each other and visible nodes cannot connect to each other. Other than that, everything's the same.

    $$E(v,h)=-\sum_{i}a_{i}v_{i}-\sum_{j}b_{j}h_{j}-\sum_{i}\sum_{j}v_{i}w_{i,j}h_{j}$$

    The energy function is defined through the weights. a and b are the biases in the system, so just constants. v is the visible node that you're looking at, h is the hidden node. w is a weight between visible and hidden nodes.

    $$P(v,h)=\frac{1}{Z}e^{-E(v,h)}$$

    The probability of being in a certain state is inversely related to the energy of that state. Where Z is the sum of all of the possible states. it's going to find the lowest energy state just because of the way we set it up.

    4.1 How to work

    Through this input it is and through the training process it is better understanding what's feature these movies might have in common or if they are features that these movies might have in common and it's assigning its hidden nodes or the weights are being assigned in such a way that the hidden nodes are becoming reflective of those specific features. So that's how the training of the RBM happens. It figures out features and similarities.

    We want to make a recommendation for a person, will they like Fight Club or not. Will they like The Departed or not?  We're feeding in a row into our Restricted Boltzmann machine and certain feature to light up if they are present in this user's tastes and preferences and likes and biases.

    The drama node lights up because this person likes Forrest Gump and Titanic which are drama.

    This person doesn't like The Matrix and Pulp Fiction which are Action, so it's gonna light up in red. We won't care about the movies that we already have a rating for which is part of the training but only care about the movies where we don't have ratings and we're gonna use the values that reconstruct as predictions.

    So once again from here Boltzmann machine is going to be reconstructing these input values based on what it's learned. For example, Fight Club is going to look at all of the nodes and find out based on what it learned from the training it's going to really know which nodes actually connect to Fight Club based on the weights that it had determined during training.

    And based on this one connection, we know this one lit up in red, therefore Fight Club is gonna be a movie that this person is not going to like.

    We can see each node votes on The Departed which are yes, no, yes, and yes. Based on votes, the answer in simplistic terms is 1

    4.1.1 How to Learn

    After lights up nodes, the backward pass happens. The Boltzmann machine is going to try to reconstruct our input 

    What the Boltzmann machine does is accept values into the hidden nodes and then it tries to reconstruct your inputs based on those hidden nodes if during training if the reconstruction is incorrect then the weights are adjusted and then we reconstruct again and again. 

    4.2 Contrastive Divergence

    Because this is an undirected network. This is where contrastive divergence comes in. Every single node is constructed from all six of these. And so, therefore, all of these nodes have values that didn't initially come from one visible node but came from the other visible nodes. Therefore when you reconstruct a visible node even though using the same weights, the values in hidden nodes came from other visible nodes. Therefore, the reconstructed value in a visible node is not going to be equal to what you had in that node initially.

    4.2.1 Gibbs sampling

    Each visible nodes of the iteration are not equal to each other because hidden nodes are reconstructed from the input. Finally, at some point, we're going to get some reconstructed input values which are such that when we feed them into the RBM, and then we try to reconstruct them again, we will get those same values. So from some point, we don't keep going forward. So in essence, this process has finally converged and our network is finally a great model to model out inputs to model that specific input.

    The gradient of the log probability of a certain state of our system based on the weights in the system which is constant in the whole learning process. But here, what it's telling us is how the weights affect the log probability. How changing the weights will change the log probability and the way it'll change is the difference between the initial state of the system visible and the final state. The way we define energy is through weights

    The green dot represents the init state and red is the second pause. The system is governed by its energy will always try to end up in the lowest energy state possible. So above the ball is rolling toward the bottom over the picture.

    4.2.2 Hintons' shortcut

    We find out how to adjust our curve with a short cut which we don't actually have to go through to the very end of the sampling process but just do two pauses which are a first pause and second pause which is Contrastive Divergence 1(CD1) which is in red circle.

    We're trying to adjust the energy curve by modifying the weights in order to facilitate a system which in the best way possible resembles our input values. Do that using the above recipe formula.

    5. Reference

    http://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf

    https://en.wikipedia.org/wiki/Boltzmann_distribution

    https://en.wikipedia.org/wiki/Boltzmann_machine

    http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf

    https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf

    'MLAI > DeepLearning' 카테고리의 다른 글

    Sigmoid and Softmax  (0) 2022.07.07
    Activation Functions  (0) 2020.01.30
    Classify Deep Learning  (0) 2019.10.16
    Softmax and Cross-Entropy with CNN  (0) 2019.10.16
    Artificial neural network(ANN)  (0) 2019.10.05

    댓글

Designed by Tistory.