You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generative Modeling is a branch of machine learning that focuses on creating models representing distributions of data, denoted as $P(X)$.
$X$ represents the data points, such as images.
Each data point (like an image) consists of many dimensions, typically corresponding to pixels.
A good generative model needs to understand and capture the relationships and dependencies among these pixels.
For example, it should recognize that adjacent pixels in an image usually exhibit similar colors and likely form parts of objects.
Simply calculating $P(X)$ numerically is straightforward but can be insufficient.
A model that can recognize which images are real versus noise is important, yet it doesn't focus on generating new useful examples.
Knowing that an image has a low probability doesn't equate to having the ability to create new, high-probability examples.
The real value of generative models lies in their ability to create new instances that resemble examples from a given database, such as generating new images that appear real or creating additional 3D models for applications like gaming.
Variational Auto Encoder(VAE)
VAEs are a powerful framework to achieve this by learning a probabilistic model of the data distribution, denoted as $P(X)$, which can sample new instances similar to a target distribution $P_{gt}(X)$.
Traditional generative modeling approaches often struggled due to:
Strong Assumptions: Some models require predefined structures in data which might not reflect its complexity.
Severe Approximations: These can lead to suboptimal models, failing to capture the true nature of the data.
Computational Expense: Classical methods like Markov Chain Monte Carlo can be too slow for practical applications.
VAEs address many of these issues:
They make minimal assumptions about the data structure.
They can effectively approximate the data distribution using neural networks without the heavy computational burden normally associated with generative models.
The approximations they introduce are generally small, allowing for effective training using fast techniques like back-propagation.
Latent Variable Models
A latent variable ($z$) represents hidden characteristics that influence the generation of observable data (ex. images).
It acts as an intermediate step that the model utilizes to condition the data it generates.
Example of Handwritten Characters:
When generating images of digits (0-9), the model first needs to decide which digit to create (e.g., 5 or 0).
This decision, represented by the latent variable $z$, ensures that the generated features of the digit are coherent and align with the chosen character.
(Uniqueness of Latent Variable) Given an output character, the specific settings for latent variables that produced it are unknown, requiring inference methods (like computer vision techniques) for a complete understanding of what settings correspond to a specific output.
The latent variables’ values must effectively map to the data points in the dataset to ensure the generative model adequately represents the distribution of the data.
The process described focuses on how to ensure that a Variational Autoencoder (VAE) model can effectively represent the dataset.
It emphasizes the importance of having latent variables (denoted as $z$), which are sampled from a probability density function (PDF) $P(z)$.
The function $f(z; \theta)$, when optimized, should enable the model to generate outputs $f(z; \theta)$ that are similar to the data points $X$ in the original dataset.
The goal is to maximize the overall probability of generating the observed data, represented mathematically as: $$P(X) = \int P(X|z; \theta) P(z) dz$$
Here, $P(X|z; \theta)$ indicates the conditional probability of $X$ given the latent variable $z$ and the parameters $\theta$.
Gaussian Distribution in VAEs: $$P(X|z; \theta) = N(X|f(z; \theta), \sigma^2 I)$$
$f(z; \theta)$ is a deterministic function that defines the mean of the Gaussian output.
$\sigma^2 I$ describes the covariance, ensuring that samples generated from the model won't be identical to any specific training data point but will be akin to the overall dataset.
The choice of a Gaussian distribution is crucial because it allows gradient-based optimization (like stochastic gradient descent) to be applied effectively.
By using a Gaussian output, the model can create varied outputs since the samples drawn will not always be identical to the input data points $X$, which facilitates learning.
Why Not a Dirac Delta Function?
If $P(X|z)$ were a Dirac delta function:
Each $z$ would deterministically produce a fixed $X$, making it impossible to explore variations around $X$.
This would limit the model's ability to learn and generate diverse examples, hindering its capacity to perform well on similar but unseen data.
The rectangle is “plate notation” meaning that we can sample from z and X N times while the model parameters θ remain fixed.
Variational Autoencoders
Variational Autoencoders (VAEs) utilize latent variables to model complex data distributions.
Unlike classical autoencoders, they do not directly copy input data, but instead, they create a distribution over the input data.
Latent Variables
In a VAE, latent variables $z$ represent underlying factors generating the observed data $X$. VAEs assume $z$ follows a simple distribution, typically $N(0, I)$ (a standard normal distribution), which allows them to learn complex mappings from $z$ to $X$ using a neural network.
Sampling and Likelihood
VAEs approximate the likelihood of data $P(X)$ through sampling from the latent space.
They introduce strategies like direct sampling from a non-complex distribution and maximizing the likelihood of data using stochastic gradient descent for efficiency, while avoiding computationally expensive methods like Markov Chain Monte Carlo.
Setting and Objective
Objective of Sampling in VAEs:
The goal is to compute the likelihood $P(X)$ of the data $X$ based on latent variables $z$ that are likely to generate $X$.
Use of Function $Q(z|X)$
The function $Q(z|X)$ is introduced to provide a distribution of latent variables $z$ that are likely to produce a particular $X$.
This makes the process efficient, as you only need to focus on the $z$ values that contribute meaningfully to $P(X)$.
Kullback-Leibler Divergence $D$:
The KL divergence measures how one probability distribution diverges from a second expected probability distribution. $$D[Q(z) | P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(z|X)]$$
This quantifies the difference between $Q(z)$ and the true posterior $P(z|X)$.
Relation Between Expectations
The equations show how to relate the quantity $P(X)$ and $P(X|z)$ using Bayes' rule: $$D[Q(z) | P(z|X)] = E_{z \sim Q} [\log Q(z) - \log P(X|z) - \log P(z)] + \log P(X)$$
Maximizing the Log Probability
The left-hand side of the equation, as expressed in Equation 4, tries to maximize $\log P(X)$ while minimizing $D[Q(z) | P(z|X)]$: $$\log P(X) - D[Q(z|X) | P(z|X)] = E_{z \sim Q} [\log P(X|z)] - D[Q(z|X) | P(z)]$$
This means that $Q(z|X)$ is ideally constructed to closely match $P(z|X)$, thereby allowing us to optimize the likelihood of observing $X$.
Convergence and Optimization
If $Q(z|X)$ can accurately approximate $P(z|X)$, the KL divergence becomes zero, and we can effectively optimize $P(X)$ directly.
The framework essentially introduces a method of variational inference, where one models the posterior $P(z|X)$ using a simpler distribution $Q(z|X)$.
VAEs optimize a complex sampling procedure to approximate the likelihood of data points using simpler distributions, enabling efficient representation learning.
The central mathematical tool used here is the Kullback-Leibler divergence, which guides the optimization of the model, ensuring that $Q(z|X)$ provides a distribution that effectively captures the characteristics of latent variables associated with observed data.
Optimizing the objective function
A training-time variational autoencoder implemented as a feed-forward neural network, where $P(X|z)$ is Gaussian.
Left is without the “reparameterization trick”, and right is with it.
Red shows sampling operations that are non-differentiable.
Blue shows loss layers.
The feedforward behavior of these networks is identical, but backpropagation can be applied only to the right network.
The objective involves maximizing the likelihood of the data while minimizing the Kullback-Leibler divergence between two probability distributions: the approximate posterior $Q(z|X)$ and the prior $P(z)$.
The usual choice for $Q(z|X)$ is modeled as a multivariate Gaussian distribution: $$Q(z|X) = N(z|\mu(X; \theta), \Sigma(X; \theta))$$
where $\mu$ and $\Sigma$ are deterministic functions parameterized by $\theta$ that are learned from data.
This choice simplifies computation and facilitates the optimization process.
KL-divergence computation
The KL-divergence between two multivariate Gaussians can be computed using the formula: $$D[Q(z) | P(z)] = \frac{1}{2} \left( \text{tr}(\Sigma_1^{-1} \Sigma_0) + (\mu_1 - \mu_0)^{\top} \Sigma_1^{-1} (\mu_1 - \mu_0) - k + \log\left(\frac{\det \Sigma_1}{\det \Sigma_0}\right) \right)$$
$k$ is the dimensionality of the distribution, and this term allows for efficient evaluation and optimization of the divergence.
Gradient computation and reparameterization trick:
To estimate the expected value $E_{z \sim Q}[\log P(X|z)]$, a sample $z$ is drawn from $Q(z|X)$.
Since this requires calculating $\log P(X|z)$ which depends on both $P$ and $Q$, care must be taken to ensure that sampling does not disrupt the gradient flow during backpropagation.
The solution is to employ the reparameterization trick, where instead of directly sampling from $Q(z|X)$, we express $z$ as:
$z$ is reparameterized into a form that allows for gradient computation through deterministic functions of $X$, enhancing training efficiency.
Final optimization equation
The overall equation we want to optimize becomes: $$E_{X \sim D} \left[ E_{z \sim Q} \left[ \log P(X|z) \right] - D[Q(z|X) | P(z)] \right]$$
This formulation captures the dual objective of maximizing the likelihood while minimizing the divergence, enabling effective optimization using stochastic gradient descent.
The testing-time variational “autoencoder,” which allows us to generate new samples. The “encoder” pathway is simply discarded.
Testing the learned model
Test-Time Process
During testing, the encoder is removed from the VAE architecture.
New samples are generated by inputting values from a standard normal distribution $ z \sim N(0, I) $ into the decoder.
Evaluating Probability
The probability $P(X)$ of the generated samples is generally intractable to compute directly.
The term $D[Q(z|X) | P(z|X)]$, which represents the Kullback-Leibler divergence, is positive, indicating that it serves as a lower bound for $P(X)$.
To approximate $P(X)$, sampling from the approximate distribution $Q(z)$ provides a useful estimator that tends to converge faster than sampling from the prior $N(0, I)$.
Usefulness of Lower Bound
This lower bound gives insights into how well the VAE model represents the training data by indicating how probable generated samples $X$ are under the learned model.
Interpreting the objective
VAE framework seeks to optimize the likelihood of the data, represented as $\log P(X)$, but it does so with some approximations which is crucial to understanding its performance.
The learning objective incorporates two key components:
$D[Q(z|X) | P(z|X)]$: This term is the Kullback-Leibler divergence which measures how well the approximating distribution $Q(z|X)$ aligns with the true posterior $P(z|X)$.
Optimizing this term, while necessary for making the model tractable and efficient, introduces some error; this error arises because the exact posterior is often complex and not easily computable.
$\log P(X)$: This represents the likelihood of the observed data under the model, and maximizing this ensures that the reconstructed samples closely resemble the original data points.
VAE's potential error comes from balancing these two terms:
If $Q(z|X)$ is an accurate approximation of $P(z|X)$, the divergence term becomes small, and the model performs well.
If not, larger divergences may indicate poor model performance, hence affecting the data generation capability of the VAE.
The relationship to information theory is through the concept of Minimum Description Length (MDL):
Lower values of the objective imply a more efficient coding of the data, meaning the model captures more essential information with fewer bits.
This helps in understanding the efficiencies gained through the variational inference framework, suggesting that the chosen approximating distribution $Q(z|X)$ must minimize additional overhead.
The tuning of $\sigma$ in $P(X|z)$ could be seen as a form of regularization, akin to parameters used in sparse autoencoders, which control the complexity of the model and prevent overfitting.
The error from $D[Q(z|X) | P(z|X)]$
$Q(z|X)$ is the approximate posterior distribution of the latent variable $z$ given the input $X$.
$P(z|X)$ is the true posterior distribution of $z$ given the input $X$.
Kullback-Leibler Divergence (KL Divergence)
The term $D[Q(z|X) | P(z|X)]$ represents the KL Divergence between these two distributions.
It quantifies how much information is lost when using the approximate distribution $Q(z|X)$ instead of the true distribution $P(z|X)$.
Convergence to True Distribution:
For the model's output distribution $P(X)$ to converge to the true distribution, $D[Q(z|X) | P(z|X)]$ must approach zero.
This means the approximate posterior $Q(z|X)$ needs to accurately represent the true posterior $P(z|X)$.
Challenges in Achieving Zero Divergence
The author highlights that simply having high capacity (i.e., complex) functions for $\mu(X)$ (mean) and $\Sigma(X)$ (variance) does not guarantee that $Q(z|X)$ will resemble $P(z|X)$ closely enough to make the divergence zero.
The function $f$ modulating the relationship can greatly affect this outcome.
Existence of a Suitable Function
The text posits that there may exist a sufficiently flexible function $f$ that can ensure $P(z|X)$ is Gaussian for all $X$ while simultaneously maximizing the likelihood $\log P(X)$.
If such a function exists, it would facilitate minimizing the divergence $D[Q(z|X) | P(z|X)]$.
The author acknowledges that proving general results for all distributions remains an open problem, but notes that it's theoretically proven in some 1D cases.
A small variance $\sigma$ can make modeling easier, though it may also lead to complications in gradient scaling during training.
Information-theoretic interpretation
Minimum Description Length Principle
This principle suggests that the best model is one that minimizes the number of bits needed to encode the data. In the context of VAEs, $-\log P(X)$ represents the total bits required for encoding data $X$ using an ideal encoding strategy.
Step 1 - Encoding Latent Variable $z$
Some bits are used to determine the latent variable $z$.
The KL-divergence $D[Q(z|X)||P(z)]$ quantifies the extra information needed to adjust from a prior distribution $P(z)$ (uninformative) to the posterior distribution $Q(z|X)$.
This measures how much information we gain about the latent variable $z$ when it is informed by the observed data $X$.
Step 2 - Decoding
The term $P(X|z)$ measures the amount of information needed to reconstruct $X$ once $z$ has been determined.
The total number of bits needed to accurately represent $X$ is the sum of the bits used in both steps, subtracting the penalty from the KL-divergence which indicates how well $Q(z|X)$ matches $P(z)$.
This penalty reflects the inefficiency of the encoding process, highlighting that a sub-optimal encoding can lead to excess needed bits.
VAEs and the regularization parameter
In a traditional sparse autoencoder, a regularization parameter$\lambda$ is used in the objective function, which can be represented as: $$L = | \phi( \psi(X) ) - X |^2 + \lambda | \psi(X) |_0$$
$\phi$ and $\psi$ are the encoder and decoder functions, respectively, and $| \cdot |_0$ is the L0 norm promoting sparsity in the encoding.
Unlike sparse autoencoders, VAEs typically do not have a separate explicit regularization parameter to tune.
This is advantageous as it reduces hyperparameter tuning for practitioners.
Absorption of Constants
The text suggests that even though one might think of introducing a parameter through scaling the latent variable $z$ (like using $z' \sim N(0, \lambda I) $), this does not fundamentally change the model.
The model remains the same because you can absorb the constant into the probabilistic definitions of $P$ and $Q$ $$f'(z') = f(z'/\lambda), \quad \mu'(X) = \mu(X) \cdot \lambda, \quad \Sigma'(X) = \Sigma(X) \cdot \lambda^2$$
Output Distribution and Regularization Parameter
The output distribution for continuous data is typically Gaussian: $$P(X|z) \sim N(f(z), \sigma^2 I)$$
The log-probability can be expressed as: $$\log P(X|z) = C - \frac{1}{2} \frac{| X - f(z) |^2}{\sigma^2}$$
Here, $C$ is a constant. In this context, $\sigma$ acts like a regularization parameter that controls the balance between the two terms:
how well the model fits the data vs. how simplistic the model should be.
Binary vs. Continuous Inputs
If the output $X$ is binary, the behavior of a regularization parameter disappears altogether since it doesn't influence the model's ability to capture the necessary information content as both terms on the right side use the same information units.
However, for continuous cases, we need a carefully chosen $\sigma$ to maintain finite information representation, which affects the expected accuracy of the model's reconstruction of data $X$.
Examples: MNIST & VAE
he VAE is applied to the MNIST dataset, which consists of grayscale images of handwritten digits (0-9).
The values for each pixel are constrained between 0 and 1.
Instead of using pre-existing VAE architectures, the authors adapt the basic AutoEncoder example from Caffe, ensuring flexibility in implementation.
Loss Function
The authors mention using the Sigmoid Cross Entropy loss for the probability distribution $P(X|z)$ of the data given the latent variables $z$.
This loss function is appropriate since the MNIST pixel values are between 0 and 1.
They probabilistically define new data points as $X'$ sampled according to: $$X'_i \sim \text{Bernoulli}(X_i)$$
This means that each pixel value is treated as a Bernoulli trial, where $X_i$ is the actual observed value from the training set.
This binarization captures the uncertainty in the pixel representations.
Training Process
The model is trained once fully, but with multiple restarts to identify the optimal learning rate for minimizing the loss.
This indicates that achieving good performance does not heavily depend on the initial setup or deep structural modifications.
Generated Samples
The results from the VAE show that while many generated digits appear realistic, some samples fall in-between digits, exemplifying the VAE’s tendency to interpolate between classes rather than producing distinctly different outputs.
ex. Digits might look like a blend between '7' and '9'.
The dimensionality of the latent variable $z$ in VAEs appears to have varying impacts on model performance.
If the dimensionality is too low (ex. less than 4), the model struggles to capture the complexities in the data, leading to poor performance.
Specifically, a model with too few dimensions fails to adequately represent the variations present in the input data.
Conversely, increasing the dimensions of $z$ improves performance to a certain extent, but when the dimensionality is excessively high (ex. 10,000), it can lead to problems in effectively managing the training, especially during optimization with stochastic gradient descent.
This happens because the model has a harder time keeping the Kullback-Leibler divergence $D[Q(z|X) || P(z)]$ low essentially a measure of how well the approximated distribution matches the true distribution when $z$ is large.
Tutorial on Variational Autoencoders
Introduction
Generative Modeling
Variational Auto Encoder(VAE)
Latent Variable Models
A latent variable ($z$ ) represents hidden characteristics that influence the generation of observable data (ex. images).
Example of Handwritten Characters:
(Uniqueness of Latent Variable)Given an output character, the specific settings for latent variables that produced it are unknown, requiring inference methods (like computer vision techniques) for a complete understanding of what settings correspond to a specific output.Gaussian Distribution in VAEs:
$$P(X|z; \theta) = N(X|f(z; \theta), \sigma^2 I)$$
By using a Gaussian output, the model can create varied outputs since the samples drawn will not always be identical to the input data points$X$ , which facilitates learning.
Why Not a Dirac Delta Function?
Variational Autoencoders
Setting and Objective
Objective of Sampling in VAEs:
Use of Function$Q(z|X)$
Kullback-Leibler Divergence$D$ :
Relation Between Expectations
Maximizing the Log Probability
Convergence and Optimization
VAEs optimize a complex sampling procedure to approximate the likelihood of data points using simpler distributions, enabling efficient representation learning.
Optimizing the objective function
A training-time variational autoencoder implemented as a feed-forward neural network, where$P(X|z)$ is Gaussian.
The objective involves maximizing the likelihood of the data while minimizing the Kullback-Leibler divergence between two probability distributions: the approximate posterior$Q(z|X)$ and the prior $P(z)$ .
The usual choice for$Q(z|X)$ is modeled as a multivariate Gaussian distribution:
$$Q(z|X) = N(z|\mu(X; \theta), \Sigma(X; \theta))$$
KL-divergence computation
Gradient computation and reparameterization trick:
Final optimization equation
The testing-time variational “autoencoder,” which allows us to generate new samples. The “encoder” pathway is simply discarded.

Testing the learned model
Test-Time Process
Evaluating Probability
Usefulness of Lower Bound
Interpreting the objective
VAE framework seeks to optimize the likelihood of the data, represented as$\log P(X)$ , but it does so with some approximations which is crucial to understanding its performance.
The learning objective incorporates two key components:
VAE's potential error comes from balancing these two terms:
The relationship to information theory is through the concept of Minimum Description Length (MDL):
The tuning of$\sigma$ in $P(X|z)$ could be seen as a form of regularization, akin to parameters used in sparse autoencoders, which control the complexity of the model and prevent overfitting.
The error from$D[Q(z|X) | P(z|X)]$
Information-theoretic interpretation
VAEs and the regularization parameter
In a traditional sparse autoencoder, a regularization parameter$\lambda$ is used in the objective function, which can be represented as:
$$L = | \phi( \psi(X) ) - X |^2 + \lambda | \psi(X) |_0$$
Unlike sparse autoencoders, VAEs typically do not have a separate explicit regularization parameter to tune.
Absorption of Constants
Output Distribution and Regularization Parameter
Here,$C$ is a constant. In this context, $\sigma$ acts like a regularization parameter that controls the balance between the two terms:
Binary vs. Continuous Inputs
Examples: MNIST & VAE
he VAE is applied to the MNIST dataset, which consists of grayscale images of handwritten digits (0-9).
Loss Function
Training Process
Generated Samples

The dimensionality of the latent variable$z$ in VAEs appears to have varying impacts on model performance.
If the dimensionality is too low (ex. less than 4), the model struggles to capture the complexities in the data, leading to poor performance.