Several posts on social media about photos that have been modified with the help of AI, either older or younger, show AI's ability to generate objects. Initially, Naive Bayes could be used to generate objects based on certain features, but because of its 'naive' character, difficulties were encountered because there was no relationship between variables, even though there should be. A model that is capable of doing this is the Variational Auto-Encoder (VAE).
What is a Variational Autoencoder?
![]() |
What is a Variational Autoencoder? |
A variational autoencoder (VAE) is a generative model used in machine learning (ML) to generate new data in the form of variations of the input data used to train it. In addition, variational autoencoders also perform tasks common to other autoencoders, such as denoising.
Like all autoencoders, a variational autoencoder is a deep learning model consisting of an encoder that learns to isolate important latent variables from the training data and a decoder, which then uses these latent variables to reconstruct the input data.
However, most autoencoder architectures encode a discrete, fixed representation of the latent variables, while VAEs encode a continuous, probabilistic representation of the latent space. This allows VAEs to not only accurately reconstruct the original input but also use variational inference to generate new data samples that resemble the original input data.
The Neural Networks architecture for variational autoencoders was originally proposed in a 2013 paper by Diederik P. Kingma and Max Welling, titled Auto-Encoding Variational Bayes (link is external to ibm.com). This paper also popularized what they called the reparameterization trick, an important machine learning technique that allows randomness to be used as a model input without sacrificing the model’s differentiability, that is, the ability to optimize model parameters.
While VAEs are often discussed in the context of image generation, including in this article, VAEs can be used for a wide range of artificial intelligence (AI) applications, from anomaly detection to generating new drug molecules (link is external to ibm.com).
![]() |
ABOVE: handwritten digits from the MNIST dataset; BOTTOM: original samples generated by a VAE trained on those MNIST digits. Source: Mak, Hugo & Han, Runze & Yin, Hoover. (2023) |
VAE has three components, namely encoder, latent space and decoder. The encoder produces a latent space that stores characteristics in the form of a distribution. And the encoder regenerates it in the form of a new variant that takes from the distribution. So where is the AI? Of course, from the ability to create variants, where the new object that is formed is not the initial object and is truly created from the extraction of the initial object's features in the form of a distribution.
The image above shows a system that is capable of recreating the numbers from the initial object (bottom). Some are wrong, especially in the numbers 8 and 4 which are similar to 5 and 9. For more details, please try running the following Colab code [Link]
What is the meaning of latent space?
Important to understanding VAEs or other types of autoencoders is the idea of latent space, which is the term for the collective latent variables of a given set of input data. In short, latent variables are the underlying variables in the data that inform how the data is distributed, but are often not directly observable.
As a visualization of the concept of latent variables, imagine a bridge with sensors that measure the weight of each vehicle passing by. Of course, there are a variety of vehicles using the bridge, ranging from small, light-duty cars to large, heavy-duty trucks. Since there are no cameras, we have no way of detecting whether a particular vehicle is a convertible, sedan, van, or truck. However, we do know that the type of vehicle significantly affects the weight of that vehicle.
Thus, this example involves two random variables, x and z, where x is the directly observable variable, the weight of the vehicle, and z is the latent variable, the type of vehicle. The primary training goal for any autoencoder is to learn how to efficiently model a given input latent space.
1. Latent space and dimensionality reduction
Autoencoders model latent space through dimensionality reduction: the compression of data into a low-dimensional space that captures the meaningful information contained in the original input.
In the context of machine learning (ML), mathematical dimensions do not correspond to the familiar spatial dimensions in the physical world, but to features of the data. For example, a 28x28 pixel image of handwritten black-and-white numbers from the MNIST dataset can be represented as a 784-dimensional vector, where each dimension corresponds to an individual pixel whose values range from 0 (for black) to 1 (for white). The same image in color, on the other hand, can be represented as a 2,352-dimensional vector, where each of the 784 pixels is represented in three dimensions corresponding to red, green, and blue (RGB) values.
However, not all of these dimensions contain useful information. The numbers themselves actually represent only a small portion of the image, so most of the input space is background noise. Compressing data into only dimensions containing relevant information, namely the latent space, can improve the accuracy, efficiency, and effectiveness of various ML tasks and algorithms.
What is an autoencoder?
VAEs are part of the larger category of autoencoders, which are neural network architectures commonly used in deep learning for tasks such as data compression, image denoising, anomaly detection, and facial recognition.
An autoencoder is a self-supervised system whose training goal is to compress (or encode) input data through dimensionality reduction and then accurately reconstruct (or decode) the original input using that compressed representation.
Essentially, the function of an autoencoder is to effectively extract the most salient information from the data, its latent variables, while discarding irrelevant noise. What distinguishes different types of autoencoders from one another is the specific strategy used to extract that information and the use cases for which each strategy is best suited.
In training, the encoder network takes the input data from the training dataset through a “bottleneck” before reaching the decoder. The decoder network is then responsible for reconstructing the original input using only the latent variable vectors.
After each training epoch, an optimization algorithm such as gradient descent is used to adjust the model weights in a way that minimizes the difference between the original input data and the decoder output. Eventually, the encoder learns to allow the information that is most conducive to accurate reconstruction and the decoder learns to reconstruct effectively.
While this is intuitively suitable for simple data compression tasks, the ability to efficiently encode accurate latent representations of unlabeled data enables a wide range of uses for autoencoders. For example, autoencoders can be used to recover corrupted audio files, colorize grayscale images, or detect anomalies (such as those caused by fraud) that are invisible to the naked eye.
What are the structures of autoencoder?
While different types of autoencoders add or modify certain aspects of their architecture to better suit specific purposes and data types, all autoencoders share three main structural elements:
1. Encoder
The encoder extracts latent variables from the input data x and outputs them as a vector representing the latent space z . In a “vanilla” autoencoder, typically each subsequent encoder layer contains fewer nodes than the previous layer; as the data passes through each encoder layer, it is compressed through a self-“squeezing” process into fewer dimensions.
Other variants of autoencoders use regularization terms, such as functions that enforce sparsity by penalizing the number of activated nodes at each layer, to achieve this dimensionality reduction.
2. Bottleneck
The bottleneck, or “code,” is the output layer of the encoder network and the input layer of the decoder network. The bottleneck contains the latent space: a fully compressed, lower-dimensional embedding of the input data. A sufficient bottleneck is needed to help ensure that the decoder cannot easily copy or memorize the input data, which would nominally fulfill its training task but prevent the autoencoder from learning.
3. Decoder
The decoder uses this latent representation to reconstruct the original input by inverting the encoder: in a typical decoder architecture, each subsequent layer contains an increasingly larger number of activated nodes.
While the encoder and decoder networks of many autoencoders are built from standard multilayer perceptrons (MLPs), autoencoders are not limited to a specific type of neural network.
Autoencoders used for computer vision tasks are often convolutional neural networks (CNNs) and are thus called convolutional autoencoders. Autoencoders built from transformer architectures have been used in areas such as computer vision3 and music.
The main advantage of autoencoders over other dimensionality reduction algorithms, such as Principal Component Analysis (PCA), is that autoencoders can model nonlinear relationships between different variables. Therefore, autoencoder nodes of neural networks typically use nonlinear activation functions.
In various uses of autoencoders, the decoder only serves to assist in optimizing the encoder, after training. However, in variational autoencoders, the decoder is retained and used to generate new data points.
How does a variational autoencoder work?
What sets a VAE apart from other autoencoders is its unique way of encoding the latent space, and its probabilistic encoding can be applied in a variety of use cases.
Unlike most autoencoders, which are deterministic models that encode a single vector of discrete latent variables, a VAES is a probabilistic model. A VAE encodes the latent variables of the training data not as fixed discrete values z, but as a continuous range of possibilities expressed as a probability distribution p(z).
In Bayesian statistics, this learned range of possibilities for the latent variables is called the prior distribution. In variational inference, the generative process of synthesizing new data points, this prior distribution is used to compute the posterior distribution, p(z|x). In other words, x is the value of the observable variable, while z is the value for the latent variable.
For each latent attribute of the training data, the VAE encodes two distinct latent vectors: a mean vector, “μ”, and a standard deviation vector, “σ”. In essence, these two vectors represent the range of possibilities for each latent variable and the expected variance within each range of possibilities.
By randomly sampling from these encoded ranges of possibilities, the VAE can synthesize new data samples that, while unique and original, resemble the original training data. While relatively intuitive in principle, this methodology requires further adaptation of standard autoencoder methodology to be put into practice.
To explain the capabilities of the VAE, we will review the following concepts:
1. Reconstruction loss
Like all autoencoders, VAEs use reconstruction loss, also called reconstruction error, as the primary loss function in training. The reconstruction error measures the difference (or “missing” data) between the original input data and the reconstructed version of that data output by the decoder. Several algorithms, including cross-entropy loss or Mean-Squared Error (MSE), can be used as the reconstruction loss function.
As explained earlier, the autoencoder architecture creates a bottleneck that allows only a subset of the original input data to be passed to the decoder. At the beginning of training, which typically begins with random initialization of the model parameters, the encoder has not yet learned which subsets of the data should be weighted more heavily. As a result, the encoder will initially produce a suboptimal latent representation, and the decoder will produce an inaccurate or incomplete reconstruction of the original input.
By minimizing the reconstruction error through some form of gradient descent on the parameters of the encoder and decoder networks, the autoencoder model weights are adjusted in a way that produces a more useful encoding of the latent space (and thus a more accurate reconstruction). Mathematically, the goal of the reconstruction loss function is to optimize pθ(z|x) , where θ represents the model parameters that perform accurate reconstruction of the input x given the latent variables z.
Reconstruction loss alone is sufficient to optimize most autoencoders, whose primary goal is to produce a compressed representation of the learned input data that is conducive to accurate reconstruction.
However, the goal of a variational autoencoder is not to reconstruct the original input; rather, it is to generate new samples that resemble the original input. For that reason, an additional optimization term is needed.
2. Kullback-Leibler Divergence
For the purpose of variational inference, i.e., the generation of new samples by a trained model, reconstruction loss alone can result in an irregular latent space encoding because it overfits the training data and does not generalize well to new samples. Therefore, VAEs incorporate another regularization term: Kullback-Leibler Divergence, or KL divergence.
To generate an image, the decoder samples from the latent space. Sampling from specific points in the latent space that represent the original inputs in the training data will replicate those original inputs. To generate new images, the VAE must be able to sample from anywhere in the latent space in between the original data points. For this to be possible, the latent space must exhibit two types of regularity:
- Continuity: Nearby points in the latent space should produce similar content when decoded or decoded.
- Completeness: Every point sampled from the latent space should produce meaningful content when decoded.
A simple way to implement continuity and completeness in the latent space is to help ensure that the latent space follows a standard normal distribution, called the Gaussian distribution. But minimizing the reconstruction loss alone does not incentivize the model to regularize the latent space in a particular way, since the space “in between” is irrelevant to accurately reconstructing the original data points. This is where the term KL divergence regularization comes into play.
KL divergence is a metric used to compare two probability distributions. Minimizing the KL divergence between the learned latent variable distribution and a simple Gaussian distribution whose values range from 0 to 1 forces the learned latent variable encoding to follow a normal distribution. This allows smooth interpolation at any point in the latent space, and thus generates new images.
![]() |
An example of how reconstruction loss and KL divergence affect latent space modeling for handwritten digits 0-9 from the MNIST dataset. |
3. Evidence lower bound (ELBO)
One of the pitfalls of using KL divergence for variational inference is that the denominator of the equation is intractable, meaning that it would take theoretically infinite time to compute it directly. To address that issue, and integrate the two main loss functions, VAEs approximate the minimization of KL divergence by maximizing an Evidence Lower Bound (ELBO).
In statistical terminology, the “evidence” in “evidence lower bound” refers to p(x), the observable input data that the VAE is likely to reconstruct. The observable variables in that input data are the “evidence” for the latent variables discovered by the autoencoder. The “lower bound” refers to the worst-case estimate of the log-likelihood of a particular distribution. The true log-likelihood may be higher than the ELBO.
In the context of VAEs, the evidence lower bound refers to the worst-case estimate of the likelihood of a particular posterior distribution—in other words, a particular output of the autoencoder, conditioned by the KL divergence loss term and the reconstruction loss term—given the “evidence” of the training data. Therefore, training a model for variational inference can be a reference in maximizing the ELBO.
4. Reparameterization trick
As discussed, the goal of variational inference is to generate new data in the form of random variations of the training data x. At first glance, this is relatively straightforward: use a function ƒ that selects random values for the latent variables z, which can then be used by the decoder to generate an estimate of the reconstruction x.
However, the inherent property of randomness is that it cannot be optimized. There is no “best” random value; a vector of random values, by definition, has no derivatives—that is, no gradients that express any pattern in the resulting model output—and therefore cannot be optimized via backpropagation using any form of gradient descent. This means that a neural network that uses a random sampling process cannot learn the optimal parameters to perform its task.
To circumvent this constraint, VAEs use a reparameterization trick. The reparameterization trick introduces a new parameter, ε, which is a random value chosen from a normal distribution between 0 and 1.
The latent variable z is then reparameterized as z = μx + εσx. More simply, this trick chooses a value for the latent variable z by starting from the mean of the variable (represented by μ) and shifting it by a random multiple (represented by ε) of a standard deviation (σ). Conditioned by the specific value of z, the decoder outputs a new sample.
Since the random value ε is not derived from and has no relation to the parameters of the autoencoder model, it can be ignored during backpropagation. The model is updated through some form of gradient descent, most commonly through Adam (link is external to ibm.com), a gradient-based optimization algorithm also developed by Kingma, to maximize the ELBO.
Conditional VAE (CVAE)
One of the drawbacks of conventional “vanilla” VAEs is that the user has no control over the specific outputs produced by the autoencoder. For example, a conventional VAE trained on the aforementioned MNIST dataset would generate new samples of handwritten digits from 0 to 9, but could not be constrained to generate only 4 and 7.
As the name suggests, a conditional VAE (CVAE) allows for outputs that are conditioned by specific inputs, rather than simply generating random variations of the training data. This is achieved by incorporating an element of supervised learning (or semi-supervised learning) alongside the conventional autoencoder’s typically unsupervised training objective.
By further training the model with labeled examples of a particular variable, that variable can be used to condition the decoder’s output. For example, a CVAE could be first trained on a large dataset of facial images, and then trained using supervised learning to learn a latent encoding for “beard” so that it can generate new images of bearded faces.
Is VAE better than GAN?
VAEs are often compared to generative adversarial networks (GANs), another model architecture used to generate samples that resemble training data, particularly for images.
Like VAEs, GANs are a hybrid architecture that combines two neural networks: a generator network that is responsible for generating image samples that resemble images from a training dataset and a discriminator network that is responsible for determining whether a given image is a “real” image from the training data or a “fake” image from the generator network.
The two networks are trained adversarially in a zero-sum game: feedback from the discriminator is used to improve the generator’s output until the discriminator can no longer distinguish between real and fake samples.
For image synthesis, both have advantages and disadvantages: GANs produce clearer images, but are unstable in training due to the adversarial trade-off between the two composite models. VAEs are easier to train, but, due to their nature of generating images from “averaging” features from the training data, tend to produce blurrier images.
VAE-GAN
VAE-GAN is, as the name suggests, a combination of variational autoencoder (VAE) and generative adversarial network (GAN). VAE-GAN reduces the image blur produced by VAE by replacing the reconstruction loss term of the VAE model with a discriminator network.