603556-Tonnaer

12 Background EM algorithm cannot be used either since the true posterior pθ(z|x)= pθ(x|z)p(z) pθ(x) is intractable as well. To address this, the VAE defines a parametric approximate posterior qϕ(z|x) that can be used for variational inference. The parameters ϕare modelled with a neural network, called the encoder, since it encodes observed data points into a distribution over the latent space Z. The marginal log likelihood can then be written as logpθ(x)=KL(qϕ(z|x)||pθ(z|x))+ELBO(ϕ,θ; x), (2.1) where the first term on the right is the Kullback-Leibler (KL) divergence of the approximate from the true posterior. Since the KL divergence is non-negative, the second term is a lower bound, often called the evidence lower bound (ELBO). This KL divergence is intractable, so we focus instead on maximising the ELBO. In particular, from Equation (2.1) we see that maximising the ELBO with respect to ϕis the same as minimising the KL divergence of the approximate from the true posterior, meaning that we are in fact performing variational inference to learnqϕ(z|x). The ELBO can be written as ELBO(ϕ,θ; x)=Eqϕ(z|x)[logpθ(x|z)] −KL(qϕ(z|x)||p(z)), (2.2) which we want to optimise (i.e. maximise) with respect to the parameters ϕand θ, using gradient-based methods. However, estimating a gradient with respect to ϕ for the expectation is not trivial, and typical gradient estimators exhibit high variance. Instead, Kingma and Welling (2013) propose a reparameterisation trick, noting that it is often possible to express a continuous random variable z ∼ qϕ(z|x) as a deterministic variable z =gϕ(ϵ, x), where ϵ is an auxiliary random variable with a fixed marginal distributionp(ϵ), andgϕ(·, ·) is some vector-valued function parameterised by ϕ. Such a reparameterisation allows for a simple Monte Carlo estimate of the expectation that is differentiable with respect to ϕ, by sampling from the fixed distribution p(ϵ) rather than the parameterised distributionqϕ(z|x): Eqϕ(z|x)[logpθ(x|z)] ≈ 1 L LX l=1 logpθ(x|z (l)), (2.3) where z(l) =gϕ(ϵ (l), x) andϵ(l) ∼p(ϵ). For training a VAE, a single sample (i.e. L= 1) is typically sufficient, so in practice this loss component boils down to

RkJQdWJsaXNoZXIy MjY0ODMw