14 Background derive logpθ(x|z)= logN(x|µdec(z),σdec · I) (2.7) = log DY d=1 N(xd|µd,σdec) (2.8) = DX d=1 log 1 p2πσ2 dec e− (xd−µd) 2 2σ2 dec ! (2.9) = DX d=1 − 1 2 log2πσ2 dec − (xd −µd) 2 2σ2 dec , (2.10) where Dis the dimensionality of the data space X, µdec(z) = (µ1, . . . ,µD) T, and x = (x1, . . . ,xD) T. Since the first term in this last line is constant with respect to the trainable parameters µdec(z), we only optimise the second term, which acts as a (negative) squared error between the input xand the predicted mean µdec (scaled by 1 2σ2 dec ), clearly relating to the mean squared error (but without averaging over the data dimensions), a commonly used reconstruction loss for neural networks. In particular, if we choose σdec = 1√ 2 , we obtain an unscaled (negative) squared error. Alternatively, if the data is binary (i.e. each data point xis a vector of 0’s and 1’s), it is common to model pθ(x|z) with independent Bernoulli trials instead, resulting in the following reconstruction term: logp∗θ(x|z)= log DY d=1 Bernoulli(xd|ρd) (2.11) = DX d=1 log ρxd d (1−ρd) (1−xd) (2.12) = DX d=1 xd logρd +(1−xd) log(1−ρd), (2.13) where ρ(z)=(ρ1, . . . ,ρD) T is now the output of the decoder. Note that this is essentially the (negative) binary cross-entropy loss (but without averaging over the data dimensions), a common loss function for neural networks. Although theoretically this expression is only for binary data, in practice it is often used as a reconstruction loss for non-binarised image data as well, which consists of pixel values that can attain any real value from 0 to 1.
RkJQdWJsaXNoZXIy MjY0ODMw