118 Out-of-Distribution Generalisation with LSBD Representations only the ELBO values of individual data points. In this case, bad OOD detection indicates good generalisation. We can visualise the ELBO values for training and OOD data in a density plot, see Figure 5.7a for an example. A large overlap between training and OOD set indicates good generalisation. We can quantify this overlap by plotting an ROC curve (see Figure 5.7b) and computing the area under the ROC curve (AUROC). An AUROC of 0.5 would indicate that ELBO values can give no better indication of whether data is OOD than random guessing, thus in such case generalisation would be very good. Higher AUROC scores (towards 1) indicate that OOD samples typically get a much lower ELBO, indicating poor generalisation. (a) ELBO Density plot. (b) ROC Curve. Figure 5.7: OOD detection example for LSBD-VAE on the Arrow 0.25 split. In Figure 5.8 we summarise the AUROC scores for all models and datasets. Again we show the traditional models with 7 latent dimensions, which had the best performance. As before, we observe that LSBD-VAE only shows an advantage for the Square dataset, although we now also see that even in the limited 0.125 split, the AUROC is far away from the optimal value of 0.5, indicating that OOD samples are not represented well compared to training samples, for any of the models. For the Arrow dataset, we see that decent generalisation is possible until the 0.375 split, after which all models start generalising worse. For the dSprites and 3D Shapes datasets, we once again confirm the findings of Montero et al. (2021) that generalisation mostly happens in limited cases and models extrapolate badly, and that disentanglement models don’t necessarily perform better than a regular VAE.
RkJQdWJsaXNoZXIy MjY0ODMw