4.7 Results: Evaluating LSBD withDLSBD 81 achieve good scores on all traditional metrics. In particular, SAP, DCI, and MIG scores are low. We believe this is a result of the cyclic nature of the symmetries underlying our datasets, further emphasising the need for disentanglement methods that can capture such symmetries. The SAP and MIG scores measure to what extent generative factors are disentangled into a single latent dimension. However, since the factors in our dataset are inherently cyclic due to their symmetry structure, they cannot be properly represented in a single latent dimension, as shown by Pérez Rey et al. (2019). Instead, at least two dimensions are needed to continuously represent each cyclic factor in our data. A similar conclusion was made by Caselles-Dupré et al. (2019) and Painter et al. (2020). DCI disentanglement measures whether a latent dimension captures at most one generative factor. This is accomplished by measuring the importance of each latent dimension in predicting the true generative factor using boosted trees. However, since the generative factors are cyclic, the performance of the boosted tree classifiers is far from optimal, thus providing more importance to several dimensions in predicting the generative factors and giving overall lower DCI scores. 4.7.2 LSBD-VAE and other LSBD Methods Can Learn LSBD Representations with Limited Supervision on Transformations From Figure 4.6 we observe that methods focusing specifically on LSBD can score higher on DLSBD, showing that they are indeed more suitable to learn LSBD representations. In particular, LSBD-VAE got very goodDLSBD scores for all datasets. Moreover, our experiments on the Arrow, Airplane, and Square datasets also show that only limited supervision suffices to obtain good DLSBD scores with low variability, either with few transformation-labelled pairs or with paths of consecutive observations that are easy to obtain in agent-environment settings. To further highlight this, Figure 4.7 shows DLSBDscores for LSBD-VAE trained on the Square, Arrow, and Airplane datasets respectively, for various values for the number of labelled pairs L. For each L and each dataset, we trained 10 models so we can report box plots of the DLSBD scores. For low values of Lwe see worse scores and high variability. But for slightly higher L, scores are consistently good, starting already at L=512for the Square, L = 768 for the Arrow, and L = 256 for the Airplane. This corresponds to
RkJQdWJsaXNoZXIy MjY0ODMw