114 Out-of-Distribution Generalisation with LSBD Representations spaced values with an in-between angle of 2π nclasses . Since there are only a few classes, this naturally encourages a class-based clustering in the corresponding latent subspaces. Factors scale, x-position and y-position in dSprites as well as scale and orientation in 3D Shapes, are essentially continuous but not cyclic. Thus, we map them to a range of angle values from 0 to 0.9 · 2π radians, to introduce a discontinuity in the lowest and highest observed factor value. The remaining factors are cyclic so we represent them with regular angle values between 0 and2π. Architecture and hyperparameters For the encoder we use a convolutional architecture as in Locatello et al. (2018), with 4 convolutional layers (4 ×4 kernels and 2 ×2 strides, each layer has 32, 32, 64, and 64 filters, respectively), followed by a fully-connected layer with 256 units. The output of this is connected by fully-connected layers to the parameters of the LSBD-VAE latent subspaces. The decoder is the reverse of this, using strided transposed convolutions. All hidden layers use “ReLU” non-linearity, the output layer uses a sigmoid activation to predict pixel values. Each model is trained with the Adam optimiser, using an early stopping criterion. We use a mini-batch size of 8, where each element in the mini-batch is in fact a transformation-supervised batch of M = 32 images, thus each mini-batch consists of 256 images with batch shape (8, 32). We train each dataset and OOD split combination 3 times, and report the mean scores over these 3 iterations in our evaluations. 5.3.4 Traditional Disentanglement Models For comparison, we train a regular VAE as well as 5 traditional unsupervised disentanglement models: BetaVAE (Higgins et al., 2017), DIP-VAE-I and II (Kumar et al., 2017), FactorVAE (Kim and Mnih, 2018), and cc-VAE (Burgess et al., 2018), as implemented in disentanglement_lib (Locatello et al., 2018), which use the same architecture as described above for LSBD-VAE. See Section 2.2 for more information on traditional disentanglement models. These models do not address disentanglement from the perspective of LSBD, but are instead based on statistical properties of the data. We train each model-dataset combination only once, due to the large number of such combinations and limited computing power. Each model is trained for 30,000 training steps with a batch size of 64. For the Square and Arrow datasets, which have 2 cyclic factors, we trained each model-dataset combination with 2, 4 and 7 latent dimensions. Although
RkJQdWJsaXNoZXIy MjY0ODMw