Linearly Symmetry-Based Disentangled Representations and their Out-of-Distribution Behaviour Loek Tonnaer

Linearly Symmetry-Based Disentangled Representations and their Out-of-Distribution Behaviour Loek Tonnaer

Linearly Symmetry-Based Disentangled Representations and their Outof-Distribution Behaviour by Loek Tonnaer. Eindhoven: Technische Universiteit Eindhoven, 2023. Proefschrift. A catalogue record is available from the Eindhoven University of Technology Library ISBN 978-90-386-5847-6 SIKS Dissertation Series No. 2023-26 The research reported in this thesis has been carried out under the auspices of SIKS, the Dutch Research School for Information and Knowledge Systems. This work has received funding from the Electronic Component Systems for European Leadership Joint Undertaking under grant agreement No. 737459 (project Productive4.0). This Joint Undertaking received support from the European Union’s Horizon 2020 research and innovation programme and Germany, Austria, France, Czech Republic, Netherlands, Belgium, Spain, Greece, Sweden, Italy, Ireland, Poland, Hungary, Portugal, Denmark, Finland, Luxembourg, Norway, Turkey. Copyright © 2023 by Loek Tonnaer. All Rights Reserved.

Linearly Symmetry-Based Disentangled Representations and their Out-of-Distribution Behaviour PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus prof.dr. S.K. Lenaerts, voor een commissie aangewezen door het College voor Promoties, in het openbaar te verdedigen op woensdag 1 november 2023 om 16:00 uur door Loek Tonnaer geboren te Oirschot

Dit proefschrift is goedgekeurd door de promotoren en de samenstelling van de promotiecommissie is als volgt: voorzitter: prof.dr. E.R. van den Heuvel promotor: prof.dr. M. Pechenizkiy co-promotoren: dr. V. Menkovski dr. M. Holenderski leden: prof.dr. J. Hare (University of Southampton) prof.dr. T. Kärkkäinen (University of Jyväskylä) dr. J. Tomczak dr.habil. C. de Campos Het onderzoek of ontwerp dat in dit proefschrift wordt beschreven is uitgevoerd in overeenstemming met de TU/e Gedragscode Wetenschapsbeoefening.

Summary Deep learning models excel at solving pattern recognition tasks by learning from high-dimensional data. Discriminative models, which predict low-dimensional labels from the data observations, have been successful but discard valuable information about underlying real-world mechanisms. To counter this, generative modelling and representation learning aim to model such real-world mechanisms, by learning how to generate new data or find lower-dimensional representations that capture its complexity. In this thesis, we focus in particular on the Variational Autoencoder (VAE), a probabilistic generative model that learns latent representations of observed data. A key motivation for learning representations is their usefulness for downstream tasks. One example is anomaly detection, which is often achieved by assigning an anomaly score to datapoints. Negative likelihood assigned by a VAE trained on normal data forms a natural candidate for such an anomaly score. This entails that datapoints are considered anomalous if they cannot be represented well by the VAE. If the VAE’s representations accurately model the underlying real-world properties of the data, this should provide a reliable method for anomaly detection. The first question this thesis addresses is how VAEs can be used for anomaly detection and how they perform in certain practical cases. Regular VAEs essentially compress high-dimensional data based only on an information bottleneck, but by extending the VAE framework we can model certain desirable properties of the real world. A particular example is disentangling independent generative factors. The idea is that data can be described by various real-world factors that represent independent mechanisms, and that representations should model these factors in separate latent subspaces. However, there is

vi no consensus on how disentanglement should be defined exactly and whether statistical independence is the right measure to reason about disentanglement. Therefore, we focus on Linear Symmetry-Based Disentanglement (LSBD); a formal group-theoretic definition of disentanglement inspired by physics, which describes how symmetries from the real world should be reflected in learned representations. The second question this thesis addresses is how LSBD can be properly quantified, how LSBD representations can be learned, and how LSBD compares to other notions of disentanglement. Lastly, this thesis addresses how disentanglement can help in situations where regular VAE-based anomaly detection is not well-aligned with the empirical data distribution. In particular, we focus on cases where certain combinations of generative factors are not observed, and thus empirically out-of-distribution (OOD), but where we still wish to develop a model that can generalise to such factor combinations, i.e. they should not be detected as anomalous. LSBD provides a sensible framework to deal with suchOOD generalisation, since the generative factors are modelled by symmetry groups that need not depend on the observed data distribution. The contributions in this thesis can be summarised as follows: • With a VAE trained on normal data samples, we can detect anomalous samples if their assigned probability density is lower than for normal samples. We test this anomaly detection framework on applications for visual quality control and lung cancer detection. Results show that anomaly detection is possible in certain cases, confirming the validity of this approach, but in more complicated settings the models fail to represent the data well enough for reliable anomaly detection. • The LSBD definition gives an explicit formalisation of disentanglement, but it does not provide a metric to quantify LSBD. Such a metric is crucial to evaluate LSBD methods and to compare to previous notions of disentanglement. Therefore, we propose DLSBD, a formal metric based on group theory to quantify LSBD for arbitrary representations; as well as a practical implementation to compute this metric for common group decompositions. Furthermore, from this metric we derive LSBD-VAE, a weakly-supervised model to learn LSBD representations. We also demonstrate how this model can be used for a particular image retrieval challenge. We demonstrate the utility of the DLSBD metric by showing that (1) common VAE-based disentanglement methods don’t learn LSBD representations, (2) LSBD-VAE, as well as other recent methods, canlearn LSBD representations needing

vii only limited supervision on transformations, and (3) various desirable properties expressed by existing disentanglement metrics are also achieved by LSBD representations. • Lastly, we investigate how well both LSBD-VAE and traditional VAE-based disentanglement models can perform OOD generalisation in a number of controlled settings. Results show that both model types struggle with generalisation in more challenging settings. However, we also observe that the LSBD-VAE encoder often still learns a meaningful mapping that reflects the underlying group structure. In other words, the encoder may generalise well to data with unseen factor combinations even if the decoder struggles to correctly reconstruct this data.

List of Publications Relating to Chapter 3 1. Tonnaer, L., Li, J., Osin, V., Holenderski, M., and Menkovski, V. (2019). Anomaly Detection for Visual Quality Control of 3D-Printed Products. In Proceedings of the International Joint Conference on Neural Networks (IJCNN) 2. Santos Buitrago, N., Tonnaer, L., Menkovski, V., and Mavroeidis, D. (2018). Anomaly detection for imbalanced datasets with deep generative models. In27th Belgian-Dutch Conference on Machine Learning (Benelearn 2018) Relating to Chapter 4 3. Tonnaer, L., Pérez Rey, L. A., Menkovski, V., Holenderski, M., and Portegies, J. W. (2022). Quantifying and Learning Linear Symmetry-Based Disentanglement. In Proceedings of the 39th International Conference on Machine Learning (ICML) 4. Pérez Rey, L., Tonnaer, L., Menkovski, V., Holenderski, M., and Portegies, J. (2020). A metric for linear symmetry-based disentanglement. InNeurIPS 2020 workshop on Differential Geometry meets Deep Learning (DiffGeo4DL) 5. Sipiran, I., Lazo, P., Lopez, C., Jimenez, M., Bagewadi, N., Bustos, B., Dao, H., Gangisetty, S., Hanik, M., Ho-Thi, N.-P., Holenderski, M., Jarnikov, D., Labrada, A., Lengauer, S., Licandro, R., Nguyen, D.-H., Nguyen-Ho, T.-L., Perez Rey, L., Pham, B.-D., Pham, M.-K., Preiner, R., Schreck, T., Trinh, Q.-H., Tonnaer, L., von Tycowicz, C., and Vu-Le, T.-A. (2021). SHREC 2021: Retrieval of cultural heritage objects. Computers and Graphics (Pergamon), 100

x List of Publications Relating to Chapter 5 6. Tonnaer, L., Holenderski, M., and Menkovski, V. (2023). Out-of-Distribution Generalisation with Symmetry-Based Disentangled Representations. In Advances in Intelligent Data Analysis XXI (IDA). Springer, Cham

Contents Summary v List of Publications ix List of Figures xv List of Tables xix 1 Introduction 1 1.1 Motivation............................... 1 1.2 ResearchQuestions.......................... 5 1.2.1 Anomaly Detection with Probabilistic Generative Models . 5 1.2.2 Quantifying and Learning Linear Symmetry-Based Disentanglement(LSBD)...................... 6 1.2.3 Out-of-Distribution Generalisation with Linear SymmetryBased Disentangled Representations . . . . . . . . . . . . 7 1.3 Thesis Outline and Contributions . . . . . . . . . . . . . . . . . . 9 2 Background 11 2.1 VariationalAutoencoders. . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Disentanglement ........................... 15 2.3 GroupTheory............................. 18 2.4 MeasureTheory............................ 20

xii CONTENTS 3 Anomaly Detection with Variational Autoencoders 25 3.1 Introduction.............................. 26 3.2 RelatedWork ............................. 27 3.3 Anomaly Detection with Generative Models . . . . . . . . . . . . 28 3.3.1 Likelihood Estimation with VAEs . . . . . . . . . . . . . . 29 3.3.2 Likelihood Estimation with GANs . . . . . . . . . . . . . . 30 3.3.3 Thresholds for Likelihood Scores . . . . . . . . . . . . . . 32 3.3.4 Localising Anomalies with Reconstructions . . . . . . . . . 33 3.4 ExperimentalSetup.......................... 34 3.4.1 Datasets............................ 34 3.4.2 Architectures and Hyperparameters . . . . . . . . . . . . . 37 3.5 Results................................. 41 3.5.1 MNIST............................. 41 3.5.2 3D-PrintedProducts ..................... 43 3.5.3 NLST3Dnodules....................... 45 3.6 Conclusion .............................. 48 4 Quantifying and Learning Linear Symmetry-Based Disentanglement (LSBD) 51 4.1 Introduction.............................. 52 4.2 LSBDDefinition............................ 53 4.3 RelatedWork ............................. 57 4.4 DLSBD:QuantifyingLSBD...................... 61 4.4.1 Intuition: Measuring Equivariance with Dispersion . . . . 61 4.4.2 DLSBD:AMetricforLSBD.................. 62 4.4.3 Practical Computation of DLSBD .............. 66 4.5 LSBD-VAE: Learning LSBD Representations . . . . . . . . . . . . . 68 4.5.1 Assumptions ......................... 68 4.5.2 Unsupervised Learning on a Latent Manifold with∆VAE . 69 4.5.3 Semi-Supervised Learning with Transformation Labels . . 70 4.6 ExperimentalSetup.......................... 72 4.6.1 Datasets............................ 72 4.6.2 LSBD-VAE with Semi-supervised Labelled Pairs . . . . . . 74 4.6.3 LSBD-VAE with Paths of Consecutive Observations . . . . 75 4.6.4 Other Disentanglement Methods . . . . . . . . . . . . . . 75 4.6.5 Disentanglement Metrics . . . . . . . . . . . . . . . . . . . 77 4.6.6 Further Experimental Details . . . . . . . . . . . . . . . . 77 4.7 Results: Evaluating LSBD withDLSBD ............... 80

CONTENTS xiii 4.7.1 Standard Disentanglement Methods Don’t Learn LSBD Representations.......................... 80 4.7.2 LSBD-VAE and other LSBD Methods Can Learn LSBD Representations with Limited Supervision on Transformations 81 4.7.3 LSBD Representations Also Satisfy Previous DisentanglementNotions......................... 83 4.7.4 Full Quantitative Results . . . . . . . . . . . . . . . . . . . 85 4.7.5 Further Qualitative Results . . . . . . . . . . . . . . . . . 89 4.8 SHREC 2021 Object Retrieval Challenge . . . . . . . . . . . . . . 93 4.8.1 TheChallenge ........................ 94 4.8.2 OurMethodology....................... 95 4.8.3 ResultsandConclusions . . . . . . . . . . . . . . . . . . . 99 4.9 Conclusion ..............................103 5 Out-of-Distribution Generalisation with LSBD Representations 105 5.1 Introduction..............................106 5.2 RelatedWork .............................108 5.3 ExperimentalSetup..........................110 5.3.1 Datasets............................110 5.3.2 OOD Splits: Left-out Factor Combinations . . . . . . . . . 111 5.3.3 LSBD-VAE...........................112 5.3.4 Traditional Disentanglement Models . . . . . . . . . . . . 114 5.4 ExperimentsandResults.......................115 5.4.1 Likelihood Ratio: Training vs. OOD ELBO . . . . . . . . . 115 5.4.2 OOD Detection: Area Under ROC Curve (AUROC) . . . . 117 5.4.3 Reconstructions of OOD Combinations . . . . . . . . . . . 119 5.4.4 Equivariance of OOD Combinations . . . . . . . . . . . . . 120 5.5 Conclusion ..............................122 6 Conclusion and Future Work 123 6.1 Conclusions ..............................123 6.2 Limitations ..............................125 6.3 FutureWork..............................128 Bibliography 131 Curriculum Vitae 141 Acknowledgements 143

xiv CONTENTS SIKS Dissertations 146

List of Figures 1.1 Discriminative modelling. Given data inX, predict labels inY. . 2 1.2 Data inXare observations governed by underlying world descriptors inW. ............................... 3 1.3 Directly predicting labels inY fromdata Xessentially sidesteps the world descriptors W, yet fromWit’s typically trivial to predict Y. ................................... 4 1.4 Latent variable modelling: the lower-dimensional latent space Z describes high-dimensional data inX. ............... 4 3.1 ELBO density distribution, ROC curves, and Precision-Recall curves for anomaly “edge erosion” on 3D-printed products, with a 64dimensionallatentspace. ...................... 33 3.2 Thefull3D-printedproduct. . . . . . . . . . . . . . . . . . . . . . 35 3.3 Examplesforeachdefectclass. . . . . . . . . . . . . . . . . . . . 36 3.4 Examples of samples in the dataset with their axial, coronal, and sagital perspective, for (a) 3 different healthy nodules and (b) 3 different nodules identified as anomalous (positive for cancer). . 37 3.5 Displaying 25 slices of 28×28 pixels, as a representation of the cube of 28×28×28 voxels used for training our models. . . . . . 38 3.6 VAE architecture for 3D-printed products. . . . . . . . . . . . . . 39 3.7 Trained 3D WGAN-GP architecture for the generator. . . . . . . . 40 3.8 3D VAE encoder architecture for NLST 3D nodules. . . . . . . . . 40

xvi LIST OF FIGURES 3.9 ELBO density distributions for normal (blue) and anomalous (red) MNISTdigits.............................. 42 3.10 Originals (left), reconstructions (middle), and difference images (right) for anomalous MNIST digits. . . . . . . . . . . . . . . . . 43 3.11 ELBO density distributions for normal (blue) and anomalous (red) 3D-printed products with a 64-dimensional latent space. . . . . . 45 3.12 Original (left), reconstruction (middle), and difference (right) images for 3D-printed products. . . . . . . . . . . . . . . . . . . . 46 3.13 Four nodules generated by the 3D WGAN-GP. . . . . . . . . . . . 47 3.14 GAN-based anomaly detection results for NLST 3D nodules. . . . 48 3.15 VAE-based anomaly detection results for NLST 3D nodules. . . . . 49 4.1 A dataset of images from a rotating object expressed in terms of the group G=SO(2) acting on a base image x0........... 62 4.2 Intuitive description of the practical computation of DLSBD. . . . 67 4.3 Overview of the supervised part of LSBD-VAE. . . . . . . . . . . . 71 4.4 Example images from each of the datasets used. . . . . . . . . . . 73 4.5 Example paths of consecutive observations. . . . . . . . . . . . . 76 4.6 DLSBD scores for all methods on all datasets. . . . . . . . . . . . 80 4.7 Box plots for DLSBD scores over 10 training repetitions for different numbers of labelled pairs L,foralldatasets. . . . . . . . . . . 82 4.8 Results from Quessard et al. (2020)’s method on the Arrow dataset. 83 4.9 Comparing DLSBD to previous disentanglement metrics. . . . . . 84 4.10 Images obtained by decoding latent variables sampled according to the prior over the latent space for different models trained on COIL-100andModelNet40. ..................... 89 4.11 Image generation by traversing the circular latent variable for a sampledobjectidentity. ....................... 90 4.12 Diagrams illustrating the interpolation between the latent variablesassociatedtotwoobjects. . . . . . . . . . . . . . . . . . . . 91 4.13 Images produced from the decoding of interpolated latent variables using cc-VAE and LSBD-VAE trained with COIL-100. . . . . 92 4.14 Sample objects for every class of the retrieval-by-shape dataset. . 94 4.15 Sample objects for every class of the retrieval-by-culture dataset. 95 4.16 Diagram of the multi-view data generation. . . . . . . . . . . . . 96 4.17 Diagrams with the architectures used in the Triplet Loss (TL), Autoencoder with Triplet Loss (AE-TL) and LSBD-VAE with Triplet Loss (LSBD-VAE-TL) submissions. . . . . . . . . . . . . . . . . . . 99

LIST OF FIGURES xvii 5.1 Illustrative example of the misalignment between underlying factors and observed distributions. . . . . . . . . . . . . . . . . . . . 107 5.2 Example images of Square, Arrow, dSprites, and 3D Shapes. . . . 110 5.3 Visualisation of OOD splits for datasets with 2 factors. . . . . . . 111 5.4 Mean negative ELBOs for all datasets and models. . . . . . . . . . 116 5.5 Differences between train and OOD ELBO for all datasets and models. ................................117 5.6 Examples of training and OOD samples (top lines) and their reconstructions (bottom lines) by two different models, for the Arrow0.625split............................117 5.7 OOD detection example for LSBD-VAE on the Arrow 0.25 split. . . 118 5.8 AUROC scores for detecting OOD from train data, for all datasets andmodels...............................119 5.9 LSBD-VAE reconstructions of OOD data from various splits of dSprites (left) and 3D Shapes (right). . . . . . . . . . . . . . . . . 120 5.10 DLSBD scores (lower is better) for various OOD splits. . . . . . . 121 5.11 2D latent embeddings (top) and latent traversals (bottom) for LSBD-VAE trained on Arrow for increasingly large OOD splits, visualised on a flattened 2D torus. . . . . . . . . . . . . . . . . . . 122

List of Tables 3.1 Number of data points per defect type for the 3D-printed products dataset. ................................ 35 3.2 Lung nodule dataset after data augmentation. . . . . . . . . . . . 38 3.3 auROCscoresforMNIST. ...................... 41 3.4 auPRCscoresforMNIST........................ 41 3.5 auROC scores for 3D-Printed Products. . . . . . . . . . . . . . . . 44 3.6 auPRC scores for 3D-Printed Products. . . . . . . . . . . . . . . . 44 4.1 Encoder and decoder architectures used in most methods. . . . . 78 4.2 Encoder and decoder architecture used to train LSBD-VAE/0 for ModelNet40dataset.......................... 78 4.3 LSBD-VAE hyperparameters for all datasets. . . . . . . . . . . . . 79 4.4 Model hyperparameters for all datasets. . . . . . . . . . . . . . . 79 4.5 ScoresfortheSquaredataset. . . . . . . . . . . . . . . . . . . . . 85 4.6 ScoresfortheArrowdataset. . . . . . . . . . . . . . . . . . . . . 86 4.7 Scores for the Airplane dataset. . . . . . . . . . . . . . . . . . . . 87 4.8 Scores for the ModelNet40 dataset. . . . . . . . . . . . . . . . . . 88 4.9 ScoresforCOIL-100dataset.. . . . . . . . . . . . . . . . . . . . . 88 4.10 Hyperparameters for the different variants submitted to the SHREC 2021 3D object retrieval challenge. . . . . . . . . . . . . . . . . . 100 4.11 Evaluation measures for the retrieval-by-shape challenge. . . . . 101 4.12 Evaluation measures for the retrieval-by-culture challenge. . . . . 102 5.1 OOD splits for dSprites and 3D Shapes. . . . . . . . . . . . . . . . 112

Chapter 1 Introduction 1.1 Motivation Machine learning (ML) has shown to be highly effective in solving problems related to pattern recognition (Bishop, 2006), where rule-based solutions or symbolic equations are difficult to formulate. Instead of relying on predetermined rules, ML models learn from data, identifying patterns and relationships that can be used to make predictions on new data. But when the data is high-dimensional, the curse of dimensionality (Bellman, 1957) poses new challenges for ML models. Lower-level individual data dimensions, or local properties, may carry little meaning on their own and need to be combined in more complex ways to reveal higher-level global structures. Detecting these structures is essential for accurately modelling and predicting outcomes in high-dimensional data, as they provide a way to reduce the complexity of the data and extract meaningful insights. Images are a clear example of such high-dimensional data; even small 32by-32-pixel images have over 1000 dimensions (i.e. pixel values). Each pixel carries little information on its own, only larger patterns of many pixels are meaningful. ML models that work well on data with fewer dimensions would need an enormous amount of training data to succeed on images, and even then they would likely learn spurious connections that aren’t semantically meaningful and won’t generalise to unseen data. Deep learning (DL) models (Lecun et al., 2015), or neural networks, overcome the challenges of high-dimensional data by learning layers or hierarchies of

2 Introduction representations through a sequence of non-linear transformations. These representations start with low-level patterns and move towards high-level concepts that can help solve the task at hand. For example, in image processing, DL models can detect low-level patterns in the pixels, such as lines or angles, and then consecutively combine them into lower-dimensional higher-level descriptors of what is shown in the image. This approach allows for more complex and abstract patterns to be identified and leveraged, resulting in more accurate predictions. A clear example of the success of DL is the performance of convolutional neural networks (CNNs) on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Deng et al., 2010). This challenge involves a dataset with a large amount of images, with associated labels representing various classes such as animal species or object types. CNNs are trained to predict the right label given a query image, e.g. identifying which animal is shown in an image, based on the images and labels in the training dataset. This is an example of discriminative modelling; given data observations from a data space Xand associated labels from a set of potential values Y, the task is to learn a model that can predict labels inY given datapoints inX, see Figure 1.1. From the perspective of probability theory, discriminative models represent a conditional distribution p(Y|X), i.e. they model a distribution over Y, conditioned onX. As mentioned before, DL models learn layers of representations of the data, from high-dimensional low-level features towards lower-dimensional features that represent high-level concepts. Discriminative models essentially try to throw away any information that is irrelevant for predictions inY, until ending up with features that are most informative for this task. Figure 1.1: Discriminative modelling. Given data inX, predict labels inY. A discriminative model is only encouraged to model features that empirically work best for the particular task of predictingY on a given dataset. Such features don’t necessarily relate to real-world concepts, e.g. those that humans would use to communicate decisions. Moreover, they are susceptible to learning shortcuts (Geirhos et al., 2020), tricks that may work in simple benchmark settings, but that do not generalise well to real-world applications and have little to do with

1.1 Motivation 3 the true reasons why a datapoint has a certain label. In particular, such shortcuts may focus on unintended bias in the datasets. In a typical example by Ribeiro et al. (2016), a model is trained to differentiate huskies from wolves, but the dataset mainly contained wolves on a snowy background and huskies on other backgrounds. The model essentially learned a snow detector instead, predicting wolf whenever the background contained snow or a lot of white, thus incorrectly classifying images of huskies in the snow as wolves. There is often a lot of information in the data that isn’t described by a simple label fromY. E.g. animals may be recognised by the shape of their ears or the patterns in their fur, but the background of an image does not influence which animal is shown. Humans can identify such properties and use them to decide what animal they see, and they can articulate why they make this decision. Discriminative models mostly lack such explainability, since the task of predicting Y givenXdoesn’t encourage or require such information explicitly. High-dimensional data are typically observations from the real world, which is governed by certain underlying real-world mechanisms. To formalise this, we may say that world descriptors in a space of world states Wproduce observations in the data space X, see Figure 1.2. For example, a picture of an animal is an observation of that particular animal, in some pose, on some background. The animal itself can be described by properties such as the shape of its ears and nose, the colour and patterns of its fur or skin, and whether it has legs or wings. Such world descriptors in Ware not observed directly and can still be complicated, but they are typically much more low-dimensional and meaningful than data observations inX. Figure 1.2: Data inXare observations governed by underlying world descriptors inW. Discriminative modelling, i.e. directly predicting labels in Y from data in X, essentially sidesteps such world descriptors in W. However, from a good world description in W, the correct label in Y typically follows trivially and with a clear explanation, as illustrated schematically in Figure 1.3. Although world descriptors in W are generally unknown or unobserved, we are often able to model or extract certain aspects of it. Sidestepping Winherently limits

4 Introduction discriminative models, since Y carries less information thanW. Therefore, it is desirable to have a modelling paradigm that considers more aspects of the real worldW. Figure 1.3: Directly predicting labels in Y fromdata Xessentially sidesteps the world descriptors W, yet fromWit’s typically trivial to predict Y. A common approach to consider these real-world aspects is to model an additional representation space Z (Bengio et al., 2012), in such a way that Z shares desirable properties with W, as illustrated in Figure 1.4. This space Z is typically much lower-dimensional than the data space X, and can ultimately function as a step in betweenXand Y, i.e. it should be easier and more reliable to predict targets inY fromZ than directly fromX. But Z can already be useful on its own, e.g. for other still unknown tasks, often referred to as downstream tasks. Values in Z represent unobserved variables, also called latent variables, so we often refer to Z as the latent space. A key question is which properties from Wwe want to model inZ, and how to do so. Figure 1.4: Latent variable modelling: the lower-dimensional latent space Z describes high-dimensional data in X. Ideally, Z should be similar to W, i.e. share similar properties or factors. Given some targets inY, it should then be much easier and reliable to predict Y fromZ. But simply modelling Zcan already be useful on its own, to provide insight into what the data describes.

1.2 Research Questions 5 Since X contains observations that are essentially generated from world descriptors in W, and we want Z to share desirable properties with W, a reasonable objective is that we should also be able to generate data observations fromZ. In that case, we know that Z contains a decent description of the data that could be much lower-dimensional thanX. This is an example of generative modelling (Tomczak, 2022), where the goal is to learn to describe the data Xitself by learning how to generate it, as opposed to discriminative modelling, where the goal is to learn how to assign labels inY to data X. In probabilistic terms, generative models learn a distributionp(X), or in the presence of labels a joint distributionp(X,Y). This allows a model to include information that is present inXbut not inY. Discriminative models on the other hand learn a conditional distributionp(Y|X), which is only described over Y. In particular, a generative model with a latent space Z is called a latent variablemodel. Here, p(X) is modelled indirectly through a latent space prior p(Z) and a conditional distributionp(X|Z) such that the joint distribution becomes p(X,Z) = p(Z)p(X|Z). Such a model can generate data in X by sampling from a simple prior distribution over Z and then from the learned conditional distributionp(X|Z). Using neural networks to model the parameters of these distributions, we can formulate a Variational Autoencoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014). In a VAE, the parameters of the conditional distributionp(X|Z) are modelled by a decoder network. Furthermore, the parameters of an approximate posterior distribution q(Z|X) are modelled by a encoder network, to perform inference of latent variables given input data. The prior p(Z) is often simple and fixed. As motivated before, ideally we want Z to share desirable properties withW, or in other words, we want the mechanisms of the real worldWto apply to Zas well. This should help us with building more reliable models that can do well on downstream tasks. Motivated by this, we present the following research topics and questions. 1.2 Research Questions 1.2.1 Anomaly Detection with Probabilistic Generative Models A key motivation for learning representations is that they are useful for downstream tasks. Probabilistic generative models, such as Variational Autoencoders

6 Introduction (VAEs), learn latent representations to describe the probability distribution of observed data and allow for density estimation of unseen data points. This makes them well-suited for the task of probabilistic anomaly detection, or outlier detection (Pimentel et al., 2014). By training a generative model on what is considered normal data, the model should be able to detect outliers, i.e. anomalous or abnormal datapoints, as they should get assigned a lower probability density. Thus, we pose the following research question: Q: How can the density estimation of probabilistic generative models—in particular Variational Autoencoders (VAEs)—help with anomaly detection; i.e. detecting out-of-distribution datapoints? 1.2.2 Quantifying and Learning Linear Symmetry-Based Disentanglement (LSBD) As motivated above, we want representations in Z to capture the underlying mechanisms of the real worldW. Regular VAEs try to achieve this by learning how to generate data inXfromZ, but this still imposes little structure onZ. The real world can often be described in terms of independent factors of variation. E.g. images of animals can be described with factors such as fur colour, ear shape, and pose. Such factors are often calledgenerative factors, as knowing their values allows to generate data observations. Generative factors are implicitly present inW, and it would be beneficial to find representations inZ that model these factors as well. This is the goal of disentanglement, or learning disentangled representations (Bengio et al., 2012). The idea is that each factor should be represented in a separate subspace of Z, such that changes in one generative factor only lead to changes in one subspace. Several approaches have been proposed to learn and quantify disentangled representations (Locatello et al., 2018), we refer to these collectively as traditional disentanglement. However, it is difficult to turn these ideas into a formal definition that can be used to design disentanglement models. It is not obvious how to formally describe what a factor is. To address this, Higgins et al. (2018) propose the formal definition of Linear Symmetry-Based Disentanglement (LSBD), arguing that symmetries of the real worldWare what cause variability in the data. The real-world mechanisms that we want to disentangle in Z are then symmetry transformations that change one aspect of the real world but leave all others invariant. The mathematical language of group theory is used to capture this in a formal definition.

1.2 Research Questions 7 However, this definition only describes what perfect LSBD should look like— it does not supply a metric to quantify how well LSBD is achieved for given representations. Moreover, it does not provide any method to actually obtain LSBD representations. It is crucial to have a metric that quantifies LSBD, to be able to evaluate methods that aim to obtain LSBD representations and to compare LSBD to previous understandings of disentanglement. Since the proposal of LSBD, several methods have been proposed to learn LSBD representations (Caselles-Dupré et al., 2019; Quessard et al., 2020; Painter et al., 2020). Although some of these works do propose metrics that measure some aspect of LSBD, none of them provide a general metric that fully quantifies LSBD according to its formal definition and for any learned representation. A quantification of LSBD can furthermore be helpful to develop methods to learn LSBD representations. Typically, evaluation metrics of traditional disentanglement assume access to the underlying factors that should be disentangled, whereas methods to learn disentangled representations have limited access to these factors. But even partial information can be beneficial, as demonstrated by weakly-supervised methods (Locatello et al., 2020; Träuble et al., 2021). Similarly, a quantification metric for LSBD may be useful for learning LSBD representations, even if this metric can only be computed under certain assumptions and in a weakly-supervised fashion. Given the lack of a general quantification metric for LSBD, as well as the need for methods that can obtain LSBD representations, we pose the following research questions: Q: How, and under which assumptions, canLinear Symmetry-Based Disentanglement (LSBD) be quantified according to its formal definition? Q: How, and under which assumptions, can such a quantification help with learning LSBD representations? 1.2.3 Out-of-Distribution Generalisation with Linear SymmetryBased Disentangled Representations We previously focused on the suitability of probabilistic latent variable models, in particular Variational Autoencoders (VAEs), for anomaly detection. Since the latent space Z ideally shares desirable properties with the real world W, it should form a reliable model to describe what is normal and what is anomalous.

8 Introduction Therefore, better representations inZ should lead to a more reliable anomaly detection method, whose decisions are based on real-world properties. We motivated before that disentanglement is a good strategy to improve representations inZ, by disentangling underlying factors of variation from the real world W. In the context of anomaly detection, this is particularly helpful to prevent false negatives; where data is flagged as anomalous, but should be considered normal and thus well-represented by the model. Most traditional disentanglement methods assume that factors are statistically independent and that datasets contain examples from all possible combinations of factor values. In practice, however, there may be correlations between different factors, but they can still be disentangled (Träuble et al., 2021). For example, a dataset of human bodies would contain the underlying factors body height and foot size, which are clearly not independent. Nevertheless, we can identify these two factors as independent mechanisms, and unlikely combinations of factor values should still be modelled. E.g. small humans with big feet can still exist even if they aren’t observed in a particular dataset due to their lower likelihood. Furthermore, as the number of factors grows, the number of possible combinations of factor values grows exponentially, so it becomes unrealistic to expect a dataset to cover all possible combinations of factor values. Therefore, it is useful for models to be able to disentangle factors without needing to see all possible combinations and without assuming statistical independence between factors. A model that generalises well to unseen combinations is then less likely to flag such unseen cases as anomalous. Particular combinations of factor values may be out-of-distribution (OOD) from a probabilistic point of view, but should still be considered “normal” given that these factors represent underlying mechanisms of the world. Generalising to such unseen combinations of factor values is a type of out-ofdistribution generalisation (Shen et al., 2021), more specifically combinatorial generalisation. LSBD models are a sensible candidate to handle such generalisation, since they focus on modelling underlying mechanisms and should thus be capable of modelling unseen combinations that are the result of applying these mechanisms. Since LSBD defines disentanglement with respect to real-world symmetries, rather than statistical properties of the data and its underlying factors, it provides a suitable framework to generalise to unseen combinations even if they are empirically OOD. Q: How well do LSBD models generalise towards unseen observations that are the result of mechanisms observed during training? I.e. , how well do LSBD models performout-of-distribution (OOD) generalisation?

1.3 Thesis Outline and Contributions 9 Q: How do LSBD models compare to traditional disentanglement models with respect to OOD generalisation? 1.3 Thesis Outline and Contributions In Chapter 2 we provide some relevant background and preliminaries for the work in this thesis. The subsequent chapters detail the contributions made in this thesis, addressing the research questions outlined above. These contributions can be summarised as follows. In Chapter 3 we present anomaly detection with probabilistic generative models, in particular Variational Autoencoders (VAEs). We train a VAE on normal data samples, such that it can detect anomalous samples if their assigned probability density is lower than for normal samples. We apply this method on applications for visual quality control and lung cancer detection. Results show that anomaly detection is possible in certain cases, which confirms the validity of this approach. However, in more complicated and realistic settings the models may fail to represent the data well enough for reliable anomaly detection. This suggests that improvements in the VAE framework could be beneficial for anomaly detection performance as well. In Chapter 4 we focus on quantifying and learning Linear Symmetry-Based Disentanglement (LSBD). We propose DLSBD, a well-formalised metric to quantify LSBD. We give a practical implementation of this metric for SO(2), a common group structure that models 2D rotations or other cyclic properties. From this metric, we derive LSBD-VAE, a semi-supervised method to learn LSBD representations. We use the DLSBD metric to compare LSBD with previous notions of disentanglement, as well as to evaluate models designed to learn LSBD representations, including our own LSBD-VAE. In Chapter 5 we explore how LSBD helps with out-of-distribution generalisation, and how LSBD models compare to traditional disentanglement models for this task. We train models on datasets with held-out factor combinations, and test their generalisation on these unseen factor combinations. We observe that models struggle with generalisation in more challenging settings, and that LSBD models show no obvious improvement over traditional disentanglement when measuring generalisation in terms of the likelihood of unseen data. However, we also observe that the encoder of LSBD models may still generalise well by learning a meaningful mapping that reflects the underlying real-world mechanisms.

10 Introduction Lastly, in Chapter 6 we summarise the overall conclusions of this thesis, and outline possible directions for future work.

Chapter 2 Background In this chapter, we summarise relevant background for understanding this thesis. We assume the reader is somewhat familiar with machine learning, neural networks, probability theory, and set theory; as the topics in this section will build on concepts from these fields. 2.1 Variational Autoencoders In this section, we describe the variational autoencoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014), a model that is at the core of most other models discussed in this thesis. A VAE introduces neural networks in a very simple latent variable model, where latent variables z ∈Z predict data observations x ∈ X. There is a prior p(z) over the latent space Z, which is typically fixed, as well as a parametric conditional distributionpθ(x|z), where θ represents the parameters of this distribution. In a VAE, these parameters are modelled with a neural network, called the decoder, since it decodes latent variables into a distribution over the data space X. Together, these distributions constitute the generative model p(z)pθ(x|z) that, once trained, allows us to sample new data points, making the VAE a generative model. To train the model, we wish to learn the parameters θ through Maximum Likelihood Estimation, i.e. we wish to maximise the marginal log likelihood logpθ(x)= logRz pθ(x|z)p(z) with respect to the parameters θ. However, since pθ(x|z) is parameterised by a neural network, it is intractable to integrate over z to obtain gradients for gradient-based learning. Moreover, methods such as the

12 Background EM algorithm cannot be used either since the true posterior pθ(z|x)= pθ(x|z)p(z) pθ(x) is intractable as well. To address this, the VAE defines a parametric approximate posterior qϕ(z|x) that can be used for variational inference. The parameters ϕare modelled with a neural network, called the encoder, since it encodes observed data points into a distribution over the latent space Z. The marginal log likelihood can then be written as logpθ(x)=KL(qϕ(z|x)||pθ(z|x))+ELBO(ϕ,θ; x), (2.1) where the first term on the right is the Kullback-Leibler (KL) divergence of the approximate from the true posterior. Since the KL divergence is non-negative, the second term is a lower bound, often called the evidence lower bound (ELBO). This KL divergence is intractable, so we focus instead on maximising the ELBO. In particular, from Equation (2.1) we see that maximising the ELBO with respect to ϕis the same as minimising the KL divergence of the approximate from the true posterior, meaning that we are in fact performing variational inference to learnqϕ(z|x). The ELBO can be written as ELBO(ϕ,θ; x)=Eqϕ(z|x)[logpθ(x|z)] −KL(qϕ(z|x)||p(z)), (2.2) which we want to optimise (i.e. maximise) with respect to the parameters ϕand θ, using gradient-based methods. However, estimating a gradient with respect to ϕ for the expectation is not trivial, and typical gradient estimators exhibit high variance. Instead, Kingma and Welling (2013) propose a reparameterisation trick, noting that it is often possible to express a continuous random variable z ∼ qϕ(z|x) as a deterministic variable z =gϕ(ϵ, x), where ϵ is an auxiliary random variable with a fixed marginal distributionp(ϵ), andgϕ(·, ·) is some vector-valued function parameterised by ϕ. Such a reparameterisation allows for a simple Monte Carlo estimate of the expectation that is differentiable with respect to ϕ, by sampling from the fixed distribution p(ϵ) rather than the parameterised distributionqϕ(z|x): Eqϕ(z|x)[logpθ(x|z)] ≈ 1 L LX l=1 logpθ(x|z (l)), (2.3) where z(l) =gϕ(ϵ (l), x) andϵ(l) ∼p(ϵ). For training a VAE, a single sample (i.e. L= 1) is typically sufficient, so in practice this loss component boils down to

2.1 Variational Autoencoders 13 logpθ(x|z), where z =gϕ(ϵ, x) is computed from a single noise variable ϵ and the encoder output ϕ. The KL divergence can often be computed analytically, such that obtaining gradients is easy, but otherwise this reparameterisation trick can also help to obtain gradients for the KL divergence. The entire VAE is trained by maximising the ELBO with respect to ϕand θ simultaneously, using stochastic gradient descent (SGD) or related optimisers. Looking at the two terms that make up the ELBO, we can see that the first term Eqϕ(z|x)[logpθ(x|z)] acts as a negative expected reconstruction loss between an input data point x and its predicted reconstruction according to the encoder and decoder networks, showing that the VAE indeed acts as an autoencoder. The KL divergence in the second term then acts as a kind of regularisation for the latent space, ensuring that the learned approximate posteriors stay similar to the latent space prior. This, together with the sampling procedure for the reparameterisation trick, promotes a certain smoothness in the latent space. A common choice for the various distributions in the VAE framework is to use Gaussian distributions with diagonal covariance, which can also be interpreted as independent univariate Gaussians. In particular, the latent space prior is often a standard Gaussian, and the generative conditional distribution usually has a fixed variance. Put more formally, the distributions are then qϕ(z|x)=N(z|ϕ(x))=N(z|µenc(x), diag(σenc(x))), (2.4) pθ(x|z)=N(x|θ(z))=N(x|µdec(z),σdec · I), (2.5) p(z)=N(z|0,I), (2.6) where µenc andσenc are outputs of the encoder network with input x, µdec is the output of the decoder network with input z, σdec is a fixed scalar, andI is the identity matrix (size inferred from context). For such a Gaussian approximate posterior, a valid reparameterisation of z ∼ N(µ, diag(σ)) is z = µ+σ⊙ϵ for ϵ ∼ N(0,I), where ⊙ denotes the Hadamard (or element-wise) product. As explained before, we can then use single sample Monte Carlo estimation, such that the reconstruction term of the ELBO is essentially logpθ(x|z). Then, for our chosen Gaussianpθ(x|z) we can

14 Background derive logpθ(x|z)= logN(x|µdec(z),σdec · I) (2.7) = log DY d=1 N(xd|µd,σdec) (2.8) = DX d=1 log 1 p2πσ2 dec e− (xd−µd) 2 2σ2 dec ! (2.9) = DX d=1 − 1 2 log2πσ2 dec − (xd −µd) 2 2σ2 dec , (2.10) where Dis the dimensionality of the data space X, µdec(z) = (µ1, . . . ,µD) T, and x = (x1, . . . ,xD) T. Since the first term in this last line is constant with respect to the trainable parameters µdec(z), we only optimise the second term, which acts as a (negative) squared error between the input xand the predicted mean µdec (scaled by 1 2σ2 dec ), clearly relating to the mean squared error (but without averaging over the data dimensions), a commonly used reconstruction loss for neural networks. In particular, if we choose σdec = 1√ 2 , we obtain an unscaled (negative) squared error. Alternatively, if the data is binary (i.e. each data point xis a vector of 0’s and 1’s), it is common to model pθ(x|z) with independent Bernoulli trials instead, resulting in the following reconstruction term: logp∗θ(x|z)= log DY d=1 Bernoulli(xd|ρd) (2.11) = DX d=1 log ρxd d (1−ρd) (1−xd) (2.12) = DX d=1 xd logρd +(1−xd) log(1−ρd), (2.13) where ρ(z)=(ρ1, . . . ,ρD) T is now the output of the decoder. Note that this is essentially the (negative) binary cross-entropy loss (but without averaging over the data dimensions), a common loss function for neural networks. Although theoretically this expression is only for binary data, in practice it is often used as a reconstruction loss for non-binarised image data as well, which consists of pixel values that can attain any real value from 0 to 1.

2.2 Disentanglement 15 With the chosen Gaussian distributions for the prior and approximate posterior, the KL divergence in Equation (2.2) can be computed analytically and becomes KL(qϕ(z|x)||p(z))= 1 2 KX k=1 (m2 k +s 2 k −logs 2 k −1), (2.14) where Kis the dimensionality of the latent space, µenc(x) = (m1, . . . ,mK) T, and σenc(x) = (s1, . . . ,sK) T. Gradients with respect to µenc and σenc can be computed exactly for this expression, so it can directly be used for gradient-based optimisation. 2.2 Disentanglement In Section 1.2.2, we motivated the need for learning disentangled representations (Bengio et al., 2012), where the idea is that data contains underlying generative factors that we wish to model in separate dimensions or subspaces of our latent representation. Liu et al. (2022) provide an overview of key concepts and methods in the field of disentanglement, in particular for applications in the imaging domain. Here, we briefly summarise a number of methods that have been proposed to learn disentangled representations, based on extending the VAE framework with various loss components to encourage disentanglement in the latent space. We also summarise some proposed metrics for quantifying the level of disentanglement in representations. Furthermore, we briefly discuss some limitations of current disentanglement approaches, including the lack of a formal agreed-upon definition for disentanglement. Unsupervised disentanglement methods The following unsupervised methods all extend the VAE loss function with some regulariser to encourage disentanglement. In particular, they assume that generative factors are one-dimensional, and should thus be modelled in single independent dimensions in the latent space. β−VAE (Higgins et al., 2017) adds a weight parameter β > 1 to the KL divergence term in the VAE loss, thereby constraining the capacity of the VAE bottleneck. This forces the posterior (encoder) distribution to better match the prior, which is typically a factorised unit Gaussian, thus this should lead to more

16 Background disentangled latent variables. Building on this, cc-VAE (Burgess et al., 2018) gradually increases this bottleneck capacity over time, allowing the encoder to learn one generative factor at a time. Chen et al. (2018) show by rewriting the ELBO that it contains a Total Correlation (TC) term (Watanabe, 1960), which is a measure of dependence between variables. They claim that a heavier penalty on this specific term should induce a more disentangled representation by encouraging independence between latent variables, and thus propose β-TCVAE where a weight parameter β over-penalises this TC term, which they compute using a tractable but biased Monte Carlo estimator. Similarly, FactorVAE (Kim and Mnih, 2018) also overpenalises an additional TC term, using adversarial training instead. DIP-VAE-I and DIP-VAE-II (Kumar et al., 2017) both add an additional term that penalises some divergence between the aggregate posterior q(z) and a factorised prior. Since using the KL divergence would make this term intractable, they instead propose a moment matching solution. Disentanglement metrics Various metrics have been proposed to quantify disentanglement, aiming to capture various desirable properties that a disentangled representation should have. Often, new metrics are proposed alongside a new disentanglement method, aiming to address various issues of previous metrics. The following metrics all assume that a single generative factor should be modelled in a single latent dimension. The Beta metric (Higgins et al., 2017) measures the accuracy of a linear classifier that tries to predict the index of a generative factor that is kept fixed, aiming to measure both independence and interpretability of the learned latent variables. The Factor metric (Kim and Mnih, 2018) addresses several issues with this previous metric, by using a majority vote classifier that tries to predict the index of the fixed generative factor based on the index of the latent dimension with the lowest variance. Chen et al. (2018) argue that the Beta andFactor metrics are neither general nor unbiased, since they rely on certain hyperparameters. Instead the propose the Mutual Information Gap (MIG), which for each generative factor measures the normalised gap in mutual information between the two latent dimensions that have the highest and second highest mutual information with that factor. Conversely, Modularity (MOD) (Ridgeway and Mozer, 2018) measures if each latent dimension depends on at most one generative factor, by computing the average normalised squared difference between the mutual information of the