603556-Tonnaer

2 Introduction representations through a sequence of non-linear transformations. These representations start with low-level patterns and move towards high-level concepts that can help solve the task at hand. For example, in image processing, DL models can detect low-level patterns in the pixels, such as lines or angles, and then consecutively combine them into lower-dimensional higher-level descriptors of what is shown in the image. This approach allows for more complex and abstract patterns to be identified and leveraged, resulting in more accurate predictions. A clear example of the success of DL is the performance of convolutional neural networks (CNNs) on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Deng et al., 2010). This challenge involves a dataset with a large amount of images, with associated labels representing various classes such as animal species or object types. CNNs are trained to predict the right label given a query image, e.g. identifying which animal is shown in an image, based on the images and labels in the training dataset. This is an example of discriminative modelling; given data observations from a data space Xand associated labels from a set of potential values Y, the task is to learn a model that can predict labels inY given datapoints inX, see Figure 1.1. From the perspective of probability theory, discriminative models represent a conditional distribution p(Y|X), i.e. they model a distribution over Y, conditioned onX. As mentioned before, DL models learn layers of representations of the data, from high-dimensional low-level features towards lower-dimensional features that represent high-level concepts. Discriminative models essentially try to throw away any information that is irrelevant for predictions inY, until ending up with features that are most informative for this task. Figure 1.1: Discriminative modelling. Given data inX, predict labels inY. A discriminative model is only encouraged to model features that empirically work best for the particular task of predictingY on a given dataset. Such features don’t necessarily relate to real-world concepts, e.g. those that humans would use to communicate decisions. Moreover, they are susceptible to learning shortcuts (Geirhos et al., 2020), tricks that may work in simple benchmark settings, but that do not generalise well to real-world applications and have little to do with

RkJQdWJsaXNoZXIy MjY0ODMw