96 Quantifying and Learning Linear Symmetry-Based Disentanglement (LSBD) Data generation For each 3D model, we render M=12 images of 256×256 pixels from equally spaced viewpoints surrounding the object, similar to Su et al. (2015), as illustrated in Figure 4.16. Each object is scaled to have a maximum dimension of 15 meters along any of the canonical axes, with the purpose of producing images of well-centred objects with similar dimensions. It is important to note that with this procedure the information about the object’s dimensions is lost. Figure 4.16: Diagram of the multi-view data generation. Triplet loss To accommodate the retrieval task, we use a triplet loss (Schroff et al., 2015) as a training objective to encourage the representations of similar objects to be close to each other. Consider a triplet of data points xa,xp,xn ∈X, where the anchor data point xa shares the same class of interest with the positive data point xp but has a different class to the negative data point xn. The data is passed via a parametric encoding functionh: X→Z (i.e. a neural network) from the data space Xinto a low-dimensional encoding space Z. The triplet loss for encodings za :=h(xa),zp :=h(xp),zn :=h(xn) is defined as LTL(za,zp,zn) :=max ∥za −zp∥ 2 2 −∥za −zn∥ 2 2 +α, 0 , (4.26) where the value α represents a margin to keep negative data points far away enough. This training objective is used to optimise the parameters of the encoding functionhto encourage data points that share the same class to be close to each other in the representation space, while points with a dissimilar class should be encoded far apart. In all our experiments the marginαis set to 1.
RkJQdWJsaXNoZXIy MjY0ODMw