Thesis

286 Chapter 9 Appendix C – Supplementary methods The supplementary methods include the database pre-processing steps, explanations on the distance metric that was used for the hierarchical clustering algorithm, the combination of clustering evaluation metrics to determine the number of clusters, and the mathematical definitions of the clustering outcome measures. Database pre-processing In total, 97 variables were input into the hierarchical clustering algorithm after removing the follow-up and near zero-variance variables (i.e., the values of the variables were almost the same in all people) and the variables with more than 95% missing values. The number of included variables was substantially higher for some aspects of hypersomnolence (e.g., cataplexy) and, as we wanted to give each aspect a potentially similar influence on the clustering, we used expert opinion to group variables in 15 overarching categories as presented in Figure 1 of the main manuscript. We equally divided the clustering weighting over the 15 categories, before evenly splitting it over the variables within that category. For example, the category cataplexy had 1/15th of the total weight, so that each of the 17 cataplexy-related variables had 1/17th of 1/15th of the total weight. All variables were normalized to the range [0, 1] to make them comparable. For categorical variables this was done by evenly distributing the outcome options over the normalized range. To ensure sufficient spread in continuous variables and meanwhile to reduce the influence of outliers, the outermost 10 values of continuous variables (about 2% of raw continuous data) were changed to the corresponding 11th highest or lowest value, before normalization. This data pre-processing step refined the dataset while it preserved the distributions of the subjects. Distance metrics Different distance metrics can be used to determine the distances between individuals and different linkage types to determine which clusters will be merged next by the clustering algorithm. The Gower’s distance was most suitable for the EU-NN database as it is the simplest distance metric where input may contain mixed categorical, continuous or missing data [343]. In case of a missing value, this variable was not included in the distance calculations for this individual. Different linkage methods (single, average and complete) were tested [344]. Single and average linkage consistently assigned individual

RkJQdWJsaXNoZXIy MjY0ODMw