109 parameters, and microbial compositions were expressed as relative abundances in the fecal samples.26 Microbial relative abundances were transformed using centered log-ratio (CLR) transformation. Bacteria not present in at least 10% of samples were discarded. Statistical analysis Descriptive data were presented as mean ± standard deviation (SD), median [interquartile range, IQR] or as proportions n with corresponding percentages (%). Demographic and clinical characteristics were compared between groups using Mann-Whitney U-tests, chi-squared tests or Fisher’s exact tests (if n observations were <10). Two-tailed P-values ≤ 0.05 were considered statistically significant. PCAwas performed for dimensionality reduction to identify distinct clusters, followed by an assessment of their potential determinants. A k-means clustering algorithm (k = 2) was additionally performed on the dataset (PCs 1 and 2) to label the observed clusters. To assess the associations between patient phenotypic and clinical factors and the presence of antibodybound peptides, logistic regression analysis was performed while adjusting for the effects of age and sex. Adjustment for multiple comparisons was performed using the Benjamini-Hochberg method,27 where associations with FDR < 5% were considered statistically significant. Using GenBank®, we queried the corresponding amino acid sequences of four additional IBD-associated antibodies (anti-Escherichia coli outer membrane porin C [anti-OmpC] and the anti-flagellin antibodies anti-CBir1, anti-Fla2 and anti-A4-FlaX) and used the blastp-application from BLAST® (v.2.10.1) to analyze sequence homology with the peptide amino acid sequences incorporated in the PhIP-Seq library. Classification analysis was performed using logistic regression with elastic net penalty (optimizing the alpha (elastic net regularization mixing) and lambda (regularization strength) parameters) using the Caret package (v.6.0-90) in R (v.4.2.1). In addition, three different machine learning algorithms were used to compare classification performance: GBM (package gbm_2.1.8, with optimization of number of trees and interaction depth), a SVM model with radial kernel function (package kernlab v.0.9-29, with optimization of the cost (C) parameter) and neural networks using model averaging (avNNet, package nnet v.7.3-16, with optimization of network nodes and decay value). Input data, consisting of antibody-bound peptides from 256 CD patients, 207 UC patients and equal numbers of age- and sex-matched healthy controls were randomly partitioned into training (80%) and test (20%) sets. The training set was pre-processed by removing antibody-bound peptides that were highly correlated (Pearson’s correlation coefficient ≥ 0.99), showed zero-variance, were present in <1% or >99% of data, or did not show significant difference between classes (Fisher’s exact tests, P>0.005). After these pre-processing steps, final input data consisted of 186 antibody-bound peptides for CD vs. healthy controls, 64 for UC vs. healthy controls and 89 for CD vs. UC classification tasks. The training set was used for training the models, including parameter optimization by five repeats of five-fold cross-validation in order to maximize Cohen’s Kappa value (which is a balanced metric of positive and negative predictive values), whereas the test set was solely used for evaluating model performance. Classification performance was assessed by calculating receiver operating characteristics (ROC) statistics and corresponding evaluation metrics, including the area under the curve (AUC), sensitivity, specificity, PPV, NPV, the F1-score (harmonic mean of precision and recall), Cohen’s Kappa and overall classification accuracy. All classification performance metrics were reported for the test The antibody epitope repertoire in IBD
RkJQdWJsaXNoZXIy MjY0ODMw