110 set (metrics for training sets after internal cross-validation are provided in the corresponding supplementary tables). Feature selection from the classification models was performed using recursive feature elimination (RFE) with resampling in order to identify and quantify the relative importance of antibody-bound peptides that contributed most to the classification as well as to minimize the chance of potential overfitting and maximize performance. Finally, an additional dataset from a previously published population-based cohort study (featuring data on 1,874 antibody-bound peptides from 997 healthy individuals) was used as an external test set to evaluate the generalization of the models to a geographically distinct population.12 Performance of models using the top 5 and top 10 antibody-bound peptides obtained using RFEwas compared to the models using all antibody-bound peptides by DeLong’s test, which is a nonparametric test for comparison of AUCs of ROC curves, implemented in the pROC package (v.1.18.0) in R. Statistical analyses were performed using the Python programming language (v.3.8.5, Python Software Foundation, https://www.python.org) using the pandas (v.1.2.3), numpy (v.1.20.0), statsmodels (v.0.12.2), scipy (v.1.7.0) and sklearn (v.0.24.2) packages and using R (v.4.2.1). Data visualization was performed using the seaborn (v.0.11.1) and matplotlib (v.3.4.1) packages in Python. Data processing of PhIP-Seq Sequencing reads from NGS after IP were downsized to 1.25 million identifiable reads per sample, capturing reads that were within one error of all possible barcodes for which the paired-end matched the identifiable antibody-bound peptide. A minimum of 750,000 reads was required for data analysis in the case that insufficient reads were obtained. Enrichment of antigens was calculated by comparing the total number of reads per antigen with that of the input read level (when the phage library was sequenced before IP). Each input read level per sample was assumed to generate an output read level null distribution that was fitted using a generalized Poisson distribution. Estimation of its parameters was separately performed for each input read level in each individual sample, and scores were generated while the parameters were fitted to three distribution parameters for all samples, following interpolation for each input read level .28 Individual P-values were calculated and adjusted for multiple comparisons using Bonferroni correction (adjusted P ≤ 0.05 considered statistically significant and defined as seropositivity). Fold changes were computed as number of reads for antibody-bound peptides (after IP) versus the number of input reads (before IP), which were computed only for significantly enriched antibody-bound peptides (called as seropositive antibodies), whereas the remaining peptides were set to zero. Furthermore, fold changes were only calculated with a minimum of 25 input reads per peptide. Baseline sequencing of the antigen phage library (before IP) was performed at >100-fold coverage. Samples with <200 significantly enriched antigens (compared to input reads) were excluded. Two prevalence filters were applied for selecting antibody-bound peptides to be used in statistical analyses, where we included peptides that appeared in at least 5% but below 95% of subjects in either (IBD or LL-DEEP) cohorts. Finally, to avoid redundancy of antibodybound peptides with identical sequences, the most prevalent peptide of sequence replicates was chosen, resulting in a final selection of 2,815 antibody-bound peptides for case–control analyses and 2,368 antibody-bound peptides for IBD-cohort-specific analyses. Chapter 4
RkJQdWJsaXNoZXIy MjY0ODMw