3 56 CHAPTER 3 2.5 Statistical Analysis For the statistical analysis, we used the statsmodels package 0.11.1 in Python 3.7.2. The statistical analysis was performed for the variables distance covered, distance covered in speed category, and energy expenditure in power category. First, the normality of the variables was checked for entire match players for the first half, the second half, and the 15-minute periods of both halves. The normality of the variables was checked for substitutes in the second half and 15-minute periods in the second half. The KolmogorovSmirnov test determined the normality of the variables. No normal distribution was found for both entire match players and substitutes in the variables (i) the distance covered (P < 0.001), (ii) the distance covered in speed category in all speed categories ( P < 0.001), (iii) the energy expenditure in power category in all power categories (P < 0.001). Kruskal-Wallis test evaluated the differences between the different periods and variables. There were significant differences between every period and variable (P < 0.001) for both entire match players and substitutes. As a measure of effect size, epsilon squared (2) was calculated for the Kruskal-Wallis test, and values from 0 to 1 indicate no relationship to a perfect relationship, respectively [23]. In the event of a significant difference, Conover post-hoc tests were used to identify any localized effects. The variable pairwise comparisons were used to reject the null hypothesis (P < 0.01). Statistical significance was set at p< 0.05. The source code, access to the data, and corresponding Jupiter notebooks of the statistics procedure are available as open-source software on Github (https://github. com/dijkhuist/Early-Performance-Prediction-Machine-Learning-in-Soccer). 2.6 Machine learning To predict the physical performance of individual players machine learning models were constructed for each variable distance covered, distance covered in speed category, and energy expenditure in power category. The physical performance differences between players were eliminated by individualization and normalization of the variables and outcome measures. Variables were calculated per five-minute period of the match. The performance in the current match was compared to the average individual performance of a player over the whole season. In other words, the mean value of the performance variable over the entire season based on all entire matches by an individual player was set as a personal baseline. We further calculated these baseline values for each of the 18 5-minute periods of a match. Given this approach, we could calculate a relative individual performance for each player. All constructed features are presented in Table 1.
RkJQdWJsaXNoZXIy MjY0ODMw