584779-Bijlsma

Hannah Bijlsma THE VALIDITY AND IMPACT OF STUDENT PERCEPTIONS OF TEACHING QUALITY

THE VALIDITY AND IMPACT OF STUDENT PERCEPTIONS OF TEACHING QUALITY Hannah Jakoba Elisabeth Bijlsma

Cover design: Marjolein Vormgeving Printed by: Ipskamp Printing Lay-out: Marjolein Vormgeving / Douwe Oppewal ISBN: 978-90-365-5479-4 DOI: 10.3990/1.9789036554794 © 2022 Hannah Jakoba Elisabeth Bijlsma, The Netherlands. All rights reserved. No parts of this thesis may be reproduced, stored in a retrieval system or transmitted in any form or by any means without permission of the author. Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd, in enige vorm of op enige wijze, zonder voorafgaande schriftelijke toestemming van de auteur. Dit proefschrift is goedgekeurd door prof. dr. A. J. Visscher

THE VALIDITY AND IMPACT OF STUDENT PERCEPTIONS OF TEACHING QUALITY PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit Twente, op gezag van de rector magnificus, prof. dr. ir. A. Veldkamp, volgens besluit van het College voor Promoties in het openbaar te verdedigen op vrijdag 9 december 2022 om 16.45 uur door Hannah Jakoba Elisabeth Bijlsma geboren op 12 januari 1991 in Gorinchem, Nederland

4 PROMOTIECOMMISSIE: Voorzitter / secretaris: prof. dr. T. Bondarouk Promotor: prof. dr. A. J. Visscher Universiteit Twente Leden: prof. dr. P. B. Den Brok Universiteit Wageningen prof. dr. J. Hattie University of Melbourne prof. dr. H. Korpershoek Rijksuniversiteit Groningen prof. dr. M. T. Mainhard Universiteit Leiden prof. dr. S. E. McKenney Universiteit Twente prof. dr. ir. B. P. Veldkamp Universiteit Twente

5 CONTENT 1 General introduction 7 2 The reliability and construct validity of student perceptions of teaching quality 19 3 Factors associated with differences in digitally measured student perceptions of teaching quality 41 4 Does smartphone-assisted student feedback affect the quality of teachers’ teaching? 63 5 Factors influencing teachers’ use of digital student feedback to improve their teaching 85 6 Conclusion and discussion of the findings and implications for educational practice and for future research 103 Reference list 115 Appendices 137 Publications, presentations and contributions 145 Nederlandse samenvatting 151 Dankwoord 161

6 The reliability and construct validity of student perceptions of teaching quality

7 1 1 General introduction

8 The reliability and construct validity of student perceptions of teaching quality 1.1 INTRODUCTION Within schools, the quality of the teaching is one of the most important factors that impact student achievement (Nye et al., 2004; Rivkin et al., 2005). Partly in response to a decline in student achievement all over the world (OECD, 2014), great emphasis has been placed on measuring teaching quality (Timperley et al., 2007). Measuring teaching quality can be done for different purposes, for example, to foster the professional development of teachers, to support timely and efficient human resource decisions, and for research purposes (e.g., to measure the effectiveness of an intervention aimed at improving teaching quality). Teaching quality can be measured in several ways. Single (face-toface) lesson observations by external observers are quite common. However, multiple lessons should be observed and assessed by multiple raters to obtain a reliable picture of the quality of a teacher’s instruction (Hill et al., 2012; Praetorius et al., 2014). Another way to evaluate teaching quality is to analyse student achievement growth. Yet, it remains diff icult to infer the added value of a teacher’s instruction by using such an approach, because many external factors simultaneously influence students’ outcomes. Information on student achievement growth (value added) also does not provide teachers with advice on how to improve their lessons (Visscher, 2017). Teachers can evaluate themselves, but according to f indings from Kruger and Dunning (1999), many people think of themselves as performing above average; underperformers, in particular, vastly overestimate their performance. This also appears to be the case for (underperforming) teachers (Inspectorate, 2013). Teacher selfevaluations may thus be invalid measures of teaching quality (Muijs, 2006). Another way to evaluate teaching quality is to measure the perceptions of the target group, the students, concerning their teacher’s teaching (Coles, 2002; Kane & Staiger, 2012). As students are the only ones who experience a teacher’s instruction almost daily, they are in some respects in a better position to make judgments about the teacher, than, for example, an outside observer who visits the classroom only once or a few times (den Brok, Brekelmans, et al., 2006; Donahue, 1994; Korfhage, 1997). However, validly measuring teaching quality through students’ eyes is not a given, and the arguments for and against the use of student ratings as a basis for improving teaching have been going on for some time now (cf. section 1.3 of this chapter). Moreover, the impact of student perceptions of teaching quality on the subsequent improvement of teaching requires more than just providing student perception data to teachers, because even if student ratings were guaranteed to be accurate measures of teaching

9 1 quality, those ratings likely could not by themselves support improvement of individual teaching performance (Loeb, 2013). The research described in this dissertation contributes to the debate on the validity of student perception data, as well as to the effective use of such data in schools as developmental feedback for teachers. The introduction of this dissertation starts with an overview of student feedback research over the last 100 years, followed by an overview of the main research f indings regarding the validity and impact of student perceptions of teaching quality, which serves as a basis for the four studies in the subsequent chapters of this dissertation. 1.2 A BRIEF HISTORY OF THE RESEARCH ON STUDENT FEEDBACK Röhl et al. (under review) conducted an overview of the history of student feedback research. They indicated that student rating of teaching as feedback for teachers has been studied as a topic for almost 100 years, starting with some experiments in higher education in the 1920s (Stalnaker & Remmers, 1928), and later in primary and secondary schools (Porter, 1942; Remmers, 1934). Starting in the early 1960s, three lines of research emerged regarding the use and effect of student perceptions of teaching quality. The f irst research line focused on how student feedback to teachers about their teaching affects its quality. This research started with the work of Gage (1960) and Bryan (1962), which became the basis for subsequent studies by Tuckman and Olivier (1968), Tuckman (1976), Novak (1972), Knox (1973), Lauroesch et al. (1969), and Tacke and Hofer (1979). More recent studies that have followed this research line include Ditton and Arnold (2004), Tozoglu (2006), Mandouit (2018), and the study described in chapter 4 of this dissertation. This student feedback movement was focused on using student voice to improve teaching and learning. In the early 1980s, research on student perceptions of their learning environment or classroom climate (the physical locations, contexts, and cultures in which students learn) increased rapidly. Fraser and colleagues (e.g., Fraser & Fisher, 1982; Fraser et al., 1983), the founders of research on the learning environment, were not the only ones to study students’ views of their learning environments; den Brok, Bergen, et al. (2006), Wubbels and Levy (1883), Bell and Aldridge (2014), and Thorp et al. (1994) also intensively surveyed learning environments from the students’ perspective. In contrast to the early work on student feedback, this research was not based on how student feedback affects teaching quality, but more on how, from the students’ perspective, the teacher creates a climate within the classroom that affects student learning.

10 The reliability and construct validity of student perceptions of teaching quality At the start of the 21st century, a new approach, founded in theories and def initions of teaching quality, entered the research f ield. Student perceptions of teaching quality were now investigated in order to identify the factors that influence the validity of these measures. For example, researchers argued that student perceptions of teaching quality may be associated with other variables (e.g., the popularity of the teacher, or the students’ average grades for a subject), and they studied students’ ability to distinguish between different aspects of teaching quality (Ferguson, 2012; van der Lans, 2017; the study presented in chapter 3 of this dissertation). Moreover, researchers increasingly studied the extent to which the student ratings of a teacher correlated with ratings by others, such as ratings by an external observer or a teacher’s self-evaluations (Clausen, 2020; Dobbelaer et al., submitted; van der Lans et al., 2015). Despite concerns about the validity and reliability of student perceptions of teaching quality, a signif icant number of studies have supported the use of student ratings as a valuable source of information about teaching quality and teacher performance (e.g., Atlay et al., 2019; Fauth et al., 2014; Kyriakides, 2005; van der Scheer et al., 2019; Wagner et al., 2013; Wallace et al., 2016). 1.3 VALIDITY OF STUDENT PERCEPTIONS OF TEACHING QUALITY The scientific debate on the merits of student perceptions of teaching quality has thus been going on for some time now. It has been argued that by measuring student perceptions, the quality of a teacher’s teaching can be assessed efficiently multiple times (equal to the number of students in a class; Kane & Staiger, 2012) at one and the same time, and the measurement can be repeated easily, because the teacher usually teaches the class at least once a week. Therefore, compared to classroom observations, for example, student perceptions can be reliable measures of teaching quality (Coles, 2002; Kane & Staiger, 2012). However, concerns regarding student teaching quality ratings generally involve arguments concerning their validity. That is: do they really measure what we want to measure? Although several types of validity can be distinguished, the construct validity and content validity of student perceptions are two of the main topics of this dissertation. Construct validity is about the underlying theoretical assumptions regarding (the items in) the measure. If the construct of “teaching quality” is intended to be measured by means of student perceptions, then the content of the items must represent and cover this construct adequately, based on the theoretical assumptions made by the developers of the measurement instrument. The

11 1 construct validity of a questionnaire can be determined psychometrically by calculating the extent to which an item loads on (in other words, “belongs to”) a scale that is used to measure a construct. The theoretical foundation of the items therefore needs to be very solid, as a construct validity measure by itself does not tell much about the underlying theory regarding the construct being measured. The construct validity of a questionnaire can also be assessed by experts by asking them to what extent the questionnaire covers all elements of the construct being measured. Content validity refers to the extent to which a measure represents all facets of a given construct. Concerning the content validity of students’ teaching quality ratings, several factors can influence the ratings students give to the quality of a teacher’s teaching. For example, the average teaching quality score may be lower in a class with many low-performing students, without the teaching quality really being lower. Or female teachers might receive signif icantly lower ratings from male students, although they are doing just a good as job as male teachers do. Or popular, well-liked teachers may receive higher student perception scores regardless of the quality of their teaching (Atlay et al., 2019; Ben-Chaim & Zoller, 2001; Bijlsma & Röhl, 2021). Some associations might not be legitimate, because the factors are unrelated to criteria for effective teaching and beyond the teacher’s control, such as age and gender (Benton & Cashin, 2014). Other factors such as teacher popularity cannot be eliminated, as they play a role in the teaching process (Bijlsma & Röhl, 2021). Content validity can be investigated psychometrically by checking to what extent potentially influential factors affect students’ ratings. 1.4 THE IMPACT OF STUDENT PERCEPTIONS OF TEACHING QUALITY Students’ teaching quality ratings basically and primarily provide information about how students experience their teachers’ teaching (Wisniewski & Zierer, 2021). In order to have these measures positively impact teaching quality, the teacher must receive and use the feedback data to improve their followup lessons. Fraser (2007) argued that the improvement of teaching quality based on student ratings includes five steps: (1) Data on student perceptions of teaching quality are collected; (2) The results are provided to teachers as feedback; (3) Teachers can identify elements of their lessons that need improvement and consider alternative ways of acting; (4) Teachers carry out improvement-oriented actions based on the feedback; (5) To determine the effectiveness of the actions undertaken by teachers, student feedback can

12 The reliability and construct validity of student perceptions of teaching quality be collected again. Although Fraser’s steps seem obvious, the process of using student perceptions of teaching quality as data to improve teaching is complicated (Röhl et al., 2021). Several factors can hinder or stimulate teachers’ use of student feedback to improve their teaching (Wisniewski & Zierer, 2021), including teachers’ attitudes towards student feedback, their improvementoriented actions, characteristics of the school context, and data (system) characteristics (characteristics of the instrument or tool used for collecting the student perceptions). The factors influencing the use of student feedback by teachers and teachers’ improvement activities based on the student feedback are two other main topics of this dissertation. 1.5 THE IMPACT! TOOL Student perceptions of teaching quality can be collected efficiently by means of the digital Impact! tool, which allows students to use a digital device to provide teachers with feedback on a lesson. The Impact! tool was originally developed by the University of Twente for the purpose of the research included in this dissertation. A company developed the technical part of the Impact! tool, based on specifications from the author of this dissertation and her colleagues. Figure 1.1 shows an example of an item on a student phone (normally, questions are presented in Dutch). Students respond to the items on a 4-point Figure 1.1 Example of an item on a student phone

13 1 Likert scale (3 = totally agree, 2 = agree, 1 = disagree or 0 = totally disagree). If an item does not apply, the option not applicable (niet van toepassing in Dutch) can be used (this option was added to three items). The “i” button provides additional user information. Every time the tool is used, teachers receive a summary of the feedback the students have given; responses are displayed per student performance group (low, middle, high; see Figure 4.1a and 4.1b in chapter 4 of this dissertation), as indicated by the teacher. The feedback is confidential: the teacher does not know what an individual student answered. If the tool is used multiple times, then the change in the student ratings can be displayed per item. The tool is now available for primary, secondary, special and higher education and is mainly used for teacher development and research purposes. 1.5.1 Conceptualization of teaching quality To develop the Impact! questionnaire, the “teaching quality” construct was operationalized by first conducting a literature review of teacher effectiveness research, to identify characteristics of teaching that positively affects student learning. Several meta-analyses and other publications about effective teaching were found that pointed to a number of teaching practices that are effective for student learning (Creemers, 1994; Day et al., 2008; Fauth et al., 2014; Hattie, 2008; Muijs et al., 2014; Pianta & Hamre, 2006; Praetorius et al., 2018; Reynolds et al., 2014; Sammons et al., 1995; van de Grift, 2007). These practices were categorized using the following seven general characteristics of effective teaching: 1. creating a supportive and positive classroom climate; 2. well-organized and structured classroom management; 3. providing clear instruction; 4. adapting instruction to students’ needs; 5. teacher–student interaction; 6. the cognitive activation of students to promote deep learning; 7. assessing student learning during the lesson (formative assessment). Second, based on this, draft versions of the questionnaire were developed and iteratively discussed among the researchers. To require the least effort from students in answering the items, a standard set of a few items were

14 The reliability and construct validity of student perceptions of teaching quality included in Impact! and each item addressed only one aspect of teaching. Four answer options were given to avoid central tendency in responding the items (Weisberg, 1992). Students were asked to rate a lesson on scientifically proven characteristics of effective lessons (based on the above-mentioned literature review) instead of, for example, asking them if they liked the lesson, and all items were about one single lesson instead of about the teacher’s lessons in general. The items were all formulated in a teacher-centred way (e.g., “The teacher created a safe atmosphere during the lesson”) to determine the teacher’s contribution to lesson quality. Because students were supposed to give their own opinion when answering the items, “I” was used instead of “our class” (e.g., “The teacher clearly indicated what I was going to learn”). Based on feedback from fellow researchers, teachers and students, a final set of 16 items was included in the Impact! questionnaire. The questionnaire can be found in Appendix A. 1.6 THIS DISSERTATION In order to investigate the main topics that were introduced in this chapter – the validity and impact of student perceptions of teaching quality – four studies were conducted that form the core of this dissertation. In the following sections each of these studies is introduced. 1.6.1 The reliability and validity of student perceptions of teaching quality In the first study that we conducted, the following research questions were answered: Is there support for the construct validity of the Impact! questionnaire? How reliable are student perceptions of teaching quality as measured by means of the Impact! tool? We investigated the student Impact! ratings for 26 mathematics teachers (717 students) using a multi-level modelling approach. A combined item response theory (IRT) and generalizability theory (GT) model (unidimensional model, latent score) was used to model and measure the scores. Three models were examined (Glas, 1999, 2016). The models differed systematically: 1) a full model, where differences between students, teachers, measurement timings and interactions between these variance components were included; 2) a model where the interaction between time and teachers was removed; 3) a linear regression model, to analyse teachers’ growth curves over time.

15 1 To f ind statistical support for the construct validity of the Impact! questionnaire, the f it of the combined IRT and GT model was investigated. The model also made it possible to simultaneously estimate variance components (students, teachers, times, and their interactions), to determine the reliability of the scores from students. In addition, a decision study (d-study) was conducted to investigate what happens with the reliability of the measurements in the case of more or fewer measurement times and students. 1.6.2 Factors associated with differences in student perceptions of teaching quality Insight into confounding variables involved in student perceptions of teaching quality can help us understand better what is measured when we collect student perceptions of teaching quality. Our innovative statistical methods from the first study (section 1.6.1) could be used to scrutinize the specific effects of several variables, as was demonstrated when investigating the following research question: What factors on the student, teacher and class levels are related to differences in student perceptions of teaching quality? To answer this question, we used the same data and the same full model as in the first study. Theoretically grounded variables that are potentially associated with differences in student perceptions of teaching quality were added as covariates to the model. These variables were teachers’ age, gender, teaching experience and popularity among students, students’ gender, their performance level and attitude towards the feedback, class size and the average performance level of the students in a class. Factors that proved to be related and unrelated to differences in student ratings were reported and discussed, as well as the advantages of the complex statistical multi-level approach for evaluating extensive longitudinal designs and for giving insight into the factors biasing student perceptions of teaching quality. 1.6.3 The effectiveness of student feedback In this study, we investigated what happens when the student teaching quality ratings are provided to teachers as feedback on their instruction, in order to develop their insight into the strengths and weaknesses of their lessons. The goal was to promote teachers’ professional reflection on the quality of their teaching as a basis for their attempts to improve their teaching. The following research question was answered in the study: Does student feedback promote

16 The reliability and construct validity of student perceptions of teaching quality teachers’ insight into the strengths and weaknesses of their teaching, their reflection, and their improvement-oriented actions, and does it help to improve the quality of their teaching? The feedback was provided to teachers by means of the Impact! tool. Teachers using the tool had access to a private web-environment where they could choose which lesson they wanted to receive feedback on. At the end of that lesson, students were asked to respond to the 16 items of the Impact! questionnaire on their digital device. To answer the research question, a randomized controlled trial was conducted. Mathematics teachers in the experimental group (n = 26) used the Impact! tool at the end of the lessons they chose, to collect student perceptions of teaching quality (13- to 14-year-old students, n = 717). Teachers in the control group (n = 32) did not use the Impact! tool during the research period of 4 months. Questionnaires (including questions on teacher and student background characteristics and the extent of teachers’ professional reflection) were administered to teachers (n = 58) and students (n = 1489) in both groups at the pretest and the posttest. Teachers in the experimental group completed two additional, digital questionnaires during the intervention period about whether they had obtained insight into where they could improve their lessons, and about whether they had undertaken improvement-oriented actions during the follow-up lessons and/or outside those lessons. Descriptive statistics were calculated for the frequency of Impact! use by teachers. Bar charts were made to show the extent of teachers’ insight into where they could improve their lessons, and to examine the number of improvement-oriented actions undertaken by teachers. The development of teachers’ professional reflection was also analysed. A four-step multilevel modelling procedure was used to investigate the development of teaching quality over time. 1.6.4 Factors influencing teachers’ use of digital student feedback to improve their teaching In this final study, we used a qualitative research approach with semistructured interviews with teachers and students. Eight teachers and 21 of their students who participated in the Impact! project were interviewed about their perceptions of the influence that factors known to influence data use in general had on teachers’ use of Impact! student feedback to improve their teaching. Using a comparative case study approach, two teachers were also

17 1 compared with each other regarding their own and their students’ perceptions. The quality of one teacher’s teaching had improved after receiving student feedback, the quality of the other’s teaching had not. The findings show what factors included in the study were important for teachers’ use of digital student feedback to improve teaching quality, according to teachers and students. 1.7 FINAL NOTES The four studies just introduced are reported in the chapters that follow (chapters 2, 3, 4 and 5, respectively), and are based on four separate research papers, which were all either published in or submitted to scientific journals. Each chapter can be read independently; however, the chapters may overlap slightly in their introduction and theoretical framework. Chapter 6 provides a summary of and reflection on the main findings of the four studies, considerations regarding implementing student feedback use in schools and recommendations for future research.

18 The reliability and construct validity of student perceptions of teaching quality

19 2 2 The reliability and construct validity of student perceptions of teaching quality Abstract Student perceptions of teaching quality can be collected efficiently by means of the Impact! tool, which allows students to provide feedback to teachers regarding a lesson. However, the reliability and validity of student perceptions of teaching quality are a subject of scientific debate. In this study, data from 26 teachers and their students were therefore analysed to assess the construct validity of the Impact! questionnaire, using a combined item response theory and generalizability theory model. Moreover, the global and local reliability of the scores were investigated. Results supported the construct validity of the questionnaire. The student Impact! scores were also found to be reliable measures of teaching quality. Based on: Bijlsma, H. J. E., Glas, C. A. W., & Visscher, A. J. (under review). The reliability and construct validity of student perceptions of teaching quality. Educational Assessment, Evaluation and Accountability.

20 The reliability and construct validity of student perceptions of teaching quality 2.1 INTRODUCTION In education, there has been an increasing demand for the evaluation of teaching quality for a variety of purposes (Centra, 2003; Darling-Hammond, 2000). The scores can be used for (scientific) research, for example, for investigating whether an intervention has had an effect on teaching quality (Goe et al., 2008; Muijs, 2006). The measurements can also serve as feedback to teachers, enabling the teacher to obtain insight into the strengths and weaknesses of a lesson or lessons (Fraser, 2007; Marzano & Toth, 2013). Based on this, customized professionalization activities can be used, to improve teaching quality and enhance student learning (Scheerens & Bosker, 1997; Seidel & Shavelson, 2007; Wang et al., 1993). Moreover, statements about the quality of the teacher’s instruction based on teaching quality scores can be used for management decisions (van der Lans & Maulana, 2018). Teaching quality can be measured in several ways. Lesson observations by external observers are quite common; however, multiple lessons should be observed and assessed by multiple raters to obtain a reliable picture of the quality of a teacher’s lessons (Hill et al., 2012; Praetorius et al., 2014). Another way to evaluate teaching quality is to analyse student achievement growth. Yet, it remains difficult to calculate the added value of a teacher using such an approach, because many external factors also influence students’ outcomes and information on students’ achievement growth does not provide teachers with tips for improving their lessons (Sammons et al., 1995; Timmermans et al., 2011). Teachers can also evaluate themselves, but according to Kruger and Dunning (1999), many people think of themselves as performing aboveaverage. Underperformers, in particular, vastly overestimate their performance, including underperforming teachers (Inspectorate, 2013). Teacher selfevaluations may thus be invalid measures of teaching quality (Muijs, 2006). Another way to evaluate teaching quality is to measure the perceptions of the target group, the students, about the quality of their teachers’ instruction (Coles, 2002; Kane & Staiger, 2012). Compared to the use of classroom observations to assess teaching quality, teaching quality can be assessed easily and eff iciently multiple times at one and the same time by means of student perceptions (equal to the number of students in a class; Kane & Staiger, 2012). Furthermore, as students are the only ones who observe a teacher daily, they are in some respects in a better position to make judgments about the teacher than an outside evaluator who visits the classroom only once or a few times (den Brok, Brekelmans, et al., 2006; Donahue, 1994; Korfhage, 1997)

21 2 Technical developments nowadays enable eff icient collection and processing of student perceptions of teaching quality. The Impact! tool is an example of such a technological development. It enables students to use a digital device to easily give feedback to their teachers about the degree to which the lesson that just ended met several characteristics of effective teaching. Despite the advantages of using student perceptions for evaluating teaching quality, some critical concerns remain (Camburn, 2012; de Jong & Westerhof, 2001; Ferguson, 2012; Muijs, 2006) with regard to, among other things, the construct validity and reliability of the instruments used for measuring teaching quality. Research on this topic is limited and scientif ic debate continues. The investigation of these psychometric features of the Impact! tool contributes to the debate about the validity of student perception measures of teaching quality. The following research questions are answered in this chapter: Is there support for the construct validity of the Impact! questionnaire? How reliable are student perceptions of teaching quality as measured by means of the Impact! tool? 2.2 THEORETICAL FRAMEWORK 2.2.1 Construct validity The construct validity of a questionnaire reflects the extent to which the questionnaire measures the construct that is intended to be measured (Cronbach & Meehl, 1955; Messick, 1995). To evaluate construct validity, first, one needs to establish whether the questionnaire covers every single element of a construct (Gravetter & Forzano, 2012; Radhakrishna, 2007; Shuttleworth, 2009). For that purpose, the construct that one intends to measure needs to be conceptualized by searching for specific characteristics of that construct and by reviewing relevant research on the construct. Experts in the particular research field can be asked for their ideas about characteristics of the construct. Next, the construct characteristics need to be operationalized in items. A draft version of the questionnaire can then be presented to the target population, to ensure that the items are answerable and understandable for them (Baarda & De Goede, 2006; Camburn, 2012). Secondly, the construct validity of a questionnaire can be evaluated using a statistical approach. Factor-analytic approaches are very familiar. The key concept of factor analysis is that several items have similar patterns of responses because they are all associated with a latent, that is, not directly measured, variable. To determine to what extent a pre-defined theoretical construct can

22 The reliability and construct validity of student perceptions of teaching quality be found in the data collected with the questionnaire, the factor loadings of the items can be calculated. The studies by Marsh and Roche (1997) and Abrami et al. (1990) are classic examples of such a factor-analytic approach in the case of student perception questionnaires for higher education. However, as the factor analysis is always conducted on aggregated scores at the class level, it does not take into account differences in psychometric properties (e.g., reliability) of the questionnaire at both the student level and the class level. Because scores are aggregated, the analysis also does not deal with missing values (Marsh et al., 2012). As an improvement on this one-level approach, an item response theory (IRT) modelling approach can be used for statistically investigating construct validity. This approach is analogous to factor analysis, but also takes into account the multi-level structure of the data if scores are nested. Further, compared to classical test theory, it is easy to take missing data into account in IRT modelling, as the missing responses are not included in the estimates (they are not scored as zero). In IRT, it is assumed that all items in a test are indicators of one unidimensional construct. The parameter estimates of an IRT model on item scores, as well as goodness-of-f it indices, give insight into the contribution of individual items towards the reliability of the model, and thus, of the extent to which the items are indicators of the underlying construct. The use of IRT modelling has become the standard statistical approach for quantitatively determining construct validity (Hambleton et al., 1991; Lord, 1980). Based on statistical analyses alone, what the underlying construct is precisely remains unclear: do the items reflect the construct one is attempting to measure (Muijs, 2006)? Construct validity can be strengthened in other, more content-based ways. We will next explain how this was done. 2.2.2 T he content of the Impact! questionnaire and enhancing construct validity For the development of the Impact! questionnaire, the construct of “teaching quality” was conceptualized by first conducting a literature review of teacher effectiveness research to identify characteristics of teaching that positively affects student learning. This included a review of student perception questionnaires (Bijlsma, 2016) to identify constructs and items regarding teaching quality used in other student questionnaires. Several meta-analyses and other publications about effective teaching were found that showed a number of teaching practices that are known to be effective for student learning (Black & William, 2006; Creemers, 1994; Day et al., 2008; Drijvers, 2015;

23 2 Fauth et al., 2014; Fraser, 1998b; Hattie, 2008; Hattie & Timperley, 2007; Hiebert & Grouws, 2007; Hollingsworth & Ybarra, 2015; Marzano, 2003; Maulana et al., 2015; Muijs et al., 2014; Pianta & Hamre, 2006; Reynolds et al., 2014; Rosenshine, 1995; Sammons et al., 1995; van de Grift, 2007). These practices were categorized into the following seven general characteristics of effective teaching: 1. creating a supportive and positive classroom climate; 2. well-organized and structured classroom management; 3. providing clear instruction; 4. adapting instruction to students’ needs; 5. teacher–student interaction; 6. the cognitive activation of students to promote deep learning; 7. assessing student learning during the lesson (formative assessment). Experts in the field of educational science were asked for their ideas about potential items that, in their opinion, reflect the core of high-quality teaching and could be included in the questionnaire. With the information collected through the literature research and the review, and the input from the experts, draft versions of the questionnaire were developed and iteratively discussed among the authors of this article. During these discussions, the question was always whether the items reflect the seven characteristics of effective teaching presented above. After the authors had reached consensus about the formulation of the items, the items were presented to grade 9 students from Dutch secondary schools, who were asked whether the items were clear to them, and what they thought the items were about. Their teachers were also asked whether they thought the items were clear for the students and if they thought specific important aspects of teaching were missing in the draft questionnaire. Based on the feedback from the teachers and students, a final set of 16 items was included in the questionnaire (see Appendix A). The characteristic of effective teaching that each item refers to is given as well in Appendix A. Items 1-15 were responded to using a 4-point Likert scale. Item 16 was an open-ended question, where students could type in their answer. With items 6, 8 and 11, an extra option, not applicable, was added, because these questions do not apply to every situation (for an example on a student phone, see Figure 1.1). The construct validity of the Impact! questionnaire was evaluated based on the data collected with the Impact! tool, using the IRT modelling approach described in the previous section. Further details on the IRT model and the goodness-of-f it indices are outlined in the Method section of this chapter (section 2.3).

24 The reliability and construct validity of student perceptions of teaching quality 2.2.3 Reliability A measure is said to have high reliability when it produces similar results under comparable conditions (Baarda & De Goede, 2006; Fraenkel et al., 2012). To indicate the amount of error in the measures, various kinds of reliability coefficients, with values ranging between 0.00 (much error) and 1.00 (no error), can be calculated. In general, there are two ways to approach reliability, namely, global reliability and local reliability. Global reliability refers to the extent to which two parallel versions of the same test correlate with each other (Kenny et al., 1994). It can be conceptualized as the proportion of variance explained by the differences between two (or more) measurements, or the extent to which two randomly chosen respondents from a population can be distinguished from each other. The global reliability coeff icient can be estimated by, for example, calculating Cohen’s kappa (Cohen, 1960; Shrout, 1998). However, coeff icient kappa can underestimate reliability, because it does not consistently calculate the exact match between the measurements. Another way to determine global reliability is to calculate Cronbach’s coeff icient alpha (Cronbach, 1951; Santos, 1999). Although Cronbach’s alpha has received quite some critique (e.g., in situations where it does not take into account the nested structure of data; Dunn et al., 2013, Sijtsma, 2009), it is widely used as a measure of the reliability of (psychological) tests (COTAN, 2010). The concept of global reliability can also be def ined as the dependability of scores: the extent to which the variance of scores on a questionnaire depend on several variance components, such as respondents, time points and different tasks (Brennan, 2001). This provides an understanding of how much of the total observed variance in the measurements can be decomposed into these components and what happens with the reliability coeff icient if cases are added or removed within a component (Cronbach et al., 1972; Shavelson & Webb, 2005). Compared to other approaches to global reliability, this approach takes into account the multilevel structure of the data. As the dataset used in this study has a multilevel structure, the approach evaluating the dependability of scores was used. The local reliability investigated in this study is an indication of the measurement precision at a specif ic scale point. In other words, it is a measure of the precision with which a specif ic teaching quality score is estimated. This standard error is the inverse of what is called test information. Test information is the sum of item information values, which indicate how much the individual items contribute to the reliability of the instrument. This can be investigated by determining the information value of every single item on the questionnaire.

25 2 The information value of an item depends on the location parameter that indicates how diff icult it is to receive a high score on that item, and the discrimination parameter that indicates the contribution of that item to the scale and how well responses (in this study: responses of students) discriminate between items and students (Fauth et al., 2014; Kyriakides, 2005). In this study, both the global and the local reliability of the Impact! questionnaire were investigated by (respectively) evaluating the dependability of scores and calculating the measurement precision at a specif ic scale point. Further details are outlined in the Method section. 2.3 METHOD 2.3.1 Participants and research design In total, 26 teachers (58.3% of whom were male) with an average age of 41.1 years (SD = 10.8) and on average 12.7 years of teaching experience (SD = 8.9), and 717 students (48.5% male; all aged 14 or 15 years old) participated in the study. Throughout a period of 4 months during the 2016-2017 school year, the teachers and their students used the Impact! tool at the end of a number of mathematics lessons chosen by the teachers. In this way, student perceptions of the quality of their mathematics teachers’ teaching were collected. The number of measurement moments differed between teachers, ranging from 3 to 17, with an average of 7. 2.3.2 The data The items on the Impact! questionnaire were originally scored from 0 to 3. However, because the lowest category was used in too few responses (which negatively affected the stability of the analyses), the two lowest categories were combined in the analyses. The extra option not applicable, possible for three of the 16 questions, was coded as “missing values”. The data followed a multilevel structure pertaining to teachers, students, and time points (students’ responses were nested within teachers and collected at different times). With every teacher, the first and last measurement were administered using a paper-based format as part of a pretest/posttest (the pretest and posttest also included questions regarding background characteristics of students and teachers), while the intermediate measurements, at time points 2 up to 16, were conducted digitally using the Impact! tool. Items in both measurement formats were similar. To answer the research questions, data were analysed using a combination of an item response theory model and a generalizability theory model. In the following sections, these two models are described.

26 The reliability and construct validity of student perceptions of teaching quality 2.3.3 The item response theory model Item response theory (IRT) was used to model students’ responses to the Impact! questionnaire items at the different measurement moments. All items scored (k = 15) are associated with a unidimensional latent variable theorized as “teaching quality”. The model used was what is called the generalized partial credit model (GPCM, Muraki, 1992). A response to item k in category m (m = 0, 1, or 2) pertaining to a teacher j, by a student i, at time point t, is denoted by 1992). A response to item k in categ ry m (m i, at time point t, is denoted by . The probabil respectively, and is the quality of teacher j 2.3.4 The Generalizability Theory model the quality of teacher j as perceived by student i random effects, that is, . (2) Table 2.1 Parameter Interpretation ijtk U m = exp( ) ( | ) 1 exp( ) k ijt km ijtk ijt k ijt kh h m p U m h a q d q a q d - = = + - å k a km d ijt q | | ijt j t i j jt it j q q w t d e = + + + + | main effect teacher main effect time main effect students nested in interaction teachers & time i j jt it j e . The probability of the response is given by 1992). A response to item k in category m (m = 0, 1, or 2) pertaining to a teacher j i, at time point t, is denoted by . The probability of the response is given by (1) respectively, and is the quality of teacher j as perceived by student i at time point t. 2.3.4 The Generalizability Theory model the quality of teacher j as perceived by student i at time point t random effects, that is, . (2) The interpretation and the distribution of these random effects are given in Table 2.1. Table 2.1 Generalizability theory model: parameters, interpretations and distributions Parameter Interpretation Distribution ijtk U m = exp( ) ( | ) 1 exp( ) k ijt km ijtk ijt k ijt kh h m p U m h a q d q a q d - = = + - å k a km d ijt q ijt q | | ijt j t i j jt it j q q w t d e = + + + + 2 2 | | 2 | main effect teacher (0, ) main effect time (0, ) main effect students nested in teachers (0, ) interaction teachers & time (0, ) j j t t i j i j jt jt it j N N N N q s w s t s d s e 2 2 | interaction students & time nested in teachers (0, ) it j N s (1) where The model used was what is called the generalized partial credit model (GPCM, Muraki, 1992). A r s nse to tem k in category m (m = 0, 1, or 2) pertaining to a teacher j, by a student i, at time point t, is denoted by . The probability of the response is given by (1) where and are the item discrimination and an item location parameter of item k, respectively, and is the quality of teacher j as perceived by student i at time point t. 2.3 4 The G neraliz b lity Theory m del In the present context, the generalizability theory (GT) model is a linear model that decomposes the quality of teacher j as perceived by student i at time point t, that is, , into a number of random effects, that is, . (2) The interpretation and the distribution of these random effects are given in Table 2.1. Table 2.1 Generalizability theory model: parameters, interpretations and distributions Parameter Interpretation Distribution ijtk U m = exp( ) ( | ) 1 exp( ) k ijt km ijtk ijt k ijt kh h m p U m h a q d q a q d - = = + - å k a km d ijt q ijt q | | ijt j t i j jt it j q q w t d e = + + + + 2 2 | | 2 | main effect teacher (0, ) main effect time (0, ) main effect students nested in teachers (0, ) interaction tea hers & time (0, ) j j t t i j i j jt jt it j N N N N q s w s t s d s e 2 2 | interaction students & time nest d in teachers (0, ) it j N s and 1992). A response to item k in category m (m = 0, 1, or 2) pertaining to a teacher j i, at time point t, is denoted by . The probability of the response is given by (1) where are the item discrimination and an item location paramet r of item respectively, and is the quality of teacher j as perceived by student i at time point t. 2.3.4 The Generalizability Theory model the quality of teacher j as perceived by student i at time point t random effects, that is, . (2) The interpretation and the distribution of these random effects are given in Table 2.1. Table 2.1 Generalizability theory model: parameters, interpretations and distributions Parameter Interpretation Distribution ijtk U m = exp( ) ( | ) 1 exp( ) k ijt km ijtk ijt k ijt kh h m p U m h a q d q a q d - = = + - å k a km d ijt q ijt q | | ijt j t i j jt it j q q w t d e = + + + + 2 2 | | 2 | i effect teacher (0, ) main effect time (0, ) main effect students nested in teachers (0, ) interaction teachers & time (0, ) j j t t i j i j jt jt it j N N N N q s w s t s d s e 2 2 | interaction students & time nested in teachers (0, ) it j N s are the item discrimination and an item location parameter of item k, respectively, and 1992). A response to item k in category m (m i, at time point t (1) respectively, and is the quality of teacher j as perceived by student i 2.3.4 The Generalizability Theory model the quality of teacher j as perceived by student i at time point t random effects, that is, . (2) Table 2.1 Generalizability theory model: parameters, interpretations and distributions Parameter Interpretation Distribution ijtk U m = exp( ) ( | ) 1 exp( ) k ijt km ijtk ijt k ijt kh h m p U m h a q d q a q d - = = + - å k a km d ijt q ijt q | | ijt j t i j jt it j q q w t d e = + + + + 2 2 | | 2 main eff ct t acher (0, ) main eff ct time (0, ) main effect students nested i te hers (0, ) interactio teacher & tim (0, ) j j t t i j i j jt jt N N N N q s w s t s d s 2 is the quality of teacher j as perceived by student i at time point t. 3.4 The generalizability the ry del In the present context, the generalizability theory (GT) model is a linear model that decomposes the quality of teacher j as perceived by student i at time point t, at is, j, by a stude t k, i at time point t. t, that is, , into a numb r of Distribution ijt q 2 2 2 (0, ) (0, ) (0, ) i j jt N N N N 2 2 | it j N s , into a number of random effects, that is, 1992). A response to item k in category m (m = 0, 1, or 2) pertaining to a teacher j i, at time point t (1) respectively, and is the quality of teacher j as perceived by student i 2.3.4 The Generalizability Theory model the quality of teacher j as perceived b student i at time point t random effects, th t is, . (2) Tabl 2.1 Generalizability theory model: parameters, interpretations and distributions Parameter Interpret tion Distribution ijtk U m = exp( ) ( | ) 1 exp( ) k ijt km ijtk ijt k ijt kh h m p U m h a q d q a q d - = = + - å k a km d ijt q ijt q | | ijt j t i j jt it j q w t d e = + + + + 2 main effect teacher (0, ) main effect time (0, ) j j t t N N q s w s 2 . (2) The interpretation and the distribution of these random effects are given in Table 2.1.

27 2 Table 2.1 Generalizability theory model: parameters, interpretations and distributions Parameter Interpretation Distribution Note that since students are nested within teachers, the between-teacher variance is confounded with the between-student-groups variance. 2.3.5 Estimation method The combined IRT and GT model is a multilevel model, as introduced by Fox and Glas (2001, 2003). The combined model was estimated in a Bayesian framework using OpenBugs (Lunn et al., 2009). The motive for this choice rather than using traditional standard software for latent variable modelling was that OpenBugs allows the users to completely specify their own model, taking all facets and levels into account. Estimation in OpenBugs was done with a Markov chain Monte Carlo (MCMC) procedure. The script for the estimation of the overall model is provided in Appendix B. The MCMC sampling procedures in the application were run with over 24000 iterations for each of two chains. Each chain had a burn-in of 10,000 iterations. These large numbers were chosen to ensure convergence and small estimation errors. The sampling procedures were checked for convergence by visual inspection of the trace plots, by comparing the estimates of the two chains and by checking whether the MCMC errors were acceptably small (< 5% of the estimated standard deviation; Lunn et al., 2012). 2.4 DATA ANALYSIS 2.4.1 Construct validity To assess the construct validity of the Impact! questionnaire statistically, the fit of the combined IRT and GT model was investigated in three ways. The first investigation tested the probability of the predicted item responses as implied by the formula. For that purpose, the data were divided into three subgroups: a group with low total scores from students (below 11 out of the possible Table 2.1 G ner lizability the ry model: param te s, interpretations a d distributions Parameter Interpretation Distribution with the between-student-group varianc . 2.3.5 Estimation Method 2 2 | | 2 | main effect teacher (0, ) main effect time (0, ) main effect students nested in teachers (0, ) interaction teachers & time (0, ) j j t t i j i j jt jt it j N N N N q s w s t s d s e 2 2 | interaction students & time nested in teachers (0, ) it j N s main effect teacher ain effect time i effect students nested in teach rs i tion teachers & time i tion teachers & time nested in achers

RkJQdWJsaXNoZXIy MjY0ODMw