=Paper=
{{Paper
|id=Vol-2770/paper20
|storemode=property
|title=Psychometrical Modeling of Components of Composite Constructs: Recycling Data Can Be Useful
|pdfUrl=https://ceur-ws.org/Vol-2770/paper20.pdf
|volume=Vol-2770
|authors=Denis Federiakin,Elena Kardanova
}}
==Psychometrical Modeling of Components of Composite Constructs: Recycling Data Can Be Useful==
Psychometrical Modeling of Components of Composite Constructs: Recycling Data Can Be Useful1 Denis Federiakin1[0000-0003-0993-5315] and Elena Kardanova1[0000-0003-2280-1258] 1 National Research University Higher School of Economics, Potapovsky Lane 16, build. 10, 101000 Moscow, Russia dafederiakin@hse.ru, ekardanova@hse.ru Abstract. This paper describes a list of studies necessary to justify the simulta- neous use of both the overall test score and the subscale scores when measuring complex constructs. We investigate in detail one of the strategies for modeling composite constructs, which is popular within the international comparative studies of education. This strategy is based on repetitive recalibrations of the same data using unidimensional models for reporting overall test score and mul- tidimensional models for reporting its components. We use Monte-Carlo simu- lations to illustrate that repetitive recalibrations of the data using unidimensional and multidimensional models yield, basically, the same results after their transformation to the same scales. However, we also illustrate that the fit of the unidimensional models to the data may be confounded if the com- ponents of the composite vary in terms of their relations with each other and their variance. We illustrate the studied strategy for modeling composite con- structs using the computer adaptive test PROGRESS-ML, which measures basic math literacy in the third grade. Keywords: Composite Constructs, Composite Tests, Multidimensional Rasch Models, Unidimensional Rasch Models, PROGRESS-ML. 1 Introduction Within contemporary educational sciences and broadly, in the social sciences, there is a growing need for composite measurement instruments – instruments that have a complex structure, for example, those which consist of subscales that invest in some way in the overall test score. This may be a consequence of the trend for measuring complex constructs - such as 21st century skills or new literacies. Such constructs consist of multiple components, and it is not easy to portrait them as a classic unidimensional or single-component trait of respondents. It is widely assumed that the information about the integral trait level is valuable for policymakers, while infor- mation about its components is valuable for practitioners. Such information provides important insights for improving the performance of, for example, the educational system or psychological practice at different levels of the social system. 1 Supported by the Russian Science Foundation, grant No. 19-29-14110. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Proceedings of the 4th International Conference on Informatization of Education and E-learning Methodology: Digital Technologies in Education (IEELM-DTE 2020), Krasnoyarsk, Russia, October 6-9, 2020. The standards for educational and psychological testing [1] clearly state that (1) test scores should not be reported to users until their validity, fairness, and reliability have been studied, and (2) if the test produces more than one test score, the psycho- metric quality of all reported scores must be confirmed. This is important because inaccurate information about the overall test score can lead to decisions with undesirable social consequences, while erroneous information about subscores can lead to incorrect decisions to correct or improve the situation [2]. In an academic environment, low-quality subscores can lead to false conclusions about the nature of the phenomenon being studied. 2 Psychometrics of composite instruments In psychometric terms, composite tests are multidimensional. Therefore, the task is to evaluate, if possible, both the overall ability and its components. Psychometric model- ing of such tests consists of several stages. First of all, a researcher needs to check whether the test is essentially unidimensional. It is possible to do so by utilizing the weak definition of local item independence stating that item residual correlations are zero after extracting a single factor estimated by the unidimensional model (figure 1a) [3]. If so, a researcher can report the overall test score - of course, given that it is proven to be valid and psychometrically consistent. If the test is not unidimensional, it is necessary to use multidimensional models, and then the overall test score requires additional research using hierarchical models [4]. Two types of hierarchical models are particularly popular – models with higher-order factors (figure 1c) [5] and bifactor models [6, 7] (figure 1d). Despite the algebraic similarities [8, 9] and the fact that both groups of models assume the use of the overall test score (called the general factor in factor-analysis terminology), their interpretation is different [10, 11]. While models with higher-order factors estimate the general factor that manifests in items through subscores, bifactor models assume a complete separation of the general factor and specific factors. Second, if a researcher intends to report subscores (for example, cognitive opera- tions or content areas), several approaches are available. The first is to apply the unidimensional model to each subscale separately [12]. This approach is called the "Consecutive Approach". The consecutive approach is the least attractive since the number of items in each subscale is usually small. Therefore, the measurement relia- bility will not be high enough, and the measurement error will be too large. This leads to the impossibility of reporting subscores [1]. The second approach involves the use of bifactor models. These models, hypothet- ically, allow simultaneous reports of the overall score and subscores as additional independent information. However, studies show that subscores estimated in bifactor models rarely have satisfactory reliability because they describe information not ex- tracted by the overall score. Therefore, valuable information is often suppressed by random noise [13]. Moreover, their interpretation is difficult due to model assump- tions. Fig. 1. Structural models for modeling composite constructs. Latent variables are drawn using circles, while observed variables are drawn using squares. One-headed arrows represent regres- sion dependencies, while two-headed arrows represent correlations The third approach involves the use of non-compensatory multidimensional mod- els [14] (correlated traits models or models for between-item multidimensionality [15], figure 1b). Such models represent, essentially, several unidimensional models combined in a single likelihood equation. This approach is under investigation in this paper. From a modeling perspective, it is crucial to distinguish this analysis strategy from the bifactor modeling and consecutive approach. The described approach breaks the general factor into its parts, proportional to the number of items dedicated to a particular dimension. Each latent trait is calculated based on respondents' responses to the corresponding items and considering the latent variables' estimated correlations. Thus, multidimensional models use information about each dimension and compute the probability of completing or endorsing an item as a function of several latent vari- ables, taking into account the relationships between them. As a result, such measure- ments' reliability will be greater compared to the consecutive approach. Therefore, it is more likely that it will be possible to report subscores. At the same time, bifactor modeling suggests modeling additional subscale-specific components, which add up to the general factor to produce the observed item scores. Consequently, the interpre- tation of the subscale-specific scores from bifactor models is too convoluted for the most practical tasks. As a result, the application of the bifactor models is mostly lim- ited to modeling testlet-based assessments and local item dependence conditional on person parameters. The third analysis strategy illustrates the use of collateral information. Collateral information is any information about items, respondents, or their interaction, which, being introduced in the measurement model, does not change the parameters' interpre- tation. However, collateral information reduces the uncertainty in the estimates [16]. In this case, for each subscale, the responses to all other subscales (together with the correlation matrix of latent dimensions) are collateral information [17]. Thus, to use the results of composite tests, regardless of the chosen strategy of data analysis, it is necessary to conduct extensive psychometric research. It is necessary to decide whether the overall test score and subscores are reliable and psychometrically consistent enough to be reported to users. 2.1 Modeling components of the composite Breaking the overall test score into its components is popular within cross-national comparative studies of education. For example, PISA [18] and TIMSS [19] use repeti- tive recalibrations of their testing data to decompose the overall test score into the components, which produce it. TIMSS uses its theoretical framework to report subscores on cognitive operations required to solve an item. From a statistical point of view, de facto, it leads to ignoring model-fit indices and recirculation of the data. Nevertheless, its interpretation allows researchers to describe the composition of the overall test scores in terms of how respondents achieve those test scores. This enables policymakers to make decisions based on the information described in terms of social sciences. However, the difference and equivalence between multidimensional and unidimensional models is a challenging area of psychometric research. Many studies have already touched upon the idea of the unidimensional interpretation of multidi- mensional measurements. For example, Reckase et al. [20] showed that if the test items are selected according to specific conditions, the unidimensional model can fit such data. However, it requires strict guiding the process of test development by the psychometric parameters of the items. Several researchers have also tried to concep- tualize the fit of the unidimensional models to multidimensional data in terms of the general factor's strength. For example, Drasgow and Parsons [21] demonstrated that if the general factor is "strong" (if the factors in the multidimensional model are firmly positively correlated), then the unidimensional model can fit the data well. Our paper describes the same phenomenon directly in terms of the correlation matrix of latent dimensions. Many other researchers studied how model modification can allow the unidimensional model to fit multidimensional data. The main implication of those findings is that it is possible to use the overall test score even if the general factor is weak as long as the multidimensional structure of the data is explicitly modeled [22]. Nonetheless, much research found that the differences between parameter esti- mates from multidimensional and unidimensional models are expressed, mainly, in item parameters. Numerous researches have highlighted unpredictable distortion in the item parameters estimated when the model's dimensionality is misspecified re- garding the data-generating model [23]. However, another conclusion from this stream of research concerns the stability of the person parameters. As DeMar noted (although, in another context), "if the focus is on estimated θ's and not on the item parameters, any of the models will perform satisfactorily" [24]. Reise et al. [25] sum- marized that the correlation of person parameters from different models tends to be close to 1 regardless of the model's misspecification. 3 Simulation study To illustrate the possibility of fitting the unidimensional models to the multidimen- sional data, we perform a small-scale Monte-Carlo simulation study. We generate the data under the multidimensional Rasch model and calibrate both unidimensional (misspecified) and multidimensional (correctly specified) Rasch models on the data. We then compare the average of the multiple person abilities from the multidimen- sional model and the estimated person ability from the unidimensional model. To compare the results, we used the Pearson linear correlation. We also analyze the essential unidimensionality of the simulated data by utilizing residual analysis. To do so, we apply principal components analysis to the standard- ized response residuals under the unidimensional model. This is standard practice for the analysis of unidimensionality under the Rasch modeling paradigm. This method rests upon the assumption that if the data is unidimensional (and does not exhibit local item dependence conditional on person parameters), the residuals are noise, and any significant principal component cannot be extracted from the data [26; 27]. To ana- lyze local fit, we used Rasch InFit and OutFit item-wise statistics [28], particularly their range from the maximum to minimum values. The larger range in InFit and Out- Fit means that some items deviate from the model prediction and do not fit the Rasch model, while smaller variance means that all items fit the Rasch model. We conduct the simulations for 2000 respondents responding to 30 dichotomous items, separated into five subscales equally (6 items per subscale). We carry out 100 replications for randomly varying positive definite variance-covariance matrices with positive manifold (where all latent dimensions are non-negatively correlated). Note, however, that during random varying of the variance-covariance matrix, we also alter the variance of latent dimensions. To control this source of the difference of the re- sults, we also carry out 50 replications for three fixed variance-covariance matrices of person parameters (where all correlations were equal to 0.80, 0.50, or 0.20, and all variances are equal). For the randomly varying variance-covariance matrices, we calculate the difference between correlations by taking the standard deviation of the values in the lower triangle of the correlation matrix. We do so to analyze the fit of the unidimensional models conditional on the difference between the variance- covariance matrix values. Both the multidimensional model and unidimensional mod- el can be considered as special cases of the Multidimensional Random Coefficients Multinomial Logit Model [15]. The quasi-Monte-Carlo algorithm implemented in the Tam v. 3.5-19 package [29] for the R V. 3.6.2 software was used to estimate all mod- els. 3.1 Results of the simulation study The average correlation between person parameters from the unidimensional model and the average of person parameters from the multidimensional model is 0.99 (p < 0.01) with a standard deviation of less than 0.01 across all simulated conditions. The- se results hold for any case – whether the variance-covariance matrix was fixed or not. This result is in agreement with other similar research, suggesting that person parameters are more stable in the situation of model dimensionality misspecification. Further, the results of dimensionality analysis using PCA on unidimensional model residuals do vary depending on the size of correlations of latent dimensions in the data-generating multidimensional model. They suggest that the eigenvalue of the first component depends on the mean correlation of those dimensions (r = -0.48, p < 0.01, figure 2) and less depends on differences in the values of correlation matrix (r = -0.25, p < 0.05, figure 3). Note, however, that the critical value for the first eigenvalue is 2 [30, 31]. Since the first component's eigenvalue is larger than the critical value, all unidimensional models are critically misspecified for the simulated data, and, there- fore, their results are inconsistent. Fig. 2. Scatterplot of eigenvalues of the first component from PCA applied to the standardized model residuals versus mean correlation of latent dimensions from the simulations with ran- domly varying variance-covariance matrices. Each point represents a single simulation Fig. 3. Scatterplot of eigenvalues of the first component from PCA applied to the standardized model residuals versus standard deviation of correlations of latent dimensions from the simula- tions with randomly varying variance-covariance matrices. Each point represents a single simu- lation To support these findings, we additionally analyzed the eigenvalue of the first component from PCA applied to the unidimensional model residuals when the vari- ance-covariance matrix was fixed. We compared the eigenvalue of the first compo- nent across different values of the fixed correlation and the varied matrix. The results are presented in figure 4. They also suggest that the analyzed eigenvalue depends on the size of the fixed correlation. However, they never exceed the critical value of 2. Therefore, the data with small (or absent) variance in the values of the correlation matrix of underlying latent factors can be considered unidimensional. Fig. 4. Boxplot of the variance of the first component from PCA applied to the standardized model residuals depending on conditions for simulations Next, we analyzed item fit statistics. The results are presented in figure 4. We compared values of item fit statistics across different conditions of simulations simi- larly with previous results. The results are presented in figures 5 and 6. We discov- ered similar findings: the range of item fit statistics from unidimensional models in case of randomly varied variance-covariance matrix exceeds that of the fixed vari- ance-covariance matrix. However, since the model used for data-generating is the Rasch model as well as the model used for data analysis, item fit statistics do not react to differences in item discrimination parameters. Instead, they react to the violation of unidimensionality, which is expected [27]. Fig. 5. Range of Rasch InFit item-wise statistic depending on conditions for simulations Fig. 6. Range of Rasch OutFit item-wise statistic depending on conditions for simulations Thus, we showed that the unidimensional IRT model could fit the data well even if the data was actually generated under the multidimensional model. This is fair for the cases where the values in the correlation matrix of latent dimensions are positive, correlations are strong, and they do not vary much. However, regardless of that, the average of the ability estimates from the multidimensional model is equal to the abil- ity estimate from the unidimensional model. For this, of course, their transformation to the scales with the same numerical values is necessary (e.g., linear transformation to the scale N(500,100)). This finding is in agreement with previous studies, which found that the person parameters are not as sensitive to the model dimensionality misspecification as the item parameters. Nonetheless, psychometric consistency of the unidimensional score can be con- founded if there is variation in the correlation matrix of "true" latent dimensions. If this is a case, the extraction of the overall test score from multidimensional data can- not be conducted by averaging the multidimensional model's estimates. Additional research on "sufficient unidimensionality" of the data is crucial for overall test score reporting. 4 Real data example This section demonstrates the scope of psychometric studies necessary for reporting both overall test scores and specific scores, interpreting them as components of the composite construct. We do so by applying them for the PROGRESS-ML basic math- ematical literacy test. The PROGRESS-ML test evaluates how well a student is oriented in mathematics after completing two years of primary school. When developing the test, we relied on the following definition of basic mathematical literacy [32]: "basic mathematical lit- eracy (including working with data) – the ability to apply mathematical tools, reason- ing, and modeling in everyday life, including in the digital environment". The PROGRESS-ML basic math literacy test consists of 30 dichotomous items. The assessment is built as a computerized adaptive test with an automated stopping rule. The content of the test was selected in a way that, on the one hand, it meets the def- inition of basic mathematical literacy, and on the other hand, it takes into account the content of the Russian Federal Educational Standard. As a result, we identified five content areas: spatial representations, measurement of quantities, regularities, model- ing, and information processing. Test items are grouped into blocks according to the content area. Additionally, the PROGRESS-ML test evaluates students' cognitive processes re- quired to solve the items. When developing the test items, we used the TIMSS' theo- retical framework for the 4th grade [33]. Therefore, in addition to assessing the con- tent area, three cognitive operations groups are measured — knowing, application, and reasoning. Thus, the PROGRESS-ML test is a composite tool: it includes five content areas and reflects three cognitive operations groups. It is assumed that the test results will report the students' overall test score (in this case, the level of their basic mathemati- cal literacy), as well as subscores (in this case, content areas and cognitive opera- tions). The sample consisted of 6078 the 3rd grade students from two regions of the Rus- sian Federation. The samples were representative for the regions. Average age = 9.06 years (SD = 0.46), number of girls = 52.36%. 4.1 Results of the analysis of the real data In the analysis of standardized residuals by the PCA, we found that the first compo- nent's eigenvalue is 1.45, which corresponds to 4.2% of the residual variance. The next four components' eigenvalues are in the interval from 1.15 to 1.2. The distribu- tion of the explained variance of residuals among the components is almost uniform – about 4% per component. Therefore, we conclude that the unidimensional model sufficiently describes the response probability distribution across persons, and the test can be considered unidimensional. The model Expected-a-Posteriori reliability [34] of the entire test score from the unidimensional model was 0.76. For comparison, we calculated the reliability using the methods of Classical Test Theory (CTT): Greatest Lower Bound (GLB) [35] reli- ability was 0.86, the Cronbach's α [36] was 0.81. However, it is essential to note that the design of testing (computerized adaptive) implies that not all items are adminis- trated to all respondents, and the CTT parameters become unstable in the presence of missing responses. Therefore, even though, in our example, the reliability estimated in the CTT (both GLB and Cronbach's α) is slightly higher than the reliability of the scores evaluated in the IRT, these indices should not be trusted. Overall, the analysis results suggest that the test can be considered unidimensional, even though there are different ways to group items. This implies that it is possible to report one overall test score of mathematical literacy based on the test results, which will have good reliability and psychometric consistency. Then, we calibrated the multidimensional IRT model to estimate if they will have good psychometric characteristics. The reliability analysis results by content areas are shown in table 1, by cognitive operations are shown in table 2. Table 1. Analysis of relations between content areas Measure- Infor- Spatial rep- Regulari- Model- Content area ment of mation resentation ties ing quantities processing Spatial repre- 0.85 0.80 0.83 0.80 sentation Measure- 0.85 0.90 0.83 ments Regularities 0.86 0.84 Modeling 0.83 Variance 0.89 1.23 1.12 1.06 2.95 Reliability 0.68 0.71 0.67 0.68 0.63 Number of 7 6 6 6 5 items Table 2. Analysis of relations between cognitive operations Cognitive opera- Knowing Application Reasoning tion Knowing 0.95 0.85 Application 0.85 Variance 1.37 0.82 0.60 Reliability 0.75 0.74 0.61 Number of items 12 14 4 From the tables, we can conclude that all dimensions have sufficient reliability for the monitoring test use. Despite the small number of items per subscale, relatively high reliability is possible due to the approach used for IRT modeling. In fact, such a small number of items per dimension makes raw subtest scores unusable. Additional- ly, we looked at correlations between the latent dimensions: both content areas and cognitive operations correlate approximately equally – at the level of 0.8-0.9. Based on the simulation study, we conclude that this can be seen as an additional argument in favor of the unidimensional model, even though multidimensional models fit the data better than the unidimensional model according to the AIC [37] and BIC [38] indices. These indices can estimate the relative model fit to the data introducing a penalty for extra model parameters (AIC) with respect to sample size (BIC). The low- er values of these indices indicate a better model fit. These indices are presented in table 3. Table 3. Analysis of model fit Number of Model Deviance Sample AIC BIC parameters Unidimensional 144255.6 31 144318 144526 Content areas 143875.7 6078 45 143966 144268 Cognitive op- 143965.4 36 144037 144279 erations Thus, multidimensional IRT models allowed us to get reasonably reliable subscores (for both content areas and cognitive operations) and therefore made it possible to report them to users. Moreover, the described reliability estimates are derived from IRT models in which no context variables were entered. Note that the introduction of these variables into the model (using latent regression modeling) leads to the estimation of more reliable scores for subscales due to explaining ability vari- ance. 5 Conclusion Contemporary psychometric literature notes the growing popularity of composite tests designed to produce both the overall test score and the subscores. There are several strategies for processing such test data. They include the use of raw test scores or the application of hierarchical models. However, in most cases, raw test scores cannot be used due to their low reliability [13], and hierarchical models require extraordinary caution in use due to their complex mathematical nature and interpretation. In this paper, we describe the strategy for modeling subscores as components of the overall composite test score. This strategy is based upon repetitive recalibration of the same data using unidimensional and multidimensional models. We demonstrate that the average of the ability estimates from the multidimensional models is equal to the ability estimate from the unidimensional model estimated on the same data. Interest- ingly, this statement holds regardless of whether or not the unidimensional model fits the data. However, the application of any statistical models in social sciences needs to be backed by checking its assumptions and thinking through its theoretical conse- quences. Therefore, the unidimensional model's meaningfulness must be argued in terms of both model fit and construct definition. As we demonstrate in our simulation study, the unidimensional model does not always fit the data despite the equivalence of its estimates to the average of the estimates from the multidimensional models. This means that the unidimensional model's adequacy needs to be verified either way if a researcher intends to follow the described approach in modeling the composite constructs. We also provide an example of the described strategy for modeling the composite constructs using the PROGRESS-ML basic mathematical literacy test. We demon- strate that the use of IRT models allows us to report the respondent's overall test score and subscores with respect to test specification. For this test, the main result of testing is the respondent's overall test score. However, repeated recalibration of data based on content areas and cognitive operations groups required for solving items allows us to report subscores on those dimensions. These estimates possess greater reliability and simpler interpretation than estimates from other approaches to modeling composite constructs. The essence of these results is the decomposition of the overall test score into the components that make it up. References 1. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education: Standards for educational and psycholog- ical testing. American Educational Research Association, Washington, DC (2014). 2. Sinharay, S., Puhan, G., & Haberman, S.J. An NCME instructional module on subscores. Educational Measurement: Issues and Practice, 30(3), 29–40 (2011). 3. Hattie, J.: Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139–164 (1985). 4. Yung, Y.F., Thissen, D., & McLeod, L.D.: On the relationship between the higher-order factor model and the hierarchical factor model. Psychometrika, 64(2), 113-128 (1999). 5. Gignac, G.E.: Higher-order models versus direct hierarchical models: g as superordinate or breadth factor? Psychology Science, 50(1), 21 (2008). 6. Holzinger, K.J., & Swineford, F. A study in factor analysis: The stability of a bi-factor so- lution. Supplementary educational monographs (1939). 7. Reise, S.P. The rediscovery of bifactor measurement models. Multivariate behavioral re- search, 47(5), 667-696, (2012). 8. Schmid, J., & Leiman, J.M.: The development of hierarchical factor solutions. Psychometrika, 22(1), 53-61 (1957). 9. Rijmen, F. Formal relations and an empirical comparison among the bi‐ factor, the testlet, and a second‐ order multidimensional IRT model. Journal of Educational Measurement, 47(3), 361-372, (2010). 10. Brunner, M., Nagy, G., & Wilhelm, O.: A tutorial on hierarchically structured constructs. Journal of personality, 80(4), 796-846 (2012). 11. Mansolf, M., & Reise, S.P.: When and why the second-order and bifactor models are dis- tinguishable. Intelligence, 61, 120-129 (2017). 12. Davey, T. Hirsch, T.M.: Concurrent and Consecutive estimates of examinee ability pro- files. Paper presented at the Annual Meeting of the Psychometric Society, New Brunswick, NJ. (1991). 13. Haberman, S.J., & Sinharay, S.: Reporting of subscores using multidimensional item re- sponse theory. Psychometrika, 75(2), 209-227 (2010). 14. Reckase, M.D. Multidimensional item response theory models. In Multidimensional item response theory (pp. 79-112). Springer, New York, NY (2009). 15. Adams, R.J., Wilson, M., & Wang, W.C.: The multidimensional random coefficients mul- tinomial logit model. Applied psychological measurement, 21(1), 1-23 (1997). 16. Wang, W., Chen, P., & Cheng, Y.: Improving measurement precision of test batteries us- ing multidimensional item response models. Psychological Methods, 9(1), 116-136, (2004). 17. Wu, M., Tam, H.P., & Jen, T.H. (2016). Multidimensional IRT Models in Book: Educa- tional measurement for applied researchers. Theory into practice. 18. Scaling PISA Data (Chapter 9). In: PISA 2018 Technical Report. OECD, Paris (2019). 19. Foy, P., & Yin, L. Scaling the TIMSS 2015 Achievement Data. In: Martin, M.O., Mullis, I.V., & Hooper, M. (eds.) Methods and procedures in TIMSS 2015, pp. 13.1-13.62. TIMSS & PIRLS International Study Center, Boston College, Chestnut Hill, MA (2016). 20. Reckase, M.D., Ackerman, T.A., & Carlson, J.E.: Building a unidimensional test using multidimensional items. Journal of Educational Measurement 25(3), 193-203 (1988). 21. Drasgow, F., &Parsons, C.: Application of unidimensional item response theory models to multidimensional data. Applied Psychological Measurement, 7, 189–199 (1983). 22. Ip, E.H.: Empirically indistinguishable multidimensional IRT and locally dependent unidimensional item response models. British Journal of Mathematical and Statistical Psy- chology, 63(2), 395-416 (2010). 23. Steinberg, L., & Thissen, D.: Uses of item response theory and the testlet concept in the measurement of psychopathology. Psychological Methods, 1, 81–97 (1996). 24. DeMars, C.E. Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145–168 (2006). 25. Reise, S.P., Cook, K.F., & Moore, T.M. Evaluating the impact of multidimensionality on unidimensional item response theory model parameters. In Reise, S.P., & Revicki, D.A. (eds.) Handbook of item response theory modeling. Routledge, New York (2014). 26. Linacre, J.M.: Structure in Rasch residuals: why principal components analysis. Rasch measurement transactions, 12(2), 636 (1998). 27. Smith, E.V.: Detecting and evaluating the impact of multidimensionality using item fit sta- tistics and principal component analysis of residuals. Journal of Applied Measurement, 3, 205–231 (2002). 28. Linacre, J.M.: What do infit and outfit, mean-square and standardized mean. Rasch Meas- urement Transactions, 16(2), 878, (2002). 29. Robitzsch, A., Kiefer, T., Wu, M. Package ‘TAM’. Test Analysis Modules – Version: 3.5- 19, (2020). 30. Raîche, G.: Critical eigenvalue sizes in standardized residual principal components analy- sis. Rasch measurement transactions, 19(1), 1012 (2005). 31. Linacre, J.M.: Winsteps® Rasch measurement computer program User's Guide. Beaver- ton, OR: Winsteps.com (2018). 32. Фрумин, И.Д., Добрякова, М.С., Баранников, К.А., & Реморенко, И.М. Универсаль- ные компетентности и новая грамотность: чему учить сегодня для успеха завтра. Предварительные выводы международного доклада о тенденциях трансформации школьного образования (2(19); Современная Аналитика Образования) (2018). 33. Mullis, I.V., & Martin, M.O.: TIMSS 2019 Assessment Frameworks. International Associ- ation for the Evaluation of Educational Achievement, Amsterdam, The Netherlands (2017). 34. Bock, R.D., & Mislevy, R.J: Adaptive EAP estimation of ability in a microcomputer envi- ronment. Applied psychological measurement, 6(4), 431-444 (1982). 35. Jackson, P.H., & Agunwamba, C.C.: Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: I: Algebraic lower bounds. Psychometrika, 42(4), 567-578 (1977). 36. Cronbach, L. J.: Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334 (1951). 37. Akaike, H.: A new look at the statistical model identification. IEEE transactions on auto- matic control, 19(6), 716-723 (1974). 38. Schwarz, G.: Estimating the dimension of a model. The annals of statistics, 6(2), 461-464 (1978).