=Paper=
{{Paper
|id=Vol-3051/PA_4
|storemode=property
|title=Understanding Students’ Problem-Solving Processes via Action Sequence Analyses (Short Paper)
|pdfUrl=https://ceur-ws.org/Vol-3051/PA_4.pdf
|volume=Vol-3051
|authors=Ruhan Circi,Manqian Liao,Chad Scott,Juanita Hicks
|dblpUrl=https://dblp.org/rec/conf/edm/CirciLSH21
}}
==Understanding Students’ Problem-Solving Processes via Action Sequence Analyses (Short Paper)==
Understanding Students’ Problem-Solving Processes via Action Sequence Analyses Ruhan Circi Manqian Liao Chad Scott Juanita Hicks American Institutes for Research Duolingo Deloitte American Institutes for Research Research in this paper was developed and conducted during the 2019 NAEP doctoral internship program administered by AIR and funded by NCES under Contract No. ED-IES-12-D-0002/0004. The views, thoughts, and opinions expressed in the paper belong solely to the authors and do not reflect NCES position or endorsement. ABSTRACT It has become commonplace to include response time (RT) in The transition of the National Assessment of Educational Progress addition to responses in the psychometric models to account for (NAEP) to digitally based assessments (DBAs) allowed for the speed and accuracy (e.g., Goldhammer, 2015), and to examine the collection of data that can provide insights into students’ problem- relationship between response time and item- and person-level solving processes. When students interact with a NAEP DBA factors (e.g., Masters, Schnipke, & Connor, 2005). Response time item, their recorded timestamped events in the process data form is used to examine psychometric quality of items and students’ sequences. We refer to action sequences as the series of clicks or test-taking behaviors and it is concluded to be promising for other actions a student makes within an item. Using data from one various assessment elements. Yet, the process data contains richer released block of the NAEP 2017 mathematics assessment for information such as actions that students use during their grade 4, this study aims to provide an understanding of the problem-solving processes and the allocation of the time students relationships among action sequence characteristics, item spend on particular activities within a single response time characteristics and student performance. remained unexplored. When students interact with a NAEP DBA item, their recorded First, we extract individual actions sequences across items. timestamped events in the process data form sequences. These Second, we categorize each individual action into one of four sequences contain information about the order, mobility, and activities: Browsing, Passive investigation, Active investigation, duration of the tasks students take throughout the problem-solving or Decision. This categorization enables us to investigate process and may shed light on the processes underlying the sequence patterns within and across different items. Sequence students’ test-taking behaviors. In this study, we divide response characteristics are summarized from two perspectives: a) the time times into subcategories using the action definitions to provide a spent on each activity is calculated for each student across items more meaningful understanding of student test taking behavior and b) the within-sequence entropy (Shannon, 1948) and and examine the differences across item types and student turbulence (Elzinga, 2006) of the sequences are calculated to performance. quantify students’ action mobility. 1.1 Literature We found that the time students spend on “Decision” and “Passive Process data is most commonly used to calculate response time investigation” activities can be used to predict student (RT), defined as the time an examinee takes to complete an item performance. or assessment. Due to the association of RT with psychological and cognitive processes (e.g., Huff & Sireci, 2001), RT is often Keywords used to make decisions such as setting assessment time limits NAEP, Process data, Digitally based assessments, sequence (e.g., van der Linden, 2011) and capturing aberrant test-taking mining, action sequences. behaviors (e.g., Marianti, Fox, Avetisyan, Veldkamp, & Tijmstra, 2014). However, RT alone may not provide sufficient information to 1. BACKGROUND draw inferences about the processes underlying students’ test In 2017 the National Assessment of Educational Progress (NAEP) taking behaviors (Lee & Haberman, 2016). In fact, RT could transitioned from paper-based assessments (PBAs) to digitally consist of the time for various components in the problem-solving based assessments (DBAs). DBAs allow us to capture student process such as preparation (e.g., forming a response plan) and interactions with the test screen that are recorded as timestamped writing down/typing the response. The decomposition of RT can events. These records form data known as process data. differ depending on item types (e.g., Li, Banerjee, & Zumbo, 2017). Thus, to ensure the validity of inferences drawn from RTs, it is necessary to understand what students actually do throughout the RT. 2. CURRENT STUDY In the assessment setting, RT could consist of the times for various components in the problem-solving process such as Copyright © 2021 for this paper by its authors. Use permitted preparation (e.g., forming a response plan) and writing under Creative Commons License Attribution 4.0 International down/typing the response. The decomposition of RT can be (CC BY 4.0) different for different item types (e.g., Li, Banerjee, & Zumbo, 1 2017). Since the NAEP mathematics assessment consists of items and turbulence, are used as features to predict students’ item with a mix of item types (e.g., multiple choice, constructed scores. The results could inform which feature(s) of the sequences response), using the decomposition of RT for different tasks (e.g., best contribute to correct/incorrect item responses or the investigation, decision) rather than total RT could be helpful when presence/absence of the responses. Moreover, the score different decisions (e.g., setting assessment time limits, capturing distributions are compared across sequence clusters. aberrant test-taking behaviors) are to be made based on the time. In sum, this study, by decomposing RT and examining the A more fine-grained understanding of the relationships among relationships among the sequence characteristics, item RTs and students’ problem-solving behaviors can be gained by characteristics and student performance, aims to inform more analyzing students’ action sequences, which can further improve meaningful ways of calculating RT (e.g., different ways of RT the usefulness of RT in psychometric research (e.g., determining calculation for different items) and the validity of score categories non-response categories such as omit and not reach). such as “omit” and “not reach”. For instance, if the sequences of students who were scored as “not reach” were found to contain The goals of this study are: a) identify and describe the action some actions that are related to making responses (i.e., the sequences of students in a meaningful way, b) examine mobility “Decision” actions), the scores of these students may be across actions, c) differentiate profiles of action sequences, and d) considered as “omit” as opposed to “not reach”. explore students’ performance in connection to sequence clusters. Steps taken for current project can be presented as follows: First, 2.1 Research Questions individual actions are extracted. Second, students’ response Specifically, the following research questions are examined in the processes are represented as sequences consisting of four tasks, current study: i.e., Browsing, Passive investigation, Active investigation, and RQ1. What actions do students take and what are the Decision (See definitions in Table 1). Since the variation across characteristics of the action sequences (mobility, time time for individual actions can be very large, we decided to use a distribution) throughout the RTs of the NEAP math items? set cut off point (2 seconds) for defining each action. In the end, we recoded the sequence of student actions in these groups for RQ2. How do students’ action sequences differ across different further analyses (See Figure 1 for an example). Then, the item types (e.g., multiple-choice item, constructed-response characteristics of the sequences are summarized from two item)? perspectives: a) the time spent on each task is calculated for each student, which allows the decomposition of the RT, and b) the RQ3. Which action sequence characteristic(s) best predict the within-sequence entropy (Shannon, 1948) and turbulence item scores? (Elzinga, 2006) of the sequences are calculated to quantify students’ action mobility. 3. DATA We used data from one of the released blocks from NAEP 2017 Table 1. Definition and Example Action of Each Behavior Grade 4 Mathematics assessment. One of the released blocks Category includes 29,100 4th graders in both public and private schools and Behavior consists of 14 cognitive items. The sample was collected using the Definition Example Action conventional NAEP sampling procedures, i.e., a two-stage category Examinees browse the content Horizontal stratified random sampling design with schools selected in the Browsing of an item by executing scroll scrolling, vertical first stage and students in the second stage. In the data cleaning on the screen scrolling procedure, students with accommodation or interruptions were Examinees get support from excluded. Comparisons of the demographic composition of the Change theme Passive assistive tool for their two samples, full sample and analytical sample, are presented in (change the color investigation problem-solving process Table 2. of background) without interacting item Examinees interact with item Draw with Table 2. Summary Statistics for Full and Analytical Sample: Active Student Demographic Characteristics as a part of their problem- scratchwork, investigation solving process highlight Weighted Unweighted Examinees make responses to Click choice, text Analytical Full Analytical Full Decision an item enter Observations 649,500 780,500 24,100 29,100 Gender Percentages In addition to summarizing sequence characteristics in a Female 50 49 50 49 descriptive manner, this study examines the relationships among Race/Ethnicity Percentages the sequence characteristics, item characteristics and students’ White 51 49 52 50 item responses. Specifically, to examine the relationship between Black 14 15 17 18 sequence characteristics and item characteristics, the RT Hispanic 24 26 20 22 Asian 6 5 4 4 decomposition and students’ action mobility are compared across American Indian 1 1 2 2 different items. Furthermore, representative sequence(s) are identified for each item with the use of a sequence dissimilarity Other 4 4 5 5 National School measure and a clustering algorithm. The representative Lunch Program* Percentages sequence(s) can inform the typical response process of an item. Eligible 48 50 51 54 Finally, to examine the relationship between sequence Not Eligible 46 44 45 43 characteristics and student performance, sequence characteristics, * No Information categories are not presented. such as the time duration of each task, within-sequence entropy Note: Because all extended time accommodation students (that are excluded from analyses) are either with limited English 2 proficiency or in individualized education program, the results for Figure 1. The procedure of turning raw process data into an these variables are not included. Detail may not sum to totals action sequence. because of rounding. The mean time spent on each action as well as the action mobility A small non-significant difference in the proportion of White were summarized as the sequence characteristics. The number of (50.5 % in analytical, and 48.9% in full sample) and Hispanic task transitions, Shannon entropy (Shannon, 1948) and turbulence students (24.4% in analytical and 25.9% in full sample) are (Elzinga, 2006) measures were used to quantify the action observed. A significant difference in term of NSLP non-eligible mobility. category is found (46% vs. 43.8%). To examine how students’ action sequences differ across different item types (e.g., multiple-choice item, constructed-response item), 4. ANALYSIS the characteristics of sequences were summarized and compared To construct sequences and decompose RT from the process data, across different items. To identify the typical response process for we followed two steps (See Figure 1 for a demonstration of the an item, the hierarchical agglomerative clustering algorithm was procedure): a) Recoding the actions into four task categories (i.e., applied to all the students’ sequences based on the optimal Browsing, Passive investigation, Active investigation, Decision; matching edit distance (Levenshtein, 1966) matrix. The medoids See definitions in Table 1); and b) Calculating the time duration of the clusters (i.e., the sequence that is the nearest to the virtual of each task. Thus, students’ item response processes were center of the cluster) were treated as the representative sequences represented as sequences whose lengths are proportional to the that represent the typical response processes for an item. As no time durations. Since the variation of time students spend on an study to our knowledge has been done to determine the optimal item can be large (i.e., range from 0.01 second to 30 minutes), number of clusters when the clustering is based on the edit using a small-time unit (e.g., 0.01 second) could result in distance matrix. Ward’s algorithm was used to form clusters by extremely long sequences that exceed the computer memory maximizing within cluster homogeneity. We chose the number of capacity. Therefore, we decided to use 2 seconds as the time unit clusters by visually inspecting the dendrogram and assessing the while constructing the sequences. Only actions in students’ initial interpretability of the clusters. Specifically, for each item, we item visit (i.e., actions between the first pair of “Enter Item” and examined the cluster medoids when the number of clusters ranged “Exit Item” actions) were included in the sequence. Students from 2 to 4 and chose the number of clusters that resulted in whose initial item visit lasts longer than 8 minutes (480 seconds) interpretable clusters from practical perspectives. All sequence were excluded from the analyses to avoid extremely long analyses were performed using the TraMineR R package. sequences. For all the items in the MA block, the percentages of students with initial item visit longer than 8 minutes are lower To examine the relationship between the sequence characteristics than 1%. and student performance, the sequence characteristics were used as features to predict the item scores using the regression tree (Breiman, 2017). In addition, the score distributions were compared across the sequence clusters identified based on the hierarchical clustering algorithm and edit distance. 5. RESULTS For the purposes of this paper, we present the results for two selected items1 listed in Table 3. The items are different item types (Item A is multiple-choice item while Item B is constructed response item) and are close in the presentation order. Thus, the two items were chosen to demonstrate the difference in sequence characteristics between items of different types (with minimal confounding of the presentation order). Table 3. Characteristics of the Two Example Items Item Characteristics Item Label Item A Item B Item type Multiple-Choice Fill In Blank Presentation order 2 4 Item difficulty -0.17 0.29 parameter Compare heights Divide 3-digit Item content of objects in a whole number by 1- description figure digit whole number 5.1 Response Time Decomposition The average time students spent on each recoded behavior actions, i.e., browsing, passive investigation, active investigation, and 1 https://nces.ed.gov/NationsReportCard/nqt/Search 3 decision are shown in Figure 2. For Item A, the “decision” task Table 4 lists the summary statistics of three mobility measures, had the highest average time among the four tasks; however, for i.e., the number of task transitions, within-sequence entropy, and Item B, the “passive investigation” task had the highest time. On turbulence. Task transition refers to switching among the four average, students spent 10 seconds browsing Item A by executing tasks (i.e., browsing, passive investigation, active investigation, scroll on the screen, while students hardly spent any time and decision) in the sequence. The average task transitions for browsing Item B. Such difference in the browsing time could be item A and B are 2.28 and 2.13, respectively. As for the within- associated with the content of the items: Item A needs to be sequence entropy and turbulence measures, higher values indicate solved by inspecting and comparing the heights of the trees which larger mobility. On average, item A is found to have higher may result in browsing actions, while Item B is a straightforward within-sequence entropy and turbulence. computational item which may not require much browsing. Table 4. Mobility Measures of the Two Selected Items Mobility Measure Item A Item B Min 1 1 Number of task Median 2 2 transitions Mean 2.28 2.13 Max 4 4 Min 0 0 Within-Sequence Median 0.38 0.29 Entropy Mean 0.37 0.30 Max 1 0.92 Min 1 1 Median 3.18 2.76 Figure 2. Average time students spent on recoded actions Turbulence Mean 3.36 3.28 when interacting with two selected items. Max 11.24 14.43 5.2 Sequence Characteristics 5.3 Typical Response Process Figure 3 presents the state distributions at each time unit for the Figure 4 shows the representative sequences for Item A and Item two selected items. Each unit of the x-axis represents 2 seconds. B. A representative sequence refers to the sequence with the For instance, in the first 2 seconds, students who conducted smallest sum of edit distance to the rest of sequences; the “passive investigation” make up the largest proportion in both representative sequence is considered to be representative of the items. When responding to Item A, more than 10% of the students typical response process of an item. As the sequence length is were browsing the item in the first 2 seconds; when interacting proportional to the time duration, the overall time duration of the with Item B, nearly no students browsed the item in this time unit. typical response process is shorter for Item A than Item B. We observe that, in the typical response process; • for Item A, the student conducts passive investigation, browses the item, and makes response decisions, sequentially. • for Item B, the student conducts passive and active investigations and makes response decisions. Figure 3. State distribution plot of the two selected items. Figure 4. Representative sequences of the two selected items. 4 While identifying a single typical response process for an item is 5.4 Relationship Between the Sequence desirable for the purpose of interpretation, a single sequence may not be enough to represent all the sequences. It is possible that Characteristics and Student Performance there are multiple response process archetypes for an item. Thus, Figure 6 shows the regression tree learned from the process data we conducted hierarchical agglomerative clustering based on the of Item B. Time durations of browsing, passive investigation, edit distance matrix. In the clustering process, each unit is a active investigation and decision, number of task transitions, student. After examining the dendrogram and the interpretability within-sequence entropy and turbulence are used to predict item of the clusters, we chose to retain three clusters (labeled as Type scores. Item B has five score categories, i.e., incorrect, correct, off 1, Type 2, and Type 3). The weighted cluster sizes and students’ task, omitted, and not reached. Each box in Figure 6 is called a demographic characteristics by sequence clusters found in Item B “node” and the five decimals in each box are the predicted are presented in Table 5. proportions of students having the five score categories in that node. The name (and color) of the node is determined by the score Table 5. Student demographic characteristics by Sequence category that has the highest proportion among the five categories. Clusters in Item B For example, as the first split was performed with the decision Weighted Percentages time, for students with decision time longer than 14 seconds (25% Type 1 Type 2 Type 3 of the students in the sample have decision time longer than 14 Observations 307,100 172,500 157,400 seconds), the predicted proportions of getting “incorrect” and Gender Percentages “correct” scores are 0.70 and 0.28, respectively. In addition, all the splits in this regression tree are performed with either decision Female 45 52 57 or passive investigation time durations. Race/Ethnicity Percentages White 49 53 51 Black 15 13 13 Hispanic 24 23 26 Asian 6 6 4 American Indian 1 1 1 Other 4 5 4 National School Percentages Lunch Program* Eligible 48 45 49 Not Eligible 45 49 45 Detail may not sum to totals because of rounding. Figure 5 displays the representative sequences of the three clusters in Item B, which represent three archetypes of response processes in this item. The representative sequences of Type 1 and Type 3 only consist of “passive investigation” and “decision”. The time duration of “passive investigation” is longer for the representative sequence in Type 3 than Type 1. The representative sequence of Type 2 consists of “active investigation” in addition to “passive Figure 6. Regression tree learned from the process data of investigation” and “decision”. Item B. 5.5 Relationship between the Sequence Cluster and Student Performance Table 6 lists the score distribution of scores within each sequence cluster found in Item B. Item B is a fill-in-blank item, which is the fourth item in the block with a difficulty level of 0.29. In all three clusters, the proportion of students getting “incorrect” score was the highest among the five score categories. The similarity in the score distributions across sequence clusters implies that no clear pattern on the performance difference has been found among students with different response process archetypes. Table 6. Score Distribution of Each Sequence Cluster in Item B. Percentages (%) Cluster Cluster Correct Incorrect Omitted Not Off Size reached task Type 1 11,500 42.2 54.0 3.4 0.2 0.2 Figure 5. Representative sequences of the three sequence clusters found in Item B. Type 2 6,100 42.9 53.5 3.3 0.2 0.1 Type 3 5,900 42.5 53.7 3.5 0.2 0.2 Note. Percentages of each row add up to 100%. 5 6. DISCUSSION features such as the frequencies of subsequences (e.g., the frequency of a student switching from passive investigation to 6.1 Summary active investigation and then to decision), together with feature In summary, this study provided insights into the decomposition selection algorithms, could be incorporated in future studies. of RT by constructing action sequences from students’ process data. In particular, the action sequences contained information of 7. REFERENCES the time duration, order and mobility of the tasks students executed to solve the NAEP mathematics items. By presenting the sequences of two selected NAEP released items as examples, this [1] Breiman, L. (2017). Classification and regression trees. paper demonstrated the differences in RT decomposition and Routledge. typical response processes between items of different types (i.e., a [2] Elzinga, C. H. (2006). Turbulence in categorical time series. multiple-choice item vs a fill-in-blank item). This methodology Mathematical Population Studies. and set of results suggest that examining action sequences and RT decomposition can be a useful way to mine process data and [3] Goldhammer, F. (2015). Measuring ability, speed, or both? uncover educational processes. Also, action sequence mining can Challenges, psychometric solutions, and what can be gained be useful to analyze high variance data such as process data. from experimental control. Measurement: Interdisciplinary Research and Perspectives, 13(3–4), 133–164. Response process archetypes were found by conducting a hierarchical clustering algorithm using the edit distance matrix of [4] Huff, K. L., & Sireci, S. G. (2001). Validity issues in students’ action sequences. As for the relationship between computer-based testing. Educational Measurement: Issues student performance and sequence characteristics, the time and Practice, 20(3), 16–25. students spent on “Decision” and “Passive investigation” were [5] Lee, Y.-H., & Haberman, S. J. (2016). Investigating test- incorporated in the learned regression tree of the example fill-in- taking behaviors using timing and process data. International blank item, meaning that these components of RT can be used to Journal of Testing, 16(3), 240–267. predict the scores of this item. Further, among the 10,000 students [6] Levenshtein, V. I. (1966). Binary codes capable of correcting who correctly responded to Item B, 48.6% had their action deletions, insertions, and reversals. Soviet Physics Doklady, sequences clustered into Type 1, 26.3% into Type 2 and 25.2% 10(8), 707–710. into Type 3, which implied that students who responded to the item correctly may have different response processes. [7] Li, Z., Banerjee, J., & Zumbo, B. D. (2017). Response time data as validity evidence: Has it lived up to its promise and, 6.2 Limitations and Future Research if not, what would it take to do so. In B.D. Zumbo & A.M. As an initial exploration of action sequences in the NAEP Hubley (Eds.), Understanding and Investigating Response mathematics items, this study has limitations and opens up Processes in Validation Research (pp. 159–177). Springer. opportunities for future research. First, the actions were [8] Marianti, S., Fox, J.-P., Avetisyan, M., Veldkamp, B. P., & categorized into four tasks (browsing, passive investigation, active Tijmstra, J. (2014). Testing for aberrant behavior in response investigation and decision) in this study. However, this may not time modeling. Journal of Educational and Behavioral be the only way to categorize the actions. For instance, in a Statistics, 39(6), 426–451. multiple-choice item, the actions could be recoded based on students’ selected options. Thus, sequences that reflect students’ [9] Masters, J., Schnipke, D. L., & Connor, C. (2005). trajectory of answer changes can be constructed. Comparing item response times and difficulty for calculation items. In annual meeting of the American Educational Second, the number of clusters was determined only based on the Research Association, Montreal, Canada. dendrogram and the interpretability of the clusters in this study. To better justify the choice of the number of clusters, future [10] Shannon, C. E. (1948). A mathematical theory of studies could develop quantitative measures to determine the communication. Bell System Technical Journal, 27(3), 379– optimal number of clusters based on the edit distance matrix. 423. Finally, this study only included a limited number of sequence [11] van der Linden, W. J. (2011). Setting time limits on tests. characteristics as features to learn the regression tree. Other Applied Psychological Measurement, 35(3), 183–199. 6