-

Using Semantics of Textbook Highlights to Predict Student Comprehension and Knowledge Retention

David Y.J. Kim

Tyler R. Scott

0 2

Debshila Basu Mallick

Michael C. Mozer

0 2 0 Google Research, Brain Team, Mountain View , CA , USA 1 Rice University , Houston, TX , USA 2 University of Colorado , Boulder, CO , USA

As students read textbooks, they often highlight the material they deem to be most important. We analyze students' highlights to predict their subsequent performance on quiz questions. Past research in this area has encoded highlights in terms of where the highlights appear in the stream of text|a positional representation. In this work, we construct a semantic representation based on a state-of-the-art deep-learning sentence embedding technique (SBERT) that captures the content-based similarity between quiz questions and highlighted (as well as non-highlighted) sentences in the text. We construct regression models that include latent variables for student skill level and question di culty and augment the models with highlighting features. We nd that highlighting features reliably boost model performance. We conduct experiments that validate models on held-out questions, students, and student-questions and nd strong generalization for the latter two but not for held-out questions. Surprisingly, highlighting features improve models for questions at all levels of the Bloom taxonomy, from straightforward recall questions to inferential synthesis/evaluation/creation questions.

deep embeddings natural language processing student modeling textbook annotation

As digital textbooks become increasingly common, researchers have the extraordinary opportunity to observe students as they initially engage with unfamiliar material. Eye gaze has been used as one measure of student behavior [ 8 ]. Our interest is in using another source of information that students often provide as they read textbooks: highlighting of the material deemed to be most important.

From this manner of student engagement, our goal is to infer students' comprehension and knowledge retention. To the degree that this is possible, early interventions can be designed to steer students toward a deeper understanding of the material.

Our team has engaged in several lines of research on this topic. Winchell et al. [ 15 ] conducted a laboratory experiment with three passages from a biology text. Participants were asked to read and highlight the material. Following initial reading, they were given a brief opportunity to review the material along with any highlights they chose to make and were then tested on factual questions that spanned all three sections. Winchell et al. found that the pattern of highlights yield small but reliable improvements in predicting a participant's accuracy of a speci c quiz question. Moving to an authentic learning environment, Waters et al. [ 14 ] and Kim et al. [ 5 ] modeled a data set of highlights obtained from students in actual college-level courses using the OpenStax Tutor platform [ 12 ]. Waters et al. found that highlighting the sentence that contains the answer to a question is predictive of performance on that question. Kim et al. extended these results to utilize the entire pattern of highlights in a section of the textbook to predict the overall accuracy of a quiz based on the content of the section.

This past research was limited in two important respects. First, models predicting quiz performance were based on a positional encoding of highlights. That is, each section of the text was divided into segments|words, phrases, sentences, or xed length chunks|and a student's highlighting pattern was represented by a binary vector whose elements indicate whether or not each segment had some highlighting. (Continuous encodings were also explored in which each vector element indicated the proportion of words in that segment that had been highlighted.) Positional encodings contain no explicit information about the content of material that has been highlighted; they only allow models to discover regularities such as \if a student highlighted sentence 14 but not sentence 28, their accuracy on question 2 should increase." Such regularities will of course not generalize to other sections of text or to other questions from the same section. A key contribution of the present work is to explore a semantic encoding of the highlighted and non-highlighted textbook material. The results presented in this paper show that model accuracy is higher with the semantic encoding than the positional encoding.

The second limitation of past research concerns the nature of information that highlights provide. Models based on only the highlighting pattern may succeed because the highlights provide some general information about how skilled or motivated a particular student is, not because they determine whether students have understood the speci c material. To address this possibility, our present work uses a simple latent-variable model, the Rasch model [ 9 ], as a baseline. The Rasch model assumes that each student has an ability (perhaps better characterized as a skill level) and each question has a di culty. Collaborative ltering methods can be used to infer these latent parameters, from which predictions can be made for new students, new questions, and for known students answering known questions which were not part of the model-training corpus. With the Rasch model as a baseline, we explore whether highlights o er an orthogonal source of information to student ability and question di culty. We were surprised and pleased to discover that highlights are indeed informative, even when student ability and question di culty are known. 2 2.1

Methodology Data

We obtained data from the Openstax Tutor platform [ 12 ]. The data were collected from January 1, 2019, through December 31, 2019|spanning two academic semesters|and consist of four di erent subjects: College Biology, College Physics, Introduction to Sociology, and American History. It is essential to emphasize that these data were collected in a real-world setting, with no control over how the Openstax Tutor platform was administered, and thus, how the data was collected.

The data set consists of 11,134 students, 897 distinct sections, and 830,320 sessions, where a session consists of a particular student reading a particular section. We have no further meta-information about the students since the process was completely anonymous, thus we are unable to report or utilize the demographic information about the student sample. For the analysis, we used only non

nonhighlighted

nonshesinhgetihngelthieglnihgctheetded

nce sentence highlighted shesinhgetihngelthieglnihgctheetded

nce sentence

S S correctness prediction c o m p a r i s o n regression model s e r o c s h c t a m Fig. 1: Sketch of our highlight-based model of student performance. On the left side of the gure is a highlighted passage of text and a speci c quiz question. Each of the highlighted and non-highlighted sentences are fed one-at-a-time into SBERT to produce an embedding which is compared with the embedding of the question to determine a match score. The match scores are summarized and fed into a regression model to predict a student's correctness on the given question. Not pictured are latent student-ability and question-di culty parameters. the sessions that contain highlights which is 27,019 of the 830,320 sessions. Each section is analyzed independently, and we report mean results across sections. Because the textbooks were electronic, they were revised during the period in which we obtained data. As a result, some sections have multiple versions. We collapsed these revisions together since typically only a few words changed from one version to the next, and it was trivial to align the highlighted fragments. 2.2

Model Design

To capture the semantics of text, we used a pre-trained neural network model: BERT [ 4 ]. BERT is a transformer [ 13 ] that has produced state-of-the-art results in various natural-language processing tasks. We speci cally use Sentence-BERT (SBERT): a modi cation of BERT that uses a Siamese network structure to derive sentence-level embeddings that can be compared using cosine-similarity [ 10 ]. As shown in Figure 1, we predict a student's likelihood of answering a given quiz question correctly by comparing the SBERT embeddings of both highlighted and non-highlighted sentences to the embedding of the question.

In Figure 2, we illustrate the e ectiveness of this framework in identifying semantic similarities between sentences from the textbook and quiz questions. The gure shows a sample question from a biology section entitled \The Science of Biology" along with the correct answer to the question. Following the question and answer are the ve sentences from the section deemed to be most similar to the question by SBERT. The cosine-similarity score between each sentence and the question is shown in parentheses. In this example, the question is about the de nition of peer review. The most related sentence identi ed by SBERT is a paraphrased de nition. The other sentences with high similarity scores are either related to peer review or contain the phrase within the sentence. Representing the semantic similarity between highlights and quiz questions. Here we address several methodological decisions needed to fully specify a predictive model with semantic features. First, we have decided to partition the textbook into sentences [ 6 ] and group the sentences in a section into those that have one or more characters highlighted and those that contain no highlights. For each sentence, s, of the section, we obtain an SBERT match score (i.e., cosine similarity) to question q; we denote this match score B(s; q). Since this similarity score would be in the range of [ 1; 1], for mathematical convenience and interpretability of model parameters, we rescale this score to the range [0; 2] by adding 1. We thus obtain a set of match scores for highlighted content and a set of match scores for non-highlighted content.

Because the number of sentences|and match scores|in each set varies from student-to-student and section-to-section, we need to recast the two sets of scores into a xed length vector. A simple approach is to compute the max of the highlighted and non-highlighted sets, resulting in a two-element vector. The maximum score would re ect whether or not the student highlighted the most relevant sentences for a given question. However, the feature is biased in cases where a student highlights excessively. One could instead use the mean score, which would combat over-highlighting, but it's not clear that highlighting material unrelated to the question should make it less likely the student can answer the question. Rather than choosing either the mean or the maximum, we devised a scheme that interpolates between them, and chose a xed-length vector containing statistics that span the entire range.

If x is a vector of n match scores, and jjxjjp denotes the Lp norm, then n1 jjxjj1 is the arithmetic mean and jjxjj1 is the maximum. We can de ne a continuum of norms based on the following relationship: jjxjjr

1 1 n r p jjxjjp: If we apply this inequality with p = r + 1 for all r = 1; 2; :::, we obtain the following relation: n 1jjxjj1

1 n 2 jjxjj2

1 n 3 jjxjj3 :::

jjxjj1: 2 2 For a given p, we obtain the following de nition of a highlight match score or HMS :

HMSp;q;i = 4 n1h X B(s; q)p5 s2Sih 31=p ; where Sih is the set of nh sentences that contain one or more highlights from student i. Because well-matching, non-highlighted sentences might provide additional information, we also construct a score for all the non-highlighted sentences, which we refer to as the non-highlighted match score or NHMS : 1 NHMSp;q;i = 4 nnh

X B(s; q)p5 s2Sinh 31=p ; 1.8 S,pq HM1.6 d e lt a u m iS1.4 d e t c e p x E1.2 1.0 where Sinh is the set of nnh non-highlighted sentences from student i.

As mentioned above, instead of selecting a single value of p to compute HMS and NHMS, we use multiple values. To assist with selecting the values of p, we ran a simulation where we randomly-sampled vectors of match scores, where each match score was selected from a uniform distribution, U (0; 2). We then computed the expected HMS for various values of p 2 [1; 125]. The results of the simulation are shown in Figure 3. As expected, p = 1 is exactly the mean and p ! 1 approaches the maximum. To approximately span the range, we manually selected f1; 5; 10; 100g as the values of p for computing both HMS and NHMS.

Combining the match scores for highlighted and non-highlighted sentences over various values of p, we obtain a parameterized linear model for the overall match:

OverallMatchi;q = X

q;j HMSpj;q;i + j

X j q;j NHMSpj;q;i; where j is an index over a set of norm values p 2 f1; 5; 10; 100g and q;j are free parameters t to data. q;j and Prediction model. Our prediction model is an extension of the Rasch model [ 9 ], a speci c instantiation of the classic item-response theory model for students. To formalize the Rasch model, let yi;q = 1 if the response from student i to question q is correct. Model predictions are computed as follows: where i denotes the latent ability of student i and q denotes the latent di culty of question q. We refer to the standard Rasch model as a+d since it uses latent parameters for both student ability (a) and question di culty (d). Our model extends the Rasch model with highlighting features (h), hereafter a+d+h: where ; ; ; N (0; 2:5). All of the models were t using STAN [ 2 ]. We sample four Markov chain Monte Carlo (MCMC) chains each with 4000 samples, and from each chain we remove the rst half of samples as burn-in. The remaining samples are then averaged together across the four chains to obtain the estimated parameters, which are then used to compute predictions. We chose hierarchical Bayesian models over a simple maximum likelihood t to the parameters in order to support principled prediction for new students and to new questions.

We use two performance measures to evaluate models: area under the receiveroperating-characteristic curve (AUC) and the area under the precision-recall curve (PRC). We choose to report PRC in addition to AUC due to an imbalance between correct and incorrect responses to questions in the data. AUC measures a trade-o between sensitivity (or recall) and speci city, neither of which depend on the base rates for each class (i.e., the number of questions correctly answered versus incorrectly answered). PRC, in contrast, computes precision instead of speci city which is sensitive to the base rate of the positive class. In settings where there are many fewer instances of the positive class, PRC assigns more credit to models that successfully classify positive instances (i.e., true positives) [ 3, 11 ]. We found that our results are consistent with respect to AUC and PRC, but report both for completeness. 3 3.1

Results

Performance within cross-validation settings We conduct three cross-validation analyses: (1) held-out student-questions where the validation set is a random selection of fstudent, questiong pairs, (2) held-out students where the validation set contains all questions from a random selection of students, and (3) held-out questions where the validation set contains all students from a random selection of questions. In all three cases, we perform ve-fold cross validation within each section. The ve performance values within each section are averaged, resulting in a single performance metric per section. We then report the mean and standard-error across sections.

Held-out student-questions. In this analysis, the training set typically provides some information about each student and some information about each question. However, it excludes some particular students answering some particular questions. As shown in Figure 4, the three models with highlighting features outperform the corresponding models without highlighting, and the a+d+h model with all features performs the best. Thus, the highlighting features provide distinguishable information from ability and di culty. We observe that a alone provides the least amount of information, but this is expected since the portion of the training set that constrains each student's ability is far smaller than the portion of the training set that constrains each question's di culty. Although performance of a+h about matches performance of d, one might suppose that there is redundancy between the two sets of features; however, the superiority of a+d+h over all other models rules out this possibility. Held-out students. Our next analysis performs cross-validation on students, removing a portion of students from the training set each fold and using them to evaluate the model. This procedure removes any explanatory power of the student ability parameter since at test only the prior distribution is available. As expected (Figure 5), a alone can do no better than chance, yielding an AUC of 0.5, and the models that include ability (purple bars) perform no better than the corresponding models that exclude ability (blue bars). Just as with held-out student-questions, the d+h model outperforms d alone. It is thus reasonable to conclude that the highlighting features provide additional information that can be distinguished from question di culty.

0.80 Fig. 4: Results for held-out student-questions with ability, di culty, and both ability and di culty features. The darker-colored bars indicate the use of highlighting features in addition to the features listed along the abscissa. Each bar indicates the mean AUC (left) and PRC (right) across sections; error bars re ect 1 standard-error of the mean, corrected to remove variance due to the random factor [ 7 ].

Held-out questions. We performed cross-validation on questions, removing a portion of questions from the training set for each fold and using them to evaluate the model. This procedure removes any explanatory power of the questiondi culty parameter because at test only the prior distribution is available. As expected (Figure 6), d alone can do no better than chance, yielding an AUC around 0.5, and the models that include di culty (purple bars) perform no better than the corresponding models that exclude di culty (red bars). The a alone models o ers some degree of discrimination; however, none of the models reliably improve when highlighting features are incorporated. This nding is consistent with the laboratory study of Winchell et al. [ 15 ] where it was found that with held-out questions, highlighting features did not boost model performance relative to the baseline model (and in fact did somewhat worse due to over tting). A possible reason for the failure to generalize to new questions is that we train models for each section separately, and each section has relatively few questions. As a result, the model may over t to the set of questions in the section's training set. We speculate that better generalization to new questions might be obtained

Held-out student (AUC)

w/o highlight w/ highlight

Held-out student (PRC)

w/o highlight w/ highlight

Held-out questions (AUC)

w/o highlight w/ highlight

Held-out questions (PRC) w/o highlight w/ highlight Fig. 6: Results for held-out question models. The plots have identical layout as those in Figure 4. See the caption of Figure 4 for details. if a single model were trained for all sections rather than using section-speci c models. Ongoing simulations are addressing this issue.

While it is disappointing that the current models do not generalize to new questions in a section, this nding does not seriously impact the potential to leverage highlights. When textbooks are designed, the author knows at that point what knowledge should be acquired and correspondingly, what questions should be asked of students. It would be of far greater a concern if models did not generalize to new students; fortunately, our models do this well (Figure 5). 3.2

Performance across levels of conceptual di culty In addition to exploring various cross-validation settings, we investigated the performance of both the a+d and a+d+h models across varying levels of conceptual di culty distinguished by the six levels of the Bloom taxonomy [ 1 ]. The taxonomy re ects a continuum from concrete factual questions to abstract reasoning questions; the Bloom levels are: (1) recall, (2) understand, (3) apply, (4) synthesize, (5) evaluate, and (6) create. Waters et al. [ 14 ] found that highlights had predictive value only for recall (i.e., Bloom level 1) questions. However, their predictions were based on identifying whether or not a speci c critical sentence in the text was highlighted; the information required for questions at higher levels of the Bloom taxonomy are likely to be more di use in the text. Thus, the previously used positional encoding of highlights may not have been su ciently powerful to capture subtle information that the highlights provide.

Because Openstax Tutor had fewer questions at the higher Bloom levels, we clustered Bloom levels. Figure 7 compares a+d models (faint purple) to a+d+h models (dark purple) for three clusters: Bloom level 1, f2,3g, and f4,5,6g. Adding highlighting features improves model performance across all clusters of the Bloom taxonomy. Interestingly, the middle cluster|understand and apply questions|obtains the biggest boost from highlighting features. A Fig. 7: Held-out student-question results for a+d (lighter-colored bars) and a+d+h (darker-colored bars) across increasing levels of conceptual di culty, along the abscissa, determined by the Bloom taxonomy. Each bar indicates the mean AUC (left) and PRC (right) across sections; error bars re ect 1 standarderror of the mean, corrected to remove variance due to the random factor [ 7 ]. possible explanation for that is recall questions are so straightforward they do not depend on the complex pattern and semantics of highlights; consequently, the highlighting representation may provide less value. For the third cluster| synthesize, evaluate, and create questions|which require holistic comprehension, our semantic highlighting representation should also be valuable. The predictive power of our models tends to drop for higher levels of the Bloom taxonomy, which we were expecting considering that at higher levels, the complexity of the questions implies that many more factors can come into play in determining student correctness. 3.3

Comparing positional and semantic representations of highlights In previous work [ 5, 15 ], we used a positional encoding of highlights. Essentially, we constructed a vector whose elements indicate whether a particular segment of text has been highlighted. We found that providing this high-dimensional vector directly into regression models produced over tting due to the large number of free parameters. As an alternative, we performed principal-components-analysis (PCA) on the highlighting representation and chose the top k principal components for the highlighting representation. We, in fact, discovered that k = 1 worked best generically across sections of text. The previous work is not directly comparable to the present work because it used smaller data sets and Kim et al. [ 5 ] evaluated on overall quiz accuracy not individual question accuracy.

We compared the positional highlighting encoding with the encoding developed in this paper and evaluated on individual questions using the current, large data set. As Figure 8 shows, both highlighting representations improve model performance over the baseline a+d model, but augmenting the baseline model with the semantic encoding is superior to augmenting with the positional encoding. We have yet to explore the obvious question of whether augmenting the baseline model with both feature sets would further improve model performance. 0.800 s0.775 n o itc0.750 e ss0.725 s o rc0.700 a C0.675 U A n0.650 a e M0.625 0.600

Model Comparison (AUC)

Model Comparison (PRC) 0.90 s iton0.88 c e s ss0.86 o r c aC0.84 R P n ae0.82 M 0.80 A + D

A + D + Hpos

A + D + Hsem

A + D

A + D + Hpos

A + D + Hsem Fig. 8: Comparison of three predictive models with latent ability and di culty parameters, and optionally using positional or semantic highlighting features, hpos and hsem, respectively. Error bars re ect += 1 SEM, corrected to remove variance due to the random factor [ 7 ]

Conclusions and Future Research

We explored the relationship between student highlighting patterns and questionanswering performance using an encoding of highlights based on deep neural network embeddings of text and question content. We found that augmenting a baseline model with this semantic highlighting representation improved predictions of whether a student would answer a speci c question correctly. The baseline model is conditioned on latent factors representing student skill level and question di culty. Our results suggest that highlights provide a source of information that complements these other factors, which may not be surprising in retrospect given that the highlight encoding we used is based on how the particular student interacts with the textbook content that is relevant for the speci c question. What is surprising is how e ective the SBERT model is in producing embeddings that can be used to judge the similarity of highlighted content to individual questions. We obtained several other key results, including: (1) our models predict well for new students, but not for new questions; (2) our models predict well for all levels of the Bloom taxonomy; and (3) our models that use semantic highlight encodings predict better than models using positional highlight encodings.

From here, there are several potential paths we intend to investigate. First, we should more systematically explore several methodological decisions that we made; in our past work [ 5 ], these decisions matter. The assumptions we might question include: whether the correct decomposition of highlights is at the level of complete sentences and not smaller or larger segments; whether a segment of text should be considered highlighted if any portion is highlighted, as opposed to explicitly representing the fraction of the segment highlighted; whether the summary statistics (i.e., values of p) we selected best capture the distribution of highlighted and non-highlighted match scores.

Second, we modeled each section apart from each other section. However, in principle, semantic-highlighting models could apply across multiple sections. Constructing a multi-section model might improve predictions|particularly for held-out questions|because the model would be trained on more data, but it might harm predictions because the weighting of semantic information may vary across sections.

Third, the ultimate goal of our work is not just to predict student performance, but to leverage the predictions to boost student comprehension and retention. Once our investigation of predictive models is complete, the true value of these models to improve student learning can begin. 5

Acknowledgement

This research is supported by NSF awards DRL-1631428 and DRL-1631556. We thank Adam Winchell for helping the initial stage of the research and two anonymous reviewers for their helpful feedback on earlier drafts of this manuscript.

1. Bloom , B.S. , Krathwohl , D.R. , Masia , B.B. : Bloom taxonomy of educational objectives . In: Allyn and Bacon. Pearson Education ( 1984 )

2. Carpenter , B. , Gelman , A. , Ho

man

, M.D., Lee , D. , Goodrich , B. , Betancourt , M. , Brubaker , M.A. , Guo , J. , Li , P. , Riddell , A. : Stan: a probabilistic programming language . Grantee Submission 76 ( 1 ), 1 { 32 ( 2017 )

3. Davis , J. , Goadrich , M.: The relationship between precision-recall and roc curves . In: Proceedings of the 23rd international conference on Machine learning . pp. 233 { 240 ( 2006 )

4. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 . 04805 ( 2018 )

5. Kim , D.Y. , Winchell , A. , Waters , A.E. , Grimaldi , P.J. , Baraniuk , R.G. , Mozer, M.C. : Inferring student comprehension from highlighting patterns in digital textbooks: An exploration of an authentic learning platform ( 2020 )

6. Loper , E. , Bird , S. : Nltk: The natural language toolkit . In: In Proceedings of the ACL Workshop on E ective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics . Philadelphia: Association for Computational Linguistics ( 2002 )

7. Masson, M.E. , Loftus , G.R. : Using con dence intervals for graphically based data interpretation . Canadian Journal of Experimental Psychology/Revue canadienne de psychologie experimentale 57(3) , 203 ( 2003 )

8. Mills , C. , Graesser , A. , Risko , E.F. , D'Mello , S.K. : Cognitive coupling during reading . Journal of Experimental Psychology: General 146 ( 6 ), 872 ( 2017 )

9. Rasch , G.: Probabilistic models for some intelligence and attainment tests . ERIC ( 1993 )

10. Reimers , N. , Gurevych , I. : Sentence-bert: Sentence embeddings using siamese bertnetworks . arXiv preprint arXiv: 1908 . 10084 ( 2019 )

11. Saito , T. , Rehmsmeier , M.: The precision-recall plot is more informative than the roc plot when evaluating binary classi ers on imbalanced datasets . PloS one 10(3) , e0118432 ( 2015 )

12. Sta ord, D., Flatley , R.: Openstax. The Charleston Advisor 20 ( 1 ), 48 { 51 ( 2018 )

13. Vaswani , A. , Shazeer , N. , Parmar , N. , Uszkoreit , J. , Jones , L. , Gomez , A.N. , Kaiser , L. , Polosukhin , I. : Attention is all you need . arXiv preprint arXiv:1706.03762 ( 2017 )

14. Waters , A.E. , Grimaldi , P.J. , Baraniuk , R.G. , Mozer, M.C. , Pashler , H.: Highlighting associated with improved recall performance in digital learning environment (Submitted)

15. Winchell , A. , Lan , A. , Mozer , M. : Highlights as an early predictor of student comprehension and interests . Cognitive Science 44 ( 11 ), e12901 ( 2020 )