<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Semantics of Textbook Highlights to Predict Student Comprehension and Knowledge Retention</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Y.J. Kim</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tyler R. Scott</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Debshila Basu Mallick</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael C. Mozer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Google Research, Brain Team, Mountain View</institution>
          ,
          <addr-line>CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rice University</institution>
          ,
          <addr-line>Houston, TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Colorado</institution>
          ,
          <addr-line>Boulder, CO</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As students read textbooks, they often highlight the material they deem to be most important. We analyze students' highlights to predict their subsequent performance on quiz questions. Past research in this area has encoded highlights in terms of where the highlights appear in the stream of text|a positional representation. In this work, we construct a semantic representation based on a state-of-the-art deep-learning sentence embedding technique (SBERT) that captures the content-based similarity between quiz questions and highlighted (as well as non-highlighted) sentences in the text. We construct regression models that include latent variables for student skill level and question di culty and augment the models with highlighting features. We nd that highlighting features reliably boost model performance. We conduct experiments that validate models on held-out questions, students, and student-questions and nd strong generalization for the latter two but not for held-out questions. Surprisingly, highlighting features improve models for questions at all levels of the Bloom taxonomy, from straightforward recall questions to inferential synthesis/evaluation/creation questions.</p>
      </abstract>
      <kwd-group>
        <kwd>deep embeddings</kwd>
        <kwd>natural language processing</kwd>
        <kwd>student modeling</kwd>
        <kwd>textbook annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        As digital textbooks become increasingly common, researchers have the
extraordinary opportunity to observe students as they initially engage with unfamiliar
material. Eye gaze has been used as one measure of student behavior [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Our
interest is in using another source of information that students often provide as
they read textbooks: highlighting of the material deemed to be most important.
      </p>
      <p>Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>From this manner of student engagement, our goal is to infer students'
comprehension and knowledge retention. To the degree that this is possible, early
interventions can be designed to steer students toward a deeper understanding
of the material.</p>
      <p>
        Our team has engaged in several lines of research on this topic. Winchell et
al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] conducted a laboratory experiment with three passages from a biology
text. Participants were asked to read and highlight the material. Following initial
reading, they were given a brief opportunity to review the material along with
any highlights they chose to make and were then tested on factual questions that
spanned all three sections. Winchell et al. found that the pattern of highlights
yield small but reliable improvements in predicting a participant's accuracy of
a speci c quiz question. Moving to an authentic learning environment, Waters
et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Kim et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] modeled a data set of highlights obtained from
students in actual college-level courses using the OpenStax Tutor platform [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
Waters et al. found that highlighting the sentence that contains the answer to a
question is predictive of performance on that question. Kim et al. extended these
results to utilize the entire pattern of highlights in a section of the textbook to
predict the overall accuracy of a quiz based on the content of the section.
      </p>
      <p>This past research was limited in two important respects. First, models
predicting quiz performance were based on a positional encoding of highlights. That
is, each section of the text was divided into segments|words, phrases, sentences,
or xed length chunks|and a student's highlighting pattern was represented by
a binary vector whose elements indicate whether or not each segment had some
highlighting. (Continuous encodings were also explored in which each vector
element indicated the proportion of words in that segment that had been
highlighted.) Positional encodings contain no explicit information about the content
of material that has been highlighted; they only allow models to discover
regularities such as \if a student highlighted sentence 14 but not sentence 28, their
accuracy on question 2 should increase." Such regularities will of course not
generalize to other sections of text or to other questions from the same section. A
key contribution of the present work is to explore a semantic encoding of the
highlighted and non-highlighted textbook material. The results presented in this
paper show that model accuracy is higher with the semantic encoding than the
positional encoding.</p>
      <p>
        The second limitation of past research concerns the nature of information that
highlights provide. Models based on only the highlighting pattern may succeed
because the highlights provide some general information about how skilled or
motivated a particular student is, not because they determine whether students
have understood the speci c material. To address this possibility, our present
work uses a simple latent-variable model, the Rasch model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], as a baseline. The
Rasch model assumes that each student has an ability (perhaps better
characterized as a skill level) and each question has a di culty. Collaborative ltering
methods can be used to infer these latent parameters, from which predictions
can be made for new students, new questions, and for known students answering
known questions which were not part of the model-training corpus. With the
Rasch model as a baseline, we explore whether highlights o er an orthogonal
source of information to student ability and question di culty. We were
surprised and pleased to discover that highlights are indeed informative, even when
student ability and question di culty are known.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <sec id="sec-2-1">
        <title>Data</title>
        <p>
          We obtained data from the Openstax Tutor platform [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The data were
collected from January 1, 2019, through December 31, 2019|spanning two
academic semesters|and consist of four di erent subjects: College Biology, College
Physics, Introduction to Sociology, and American History. It is essential to
emphasize that these data were collected in a real-world setting, with no control
over how the Openstax Tutor platform was administered, and thus, how the data
was collected.
        </p>
        <p>The data set consists of 11,134 students, 897 distinct sections, and 830,320
sessions, where a session consists of a particular student reading a particular
section. We have no further meta-information about the students since the process
was completely anonymous, thus we are unable to report or utilize the
demographic information about the student sample. For the analysis, we used only
non</p>
        <p>nonhighlighted</p>
        <p>nonshesinhgetihngelthieglnihgctheetded</p>
        <p>nce
sentence
highlighted
shesinhgetihngelthieglnihgctheetded</p>
        <p>nce
sentence</p>
        <p>S
S
correctness
prediction
c
o
m
p
a
r
i
s
o
n
regression
model
s
e
r
o
c
s
h
c
t
a
m
Fig. 1: Sketch of our highlight-based model of student performance. On the left
side of the gure is a highlighted passage of text and a speci c quiz question.
Each of the highlighted and non-highlighted sentences are fed one-at-a-time into
SBERT to produce an embedding which is compared with the embedding of the
question to determine a match score. The match scores are summarized and fed
into a regression model to predict a student's correctness on the given question.
Not pictured are latent student-ability and question-di culty parameters.
the sessions that contain highlights which is 27,019 of the 830,320 sessions. Each
section is analyzed independently, and we report mean results across sections.
Because the textbooks were electronic, they were revised during the period in
which we obtained data. As a result, some sections have multiple versions. We
collapsed these revisions together since typically only a few words changed from
one version to the next, and it was trivial to align the highlighted fragments.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Model Design</title>
        <p>
          To capture the semantics of text, we used a pre-trained neural network model:
BERT [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. BERT is a transformer [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] that has produced state-of-the-art results
in various natural-language processing tasks. We speci cally use Sentence-BERT
(SBERT): a modi cation of BERT that uses a Siamese network structure to
derive sentence-level embeddings that can be compared using cosine-similarity
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. As shown in Figure 1, we predict a student's likelihood of answering a given
quiz question correctly by comparing the SBERT embeddings of both highlighted
and non-highlighted sentences to the embedding of the question.
        </p>
        <p>
          In Figure 2, we illustrate the e ectiveness of this framework in identifying
semantic similarities between sentences from the textbook and quiz questions.
The gure shows a sample question from a biology section entitled \The Science
of Biology" along with the correct answer to the question. Following the question
and answer are the ve sentences from the section deemed to be most similar
to the question by SBERT. The cosine-similarity score between each sentence
and the question is shown in parentheses. In this example, the question is about
the de nition of peer review. The most related sentence identi ed by SBERT
is a paraphrased de nition. The other sentences with high similarity scores are
either related to peer review or contain the phrase within the sentence.
Representing the semantic similarity between highlights and quiz
questions. Here we address several methodological decisions needed to fully specify
a predictive model with semantic features. First, we have decided to partition
the textbook into sentences [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and group the sentences in a section into those
that have one or more characters highlighted and those that contain no
highlights. For each sentence, s, of the section, we obtain an SBERT match score (i.e.,
cosine similarity) to question q; we denote this match score B(s; q). Since this
similarity score would be in the range of [ 1; 1], for mathematical convenience
and interpretability of model parameters, we rescale this score to the range [0; 2]
by adding 1. We thus obtain a set of match scores for highlighted content and a
set of match scores for non-highlighted content.
        </p>
        <p>Because the number of sentences|and match scores|in each set varies from
student-to-student and section-to-section, we need to recast the two sets of scores
into a xed length vector. A simple approach is to compute the max of the
highlighted and non-highlighted sets, resulting in a two-element vector. The
maximum score would re ect whether or not the student highlighted the most
relevant sentences for a given question. However, the feature is biased in cases
where a student highlights excessively. One could instead use the mean score,
which would combat over-highlighting, but it's not clear that highlighting
material unrelated to the question should make it less likely the student can answer
the question. Rather than choosing either the mean or the maximum, we
devised a scheme that interpolates between them, and chose a xed-length vector
containing statistics that span the entire range.</p>
        <p>If x is a vector of n match scores, and jjxjjp denotes the Lp norm, then n1 jjxjj1
is the arithmetic mean and jjxjj1 is the maximum. We can de ne a continuum
of norms based on the following relationship:
jjxjjr</p>
        <p>1 1
n r p jjxjjp:
If we apply this inequality with p = r + 1 for all r = 1; 2; :::, we obtain the
following relation:
n 1jjxjj1</p>
        <p>1
n 2 jjxjj2</p>
        <p>1
n 3 jjxjj3
:::</p>
        <p>jjxjj1:
2
2
For a given p, we obtain the following de nition of a highlight match score or
HMS :</p>
        <p>HMSp;q;i = 4 n1h X B(s; q)p5
s2Sih
31=p
;
where Sih is the set of nh sentences that contain one or more highlights from
student i. Because well-matching, non-highlighted sentences might provide
additional information, we also construct a score for all the non-highlighted
sentences, which we refer to as the non-highlighted match score or NHMS :
1
NHMSp;q;i = 4 nnh</p>
        <p>X B(s; q)p5
s2Sinh
31=p
;
1.8
S,pq
HM1.6
d
e
lt
a
u
m
iS1.4
d
e
t
c
e
p
x
E1.2
1.0
where Sinh is the set of nnh non-highlighted sentences from student i.</p>
        <p>As mentioned above, instead of selecting a single value of p to compute HMS
and NHMS, we use multiple values. To assist with selecting the values of p, we
ran a simulation where we randomly-sampled vectors of match scores, where
each match score was selected from a uniform distribution, U (0; 2). We then
computed the expected HMS for various values of p 2 [1; 125]. The results of
the simulation are shown in Figure 3. As expected, p = 1 is exactly the mean
and p ! 1 approaches the maximum. To approximately span the range, we
manually selected f1; 5; 10; 100g as the values of p for computing both HMS and
NHMS.</p>
        <p>Combining the match scores for highlighted and non-highlighted sentences
over various values of p, we obtain a parameterized linear model for the overall
match:</p>
        <p>OverallMatchi;q = X</p>
        <p>q;j HMSpj;q;i +
j</p>
        <p>
          X
j
q;j NHMSpj;q;i;
where j is an index over a set of norm values p 2 f1; 5; 10; 100g and
q;j are free parameters t to data.
q;j and
Prediction model. Our prediction model is an extension of the Rasch model
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], a speci c instantiation of the classic item-response theory model for students.
To formalize the Rasch model, let yi;q = 1 if the response from student i to
question q is correct. Model predictions are computed as follows:
where i denotes the latent ability of student i and q denotes the latent di culty
of question q. We refer to the standard Rasch model as a+d since it uses latent
parameters for both student ability (a) and question di culty (d). Our model
extends the Rasch model with highlighting features (h), hereafter a+d+h:
where ; ; ; N (0; 2:5). All of the models were t using STAN [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
We sample four Markov chain Monte Carlo (MCMC) chains each with 4000
samples, and from each chain we remove the rst half of samples as burn-in.
The remaining samples are then averaged together across the four chains to
obtain the estimated parameters, which are then used to compute predictions.
We chose hierarchical Bayesian models over a simple maximum likelihood t to
the parameters in order to support principled prediction for new students and
to new questions.
        </p>
        <p>
          We use two performance measures to evaluate models: area under the
receiveroperating-characteristic curve (AUC) and the area under the precision-recall
curve (PRC). We choose to report PRC in addition to AUC due to an imbalance
between correct and incorrect responses to questions in the data. AUC
measures a trade-o between sensitivity (or recall) and speci city, neither of which
depend on the base rates for each class (i.e., the number of questions correctly
answered versus incorrectly answered). PRC, in contrast, computes precision
instead of speci city which is sensitive to the base rate of the positive class. In
settings where there are many fewer instances of the positive class, PRC assigns
more credit to models that successfully classify positive instances (i.e., true
positives) [
          <xref ref-type="bibr" rid="ref11 ref3">3, 11</xref>
          ]. We found that our results are consistent with respect to AUC and
PRC, but report both for completeness.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>Performance within cross-validation settings
We conduct three cross-validation analyses: (1) held-out student-questions where
the validation set is a random selection of fstudent, questiong pairs, (2) held-out
students where the validation set contains all questions from a random selection
of students, and (3) held-out questions where the validation set contains all
students from a random selection of questions. In all three cases, we perform
ve-fold cross validation within each section. The ve performance values within
each section are averaged, resulting in a single performance metric per section.
We then report the mean and standard-error across sections.</p>
      <p>Held-out student-questions. In this analysis, the training set typically
provides some information about each student and some information about each
question. However, it excludes some particular students answering some
particular questions. As shown in Figure 4, the three models with highlighting features
outperform the corresponding models without highlighting, and the a+d+h
model with all features performs the best. Thus, the highlighting features
provide distinguishable information from ability and di culty. We observe that a
alone provides the least amount of information, but this is expected since the
portion of the training set that constrains each student's ability is far smaller
than the portion of the training set that constrains each question's di culty.
Although performance of a+h about matches performance of d, one might
suppose that there is redundancy between the two sets of features; however, the
superiority of a+d+h over all other models rules out this possibility.
Held-out students. Our next analysis performs cross-validation on students,
removing a portion of students from the training set each fold and using them
to evaluate the model. This procedure removes any explanatory power of the
student ability parameter since at test only the prior distribution is available.
As expected (Figure 5), a alone can do no better than chance, yielding an AUC
of 0.5, and the models that include ability (purple bars) perform no better than
the corresponding models that exclude ability (blue bars). Just as with held-out
student-questions, the d+h model outperforms d alone. It is thus reasonable to
conclude that the highlighting features provide additional information that can
be distinguished from question di culty.</p>
      <p>
        0.80
Fig. 4: Results for held-out student-questions with ability, di culty, and both
ability and di culty features. The darker-colored bars indicate the use of
highlighting features in addition to the features listed along the abscissa. Each bar
indicates the mean AUC (left) and PRC (right) across sections; error bars re ect
1 standard-error of the mean, corrected to remove variance due to the random
factor [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Held-out questions. We performed cross-validation on questions, removing a
portion of questions from the training set for each fold and using them to
evaluate the model. This procedure removes any explanatory power of the
questiondi culty parameter because at test only the prior distribution is available. As
expected (Figure 6), d alone can do no better than chance, yielding an AUC
around 0.5, and the models that include di culty (purple bars) perform no
better than the corresponding models that exclude di culty (red bars). The a alone
models o ers some degree of discrimination; however, none of the models reliably
improve when highlighting features are incorporated. This nding is consistent
with the laboratory study of Winchell et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] where it was found that with
held-out questions, highlighting features did not boost model performance
relative to the baseline model (and in fact did somewhat worse due to over tting).
A possible reason for the failure to generalize to new questions is that we train
models for each section separately, and each section has relatively few questions.
As a result, the model may over t to the set of questions in the section's training
set. We speculate that better generalization to new questions might be obtained
      </p>
      <sec id="sec-3-1">
        <title>Held-out student (AUC)</title>
        <p>w/o highlight
w/ highlight</p>
      </sec>
      <sec id="sec-3-2">
        <title>Held-out student (PRC)</title>
        <p>w/o highlight
w/ highlight</p>
      </sec>
      <sec id="sec-3-3">
        <title>Held-out questions (AUC)</title>
        <p>w/o highlight
w/ highlight</p>
        <p>Held-out questions (PRC)
w/o highlight
w/ highlight
Fig. 6: Results for held-out question models. The plots have identical layout as
those in Figure 4. See the caption of Figure 4 for details.
if a single model were trained for all sections rather than using section-speci c
models. Ongoing simulations are addressing this issue.</p>
        <p>While it is disappointing that the current models do not generalize to new
questions in a section, this nding does not seriously impact the potential to
leverage highlights. When textbooks are designed, the author knows at that
point what knowledge should be acquired and correspondingly, what questions
should be asked of students. It would be of far greater a concern if models did
not generalize to new students; fortunately, our models do this well (Figure 5).
3.2</p>
        <p>
          Performance across levels of conceptual di culty
In addition to exploring various cross-validation settings, we investigated the
performance of both the a+d and a+d+h models across varying levels of
conceptual di culty distinguished by the six levels of the Bloom taxonomy [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The
taxonomy re ects a continuum from concrete factual questions to abstract
reasoning questions; the Bloom levels are: (1) recall, (2) understand, (3) apply, (4)
synthesize, (5) evaluate, and (6) create. Waters et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] found that highlights
had predictive value only for recall (i.e., Bloom level 1) questions. However, their
predictions were based on identifying whether or not a speci c critical sentence
in the text was highlighted; the information required for questions at higher
levels of the Bloom taxonomy are likely to be more di use in the text. Thus, the
previously used positional encoding of highlights may not have been su ciently
powerful to capture subtle information that the highlights provide.
        </p>
        <p>
          Because Openstax Tutor had fewer questions at the higher Bloom levels,
we clustered Bloom levels. Figure 7 compares a+d models (faint purple) to
a+d+h models (dark purple) for three clusters: Bloom level 1, f2,3g, and
f4,5,6g. Adding highlighting features improves model performance across all
clusters of the Bloom taxonomy. Interestingly, the middle cluster|understand
and apply questions|obtains the biggest boost from highlighting features. A
Fig. 7: Held-out student-question results for a+d (lighter-colored bars) and
a+d+h (darker-colored bars) across increasing levels of conceptual di culty,
along the abscissa, determined by the Bloom taxonomy. Each bar indicates the
mean AUC (left) and PRC (right) across sections; error bars re ect 1
standarderror of the mean, corrected to remove variance due to the random factor [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
possible explanation for that is recall questions are so straightforward they do
not depend on the complex pattern and semantics of highlights; consequently,
the highlighting representation may provide less value. For the third cluster|
synthesize, evaluate, and create questions|which require holistic
comprehension, our semantic highlighting representation should also be valuable. The
predictive power of our models tends to drop for higher levels of the Bloom
taxonomy, which we were expecting considering that at higher levels, the complexity of
the questions implies that many more factors can come into play in determining
student correctness.
3.3
        </p>
        <p>
          Comparing positional and semantic representations of highlights
In previous work [
          <xref ref-type="bibr" rid="ref15 ref5">5, 15</xref>
          ], we used a positional encoding of highlights. Essentially,
we constructed a vector whose elements indicate whether a particular segment of
text has been highlighted. We found that providing this high-dimensional vector
directly into regression models produced over tting due to the large number of
free parameters. As an alternative, we performed principal-components-analysis
(PCA) on the highlighting representation and chose the top k principal
components for the highlighting representation. We, in fact, discovered that k = 1
worked best generically across sections of text. The previous work is not directly
comparable to the present work because it used smaller data sets and Kim et
al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] evaluated on overall quiz accuracy not individual question accuracy.
        </p>
        <p>We compared the positional highlighting encoding with the encoding
developed in this paper and evaluated on individual questions using the current, large
data set. As Figure 8 shows, both highlighting representations improve model
performance over the baseline a+d model, but augmenting the baseline model
with the semantic encoding is superior to augmenting with the positional
encoding. We have yet to explore the obvious question of whether augmenting the
baseline model with both feature sets would further improve model performance.
0.800
s0.775
n
o
itc0.750
e
ss0.725
s
o
rc0.700
a
C0.675
U
A
n0.650
a
e
M0.625
0.600</p>
      </sec>
      <sec id="sec-3-4">
        <title>Model Comparison (AUC)</title>
        <p>Model Comparison (PRC)
0.90
s
iton0.88
c
e
s
ss0.86
o
r
c
aC0.84
R
P
n
ae0.82
M
0.80
A + D</p>
        <p>A + D + Hpos</p>
        <p>A + D + Hsem</p>
        <p>A + D</p>
        <p>A + D + Hpos</p>
        <p>
          A + D + Hsem
Fig. 8: Comparison of three predictive models with latent ability and di culty
parameters, and optionally using positional or semantic highlighting features,
hpos and hsem, respectively. Error bars re ect += 1 SEM, corrected to remove
variance due to the random factor [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Research</title>
      <p>We explored the relationship between student highlighting patterns and
questionanswering performance using an encoding of highlights based on deep neural
network embeddings of text and question content. We found that augmenting
a baseline model with this semantic highlighting representation improved
predictions of whether a student would answer a speci c question correctly. The
baseline model is conditioned on latent factors representing student skill level
and question di culty. Our results suggest that highlights provide a source of
information that complements these other factors, which may not be surprising
in retrospect given that the highlight encoding we used is based on how the
particular student interacts with the textbook content that is relevant for the
speci c question. What is surprising is how e ective the SBERT model is in
producing embeddings that can be used to judge the similarity of highlighted
content to individual questions. We obtained several other key results,
including: (1) our models predict well for new students, but not for new questions;
(2) our models predict well for all levels of the Bloom taxonomy; and (3) our
models that use semantic highlight encodings predict better than models using
positional highlight encodings.</p>
      <p>
        From here, there are several potential paths we intend to investigate. First,
we should more systematically explore several methodological decisions that we
made; in our past work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], these decisions matter. The assumptions we might
question include: whether the correct decomposition of highlights is at the level
of complete sentences and not smaller or larger segments; whether a segment of
text should be considered highlighted if any portion is highlighted, as opposed
to explicitly representing the fraction of the segment highlighted; whether the
summary statistics (i.e., values of p) we selected best capture the distribution of
highlighted and non-highlighted match scores.
      </p>
      <p>Second, we modeled each section apart from each other section. However,
in principle, semantic-highlighting models could apply across multiple sections.
Constructing a multi-section model might improve predictions|particularly for
held-out questions|because the model would be trained on more data, but it
might harm predictions because the weighting of semantic information may vary
across sections.</p>
      <p>Third, the ultimate goal of our work is not just to predict student
performance, but to leverage the predictions to boost student comprehension and
retention. Once our investigation of predictive models is complete, the true value
of these models to improve student learning can begin.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgement</title>
      <p>This research is supported by NSF awards DRL-1631428 and DRL-1631556. We
thank Adam Winchell for helping the initial stage of the research and two
anonymous reviewers for their helpful feedback on earlier drafts of this manuscript.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bloom</surname>
            ,
            <given-names>B.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krathwohl</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masia</surname>
            ,
            <given-names>B.B.</given-names>
          </string-name>
          :
          <article-title>Bloom taxonomy of educational objectives</article-title>
          .
          <source>In: Allyn and Bacon. Pearson Education</source>
          (
          <year>1984</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Carpenter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gelman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            <given-names>man</given-names>
          </string-name>
          , M.D.,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodrich</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betancourt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brubaker</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riddell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Stan: a probabilistic programming language</article-title>
          .
          <source>Grantee Submission</source>
          <volume>76</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>32</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goadrich</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The relationship between precision-recall and roc curves</article-title>
          .
          <source>In: Proceedings of the 23rd international conference on Machine learning</source>
          . pp.
          <volume>233</volume>
          {
          <issue>240</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winchell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waters</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grimaldi</surname>
            ,
            <given-names>P.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baraniuk</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          , Mozer,
          <string-name>
            <surname>M.C.</surname>
          </string-name>
          :
          <article-title>Inferring student comprehension from highlighting patterns in digital textbooks: An exploration of an authentic learning platform (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Nltk: The natural language toolkit</article-title>
          .
          <source>In: In Proceedings of the ACL Workshop on E ective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics</source>
          . Philadelphia: Association for Computational Linguistics (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Masson,
          <string-name>
            <given-names>M.E.</given-names>
            ,
            <surname>Loftus</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.R.</surname>
          </string-name>
          :
          <article-title>Using con dence intervals for graphically based data interpretation</article-title>
          .
          <source>Canadian Journal of Experimental Psychology/Revue canadienne de psychologie experimentale 57(3)</source>
          ,
          <volume>203</volume>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graesser</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Risko</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>D'Mello</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          :
          <article-title>Cognitive coupling during reading</article-title>
          .
          <source>Journal of Experimental Psychology: General</source>
          <volume>146</volume>
          (
          <issue>6</issue>
          ),
          <volume>872</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rasch</surname>
          </string-name>
          , G.:
          <article-title>Probabilistic models for some intelligence and attainment tests</article-title>
          .
          <source>ERIC</source>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Sentence-bert: Sentence embeddings using siamese bertnetworks</article-title>
          . arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10084</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Saito</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rehmsmeier</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The precision-recall plot is more informative than the roc plot when evaluating binary classi ers on imbalanced datasets</article-title>
          .
          <source>PloS one 10(3)</source>
          ,
          <year>e0118432</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Sta ord, D.,
          <string-name>
            <surname>Flatley</surname>
          </string-name>
          , R.:
          <source>Openstax. The Charleston Advisor</source>
          <volume>20</volume>
          (
          <issue>1</issue>
          ),
          <volume>48</volume>
          {
          <fpage>51</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention is all you need</article-title>
          .
          <source>arXiv preprint arXiv:1706.03762</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Waters</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grimaldi</surname>
            ,
            <given-names>P.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baraniuk</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          , Mozer,
          <string-name>
            <given-names>M.C.</given-names>
            ,
            <surname>Pashler</surname>
          </string-name>
          , H.:
          <article-title>Highlighting associated with improved recall performance in digital learning environment (Submitted)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Winchell</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mozer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Highlights as an early predictor of student comprehension and interests</article-title>
          .
          <source>Cognitive Science</source>
          <volume>44</volume>
          (
          <issue>11</issue>
          ),
          <year>e12901</year>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>