<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classification in Math Class: Using Convolutional Neural Networks to Categorize Student Cognitive Demand</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Delaney</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jai Bhatia</string-name>
          <email>jbhatia187@student.fuhsd.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fremont High School</institution>
          ,
          <addr-line>575 W. Fremont Avenue, Sunnyvale, CA 94087</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Stanford University</institution>
          ,
          <addr-line>485 Lasuen Mall, Stanford, CA 94305</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Maintaining cognitively demanding instruction is a primary goal of classroom teachers. Yet students' cognitive demand is difficult to measure and track during the enactment of a rigorous task. This in-progress research addresses this problem space by predicting and modeling students' cognitive demand with computer vision and convolutional neural networks, providing an in-the-moment analysis of cognitive demand during an eighth grade mathematics task enactment. The findings suggest that models which leveraged behaviorbased visual proxies for cognitive demand (e.g., gesturing, using a computer) achieved substantially higher accuracy than the baseline model. Taken together, the results of this work build toward a classroom analytic tool for teachers and have implications for the contributions of computer vision in real-world classroom studies.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Cognitive Demand</kwd>
        <kwd>Mathematics Education</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Transfer Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        There has been much interest in applying artificial intelligence analytic tools to classroom
settings in the past decade. Although many educational applications that leverage AI examine
speech data with natural language processing [26, 29], there exists a growing enthusiasm for
computer vision-based research in classrooms to analyze and improve teachers’ instructional
practices [
        <xref ref-type="bibr" rid="ref11 ref3 ref8">2, 7, 10</xref>
        ]. This study explores the extent to which students’ cognitive demand, one
aspect of classroom instruction, can be modeled with computer vision via the analysis of
classroom video recordings in eighth grade mathematics.
      </p>
      <p>
        The maintenance of students’ cognitive demand, the amount of intellectual work required
to create meaning for a mathematical task and solve it [
        <xref ref-type="bibr" rid="ref17">14</xref>
        ], is crucial for teachers to measure
and track from students because of its direct relationship to learning outcomes [27]. When
students exhibit high cognitive demand, they develop deeper understandings and connect
concepts across the discipline [28]. However, cognitive demand is not a static construct and
can be influenced by a number of instructional factors, including the initial presentation of the
task to students [
        <xref ref-type="bibr" rid="ref17">14</xref>
        ], resources provided to the students while solving the task [
        <xref ref-type="bibr" rid="ref22">19</xref>
        ], and
teacherstudent and student-student interactions during enactment [
        <xref ref-type="bibr" rid="ref16 ref18">15</xref>
        ]. Because measuring cognitive
demand in-the-moment is difficult, yet potentially beneficial for teachers, we are curious to
explore the extent to which computer vision may be used to provide cognitive demand
measurements as students solve a mathematical task in small groups. Such data may assist
teachers by providing indicators for which students continue to exhibit high cognitive demand
throughout the task’s enactment, and conversely, which students struggle to uphold high
demand after the task is launched.
      </p>
      <p>Since cognitive demand is not a purely visual construct, our model draws upon five proxy
student behaviors to identify potentially cognitively demanding activity while solving a
mathematics task, then uses the presence of the five behaviors to predict the level of cognitive
demand. Though this approach omits cues from students’ speech, we hypothesized that
modeling cognitively demanding visual behaviors may yield additional contributions toward
predicting overall demand. We therefore ask: to what extent can computer vision model
changes in students’ cognitive demand during mathematical problem solving?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>Modeling cognitive demand with computer vision is a novel task in classroom analytics
research. Our exploration of relevant literature investigates the extent to which other computer
vision-based methods have demonstrated success in tasks with adjacent features. By
incorporating these features: transfer learning, multiclass binary classification, and use of
pretrained networks, into the present study, we aim to utilize the affordances of computing toward
a classroom setting. We discuss each feature in detail.</p>
      <p>
        Transfer Learning. This research leverages transfer learning using ImageNet pre-trained
weights; an approach that is not uncommon for developing novel applications in image
classification. Since its onset, ImageNet has been established as a reliable, general-purpose
benchmark for transfer learning on a variety of learning tasks, number of classes, and amounts
of trainable data [
        <xref ref-type="bibr" rid="ref20">17</xref>
        ]. Numerous past studies have investigated the relationship between factors
that impact transfer learning and fine-tuning of convolutional neural network (CNN) models,
including the perils of model overfitting [
        <xref ref-type="bibr" rid="ref7">6, 30</xref>
        ] and the layers of ImageNet that should be
optimized for transfer learning [
        <xref ref-type="bibr" rid="ref19">16</xref>
        ]. We drew upon this research when considering the duration
of hyperparameter tuning and the overall fit of the training data to each binary classification
model, as it suggests that model overfitting may be perilous to transferring learning to
validation and testing sets.
      </p>
      <p>
        Pre-Trained Networks for Multiclass Classification. Additionally, we relied on MobileNet V2,
a neural network specifically constructed for classification tasks, for binary and categorical
classifications of cognitive demand. MobileNet V2 was developed for “lightweight
classification tasks” [
        <xref ref-type="bibr" rid="ref12">11</xref>
        ] in transfer learning, image classification, and localization. It is
commonly used in object recognition and classification tasks, such as detecting human tissue
abnormalities in medical research [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ]. As our investigation involves the classification of certain
objects in order to detect human behavior (e.g., detecting the presence of a computer in the
“using a computer” proxy behavior class), MobileNet V2 served as a reasonable choice for a
first-pass exploration of the data.
      </p>
      <p>
        One drawback experienced was the amount of labeled training data needed to optimize
transfer learning using ImageNet and MobileNet V2. Past studies and experiments suggest that
a large quantity of labeled training data is required for transfer learning, particularly in tasks
that involve feature localization [
        <xref ref-type="bibr" rid="ref2">1, 31</xref>
        ] and modification of architectures that improve transfer
learning [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ]. Using unlabeled data has been an appealing area to explore in this research space
[
        <xref ref-type="bibr" rid="ref24">20</xref>
        ] and some self-supervised methods have attempted to improve feature generalization in
auxiliary tasks [
        <xref ref-type="bibr" rid="ref1 ref6">5</xref>
        ], although none have outperformed ImageNet’s performance on purely
supervised learning tasks. Weak supervision, which applies noisy labels from non-expert users
[
        <xref ref-type="bibr" rid="ref21">18</xref>
        ], is now seen as a plausible middle-ground for large-scale ImageNet transfer learning tasks.
We utilized weak supervision when applying hand labels to binary classes in the training data,
as one member of the research team was unfamiliar with coding student and teacher behaviors
in mathematics education research.
      </p>
      <sec id="sec-2-1">
        <title>3. Methods</title>
        <p>Our approach to modeling cognitive demand through convolutional neural networks consisted
of three primary steps. First, we constructed the baseline cognitive demand model for
comparison, which predicted demand from still images alone. Next, we devised the
experimental model, which utilized binary classification for students’ cognitive demand proxy
activities (computer use, leaning in, pointing to the task, talking to the teacher, and writing on
the task) to predict cognitive demand. Finally, we compared performance between the two
models.</p>
        <p>Both models applied transfer learning from ImageNet weights. Although the MobileNet
V2 network, which relies upon ImageNet weights, contains approximately 2.3 million
parameters, our method applies transfer learning to the bottom Dense bottleneck trainable
parameters (approximately 1,300 layers). These layers solely focus upon the localized features
of the five binary classes. Figure 1 shows a depiction of our model as well as a schematic of
the trainable network architecture that was applied.</p>
        <p>Both the experimental and baseline models freeze a majority of MobileNet V2’s layers to
preserve the classification architecture and build upon the network’s ability to detect edges,
objects, and groups of objects. This model consists of 16 repeated blocks which contain a
2dimension convolutional layer, batch normalization, and a RELU activation layer to contend
with nonlinearity.</p>
        <p>
          Categorical cross-entropy loss was used in the baseline model to classify levels of students’
cognitive demand, where level 1 indicated the least demanding activity and level 4 indicated
the most demanding activity. Binary cross-entropy loss was used in the experimental model to
categorize each of the five feature classes, as we aimed to assess whether each student behavior
was present. Finally, we implemented a support vector machine classifier to transform
intermediate binary feature predictions to cognitive demand scores on testing data. SVMs
potentially work well with smaller datasets, such as ours, and are ideal for categorizing data
into linear classes [
          <xref ref-type="bibr" rid="ref28">24</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>4. Data and Preprocessing</title>
        <p>
          The data were collected from two eighth grade mathematics classrooms that focused on
building students’ capacities for cognitively demanding work through engagement with
mathematical tasks. Four 30-minute video recordings were taken in Spring 2017 that featured
students solving “The Washing Machine Problem” with Desmos, a dynamic graphing
calculator application. The video recordings were rated for cognitive demand on a 1-4 scale
called the Instructional Quality of Assessment Rubric [
          <xref ref-type="bibr" rid="ref10 ref9">8, 9</xref>
          ], a research-backed tool for rating
cognitive demand of students’ mathematical activity. Demand was rated at the level of the
entire student group, and importantly, cognitive demand ratings were not uniformly distributed
between the 1-4 scale. This is to be expected, because students were more likely to achieve
moderate cognitive demand throughout the task (level 2 or 3) than extreme ratings (level 1 or
4). Initial ratings were assigned in Winter 2021 by two mathematics education experts (Delaney
&amp; Kinsey) who reached 87.9% inter-rater agreement. 87.9%, classified as “very good
agreement” [
          <xref ref-type="bibr" rid="ref25">21</xref>
          ] serves as the upper accuracy threshold for human performance on this task.
        </p>
        <p>The video recordings were spliced into still images taken at 1-second intervals, and we
assigned 1-4 cognitive demand labels to each image from Delaney and Kinsey’s ratings. We
then hand-labeled each image in the five binary classes according to the following schematic:
 Computer class: the image received a “1” if students were using the computer to solve
the task, and a “0” otherwise.
 Leaning class: the image received a “1” if more than one student was leaning into the
center of the table to collaborate with the group, and a “0” otherwise.
 Pointing class: the image received a “1” if one or more students were visibly pointing
to gesturing to the task or computer, and a “0” otherwise.
 Teacher class: the image received a “1” if the teacher and students were conversing
with one another at the same table, and a “0” otherwise.
 Writing class: the image received a “1” if one or more students were writing on the
task card, and a “0” otherwise.</p>
        <p>
          The five binary classes were generated based on our hypothesized relationship of each
indicator to students’ cognitive demand. Prior research has demonstrated that students’ use of
computational tools to assist with problem solving can either raise or lower cognitive demand,
contingent upon how students use it [
          <xref ref-type="bibr" rid="ref14 ref15">13</xref>
          ]. Similarly, conferring with a teacher should increase
cognitive demand, as teachers may draw students’ attention to cognitively demanding features
of the task during small-group interactions [
          <xref ref-type="bibr" rid="ref16 ref18">15</xref>
          ]. Finally, the ways in which students work
collaboratively and use one another as resources may increase cognitive demand, as visually
indicated through pointing, collective writing, and leaning in toward the “middle space” [
          <xref ref-type="bibr" rid="ref23 ref26">22</xref>
          ].
        </p>
        <p>In total, the data set contained 2000 images distributed uniformly across the four classroom
video recordings. Each image was rescaled to 224 by 224 pixels to accommodate the maximum
weight size of MobileNet V2.</p>
      </sec>
      <sec id="sec-2-3">
        <title>5. Results 5.1.</title>
      </sec>
      <sec id="sec-2-4">
        <title>Experiment 1: Training the Baseline Model</title>
        <p>The first model classifies cognitive demand from images alone. We expected the accuracy of
this model to be relatively low because, in comparison with the five binary class indicators in
the experimental model, the baseline model’s feature space was high-dimensional. We
experimented with various combinations of hyperparameters: learning rates, batch sizes,
epochs, and optimizers to investigate the training accuracy of the baseline. We aimed to achieve
accuracy of around 25%, the expected accuracy that would be generated from a balanced
cognitive demand distribution over the four levels. Table 1 shows the training and validation
accuracy as we tuned hyperparameters over 20 epochs.</p>
        <p>Categorical Cross-Entropy Loss was selected to model the data with a batch size of 16 and
a learning rate of 0.0001. Conceptually, we anticipated that Sparse Categorical Cross-Entropy
Loss would have been a better fit because it is designed for integer input; however, this was
not the case during training. The final combination of hyperparameters caused the training
accuracy to increase quickly, then level off after approximately 10 epochs.</p>
      </sec>
      <sec id="sec-2-5">
        <title>5.2. Experiment 2: Training the Experimental Model using Binary</title>
      </sec>
      <sec id="sec-2-6">
        <title>Behavioral Proxies</title>
      </sec>
      <sec id="sec-2-7">
        <title>5.2.1. Phase 1: Binary classification using MobileNet V2</title>
        <p>The experimental model sought to improve cognitive demand predictions by first identifying
five binary student behaviors that might impact demand, then applying predicted binary class
labels for the behaviors to testing data for prediction. Each of the five binary class sub-models
were trained using MobileNet V2 with ImageNet weights. Data were split into 80% training,
10% validation, and 10% testing. We ensured that both the validation and testing sets contained
all four cognitive demand levels.</p>
        <p>Hyperparameters were tuned for each class separately, although many classes showed
optimal training accuracy using similar inputs. Similar to Experiment 1, all binary classes were
tuned for learning rate, number of epochs, loss optimization function, and batch size. The
ADAM optimizer was used in all classes because it handled the noisy classroom data well, an
important consideration for localization of class features.</p>
        <p>We anticipated that the Teacher and Computer classes would achieve high training
accuracy faster, because there was less ambiguity in labeling those classes compared to
Writing, Leaning, and Pointing. We hypothesized that the latter classes would take longer to
converge because they were based on pose estimation, and were more likely to vary per student.
For example, we associated students’ elbows on the table with the “leaning” class, but since
not every student in the group need to have exhibited the “leaning” action in order for the image
to be classified as “leaning,” this nuance may have been difficult for the model to detect. Table
2 shows the training and validation accuracies as we tuned hyperparameters for all five student
behavioral proxies trained over 50 epochs.</p>
        <p>The models performed best given low learning rates, smaller batch sizes, and longer
training duration to achieve high training and validation accuracy. This is not surprising, given
the localization required for the network to learn and classify each of the five feature behaviors.
Models were trained until each obtained a training accuracy over 85%, a value similar to human
accuracy applied for the original cognitive demand labels. In the event that multiple models fit
this criterion, the model whose parameters the highest validation accuracy was selected. The
final selected hyperparameters are highlighted in yellow in Table 2.</p>
        <p>Figure 2 illustrates one example of our error analysis per each binary class. As we
interrogated the nuances these errors, it appeared that some class models learned to identify
subtleties in the data better than others. For example, the highly-accurate Computer class
differentiated between closed computers and open computers after 50 epochs of training. In a
majority of images, the Teacher class teased apart differences between the teacher’s presence
at the table versus the teacher performing other actions in the image background. Classification
errors occurred when the teacher was only partially visible in the image, which made sense, as
teachers were not actively monitoring their body position and placement during the original
video recordings. Errors in the Pointing, Writing, and Leaning classes occurred when the
students did not clearly demonstrate the intended action; for example, when the point was
blurry or incomplete, when only one student was writing or leaning, or when the leaning action
was subtle.
Once the binary multiclass models were established, we utilized a small test set of data (n = 40
images, 10 per cognitive demand class) to examine the Computer, Leaning, Pointing, Teacher,
and Writing models’ abilities to (1) correctly predict the five binary classes of students’
behaviors in the test set and (2) calculate cognitive demand based on labels generated by the
five models. We tested both a linear and a generalized support vector machine to predict final
cognitive demand labels. Regularization parameters were tuned in both models (e.g., the kernel
and gamma parameters in the generalized SVM, and the loss function in the linear SVM).
Figure 3 summarizes the results for both classifiers and provides a confusion matrix to
summarize classification errors.</p>
        <p>After tuning the regularization rate and aforementioned hyperparameters, it did not appear
that the models’ predictive accuracies for cognitive demand varied substantially. The general
SVM classifier was the better overall choice because it improved cognitive demand
classification from the baseline model (55.7%), although it does not surpass human
performance (87.9%). This result is not surprising, because cognitive demand is an abstract
concept that was previously rated by human experts using both speech and visual cues.
However, the drastic improvements in cognitive demand classification from the baseline model
validate our current approach despite the relatively small size of the data set.</p>
      </sec>
      <sec id="sec-2-8">
        <title>6. Discussion</title>
        <p>The experimental two-phase model did not approach human-level performance, but showed
both improvements from the baseline model and promise for future work. Compared to the
baseline model, the SVM classifier performed better than unsupervised classification. This
result presents a case for weak supervision to be used when training data are identified, sorted,
and labeled, which could draw upon the expertise of more teachers in future iterations of this
work. We hypothesize that teachers’ involvement during data labeling would offer
improvements to the model due to their developed practices in interpreting students’ behaviors
in their day-to-day experiences.</p>
        <p>
          Future developments in this study will increase the sample sizes and apply data
augmentation to re-examine outcomes. Increasing the sample size will improve predictive
stability in the binary models, particularly the Pointing class, which contained a smaller
proportion of positive cases with respect to the others. Future data augmentations to be tested
include varying the brightness in classroom photos, rotating classroom images, and including
more classroom images with noisy features (for example, the presence of additional individuals
in the image frame who are not the teacher). Although MobileNet V2 appeared to be a suitable
classifier for binary class inputs, it was likely not the best choice for the baseline categorical
model. Other neural networks, such as VGG net, may have produced better transfer learning
[
          <xref ref-type="bibr" rid="ref13">12</xref>
          ], and will be tested in future iterations of this work.
        </p>
        <p>A key long-term goal of this project is to build toward a cognitive demand classification
tool that can be used to support and empower teachers’ professional learning. By analyzing
their students’ variations in cognitive demand throughout a mathematical task, teachers may
better understand the range and variation in students’ enacted demand, and adjust their future
instructional practices accordingly. Such a tool may be useful in teachers’ video clubs [25], a
form professional development activity designed to hone teachers’ noticing and inquiry of
student behavior. By supplying teachers with a cognitive demand classifier, teachers may
attend to student behavioral features that impact cognitive demand more frequently, and adjust
their practices in response. We aim to test this theory in future iterations of this work.</p>
      </sec>
      <sec id="sec-2-9">
        <title>7. Acknowledgements</title>
        <p>We thank the Amir Lopatin Fellowship committee, which supplied funding to this project in
support of its potential contributions to the learning sciences. This study emerged under the
mentorship of Dr. Nick Haber and Dr. Ranjay Krishna during Stanford’s Spring 2021 academic
quarter. It was originally submitted as the project component of their CS 432 and CS 231n
courses, respectively. We thank Gina Kinsey for her work in hand-labeling the original
cognitive demand data and Jagriti Agrawal for her contributions during the initial conception
and modeling in this study.
8. References
and learning in a reform mathematics project. Educational Research and Evaluation,
2(1), 50-80.
[25 28] Tekkumru Kisa, M., &amp; Stein, M. K. (2015). "Learning to see teaching in new
ways: A foundation for maintaining cognitive demand." American Educational Research
Journal 52.1:105-136.
[26 29] Thille, C., &amp; Zimmaro, D. (2017). Incorporating learning analytics in the
classroom. New Directions for Higher Education, 2017(179), 19-31.
[22 30] Xiang, Q., Wang, X., Li, R., Zhang, G., Lai, J., &amp; Hu, Q. (2019, October). Fruit
image classification based on Mobilenetv2 with transfer learning technique. In
Proceedings of the 3rd International Conference on Computer Science and Application
Engineering (pp. 1-7).</p>
        <p>
[27 31] Yosinski, J., Clune, J., Bengio, Y., &amp; Lipson, H. (2014). How transferable are
features in deep neural networks?. arXiv preprint arXiv:1411.1792.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>5.2.2. Phase</source>
          <volume>2</volume>
          :
          <string-name>
            <given-names>Labeling</given-names>
            <surname>Cognitive</surname>
          </string-name>
          <article-title>Demand using Trained Multiclass Models and a Support Vector Machine</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2014</year>
          ,
          <article-title>September). Analyzing the performance of multilayer neural networks for object recognition</article-title>
          .
          <source>In European conference on computer vision</source>
          (pp.
          <fpage>329</fpage>
          -
          <lpage>344</lpage>
          ). Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ngoc</given-names>
            <surname>Anh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Tung Son</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Truong</surname>
          </string-name>
          <string-name>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Le</surname>
          </string-name>
          <string-name>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Huu Tuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Cong Dat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            , ... &amp;
            <surname>Van Dinh</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>A computer-vision based application for student behavior monitoring in classroom</article-title>
          .
          <source>Applied Sciences</source>
          ,
          <volume>9</volume>
          (
          <issue>22</issue>
          ),
          <fpage>4729</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ansar</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shahid</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raza</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dar</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          (
          <year>2020</year>
          , March).
          <article-title>Breast cancer detection and localization using MobileNet based transfer learning for mammograms</article-title>
          .
          <source>In International symposium on intelligent computing systems</source>
          (pp.
          <fpage>11</fpage>
          -
          <lpage>21</lpage>
          ). Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Azizpour</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharif</surname>
            <given-names>Razavian</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Sullivan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Maki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , &amp;
            <surname>Carlsson</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>From generic to specific deep representations for visual recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition workshops</source>
          (pp.
          <fpage>36</fpage>
          -
          <lpage>45</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vincent</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Unsupervised feature learning and deep learning: A review and new perspectives</article-title>
          .
          <source>CoRR, abs/1206.5538</source>
          ,
          <issue>1</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hénaff</surname>
            ,
            <given-names>O. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolesnikov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Oord</surname>
            ,
            <given-names>A. V. D.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Are we done with ImageNet?</article-title>
          . arXiv preprint arXiv:
          <year>2006</year>
          .07159.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Bosch</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wammes</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Smilek</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2018</year>
          , June).
          <article-title>Quantifying classroom instructor dynamics with computer vision</article-title>
          .
          <source>In International Conference on Artificial Intelligence in Education</source>
          (pp.
          <fpage>30</fpage>
          -
          <lpage>42</lpage>
          ). Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8] Boston,
          <string-name>
            <surname>M.</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Assessing instructional quality in mathematics</article-title>
          .
          <source>The Elementary School Journal</source>
          ,
          <volume>113</volume>
          (
          <issue>1</issue>
          ),
          <fpage>76</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9] Boston,
          <string-name>
            <given-names>M.</given-names>
            , &amp;
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. K.</surname>
          </string-name>
          (
          <year>2006</year>
          ). Assessing Academic Rigor in Mathematics Instruction:
          <article-title>The Development of the Instructional Quality Assessment Toolkit</article-title>
          .
          <source>CSE Technical Report 672</source>
          .
          <article-title>National Center for Research on Evaluation, Standards, and Student Testing (CRESST).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Canedo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trifan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          (
          <year>2018</year>
          , June).
          <article-title>Monitoring students' attention in a classroom through computer vision</article-title>
          . In International Conference on Practical Applications of Agents and
          <string-name>
            <surname>Multi-Agent Systems</surname>
          </string-name>
          (pp.
          <fpage>371</fpage>
          -
          <lpage>378</lpage>
          ). Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Suzauddola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Nanehkaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. A.</given-names>
            , &amp;
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Identification of plant disease images via a squeeze‐ and‐ excitation MobileNet model and twice transfer learning</article-title>
          .
          <source>IET Image Processing</source>
          ,
          <volume>15</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1115</fpage>
          -
          <lpage>1127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12] Cheng,
          <string-name>
            <given-names>P. M.</given-names>
            , &amp;
            <surname>Malhi</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. S.</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Transfer learning with convolutional neural networks for classification of abdominal ultrasound images</article-title>
          .
          <source>Journal of digital imaging</source>
          ,
          <volume>30</volume>
          (
          <issue>2</issue>
          ),
          <fpage>234</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Common Core State Standards Initiative.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Common Core State Standards for mathematics</article-title>
          . Retrieved from http://www.corestandards .org/assets/CCSSI_Math%20Standards.pdf
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [
          <volume>13</volume>
          14]
          <string-name>
            <surname>Doyle</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>1983</year>
          ).
          <article-title>Academic work</article-title>
          .
          <source>Review of educational research</source>
          ,
          <volume>53</volume>
          (
          <issue>2</issue>
          ),
          <fpage>159</fpage>
          -
          <lpage>199</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Franke</surname>
            ,
            <given-names>M. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turrou</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Webb</surname>
            ,
            <given-names>N. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ing</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Student engagement with others' mathematical ideas: The role of teacher invitation and support moves</article-title>
          .
          <source>The Elementary School Journal</source>
          ,
          <volume>116</volume>
          (
          <issue>1</issue>
          ),
          <fpage>126</fpage>
          -
          <lpage>148</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [
          <volume>14</volume>
           16]
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosing</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Feris</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Spottune: transfer learning through adaptive fine-tuning</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          (pp.
          <fpage>4805</fpage>
          -
          <lpage>4814</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [
          <volume>15</volume>
           17]
          <string-name>
            <surname>Huh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Efros</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>What makes ImageNet good for transfer learning?</article-title>
          .
          <source>arXiv preprint arXiv:1608</source>
          .
          <fpage>08614</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [
          <volume>16</volume>
           18]
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Der Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vasilache</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2016</year>
          ,
          <article-title>October)</article-title>
          .
          <article-title>Learning visual features from large weakly supervised data</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          (pp.
          <fpage>67</fpage>
          -
          <lpage>84</lpage>
          ). Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [
          <volume>17</volume>
           19]
          <string-name>
            <surname>Kaput</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          (
          <year>1992</year>
          ).
          <article-title>Technology and mathematics education</article-title>
          .
          <source>Handbook of research on mathematics teaching and learning</source>
          ,
          <volume>515</volume>
          ,
          <fpage>556</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [
          <volume>18</volume>
           20]
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Auto-encoding variational bayes</article-title>
          .
          <source>arXiv preprint arXiv:1312</source>
          .
          <fpage>6114</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [
          <volume>19</volume>
           21]
          <string-name>
            <surname>Landis</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>G. G.</given-names>
          </string-name>
          (
          <year>1977</year>
          ).
          <article-title>An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers</article-title>
          .
          <source>Biometrics</source>
          ,
          <volume>363</volume>
          -
          <fpage>374</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Lotan</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>Talking and Working Together: Conditions for Learning in Complex Instruction</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [
          <volume>20</volume>
           23]
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Mather</surname>
            ,
            <given-names>P. M.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Support vector machines for classification in remote sensing</article-title>
          .
          <source>International journal of remote sensing</source>
          ,
          <volume>26</volume>
          (
          <issue>5</issue>
          ),
          <fpage>1007</fpage>
          -
          <lpage>1011</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [
          <volume>21</volume>
           24]
          <string-name>
            <surname>Recht</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roelofs</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Shankar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2019</year>
          , May).
          <article-title>Do ImageNet classifiers generalize to ImageNet?</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          (pp.
          <fpage>5389</fpage>
          -
          <lpage>5400</lpage>
          ). PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [
          <volume>22</volume>
           25]
          <string-name>
            <given-names>Gamoran</given-names>
            <surname>Sherin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , &amp;
            <surname>Van Es</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. A.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Effects of video club participation on teachers' professional vision</article-title>
          .
          <source>Journal of teacher education</source>
          ,
          <volume>60</volume>
          (
          <issue>1</issue>
          ),
          <fpage>20</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [
          <volume>23</volume>
           26]
          <string-name>
            <surname>Suresh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sumner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacobs</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foland</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          (
          <year>2018</year>
          , December).
          <article-title>Using deep learning to automatically detect talk moves in teachers' mathematics lessons</article-title>
          .
          <source>In 2018 IEEE International Conference on Big Data (Big Data)</source>
          (pp.
          <fpage>5445</fpage>
          -
          <lpage>5447</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [
          <volume>24</volume>
           27]
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lane</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>1996</year>
          ).
          <article-title>Instructional tasks and the development of student capacity to think and reason: An analysis of the relationship between teaching</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>