=Paper= {{Paper |id=Vol-2734/paper4 |storemode=property |title=Evaluations as Research Tools: Gender Differences in Academic Self-Perception and Care Work in Undergraduate Course Reviews |pdfUrl=https://ceur-ws.org/Vol-2734/paper4.pdf |volume=Vol-2734 |authors=David Lang,Youjie Chen,Andreas Paepcke,Mitchell L. Stevens |dblpUrl=https://dblp.org/rec/conf/edm/LangCPS20 }} ==Evaluations as Research Tools: Gender Differences in Academic Self-Perception and Care Work in Undergraduate Course Reviews== https://ceur-ws.org/Vol-2734/paper4.pdf
      Evaluations as Research Tools: Gender Differences in
           Academic Self-Perception and Care Work in
                Undergraduate Course Reviews

                       David Lang, Youjie Chen, Andreas Paepcke, Mitchell L. Stevens
                                                            Stanford University
                                                               Stanford, CA
                             {dnlang86, minachen, paepcke, stevens4}@stanford.edu

ABSTRACT                                                                    through reviews with institutional data describing those who submit
Student course reviews are rarely considered as research instruments,       them. Thus while course reviews may be problematic means of as-
yet their ubiquity makes them promising tools for education data            sessing the quality of instructors and instruction – a matter on which
science. To illustrate this potential, we use a corpus of student           we make no comment here– we believe these data hold substantial
reviews to observe gender differences in how students appraise their        promise for education data science.
own learning and in the advice they give to to future students. We
find systematic differences in who submits course reviews, with             To illustrate this promise, we leverage a corpus of 11,255 student
female and academically high-achieving students more likely to              reviews submitted by undergraduate students enrolled in Computer
submit. Among submitters, we find (a) females understate their              Science (CS) classes at a private research university during the 2015-
achievement of learning goals relative to males earning the same            16 and 2016-17 academic years. Because each review is linked with
grades; (b) females offer lengthier written advice to future students       the academic transcript and self-reported gender of its submitter, we
than males; (c) advice written by females exhibits more positive            are able to observe variation in submissions by an important aspect
tone, even after accounting for grades and course selections.               of student identity and documented academic accomplishment.

Keywords                                                                    While a variety of student characteristics of are of interest to educa-
                                                                            tion data scientists, we focus on students’ gender for reasons both
care work; course evaluations; gender; higher education; topic
                                                                            practical and theoretical. For privacy purposes, our case university
models; survey design 1
                                                                            has currently granted researcher access to only a few variables de-
                                                                            scribing review submitters; we utilize those available data here. Yet
1.    INTRODUCTION                                                          we also have two theoretical motivations for focusing on gender.
Student course reviews are a controversial subject in academia.             First, borrowing from social psychology, we recognize that women
While considerable work has addressed problems of validity and              tend to under-estimate their own abilities, while men to to over-
bias in the use of these instruments for assessing instructors and          estimate, conditional on measured accomplishment [7]. Second,
instruction [25] [21], few have recognized reviews as potentially           borrowing from feminist social science, we posit that submission
useful research tools for education data science.                           of a course review is a form of care work – a voluntary investment
                                                                            in the well-being of others – and thus implicated differently in fem-
Several features of course reviews make them potentially attractive         inine and masculine gender roles and identities [9]. Our findings
for researchers. First, reviews are ubiquitous features of teaching         comport with the contours of these larger literatures in ways that
and learning in US higher education. Because they are so commonly           are both important in their own right, and instructive for any future
solicited and so frequently submitted, the data yielded from reviews        deployments of course reviews for education data science.
represents a very wide swath of student populations. Second, re-
views are routinely submitted through online platforms and carried          We pursue three sets of analyses below. In the first set, we ob-
out by administrative units supported on hard budget lines, bringing        serve variation in rates of review submission by gender and earned
the marginal cost of acquiring research data through reviews close          grades. These analyses illustrate how researchers might test the
to zero. Third, it is technically simple to link information obtained       representativeness of corpora of reviews. Second, we observe how
1 "Copyright © 2020 for this paper by its authors. Use permitted            submitters respond to multiple close-ended review prompts targeting
                                                                            self-assessments of learning, but that are phrased differently. These
under Creative Commons License Attribution 4.0 International (CC
BY 4.0)."                                                                   analyses illustrate how review design may interact with student
                                                                            characteristics to produce patterned variation in reported learning
                                                                            progress. Third, we conduct computational text analyses of submis-
                                                                            sions to an open-ended review prompt. These analyses illustrate
                                                                            how qualitative reviews can be efficiently leveraged for scientific
                                                                            insight.


                                                                            2.    RELATED WORK
                                                                            2.1     Course Reviews

                                                                        1
Research on course reviews typically has focused on questions of                 popular of these dictionaries is the Linguistic Inquiry and Word
their value as an instruments for evaluating the quality of instruction.         Count (LIWC) dictionary [18]. In addition to grouping words into
Analyses conducted at scale typically focus on whether measures of               75 distinct categories and themes (e.g. family, power, death, etc), the
learning are correlated with instructional quality [27].                         dictionary generates four psycho-social variables that were validated
                                                                                 on college application essays through a rating process. Each of these
One potential concern about course reviews from the pscyhomet-                   variables is scored on a 1 to 99 interval where 1 is a complete lack of
rics liaterature is the potential for differential item function, a phe-         the construct or and 99 is highly pronounced form of the construct.
nomenon in which respondents of equal ability will exhibit different             These constructs are:
responses to a given survey item or question. Studies of differential
item function in course reviews have focused on the quantitative
difficulty of a class, or characteristics of the instructor [17] [8]. Rel-          1. Tone- This is a summary variable describing the emotional
atively little work has focused on how student characteristics may                     quality of the text. A score of 99 reflects a positive tone and a
be associated with differential item function on course reviews.                       score of 1 reflects a negative tone. A score of 50 represents
                                                                                       neutral valence.
If findings from copious research on product reviews translate to                   2. Analytic- This is a measure of how much formal logic is used
academic course reviews, we would expect that students with high-                      in the text. A score of 1 indicates little use of formal logic and
valence opinions about a course are more likely to respond, resulting                  a score of 99 exhibits statement with a great deal of formal
in a bimodal or j-shaped distribution [12]. In practice, these findings                logic.
may not translate. There are often other incentives for filling out
reviews, for example, giving students earlier access to their final                 3. Authenticity- This is a measure of the sincerity/honest of
grades as an incentive to respond. Those who submit reviews may                        a text. A score of 1 indicates insincerity and a score of 99
not be representative of the larger population of students who en-                     indicates high sincerity.
rolled in a particular course, or of the overall campus population.
This problem is exacerbated when analysts to not have access of                     4. Clout- This is a measure of the text’s authority, relative po-
reviewer characteristics such as gender or grades [1]. Past work that                  sition, and confidence. A score of 1 suggests relatively little
has tried to adjust for non-response bias in review has suggested                      authority and a score of 99 suggests high authority.
that non-response bias tends to favor positive reviews [11].
                                                                                 Researchers can also create their own custom measures of text.
There is varying theory around why students may opt to submit                    We take advantage of this affordance by capturing mentions of
reviews. Studies conclude that students are more likely to respond               instructors’ names.
to course evaluations if they are majoring in the subject of the course
[1]. Other work suggests that female students are generally more
likely to respond to course evaluations than males [23]. However,
                                                                                 2.2.2     Token-based approaches
                                                                                 Token-based approaches treat every word in a text as input into a
little work has focused specifically on this topic in Computer Science
                                                                                 model. These approaches often result in the loss of syntactic mean-
courses.
                                                                                 ing but are often very effective at classifying documents. Token-
                                                                                 based approaches have proven effective at detecting socioeconomic
Under experimental conditions where researchers manipulated the
                                                                                 features of authors such as race, gender, and income in college appli-
information content and valence of course reviews, researchers
                                                                                 cation essays [4]. Other applications have generated algorithms with
found that these factors had material effects on course enrollment
                                                                                 high predictive validity on classroom observation and evaluation
decisions. Students were more likely to enroll in courses if course
                                                                                 rubrics [15].
evaluations had positive valence, particularly if there was a large
number of such evaluations [14]. Similar work found that exposure
to positive or negative course reviews had modest to large effects on            2.2.3     Unsupervised approaches
students’ expected performance within a course, and their likelihood             The basic premise behind unsupervised approaches is that texts
of recommending the course in the future[13]. These findings are                 include multiple topics, and topics comprise words. Using unsu-
particularly relevant for CS courses as CS courses tend to have                  pervised methods such as Latent Dirichelet Allocation (LDA) , we
relatively enrollments compared to other subjects.                               can group texts categorically. These same methods have been aug-
                                                                                 mented recently to allow the distribution of topics to co-vary with
2.2     Text Analysis                                                            other relevant metadata, a technique known as structural topic mod-
There is a burgeoning literature on using computational text analysis            eling [22]. In this case, we can examine the concentration of topics
and methods to quantify differences in corpora based on characteris-             by features such as student gender or grades. This method further al-
tics of the author and the text. These techniques have been quickly              lows us to perform statistical inference to see if topic preponderance
adopted to educational applications but we have seen relatively few              varies systematically by characteristics of authors.
instances of text analysis of course reviews. When text analysis of
course evaluations are done, they are typically focused on keyword               2.3     Gender Differences in Academic Experi-
extraction and on predicting Likert item responses as a function of                      ences, Skill Perception and Care Work
the text [24].                                                                   Our work has three motivations from prior social-science literature
                                                                                 on higher education and gender. The first is that male and female stu-
We group text analyses methods into the following three categories:              dents may have different experiences when taking the same courses.
                                                                                 For example, women are less comfortable asking questions and have
2.2.1     Dictionary and rule-based approaches                                   less confidence in CS courses than their male peers [19]. This "gen-
Dictionary-based approaches characterize the words of documents                  der confidence gap" grows as students take more advanced courses
into groups of predefined categories such as sentiment. The most                 [3]. Analyses of communal academic resources in CS programs


                                                                             2
find substantial differences in how contributions by male and fe-           to their male counterparts. We hypothesize that men are more
male users are acknowledged Github and StackOverflow [16] [26].             likely to see course reviews as a form of positive self-reflection and
Consequences of these phenomena may extend beyond college,as                promotion, and that females are more likely reviews as a form of
women with degrees in STEM fields are less likely than men to enter         care work. We believe these differences will have stronger valence
STEM occupations [5]. While course reviews cannot capture empir-            in items that focus on a students’ accomplishments rather than other
ical variation in experience per se, they can capture how submitters        constructs such as student learning. We will model these analyses
make sense of those experiences.                                            as a fixed-effect regression model with the following specification:

Second, gendered differences in skill perception may influence                             Yi j = β1 Malei + Gradesi j + Γ j + εi j            (1)
how students report their experiences and learning gains in reviews.
                                                                            The subscripts i and j correspond to indices for student and course.
While women tend to approach STEM fields with less confidence,
                                                                            The Y variable corresponds to our focal outcome variable, in this
men tend to over-estimate their abilities. Experimental work by
                                                                            case, responses to a Likert item. Male corresponds to a student’s
Correll [7] found that men expressed inflated perceptions of their
                                                                            self-reported indicator variable of whether the student identifies
own skill at completing quantitative tasks compared to women per-
                                                                            as male and β1 corresponds to the associated coefficient with this
forming at the same level of measured accomplishment. Together
                                                                            variable. We represent course effects with Γi to control for factors
these inquiries suggest that course reviews may bear traces of gen-
                                                                            like the difficulty of the course or instructional quality. We also
dered patterns of academic self-perceptions. Our third motivation is
                                                                            control for grades with an additional fixed-effect for each possible
the gendered character of care work. Social scientists define care
                                                                            grade a student could receive 2 . The error is represented by ε. Errors
work as work that attends to the well-being of others. It comprises
                                                                            are clustered at the course level.
activities and services intended to help other people develop their
capabilities and pursue their goals [9]. Care work is consistently
associated with femininity and female role expectations, and often          3.2      Open-Response Questions
is unpaid or poorly compensated [10]. To the extent that submitting         We pay particular interest to open-response items in course reviews.
course reviews is an act of assistance – to improve classes and to          We suspect that such items are may be the most valuable and least
inform future students – it is appropriately theorized as a form of         explored element of course reviews. As such, we may be able to
care work. Thus we might expect that female and male students will          detect subtle differences in qualitative responses.
approach the task of course reviews with different dispositions, such
that the number, extensiveness, and content of course evaluations
may vary by gender of submitters.                                            3.2.1    Psychosocial variables
                                                                            H4: Course evaluations written by females will express more pos-
3.    RESEARCH QUESTIONS                                                    itive and sincere sentiment.
Our working hypothesis is that course reviews will exhibit gendered
patterns of academic experience, self-perceptions and advice-giving.        Given our care work hypothesis, we believe that female students
Specifically: (1) reviews from male students will exhibit stronger          will express more positive sentiment in open-response items. We
professed strong learning gains (2) reviews from female students            use the same analytical strategy as an equation 1 using LIWC’s
will exhibit characteristics of care work.                                  tone variable. Specifically, we examine gender differences in these
                                                                            psycho-social variables after controlling for variation that can can
We group our analyses into two parts. The first part examines varia-        be attributed to the course, or to student grade. We report outcomes
tion by gender and earned grades on review submission rates and             in standardized effect sizes to facilitate interpretability.
on Likert-scale items on course reviews. The second part exam-
ines variation in male and female responses to a qualitative review         We also hypothesize that a corollary to the care work hypothesis
prompt eliciting advice for future students considering the same            is that female students will use more "I" statements and tentative
courses.                                                                    language. This tendency would manifest as reviews written by
                                                                            female students exhibiting more authentic language.
3.1    Review Submission Rates and Likert Items
H1: Female students will respond to course evaluations more of-             H5: Course evaluations written by male students will express
ten than males.                                                             more clout. Based on prior literature pertaining to a confidence
                                                                            gap in CS by gender, we hypothesize this trend should manifest
Our care work hypothesis is that female students will be more               with less expressions of clout and authority in course evaluations by
responsive to institutional requests for reviews. We investigate this       female authors.
hypotheses using an exact binomial-two-sample test. We examine
results by gender and grade.                                                 3.2.2    Hand-crafted rules
                                                                            H6: Female students will write more on course evaluations and
H2: There are systematic differences in response rate by grade.             mention the instructor more often.
There are many competing theories of how grades might influence
response rates to course evaluations. If students have a poor grade,        We hypothesize that care work will manifest in other ways beyond
they may be more inclined to view the evaluation as an opportunity          psycho-social variables. Specifically: female submitters will put
to retaliate against the grader. Alternatively, students who receive        more effort into reviews by writing more; and they will take a more
a low grade may opt to avoid opportunities to reflect on negative           individualized approach by mentioning the instructor explicitly.
experiences. We will investigate this hypothesis utilizing a simple
χ 2 test of response-rates by grade.                                         2 in our analyses, there are over twenty grade types, including + and
                                                                            - variants as well as credit and nocredit courses. We report A,B,C,D,
H3: Female students will understate their achievements relative              and not passing grades for simplicity


                                                                        3
We have crafted two simple measures to facilitate investigation of
this hypothesis: the length of each response in number of words,
and a capture of each instance of an instructor name.

3.2.3      Topic models
H7: There will by systematic variation in topics depending on the
author’s gender.

Our final analysis is exploratory using structural topic models to
identify whether qualitative components of the corpora systemati-
cally vary with gender of submitter The goals of this analysis are
to develop efficient means of sorting and categorizing qualitative
components of course reviews.

4.     DATA
Data comprise information describing enrollments in courses offered
through the Computer Science (CS) Department of a private research
university during the 2015-16 and 2016-17 academic years, and the
entire population of formal reviews submitted by students enrolled
in those courses. Reviews were administered near the end of the
academic term but before the beginning of the term’s official final
exam period. As an incentive for submitting reviews, students were
given the ability to see their final course grades a bit earlier than
non-submitters.

In total these data yield 11,255 student responses from 251 courses.
Courses range in character from very large introductory lecture-and-
lab formats to small advanced seminars. Institutional data made
available to us for analysis include each student’s grade, gender,
GPA, declared major (if known), and academic year. We combine
these data with the corpus of reviews submitted for CS courses
during the study period specified above. Approximately one-third of
submitted reviews from female students, and approximately half are
from undergraduates. We cannot track or identify students enrolling                     Figure 1: Responses to Likert item questions
in multiple CS courses during the study period, however we can
compute and generate response rates by grade and gender.
                                                                               Aggregated responses to these two prompts appear in Figure 1.
We limit our analysis to responses in which submitters offered a
response to the review’s only open-ended question. That question               5. ANALYSES
reads:                                                                         5.1 H1: Response Rates By Gender
                                                                               Rates of review submission by student gender and earned grade are
"What would you like to say about this course to a student who is              reported in Figure 2. Two features are notable. First, females are
considering taking it in the future?"                                          more likely to submit overall. On average, females submit to 78.0%
                                                                               of opportunities to do so; males, 74.5% (p<.001).
The prompt is very well aligned with our care work hypotheses, in
that it specifically asks submitters to give advice to a hypothetical          Second, those receiving higher grades in a course are more likely
future student. Individual responses vary substantially in length:             to submit reviews Females receiving a grade of "A" are 3.7% more
from a single character to over 5,964 characters (the latter equivalent        likely to respond to submit than their male counterparts. The gen-
to 1004 words). The mean response length is 132 characters –                   der submission gap is greatest among students receiving a grade of
approximately the length of a tweet. The entire corpus of responses            "B," with female "B" recipients 6.5% (p<.001)more likely to sub-
to this question is 300,000 words.                                             mit than males. We do not observe statistically significant gender
                                                                               differences in submission rates for those receiving grades below
Additionally we analyze responses to two review prompts with                   "B," however such grades represent fewer than 5% of grades in the
five-point Likert responses: 3                                                 research sample.

     • How much did you learn from this course?                                5.2    H2: Variation in Submission by Grade
                                                                               We also examined whether review submission varied systematically
     • How well did you achieve the learning goals of this course?             grades. Figure 2 indicates a strong positive correlation between
3 we will limit this analysis to complete cases due to the fact that one       grade and likelihood of submission. A student with a grade of A or
item was not consistently administered across courses. We ignore               higher has an 80% chance of responding to the evaluation, while
questions pertaining to quality of instruction and focus on student            students who do not pass the course or receive credit without a
learning goals.                                                                grade responded approximately 50 percent of the time. Effectively,


                                                                           4
                                                                               Figure 3: Percent of Students Saying they Achieved learning
                                                                               goals extremely well

                                                                                                                             Tone    Analytic   Clout    Authentic
                                                                                     Male                                  −0.08∗∗    −0.01      0.01    −0.10∗∗∗
                                                                                                                            (0.03)    (0.02)    (0.02)    (0.02)
                                                                                     Num. obs.                              11255     11255     11255     11255
                                                                                     R2 (full model)                         0.07      0.05      0.05      0.04
                                                                                     R2 (proj model)                         0.00      0.00      0.00      0.00
                                                                                     Adj. R2 (full model)                    0.05      0.02      0.02      0.02
                                                                                     Adj. R2 (proj model)                   −0.02     −0.02     −0.02     −0.02
                                                                                     Num. groups: grade                       20        20        20        20
               Figure 2: Survey response rates by grade                              Num. groups: evalunitid                 251       251       251       251
                                                                                     ∗∗∗
                                                                                           p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05

                                             Learning   Achievement
       Male                                    0.01       0.08∗∗∗                                            Table 2: LIWC Regressions
                                              (0.01)      (0.01)
       Num. obs.                               9002        9002
       R2 (full model)                         0.15        0.18                in self-reported measures of ‘how much they learned’ in a class.
       R2 (proj model)                         0.00        0.01                However, when we look at a similar question about ‘achievement
       Adj. R2 (full model)                    0.13        0.16                of learning goals’, we see a stark difference. Male students are 8%
       Adj. R2 (proj model)                   −0.03       −0.02                points more likely than females to state they mastered the learning
       Num. groups: grade                        20          20                goals of a course. This finding suggests two concerns. First, given
       Num. groups: evalunitid                  203         203                the similarity of these questions, we see that subtle differences in
       ∗∗∗
             p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05                                phrasing yield substantial differences in student responses. Second,
                                                                               female students report lower-level of mastery even after controlling
             Table 1: Gender Differences in Likert Items                       for grades. Notably, these surveys are collected before students
                                                                               know their final grades. These perceptions may change after this
                                                                               information is revealed to them.
this means that students who fail courses are represented by review
submissions about half as often as students who excel in a courses.            5.4         Open Text Responses
Differences are statistically significant with a χ 2 statistic of 395.37
and a p-value of less than 0.001.
                                                                               5.4.1           Psycho-social variables
                                                                               graphicx

5.3     H3: Reports of Learning and Goal-Meeting                               We report our analyses for H4 and H5 in table 2. We find modest
We observe the proportion of students reporting having achieved the            variation by gender in how submitters describe their experience in
learning goals of a course Extremely Well by grade in Figure 3. Not            the same course, conditional on grades. On average, submissions
surprisingly, we find a strong direct correlation with course grade,           from males evince slightly more negative and slightly less authentic
such that reported goal achievement declines with grade. What is               language. While these gender differences are highly significant, their
striking is that at every grade level, there is a clear gap in reported        magnitude is modest: on the order of a tenth of a standard deviation.
goal achievement, with males more likely to report achievement                 Nevertheless, they are consistent with our care work hypotheses. To
than females earning the same grade.                                           wit, men are somewhat more critical and less honest in their reviews
                                                                               than women, suggesting greater empathy and investment among
We extend this analysis to see if this same pattern occurs with the            female submitters.
question of how much students learn. Using the same specification
as described in equation 1 in table 1. We see that after controlling           With respect to our hypothesis around clout, we find little evidence
for grades and course, males and females exhibit no differences                that qualitative open-responses exhibit any significant differences in


                                                                           5
                                                ProfessorName   Word Count
          Male                                      −0.02∗       −0.15∗∗∗
                                                    (0.01)        (0.03)
          Num. obs.                                 11255         11255
          R2 (full model)                            0.08          0.08
          R2 (proj model)                            0.00          0.00
          Adj. R2 (full model)                       0.06          0.06
          Adj. R2 (proj model)                      −0.02         −0.02
          Num. groups: grade                          20            20
          Num. groups: evalunitid                    251           251
          ∗∗∗
                p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05


                        Table 3: Handcrafted Features


submissions from males and females.

5.4.2      Handcrafted features
We report the standardized results of our analysis in table 3. We ob-
serve marginally significant differences in the frequency with which
submissions from males and females mention instructor name, with
women approximately two percent more likely to mention. Sub-
missions from women are also lengthier – about .15 of a standard
deviation. While modest in magnitude, these statistically significant
findings comport with our care work hypotheses that female submit-
ters approach the task of submitting reviews with more attention to
specificity and investment.

5.5     Topic Models
We ran structural topic models while allowing submitter gender
to vary with topic prevalence. We tuned the optimal number of
topics from 2 to 50 using an exclusivity measure called FREX
[2] [20]. FREX (See Equation 2) is a harmonic weighting of the
frequency (F) with which a word occurs in a topic; and exclusivity
(E), how frequently the word occurs in a given topic relative to
others. The parameter ω corresponds to a tuning parameter of the
relative importance of these features. We used the default parameter
of ω = .7 to favor topics that had more exclusivity.
                                               ω 1 − ω −1
                             FREX = (            +    )                      (2)
                                               E   F

Using this criterion, We found the locally optimal parameter to be
32 distinct topics. We then hand labelled each topic, observing the
ten (10) statements that had the highest probability labels of that
topic see Figure 4 (Top). The most common topics were comments
about a course being a tutorial, a suggestion to take the course, or
a positive review. The least common topics were highly specific
suggestions and issues pertaining to course prerequisites. While we
did not see substantial gender variation in topics overall, there are
exceptions of note. First, submissions from males are more likely to
talk about math prerequisites and claims that instruction was poorly
organized or of poor quality. They were also more likely to discuss
course organization and instruction. Submissions from females were
more likely to bear topics pertaining to workload, study practices,
and attendance. These patterns provide at least modest evidence
that women are offering relatively more specific advice that may be
relevant to larger numbers of future students.

We note an important caveat to this analysis, however. In contrast
with the above studies results of the topic models presented in this                   Figure 4: (Top): Topic Proportions
section do not control for grades or course selections, thus reported                  (Bottom):Differences in Topic Prevalence by Gender
gender differences in topic prevalence may be an artifact of these
other factors. We attempted to model the data with all of these
parameters but found the models to be degenerate.


                                                                                   6
6.    DISCUSSION                                                             7.   REFERENCES
Even while they are controversial for evaluating instructors and              [1] Meredith J. D. Adams and Paul D. Umbach. 2012.
instruction, course reviews are ubiquitous features of the US higher              Nonresponse and Online Student Evaluations of Teaching:
education landscape and potentially powerful tools for education                  Understanding the Influence of Salience, Fatigue, and
data science. In the work presented here we have sought to demon-                 Academic Environments. Research in Higher Education 53, 5
strate the promise of course reviews as a window into students’                   (8 2012), 576–591. DOI:
perceptions of their academic experiences and their orientation to                http://dx.doi.org/10.1007/s11162-011-9240-5
the task of submitting evaluations. Taking advantage of archival data         [2] Edoardo M Airoldi and Jonathan M Bischof. 2012. A Poisson
that included 11,255 submitted to 251 computer science courses at                 convolution model for characterizing topical content with
a single university between 2015-2017 that was linked to adminis-                 word frequency and exclusivity. arxiv.org (2012).
trative information describing submitters’ gender (M/F) and grades,               https://arxiv.org/abs/1206.4631http:
we found patterned variation in who submits course reviews, and                   //arxiv.org/abs/1206.4631
how.                                                                          [3] Christine Alvarado, Yingjun Cao, and Mia Minnes. 2017.
                                                                                  Gender Differences in Students’ Behaviors in CS Classes
In three observational studies we found that (a) women and those                  throughout the CS Major. In Proceedings of the 2017 ACM
earning high grades were disproportionately likely to submit re-                  SIGCSE Technical Symposium on Computer Science
views (b) the phrasing of close-ended review prompts influenced                   Education. ACM, New York, NY, USA, 27–32. DOI:
patterns of response by gender (c) responses to qualitative review                http://dx.doi.org/10.1145/3017680.3017771
prompts differed subtly but significantly by gender, with women               [4] A.J. Alvero, Noah Arthurs, anthony lising antonio,
writing somewhat more positive, individualized, and lengthier re-                 Benjamin W. Domingue, Ben Gebre-Medhin, Sonia Giebel,
views. These empirical findings comport with theoretical insights                 and Mitchell L. Stevens. 2020. AI and Holistic Review. In
from educational social psychology and feminist social science,                   Proceedings of the AAAI/ACM Conference on AI, Ethics, and
which suggest gender variation in how men and women perceive                      Society. ACM, New York, NY, USA, 200–206. DOI:
their own academic accomplishments and their obligations for the                  http://dx.doi.org/10.1145/3375627.3375871
well-being of others.                                                         [5] David N. Beede, Tiffany A. Julian, David Langdon, George
                                                                                  McKittrick, Beethika Khan, and Mark E. Doms. 2011.
While the empirical findings presented here are modest, they suggest              Women in STEM: A Gender Gap to Innovation. SSRN
the promise of leveraging course reviews for cumulative science in                Electronic Journal (8 2011). DOI:
at least two ways.                                                                http://dx.doi.org/10.2139/ssrn.1964782
                                                                              [6] Michael Brown and Carrie Klein. 2020. Whose Data? Which
First, we note that the inquiries presented here are based entirely on
                                                                                  Rights? Whose Power? A Policy Discourse Analysis of
the premise that course reviews and submitter demographic informa-                Student Privacy Policy Documents. The Journal of Higher
tion are "found" data. To the extent that virtually every US college              Education (2020), 1–30.
and university possesses data such as these, we can only imagine the
                                                                              [7] Shelley J. Correll. 2004. Constraints into Preferences: Gender,
number and variety of insights that might be gained from parallel
                                                                                  Status, and Emerging Career Aspirations. American
investigations at other schools. To a nascent field whose promise
                                                                                  Sociological Review 69, 1 (2 2004), 93–113. DOI:
lies substantially in observing phenomena at scale, course reviews
                                                                                  http://dx.doi.org/10.1177/000312240406900106
provide exceptionally promising sources of data for education data
science.                                                                      [8] Erica DeFrain and Erica DeFrain. 2016. An Analysis of
                                                                                  Differences in Non-Instructional Factors Affecting
Second, there is every reason to imagine that education data scien-               Teacher-Course Evaluations over Time and Across
tists might collaborate with school administrators to more explicitly             Disciplines. (2016).
and conscientiously instrument reviews for systematic experimental                https://repository.arizona.edu/handle/10150/621018
and quasi-experimental research. The basic conditions for such in-            [9] Paula England. 2005. Emerging Theories of Care Work.
quiries are already in place and sustained by established administra-             Annual Review of Sociology 31, 1 (8 2005), 381–399. DOI:
tive rhythms: schools have offices conducting the reviews, students               http:
anticipate receiving them, and they take place multiple times a year.             //dx.doi.org/10.1146/annurev.soc.31.041304.122317
It is possible to imagine substantial scientific insight through the         [10] Nancy Folbre. 1995. “Holding hands at midnight”: The
linkage review subsmissions with with administrative data describ-                paradox of caring labor. Feminist Economics 1, 1 (3 1995),
ing characteristics of submitters. The initial efforts presented here             73–92. DOI:http://dx.doi.org/10.1080/714042215
provide an inkling of this promise.                                          [11] Maarten Goos and Anna Salomons. 2017. Measuring teaching
                                                                                  quality in higher education: assessing selection bias in course
As with any novel research strategy, pursuing education data science              evaluations. Research in Higher Education 58, 4 (6 2017),
through course reviews comes with important ethical considerations                341–364. DOI:
regarding participant consent and responsible use. We are grateful                http://dx.doi.org/10.1007/s11162-016-9429-8
that such discussions are already well underway nationwide [6]               [12] Nan Hu, Jie Zhang, and Paul A. Pavlou. 2009. Overcoming
and we hope that our own illustrative work here might helpfully                   the J-shaped distribution of product reviews. (10 2009). DOI:
contribute to them. Indeed, addressing questions of responsible use               http://dx.doi.org/10.1145/1562764.1562800
of student data in the context of course reviews may have the addi-          [13] Neneh Kowai-Bell, Rosanna E. Guadagno, Tannah Little,
tional benefit of improving the collective value of an institutional              Najean Preiss, and Rachel Hensley. 2011. Rate My
practice currently regarded with ambivalence and suspicion but that,              Expectations: How online evaluations of professors impact
in whatever form, will likely be part of the academic landscape for               students’ perceived control. Computers in Human Behavior
a long time.                                                                      27, 5 (9 2011), 1862–1867. DOI:


                                                                         7
     http://dx.doi.org/10.1016/J.CHB.2011.04.009                                Stallings. 2017. Gender differences and bias in open source:
[14] Cong Li and Xiuli Wang. 2013. The power of eWOM: A                         pull request acceptance of women versus men. PeerJ
     re-examination of online student evaluations of their                      Computer Science 3 (5 2017), e111. DOI:
     professors. Computers in Human Behavior 29, 4 (7 2013),                    http://dx.doi.org/10.7717/peerj-cs.111
     1350–1357. DOI:                                                       [27] Bob Uttl, Carmela A. White, and Daniela Wong Gonzalez.
     http://dx.doi.org/10.1016/J.CHB.2013.01.007                                2017. Meta-analysis of faculty’s teaching effectiveness:
[15] Jin Liu and Julie Cohen. 2020. Measuring Teaching Practices                Student evaluation of teaching ratings and student learning are
     at Scale: A Novel Application of Text-as-Data Methods |                    not related. Studies in Educational Evaluation 54 (9 2017),
     EdWorkingPapers. (2020).                                                   22–42. DOI:
     https://www.edworkingpapers.com/ai20-239                                   http://dx.doi.org/10.1016/J.STUEDUC.2016.08.007
[16] Anna May, Johannes Wachs, and Anikó Hannák. 2019.
     Gender differences in participation and reward on Stack
     Overflow. Empirical Software Engineering 24, 4 (8 2019),
     1997–2019. DOI:
     http://dx.doi.org/10.1007/s10664-019-09685-x
[17] Arunachalam Narayanan, William J. Sawaya, and Michael D.
     Johnson. 2014. Analysis of Differences in Nonteaching
     Factors Influencing Student Evaluation of Teaching between
     Engineering and Business Classrooms. Decision Sciences
     Journal of Innovative Education 12, 3 (7 2014), 233–265.
     DOI:http://dx.doi.org/10.1111/dsji.12035
[18] JW Pennebaker, RL Boyd, K Jordan, and K Blackburn. 2015.
     The development and psychometric properties of LIWC2015.
     (2015). https:
     //repositories.lib.utexas.edu/handle/2152/31333
[19] Katie Redmond, Sarah Evans, and Mehran Sahami. 2013. A
     large-scale quantitative study of women in computer science
     at Stanford University. In Proceeding of the 44th ACM
     technical symposium on Computer science education -
     SIGCSE ’13. ACM Press, New York, New York, USA, 439.
     DOI:http://dx.doi.org/10.1145/2445196.2445326
[20] J Reich, DH Tingley, J Leder-Luis, and M Roberts. 2014.
     Computer-assisted reading and discovery for student
     generated text in massive open online courses. (2014).
     https://papers.ssrn.com/sol3/papers.cfm?abstract_id=
     2499725
[21] Lauren A Rivera and András Tilcsik. 2019. Scaling down
     inequality: Rating scales, gender bias, and the architecture of
     evaluation. American Sociological Review 84, 2 (2019),
     248–274.
[22] Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley,
     Christopher Lucas, Jetson Leder-Luis, Shana Kushner
     Gadarian, Bethany Albertson, and David G. Rand. 2014.
     Structural Topic Models for Open-Ended Survey Responses.
     American Journal of Political Science 58, 4 (10 2014),
     1064–1082. DOI:http://dx.doi.org/10.1111/ajps.12103
[23] Linda J. Sax, Shannon K. Gilmartin, and Alyssa N. Bryant.
     2003. Assessing response rates and nonresponse bias in web
     and paper surveys. (2003). DOI:
     http://dx.doi.org/10.1023/A:1024232915870
[24] T Sliusarenko, LH Clemmensen International . . . , and
     Undefined 2013. 2013. Text Mining in Students’ Course
     Evaluations. pdfs.semanticscholar.org (2013).
     https://pdfs.semanticscholar.org/cb02/
     b880ef86371461b3ebe46d2f8c293b43c7a2.pdf
[25] Philip Stark, Kellie Ottoboni, and Anne Boring. 2016. Student
     Evaluations of Teaching (Mostly) Do Not Measure Teaching
     Effectiveness. ScienceOpen Research (2016). DOI:http:
     //dx.doi.org/10.14293/s2199-1006.1.sor-edu.aetbzc.v1

[26] Josh Terrell, Andrew Kofink, Justin Middleton, Clarissa
     Rainear, Emerson Murphy-Hill, Chris Parnin, and Jon


                                                                       8