=Paper= {{Paper |id=Vol-2734/paper4 |storemode=property |title=Evaluations as Research Tools: Gender Differences in Academic Self-Perception and Care Work in Undergraduate Course Reviews |pdfUrl=https://ceur-ws.org/Vol-2734/paper4.pdf |volume=Vol-2734 |authors=David Lang,Youjie Chen,Andreas Paepcke,Mitchell L. Stevens |dblpUrl=https://dblp.org/rec/conf/edm/LangCPS20 }} ==Evaluations as Research Tools: Gender Differences in Academic Self-Perception and Care Work in Undergraduate Course Reviews== https://ceur-ws.org/Vol-2734/paper4.pdf

Evaluations as Research Tools: Gender Differences in
Academic Self-Perception and Care Work in
Undergraduate Course Reviews

David Lang, Youjie Chen, Andreas Paepcke, Mitchell L. Stevens
Stanford University
Stanford, CA
{dnlang86, minachen, paepcke, stevens4}@stanford.edu

ABSTRACT through reviews with institutional data describing those who submit
Student course reviews are rarely considered as research instruments, them. Thus while course reviews may be problematic means of as-
yet their ubiquity makes them promising tools for education data sessing the quality of instructors and instruction – a matter on which
science. To illustrate this potential, we use a corpus of student we make no comment here– we believe these data hold substantial
reviews to observe gender differences in how students appraise their promise for education data science.
own learning and in the advice they give to to future students. We
find systematic differences in who submits course reviews, with To illustrate this promise, we leverage a corpus of 11,255 student
female and academically high-achieving students more likely to reviews submitted by undergraduate students enrolled in Computer
submit. Among submitters, we find (a) females understate their Science (CS) classes at a private research university during the 2015-
achievement of learning goals relative to males earning the same 16 and 2016-17 academic years. Because each review is linked with
grades; (b) females offer lengthier written advice to future students the academic transcript and self-reported gender of its submitter, we
than males; (c) advice written by females exhibits more positive are able to observe variation in submissions by an important aspect
tone, even after accounting for grades and course selections. of student identity and documented academic accomplishment.

Keywords While a variety of student characteristics of are of interest to educa-
tion data scientists, we focus on students’ gender for reasons both
care work; course evaluations; gender; higher education; topic
practical and theoretical. For privacy purposes, our case university
models; survey design 1
has currently granted researcher access to only a few variables de-
scribing review submitters; we utilize those available data here. Yet
1. INTRODUCTION we also have two theoretical motivations for focusing on gender.
Student course reviews are a controversial subject in academia. First, borrowing from social psychology, we recognize that women
While considerable work has addressed problems of validity and tend to under-estimate their own abilities, while men to to over-
bias in the use of these instruments for assessing instructors and estimate, conditional on measured accomplishment [7]. Second,
instruction [25] [21], few have recognized reviews as potentially borrowing from feminist social science, we posit that submission
useful research tools for education data science. of a course review is a form of care work – a voluntary investment
in the well-being of others – and thus implicated differently in fem-
Several features of course reviews make them potentially attractive inine and masculine gender roles and identities [9]. Our findings
for researchers. First, reviews are ubiquitous features of teaching comport with the contours of these larger literatures in ways that
and learning in US higher education. Because they are so commonly are both important in their own right, and instructive for any future
solicited and so frequently submitted, the data yielded from reviews deployments of course reviews for education data science.
represents a very wide swath of student populations. Second, re-
views are routinely submitted through online platforms and carried We pursue three sets of analyses below. In the first set, we ob-
out by administrative units supported on hard budget lines, bringing serve variation in rates of review submission by gender and earned
the marginal cost of acquiring research data through reviews close grades. These analyses illustrate how researchers might test the
to zero. Third, it is technically simple to link information obtained representativeness of corpora of reviews. Second, we observe how
1 "Copyright © 2020 for this paper by its authors. Use permitted submitters respond to multiple close-ended review prompts targeting
self-assessments of learning, but that are phrased differently. These
under Creative Commons License Attribution 4.0 International (CC
BY 4.0)." analyses illustrate how review design may interact with student
characteristics to produce patterned variation in reported learning
progress. Third, we conduct computational text analyses of submis-
sions to an open-ended review prompt. These analyses illustrate
how qualitative reviews can be efficiently leveraged for scientific
insight.

2. RELATED WORK
2.1 Course Reviews

1
Research on course reviews typically has focused on questions of popular of these dictionaries is the Linguistic Inquiry and Word
their value as an instruments for evaluating the quality of instruction. Count (LIWC) dictionary [18]. In addition to grouping words into
Analyses conducted at scale typically focus on whether measures of 75 distinct categories and themes (e.g. family, power, death, etc), the
learning are correlated with instructional quality [27]. dictionary generates four psycho-social variables that were validated
on college application essays through a rating process. Each of these
One potential concern about course reviews from the pscyhomet- variables is scored on a 1 to 99 interval where 1 is a complete lack of
rics liaterature is the potential for differential item function, a phe- the construct or and 99 is highly pronounced form of the construct.
nomenon in which respondents of equal ability will exhibit different These constructs are:
responses to a given survey item or question. Studies of differential
item function in course reviews have focused on the quantitative
difficulty of a class, or characteristics of the instructor [17] [8]. Rel- 1. Tone- This is a summary variable describing the emotional
atively little work has focused on how student characteristics may quality of the text. A score of 99 reflects a positive tone and a
be associated with differential item function on course reviews. score of 1 reflects a negative tone. A score of 50 represents
neutral valence.
If findings from copious research on product reviews translate to 2. Analytic- This is a measure of how much formal logic is used
academic course reviews, we would expect that students with high- in the text. A score of 1 indicates little use of formal logic and
valence opinions about a course are more likely to respond, resulting a score of 99 exhibits statement with a great deal of formal
in a bimodal or j-shaped distribution [12]. In practice, these findings logic.
may not translate. There are often other incentives for filling out
reviews, for example, giving students earlier access to their final 3. Authenticity- This is a measure of the sincerity/honest of
grades as an incentive to respond. Those who submit reviews may a text. A score of 1 indicates insincerity and a score of 99
not be representative of the larger population of students who en- indicates high sincerity.
rolled in a particular course, or of the overall campus population.
This problem is exacerbated when analysts to not have access of 4. Clout- This is a measure of the text’s authority, relative po-
reviewer characteristics such as gender or grades [1]. Past work that sition, and confidence. A score of 1 suggests relatively little
has tried to adjust for non-response bias in review has suggested authority and a score of 99 suggests high authority.
that non-response bias tends to favor positive reviews [11].
Researchers can also create their own custom measures of text.
There is varying theory around why students may opt to submit We take advantage of this affordance by capturing mentions of
reviews. Studies conclude that students are more likely to respond instructors’ names.
to course evaluations if they are majoring in the subject of the course
[1]. Other work suggests that female students are generally more
likely to respond to course evaluations than males [23]. However,
2.2.2 Token-based approaches
Token-based approaches treat every word in a text as input into a
little work has focused specifically on this topic in Computer Science
model. These approaches often result in the loss of syntactic mean-
courses.
ing but are often very effective at classifying documents. Token-
based approaches have proven effective at detecting socioeconomic
Under experimental conditions where researchers manipulated the
features of authors such as race, gender, and income in college appli-
information content and valence of course reviews, researchers
cation essays [4]. Other applications have generated algorithms with
found that these factors had material effects on course enrollment
high predictive validity on classroom observation and evaluation
decisions. Students were more likely to enroll in courses if course
rubrics [15].
evaluations had positive valence, particularly if there was a large
number of such evaluations [14]. Similar work found that exposure
to positive or negative course reviews had modest to large effects on 2.2.3 Unsupervised approaches
students’ expected performance within a course, and their likelihood The basic premise behind unsupervised approaches is that texts
of recommending the course in the future[13]. These findings are include multiple topics, and topics comprise words. Using unsu-
particularly relevant for CS courses as CS courses tend to have pervised methods such as Latent Dirichelet Allocation (LDA) , we
relatively enrollments compared to other subjects. can group texts categorically. These same methods have been aug-
mented recently to allow the distribution of topics to co-vary with
2.2 Text Analysis other relevant metadata, a technique known as structural topic mod-
There is a burgeoning literature on using computational text analysis eling [22]. In this case, we can examine the concentration of topics
and methods to quantify differences in corpora based on characteris- by features such as student gender or grades. This method further al-
tics of the author and the text. These techniques have been quickly lows us to perform statistical inference to see if topic preponderance
adopted to educational applications but we have seen relatively few varies systematically by characteristics of authors.
instances of text analysis of course reviews. When text analysis of
course evaluations are done, they are typically focused on keyword 2.3 Gender Differences in Academic Experi-
extraction and on predicting Likert item responses as a function of ences, Skill Perception and Care Work
the text [24]. Our work has three motivations from prior social-science literature
on higher education and gender. The first is that male and female stu-
We group text analyses methods into the following three categories: dents may have different experiences when taking the same courses.
For example, women are less comfortable asking questions and have
2.2.1 Dictionary and rule-based approaches less confidence in CS courses than their male peers [19]. This "gen-
Dictionary-based approaches characterize the words of documents der confidence gap" grows as students take more advanced courses
into groups of predefined categories such as sentiment. The most [3]. Analyses of communal academic resources in CS programs

2
find substantial differences in how contributions by male and fe- to their male counterparts. We hypothesize that men are more
male users are acknowledged Github and StackOverflow [16] [26]. likely to see course reviews as a form of positive self-reflection and
Consequences of these phenomena may extend beyond college,as promotion, and that females are more likely reviews as a form of
women with degrees in STEM fields are less likely than men to enter care work. We believe these differences will have stronger valence
STEM occupations [5]. While course reviews cannot capture empir- in items that focus on a students’ accomplishments rather than other
ical variation in experience per se, they can capture how submitters constructs such as student learning. We will model these analyses
make sense of those experiences. as a fixed-effect regression model with the following specification:

Second, gendered differences in skill perception may influence Yi j = β1 Malei + Gradesi j + Γ j + εi j (1)
how students report their experiences and learning gains in reviews.
The subscripts i and j correspond to indices for student and course.
While women tend to approach STEM fields with less confidence,
The Y variable corresponds to our focal outcome variable, in this
men tend to over-estimate their abilities. Experimental work by
case, responses to a Likert item. Male corresponds to a student’s
Correll [7] found that men expressed inflated perceptions of their
self-reported indicator variable of whether the student identifies
own skill at completing quantitative tasks compared to women per-
as male and β1 corresponds to the associated coefficient with this
forming at the same level of measured accomplishment. Together
variable. We represent course effects with Γi to control for factors
these inquiries suggest that course reviews may bear traces of gen-
like the difficulty of the course or instructional quality. We also
dered patterns of academic self-perceptions. Our third motivation is
control for grades with an additional fixed-effect for each possible
the gendered character of care work. Social scientists define care
grade a student could receive 2 . The error is represented by ε. Errors
work as work that attends to the well-being of others. It comprises
are clustered at the course level.
activities and services intended to help other people develop their
capabilities and pursue their goals [9]. Care work is consistently
associated with femininity and female role expectations, and often 3.2 Open-Response Questions
is unpaid or poorly compensated [10]. To the extent that submitting We pay particular interest to open-response items in course reviews.
course reviews is an act of assistance – to improve classes and to We suspect that such items are may be the most valuable and least
inform future students – it is appropriately theorized as a form of explored element of course reviews. As such, we may be able to
care work. Thus we might expect that female and male students will detect subtle differences in qualitative responses.
approach the task of course reviews with different dispositions, such
that the number, extensiveness, and content of course evaluations
may vary by gender of submitters. 3.2.1 Psychosocial variables
H4: Course evaluations written by females will express more pos-
3. RESEARCH QUESTIONS itive and sincere sentiment.
Our working hypothesis is that course reviews will exhibit gendered
patterns of academic experience, self-perceptions and advice-giving. Given our care work hypothesis, we believe that female students
Specifically: (1) reviews from male students will exhibit stronger will express more positive sentiment in open-response items. We
professed strong learning gains (2) reviews from female students use the same analytical strategy as an equation 1 using LIWC’s
will exhibit characteristics of care work. tone variable. Specifically, we examine gender differences in these
psycho-social variables after controlling for variation that can can
We group our analyses into two parts. The first part examines varia- be attributed to the course, or to student grade. We report outcomes
tion by gender and earned grades on review submission rates and in standardized effect sizes to facilitate interpretability.
on Likert-scale items on course reviews. The second part exam-
ines variation in male and female responses to a qualitative review We also hypothesize that a corollary to the care work hypothesis
prompt eliciting advice for future students considering the same is that female students will use more "I" statements and tentative
courses. language. This tendency would manifest as reviews written by
female students exhibiting more authentic language.
3.1 Review Submission Rates and Likert Items
H1: Female students will respond to course evaluations more of- H5: Course evaluations written by male students will express
ten than males. more clout. Based on prior literature pertaining to a confidence
gap in CS by gender, we hypothesize this trend should manifest
Our care work hypothesis is that female students will be more with less expressions of clout and authority in course evaluations by
responsive to institutional requests for reviews. We investigate this female authors.
hypotheses using an exact binomial-two-sample test. We examine
results by gender and grade. 3.2.2 Hand-crafted rules
H6: Female students will write more on course evaluations and
H2: There are systematic differences in response rate by grade. mention the instructor more often.
There are many competing theories of how grades might influence
response rates to course evaluations. If students have a poor grade, We hypothesize that care work will manifest in other ways beyond
they may be more inclined to view the evaluation as an opportunity psycho-social variables. Specifically: female submitters will put
to retaliate against the grader. Alternatively, students who receive more effort into reviews by writing more; and they will take a more
a low grade may opt to avoid opportunities to reflect on negative individualized approach by mentioning the instructor explicitly.
experiences. We will investigate this hypothesis utilizing a simple
χ 2 test of response-rates by grade. 2 in our analyses, there are over twenty grade types, including + and
- variants as well as credit and nocredit courses. We report A,B,C,D,
H3: Female students will understate their achievements relative and not passing grades for simplicity

3
We have crafted two simple measures to facilitate investigation of
this hypothesis: the length of each response in number of words,
and a capture of each instance of an instructor name.

3.2.3 Topic models
H7: There will by systematic variation in topics depending on the
author’s gender.

Our final analysis is exploratory using structural topic models to
identify whether qualitative components of the corpora systemati-
cally vary with gender of submitter The goals of this analysis are
to develop efficient means of sorting and categorizing qualitative
components of course reviews.

4. DATA
Data comprise information describing enrollments in courses offered
through the Computer Science (CS) Department of a private research
university during the 2015-16 and 2016-17 academic years, and the
entire population of formal reviews submitted by students enrolled
in those courses. Reviews were administered near the end of the
academic term but before the beginning of the term’s official final
exam period. As an incentive for submitting reviews, students were
given the ability to see their final course grades a bit earlier than
non-submitters.

In total these data yield 11,255 student responses from 251 courses.
Courses range in character from very large introductory lecture-and-
lab formats to small advanced seminars. Institutional data made
available to us for analysis include each student’s grade, gender,
GPA, declared major (if known), and academic year. We combine
these data with the corpus of reviews submitted for CS courses
during the study period specified above. Approximately one-third of
submitted reviews from female students, and approximately half are
from undergraduates. We cannot track or identify students enrolling Figure 1: Responses to Likert item questions
in multiple CS courses during the study period, however we can
compute and generate response rates by grade and gender.
Aggregated responses to these two prompts appear in Figure 1.
We limit our analysis to responses in which submitters offered a
response to the review’s only open-ended question. That question 5. ANALYSES
reads: 5.1 H1: Response Rates By Gender
Rates of review submission by student gender and earned grade are
"What would you like to say about this course to a student who is reported in Figure 2. Two features are notable. First, females are
considering taking it in the future?" more likely to submit overall. On average, females submit to 78.0%
of opportunities to do so; males, 74.5% (p<.001).
The prompt is very well aligned with our care work hypotheses, in
that it specifically asks submitters to give advice to a hypothetical Second, those receiving higher grades in a course are more likely
future student. Individual responses vary substantially in length: to submit reviews Females receiving a grade of "A" are 3.7% more
from a single character to over 5,964 characters (the latter equivalent likely to respond to submit than their male counterparts. The gen-
to 1004 words). The mean response length is 132 characters – der submission gap is greatest among students receiving a grade of
approximately the length of a tweet. The entire corpus of responses "B," with female "B" recipients 6.5% (p<.001)more likely to sub-
to this question is 300,000 words. mit than males. We do not observe statistically significant gender
differences in submission rates for those receiving grades below
Additionally we analyze responses to two review prompts with "B," however such grades represent fewer than 5% of grades in the
five-point Likert responses: 3 research sample.

• How much did you learn from this course? 5.2 H2: Variation in Submission by Grade
We also examined whether review submission varied systematically
• How well did you achieve the learning goals of this course? grades. Figure 2 indicates a strong positive correlation between
3 we will limit this analysis to complete cases due to the fact that one grade and likelihood of submission. A student with a grade of A or
item was not consistently administered across courses. We ignore higher has an 80% chance of responding to the evaluation, while
questions pertaining to quality of instruction and focus on student students who do not pass the course or receive credit without a
learning goals. grade responded approximately 50 percent of the time. Effectively,

4
Figure 3: Percent of Students Saying they Achieved learning
goals extremely well

Tone Analytic Clout Authentic
Male −0.08∗∗ −0.01 0.01 −0.10∗∗∗
(0.03) (0.02) (0.02) (0.02)
Num. obs. 11255 11255 11255 11255
R2 (full model) 0.07 0.05 0.05 0.04
R2 (proj model) 0.00 0.00 0.00 0.00
Adj. R2 (full model) 0.05 0.02 0.02 0.02
Adj. R2 (proj model) −0.02 −0.02 −0.02 −0.02
Num. groups: grade 20 20 20 20
Figure 2: Survey response rates by grade Num. groups: evalunitid 251 251 251 251
∗∗∗
p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05

Learning Achievement
Male 0.01 0.08∗∗∗ Table 2: LIWC Regressions
(0.01) (0.01)
Num. obs. 9002 9002
R2 (full model) 0.15 0.18 in self-reported measures of ‘how much they learned’ in a class.
R2 (proj model) 0.00 0.01 However, when we look at a similar question about ‘achievement
Adj. R2 (full model) 0.13 0.16 of learning goals’, we see a stark difference. Male students are 8%
Adj. R2 (proj model) −0.03 −0.02 points more likely than females to state they mastered the learning
Num. groups: grade 20 20 goals of a course. This finding suggests two concerns. First, given
Num. groups: evalunitid 203 203 the similarity of these questions, we see that subtle differences in
∗∗∗
p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05 phrasing yield substantial differences in student responses. Second,
female students report lower-level of mastery even after controlling
Table 1: Gender Differences in Likert Items for grades. Notably, these surveys are collected before students
know their final grades. These perceptions may change after this
information is revealed to them.
this means that students who fail courses are represented by review
submissions about half as often as students who excel in a courses. 5.4 Open Text Responses
Differences are statistically significant with a χ 2 statistic of 395.37
and a p-value of less than 0.001.
5.4.1 Psycho-social variables
graphicx

5.3 H3: Reports of Learning and Goal-Meeting We report our analyses for H4 and H5 in table 2. We find modest
We observe the proportion of students reporting having achieved the variation by gender in how submitters describe their experience in
learning goals of a course Extremely Well by grade in Figure 3. Not the same course, conditional on grades. On average, submissions
surprisingly, we find a strong direct correlation with course grade, from males evince slightly more negative and slightly less authentic
such that reported goal achievement declines with grade. What is language. While these gender differences are highly significant, their
striking is that at every grade level, there is a clear gap in reported magnitude is modest: on the order of a tenth of a standard deviation.
goal achievement, with males more likely to report achievement Nevertheless, they are consistent with our care work hypotheses. To
than females earning the same grade. wit, men are somewhat more critical and less honest in their reviews
than women, suggesting greater empathy and investment among
We extend this analysis to see if this same pattern occurs with the female submitters.
question of how much students learn. Using the same specification
as described in equation 1 in table 1. We see that after controlling With respect to our hypothesis around clout, we find little evidence
for grades and course, males and females exhibit no differences that qualitative open-responses exhibit any significant differences in

5
ProfessorName Word Count
Male −0.02∗ −0.15∗∗∗
(0.01) (0.03)
Num. obs. 11255 11255
R2 (full model) 0.08 0.08
R2 (proj model) 0.00 0.00
Adj. R2 (full model) 0.06 0.06
Adj. R2 (proj model) −0.02 −0.02
Num. groups: grade 20 20
Num. groups: evalunitid 251 251
∗∗∗
p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05

Table 3: Handcrafted Features

submissions from males and females.

5.4.2 Handcrafted features
We report the standardized results of our analysis in table 3. We ob-
serve marginally significant differences in the frequency with which
submissions from males and females mention instructor name, with
women approximately two percent more likely to mention. Sub-
missions from women are also lengthier – about .15 of a standard
deviation. While modest in magnitude, these statistically significant
findings comport with our care work hypotheses that female submit-
ters approach the task of submitting reviews with more attention to
specificity and investment.

5.5 Topic Models
We ran structural topic models while allowing submitter gender
to vary with topic prevalence. We tuned the optimal number of
topics from 2 to 50 using an exclusivity measure called FREX
[2] [20]. FREX (See Equation 2) is a harmonic weighting of the
frequency (F) with which a word occurs in a topic; and exclusivity
(E), how frequently the word occurs in a given topic relative to
others. The parameter ω corresponds to a tuning parameter of the
relative importance of these features. We used the default parameter
of ω = .7 to favor topics that had more exclusivity.
ω 1 − ω −1
FREX = ( + ) (2)
E F

Using this criterion, We found the locally optimal parameter to be
32 distinct topics. We then hand labelled each topic, observing the
ten (10) statements that had the highest probability labels of that
topic see Figure 4 (Top). The most common topics were comments
about a course being a tutorial, a suggestion to take the course, or
a positive review. The least common topics were highly specific
suggestions and issues pertaining to course prerequisites. While we
did not see substantial gender variation in topics overall, there are
exceptions of note. First, submissions from males are more likely to
talk about math prerequisites and claims that instruction was poorly
organized or of poor quality. They were also more likely to discuss
course organization and instruction. Submissions from females were
more likely to bear topics pertaining to workload, study practices,
and attendance. These patterns provide at least modest evidence
that women are offering relatively more specific advice that may be
relevant to larger numbers of future students.

We note an important caveat to this analysis, however. In contrast
with the above studies results of the topic models presented in this Figure 4: (Top): Topic Proportions
section do not control for grades or course selections, thus reported (Bottom):Differences in Topic Prevalence by Gender
gender differences in topic prevalence may be an artifact of these
other factors. We attempted to model the data with all of these
parameters but found the models to be degenerate.

6
6. DISCUSSION 7. REFERENCES
Even while they are controversial for evaluating instructors and [1] Meredith J. D. Adams and Paul D. Umbach. 2012.
instruction, course reviews are ubiquitous features of the US higher Nonresponse and Online Student Evaluations of Teaching:
education landscape and potentially powerful tools for education Understanding the Influence of Salience, Fatigue, and
data science. In the work presented here we have sought to demon- Academic Environments. Research in Higher Education 53, 5
strate the promise of course reviews as a window into students’ (8 2012), 576–591. DOI:
perceptions of their academic experiences and their orientation to http://dx.doi.org/10.1007/s11162-011-9240-5
the task of submitting evaluations. Taking advantage of archival data [2] Edoardo M Airoldi and Jonathan M Bischof. 2012. A Poisson
that included 11,255 submitted to 251 computer science courses at convolution model for characterizing topical content with
a single university between 2015-2017 that was linked to adminis- word frequency and exclusivity. arxiv.org (2012).
trative information describing submitters’ gender (M/F) and grades, https://arxiv.org/abs/1206.4631http:
we found patterned variation in who submits course reviews, and //arxiv.org/abs/1206.4631
how. [3] Christine Alvarado, Yingjun Cao, and Mia Minnes. 2017.
Gender Differences in Students’ Behaviors in CS Classes
In three observational studies we found that (a) women and those throughout the CS Major. In Proceedings of the 2017 ACM
earning high grades were disproportionately likely to submit re- SIGCSE Technical Symposium on Computer Science
views (b) the phrasing of close-ended review prompts influenced Education. ACM, New York, NY, USA, 27–32. DOI:
patterns of response by gender (c) responses to qualitative review http://dx.doi.org/10.1145/3017680.3017771
prompts differed subtly but significantly by gender, with women [4] A.J. Alvero, Noah Arthurs, anthony lising antonio,
writing somewhat more positive, individualized, and lengthier re- Benjamin W. Domingue, Ben Gebre-Medhin, Sonia Giebel,
views. These empirical findings comport with theoretical insights and Mitchell L. Stevens. 2020. AI and Holistic Review. In
from educational social psychology and feminist social science, Proceedings of the AAAI/ACM Conference on AI, Ethics, and
which suggest gender variation in how men and women perceive Society. ACM, New York, NY, USA, 200–206. DOI:
their own academic accomplishments and their obligations for the http://dx.doi.org/10.1145/3375627.3375871
well-being of others. [5] David N. Beede, Tiffany A. Julian, David Langdon, George
McKittrick, Beethika Khan, and Mark E. Doms. 2011.
While the empirical findings presented here are modest, they suggest Women in STEM: A Gender Gap to Innovation. SSRN
the promise of leveraging course reviews for cumulative science in Electronic Journal (8 2011). DOI:
at least two ways. http://dx.doi.org/10.2139/ssrn.1964782
[6] Michael Brown and Carrie Klein. 2020. Whose Data? Which
First, we note that the inquiries presented here are based entirely on
Rights? Whose Power? A Policy Discourse Analysis of
the premise that course reviews and submitter demographic informa- Student Privacy Policy Documents. The Journal of Higher
tion are "found" data. To the extent that virtually every US college Education (2020), 1–30.
and university possesses data such as these, we can only imagine the
[7] Shelley J. Correll. 2004. Constraints into Preferences: Gender,
number and variety of insights that might be gained from parallel
Status, and Emerging Career Aspirations. American
investigations at other schools. To a nascent field whose promise
Sociological Review 69, 1 (2 2004), 93–113. DOI:
lies substantially in observing phenomena at scale, course reviews
http://dx.doi.org/10.1177/000312240406900106
provide exceptionally promising sources of data for education data
science. [8] Erica DeFrain and Erica DeFrain. 2016. An Analysis of
Differences in Non-Instructional Factors Affecting
Second, there is every reason to imagine that education data scien- Teacher-Course Evaluations over Time and Across
tists might collaborate with school administrators to more explicitly Disciplines. (2016).
and conscientiously instrument reviews for systematic experimental https://repository.arizona.edu/handle/10150/621018
and quasi-experimental research. The basic conditions for such in- [9] Paula England. 2005. Emerging Theories of Care Work.
quiries are already in place and sustained by established administra- Annual Review of Sociology 31, 1 (8 2005), 381–399. DOI:
tive rhythms: schools have offices conducting the reviews, students http:
anticipate receiving them, and they take place multiple times a year. //dx.doi.org/10.1146/annurev.soc.31.041304.122317
It is possible to imagine substantial scientific insight through the [10] Nancy Folbre. 1995. “Holding hands at midnight”: The
linkage review subsmissions with with administrative data describ- paradox of caring labor. Feminist Economics 1, 1 (3 1995),
ing characteristics of submitters. The initial efforts presented here 73–92. DOI:http://dx.doi.org/10.1080/714042215
provide an inkling of this promise. [11] Maarten Goos and Anna Salomons. 2017. Measuring teaching
quality in higher education: assessing selection bias in course
As with any novel research strategy, pursuing education data science evaluations. Research in Higher Education 58, 4 (6 2017),
through course reviews comes with important ethical considerations 341–364. DOI:
regarding participant consent and responsible use. We are grateful http://dx.doi.org/10.1007/s11162-016-9429-8
that such discussions are already well underway nationwide [6] [12] Nan Hu, Jie Zhang, and Paul A. Pavlou. 2009. Overcoming
and we hope that our own illustrative work here might helpfully the J-shaped distribution of product reviews. (10 2009). DOI:
contribute to them. Indeed, addressing questions of responsible use http://dx.doi.org/10.1145/1562764.1562800
of student data in the context of course reviews may have the addi- [13] Neneh Kowai-Bell, Rosanna E. Guadagno, Tannah Little,
tional benefit of improving the collective value of an institutional Najean Preiss, and Rachel Hensley. 2011. Rate My
practice currently regarded with ambivalence and suspicion but that, Expectations: How online evaluations of professors impact
in whatever form, will likely be part of the academic landscape for students’ perceived control. Computers in Human Behavior
a long time. 27, 5 (9 2011), 1862–1867. DOI:

7
http://dx.doi.org/10.1016/J.CHB.2011.04.009 Stallings. 2017. Gender differences and bias in open source:
[14] Cong Li and Xiuli Wang. 2013. The power of eWOM: A pull request acceptance of women versus men. PeerJ
re-examination of online student evaluations of their Computer Science 3 (5 2017), e111. DOI:
professors. Computers in Human Behavior 29, 4 (7 2013), http://dx.doi.org/10.7717/peerj-cs.111
1350–1357. DOI: [27] Bob Uttl, Carmela A. White, and Daniela Wong Gonzalez.
http://dx.doi.org/10.1016/J.CHB.2013.01.007 2017. Meta-analysis of faculty’s teaching effectiveness:
[15] Jin Liu and Julie Cohen. 2020. Measuring Teaching Practices Student evaluation of teaching ratings and student learning are
at Scale: A Novel Application of Text-as-Data Methods | not related. Studies in Educational Evaluation 54 (9 2017),
EdWorkingPapers. (2020). 22–42. DOI:
https://www.edworkingpapers.com/ai20-239 http://dx.doi.org/10.1016/J.STUEDUC.2016.08.007
[16] Anna May, Johannes Wachs, and Anikó Hannák. 2019.
Gender differences in participation and reward on Stack
Overflow. Empirical Software Engineering 24, 4 (8 2019),
1997–2019. DOI:
http://dx.doi.org/10.1007/s10664-019-09685-x
[17] Arunachalam Narayanan, William J. Sawaya, and Michael D.
Johnson. 2014. Analysis of Differences in Nonteaching
Factors Influencing Student Evaluation of Teaching between
Engineering and Business Classrooms. Decision Sciences
Journal of Innovative Education 12, 3 (7 2014), 233–265.
DOI:http://dx.doi.org/10.1111/dsji.12035
[18] JW Pennebaker, RL Boyd, K Jordan, and K Blackburn. 2015.
The development and psychometric properties of LIWC2015.
(2015). https:
//repositories.lib.utexas.edu/handle/2152/31333
[19] Katie Redmond, Sarah Evans, and Mehran Sahami. 2013. A
large-scale quantitative study of women in computer science
at Stanford University. In Proceeding of the 44th ACM
technical symposium on Computer science education -
SIGCSE ’13. ACM Press, New York, New York, USA, 439.
DOI:http://dx.doi.org/10.1145/2445196.2445326
[20] J Reich, DH Tingley, J Leder-Luis, and M Roberts. 2014.
Computer-assisted reading and discovery for student
generated text in massive open online courses. (2014).
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=
2499725
[21] Lauren A Rivera and András Tilcsik. 2019. Scaling down
inequality: Rating scales, gender bias, and the architecture of
evaluation. American Sociological Review 84, 2 (2019),
248–274.
[22] Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley,
Christopher Lucas, Jetson Leder-Luis, Shana Kushner
Gadarian, Bethany Albertson, and David G. Rand. 2014.
Structural Topic Models for Open-Ended Survey Responses.
American Journal of Political Science 58, 4 (10 2014),
1064–1082. DOI:http://dx.doi.org/10.1111/ajps.12103
[23] Linda J. Sax, Shannon K. Gilmartin, and Alyssa N. Bryant.
2003. Assessing response rates and nonresponse bias in web
and paper surveys. (2003). DOI:
http://dx.doi.org/10.1023/A:1024232915870
[24] T Sliusarenko, LH Clemmensen International . . . , and
Undefined 2013. 2013. Text Mining in Students’ Course
Evaluations. pdfs.semanticscholar.org (2013).
https://pdfs.semanticscholar.org/cb02/
b880ef86371461b3ebe46d2f8c293b43c7a2.pdf
[25] Philip Stark, Kellie Ottoboni, and Anne Boring. 2016. Student
Evaluations of Teaching (Mostly) Do Not Measure Teaching
Effectiveness. ScienceOpen Research (2016). DOI:http:
//dx.doi.org/10.14293/s2199-1006.1.sor-edu.aetbzc.v1

[26] Josh Terrell, Andrew Kofink, Justin Middleton, Clarissa
Rainear, Emerson Murphy-Hill, Chris Parnin, and Jon