=Paper=
{{Paper
|id=Vol-2734/paper4
|storemode=property
|title=Evaluations as Research Tools: Gender Differences in Academic Self-Perception and Care Work in Undergraduate Course Reviews
|pdfUrl=https://ceur-ws.org/Vol-2734/paper4.pdf
|volume=Vol-2734
|authors=David Lang,Youjie Chen,Andreas Paepcke,Mitchell L. Stevens
|dblpUrl=https://dblp.org/rec/conf/edm/LangCPS20
}}
==Evaluations as Research Tools: Gender Differences in Academic Self-Perception and Care Work in Undergraduate Course Reviews==
Evaluations as Research Tools: Gender Differences in Academic Self-Perception and Care Work in Undergraduate Course Reviews David Lang, Youjie Chen, Andreas Paepcke, Mitchell L. Stevens Stanford University Stanford, CA {dnlang86, minachen, paepcke, stevens4}@stanford.edu ABSTRACT through reviews with institutional data describing those who submit Student course reviews are rarely considered as research instruments, them. Thus while course reviews may be problematic means of as- yet their ubiquity makes them promising tools for education data sessing the quality of instructors and instruction – a matter on which science. To illustrate this potential, we use a corpus of student we make no comment here– we believe these data hold substantial reviews to observe gender differences in how students appraise their promise for education data science. own learning and in the advice they give to to future students. We find systematic differences in who submits course reviews, with To illustrate this promise, we leverage a corpus of 11,255 student female and academically high-achieving students more likely to reviews submitted by undergraduate students enrolled in Computer submit. Among submitters, we find (a) females understate their Science (CS) classes at a private research university during the 2015- achievement of learning goals relative to males earning the same 16 and 2016-17 academic years. Because each review is linked with grades; (b) females offer lengthier written advice to future students the academic transcript and self-reported gender of its submitter, we than males; (c) advice written by females exhibits more positive are able to observe variation in submissions by an important aspect tone, even after accounting for grades and course selections. of student identity and documented academic accomplishment. Keywords While a variety of student characteristics of are of interest to educa- tion data scientists, we focus on students’ gender for reasons both care work; course evaluations; gender; higher education; topic practical and theoretical. For privacy purposes, our case university models; survey design 1 has currently granted researcher access to only a few variables de- scribing review submitters; we utilize those available data here. Yet 1. INTRODUCTION we also have two theoretical motivations for focusing on gender. Student course reviews are a controversial subject in academia. First, borrowing from social psychology, we recognize that women While considerable work has addressed problems of validity and tend to under-estimate their own abilities, while men to to over- bias in the use of these instruments for assessing instructors and estimate, conditional on measured accomplishment [7]. Second, instruction [25] [21], few have recognized reviews as potentially borrowing from feminist social science, we posit that submission useful research tools for education data science. of a course review is a form of care work – a voluntary investment in the well-being of others – and thus implicated differently in fem- Several features of course reviews make them potentially attractive inine and masculine gender roles and identities [9]. Our findings for researchers. First, reviews are ubiquitous features of teaching comport with the contours of these larger literatures in ways that and learning in US higher education. Because they are so commonly are both important in their own right, and instructive for any future solicited and so frequently submitted, the data yielded from reviews deployments of course reviews for education data science. represents a very wide swath of student populations. Second, re- views are routinely submitted through online platforms and carried We pursue three sets of analyses below. In the first set, we ob- out by administrative units supported on hard budget lines, bringing serve variation in rates of review submission by gender and earned the marginal cost of acquiring research data through reviews close grades. These analyses illustrate how researchers might test the to zero. Third, it is technically simple to link information obtained representativeness of corpora of reviews. Second, we observe how 1 "Copyright © 2020 for this paper by its authors. Use permitted submitters respond to multiple close-ended review prompts targeting self-assessments of learning, but that are phrased differently. These under Creative Commons License Attribution 4.0 International (CC BY 4.0)." analyses illustrate how review design may interact with student characteristics to produce patterned variation in reported learning progress. Third, we conduct computational text analyses of submis- sions to an open-ended review prompt. These analyses illustrate how qualitative reviews can be efficiently leveraged for scientific insight. 2. RELATED WORK 2.1 Course Reviews 1 Research on course reviews typically has focused on questions of popular of these dictionaries is the Linguistic Inquiry and Word their value as an instruments for evaluating the quality of instruction. Count (LIWC) dictionary [18]. In addition to grouping words into Analyses conducted at scale typically focus on whether measures of 75 distinct categories and themes (e.g. family, power, death, etc), the learning are correlated with instructional quality [27]. dictionary generates four psycho-social variables that were validated on college application essays through a rating process. Each of these One potential concern about course reviews from the pscyhomet- variables is scored on a 1 to 99 interval where 1 is a complete lack of rics liaterature is the potential for differential item function, a phe- the construct or and 99 is highly pronounced form of the construct. nomenon in which respondents of equal ability will exhibit different These constructs are: responses to a given survey item or question. Studies of differential item function in course reviews have focused on the quantitative difficulty of a class, or characteristics of the instructor [17] [8]. Rel- 1. Tone- This is a summary variable describing the emotional atively little work has focused on how student characteristics may quality of the text. A score of 99 reflects a positive tone and a be associated with differential item function on course reviews. score of 1 reflects a negative tone. A score of 50 represents neutral valence. If findings from copious research on product reviews translate to 2. Analytic- This is a measure of how much formal logic is used academic course reviews, we would expect that students with high- in the text. A score of 1 indicates little use of formal logic and valence opinions about a course are more likely to respond, resulting a score of 99 exhibits statement with a great deal of formal in a bimodal or j-shaped distribution [12]. In practice, these findings logic. may not translate. There are often other incentives for filling out reviews, for example, giving students earlier access to their final 3. Authenticity- This is a measure of the sincerity/honest of grades as an incentive to respond. Those who submit reviews may a text. A score of 1 indicates insincerity and a score of 99 not be representative of the larger population of students who en- indicates high sincerity. rolled in a particular course, or of the overall campus population. This problem is exacerbated when analysts to not have access of 4. Clout- This is a measure of the text’s authority, relative po- reviewer characteristics such as gender or grades [1]. Past work that sition, and confidence. A score of 1 suggests relatively little has tried to adjust for non-response bias in review has suggested authority and a score of 99 suggests high authority. that non-response bias tends to favor positive reviews [11]. Researchers can also create their own custom measures of text. There is varying theory around why students may opt to submit We take advantage of this affordance by capturing mentions of reviews. Studies conclude that students are more likely to respond instructors’ names. to course evaluations if they are majoring in the subject of the course [1]. Other work suggests that female students are generally more likely to respond to course evaluations than males [23]. However, 2.2.2 Token-based approaches Token-based approaches treat every word in a text as input into a little work has focused specifically on this topic in Computer Science model. These approaches often result in the loss of syntactic mean- courses. ing but are often very effective at classifying documents. Token- based approaches have proven effective at detecting socioeconomic Under experimental conditions where researchers manipulated the features of authors such as race, gender, and income in college appli- information content and valence of course reviews, researchers cation essays [4]. Other applications have generated algorithms with found that these factors had material effects on course enrollment high predictive validity on classroom observation and evaluation decisions. Students were more likely to enroll in courses if course rubrics [15]. evaluations had positive valence, particularly if there was a large number of such evaluations [14]. Similar work found that exposure to positive or negative course reviews had modest to large effects on 2.2.3 Unsupervised approaches students’ expected performance within a course, and their likelihood The basic premise behind unsupervised approaches is that texts of recommending the course in the future[13]. These findings are include multiple topics, and topics comprise words. Using unsu- particularly relevant for CS courses as CS courses tend to have pervised methods such as Latent Dirichelet Allocation (LDA) , we relatively enrollments compared to other subjects. can group texts categorically. These same methods have been aug- mented recently to allow the distribution of topics to co-vary with 2.2 Text Analysis other relevant metadata, a technique known as structural topic mod- There is a burgeoning literature on using computational text analysis eling [22]. In this case, we can examine the concentration of topics and methods to quantify differences in corpora based on characteris- by features such as student gender or grades. This method further al- tics of the author and the text. These techniques have been quickly lows us to perform statistical inference to see if topic preponderance adopted to educational applications but we have seen relatively few varies systematically by characteristics of authors. instances of text analysis of course reviews. When text analysis of course evaluations are done, they are typically focused on keyword 2.3 Gender Differences in Academic Experi- extraction and on predicting Likert item responses as a function of ences, Skill Perception and Care Work the text [24]. Our work has three motivations from prior social-science literature on higher education and gender. The first is that male and female stu- We group text analyses methods into the following three categories: dents may have different experiences when taking the same courses. For example, women are less comfortable asking questions and have 2.2.1 Dictionary and rule-based approaches less confidence in CS courses than their male peers [19]. This "gen- Dictionary-based approaches characterize the words of documents der confidence gap" grows as students take more advanced courses into groups of predefined categories such as sentiment. The most [3]. Analyses of communal academic resources in CS programs 2 find substantial differences in how contributions by male and fe- to their male counterparts. We hypothesize that men are more male users are acknowledged Github and StackOverflow [16] [26]. likely to see course reviews as a form of positive self-reflection and Consequences of these phenomena may extend beyond college,as promotion, and that females are more likely reviews as a form of women with degrees in STEM fields are less likely than men to enter care work. We believe these differences will have stronger valence STEM occupations [5]. While course reviews cannot capture empir- in items that focus on a students’ accomplishments rather than other ical variation in experience per se, they can capture how submitters constructs such as student learning. We will model these analyses make sense of those experiences. as a fixed-effect regression model with the following specification: Second, gendered differences in skill perception may influence Yi j = β1 Malei + Gradesi j + Γ j + εi j (1) how students report their experiences and learning gains in reviews. The subscripts i and j correspond to indices for student and course. While women tend to approach STEM fields with less confidence, The Y variable corresponds to our focal outcome variable, in this men tend to over-estimate their abilities. Experimental work by case, responses to a Likert item. Male corresponds to a student’s Correll [7] found that men expressed inflated perceptions of their self-reported indicator variable of whether the student identifies own skill at completing quantitative tasks compared to women per- as male and β1 corresponds to the associated coefficient with this forming at the same level of measured accomplishment. Together variable. We represent course effects with Γi to control for factors these inquiries suggest that course reviews may bear traces of gen- like the difficulty of the course or instructional quality. We also dered patterns of academic self-perceptions. Our third motivation is control for grades with an additional fixed-effect for each possible the gendered character of care work. Social scientists define care grade a student could receive 2 . The error is represented by ε. Errors work as work that attends to the well-being of others. It comprises are clustered at the course level. activities and services intended to help other people develop their capabilities and pursue their goals [9]. Care work is consistently associated with femininity and female role expectations, and often 3.2 Open-Response Questions is unpaid or poorly compensated [10]. To the extent that submitting We pay particular interest to open-response items in course reviews. course reviews is an act of assistance – to improve classes and to We suspect that such items are may be the most valuable and least inform future students – it is appropriately theorized as a form of explored element of course reviews. As such, we may be able to care work. Thus we might expect that female and male students will detect subtle differences in qualitative responses. approach the task of course reviews with different dispositions, such that the number, extensiveness, and content of course evaluations may vary by gender of submitters. 3.2.1 Psychosocial variables H4: Course evaluations written by females will express more pos- 3. RESEARCH QUESTIONS itive and sincere sentiment. Our working hypothesis is that course reviews will exhibit gendered patterns of academic experience, self-perceptions and advice-giving. Given our care work hypothesis, we believe that female students Specifically: (1) reviews from male students will exhibit stronger will express more positive sentiment in open-response items. We professed strong learning gains (2) reviews from female students use the same analytical strategy as an equation 1 using LIWC’s will exhibit characteristics of care work. tone variable. Specifically, we examine gender differences in these psycho-social variables after controlling for variation that can can We group our analyses into two parts. The first part examines varia- be attributed to the course, or to student grade. We report outcomes tion by gender and earned grades on review submission rates and in standardized effect sizes to facilitate interpretability. on Likert-scale items on course reviews. The second part exam- ines variation in male and female responses to a qualitative review We also hypothesize that a corollary to the care work hypothesis prompt eliciting advice for future students considering the same is that female students will use more "I" statements and tentative courses. language. This tendency would manifest as reviews written by female students exhibiting more authentic language. 3.1 Review Submission Rates and Likert Items H1: Female students will respond to course evaluations more of- H5: Course evaluations written by male students will express ten than males. more clout. Based on prior literature pertaining to a confidence gap in CS by gender, we hypothesize this trend should manifest Our care work hypothesis is that female students will be more with less expressions of clout and authority in course evaluations by responsive to institutional requests for reviews. We investigate this female authors. hypotheses using an exact binomial-two-sample test. We examine results by gender and grade. 3.2.2 Hand-crafted rules H6: Female students will write more on course evaluations and H2: There are systematic differences in response rate by grade. mention the instructor more often. There are many competing theories of how grades might influence response rates to course evaluations. If students have a poor grade, We hypothesize that care work will manifest in other ways beyond they may be more inclined to view the evaluation as an opportunity psycho-social variables. Specifically: female submitters will put to retaliate against the grader. Alternatively, students who receive more effort into reviews by writing more; and they will take a more a low grade may opt to avoid opportunities to reflect on negative individualized approach by mentioning the instructor explicitly. experiences. We will investigate this hypothesis utilizing a simple χ 2 test of response-rates by grade. 2 in our analyses, there are over twenty grade types, including + and - variants as well as credit and nocredit courses. We report A,B,C,D, H3: Female students will understate their achievements relative and not passing grades for simplicity 3 We have crafted two simple measures to facilitate investigation of this hypothesis: the length of each response in number of words, and a capture of each instance of an instructor name. 3.2.3 Topic models H7: There will by systematic variation in topics depending on the author’s gender. Our final analysis is exploratory using structural topic models to identify whether qualitative components of the corpora systemati- cally vary with gender of submitter The goals of this analysis are to develop efficient means of sorting and categorizing qualitative components of course reviews. 4. DATA Data comprise information describing enrollments in courses offered through the Computer Science (CS) Department of a private research university during the 2015-16 and 2016-17 academic years, and the entire population of formal reviews submitted by students enrolled in those courses. Reviews were administered near the end of the academic term but before the beginning of the term’s official final exam period. As an incentive for submitting reviews, students were given the ability to see their final course grades a bit earlier than non-submitters. In total these data yield 11,255 student responses from 251 courses. Courses range in character from very large introductory lecture-and- lab formats to small advanced seminars. Institutional data made available to us for analysis include each student’s grade, gender, GPA, declared major (if known), and academic year. We combine these data with the corpus of reviews submitted for CS courses during the study period specified above. Approximately one-third of submitted reviews from female students, and approximately half are from undergraduates. We cannot track or identify students enrolling Figure 1: Responses to Likert item questions in multiple CS courses during the study period, however we can compute and generate response rates by grade and gender. Aggregated responses to these two prompts appear in Figure 1. We limit our analysis to responses in which submitters offered a response to the review’s only open-ended question. That question 5. ANALYSES reads: 5.1 H1: Response Rates By Gender Rates of review submission by student gender and earned grade are "What would you like to say about this course to a student who is reported in Figure 2. Two features are notable. First, females are considering taking it in the future?" more likely to submit overall. On average, females submit to 78.0% of opportunities to do so; males, 74.5% (p<.001). The prompt is very well aligned with our care work hypotheses, in that it specifically asks submitters to give advice to a hypothetical Second, those receiving higher grades in a course are more likely future student. Individual responses vary substantially in length: to submit reviews Females receiving a grade of "A" are 3.7% more from a single character to over 5,964 characters (the latter equivalent likely to respond to submit than their male counterparts. The gen- to 1004 words). The mean response length is 132 characters – der submission gap is greatest among students receiving a grade of approximately the length of a tweet. The entire corpus of responses "B," with female "B" recipients 6.5% (p<.001)more likely to sub- to this question is 300,000 words. mit than males. We do not observe statistically significant gender differences in submission rates for those receiving grades below Additionally we analyze responses to two review prompts with "B," however such grades represent fewer than 5% of grades in the five-point Likert responses: 3 research sample. • How much did you learn from this course? 5.2 H2: Variation in Submission by Grade We also examined whether review submission varied systematically • How well did you achieve the learning goals of this course? grades. Figure 2 indicates a strong positive correlation between 3 we will limit this analysis to complete cases due to the fact that one grade and likelihood of submission. A student with a grade of A or item was not consistently administered across courses. We ignore higher has an 80% chance of responding to the evaluation, while questions pertaining to quality of instruction and focus on student students who do not pass the course or receive credit without a learning goals. grade responded approximately 50 percent of the time. Effectively, 4 Figure 3: Percent of Students Saying they Achieved learning goals extremely well Tone Analytic Clout Authentic Male −0.08∗∗ −0.01 0.01 −0.10∗∗∗ (0.03) (0.02) (0.02) (0.02) Num. obs. 11255 11255 11255 11255 R2 (full model) 0.07 0.05 0.05 0.04 R2 (proj model) 0.00 0.00 0.00 0.00 Adj. R2 (full model) 0.05 0.02 0.02 0.02 Adj. R2 (proj model) −0.02 −0.02 −0.02 −0.02 Num. groups: grade 20 20 20 20 Figure 2: Survey response rates by grade Num. groups: evalunitid 251 251 251 251 ∗∗∗ p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05 Learning Achievement Male 0.01 0.08∗∗∗ Table 2: LIWC Regressions (0.01) (0.01) Num. obs. 9002 9002 R2 (full model) 0.15 0.18 in self-reported measures of ‘how much they learned’ in a class. R2 (proj model) 0.00 0.01 However, when we look at a similar question about ‘achievement Adj. R2 (full model) 0.13 0.16 of learning goals’, we see a stark difference. Male students are 8% Adj. R2 (proj model) −0.03 −0.02 points more likely than females to state they mastered the learning Num. groups: grade 20 20 goals of a course. This finding suggests two concerns. First, given Num. groups: evalunitid 203 203 the similarity of these questions, we see that subtle differences in ∗∗∗ p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05 phrasing yield substantial differences in student responses. Second, female students report lower-level of mastery even after controlling Table 1: Gender Differences in Likert Items for grades. Notably, these surveys are collected before students know their final grades. These perceptions may change after this information is revealed to them. this means that students who fail courses are represented by review submissions about half as often as students who excel in a courses. 5.4 Open Text Responses Differences are statistically significant with a χ 2 statistic of 395.37 and a p-value of less than 0.001. 5.4.1 Psycho-social variables graphicx 5.3 H3: Reports of Learning and Goal-Meeting We report our analyses for H4 and H5 in table 2. We find modest We observe the proportion of students reporting having achieved the variation by gender in how submitters describe their experience in learning goals of a course Extremely Well by grade in Figure 3. Not the same course, conditional on grades. On average, submissions surprisingly, we find a strong direct correlation with course grade, from males evince slightly more negative and slightly less authentic such that reported goal achievement declines with grade. What is language. While these gender differences are highly significant, their striking is that at every grade level, there is a clear gap in reported magnitude is modest: on the order of a tenth of a standard deviation. goal achievement, with males more likely to report achievement Nevertheless, they are consistent with our care work hypotheses. To than females earning the same grade. wit, men are somewhat more critical and less honest in their reviews than women, suggesting greater empathy and investment among We extend this analysis to see if this same pattern occurs with the female submitters. question of how much students learn. Using the same specification as described in equation 1 in table 1. We see that after controlling With respect to our hypothesis around clout, we find little evidence for grades and course, males and females exhibit no differences that qualitative open-responses exhibit any significant differences in 5 ProfessorName Word Count Male −0.02∗ −0.15∗∗∗ (0.01) (0.03) Num. obs. 11255 11255 R2 (full model) 0.08 0.08 R2 (proj model) 0.00 0.00 Adj. R2 (full model) 0.06 0.06 Adj. R2 (proj model) −0.02 −0.02 Num. groups: grade 20 20 Num. groups: evalunitid 251 251 ∗∗∗ p < 0.001; ∗∗ p < 0.01; ∗ p < 0.05 Table 3: Handcrafted Features submissions from males and females. 5.4.2 Handcrafted features We report the standardized results of our analysis in table 3. We ob- serve marginally significant differences in the frequency with which submissions from males and females mention instructor name, with women approximately two percent more likely to mention. Sub- missions from women are also lengthier – about .15 of a standard deviation. While modest in magnitude, these statistically significant findings comport with our care work hypotheses that female submit- ters approach the task of submitting reviews with more attention to specificity and investment. 5.5 Topic Models We ran structural topic models while allowing submitter gender to vary with topic prevalence. We tuned the optimal number of topics from 2 to 50 using an exclusivity measure called FREX [2] [20]. FREX (See Equation 2) is a harmonic weighting of the frequency (F) with which a word occurs in a topic; and exclusivity (E), how frequently the word occurs in a given topic relative to others. The parameter ω corresponds to a tuning parameter of the relative importance of these features. We used the default parameter of ω = .7 to favor topics that had more exclusivity. ω 1 − ω −1 FREX = ( + ) (2) E F Using this criterion, We found the locally optimal parameter to be 32 distinct topics. We then hand labelled each topic, observing the ten (10) statements that had the highest probability labels of that topic see Figure 4 (Top). The most common topics were comments about a course being a tutorial, a suggestion to take the course, or a positive review. The least common topics were highly specific suggestions and issues pertaining to course prerequisites. While we did not see substantial gender variation in topics overall, there are exceptions of note. First, submissions from males are more likely to talk about math prerequisites and claims that instruction was poorly organized or of poor quality. They were also more likely to discuss course organization and instruction. Submissions from females were more likely to bear topics pertaining to workload, study practices, and attendance. These patterns provide at least modest evidence that women are offering relatively more specific advice that may be relevant to larger numbers of future students. We note an important caveat to this analysis, however. In contrast with the above studies results of the topic models presented in this Figure 4: (Top): Topic Proportions section do not control for grades or course selections, thus reported (Bottom):Differences in Topic Prevalence by Gender gender differences in topic prevalence may be an artifact of these other factors. We attempted to model the data with all of these parameters but found the models to be degenerate. 6 6. DISCUSSION 7. REFERENCES Even while they are controversial for evaluating instructors and [1] Meredith J. D. Adams and Paul D. Umbach. 2012. instruction, course reviews are ubiquitous features of the US higher Nonresponse and Online Student Evaluations of Teaching: education landscape and potentially powerful tools for education Understanding the Influence of Salience, Fatigue, and data science. In the work presented here we have sought to demon- Academic Environments. Research in Higher Education 53, 5 strate the promise of course reviews as a window into students’ (8 2012), 576–591. DOI: perceptions of their academic experiences and their orientation to http://dx.doi.org/10.1007/s11162-011-9240-5 the task of submitting evaluations. Taking advantage of archival data [2] Edoardo M Airoldi and Jonathan M Bischof. 2012. A Poisson that included 11,255 submitted to 251 computer science courses at convolution model for characterizing topical content with a single university between 2015-2017 that was linked to adminis- word frequency and exclusivity. arxiv.org (2012). trative information describing submitters’ gender (M/F) and grades, https://arxiv.org/abs/1206.4631http: we found patterned variation in who submits course reviews, and //arxiv.org/abs/1206.4631 how. [3] Christine Alvarado, Yingjun Cao, and Mia Minnes. 2017. Gender Differences in Students’ Behaviors in CS Classes In three observational studies we found that (a) women and those throughout the CS Major. In Proceedings of the 2017 ACM earning high grades were disproportionately likely to submit re- SIGCSE Technical Symposium on Computer Science views (b) the phrasing of close-ended review prompts influenced Education. ACM, New York, NY, USA, 27–32. DOI: patterns of response by gender (c) responses to qualitative review http://dx.doi.org/10.1145/3017680.3017771 prompts differed subtly but significantly by gender, with women [4] A.J. Alvero, Noah Arthurs, anthony lising antonio, writing somewhat more positive, individualized, and lengthier re- Benjamin W. Domingue, Ben Gebre-Medhin, Sonia Giebel, views. These empirical findings comport with theoretical insights and Mitchell L. Stevens. 2020. AI and Holistic Review. In from educational social psychology and feminist social science, Proceedings of the AAAI/ACM Conference on AI, Ethics, and which suggest gender variation in how men and women perceive Society. ACM, New York, NY, USA, 200–206. DOI: their own academic accomplishments and their obligations for the http://dx.doi.org/10.1145/3375627.3375871 well-being of others. [5] David N. Beede, Tiffany A. Julian, David Langdon, George McKittrick, Beethika Khan, and Mark E. Doms. 2011. While the empirical findings presented here are modest, they suggest Women in STEM: A Gender Gap to Innovation. SSRN the promise of leveraging course reviews for cumulative science in Electronic Journal (8 2011). DOI: at least two ways. http://dx.doi.org/10.2139/ssrn.1964782 [6] Michael Brown and Carrie Klein. 2020. Whose Data? Which First, we note that the inquiries presented here are based entirely on Rights? Whose Power? A Policy Discourse Analysis of the premise that course reviews and submitter demographic informa- Student Privacy Policy Documents. The Journal of Higher tion are "found" data. To the extent that virtually every US college Education (2020), 1–30. and university possesses data such as these, we can only imagine the [7] Shelley J. Correll. 2004. Constraints into Preferences: Gender, number and variety of insights that might be gained from parallel Status, and Emerging Career Aspirations. American investigations at other schools. To a nascent field whose promise Sociological Review 69, 1 (2 2004), 93–113. DOI: lies substantially in observing phenomena at scale, course reviews http://dx.doi.org/10.1177/000312240406900106 provide exceptionally promising sources of data for education data science. [8] Erica DeFrain and Erica DeFrain. 2016. An Analysis of Differences in Non-Instructional Factors Affecting Second, there is every reason to imagine that education data scien- Teacher-Course Evaluations over Time and Across tists might collaborate with school administrators to more explicitly Disciplines. (2016). and conscientiously instrument reviews for systematic experimental https://repository.arizona.edu/handle/10150/621018 and quasi-experimental research. The basic conditions for such in- [9] Paula England. 2005. Emerging Theories of Care Work. quiries are already in place and sustained by established administra- Annual Review of Sociology 31, 1 (8 2005), 381–399. DOI: tive rhythms: schools have offices conducting the reviews, students http: anticipate receiving them, and they take place multiple times a year. //dx.doi.org/10.1146/annurev.soc.31.041304.122317 It is possible to imagine substantial scientific insight through the [10] Nancy Folbre. 1995. “Holding hands at midnight”: The linkage review subsmissions with with administrative data describ- paradox of caring labor. Feminist Economics 1, 1 (3 1995), ing characteristics of submitters. The initial efforts presented here 73–92. DOI:http://dx.doi.org/10.1080/714042215 provide an inkling of this promise. [11] Maarten Goos and Anna Salomons. 2017. Measuring teaching quality in higher education: assessing selection bias in course As with any novel research strategy, pursuing education data science evaluations. Research in Higher Education 58, 4 (6 2017), through course reviews comes with important ethical considerations 341–364. DOI: regarding participant consent and responsible use. We are grateful http://dx.doi.org/10.1007/s11162-016-9429-8 that such discussions are already well underway nationwide [6] [12] Nan Hu, Jie Zhang, and Paul A. Pavlou. 2009. Overcoming and we hope that our own illustrative work here might helpfully the J-shaped distribution of product reviews. (10 2009). DOI: contribute to them. Indeed, addressing questions of responsible use http://dx.doi.org/10.1145/1562764.1562800 of student data in the context of course reviews may have the addi- [13] Neneh Kowai-Bell, Rosanna E. Guadagno, Tannah Little, tional benefit of improving the collective value of an institutional Najean Preiss, and Rachel Hensley. 2011. Rate My practice currently regarded with ambivalence and suspicion but that, Expectations: How online evaluations of professors impact in whatever form, will likely be part of the academic landscape for students’ perceived control. Computers in Human Behavior a long time. 27, 5 (9 2011), 1862–1867. DOI: 7 http://dx.doi.org/10.1016/J.CHB.2011.04.009 Stallings. 2017. Gender differences and bias in open source: [14] Cong Li and Xiuli Wang. 2013. The power of eWOM: A pull request acceptance of women versus men. PeerJ re-examination of online student evaluations of their Computer Science 3 (5 2017), e111. DOI: professors. Computers in Human Behavior 29, 4 (7 2013), http://dx.doi.org/10.7717/peerj-cs.111 1350–1357. DOI: [27] Bob Uttl, Carmela A. White, and Daniela Wong Gonzalez. http://dx.doi.org/10.1016/J.CHB.2013.01.007 2017. Meta-analysis of faculty’s teaching effectiveness: [15] Jin Liu and Julie Cohen. 2020. Measuring Teaching Practices Student evaluation of teaching ratings and student learning are at Scale: A Novel Application of Text-as-Data Methods | not related. Studies in Educational Evaluation 54 (9 2017), EdWorkingPapers. (2020). 22–42. DOI: https://www.edworkingpapers.com/ai20-239 http://dx.doi.org/10.1016/J.STUEDUC.2016.08.007 [16] Anna May, Johannes Wachs, and Anikó Hannák. 2019. Gender differences in participation and reward on Stack Overflow. Empirical Software Engineering 24, 4 (8 2019), 1997–2019. DOI: http://dx.doi.org/10.1007/s10664-019-09685-x [17] Arunachalam Narayanan, William J. Sawaya, and Michael D. Johnson. 2014. Analysis of Differences in Nonteaching Factors Influencing Student Evaluation of Teaching between Engineering and Business Classrooms. Decision Sciences Journal of Innovative Education 12, 3 (7 2014), 233–265. DOI:http://dx.doi.org/10.1111/dsji.12035 [18] JW Pennebaker, RL Boyd, K Jordan, and K Blackburn. 2015. The development and psychometric properties of LIWC2015. (2015). https: //repositories.lib.utexas.edu/handle/2152/31333 [19] Katie Redmond, Sarah Evans, and Mehran Sahami. 2013. A large-scale quantitative study of women in computer science at Stanford University. In Proceeding of the 44th ACM technical symposium on Computer science education - SIGCSE ’13. ACM Press, New York, New York, USA, 439. DOI:http://dx.doi.org/10.1145/2445196.2445326 [20] J Reich, DH Tingley, J Leder-Luis, and M Roberts. 2014. Computer-assisted reading and discovery for student generated text in massive open online courses. (2014). https://papers.ssrn.com/sol3/papers.cfm?abstract_id= 2499725 [21] Lauren A Rivera and András Tilcsik. 2019. Scaling down inequality: Rating scales, gender bias, and the architecture of evaluation. American Sociological Review 84, 2 (2019), 248–274. [22] Margaret E. Roberts, Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. 2014. Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science 58, 4 (10 2014), 1064–1082. DOI:http://dx.doi.org/10.1111/ajps.12103 [23] Linda J. Sax, Shannon K. Gilmartin, and Alyssa N. Bryant. 2003. Assessing response rates and nonresponse bias in web and paper surveys. (2003). DOI: http://dx.doi.org/10.1023/A:1024232915870 [24] T Sliusarenko, LH Clemmensen International . . . , and Undefined 2013. 2013. Text Mining in Students’ Course Evaluations. pdfs.semanticscholar.org (2013). https://pdfs.semanticscholar.org/cb02/ b880ef86371461b3ebe46d2f8c293b43c7a2.pdf [25] Philip Stark, Kellie Ottoboni, and Anne Boring. 2016. Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness. ScienceOpen Research (2016). DOI:http: //dx.doi.org/10.14293/s2199-1006.1.sor-edu.aetbzc.v1 [26] Josh Terrell, Andrew Kofink, Justin Middleton, Clarissa Rainear, Emerson Murphy-Hill, Chris Parnin, and Jon 8