Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities Simon Clematide, Isabel Meraner, Noah Bubenhofer, Martin Volk University of Zurich, Switzerland Institute of Computational Linguistics simon.clematide@uzh.ch, isabel.meraner@uzh.ch, bubenhofer@cl.uzh.ch, volk@cl.uzh.ch Abstract text representation and analysis. Therefore, pro- In this paper, we present the concept, gramming experience is neither required for this content and experience with an actively introductory course nor provided in it. running Massive Open Online Course According to Ubell (2017), more than 58 mil- (MOOC) on Natural Language Processing lion people have signed up worldwide for Massive for Digital Humanities. This video-based Open Online Courses (MOOCs) by now. This form course is held in German, does not require of distance learning in higher education has grown any programming skills, and serves as an popular over the last 6 years and several commer- introduction to automatic text analysis. The cial and non-commercial platforms compete for target audience is anyone who is interested participants. in applying basic language technology to Our free course is held on Coursera1 , one of text corpora. It has a strong empirical fo- the largest commercial platforms that distributes cus on digital representations, tools and classes mostly held in English and created by lectur- corpus linguistics. The main goal thereby ers of top universities around the world. Our course is to grasp the fundamental terminology language is German2 which on the one hand has and concepts of computational linguistics, the disadvantage of excluding participants who do to understand the main problems and solu- not speak German, but on the other hand, it allows tions, as well as to know about the perfor- us to occupy a niche in language technology focus- mance and limitations of current methods. ing on German texts. A first session of the course Furthermore, manual annotation and data was run in summer 2015, and about 900 learners visualization are introduced in this course. visited the course at least once. Due to legal issues between our university and Coursera, and due to 1 Introduction the introduction of Coursera’s new platform3 and More and more scientific disciplines use automatic the resulting course migration effort, it took two text analysis in their digital scholarship. In the hu- years to start the next session of our course. manities, we have literary and cultural studies (e.g. The rest of this paper is organized as follows: in popularized as “distant reading” (Moretti, 2013), section 2, we introduce and motivate the syllabus of “corpus based discourse analysis” (Sinclair, 2004; our course, in section 3 we discuss our experience Bubenhofer, 2009) etc.), empirical corpus linguis- from running the course twice so far. tics and computational social sciences (including 1 The MOOC can be accessed via this link: automatic media monitoring (Reamy, 2016)), but https://www.coursera.org/learn/digital-humanities. 2 All videos have German subtitles, which is especially text mining is also popular in the natural sciences, useful for users with a limited understanding of German. We for instance in the bio-medical domain (Cohen and explicitly allow English contributions in the discussion forum Hunter, 2008). and peer assignments. 3 Coursera now offers all courses more flexibly on demand Being able to apply Natural Language Process- by restarting each course regularly at intervals of several ing (NLP) methods to texts requires special knowl- weeks. Learners can now easily switch from one instance edge and skills. The goal of this course is not to of a course to the next if they cannot complete within their initial learner cohort. According to Saraf (2017), these cohorts teach these skills, but to didactically introduce im- improve the completion rate compared to purely self-paced portant concepts and techniques related to digital learning and still offer more flexibility. 17 2 Course Structure texts in a creative, interactive and illustrative way. The course is designed to run over a period of 6 Module 4 “Automatic Corpus Annotation Using weeks each of which has its own thematic focus. NLP Tools” Each thematic module consists of 30 to 90 minutes In this module, we introduce different automatic of videos, which mostly use a fairly traditional corpus annotation methods, such as part-of-speech format where a lecturer presents slides and explains tagging, lemmatization, stemming, parsing, Named NLP methods in an accessible and illustrative way. Entity Recognition, and Entity Linking (Ratinov In addition, we provide learner-oriented learning and Roth, 2009) for automatic disambiguation. Fur- objectives, more detailed readings regarding the thermore, we investigate potential problems and presented topics and further course material within sources of errors that can emerge while using such each module. In order to test the individual learning automatic annotation tools and we offer approaches progress, we integrated either a brief final quiz or to solving these issues. a peer assessment at the end of each module. The course syllabus is structured as follows: Module 5 “Manual Annotation and Evaluation of Corpus Data” Module 1 “Paths into the Digital World” The main topic of module 5 is the efficient combi- This introductory module presents the fundamental nation between manual and automatic annotation concepts and terminology regarding the digitization and the integration of machine learning methods of texts. We present techniques such as scanning in the vein of Pustejovsky and Stubbs (2013). Sub- and OCR (Optical Character Recognition) as well sequently, we present the most common metrics as other approaches for the acquisition of text cor- for measuring the quality of NLP systems and in- pus material (including digital-born documents), troduce the concept of inter-rater reliability. In the and we discuss potential problems related to dig- second part of module 5, we focus on the possibili- itization and corpus design. Additionally, short ties and restrictions of crowd-sourcing methods in interviews about digitization techniques and the the digital humanities. relevancy of digitization with two experts from the Zurich central library complete the first module. Module 6 “Challenges in Multilingual Text Anal- ysis” Module 2 “Structured and Effective Representa- The last module concentrates on multilingual and tion of Corpus Data” parallel corpora as well as on automatic language The second module provides an overview of differ- identification in large-scale text collections. Finally, ent encodings, the markup language XML and the we introduce several up-to-date tools for automatic TEI P5 standard for text representation. The second alignment of parallel corpora on the level of docu- half of the module has its focus on automatic tok- ments, sentences and words. enization and sentence segmentation. Finally, in a non-graded hands-on discussion prompt the learner Assessments needs to apply the acquired XML knowledge con- In terms of graded assignments, we integrate short cerning well-formedness and identify syntax errors single and multiple choice quizzes ranging from 5 in an XML document. to 12 questions at the end of each module4 . Mod- ule 3 and Module 5 additionally include a graded Module 3 “Properties of Corpora and Basic peer assignment where each learner is supposed to Methods for Analysis” assess at least two other submissions according to In this module, we present the basic concepts of cor- detailed grading instructions. The peer assessment pus linguistics such as term frequencies, n-grams, in Module 3 encourages the learner to apply the collocations and methods for analyzing texts ac- acquired knowledge on complex corpus queries. cording to Lemnitzer and Zinsmeister (2006). In By this means, each learner performs individual addition, we demonstrate the functionality of vari- queries on the IMS Open Corpus Workbench or the ous platforms and interfaces for corpus analysis and COSMAS II interface, regarding diachronic lan- show some hands-on corpus query examples. In the guage change. Apart from this, the learner is sup- last part of Module 3, we introduce the topic “vi- posed to generate frequency charts or collocation sual linguistics” (Bubenhofer, 2016) together with a variety of tools for displaying the properties of 4 One module currently does not have a quiz. 18 profiles and to interpret the findings and insights gained from this task. The peer assignment in module 5 demands the learner to run the online demo version of the Stan- ford Named Entity Tagger (Finkel et al. (2005)) or the Thomson Reuters Open Calais (Reuters (2008)) on a small sample text of his own choice and to evaluate the NER taggers’s output according to the evaluation metrics precision and recall that we ex- plained in this module. In this manner, peer assess- ments motivate the learners to try out individually different NLP tools and corpus query platforms, Figure 1: Talking head recording of speaker pre- and to question and critically analyze their output. senting slides in a weekly video session. Community building and feedback In order to enhance community building and the- Having 3 different lecturers makes the course more matic exchange between enrolled learners, we in- varied and offers the learner slightly different per- cluded a “Meet and Greet” discussion prompt sec- spectives on the matter. In addition, every lecturer tion in the first module as well as a “Feedback and was able to teach the topics he is more specialized Thank You” discussion field at the very end of the and experienced in. last module. For each module a weekly discussion 2.2 Building a studio and gaining recording forum is automatically generated on the platform experience where participants can ask or answer questions re- garding the content of a module. Additionally, for For the time of the video recordings we turned an each discussion prompt, individual threads are au- office into a makeshift studio. We decided to record tomatically included in the weekly forum to al- the videos on our own and not by a multimedia pro- low topic-related discussions and exchange. To duction team from our university, who, however, ensure a friendly discussion atmosphere and make instructed us kindly in the beginning. Although the learners feel well looked after, the course tutor the result would have looked more professional, is actively present in the forums and tries to answer this gave us the urgently needed flexibility in pro- or comment every contribution. duction as all lecturers had no prior experience in teaching in front of a camera. 2.1 Lecturers and tutors The scene background was white with some Three different lecturers teach in this course and books and logos for ease of recognition (see Fig. 1). they agreed beforehand on the overall content, Lighting was installed to keep the scene equally syllabus and presentation style. After that, each illuminated without making the lecturers look pale. lecturer was responsible for developing his own The lecturers were filmed from the side while sit- module content, preparing the slides and organiz- ting in order to offer a relaxed learning atmosphere. ing additional material. A student assistant sup- Lecturers needed a while to learn to keep eye- ported this process, cut the video recordings, added contact with the camera rather than looking at their some video effects (zooming, highlighting, tex- slides. Small slips of the tongue were accepted as tual annotations, in-video quizzes in order to avoid ingredients of natural talks. Larger wording errors monotony) to the slide recordings and published were cut out and required repeated recordings. everything on Coursera’s electronic learning man- agement platform. All lecturers already had a lot 2.3 Resources and NLP Tools of teaching experience in the subjects of their mod- As a running example, we use the diachronic and ules, yet, everyone had to invest a large amount multilingual corpus Text+Berg (Volk et al., 2010) of time to fit the existing teaching material from which allows us to illustrate many different NLP normal university classes into video sequences of tasks and exploitation techniques on a coherent an appropriate length for online courses. Actually, and academically freely available resource. This some of our videos are still too long by current corpus has texts mostly in French, German, and standards (5-7 minutes). Italian, some of them translated, and spans over a 19 period of 150 years. text analysis for digital humanities and addressing In our videos, we also mention, demonstrate and learners with a mostly arts and humanities back- reference a lot of other initiatives, resources, frame- ground, we strongly believe that our course struc- works, and open-source tools: (a) digitization ini- ture results in a better understanding of the prob- tiatives (Projekt Gutenberg, Europeana, TextGrid); lems that one needs to tackle when processing nat- (b) OCR crowd-correction and crowd-sourcing in ural language. general (TypeWright, Crowdflower, Artigo); (c) In addition, white box instead of black box sys- online corpora and corpus query tools (COSMAS tems using valid features5 are most important for II/DeReKo, DWDS, CQPweb); (d) parallel corpora digital humanities and linguistics: It is often crucial (EuroParl, Canadian Hansard); (e) sentence and to properly design linguistic meaningful features to word alignment tools for parallel corpora (Inter- receive valid categories for understanding the speci- Text, HunAlign, GIZA++); (f) language identifica- ficity of a text corpus or a linguistic phenomenon. tion (lingua-ident, LangId); (g) text representation To give a simple example: Even if a statistical standards (Unicode, UTF-8, XML, TEI-P5); (h) model based on character n-grams turns out to per- annotation standards (STTS, Universal tags and de- form best for authorship attribution, this model is pendencies); (i) standard lexical and syntactic NLP of low interest for a linguistic research question on tools (Porter Stemmer, Durm Lemmatizer, Tree- writing styles. That is because character n-grams Tagger, Connexor-Tagger; chunkers and parsers); do not represent a linguistic meaningful category (j) named entity recognition (Open Calais, Stan- and it is unclear what a character n-gram measures. ford NER); (k) tools for manual annotation of lin- Even though: A follow-up intermediate course guistic structures (and/or querying the annotations) clearly would need to focus more on distributional (WebAnno, ANNIS, EXMARaLDA, RSTTool); (l) (word embeddings and topic modeling) and neural visualization (Graphviz, Leaflet, Gephi). approaches, which, however, require more knowl- edge in mathematics and programming skills. 3 Discussion Active Learning The field of language technology and NLP is a Successful MOOCs have to offer more than just rapidly evolving discipline. In the last 25 years, sys- recorded video streams of lectures. Freeman et tems based on hand-written rules and application- al. (2014) show that active learning settings gener- specific algorithms have been largely superseded ally improve the learning outcome of participants. by statistical systems that are typically built by Platforms such as Coursera offer several technical supervised or semi-supervised machine learning solutions for making distance learning more than techniques. passive consumption of videos. Individual user In our course we reflect this paradigm change, activity for an active and enduring learning expe- e.g. by contrasting the output of a rule-based part- rience is encouraged through several course items. of-speech tagger with a statistical one, and make In-video-questions re-captivate the learner’s atten- our participants aware of the different requirements tion and require brief reflections on recently learned for these approaches (e.g. manually built training course content. Peer assignments encourage learn- material needed for supervised machine learning). ers to apply knowledge from the current module However, we do not introduce “Neural Deep Learn- and to try out NLP tools individually and critically ing” methods (Manning, 2015), which currently evaluate their actual performance. By assessing dominate NLP research and already have an im- other peers, further reflection and critical feedback pact on practical NLP systems. Our course design, is demanded from the learners. In a hands-on video which roughly follows the traditional NLP pipeline in Module 3, we provide step-by-step instructions steps with language identification, tokenization, for individual corpus analysis with the IMS Open part-of-speech tagging, syntactic analysis and se- Corpus Workbench. A brief introduction of the mantic analysis does not particularly fit the recent CQP query language allows the learner to issue trend for neural end-to-end systems (Zhang et al., more complex queries. Furthermore, we constantly 2015), which – in the extreme – try to avoid these invite the learner to apply his or her newly acquired steps altogether and favor purely character-based skills and do further experiments on his or her own approaches. at the end of each module. For an introductory course targeting the basics of 5 Valid for categories in the respective discipline. 20 Community building and forum activity 4 Conclusion From the experience with the first session of the course held in summer 2015, there is only a limited This paper presents the content of an ongoing intro- need of the users for exchange in the forums. There ductory MOOC on Natural Language Processing was some discussion on more advanced topics such for Digital Humanities. Any participant who suc- as dependency parsing which was mentioned in cessfully completes this course will have a broad Module 3, however, more formally introduced only overview on the problems and solutions for au- later in Module 4. In the past, the peer assess- tomatically enriching and exploiting text corpora ments on the evaluation of named entity taggers (via visual exploration or more sophisticated cor- triggered some discussions, for instance, on the pus queries). The course introduces the process question whether the German word “Mittelmeer- of digitization, corpus creation, text representation, raum” (Mediterranean) should be recognized as a statistical analysis, visualization, automatic and toponym or not. manual annotation on different linguistic levels, as well as the challenges and benefits of multilingual Our course participants on Coursera come from resources. all over the world6 , although naturally participants from the German speaking countries dominate. The As with any MOOC, the number of participants participants have different backgrounds and inter- that actually complete the course is only a small ests, in our current course 37% declare themselves fraction (5 to 12%) of all registered users (Ubell, as higher education students. Others are either look- 2017). When our course was run for the first time ing for a job after graduation or already employed in 2015, 46 participants achieved a certificate of and willing to expand their knowledge regarding accomplishment out of 883 learners who actually NLP for Digital Humanities. visited the course at least once. In the current on- demand setup of the course that started in July 2017 we have a lower number of registered learners, Course development however, the majority of them seems to be actively After successfully running our MOOC on the new following the course.7 Coursera learning management system in Summer The number of participants cannot be considered 2017, we fine-tuned our course for its future iter- “massive” in the literal sense of “Massive Open ations. We tried to respond to previous learner’s Online Course”, however, MOOCs actually do not feedback and to include a variety of small adjust- need to have thousands of students. The strength ments such as smaller quizzes after each video of courses like ours lies in their openness, in the instead of longer quizzes at the end of each module. way they present and offer specialist knowledge to We now provide guidelines at the beginning of the interested people all over the world, and last but not MOOC and explain how the course can be used in least, how they structure the learning process and order to satisfy wide-ranging needs of learners with the topics in an accessible way and easily digestible different backgrounds, therefore easing “cherry- portions. picking” of certain course modules and not forcing everybody into following the one-module-per-week Acknowledgment order. Additionally, we integrated more discussion and reading prompts related to the course content The production of our MOOC was financed by to maintain the learner’s active attention. A new the division “Digitale Lehre und Forschung (DLF)” outlook section in the last module provides further from the Faculty of Arts of the University of Zurich links and information on machine translation and (UZH). We would like to thank Anita Holdener recent trends on applying Neural Network meth- (DLF) for her constant technical support, Lukas ods in NLP. In October 2017, a new version of Meyer from “Multimedia & E-Learning-Services the course goes live where learners will be able (MELS)” of the UZH for producing our promotion to purchase a certificate provided by the platform video and for the introduction in video recording Coursera that can be helpful when seeking a job in he gave to us, and last but not least, Sara Wick, our the field of Digital Humanities. initiative student tutor and production assistant. 6 Some of them are also motivated by the fact that the 7 Regarding the participation in the course, we currently course is given in German. have 211 active learner out of 293 enrolled learners. 21 References John Sinclair. 2004. Trust the Text. Language, Corpus and Discourse. Routledge, London. Noah Bubenhofer. 2009. Sprachgebrauchsmuster. Ko- rpuslinguistik als Methode der Diskurs- und Kultur- Robert Ubell. 2017. MOOCs come back to earth. analyse. Sprache und Wissen, 4. De Gruyter, Berlin, IEEE Spectrum, 54(3):22–22. New York. Martin Volk, Noah Bubenhofer, Adrian Althaus, Maya Noah Bubenhofer. 2016. Drei Thesen zu Vi- Bangerter, Lenz Furrer, and Beni Ruef. 2010. Chal- sualisierungspraktiken in den Digital Humanities. lenges in building a multilingual alpine heritage cor- Rechtsgeschichte Legal History - Journal of the pus. Seventh International Conference on Language Max Planck Institute for European Legal History, Resources and Evaluation (LREC). (24):351–355. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. K. Bretonnel Cohen and Lawrence Hunter. 2008. Get- Character-level convolutional networks for text clas- ting started in text mining. PLOS Computational Bi- sification. In Advances in Neural Information Pro- ology, 4(1):1–3. cessing Systems 28, pages 649–657. Curran Asso- ciates, Inc. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating non-local informa- tion into information extraction systems by gibbs sampling. Proceedings of the 43nd Annual Meet- ing of the Association for Computational Linguistics (ACL 2005), 6:363–370. Scott Freeman, Sarah L. Eddy, Miles McDonough, Michelle K. Smith, Nnadozie Okoroafor, Hannah Jordt, and Mary Pat Wenderoth. 2014. Active learn- ing increases student performance in science, engi- neering, and mathematics. Proceedings of the Na- tional Academy of Sciences, 111(23):8410–8415. Lothar Lemnitzer and Heike Zinsmeister. 2006. Kor- puslinguistik. Eine Einführung. Narr, Tübingen. Christopher D. Manning. 2015. Last words: Computa- tional linguistics and deep learning. Computational Linguistics, 41:701–707. Franco Moretti. 2013. Distant Reading. Verso Books, London. James Pustejovsky and Amber Stubbs. 2013. Natural language annotation for machine learning. O’Reilly Media, Sebastopol, CA. Lev Ratinov and Dan Roth. 2009. Design chal- lenges and misconceptions in named entity recogni- tion. CoNLL, 6:147–155. Tom Reamy. 2016. Deep text: using text analytics to conquer information overload, get real value from social media, and add big(ger) text to big data. In- formation Today. Thomson Reuters. 2008. Open calais demo. http: //www.opencalais.com/opencalais- demo/. Date accessed: 20/07/2017. Kapeesh Saraf. 2017. Life gets in the way: How Coursera is solving for the biggest challenge in online learning. https: //blog.coursera.org/life-gets- way-coursera-solving-biggest- challenge-online-learning/. Date accessed: 20/07/2017. 22