Lessons from a Massive Open Online Course (MOOC) on Natural
                 Language Processing for Digital Humanities

              Simon Clematide, Isabel Meraner, Noah Bubenhofer, Martin Volk
                             University of Zurich, Switzerland
                           Institute of Computational Linguistics
                 simon.clematide@uzh.ch, isabel.meraner@uzh.ch,
                       bubenhofer@cl.uzh.ch, volk@cl.uzh.ch


                     Abstract                                 text representation and analysis. Therefore, pro-
    In this paper, we present the concept,                    gramming experience is neither required for this
    content and experience with an actively                   introductory course nor provided in it.
    running Massive Open Online Course                           According to Ubell (2017), more than 58 mil-
    (MOOC) on Natural Language Processing                     lion people have signed up worldwide for Massive
    for Digital Humanities. This video-based                  Open Online Courses (MOOCs) by now. This form
    course is held in German, does not require                of distance learning in higher education has grown
    any programming skills, and serves as an                  popular over the last 6 years and several commer-
    introduction to automatic text analysis. The              cial and non-commercial platforms compete for
    target audience is anyone who is interested               participants.
    in applying basic language technology to                     Our free course is held on Coursera1 , one of
    text corpora. It has a strong empirical fo-               the largest commercial platforms that distributes
    cus on digital representations, tools and                 classes mostly held in English and created by lectur-
    corpus linguistics. The main goal thereby                 ers of top universities around the world. Our course
    is to grasp the fundamental terminology                   language is German2 which on the one hand has
    and concepts of computational linguistics,                the disadvantage of excluding participants who do
    to understand the main problems and solu-                 not speak German, but on the other hand, it allows
    tions, as well as to know about the perfor-               us to occupy a niche in language technology focus-
    mance and limitations of current methods.                 ing on German texts. A first session of the course
    Furthermore, manual annotation and data                   was run in summer 2015, and about 900 learners
    visualization are introduced in this course.              visited the course at least once. Due to legal issues
                                                              between our university and Coursera, and due to
1   Introduction                                              the introduction of Coursera’s new platform3 and
More and more scientific disciplines use automatic            the resulting course migration effort, it took two
text analysis in their digital scholarship. In the hu-        years to start the next session of our course.
manities, we have literary and cultural studies (e.g.            The rest of this paper is organized as follows: in
popularized as “distant reading” (Moretti, 2013),             section 2, we introduce and motivate the syllabus of
“corpus based discourse analysis” (Sinclair, 2004;            our course, in section 3 we discuss our experience
Bubenhofer, 2009) etc.), empirical corpus linguis-            from running the course twice so far.
tics and computational social sciences (including                 1 The MOOC can be accessed via this link:

automatic media monitoring (Reamy, 2016)), but                https://www.coursera.org/learn/digital-humanities.
                                                                  2 All videos have German subtitles, which is especially
text mining is also popular in the natural sciences,          useful for users with a limited understanding of German. We
for instance in the bio-medical domain (Cohen and             explicitly allow English contributions in the discussion forum
Hunter, 2008).                                                and peer assignments.
                                                                  3 Coursera now offers all courses more flexibly on demand
   Being able to apply Natural Language Process-              by restarting each course regularly at intervals of several
ing (NLP) methods to texts requires special knowl-            weeks. Learners can now easily switch from one instance
edge and skills. The goal of this course is not to            of a course to the next if they cannot complete within their
                                                              initial learner cohort. According to Saraf (2017), these cohorts
teach these skills, but to didactically introduce im-         improve the completion rate compared to purely self-paced
portant concepts and techniques related to digital            learning and still offer more flexibility.


                                                         17
2   Course Structure                                         texts in a creative, interactive and illustrative way.

The course is designed to run over a period of 6             Module 4 “Automatic Corpus Annotation Using
weeks each of which has its own thematic focus.              NLP Tools”
Each thematic module consists of 30 to 90 minutes            In this module, we introduce different automatic
of videos, which mostly use a fairly traditional             corpus annotation methods, such as part-of-speech
format where a lecturer presents slides and explains         tagging, lemmatization, stemming, parsing, Named
NLP methods in an accessible and illustrative way.           Entity Recognition, and Entity Linking (Ratinov
In addition, we provide learner-oriented learning            and Roth, 2009) for automatic disambiguation. Fur-
objectives, more detailed readings regarding the             thermore, we investigate potential problems and
presented topics and further course material within          sources of errors that can emerge while using such
each module. In order to test the individual learning        automatic annotation tools and we offer approaches
progress, we integrated either a brief final quiz or         to solving these issues.
a peer assessment at the end of each module. The
course syllabus is structured as follows:                    Module 5 “Manual Annotation and Evaluation
                                                             of Corpus Data”
Module 1 “Paths into the Digital World”                      The main topic of module 5 is the efficient combi-
This introductory module presents the fundamental            nation between manual and automatic annotation
concepts and terminology regarding the digitization          and the integration of machine learning methods
of texts. We present techniques such as scanning             in the vein of Pustejovsky and Stubbs (2013). Sub-
and OCR (Optical Character Recognition) as well              sequently, we present the most common metrics
as other approaches for the acquisition of text cor-         for measuring the quality of NLP systems and in-
pus material (including digital-born documents),             troduce the concept of inter-rater reliability. In the
and we discuss potential problems related to dig-            second part of module 5, we focus on the possibili-
itization and corpus design. Additionally, short             ties and restrictions of crowd-sourcing methods in
interviews about digitization techniques and the             the digital humanities.
relevancy of digitization with two experts from the
Zurich central library complete the first module.            Module 6 “Challenges in Multilingual Text Anal-
                                                             ysis”
Module 2 “Structured and Effective Representa-               The last module concentrates on multilingual and
tion of Corpus Data”                                         parallel corpora as well as on automatic language
The second module provides an overview of differ-            identification in large-scale text collections. Finally,
ent encodings, the markup language XML and the               we introduce several up-to-date tools for automatic
TEI P5 standard for text representation. The second          alignment of parallel corpora on the level of docu-
half of the module has its focus on automatic tok-           ments, sentences and words.
enization and sentence segmentation. Finally, in a
non-graded hands-on discussion prompt the learner            Assessments
needs to apply the acquired XML knowledge con-               In terms of graded assignments, we integrate short
cerning well-formedness and identify syntax errors           single and multiple choice quizzes ranging from 5
in an XML document.                                          to 12 questions at the end of each module4 . Mod-
                                                             ule 3 and Module 5 additionally include a graded
Module 3 “Properties of Corpora and Basic                    peer assignment where each learner is supposed to
Methods for Analysis”                                        assess at least two other submissions according to
In this module, we present the basic concepts of cor-        detailed grading instructions. The peer assessment
pus linguistics such as term frequencies, n-grams,           in Module 3 encourages the learner to apply the
collocations and methods for analyzing texts ac-             acquired knowledge on complex corpus queries.
cording to Lemnitzer and Zinsmeister (2006). In              By this means, each learner performs individual
addition, we demonstrate the functionality of vari-          queries on the IMS Open Corpus Workbench or the
ous platforms and interfaces for corpus analysis and         COSMAS II interface, regarding diachronic lan-
show some hands-on corpus query examples. In the             guage change. Apart from this, the learner is sup-
last part of Module 3, we introduce the topic “vi-           posed to generate frequency charts or collocation
sual linguistics” (Bubenhofer, 2016) together with
a variety of tools for displaying the properties of             4 One module currently does not have a quiz.


                                                        18
profiles and to interpret the findings and insights
gained from this task.
   The peer assignment in module 5 demands the
learner to run the online demo version of the Stan-
ford Named Entity Tagger (Finkel et al. (2005)) or
the Thomson Reuters Open Calais (Reuters (2008))
on a small sample text of his own choice and to
evaluate the NER taggers’s output according to the
evaluation metrics precision and recall that we ex-
plained in this module. In this manner, peer assess-
ments motivate the learners to try out individually
different NLP tools and corpus query platforms,              Figure 1: Talking head recording of speaker pre-
and to question and critically analyze their output.         senting slides in a weekly video session.

Community building and feedback
In order to enhance community building and the-              Having 3 different lecturers makes the course more
matic exchange between enrolled learners, we in-             varied and offers the learner slightly different per-
cluded a “Meet and Greet” discussion prompt sec-             spectives on the matter. In addition, every lecturer
tion in the first module as well as a “Feedback and          was able to teach the topics he is more specialized
Thank You” discussion field at the very end of the           and experienced in.
last module. For each module a weekly discussion
                                                             2.2   Building a studio and gaining recording
forum is automatically generated on the platform
                                                                   experience
where participants can ask or answer questions re-
garding the content of a module. Additionally, for           For the time of the video recordings we turned an
each discussion prompt, individual threads are au-           office into a makeshift studio. We decided to record
tomatically included in the weekly forum to al-              the videos on our own and not by a multimedia pro-
low topic-related discussions and exchange. To               duction team from our university, who, however,
ensure a friendly discussion atmosphere and make             instructed us kindly in the beginning. Although
the learners feel well looked after, the course tutor        the result would have looked more professional,
is actively present in the forums and tries to answer        this gave us the urgently needed flexibility in pro-
or comment every contribution.                               duction as all lecturers had no prior experience in
                                                             teaching in front of a camera.
2.1   Lecturers and tutors                                      The scene background was white with some
Three different lecturers teach in this course and           books and logos for ease of recognition (see Fig. 1).
they agreed beforehand on the overall content,               Lighting was installed to keep the scene equally
syllabus and presentation style. After that, each            illuminated without making the lecturers look pale.
lecturer was responsible for developing his own              The lecturers were filmed from the side while sit-
module content, preparing the slides and organiz-            ting in order to offer a relaxed learning atmosphere.
ing additional material. A student assistant sup-            Lecturers needed a while to learn to keep eye-
ported this process, cut the video recordings, added         contact with the camera rather than looking at their
some video effects (zooming, highlighting, tex-              slides. Small slips of the tongue were accepted as
tual annotations, in-video quizzes in order to avoid         ingredients of natural talks. Larger wording errors
monotony) to the slide recordings and published              were cut out and required repeated recordings.
everything on Coursera’s electronic learning man-
agement platform. All lecturers already had a lot            2.3   Resources and NLP Tools
of teaching experience in the subjects of their mod-         As a running example, we use the diachronic and
ules, yet, everyone had to invest a large amount             multilingual corpus Text+Berg (Volk et al., 2010)
of time to fit the existing teaching material from           which allows us to illustrate many different NLP
normal university classes into video sequences of            tasks and exploitation techniques on a coherent
an appropriate length for online courses. Actually,          and academically freely available resource. This
some of our videos are still too long by current             corpus has texts mostly in French, German, and
standards (5-7 minutes).                                     Italian, some of them translated, and spans over a


                                                        19
period of 150 years.                                           text analysis for digital humanities and addressing
   In our videos, we also mention, demonstrate and             learners with a mostly arts and humanities back-
reference a lot of other initiatives, resources, frame-        ground, we strongly believe that our course struc-
works, and open-source tools: (a) digitization ini-            ture results in a better understanding of the prob-
tiatives (Projekt Gutenberg, Europeana, TextGrid);             lems that one needs to tackle when processing nat-
(b) OCR crowd-correction and crowd-sourcing in                 ural language.
general (TypeWright, Crowdflower, Artigo); (c)                    In addition, white box instead of black box sys-
online corpora and corpus query tools (COSMAS                  tems using valid features5 are most important for
II/DeReKo, DWDS, CQPweb); (d) parallel corpora                 digital humanities and linguistics: It is often crucial
(EuroParl, Canadian Hansard); (e) sentence and                 to properly design linguistic meaningful features to
word alignment tools for parallel corpora (Inter-              receive valid categories for understanding the speci-
Text, HunAlign, GIZA++); (f) language identifica-              ficity of a text corpus or a linguistic phenomenon.
tion (lingua-ident, LangId); (g) text representation           To give a simple example: Even if a statistical
standards (Unicode, UTF-8, XML, TEI-P5); (h)                   model based on character n-grams turns out to per-
annotation standards (STTS, Universal tags and de-             form best for authorship attribution, this model is
pendencies); (i) standard lexical and syntactic NLP            of low interest for a linguistic research question on
tools (Porter Stemmer, Durm Lemmatizer, Tree-                  writing styles. That is because character n-grams
Tagger, Connexor-Tagger; chunkers and parsers);                do not represent a linguistic meaningful category
(j) named entity recognition (Open Calais, Stan-               and it is unclear what a character n-gram measures.
ford NER); (k) tools for manual annotation of lin-                Even though: A follow-up intermediate course
guistic structures (and/or querying the annotations)           clearly would need to focus more on distributional
(WebAnno, ANNIS, EXMARaLDA, RSTTool); (l)                      (word embeddings and topic modeling) and neural
visualization (Graphviz, Leaflet, Gephi).                      approaches, which, however, require more knowl-
                                                               edge in mathematics and programming skills.
3   Discussion
                                                               Active Learning
The field of language technology and NLP is a                  Successful MOOCs have to offer more than just
rapidly evolving discipline. In the last 25 years, sys-        recorded video streams of lectures. Freeman et
tems based on hand-written rules and application-              al. (2014) show that active learning settings gener-
specific algorithms have been largely superseded               ally improve the learning outcome of participants.
by statistical systems that are typically built by             Platforms such as Coursera offer several technical
supervised or semi-supervised machine learning                 solutions for making distance learning more than
techniques.                                                    passive consumption of videos. Individual user
   In our course we reflect this paradigm change,              activity for an active and enduring learning expe-
e.g. by contrasting the output of a rule-based part-           rience is encouraged through several course items.
of-speech tagger with a statistical one, and make              In-video-questions re-captivate the learner’s atten-
our participants aware of the different requirements           tion and require brief reflections on recently learned
for these approaches (e.g. manually built training             course content. Peer assignments encourage learn-
material needed for supervised machine learning).              ers to apply knowledge from the current module
However, we do not introduce “Neural Deep Learn-               and to try out NLP tools individually and critically
ing” methods (Manning, 2015), which currently                  evaluate their actual performance. By assessing
dominate NLP research and already have an im-                  other peers, further reflection and critical feedback
pact on practical NLP systems. Our course design,              is demanded from the learners. In a hands-on video
which roughly follows the traditional NLP pipeline             in Module 3, we provide step-by-step instructions
steps with language identification, tokenization,              for individual corpus analysis with the IMS Open
part-of-speech tagging, syntactic analysis and se-             Corpus Workbench. A brief introduction of the
mantic analysis does not particularly fit the recent           CQP query language allows the learner to issue
trend for neural end-to-end systems (Zhang et al.,             more complex queries. Furthermore, we constantly
2015), which – in the extreme – try to avoid these             invite the learner to apply his or her newly acquired
steps altogether and favor purely character-based              skills and do further experiments on his or her own
approaches.                                                    at the end of each module.
   For an introductory course targeting the basics of             5 Valid for categories in the respective discipline.


                                                          20
Community building and forum activity                            4   Conclusion
From the experience with the first session of the
course held in summer 2015, there is only a limited              This paper presents the content of an ongoing intro-
need of the users for exchange in the forums. There              ductory MOOC on Natural Language Processing
was some discussion on more advanced topics such                 for Digital Humanities. Any participant who suc-
as dependency parsing which was mentioned in                     cessfully completes this course will have a broad
Module 3, however, more formally introduced only                 overview on the problems and solutions for au-
later in Module 4. In the past, the peer assess-                 tomatically enriching and exploiting text corpora
ments on the evaluation of named entity taggers                  (via visual exploration or more sophisticated cor-
triggered some discussions, for instance, on the                 pus queries). The course introduces the process
question whether the German word “Mittelmeer-                    of digitization, corpus creation, text representation,
raum” (Mediterranean) should be recognized as a                  statistical analysis, visualization, automatic and
toponym or not.                                                  manual annotation on different linguistic levels, as
                                                                 well as the challenges and benefits of multilingual
   Our course participants on Coursera come from
                                                                 resources.
all over the world6 , although naturally participants
from the German speaking countries dominate. The                    As with any MOOC, the number of participants
participants have different backgrounds and inter-               that actually complete the course is only a small
ests, in our current course 37% declare themselves               fraction (5 to 12%) of all registered users (Ubell,
as higher education students. Others are either look-            2017). When our course was run for the first time
ing for a job after graduation or already employed               in 2015, 46 participants achieved a certificate of
and willing to expand their knowledge regarding                  accomplishment out of 883 learners who actually
NLP for Digital Humanities.                                      visited the course at least once. In the current on-
                                                                 demand setup of the course that started in July
                                                                 2017 we have a lower number of registered learners,
Course development                                               however, the majority of them seems to be actively
After successfully running our MOOC on the new                   following the course.7
Coursera learning management system in Summer                       The number of participants cannot be considered
2017, we fine-tuned our course for its future iter-              “massive” in the literal sense of “Massive Open
ations. We tried to respond to previous learner’s                Online Course”, however, MOOCs actually do not
feedback and to include a variety of small adjust-               need to have thousands of students. The strength
ments such as smaller quizzes after each video                   of courses like ours lies in their openness, in the
instead of longer quizzes at the end of each module.             way they present and offer specialist knowledge to
We now provide guidelines at the beginning of the                interested people all over the world, and last but not
MOOC and explain how the course can be used in                   least, how they structure the learning process and
order to satisfy wide-ranging needs of learners with             the topics in an accessible way and easily digestible
different backgrounds, therefore easing “cherry-                 portions.
picking” of certain course modules and not forcing
everybody into following the one-module-per-week                 Acknowledgment
order. Additionally, we integrated more discussion
and reading prompts related to the course content                The production of our MOOC was financed by
to maintain the learner’s active attention. A new                the division “Digitale Lehre und Forschung (DLF)”
outlook section in the last module provides further              from the Faculty of Arts of the University of Zurich
links and information on machine translation and                 (UZH). We would like to thank Anita Holdener
recent trends on applying Neural Network meth-                   (DLF) for her constant technical support, Lukas
ods in NLP. In October 2017, a new version of                    Meyer from “Multimedia & E-Learning-Services
the course goes live where learners will be able                 (MELS)” of the UZH for producing our promotion
to purchase a certificate provided by the platform               video and for the introduction in video recording
Coursera that can be helpful when seeking a job in               he gave to us, and last but not least, Sara Wick, our
the field of Digital Humanities.                                 initiative student tutor and production assistant.

   6 Some of them are also motivated by the fact that the           7 Regarding the participation in the course, we currently
course is given in German.                                       have 211 active learner out of 293 enrolled learners.


                                                            21
References                                                     John Sinclair. 2004. Trust the Text. Language, Corpus
                                                                 and Discourse. Routledge, London.
Noah Bubenhofer. 2009. Sprachgebrauchsmuster. Ko-
  rpuslinguistik als Methode der Diskurs- und Kultur-          Robert Ubell. 2017. MOOCs come back to earth.
  analyse. Sprache und Wissen, 4. De Gruyter, Berlin,            IEEE Spectrum, 54(3):22–22.
  New York.
                                                               Martin Volk, Noah Bubenhofer, Adrian Althaus, Maya
Noah Bubenhofer.       2016.   Drei Thesen zu Vi-               Bangerter, Lenz Furrer, and Beni Ruef. 2010. Chal-
  sualisierungspraktiken in den Digital Humanities.             lenges in building a multilingual alpine heritage cor-
  Rechtsgeschichte Legal History - Journal of the               pus. Seventh International Conference on Language
  Max Planck Institute for European Legal History,              Resources and Evaluation (LREC).
  (24):351–355.
                                                               Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
K. Bretonnel Cohen and Lawrence Hunter. 2008. Get-               Character-level convolutional networks for text clas-
  ting started in text mining. PLOS Computational Bi-            sification. In Advances in Neural Information Pro-
  ology, 4(1):1–3.                                               cessing Systems 28, pages 649–657. Curran Asso-
                                                                 ciates, Inc.
Jenny Rose Finkel, Trond Grenager, and Christopher
   Manning. 2005. Incorporating non-local informa-
   tion into information extraction systems by gibbs
   sampling. Proceedings of the 43nd Annual Meet-
   ing of the Association for Computational Linguistics
  (ACL 2005), 6:363–370.

Scott Freeman, Sarah L. Eddy, Miles McDonough,
  Michelle K. Smith, Nnadozie Okoroafor, Hannah
  Jordt, and Mary Pat Wenderoth. 2014. Active learn-
  ing increases student performance in science, engi-
  neering, and mathematics. Proceedings of the Na-
  tional Academy of Sciences, 111(23):8410–8415.

Lothar Lemnitzer and Heike Zinsmeister. 2006. Kor-
  puslinguistik. Eine Einführung. Narr, Tübingen.

Christopher D. Manning. 2015. Last words: Computa-
  tional linguistics and deep learning. Computational
  Linguistics, 41:701–707.

Franco Moretti. 2013. Distant Reading. Verso Books,
  London.

James Pustejovsky and Amber Stubbs. 2013. Natural
  language annotation for machine learning. O’Reilly
  Media, Sebastopol, CA.

Lev Ratinov and Dan Roth. 2009. Design chal-
  lenges and misconceptions in named entity recogni-
  tion. CoNLL, 6:147–155.

Tom Reamy. 2016. Deep text: using text analytics to
  conquer information overload, get real value from
  social media, and add big(ger) text to big data. In-
  formation Today.

Thomson Reuters. 2008. Open calais demo. http:
  //www.opencalais.com/opencalais-
  demo/. Date accessed: 20/07/2017.

Kapeesh Saraf.     2017.     Life gets in the way:
  How Coursera is solving for the biggest
  challenge in online learning.            https:
  //blog.coursera.org/life-gets-
  way-coursera-solving-biggest-
  challenge-online-learning/.
  Date accessed: 20/07/2017.


                                                          22