=Paper= {{Paper |id=Vol-1918/vela |storemode=property |title=A Practical Course in Corpus Linguistics for Students with a Humanist Background |pdfUrl=https://ceur-ws.org/Vol-1918/vela.pdf |volume=Vol-1918 |authors=Mihaela Vela,Hannah Kermes |dblpUrl=https://dblp.org/rec/conf/gldv/VelaK17 }} ==A Practical Course in Corpus Linguistics for Students with a Humanist Background== https://ceur-ws.org/Vol-1918/vela.pdf
                        A Practical Course in Corpus Linguistics
                       for Students with a Humanist Background

                   Mihaela Vela                                   Hannah Kermes
          Language Science and Technology                  Language Science and Technology
                Saarland University                              Saarland University
         m.vela@mx.uni-saarland.de                       h.kermes@mx.uni-saarland.de


                     Abstract                                cises are reproducible and portable to further stud-
                                                             ies.
    We present a practical course in corpus lin-                In the following we will discuss challenges for
    guistics meant to provide students with a                teachers and students (Section 2) and describe the
    humanities background with the necessary                 general concept of the course (Section 3) and its
    knowledge and skills for an empirical study              composition (Section 4). We conclude with a brief
    as basis for term papers, BA- or MA-thesis.              summary and envoy (Section 5).
    The course is part of a new Bachelor pro-
    gram and is combined with a theoretically                2    Challenges for teachers and students
    oriented course on corpus linguistics. The
                                                             A practical course on corpus linguistics for students
    challenge is to provide students with the
                                                             with a humanities background has challenges for
    necessary understanding of the underlying
                                                             both teachers and students.
    concepts and skills of corpus linguistics
                                                                The challenges stem from the seemingly op-
    without overwhelming them with too much
                                                             posed character of the digital applications and the
    technical detail. The course material is
                                                             humanities disciplines as well as from the charac-
    modular, allowing for easy updates, modifi-
                                                             ter of a practical course requiring a lot of active
    cations and adaptations as well as reusable
                                                             learning on the side of the students.
    for different target groups, settings, and ap-
                                                                Challenges for teachers include:
    plications.
                                                                 • motivating students and lowering the psycho-
1   Introduction                                                   logical and practical barriers
In this paper we present a practical course in corpus            • trying to avoid or solve technical problems
linguistics, which is meant to provide students with
a humanities background with the necessary knowl-                • dealing with heterogeneous groups both with
edge and skills for an empirical study as basis for                regard to the prior knowledge of the students
term papers, BA- or MA-thesis. The course is part                  as well as with their different learning pace
of a new Bachelor program Language Science and
is combined with a theoretically oriented course                 • keeping track of the learning success of the
on corpus linguistics. The students in the program                 group and individual students, adjusting the
come from various backgrounds including transla-                   teaching speed and/or type accordingly
tology and language studies among others. Most                   Challenges for students include:
of the students have only little or no experience in
natural language processing.                                     • engaging with a potentially new kind of sub-
   The challenge is to provide them with the neces-                ject matter
sary understanding of the underlying concepts and
skills of corpus linuistics without overwhelming                 • dealing with and solving technical problems
them with too much technical detail. The course                  • coping with the high demands of active learn-
material is modular, allowing for easy updates,                    ing
modifications and adaptations as well as reusable
for different target groups, settings, and applica-            Motivating students to engage with the technical
tions. The described processes, analysis and exer-           aspects of corpus linguistics often boils down to



                                                        49
answering the question about the usefulness of the            find out about individual problems and to provide
methodology. Good and obvious examples for ap-                immediate support. It is also possible to provide
plications in the students’ discipline(s) exemplify           exercises for different levels or extra exercises for
the additional value. Useful to this respect are              more advanced students. Working on a problem as
simple and understandable practical exercises to              a group can also foster deeper understanding.
exemplify and to help lower potential psycholog-                 Making students present the results of exercises
ical barriers with regard to technical applications.          and discussing them as a group helps teachers keep-
Motivating students stays a challenge throughout              ing track of the learning success of the group and of
a course as technical aspects can easily become               individual students. This is important to eventually
cumbersome and tedious. Active learning plays an              adjust the teaching speed and/or type accordingly.
important role in this respect. Active learning in the        The discussion of the results can also be used to
sense of an instructional method engaging students            sum up and point to important aspects of a teaching
in meaningful learning activities in the classroom            unit. This helps the students to reflect on their own
(e.g. doing exercises, working on and discussing              learning success and to evaluate the personal orga-
problems/results) (Prince, 2004; Bonwell and Ei-              nization and structure of their learning activities.
son, 1991). It allows to keep the students active                In the following we will now describe the general
and involved giving them an immediate feedback                concept of the course and how it addresses the
on their learning success. The broader goal of the            challenges described above.
session, however, should be made clear to show the
necessity of the activities of the students.                  3   General concept
   Technical problems with regard to applications             The course covers the two main aspects of corpus
and code are a challenge for both teachers and                linguistics: (i) corpus building and (ii) corpus anal-
students. A lot of technical problems, e.g. with              ysis. The main emphasis, however, is on corpus
installing software, can be avoided by using on-              analysis as depicted in Figure 1. Tutorials lead the
line tools, e.g. online corpora or web services               students through the main steps in corpus building
for corpus annotation. Another possibility to limit           from the digitized text to an annotated searchable
technical problems is to provide sample code for              corpus and from the linguistic research question
more complex examples, which only needs to be                 and corpus extraction to corpus analysis.
modified or complemented for exercises or later                  The course is constructed like a sample study,
application. This can help to focus on the main               each tutorial representing a particular step in the
aspects of the methodology as the technical diffi-            process. In this sense, the tutorials in both parts
culties are reduced to a minimum. Nevertheless,               build on one another, as each tutorial produces the
technical problems cannot be avoided completely,              input data for the next. However, as the necessary
it can even be good to provoke particular problems            sample data is provided at the beginning of each
in class. Problem solving, especially finding error           tutorial, the tutorials are also self-contained. In the
in code or regular expressions, is also an aspect of          first part, we create a corpus out of a small plain
corpus linguistic research. Working through exam-             text sample, adding meta data and basic linguistic
ples and exercises in class will inevitably lead to           annotation. In the second part, we look at a sam-
some technical problems. However, as they occur               ple research question extracting and analyzing the
in class, the teacher can immediately provide help            respective data. The characteristic of the course
in finding and solving the problems.                          of a sample study allows the students to get ac-
   Teachers are often confronted with heteroge-               quainted with the process of an empirical corpus
neous groups, both with regard to their prior knowl-          linguistic study, facilitating a later application of
edge as well as with different learning speeds. Al-           the methodology to a study of their own.
though this is a general challenge in teaching, the              The course material is provided as a website
groups are often more heterogeneous in digital hu-            with detailed step-by-step tutorials including the
manities settings and are especially pronounced               necessary background information, examples with
when teaching technical skills. Again an active               sample data and exercises. Links to external knowl-
learning environment where examples and exer-                 edge sources and tutorials, provide access to addi-
cises are worked through in class can help to adopt           tional information including more (technical) de-
to individual needs. It is easier for the teacher to          tails, more profound background or more complex



                                                         50
                                      Figure 1: Structure of the course


applications. The tutorials may be worked through             students acquainted with the concepts and notions
at individual pace adapting to the specific needs             of Corpus Linguistics. Each session was planned as
of different target groups and individual students,           an interactive combination between a tutorial (pre-
skipping sections or providing additional informa-            sented by the lecturers or prepared by the students)
tion or exercises.                                            and the corresponding exercises (solved in class
   The tutorials are written in R-Markdown                    by the students with the lecturer’s assistance). The
and converted into HTML websites. Using                       tutorial corresponds to the theoretical part, intro-
R-Markdown as source documents has several                    ducing a new topic, while the exercises are meant
advantages: (i) the tutorials are easy to modify, (ii)        as practical applications.
additional information as well as new material can               In this section we describe the course in detail in-
easily be integrated, (iii) the students can download         cluding not only the structure, content and technical
the source document, which allows for individual              aspects of the course, but also the necessary infras-
notes as well as to reproduce the provided sample             tructure to teach this type of classes. We decided
analysis.                                                     to publish the entire material of the course online1 ,
   Although the tutorials were initiated as course            facilitating the availability and reproducibility of
material for a university course, they can also func-         all course materials. As mentioned before, we use
tion as self-learning tutorials or as a (knowledge)           R Markdown for this purpose, being able to com-
base and memory hook for later use when writing               bine both narrative text and code and producing
a term paper, BA- or MA-thesis. As a university               formatted output.
course it follows the concept of inverted or flipped             As shown in Figure 1, the course is structured
classroom (Bergmann and Sams, 2012; Handke                    into four blocks, Corpus Building, Corpus Annota-
et al., 2012) in the sense that sample study and              tion, Corpus Query and Data Analysis, distributed
exercises are worked through in the classroom indi-           over ten sessions. Each new session introduces a
vidually or as a group, followed by a group discus-           new concept, but at the same time using previously
sion. This allows to address problems immediately,            introduced concepts. In terms of corpora we use
discuss them as a group, work through advance                 the Royal Society Corpus (RSC) (Kermes et al.,
concept and engage in collaborative learning and              2016), a historical corpus of written scientific En-
problem solving (Tucker, 2012) again more or less             glish, as well as the BROWN (Francis and Kučera,
simulating a research process in a team.                      1979), FLOB (Mair, 1999b), LOB (Johansson and
                                                              Goodluck, 1978) or FROWN (Mair, 1999a) cor-
4   Tutorial Description                                      pora, covering different time periods and registers
The practical course described here consisted of ten             1 http://fedora.clarin-d.uni-saarland.
incremental sessions on Corpus Linguistics. The               de/teaching/Corpus_Linguistics/index.
goal, as described in the previous section, is to get         html




                                                         51
for both American and British English.                        class. In Session 1 this additional information was
   The course is structured as follows.                       provided as links depicted in the Figure 2 below.
                                                                 Session 2 and Session 3 are part of the Corpus
  • Session 1: Corpus building with XML and
                                                              Annotation block. Session 2 deals with part-of-
    TEI
                                                              speech tagging, including its definition and the in-
  • Session 2: Tagging with TreeTagger                        troduction of the concept of a tagset. More specif-
                                                              ically, this class provides also an introduction to
  • Session 3: Corpus annotation with WebLicht                the usage and configuration options of the TreeTag-
  • Session 4: Corpus query with regular expres-              ger (Schmid, 1994; Schmid, 1995). The exercise
    sions                                                     deals with the installation and usage of the Tree-
                                                              Tagger as well as performing tagging on .txt files,
  • Session 5: Corpus query with patterns                     but also on .xml files, as depicted in Figure 3.
  • Session 6: Data extraction and data formats                  Session 3 goes one step further by introducing
                                                              additional annotation layers to the already existing
  • Session 7: Data analysis and data evaluation              part-of-speech annotation. This is carried out by
    with Excel                                                WebLicht (Hinrichs et al., 2010), a web based envi-
                                                              ronment for the annotation of corpora. It includes
  • Session 8: Manipulating data sets with R                  tools for tokenization, lemmatization, pos-tagging
  • Session 9: Normalization and frequency dis-               and parsing (among others), which can be com-
    tribution with R                                          bined individually to tool chains. The WebLicht
                                                              tutorial describes the usage of the tool by depict-
  • Session 10: Plotting analysis results with R              ing screenshots and giving examples. The exercise
   Session 1 belongs to the Corpus Building block             for the students is to build a processing chain in
and provides an introduction to XML (EXtensible               WebLicht including at least a tokenizer and the
Markup Language) and TEI (Text Encoding Ini-                  TreeTagger. The files used for this exercise are the
tiative). The goal of this class is to make students          same as in Session 2.
understand the importance and the syntax of mark-                Session 4 and Session 5 are concerned with cor-
up languages when working with corpora. The first             pus query belonging to the Corpus Query block,
class starts with an exercise by asking students to           being concerned with the qualitative analysis of the
mark title, paragraphs and sentences in a .txt file.          texts. Session 4 is meant as an introduction to reg-
The different solutions are meant to show the possi-          ular expressions, defining the concept of a regular
ble variation in marking linguistic units, here title,        expression, but also explaining, by examples, the
paragraphs and sentences, and making a point why              special characters and their role in formulating a
a standardized mark-up language is necessary. The             regular expression. After practicing formulating
tutorial introduces first the XML syntax, followed            queries with regular expressions in Notepad++2
by the TEI syntax. For completion the session ends            the students were introduced to the Saarland Uni-
with an exercise on encoding the same text accord-            versity CQPWeb (Hardie, 2012) platform and to
ing to the TEI guidelines.                                    the CQP syntax (Evert and Hardie, 2011)3 . The
                                                              tutorial consists of a series of examples for queries
                                                              meant to consolidate the knowledge about regular
                                                              expressions. The corresponding exercises consist
                                                              of a set of queries to be carried out in CQPWeb
                                                              on the RSC and BROWN corpora. An example of
                                                              such an exercise can be found in Figure 4.
                                                                 Session 5 is a continuation of Session 4 building
                                                              on regular expression syntax extending the simple
                                                              search queries introduced before to more complex
                                                              queries including patterns. In the exercise part
Figure 2: Additional resources for data encoding.
                                                                 2 https://notepad-plus-plus.org/
   As described in Section 3, we provided addi-                 3 https://corpora.clarin-d.uni-saarland.
tional information, beyond the scope of the specific          de/cqpweb/




                                                         52
                          Figure 3: Instructions for tagging with the TreeTagger.


                                                               the online search tool CQPweb. Strongly related to
                                                               the data set this class is introducing the notions of
                                                               observations, features and values of features. Dur-
                                                               ing the practical part students were asked to create
                                                               their own data set based on a research question
                                                               (e.g. distribution of content verbs and their parts-
                                                               of-speech across registers in a specific corpus) and
                                                               to formulate the research question in terms of query,
                                                               observations and features.
                                                                  Session 7 to 10 belong to the Data Analysis
                                                               block, introducing basic data analysis and data
                                                               evaluation methods such as frequency distribution,
                                                               normalization and statistical significance test using
                                                               the χ 2 (chi-square) test. The statistical analyses in
                                                               these sessions are based on the queries and data
   Figure 4: Introductory exercise in CQPWeb.
                                                               extracted in the previous sessions.
                                                                  Session 7 is a gentle introduction to data anal-
of the class the students are being asked to build             ysis, introducing (with relevant examples) all the-
their own pattern by using the CQP syntax and to               oretical notions related to frequency distribution,
query again the RSC and BROWN corpora using                    normalization and χ 2 . The practical application of
CQPWeb as shown in Figure 5.                                   these concepts is realized by exercises executed in
   Session 6 belongs to both the Corpus Query as               Excel/Libre Office. Excel/Libre Office is not the
well as the Data Analysis block, dealing with the              state-of-the-art in statistical analysis but has big ad-
results of a query. The result of a specific linguistic        vantage: the statistical analysis can be performed
induced query is usually a data set containing infor-          step by step permitting students to understand the
mation about a particular linguistic phenomenon                path to the final formula. Understanding how a
extracted from a particular corpus. In this class stu-         specific formula works (including the intermediate
dents were provided with the concept of a data set,            steps) is a great benefit for learners, who need to
creating, formatting and manipulating it by using              use this kind of knowledge later in their academic



                                                          53
                              Figure 5: Exercises with patterns in CQPWeb.




                     Figure 6: Calculating normalized figures in Excel/Libre Office.


studies.                                                      cerned with basic data analysis and data evaluation
   Session 8 introduces statistical analysis with R in        methods such as frequency distribution (see Fig-
R Studio. It introduces basic notions related to              ure 8), normalization and statistical significance
R, including data manipulation such as adding col-            test using the χ 2 test. The two sessions differ by
umn names, adding additional variables (columns),             the tools used for the analysis. While data analysis
summarizing the data, merging and combining two               in Session 7 was exemplified using Excel, Session
or more data sets by presenting appropriate exam-             9 uses R. The exercises from Session 7 are repeated
ples for each of these topics. In the exercise part           in Session 9 to show the relation of the tools. How-
of this session students are asked to extract simi-           ever, it is also shown that R is more powerful when
lar data sets from other corpora applying the same            dealing with multivariate data sets extending the
data manipulation as used in the examples. At the             analysis performed in Session 7.
end all data sets are combined to one large data set,            Session 10 is the continuation of Session 9 intro-
which will then be analyzed in Session 9. Figure 7            ducing additional aspects of data analysis showing
shows the introductory part related to data frames            how to visualize and interprete data from different
in R.                                                         perspectives (see Figure 8). Students prepare the
   Session 9 relates to Session 7 in that it is con-          different visualizations and their interpretation in



                                                         54
                               Figure 7: Exercises with data frames in R.




                                                                       Figure 9: Data analysis in R.
           Figure 8: Data analysis in R.
                                                            The documents are modular allowing to apply them
                                                            to different data sets as well as copying, modifying
groups. The results are then discussed together in          and adapting the included code. The modification
class. The importance of verifying the interpreta-          and adaptation is exemplified in the exercises and
tion of the macro perspective of the visualization          the students are encouraged to make notes about
with the micro perspective, the examples from the           technical aspects and interpretations.
corpora (concordance lines) is made explicit in this
discussions, linking and intertwining quantitative          5   Conclusion
and qualitative analysis.                                   We presented a ten session practical course on Cor-
  The R Markdown documents used throughout                  pus Linguistics for students with a humanities back-
Session 8 to 10 include sample code for data ma-            ground. The structure of the course is based on
nipulation and data analysis (see Figures 8 and 9).         active learning methods to address the challenges



                                                       55
of teaching a technical course to students with little        Hannah Kermes, Jörg Knappen, Stefania Degaetano-
or no technical background. An active learning en-              Ortlieb, and Elke Teich. 2016. The royal soci-
                                                                ety corpus: From uncharted data to corpus. In In
vironment encourages students to work on research
                                                                Proceedings of the Ninth International Conference
question alone or as a group, addressing (technical)            on Language Resources and Evaluation (LREC’16),
challenges, solving technical and research related              Portoroz, Slovenia.
problems and discussing results. The role of the
                                                              Christian Mair, 1999a. The Freiburg-Brown Corpus
teacher moves in the direction of an assistant, an-             (Frown).
swering questions, pushing in the right direction
and helping to find solutions. The course and the             Christian Mair, 1999b. The Freiburg-LOB Corpus (F-
                                                                LOB).
course material, as presented here, allows for an
easy modification, adaptation and extension of the            Michael Prince. 2004. Does active learning work? A
course material. This makes the course and its                  review of the research. Journal of engineering edu-
material applicable to different target groups and              cation, 93(3):223–231.
settings, making the creation of such material worth          Helmut Schmid. 1994. Probabilistic Part-of-Speech
the effort.                                                     Tagging Using Decision Trees. In International
                                                                Conference on New Methods in Language Process-
                                                                ing, pages 44–49, Manchester, UK.
References                                                    Helmut Schmid. 1995. Improvements in Part-of-
Jonathan Bergmann and Aaron Sams. 2012. Flip Your               Speech Tagging with an Application to German. In
  Classroom: Reach Every Student in Every Class Ev-             Proceedings of the ACL SIGDAT-Workshop.
  ery Day. International Society for Technology in Ed-
  ucation, Eugene, Or.                                        Bill Tucker. 2012. The flipped classroom. Education
                                                                 next, 12(1).
Charles C. Bonwell and James A. Eison. 1991. Active
  Learning: Creating Excitement in the Classroom.
  Number 1, 1991 in ASHE-ERIC higher education re-
  port. School of Education and Human Development,
  George Washington University, Washington, DC.
Stefan Evert and Andrew Hardie. 2011. Twenty-First
   Century Corpus Workbench: Updating a Query Ar-
   chitecture for the New Millennium. In Proceedings
   of the Corpus Linguistics 2011 Conference, Briming-
   ham, UK.
W.N. Francis and H. Kučera. 1979. Manual of Infor-
  mation to Accompany A Standard Corpus of Present-
  day Edited American English, for Use with Digital
  Computers. Brown University, Department of Lin-
  gustics.
Jürgen Handke, Alexander Sperl, and Deutsche ICM-
    Konferenz, editors. 2012. Das inverted class-
    room model: Begleitband zur ersten deutschen
    ICM-Konferenz. Oldenbourg, München. OCLC:
    810266426.
Andrew Hardie. 2012. CQPweb –Combining Power,
  Flexibility and Usability in a Corpus Analysis
  Tool. International Journal of Corpus Linguistics,
  17(3):380–409.
Erhard Hinrichs, Marie Hinrichs, and Thomas Zas-
  trow. 2010. WebLicht: Web-based LRT services
  for German. In Proceedings of the ACL 2010 Sys-
  tem Demonstrations, pages 25–29. Association for
  Computational Linguistics.
Geoffrey Leech Johansson, Stig and Helen Goodluck,
  1978. Manual of Information to Accompany the
  Lancaster-Oslo/Bergen Corpus of British English,
  for Use with Digital Computers.




                                                         56