=Paper= {{Paper |id=Vol-1918/ballier |storemode=property |title=R-Based Strategies for DH in English Linguistics: A Case Study |pdfUrl=https://ceur-ws.org/Vol-1918/ballier.pdf |volume=Vol-1918 |authors=Nicolas Ballier,Paula Lissón |dblpUrl=https://dblp.org/rec/conf/gldv/BallierL17 }} ==R-Based Strategies for DH in English Linguistics: A Case Study== https://ceur-ws.org/Vol-1918/ballier.pdf
       R-based strategies for DH in English Linguistics: a case study
                           Nicolas Ballier                         Paula Lissón
                        Université Paris Diderot               Université Paris Diderot
                       UFR Études Anglophones                 UFR Études Anglophones
                       CLILLAC-ARP (EA 3967)                  CLILLAC-ARP (EA 3967)
                        nicolas.ballier@univ-                 paula.lisson@etu.univ-
                           paris-diderot.fr                      paris-diderot.fr




                                                             language for research in linguistics, both in a
                      Abstract                               quantitative and qualitative approach.
                                                                Developing a culture based on the program-
    This paper is a position statement advocating            ming language R (R Core Team 2016) for NLP
    the implementation of the programming lan-               among MA and PhD students doing English Lin-
    guage R in a curriculum of English Linguis-              guistics is no easy task. Broadly speaking, most
    tics. This is an illustration of a possible strat-       of the students have little background in mathe-
    egy for the requirements of Natural Language
                                                             matics, statistics, or programming, and usually
    Processing (NLP) for Digital Humanities
    (DH) studies in an established curriculum. R             feel reluctant to study any of these disciplines.
    plays the role of a Trojan Horse for NLP and             While most PhD students in English linguistics
    statistics, while promoting the acquisition of           are former students with a Baccalauréat in Sci-
    a programming language. We report an over-               ences, some MA students pride themselves on
    view of existing practices implemented in an             having radically opted out of Maths. However,
    MA and PhD programme at the University of                we believe that students should be made aware of
    Paris Diderot in the recent years. We empha-             the growing use of statistical and NLP methods
    size completed aspects of the curriculum and             in linguistics and to be able to interpret and im-
    detail existing teaching strategies rather than          plement these techniques.
    work in progress but our last section alludes
                                                                We need to show our students how important
    to work still under way, such as getting PhD
    students to write their own R packages.                  the DH are for their research, enabling them to
                                                             see that the use of NLP techniques provides them
    We describe our strategy, discuss better prac-           with a whole range of new possibilities for the
    tices and teaching concepts, and present ex-             treatment and the analysis of their data. In addi-
    periments in a curriculum. We express the                tion, with the growing tendency of corpus lin-
    needs of an initially limited NLP environ-               guistics, students often need to work with large
    ment and provide directions for future DH                corpora or huge databases of images, and stand-
    curricular developments. We detail the chal-             ard tutorials and introductory books do not cover
    lenges in teaching a non-CL audience, show-              these needs (Arnold and Tilton 2015).
    ing that some software suites can be integrat-
                                                                Preparing students to work with NLP methods
    ed to a curriculum, outlining how some spe-
    cific R packages contribute to the acquisition
                                                             and with command lines also means to ask them
    of NLP-based techniques and favour the                   to work with particular formats of data they need
    awareness of the needs for statistical model-            to get used to (e.g. tabular format, utf8, limited
    ling.                                                    use of special characters unless necessary).
                                                             However, this facilitates the interoperability of
1    Introduction                                            their data, as well as the replicability of their re-
                                                             search.
This paper deals with the development of an R-                  The rest of the paper is organized as follow-
based culture of DH for students of English Lin-             ing: section 2 describes the context of an MA in
guistics at the university of Paris Diderot. We              English Linguistics and explains why the culture
describe some aspects of a curriculum (MA and                is traditionally limited in NLP and DH in this
PhD) that aims at taking advantage of the flexi-             kind of curriculum. It also details the strategy
bility and the adaptability of this programming              used to develop an ‘R-based culture’ among MA




                                                         1
and PhD students, taking advantage of its flexi-           of the Speech (POS) taggers or classifiers,
bility and adaptability to the various require-            among many other possibilities, require some
ments of linguistic data. Section 3 explains how           familiarity with, at least, one programming lan-
collaboration with the Maths department has en-            guage. Although most programs designed for
abled the emergence of an R-based common cul-              corpus linguistics, such as AntConc (Anthony
ture for statistics to be taught to mostly ‘math-          2011), Cesax (Komen 2011), Sketch Engine
less’ students. Section 4 discusses the various            (Kilgarriff et al. 2004), Wordsmith (Scott 2016);
strategies to teach R in recent textbooks of quan-         or within the French lexicometric tradition, Le
titative linguistics (from Baayen, 2008 to                 Trameur (Fleury and Zimina 2014), or Le Gro-
Levshina, 2015). Several teaching styles and               moteur (Gerdes 2014) are presented in a graph-
aims are discussed. Section 5 discusses DH in a            ical and more or less user-friendly interface
wider perspective of Machine Learning (ML)                 (GUI) with built-in functions, the command line
based analyses and show the benefits of R, re-             offers much more flexibility in terms of explora-
porting on the possibility of an interdisciplinary         tion and modelling of the data. For instance, R
bridge for programs in data science. Section 6             can be first used as a concordancer (see (S. Gries
proposes an epistemological interpretation of R            2009) in order to explore a given corpus, and
as a possible medium for the 3rd revolution of             then, once the desired structures have been ex-
grammatisation, expanding on Sylvain Auroux’s              tracted, they can also be treated in R in terms of
notion. The conclusion reflects on the limitation          statistical analysis and/or modelling. Finally, a
of our strategy, when compared with other pro-             visual representation of the results of the analysis
gramming languages. We try to assess the rele-             can be easily plotted.
vance of this Esperanto-like, high-level pro-                 Apart from all the statistical packages that can
gramming language for digital humanities.                  be used for a quantitative analysis of data, there
                                                           are currently more than 50 packages for NLP
2     Developing an R-based culture for                    available in the CRAN repository1, some of them
      NLP, DH and beyond                                   being particularly useful for linguists. Because
                                                           our students have research questions on spoken
2.1    Why is it necessary?                                or written data, they can find several R packages
In France, the study of English Linguistics in             to suit their needs if they are taught how to use
English departments has traditionally been linked          them. These are some of the specific packages
to the competitive exams to become teachers of             our students are working with:
English (agrégation), so that linguistics is only a             • Phonetics/Phonology. Students are pre-
sub-domain in relation to other domains of Eng-                 sented with normalization issues and anal-
lish studies such as translation and literature. As             yses of vowel systems drawing from packag-
a consequence, the core of this curriculum can                  es such as phonTools (Barreda 2015) vowels
hardly be dedicated to corpus linguistics. There                (Kendall and Thomas 2010) emuR
also is another structural (devastating) side effect            (Winkelmann, Jaensch, and Harrington 2017)
for linguistic research: since there is not a single            and phonR (McCloy 2016), which allows for
trace of NLP-driven questions for the agréga-                   the treatment of data extracted with Praat
tion, there is nothing in the curriculum of English             (Boersma and Weenink, 2017), one of the
studies about these issues (contrary to what the                most well-known pieces of software used in
introduction of English phonology somehow                       phonetics/phonology.
triggered after 2000 in the agrégation and in the               • Data mining/text processing: cleanNLP
undergraduate programmes that are meant to                      (Arnold 2017), koRpus (Michalke 2017) lan-
prepare for this competitive exam). Since in most               guageR (Baayen, 2011) tidytext (Silge and
European Universities the rise of corpus linguis-               Robinson 2016) openNLP (Baldridge 2005)
tics and quantitative methods has become essen-                 qdap (Rinker 2013). For the treatment and
tial in Linguistics curricula, the gap in France                the exploration of written corpora, some of
between the agrégation and linguistic research is               the functions that these packages offer are:
widening.                                                       automatic POS tagging, implementations of
   In corpus linguistics, the treatment of large                parsers (e.g. Stanford coreNLP and the
corpora and complex datasets with many varia-                   SpaCy library implemented in cleanNLP),
bles is getting increasingly frequent. However,
the application of statistical models, as well as          1
                                                             https://cran.r-
the use of NLP techniques such as parsers, Part            project.org/web/views/NaturalLanguageProcessing.html



                                                       2
     text trimming, mathematical modelling of              ability distribution, linear regression models,
     vocabulary with LNRE models, automatic                ANOVAs, linear discriminant analysis, principal
     syllabification, measures of lexical diversity        component analysis…
     and readability…                                         Data sessions. As a follow up to the Statistics
                                                           seminar, PhD students can discuss their data dur-
2.2. Aspect of the current curriculum                      ing data sessions. In these sessions, our students
                                                           have the opportunity to present their linguistic
The implementation of R-based modules in the               dataset to a statistician. Ideally, the student has
BA and the MA has been taking place progres-               already made some progress on a basic level and
sively. Here we detail the various courses and             knows how to present the structure of the data
activities that are currently implemented in our           and detail the kind of investigated variables. To-
curriculum.                                                gether with a statistician, they explore all the dif-
                                                           ferent methods that can be applied according to
Undergraduate module. We present R among                   what the student’s project requires.
other software used by linguists in a module cen-             Individual work. Because we are conscious
tred on student’s projects, following Language             that our curriculum cannot cover all the for-
and Computers . This introductory module also              mation on R that our students would need, some
aims at getting undergraduate students to be in-           individual extra work is needed. In that sense,
terested in pursuing our MA research program.              specific manuals for linguists such as Analyzing
                                                           Linguistic Data: a practical introduction using
MA seminars. Although during the first year of             R (Baayen 2008), Quantitative Corpus Linguis-
the MA efforts are made to encourage students to           tics with R (S. Gries 2009), Quantitative Methods
start using R as a way to process their corpus-            in Linguistics (Johnson 2011), Statistics for Lin-
based data, it is during the second year of the            guists (S. T. Gries 2013), How to do Linguistics
MA that a seminar on R is offered. This seminar            with R, (Levshina 2015), Data Humanities with
consists on 6 sessions of 120 minutes covering             R (Arnold and Tilton 2015) are recommended
the use of R for phonetics and phonology, and 6            knowing that they have different pre-requisites in
sessions of 120 minutes on the use of R for tex-           mathematics. Although not all these manuals
tual data. Over the two years of the MA, the in-           have the same approach in regards to the techni-
troduction to R-based packages in the curriculum           cality and the progression path, all of them offer
takes the following form, in the different relevant        a comprehensive view of what linguists can do
modules (cf. Table 1):                                     with R. For a more detailed summary and com-
                                                           parison between some of these manuals, see Bal-
Table 1: R in the MA modules                               lier (Ballier, forthcoming).
                                                              Bootcamps. Bootcamps are regularly orga-
Year     First semester        Second semester             nized, both for complete beginners and for in-
M1       Corpus                Phonetic Analysis 1         termediate users who seek to improve their abili-
         methodology:          (normalisation,             ties in R and to discover new techniques. Gener-
         descriptive           plots, visualisation)       ally, one basic bootcamp is proposed at the be-
         statistics                                        ginning of the year for all students who want to
M2       Language        and   Computational pho-          participate, and a second bootcamp is later of-
         Variation (infer-     nology (classifiers)/       fered for more advanced learners. Bootcamps are
         ential statistics)    Phonetic Analysis 2         conceived as an intensive way to approach par-
                                                           ticular issues (e.g. regression models, visualisa-
     PhD seminar. Currently, our graduate school           tion, classifiers…) and consist of a week of 25-
offers 12 sessions of 90 min covering advanced             30 hours of instruction. Bootcamps are normally
statistical techniques for linguists. A basic com-         taught with examples of datasets taken from pub-
mand of R is presupposed, especially since MA              lished papers or on-going research, but students
seminars already cover the use of R for begin-             may also explore their own data if applicable.
ners. This seminar mainly focuses on the ap-               These official bootcamps are normally instructed
plicability of statistical methods to different            by visiting scholars, such as Stefan Gries or Tay-
types of data linguists deal with, but also on the         lor Arnold.
mathematical formulae behind all these methods.
Empirical cases (already published by linguists)
are used as examples. This seminar covers prob-



                                                       3
3     R-based strategies for English studies             3.2   Pedagogy of datasets and scripts
   In the previous section, we presented the R           Datasets as such can be understood as a common
courses and activities that are currently taking         aspect of the methodology that enables students
place in our programme, but we think that it is          to understand the logic behind all the possible
insufficient, especially during the MA years. In         statistical tests and NLP techniques. It is, in a
this section, we present all the aspects that            way, a dialectic form of the transferability of
should be taken into account when considering            knowledge: understand the mathematics used
an R-based curriculum for English Linguistics            with something else than your specialty. Students
studies. We aim at establishing course modules           then need to learn to adapt the code to their own
that present statistical and NLP techniques cur-         needs (dataset and research questions), as well as
rently used in linguistic research. We know that         getting familiarized with importing-exporting
getting to use command lines and to work with a          methods from the treatment of other datasets.
programming language takes a lot of time for                Part of the benefits of the R strategy can be
non-specialists, but it also offers a lot of new         cashed in with the on-line forums and scientific
possibilities for linguists. This section details        blogs. They heavily rely on few datasets when
some strategies to ease R’s steep learning curve.        explaining complex methods, so that it makes
                                                         sense to teach these recurrent datasets (mtcars,
3.1    Mathematics for linguistics and inter-            Titanic, iris) to our students so that they become
       disciplinarity                                    more autonomous in understanding on-line ex-
One key feature of our strategy is the collabora-        planations. The use of particular datasets such as
tion between statisticians, mathematicians, pro-         the classical iris dataset is notable for classifica-
grammers, and humanists. Although we want our            tion problems or dimensionality reduction. Con-
students to be independent users of R and to un-         versely, typical analyses based on linguistic data
derstand what they do when they use and manip-           such as the Bresnan and Nikitina (2003)dataset
ulate data with NLP techniques, we do not ex-            (Johnson, 2008 [2011], Baayen, 2008) or even
pect them to become professional programmers.            Jane Austen’s novels (Arnold & Tilton, 2017)
However, we do expect linguists to become fa-            may also help by showing students the multiple
miliar enough with NLP techniques, program-              applications of R from a linguistic perspective.
ming and statistics, so that they can efficiently           Graphs always help motivating students. Mak-
identify what they need to advance in their re-          ing datasets more attractive by showing visuali-
search in terms of technical treatment of their          sation techniques that catch students’ attention,
data. This interaction between experts from dif-         not only about the statistical procedure, but also
ferent fields promotes cross-fertilization between       about how this technique can be later visualized
the domains, and it also allows mathematicians           and presented in talks and vivas is definitely
and programmers involved in DH and in NLP to             something to take into account.
understand linguists’ needs.                                Eventually, students may also rely on the easi-
   However, one of the issues we face is the crea-       ness and the practicality of the script in order to
tion of a reasonable time schedule for a progres-        reproduce, compare and share results. With
sive acquisition of all these techniques without         RStudio, students can actually send their whole
scaring students, knowing that most of them are          project (including the environment, datasets,
rather averse to learning these methods. Another         graphs, objects created by them, and code) to
issue related to the pedagogy of these methods is        their supervisors, for example.
the amount of maths that should be taught so that           The scripts actually represent a new model of
students get to know what they are doing with            exercise and teaching methodology: scripts re-
their data. On the one hand, if students are ex-         place the classical textbook, showing detailed
plained all the on-going mathematical methods at         and commented functions. Examples and da-
the same time as they are introduced to the tech-        tasets need to be previously adapted to this for-
nique, there is a risk that they do not understand       mat. It is an example-based methodology that
the procedure and refuse to use it. On the other         promotes autonomy: in many cases, the script is
hand, if students do not receive the mathematical        not only the example, but also includes the exer-
explanations of the method, they will never fully        cises that need to be executed. The difficulty of
understand what they are doing with their data.          these exercises increases progressively. This is,
The risk is that students may run theirs scripts         in a way, the creation of a new didactic method:
too blindly.                                             most textbooks rely on a specific R package, R




                                                     4
scripts and companion website. The companion             their personality best. So far, we have identified
website to Levshina’s textbook is a good case in         four main teaching approaches:
     2
point .
                                                         a) Scaring literary/linguist students for their
3.3       Package-driven pedagogy                           greatest benefit. Teaching the general benefit
An alternative method for students who seek to              of learning R as an interface in itself. This is
discover the usefulness of some specific R pack-            mostly relevant for future PhD students.
ages to linguistic research is to present them an           Among the MA students, the strategy con-
MA research project related to the functionalities          sists in insisting on the advantages of the ex-
of one particular package. For instance, the                isting libraries, and the developer’s commu-
{koRpus} package implements 14 metrics of                   nity at large. More specific packages
lexical diversity and 35 metrics of readability             {ZipfR} are presented and functions are de-
that can be automatically computed. Computing               tailed for their interest for linguists but also
each one of these metrics individually for a cor-           in relation to a research community repre-
pus made up of hundreds of texts results time-              senting a (professional) lifetime investment
consuming and unnecessary. Therefore, the stu-              with possible job prospects outside the lin-
dent had to learn how to use R and to explore the           guistic community proper, provided time is
functionalities of the package: starting from the           spent on learning enough statistics and other
POS tagging with TreeTagger (Schmid 1995),                  interdisciplinary skills. In other words, to
the interpretation of the (huge) numerical result-          make the most of their potential role as inter-
ing dataset with all the results of the formulae,           face in the DH, linguists of English are ad-
the correlations of the various metrics, ANOVAs             vised to invest at the same in statistics, not to
to compare results between different groups…                say Data Science, for longer term benefits.
Getting students to work with a specific interest        b) R for statistics in disguise: this approach
for an R package is another way to foster acqui-            consists in presenting techniques from a
sition of the package itself but also of R.                 mathematical point of view; showing, detail-
                                                            ing and explaining formulae. Although it of-
3.4       We love you just the way you ‘R’: A               fers the possibility to really understand what
          typology of teaching styles                       is going on with the data, students normally
R is known for its steep learning curve, counter-           get scared and lose interest on quantitative
intuitive to real programmers. How is it taught in          methods because they consider them to be
the reference textbooks? In general, most text-             too difficult. This methodology might be par-
books for linguistics with R do not assume any              ticularly useful for intermediate and ad-
prior experience with any programming lan-                  vanced students, but not so much for begin-
guage, or any knowledge in statistics. They start           ners.
by giving basic notions on descriptive statistics,       c) Motivating students. Contrary to the previ-
and complexity increases progressively. Howev-              ous methodology, here students are intro-
er, not all books present the same degree of                duced to the various techniques without con-
mathematical complexity, and not all books de-              sidering the on-going mathematical process-
tail the mathematical processes ongoing behind              es. Although it motivates students because
the various commands and functions used in R.               they can see ‘quick results’ and the applica-
Therefore, the choice of one book or another                bility to their data, it does not cover a deep
may depend on the interest of the students, their           analysis of the processes the data is going
background in maths/programming, and their                  through. The emphasis is on learning simple
own attitude towards NLP and statistics. When it            code, for instance, heavily relying on the re-
comes to courses, the instructor faces the same             current functions of the tidyverse collection.
type of issues. Not all students follow the same            Displaying fancy visualisations with simple
progression curve, and this is mainly determined            code also helps catching students’ attention.
by their own background and profile. These ap-           d) Intermediate point: teaching how statistics
proaches are not mutually exclusive. In an ideal            are reported in the journals. This second ap-
curriculum, students should be confronted to                proach gives students the basic maths to un-
several approaches and follow whatever suits                derstand descriptive and inferential statistics,
                                                            as well as reported results from linguistic
                                                            journals. It prepares students to understand
2
    https://benjamins.com/sites/z.195/                      what they will be reading in quantitative lin-



                                                     5
       guistics. However, it does not get into much       approach, here is a bottom-up approach that tries
       detail when it comes to more complex tech-         to address what should be the real order of the
       niques.                                            day: what does it take to turn a motivated student
                                                          from an English department into a DH specialist
     Lastly, one should not underestimate the im-         (if not a data scientist)?
portance of online forums. They are complemen-               DH is an emerging field where curricula still
tary to any teaching methodology. R has a big             need to be designed. Felicitous encounters foster
community of users that answer quickly on fo-             cross-disciplinary achievements, but how do we
rums, mailing lists of R packages for doubts and          enforce these developments in our curricula?
bugs are usually very active, pages such as               There is a historical responsibility in curriculum
                3            4                            design to face the challenges of DH. More than a
Stackoverflow , r-bloggers , and many other
websites also offer a comprehensive amount of             political agenda, there is an epistemological tran-
help and code that is relatively easy to under-           sition going on in terms of required skills (and
stand. Being part of the R community is being             knowledge) to process data and knowledge. Ad-
part of a community of experts that speak the             vocating a programming language with a strong
same language, in spite of their fields of exper-         background and tradition in maths rather than
tise, and students may benefit from being part of         mere NLP modules is also a unique possibility to
this wider research community.                            bootstrap Data Science in Humanities curricu-
                                                          lum. Modules in Maths would be required, but
4       The bigger picture: NLP, DH and the               having acquired an R culture may ease this tran-
        quantitative turn.                                sition to Maths. What follows builds on existing
                                                          modules to outline possible developments: here
A set of analytical practices is being established        we sum up putative modules around R that may
(while being at the same time constantly revised)         help students of English to make a transition to-
in the industry and in the research community,            wards a postgraduate programme in data mining:
but the field is evolving so quickly that this
knowledge has not made it to the modules taught           •   First semester: mathematical bases of data
in Humanities yet. At best, competing textbooks               mining. Pedagogical packages related to a
about How to do are being produced, but they                  particular manual (e.g. {languageR, car, car-
reflect on individual solutions more than on aca-             et}). Functions and Datasets to accompany
demic curricula. NLP teachers have introduced                 An R Companion to Applied Regression (Fox
modules about Machine Learning, but the curric-               et al. 2014) and Implementing reproducible
ulum in most NLP departments is still being re-               research (Stodden, Leisch, and Peng 2014).
vised to take into account the constant changes in        •   Second semester: For data mining, Data
the perimeter of existing technologies.                       mining and analysis: fundamental concepts
   One should distinguish between sets of tools               and algorithms (Zaki, Meira Jr, and Meira
and methods, text mining, data mining, and data               2014). For statistical modelling, Applied
science. These labels interact with NLP as such,              predictive modelling (Kuhn and Johnson
but need to be nuanced. One of the reasons for                2013).
the complexity of contemporary curriculum de-
sign for Humanities is that NLP in this frame-            5    Philosophical Implications for R as a
work is no more a means in itself, or an autono-               medium for the 3rd revolution of
mous curriculum in the MA degree we describe.
                                                               grammatisation
For linguists from English faculties, the NLP
language game can be caricatured as getting the                This section draws the bigger picture that
best F-score you can for a given task/dataset in a        explains why we promote the teaching of R for
challenge. NLP as a field may well become NLP             DH. The main revolution in this kind of faculties
as set of tools in DH (or a step in processing            consists in convincing students to use the com-
chains).                                                  mand line. RStudio is an interface that reassures
   While one may argue that academic political            students because it has windows and a GUI, it
decisions are mostly designed by means of con-            enables them to run scripts and learn how to
flicting calls for projects in an indirect top-down       comment script. It is a good compromise with a
                                                          console and loading functions are actionable with
3
    https://stackoverflow.com
                                                          the mouse as with any GUI. R commander (and
4                                                         similar plugins as the one proposed with the
    https://www.r-bloggers.com




                                                      6
{Rattle} package) were previous attempts at                volves modelling unstructured data to extract
simplifying R (not to mention RExcel and other             information and knowledge, leveraging numer-
Excel-based interfaces) with the same click and            ous statistical, machine learning, and computa-
play approach. We would like to suggest that as a          tional linguistic techniques.’. R packages such as
programming language and as a programme giv-               {koRpus} (Michalke, 2017) and {tm} (Feinerer
ing access to thousands of packages, R has a spe-          and Hornik 2015) make text mining with R much
cial status for the DH turn under way (not to              easier. We could even say that they favour hy-
mention the fact that any epistemological angle            brid uses between concordancers and the stand-
on NLP tools should consider the programming               ard NLP blind processing of data.
side).                                                        Again, the uses of R are becoming increasing-
      The DH turn is closely related to the quanti-        ly more user-friendly than they used to be. Re-
tative turn and we consider that this piece of             cent publications have heralded the emergence of
software takes part in what we call, after Sylvain         accessible approaches to text mining, where NLP
Auroux, the third revolution of ‘grammatisation’.          is in disguise. For example, Jockers (2014) dis-
According to Auroux (Auroux 1994) the first                tinguishes between micro-, meso- and macro-
revolution of the ‘grammatisation’ was related to          analysis. Micro-analysis deals with frequency
writing. With Gutenberg, speech enjoys some                analysis and correlation between the presence or
specific codification. Textual structures (para-           absence of a word in a text with randomization
graphs, books) play the role of technological in-          techniques. Mesoanalysis consists in detailing
novation that preconditions linguistic analysis.           lexical complexity, hapax analysis and teaching
      The second technological invention that              how to build KWIC-type concordancing with R.
revolutionized linguistics according to Auroux             Macroanalysis presents unsupervised clustering
was the invention of dictionaries and grammars,            and some initiation to support vector machines
especially in 18th century Europe. Again, codifi-          supervised learning models.
cation of the language, and standardisation of                This more recent approach allows for more
spelling triggered some reflection about lemma-            flexible analyses where individual texts in a cor-
tisation. Linguistic data was processed according          pus may be taken into account more flexibly than
to a certain format (e.g. dictionary entries).             with other command line or corpus-based meth-
    Auroux (1994) suggests that the emergence of           ods. The interaction with visualisation techniques
corpora and NLP techniques boils down to the               and the flexibility offered by the tidyverse col-
emergence of a third revolution of grammatisa-             lection facilitates the focalisation on the mining
tion, which is still under way. We see R as a pos-         of texts. As the 2017 rOpenSci Text Workshop
sible interface between corpus linguistics and the         puts it 5 , working with these recent r packages
quantitative turn, our free access to statistical          encompasses ‘text analysis, natural language
libraries and NLP tools. One of the benefits is            processing, and other aspects of text mining and
that R can increasingly be used as a concordanc-           text data handling’. Because it can be used as an
er, mining corpora, especially with the tidyverse          interface for these practical and theoretical is-
collection of packages. Adopting R facilitates a           sues, R is a good candidate for the development
roadmap to data science, since corpus extractions          of hybrid approaches to text mining, mixing con-
end up as a dataset. The current ‘tidy data’               cordance-based approaches and more ambitious
(Wickham 2014) philosophy favours the struc-               analyses of metadata and quantitative (textomet-
ture of the data frame where ‘each type of obser-          ric) aspects of texts in a single environment.
vational unit is a table’.                                    Data visualisation and all of the flexibility of
   We believe that the DH are a crossroad for the          the R environment posit R in a very favourable
cross-fertilization of several disciplines: linguis-       situation as a specific tool for the third revolution
tics (where NLP is crucial), statistics, and the           of grammatisation that is taking place around
emerging domain of data science. In this respect,          text mining. This can be seen with the flowcharts
R is an excellent candidate as a tool to promote           describing the interaction of R packages and ti-
real pluridisciplinarity. Conceptions and bounda-          dyverse functions for topic modelling in Silge
ries vary as to the content of machine learning,           and Robinson (2016), or the R open science ef-
data mining, and data science, but the recent evo-         forts for the interoperability of formats in some
lutions within the tidyverse collection of R pack-
ages facilitate text mining. The common denom-
inator to this emerging field is text mining: ‘Text
mining is an interdisciplinary field which in-             5
                                                               http://textworkshop17.ropensci.org/




                                                       7
packages (the text interchange format package,             scientist. The essential roadmap to emancipation
{tfi}6).                                                   still has to be designed. We are not to turn our
                                                           students of English linguistics into programmers,
6       Conclusion: limits and difficulties                but we want our PhD laureates to be as proficient
                                                           as can be. Replicating and adapting a script to the
The conclusion reflects on the limitations of our
                                                           needs of your data is one thing, being an expert
curriculum and achievements as well as the
                                                           is quite another.
drawbacks of our choice of R. The first part dis-
                                                              As to the limitations of R, some of them are
cusses actions that are still under way. The se-
                                                           inherent to the language, some can be related to
cond part summarises some of the issues against
                                                           the misuses of R. The first limitation is that a
R.
                                                           programming language is not NLP as such. Be-
   Alternative strategies still to be tested include
                                                           coming acquainted with some logic of packages
reverse teaching for R packages, sessions where
                                                           gives access to specific resources, but with more
students would have to present an R package and
                                                           limitations as to the languages under scrutiny
what can be done with it, which is a way to get
                                                           than with Python-based tools.
them acquainted both with the code and the big-
                                                              Regarding the internal limitations of the pro-
ger picture: the why? and the what for?
                                                           gramming language, known problems with R
   Another strategy to be tested consists in su-
                                                           include issues related to the floating point calcu-
pervising an MA whose every step implies spe-
                                                           lations (which seem to play a role in the R
cific coding in R, to the point of designing the
                                                           word2vec implementation of the word embed-
requirements of an R package along the MA,
                                                           ding algorithms), a quirky syntax, issues with
each cornerstone of the MA corresponding to an
                                                           loops and the way everything is stored in the
elaborate R function which needs to be imple-
                                                           memory. With specific data processing (very
mented to process the data. Designing your own
                                                           often, phonetic data), matlab libraries and scripts
R package centred around your research question
                                                           have been developed so that R is superseded in
is an option, more likely to be realised at the end
                                                           these areas, not to mention the fact that the aver-
of a PhD. We are encouraging PhD students
                                                           age documentation is usually more complete, but
reaching the end of their thesis to design their
                                                           this service comes with a price, whereas R is
own packages, encapsulating their codes and par-
                                                           free.
tial datasets, to join the github community in or-
                                                              Last, we would like to report on some issues
der to bypass the heavier requirements of the
                                                           that have been encountered by students. On top
CRAN repository and propose their packages as
                                                           of classic mismanagement of codes for models or
prototypes or proofs of concepts through the dev-
                                                           serious issues with data coercion with R, we
tools library.
                                                           would like to report a typology of attitudes which
   What would make sense in a pluridisciplinary
                                                           may reflect more poorly on some of infelicitous
university would be to teach a general introduc-
                                                           uses of R. These attitudes are mainly related to
tory course resembling a seminar centred on the
                                                           what one could call ‘the encyclopaedic ignorance
versatility of R while teaching core linguistic
                                                           of the self-taught linguists’: becoming experts at
concepts to non-specialists and presenting useful
                                                           secondary details but missing the basic maths.
R packages for linguistically-related research
                                                           This means being able to execute relatively com-
questions. The challenge is to teach basic coding,
                                                           plex methods but not knowing exactly what the
linguistic notions and R packages in lecture halls.
                                                           model does to the data. Another variant is linked
Nevertheless, a module like “Language, Texts
                                                           to wanting to know everything about the R code,
and Data mining: An R-based introduction to
                                                           without considering the big picture, i.e., the un-
Digital Humanities” should be offered to under-
                                                           derlying mathematical model required to address
graduates to promote DH.
                                                           the data, practicing some sort of code-induced
   Suggesting that R should be essential to stu-
                                                           short-sightedness. It may well be the case that
dents contributes to developing a coding culture
                                                           this package-based approach to R distorts the
and foster on-line learning (as stackoverflow is
                                                           representation of a programming language. Full
your friend) but it makes sense in terms of life-
                                                           empowerment of students leading to the possibil-
long learning. Our MA students get the extra-
                                                           ity of writing a programme should be the real of
benefit of an initial training to data mining, how-
                                                           the day, whereas we probably endorse some em-
ever basic it may sound to a professional data
                                                           powerment limited to existing packages and
                                                           scripts in a kind of solving problem based phi-
6
    https://github.com/ropensci/tif                        losophy. Clearly, learning another programming



                                                       8
language would help for this kind of learning               Serge Fleury, and Maria Zimina. 2014. ‘Trameur: A
profile.                                                             Framework for Annotated Text Corpora Ex-
   Promoting the teaching of R is an important                       ploration.’ In COLING (demos), 57–61.
                                                            John Fox, Sanford Weisberg, Daniel Adler, Douglas
aspect of pluridisciplinarity (and an easy way to
                                                                     Bates, Gabriel Baud-Bovy, Steve Ellison,
start building common background with the                            David Firth, Michael Friendly, Gregor Gor-
maths department) but should be seen as the first                    janc, and Spencer Graves. 2014. ‘Companion
step. Beginning with R, students may move to                         to Applied Regression R Package.Version
Python (for example using jupyter or Anaconda)                       2.0-20’.
and then for example use the scikit-learn ML in             Kim Gerdes. 2014. ‘Corpus Collection and Analysis
Python (Pedregosa et al. 2011).                                      for the Linguistic Layman: The Gromoteur’.
   The new challenges posed by this new config-                      In proceedings of the JADT.
uration of knowledge is mostly unheard of. Most             Stefan T. Gries. 2009. Quantitative Corpus Linguis-
learning paths for the different intersecting sub-                   tics with R. A Practical Introduction. New
                                                                     York-London: Routledge.
disciplines are still unchartered territories. For
                                                            Stefan T. Gries. 2013. Statistics for Linguistics with
this complex language game ahead of us, lin-                         R: A Practical Introduction. Berlin: Walter
guists may lack computational linguistics, but                       de Gruyter.
other scientific partners will need more linguis-           Matthew L. Jockers. 2014. Text Analysis with R for
tics for true interdisciplinarity.                                   Students of Literature. Springer.
                                                            Keith Johnson. 2011. Quantitative Methods in Lin-
                                                                     guistics. Chicester, UK: John Wiley & Sons.
References                                                  Tyler Kendall, and Erik R Thomas. 2010. ‘Vowels:
                                                                     Vowel Manipulation, Normalization, and
Laurence Anthony. 2011. AntConc (Version 3.2.                        Plotting in R. R Package, Version 1.1’. Soft-
        2)[Computer Software]. Tokyo, Japan:                         ware Resource: Http://Ncslaap. Lib. Ncsu.
        Waseda University.                                           Edu/Tools/Norm/.
Taylor Arnold. 2017. ‘A Tidy Data Model for Natural         Adam Kilgarriff, Pavel Rychly, Pavel Smrz, and Da-
        Language Processing Using cleanNLP’.                         vid Tugwell. 2004. ‘The Sketch Engine’. In
        arXiv Preprint arXiv:1703.09570.                             Proceedings of Euralex, 105–16.
Taylor Arnold, and Lauren Tilton. 2015. Humanities          Erwin R. Komen. 2011. ‘Cesax: Coreference Editor
        Data in R. Springer.                                         for Syntactically Annotated XML Corpora’.
Sylvain Auroux. 1994. La Révolution Technologique                    Reference Manual. Nijmegen, Netherlands:
        de La Grammatisation. Introduction À                         Radboud University Nijmegen.
        L’histoire Des Sciences Du Langage. Liège:          Max Kuhn, and Kjell Johnson. 2013. Applied Predic-
        Mardaga.                                                     tive Modeling. Vol. 810. Springer.
Harald R. Baayen. 2008. Analyzing Linguistic Data:          Natalia Levshina. 2015. How to Do Linguistics with
        A Practical Introduction to Statistics Using                 R: Data Exploration and Statistical Analysis.
        R. Cambridge: Cambridge University Press.                    John Benjamins Publishing Company.
Jason Baldridge. 2005. ‘The Opennlp Project’. URL:          Daniel R. McCloy. 2016. phonR: Tools for Phoneti-
        Http://Opennlp. Apache. Org/Index                            cians and Phonologists. (version 1.0-7).
Nicolas Ballier. forthcoming. ‘R, Pour Un Écosys-                    https://CRAN.R-project.org/package=phonR.
        tème Du Traitement Des Données?                     Meik Michalke. 2017. Package koRpus: An R Pack-
        L’exemple de La Linguistique.’ In Données,                   age for Text Analysis (version 0.10-2).
        Métadonnées Des Corpus et Catalogage Des                     http://reaktanz.de/?c=hacking&s=koRpus.
        Objets En Sciences Humaines et Sociales.,           Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
        edited by Ph Caron. Presses universitaires de                fort, Vincent Michel, Bertrand Thirion, Oliv-
        Rennes.                                                      ier Grisel, Mathieu Blondel, Peter Prettenho-
Santiago Barreda. 2015. ‘phonTools: Functions for                    fer, Ron Weiss, and Vincent Dubourg. 2011.
        Phonetics in R’. R Package Version 0.2-2.1.                  ‘Scikit-Learn: Machine Learning in Python’.
Paul Boersma ,and David Weenink. 2017. Praat.                        Journal of Machine Learning Research 12
        (Version 6.0.29) [Computer Software].                        (Oct): 2825–30.
Joan Bresnan, and Tatiana Nikitina. 2003. ‘The Gra-         R Core Team. 2016. R: A Language and Environment
        dience of the Dative Alternation.’ Stanford                  for Statistical Computing. (version 3.3.1
        University. Retrieved from http://www-                       (2016-06-21)). English. Vienna, Austria.: R
        lfg.stanford.edu/bresnan/download.                           Foundation for Statistical Computing.
Ingo Feinerer, and Kurt Hornik. 2015. Tm: Text Min-                  https://www.R-project.org/.
        ing Package (version 0.6-6). https://cran.r-        Tyler W. Rinker. 2013. Qdap: Quantitative Discourse
        project.org/web/packages/zipfR/index.html.                   Analyisis Package (version 2.1.0.). Universi-
                                                                     ty at Buffalo/SUNY, Buffalo, New York.




                                                        9
Helmut Schmid. 1995. ‘Treetagger: a Language Inde-
          pendent Part-of-Speech Tagger’. Institut Für
          Maschinelle Sprachverarbeitung, Universität
          Stuttgart.
Mike Scott. 2016. WordSmith Tools, Stroud: Lexical
          Analysis Software. (version 7).
Julia Silge, and David Robinson. 2016. Tidytext: Text
          Mining and Analysis Using Tidy Data Prin-
          ciples in R.
Julia Silge, and David Robinson. 2017. Text Mining
          with R: A Tidy Approach. O’Reilly Media,
          Inc.
Victoria Stodden, Friedrich Leisch, and Roger D.
          Peng. 2014. Implementing Reproducible Re-
          search. CRC Press.
Hadley Wickham. 2014. ‘Tidy Data’. Journal of Sta-
          tistical Software 59 (10): 1–23.
Raphael Winkelmann, Klaus Jaensch, and Jonathan
          Harrington. 2017. emuR: Main Package of
          the EMU Speech Database Management Sys-
          temR Package Version (version 0.2.3.).
          https://CRAN.R-project.org/package=emuR.
Mohammed J. Zaki, and Wagner Meira. 2014. Data
          Mining and Analysis: Fundamental Concepts
          and Algorithms. Cambridge University Press.




                                                         10