<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Demonstration: The Language Application Grid as a Platform for Digital Humanities Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nancy Ide</string-name>
          <email>ide@cs.vassar.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Keith Suderman</string-name>
          <email>suderman@cs.vassar.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science Vassar College</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>James Pustejovsky Department of Computer Science Brandeis University</institution>
        </aff>
      </contrib-group>
      <fpage>71</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>The LAPPS Grid project, which has developed a platform providing access to a vast array of language processing tools and resources for the purposes of research and development in natural language processing (NLP), has recently expanded to enhance its usability by non-technical users such as those in the DH community. We provide a live demonstration of LAPPS Grid use, ranging from “from scratch" construction of a workflow using atomic tools to a pre-configured docker image that can be run off-the-shelf on a laptop or in the cloud, for several tasks of relevance to the DH community.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Over the past few years, Digital Humanities (DH) has looked to Computational
Linguistics (CL) for methods to enable richer analysis of literary, historical, and
other kinds of documents, recognizing that CL methods and procedures can in fact
enhance the kinds and amount of information that can be automatically extracted
from language data [14]. However, several obstacles have prevented humanists
from wholesale adoption of CL tools, the most well known of which is that they
are typically difficult to use without a fair amount of technical background. Other,
more subtle but perhaps more deeply rooted obstacles have also contributed, most
notably dramatic differences in perspective, approach, and simply differences in
the language data that each community typically deals with. It is only recently that
CL methods and tools have begun to be made more accessible to non-technical
users and are beginning to be widely adopted by the DH community; however,
there remains considerable work to be done to fully adapt CL tools and methods to
use by DH scholars.</p>
      <p>The Language Applications (LAPPS) Grid [6] is an NSF-funded project
involving Vassar College, Brandeis University, Carnegie Mellon University, and the
Linguistic Data Consortium at the University of Pennsylvania. The original
motivation for the project, begun in 2012, was to address the endemic lack of
interoperability among CL tools and data that has plagued the CL field for decades. Atomic
natural language processing (NLP) tools (e.g., part of speech taggers, syntactic
analyzers, entity detectors, etc.) are typically pipelined to create more sophisticated
applications; the lack of interoperability among tools, corpora, and other language
resources often leads to considerable waste of effort to make them work together
in a pipeline, or “workflow". To overcome the problem, the LAPPS Grid project
undertook to engineer a platform that both provides access to a wide array of
language processing tools and resources, and exploits recognized standards and best
practices to negotiate incompatibilities for the user.</p>
      <p>Over the past five years the LAPPS Grid project has collaborated with
several major projects in the US, Europe, and Asia to expand its range of accessible
tools and resources as well as to augment the capabilities of the platform. Our
collaborators serve a broad range of users, well beyond the NLP community we
originally intended to serve, including users involved in inter-cultural
communication and users from the DH community. We have also begun to create purpose-built
instances of the LAPPS Grid to use in courses aimed at non-technical users, and
we are currently working with a major project in the digital humanities [12] and
pursuing funding to collaborate with several others. As a result, the LAPPS Grid is
continually increasing its usability by non-technical users such as those in the DH
community.</p>
      <p>Our demonstration provides several sample usages of the LAPPS Grid relevant
to digital humanities research, including tool pipelines developed “from scratch"
as well as pre-configured workflows that can be used as is, and demonstrates both
the analysis and creation of resources.
2</p>
    </sec>
    <sec id="sec-2">
      <title>LAPPS Grid Overview</title>
      <p>
        The LAPPS Grid is an open platform that provides access to hundreds of NLP
tools and language resources. It incorporates the Galaxy workflow and data
management framework [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which was developed by researchers in the field of
genomics and specifically designed to enable researchers in the life sciences to
access resources and compose applications without requiring technical expertise. The
LAPPS Grid is very flexible and configurable: it can be accessed through a web
interface (http://galaxy.lappsgrid.org), deployed locally on any Unix system (laptop,
desktop, or server), or run from the cloud. Another feature of the LAPPS Grid is
its Open Advancement (OA) Evaluation system, which enables the user to explore
variant pipelines involving alternative tools in order to identify the most effective
configuration in terms of precision and recall.
      </p>
      <p>
        The LAPPS Grid is part of the Federated Grid of Language Services (FGLS)
[
        <xref ref-type="bibr" rid="ref6">7</xref>
        ], an international network of grids including the University of Kyoto’s
Language Grid1 and several other Asian and European grids. We have recently entered
into a Mellon-funded federation with the pan-European CLARIN project’s
WebLicht/Tübingen2 and LINDAT/CLARIN (Prague)3 frameworks, whose focus is
to provide support for humanities and social science scholarship. These two
collaborations provide seamless access to all of the tools and resources in any one of
the federated platforms for the LAPPS Grid user. Thus we have vastly increased
the availability of multi-lingual and multi-modal resources and tools in the LAPPS
Grid, and, through our collaboration with CLARIN, expanded the range of services
applicable to DH research.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>CL for DH in the LAPPS Grid</title>
      <p>The LAPPS Grid in its current form addresses many of the needs for DH
research. It provides easy-to-use access to a wide variety of customizable low-level
CL tools, including tokenizers, sentence boundary detectors, part-of-speech
taggers, named entity recognizers, co-referencers, phrase-structure and dependency
parsers, among others. It also provides facilities for comparing the effectiveness
of tools that perform the same task in order to identify the one that is best suited
to the task. For example, Figure 1 shows an evaluation pipeline in Galaxy that
compares the output of three named entity recognizers to from gold standard
annotations; this example shows each small step in the workflow, but sub-steps (for
1http://langrid.org
2http://weblicht.sfs.uni-tuebingen.de/
3https://lindat.mff.cuni.cz/
example, the Tokenization-SentenceSplitter-Tagger sequence that feeds the three
entity recognizers) could be bundled into a workflow and plugged in as a single
step.</p>
      <p>
        The datasets used in DH research are diverse, often involving ancient text, texts
in languages typically not covered in NLP such as Latin, poetry, historical
documents, and multi-media, and in some cases need to be representative across
multiple genres. Large CL datasets, on the other hand, are typically largely composed
of genres such as newswire (Penn Treebank [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ], English Gigaword [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ], etc.), or
they suffer from problems such as the inclusion of digitization artifacts, opaque
and unbalanced sampling, etc. [
        <xref ref-type="bibr" rid="ref10 ref7">11, 8</xref>
        ]. As a result, readily available NLP tools
often perform quite badly on DH data, due to dramatic differences in terminology
and entities, syntactic structure, etc. This often necessitates augmenting lexicons,
gazetteers, and pattern-matching rules used by these tools for the purposes of DH
research. Recent examples include augmentation of a contemporary affective
lexicon in order to study affect change patterns in German historical texts between
1740 and 1900 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and applying automatic parsing as a “pre-annotation tool" for
manual annotation of syntax in Old East Slavic texts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In the LAPPS Grid, these
tasks are accomplished by using “human-in-the-loop" capabilities to to perform
manual annotation and/or augment existing resources incrementally as new entries
or patterns emerge from analysis, without leaving the environment to use external
tools. More sophisticated analyses can exploit a cycle of automatic annotation
using machine learning followed by manual correction, which can then be used to
iteratively enhance the performance of the learning algorithm.
      </p>
      <p>
        Data visualization is often essential for humanities research, and the LAPPS
Grid includes a wide range of statistical and visualization tools. A basic but
common task is to generate frequency distributions or distributions across a text,
collection, timeline, etc. for any type of phenomenon. For example, one recent study
examined the appearance of neologisms and words that become obsolete over
several decades of Dutch magazine texts as well as tweets, by generating graphs
showing initial and final word frequencies over time intervals [13]. Other projects use
visualization of relations in graph form. For example, one study used named entity
recognition and co-reference tools to identify characters in the novels
comprising A Song of Ice and Fire and then generated a weighted graph depicting social
relations among characters based on dialogue interactions [
        <xref ref-type="bibr" rid="ref11">15</xref>
        ]; while another
extracted a dictionary of concepts by parsing the English sentences from multiple
translations of Wittgenstein’s Tractatus Logico-Philosophicus and inferred
semantic relations between concepts using word contexts, eventually generating a graph
of inter-relations among concepts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        The LAPPS Grid demonstration will show how it can be used to perform tasks
relevant to DH research such as those described above, as well as many others.
Facilities suitable for DH scholarship and research not currently available in the
LAPPS Grid are being regularly added to the platform as we receive input from
the DH community, and our current collaboration with the CLARIN projects in
Europe will significantly enhance LAPPS Grid facilities for DH research in the
near future. In the meantime, LAPPS Grid users already have access to the wide
range of tools and resources available through the Language Grid and other
federated grids, which focus on machine translation and other facilities for cultural
collaboration. A new collaboration with the Alveo project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] in Australia will
provide access to a large suite of tools for analysis of multi-modal data, including
video, audio, transcriptions of audio, and tools for their analysis. Ultimately, the
LAPPS Grid aims to provide an ever-increasing set of tools for DH research,
enhance ease of use for non-technical users, and in general help to move DH toward
more empirically-grounded (and replicable) methods.
[6] Nancy Ide, James Pustejovsky, Eric Nyberg, Christopher Cieri, Keith
Suderman, Marc Verhagen, Di Wang, and Jonathan Wright. The Language
Application Grid. In Proceedings of the Ninth International Conference on
Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 2014.
      </p>
      <p>European Language Resources Association (ELRA).
[14] Christopher Welty and Nancy Ide. Using the right tools: Enhancing retrieval
from marked-up documents. Computers and the Humanities, 33(1-2):59–84,
1999.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Anca</given-names>
            <surname>Bucur</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sergiu</given-names>
            <surname>Nisioi</surname>
          </string-name>
          .
          <article-title>A Visual Representation of Wittgenstein's Tractatus Logico-Philosophicus</article-title>
          .
          <source>In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)</source>
          , pages
          <fpage>71</fpage>
          -
          <lpage>75</lpage>
          , Osaka, Japan,
          <year>December 2016</year>
          .
          <article-title>The COLING 2016 Organizing Committee</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sven</given-names>
            <surname>Buechel</surname>
          </string-name>
          , Johannes Hellrich, and
          <string-name>
            <given-names>Udo</given-names>
            <surname>Hahn</surname>
          </string-name>
          .
          <article-title>Feelings from the PastAdapting Affective Lexicons for Historical Emotion Analysis</article-title>
          .
          <source>In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)</source>
          , pages
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          , Osaka, Japan,
          <year>2016</year>
          .
          <article-title>COLING 2016 Organizing Committee</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Steve</given-names>
            <surname>Cassidy</surname>
          </string-name>
          , Dominique Estival, Timothy Jones, Denis Burnham, and
          <string-name>
            <given-names>Jared</given-names>
            <surname>Burghold</surname>
          </string-name>
          .
          <article-title>The Alveo Virtual Laboratory: A Web based Repository API</article-title>
          .
          <source>In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)</source>
          , Reykjavik, Iceland, may
          <year>2014</year>
          .
          <article-title>European Language Resources Association (ELRA).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Hanne</given-names>
            <surname>Martine</surname>
          </string-name>
          Eckhoff and
          <string-name>
            <given-names>Aleksandrs</given-names>
            <surname>Berdicevskis</surname>
          </string-name>
          .
          <article-title>Automatic parsing as an efficient pre-annotation tool for historical texts</article-title>
          .
          <source>In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)</source>
          , pages
          <fpage>62</fpage>
          -
          <lpage>70</lpage>
          , Osaka, Japan,
          <year>2016</year>
          .
          <article-title>COLING 2016 Organizing Committee</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jeremy</given-names>
            <surname>Goecks</surname>
          </string-name>
          , Anton Nekrutenko, and
          <string-name>
            <given-names>James</given-names>
            <surname>Taylor</surname>
          </string-name>
          . Galaxy:
          <article-title>A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences</article-title>
          .
          <source>Genome biology</source>
          ,
          <volume>11</volume>
          :
          <fpage>R86</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Toru</given-names>
            <surname>Ishida</surname>
          </string-name>
          , Yohei Murakami, Donghui Lin, Takao
          <string-name>
            <surname>Nakaguchi</surname>
            , and
            <given-names>Masayuki</given-names>
          </string-name>
          <string-name>
            <surname>Otani</surname>
          </string-name>
          .
          <article-title>Open Language Grid-Towards a Global Language Service Infrastructure</article-title>
          .
          <source>In The Third ASE International Conference on Social Informatics</source>
          , Cambridge, Massachusetts, USA,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Koplenig</surname>
          </string-name>
          .
          <article-title>The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram data sets: Reconstructing the composition of the German corpus in times of WWII. In Digital Scholarship in the Humanities</article-title>
          , volume
          <volume>32</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mitchell</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Marcus</surname>
          </string-name>
          , Beatrice Santorini, and Mary Ann Marcinkiewicz.
          <article-title>Building a Large Annotated Corpus of English: The Penn Treebank</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Robert</surname>
            <given-names>Parker</given-names>
          </string-name>
          , David Graff,
          <string-name>
            <given-names>Junbo</given-names>
            <surname>Kong</surname>
          </string-name>
          , Ke Chen, and
          <string-name>
            <given-names>Kazuaki</given-names>
            <surname>Maeda</surname>
          </string-name>
          .
          <source>English Gigaword Fifth Edition LDC2011T07</source>
          , Linguistic Data Consortium, Philadelphia,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Eitan</given-names>
            <surname>Adam</surname>
          </string-name>
          <string-name>
            <given-names>Pechenick</given-names>
            ,
            <surname>Christopher M. Danforth</surname>
          </string-name>
          , and
          <article-title>Peter Sheridan Dodds. Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution</article-title>
          .
          <source>PLOS ONE</source>
          ,
          <volume>10</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>10 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Gerhard</surname>
            <given-names>Wohlgenannt</given-names>
          </string-name>
          , Ekaterina Chernyak, and
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Ilvovsky</surname>
          </string-name>
          .
          <article-title>Extracting Social Networks from Literary Text with Word Embedding Tools</article-title>
          .
          <source>In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)</source>
          , pages
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          , Osaka, Japan,
          <year>2016</year>
          .
          <article-title>COLING 2016 Organizing Committee</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>