<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TypeCraft: Collaborative Databasing and Resource Sharing for Linguists</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dorothee Beermann</string-name>
          <email>dorothee.beermann@hf.ntnu.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Mihaylov</string-name>
          <email>pavel@ontotext.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Norwegian University of Science and Technology</institution>
          ,
          <addr-line>Trondheim</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ontotext</institution>
          ,
          <addr-line>Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
      </contrib-group>
      <fpage>100</fpage>
      <lpage>106</lpage>
      <abstract>
        <p>We present a linguistic application that uses web technologies to promote the reuse of research data in the form of Interlinear Glossed Text (IGT), which is a well-established data format within philology and the structural and generative fields of linguistics. Here we present the modules and procedures of the online database TypeCraft.3 IGT is a sought after commodity in NLP and an integral part of scholarly linguistic work. It not rarely represents the only structured data available for less-resourced or endangered languages. While archiving of structured data from endangered languages is already well on its way [2], the free creation and exchange of linguistic data in the form of linked IGTs still needs to gain in popularity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With more linguists interested in the description of endangered and less-described
languages, we see a rising interest in the electronic creation and management of
linguistic data. Distinct from computational approaches to language annotation,
where tagging of language data serves to facilitate automatic processing,
annotation by linguists is meant for human consumption. TypeCraft (TC), the tool
we are going to present here, is specialised on the creation, management and
exchange of Interlinear Glossed Text (IGT), exemplified by an example from
Runyankore Rukiga (ISO 639-3,nyn), a Lacustrine language of the Great Lakes
area in Uganda:
Omu nju hakataahamu abagyenyi
o`mu` nju` h`ak`at`a`ah`amu` `ab`agy´engy`ı
Omu n ju ha ka taah a mu a ba gyenyi
in CL9 house CL16 PST enter IND LOC IV CL2 visitor
PREP N V N
‘In the house entered visitors’</p>
      <p>At present the main hindrance for IGTs to be a prime linguistic resource is
the still prevalent lack of glossing standards. In-depth manual linguistic
annotation is time consuming and often cyclic since it is an integral part of a scientific
discovery process. A tool facilitating this process needs to allow for partial
annotation It further needs to take into consideration that linguistic annotation
is often distributive work done in close collaboration between native speakers
and linguists, so that a tool that can be used independently by all partners has
to cater to the needs of users with different levels of linguistic expertise and
computer literary.</p>
      <p>The discussion is organised as follows: In Section 2, we present the TypeCraft
system from a user perspective. We discuss the creation and management of
IGT and outline how collaborative online databasing can contribute to making
linguistic data re-usable and mobile. In Section 3, we give a system description
and in Section 4, we will summarise, and outline future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>TypeCraft</title>
      <p>
        TypeCraft (TC) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] consists of a relational database combined with a tabular
text editor for the computer-aided manual annotation of text. The tool’s outer
wrapper is a customised mediawiki which serves as a general entrance port and
collaboration tool. The list below gives an overview over TC’s main
functionalities:
Annotation
Manual import of continuous text or sentence collections
Tabular interface for word level glossing with automatic sentence break-up and direct access to
pre-defined gloss-sets
Easy access to gloss definitions and linking to an online ontology of grammatical concepts
Data management
Wiki facilitated management of primary data and metadata
Easy navigation between primary and annotated data
Easy linking of data to the TC-wiki for an integrated multi-media representation of data
Collaboration
Graded access and individual work spaces: the user decides which data is shared, TC texts and
phrases have their own URL and thus can be acquired and exchanged freely online
User groups can share data
Co-editing of TC-wiki pages for early dissemination
Data export
Export of annotated sentences (individuals or sets) to Microsoft Word, Open Office and LaTEX
for paper publications. Print-friendly versions of the TypeCraft web pages including exported
database material
Export of XML for automatic data processing
2.1
      </p>
      <sec id="sec-2-1">
        <title>General Information</title>
        <p>While other manual annotation tools designed for normal linguists are desktop
systems,4 TC is a multi-user online system. Users administer their own data in</p>
        <sec id="sec-2-1-1">
          <title>4 The other tools are Toolbox and FLEx, both developed by SIL.</title>
          <p>
            their private space at TC, but they can also make use of other users’ shared data.
TC is loaded by directing a browser5 to www.typecraft.org. We use standard wiki
functionality to direct the user to the database via New text, My texts, and
Textor Phrase search. My texts displays the user’s repository of annotated material,
called ’Texts’. My texts, the user’s private space, is divided into two sections: Own
texts and Shared texts. This reflects the graded access design of TC. Texts can
be shared between groups. Data-sets can be easily exchanged between users and
co-edited in user-groups. Interlinear Glosses can be loaded to the TC-wiki where
they are displayed as part of a TC-wiki page. Under the printing of wiki pages
TC data export automatically. One additional feature is that wiki-import from
the TC database gets automatically updated when the database changes. As a
collaborative tool, TC tries to utilise the effect of collaboration for the further
standardisation of linguistic glosses. Glossing rules are conventional standards
and one way to spread them is to make them easily accessible at the point where
they are actively used. TC insists that annotators use a pre-defined set of glosses
which are rooted in the Leipzig Glossing Rules [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], but have been extended to
meet the needs of annotation in a more multi-lingual setting. We have grouped
all glosses into annotation classes and mapped them to the GOLD (General
Ontology for Linguistic Description) via URI-pointers. GOLD has been created
as a tool to facilitate a more standardised use of basic grammatical features [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
Through the linking to GOLD, TC users can relate annotations to grammatical
concepts and to find relevant bibliographic resources as they annotate.
          </p>
          <p>At present TC has 300 users, and the number is growing at a steady pace.
We started 2012 with 13 000 annotated phrases from 97 mostly African
languages. Since 2005 the TypeCraft system is under constant development and
the planning of new development is based on the feed-back from system users
and interested colleagues. Most of the active TypeCraft users are junior linguists
and graduate students, mostly working on a less-resourced languages. TypeCraft
is user driven. At present 54.2% of the data in the TC database remains private
due to pending publication. However, also part of that data is partially
represented on TC-wiki pages most of which were created by users of the system.
A feature liked by many users is the possibility to join an annotation group.
Through workshops and regular classes in linguistic text annotation new user
groups can discover the tool. In addition, TypeCraft has proven to be a
useful tool for e-learning for linguists. The TC wiki Classroom namespace contains
several examples of classroom collaborations involving TC.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The TypeCraft Editor</title>
        <p>After having imported a text into the TC Editor, which is easily accessed from
the TC sites navigation bar (New text ), the text is run through a simple, but
efficient sentence splitter. The user can then select via mouse click one of the
phrases to enter into the annotation mode. The system distinguishes between
translational, functional and part-of-speech glosses. Properties can be assigned</p>
        <sec id="sec-2-2-1">
          <title>5 TC runs in Firefox, Chrome, Safari and Opera.</title>
          <p>to linguistic phrases as well as to words or morphemes. Under Construction
description the user can add notes and write down open questions which are stored
together with the rest of the annotations in the XML-schema. The possibility to
report and retrieve notes is an important feature of an annotation tool which is
in particular needed at the beginning of an analysis, for collaborative annotation
and when new layers of annotation are added to previous annotations.</p>
          <p>The annotation interface is a table. Information is ordered horizontally (the
phrase and the free translation) and vertically, so that words and morphs are
aligned with their baseform and their part of speech, as well as their gloss. TC
assumes that free class morphemes are annotated for meaning while closed class
items receive a gloss. Phrases in TypeCraft are Url encoded data and thus can
be exchanged with other web applications. For further processing TC data can
be exported as XML.</p>
          <p>
            The system prompts the user for the Lazy Annotation Mode.6 Manual
annotation is further added by making TC-tags accessible from the TCwiki navigation
bar as well as by drop-down menu under annotation. Other important resources,
such as Ethnologue [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] can be directly reached from the TC Editor.
          </p>
          <p>Since TypeCraft uses Unicode, every script that the user can produce on
the PC can be entered into the browser. As of recent, TypeCraft offers the
transliteration of Mandarin Chinese and Telugu, a Dravidian language of India,
to make data that has been entered into the database using the original script
more widely accessible. The export of data to the main text editors is one of
the services that TC offers. TC tokens can be exported to Microsoft Word,
OpenOffice.org Writer and LaTeX, which allows easy integration of databased
material into the user’s favourite text editing software. Although annotating in
TypeCraft is still time consuming it is the re-usability of data that pays off.
Export can be selected from the text editing window or from the SEARCH
interface.
2.3</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>TypeCraft Search</title>
        <p>A TypeCraft search operates on phrases, which means that the result of a query is
a phrase level representation. Each line (or block) of the search result represents a
linguistic phrase. In our experience, search results consisting of lists of sentences
can be evaluated more easily by humans than lines of concordances. Search
results come in two flavours, either as lines of sentences which allow a first
quick scan of the data, or as blocks of IGTs which give the linguist access to
the sentence internal annotations. Using general browser functionality, search
results can be easily scanned. TypeCraft allows for complex searches on several
tiers from the search interface where words or morphemes queries can relatively
freely be combined with a search for specific glosses or combinations of glosses
co-occurring either in a phrase, or on a word or a morpheme. Visitors of TC can
at present search freely in the 45% of the private data on TypeCraft that has
been made available for general search.
6 Lazy Annotation Mode (LAM) is a function that automatically enriches annotation
tables with word related information already known to the database.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>System Description</title>
      <p>TypeCraft is a based on a remote server and a web-based client running locally in
the user’s browser. The various components of the system and their interactions
are shown in Figure 1.
The TC server consists of a Java server application running in Tomcat,
MediaWiki running in Apache, and a PostgreSQL database. The Apache server
acts as a single entry point for both MediaWiki and the TC server
application. MediaWiki has been extended with various extensions to enable specific
TC functionality, such as My texts, Phrase search and TC search. The login is
handled by MediaWiki but once authorised it will be shared with the TC server
application. This guarantees TC will only divulge data for which the logged in
user has permission.</p>
      <p>Storage and data model TC uses PostgreSQL for data storage. The data
mapping between Java objects and database tables is managed by Hibernate so
the system is not bound to any specific SQL database. TC data can be divided
into two specific groups:
– Common data: pos tags, gloss tags, global tags, ISO 639-3 languages. This
is data shared between all annotated tokens and users.
– Individual data: texts, phrases, words and morphemes, together with their
annotation. This is data specific to each user.</p>
      <p>Individual data items reference common data items. E.g. everyone who uses the
pos tag n will share the reference to a single common tag n.
3.2</p>
      <sec id="sec-3-1">
        <title>Client</title>
        <p>The user interacts with TC through a web-based interface. The interface consists
of the customised MediaWiki content and a text-and-phrase editor (TC editor).
The MediaWiki content provides wiki pages, wiki search, as well as the
TCspecific functions mentioned previously. The editor is used to edit texts and
phrases and assign annotation to phrases. It is written entirely in JavaScript
and builds a GUI using HTML elements. The editor opens in separate browser
window and is not directly connected to the MediaWiki content. The present
GUI uses the YUI library and a large amount of TC-specific code. An important
point is that for the end-user, TC is a single web application integrating the wiki
and the editor.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Summary and Outlook</title>
      <p>Setting it apart from the other main linguistic tools in its category, TC focuses
on collaboration and the exchange of annotated linguistic data in the form of
IGT. TC is low-tech on the outside. Non-computer-oriented linguists, language
specialists and language communities can start using it almost instantly. While
other linguistic software makes use of forums, blogs and other social software
to service their user communities, TypeCraft IS social software. It is a powerful
tool, however its real potential resides in its user community. Active users
promote the creation and exchange of IGT, they link their data and contribute in
this way to the availability, connectedness and quality of linguistic data.</p>
      <p>One of the difficulties we faced over time was how to extend the data model
when there are new requirements, or when a specific group needs a slightly
different annotation model. Another issue is how to integrate TC annotations
with third-party annotations on another level of the same data, e.g. audio and
video. We believe the right way to solve these issues is to switch to an
RDFbased data model that will allow us to be flexible and also allow us to benefit
from existing Linked Open Data (e.g. reference annotated tokens to WordNet).
This will also open possibilities to create open RDF-based standards for linking
heterogeneous annotations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Beermann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihaylov</surname>
          </string-name>
          , P.: e-Research for Linguists.
          <source>In: Proc. of the 5th ACLHLT Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities</source>
          (
          <year>2011</year>
          )
          <fpage>24</fpage>
          -
          <lpage>32</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Broeder</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sloetjes</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trilsbeek</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Uytvanck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Windhouwer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wittenburg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Evolving challenges in archiving and data infrastructures</article-title>
          . In: Nau,
          <string-name>
            <given-names>H. N.</given-names>
            ,
            <surname>Schnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Wegener</surname>
          </string-name>
          , C. (eds.):
          <article-title>Documenting endangered languages: Achievements and perspectives (</article-title>
          <year>2011</year>
          )
          <fpage>33</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Farrar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langendoen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A linguistic ontology for the Semantic Web</article-title>
          .
          <source>GLOT International</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ) (
          <year>2003</year>
          )
          <fpage>97</fpage>
          -
          <lpage>100</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Ethnologue:
          <article-title>Languages of the World</article-title>
          . http://www.ethnologue.com/web.asp
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Leipzig Glossing Rules.
          <source>Max Planck Gesellschaft</source>
          (
          <year>2010</year>
          ) http://www.eva.mpg.de/lingua/resources/glossing-rules.php
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>