<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DHTK: The Digital Humanities ToolKit</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Piccay</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mattia Egloffy</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>81</fpage>
      <lpage>86</lpage>
      <abstract>
        <p>Digital Humanities have the merit of connecting two very different disciplines such as humanities and computer science. In this paper we present a new python library, The Digital Humanities ToolKit (DHTK), whose scope is to provide a fast and intuitive tool for a largescale study of large literary databases by leveraging on some well-known semantic knowledge resources. In addition, DHTK has the ambition to go beyond textual resources to integrate other human resources such as images (e.g., paintings, comics, etc) or sounds (e.g., music, transcripts, etc). DHTK is a collaborative project and we invite anyone who has the interest or some ideas in this context to reach out to us.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>1 DHTK will be freely accessible and anyone is free to participate to the project by
giving its technical development contribution.</p>
    </sec>
    <sec id="sec-2">
      <title>Design Criteria</title>
      <p>
        In order to guarantee compatibility with the main NLP tools, we decided to write
DHTK in python. From a programming point of view, DHTK takes into account
some of the scope’s criteria already listed in Loper and al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and includes one
additional criteria specific to DH. In particular, since DHTK is born with the
aim of offering the researcher the opportunity to exploit the semantic resources
available in the LinkedOpen Data (LOD) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], one important DHTK’s feature
is its modularity. Thanks to this feature, DHTK can be expanded using other
resources of the LOD independently without interfering with the pre-existing
modules. In addition to the requirements of DHTK programming already
mentioned above, the following subsection provides a list of further specifications of
our work.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Requirements</title>
        <p>– Digital Humanities Oriented. This is the main and exclusive feature to
DHTK. The toolkit has been conceived to exploit and treat works coming
from human science. Thus, access to corpora from literary or visual arts has
a greater importance than the corpora’s processing in itself for which other
libraries, like NTLK, have been conceived.
– Ease of use. The main purpose of the library is to facilitate the
exploitation of LOD resources such as Gutenberg.org and DBpedia, thereby being
accessible to researchers with a more modest experience in programming.
In fact, DHTK is conceived as a high-end library that provides APIs that
can be easily recalled as we will see in the section 4. Given the simplicity of
DHTK, it could be adopted by universities as a tool for computer science
courses specifically designed for humanists.
– Modularity. We conceived the library following the KISS principle "Keep
It Simple and Straightforward." Our objective is to make each module
selfcontained to the extent possible so that we can easily add higher modules
as needed.
– Efficiency. Since we deal with large database as Gutenberg and DBpedia,
we privilege efficiency and time-effectiveness over the coherence of the
programming language.
– Extensibility. As for NLTK, our tool’s main feature is extensibility. In our
library we focus on Semantic Web resources such as LOD, which counts on
extensive resources. DHTK aims to create a simple and constant interface
to allow its extension to new modules and resources.
– Documentation. The toolkit, its data structures and its implementation
are carefully documented with all the nomenclatures and acronyms carefully
selected. The main purpose of the documentation is not only to facilitate the
use, but also to help future developers who wish to contribute to the project.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Modules</title>
      <p>The DHTK toolkit is composed of three independent modules, organized
according to the logic task they perform. The logic modules are: common modules, text
resources, metadata search and NLP Processing.
3.1</p>
      <sec id="sec-3-1">
        <title>Common Modules</title>
        <p>
          The common modules’ main purpose is to ensure a continuity and a structural
consistency between the various textual resource modules that will be added over
time. They are composed of classes that define the general concepts of Author,
Book, Corpus, TextRepository, and the Fuseki and RDFPro [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] wrappers used
to handle RDF files.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Text Resources</title>
        <p>This module is designed to contain the textual resources available on the LOD.
We initially focused on Project Gutenberg because of its relevance in the
humanist community and we found a lack of automatic tools available in order to
exploit this specific resource. The main purpose of this module is to facilitate
access to texts. At the time of writing this article, in Table 1 we report some basic
statistics on data obtained from the results automatically crawled by DHTK.
# of books # of authors # of bookshelves # of book categories
54975
One of the most important aspects for a humanist is not just the availability of
texts but also the metadata coming with texts. Unfortunately, not all repositories
have complete information such as the date of the first publication or the first
publisher. For example, Project Gutenberg does not store the original publishing
date of its books, which can represent a problem if a corpus needs be delimited
to a decade of published literature. DBpedia, thanks to its encyclopaedic nature,
helps to rethink these elements. To this end, we have created the metadata search
module on DBPedia that allows to complete and expand the missing information.
This module aims to integrate existing NLP tools into other libraries. For the
time being, DHTK integrates entirely the NLTK library to which this work is
inspired.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Usage</title>
      <p>After having provided an overview of the different logic modules, this section
explains their usage. First of all, we will explain the usage procedure. Then,
we will show two concrete examples: a search in the Gutenberg catalog and a
metadata search in DBpedia.
4.1</p>
      <sec id="sec-4-1">
        <title>How to use DHTK: the Use Case workflow</title>
        <p>As we mentioned earlier, DHTK is a work-in-progress project and, for the time
being, the user can search for authors or works in the RDF Gutenberg Catalog
that DHTK has previously loaded into a local instance of Fuseki2. If the
information returned are incomplete, the user has the possibility to query against
DBpedia in order to get further data to complete annotation and search for
metadata (e.g., the year of the first edition, main characters, etc.). All metadata
available in DBpedia on a specific work or author are in principle retrievable.
Once the corpus has been created, the user can use NLTK library to perform
text processing and storing it a local database provided by DHTK. Our local
database is an instance of PostgreSQL.
As already mentioned, the following examples are aimed at showing the specific
aspects that demonstrate the suitability of this library for researchers in DH.
In particular, in collaboration with colleagues who teach literature, we have
2 https://jena.apache.org/documentation/fuseki2/
identified at least two main topics that are generally of interest to the humanists’
community. The first is the automatic building of corpora and the second is the
search for metadata on the work to complete missing information like the list of
main characters.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Example 1: Searching in Gutenberg catalog</title>
        <p>The first example shows the easiness of building automatic corpora using DHTK.
In this example, we want to build a corpus based on the author “Jane Austen.”
It would be possible to build corpora based on subject or bookshelf for
example. Once Austen’s works have been retrieved, we specify the languages we are
interested in and the tool returns a list of all books in those specific languages.
We can finally decide to download them in a local repository in order to process
them into NLTK.
#Search by author
books = gs . search_by_author (" Jane " , " Austen ")
#R e t r i e v e o n l y books i n E n g l i s h
books = [ gs . book_from_book_id ( book [ " bookid " ] )
f o r book i n books i f book [ " l a n g u a g e " ] == " en " ]
#Build the c o r p u s
c o r p u s = Corpus (" Austen " ,
d e s c r i p t i o n ="The books o f Jane Austen " ,
corpora_path=corpora_path , b o o k _ l i s t=books )
#P r i n t the l i s t o f books i n the c o r p u s
c o r p u s . p r i n t _ b o o k _ l i s t ( )
#Output :
0 Jane Austen Northanger Abbey
1 Jane Austen Lady Susan
. . .
#Download a l l t e x t s i n a l o c a l
d i r e c t o r y c o r p u s . download_book_corpus ( )
#Output :
[ ’ ’ Jane_Austen Mansfield_Park . txt ’ ,
’ Jane_Austen Pride_and_Prejudice . txt ’ , . . . ]</p>
      </sec>
      <sec id="sec-4-3">
        <title>Example 2: Get metadata from DBpedia</title>
        <p>This example shows how to obtain metadata from DBpedia in an easy way. In
only few lines we can retrieve the missing information from Dbpedia and obtain
the list of characters or the categories to which the work belongs.
#You can now g e t metadata u s i n g the BookID
book = gutenberg_search . book_from_book_id
( ’ http : / /www. g u t e n b e r g . org / ebooks / 2 4 8 9 ’ )
dbpedia_metadata = DbpediaMetadata ( )
d b p e d i a _ u r i= dbpedia_metadata . s e a r c h _ b o o k _ u r i ( book )
dbpedia_metadata . get_book_metadata ( book )
#Output :
’ c h a r a c t e r s ’ :
{ ’ dbpedia ’ : ’ h t t p : / / d b p e d i a . o r g / r e s o u r c e /
List_of_Moby D i c k _ c h a r a c t e r s ’ }
s u b j e c t s ’ :
[ ’ h t t p : / / d b p e d i a . o r g / r e s o u r c e / C a t e g o r y : 1 8 5 1 _novels ’ ,
’ h t t p : / / d b p e d i a . o r g / r e s o u r c e /
C a t e g o r y : 1 9 th century_American_novels ’ ,
’ h t t p : / / d b p e d i a . o r g / r e s o u r c e / C a t e g o r y : A l l e g o r y ’ ]
#c o m p l e t e l i s t o f ’ c h a r a c t e r s ’ o m i t t e d f o r l a c k o f s p a c e
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper we have described DHTK, a work-in-progress library aiming to
provide non-computer science specialists with a tool to access textual resources
available on LOD. Being a work-in-progress, only Gutenberg and DBpedia are
available for the moment. We are working to integrate other repositories like
Europeana [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Yago [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In addition, DHTK has the ambition to go
beyond textual resources to integrate other human resources such as images (e.g.,
paintings, comics, etc) or sounds (e.g., music, transcripts, etc). DHTK is a
collaborative project and we invite anyone who has the interest or some ideas in
this context to reach out to us.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath, and
          <string-name>
            <surname>Tim</surname>
          </string-name>
          Berners-Lee.
          <article-title>Linked Data - The Story So Far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Corcoglioniti</surname>
          </string-name>
          , Marco Rospocher, Michele Mostarda, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Amadori</surname>
          </string-name>
          .
          <article-title>Processing billions of rdf triples on a single machine using streaming and sorting</article-title>
          .
          <source>In Proceedings of the 30th Annual ACM Symposium on Applied Computing, SAC '15</source>
          , pages
          <fpage>368</fpage>
          -
          <lpage>375</lpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <article-title>Nltk: The natural language toolkit</article-title>
          .
          <source>In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ETMTNLP '02</source>
          , pages
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          , Stroudsburg, PA, USA,
          <year>2002</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fabian</surname>
            <given-names>M Suchanek</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Gjergji</given-names>
            <surname>Kasneci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Gerhard</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Yago: a core of semantic knowledge</article-title>
          .
          <source>In Proceedings of the 16th international conference on World Wide Web (WWW07)</source>
          , pages
          <fpage>697</fpage>
          -
          <lpage>706</lpage>
          , New York, NY, USA,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Bjarki</given-names>
            <surname>Valtysson</surname>
          </string-name>
          .
          <article-title>Europeana : The digital construction of europe's collective memory</article-title>
          .
          <source>Information, Communication and Society</source>
          ,
          <volume>15</volume>
          (
          <issue>2</issue>
          ):
          <fpage>151</fpage>
          -
          <lpage>170</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>