<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>T-PAS Scraper: an Application for Linguistic Data Extraction and Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Emma Romani</string-name>
          <email>emma.romani01@universitadipavia.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Gattero</string-name>
          <email>valerio.gattero01@universitadipavia.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabetta Ježek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>. Università degli Studi di Pavia</institution>
          ,
          <addr-line>Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we introduce T-PAS Scraper, a new online application for linguistic data extraction and analysis connected to the T-PAS resource, a corpus-based digital repository of Italian verbal patterns (Ježek et al., 2014). The application is conceived as a supplementation of the main functions of T-PAS and can be used concurrently with the resource, thus extending its accessibility. It consists of 25 different scripts which operate automatically on the database of the resource and can be useful for quantitative and qualitative studies of the linguistic data it contains.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        T-PAS Scraper is a new online application for
linguistic data extraction and analysis. It is
designed as an extension and supplementation of
T-PAS resource (
        <xref ref-type="bibr" rid="ref4">Ježek at al., 2014</xref>
        )1, a
corpusbased digital inventory of Italian verbal patterns,
which provide syntactic and semantic information
on the verb-argument structures (that is, the
patterns)2.
      </p>
      <p>Initially, the project was not conceived as
online application for the linguistic analysis of
TPAS resource data. The first idea was to find a
way to speed up the revision of the resource: we
needed a fast system that could facilitate the
manual correction of annotators’ mistakes within
the patterns contained in the editor of the resource.
As the dimension of the resource is considerable
(T-PAS is a repository of 5326 patterns)3,
checking all of them manually while going
through the correction and refinement phase
would have been time-consuming. Annotators’
mistakes are not widespread within the resource
as most of the work performed on T-PAS was
carried out manually, but isolated errors can
occur.</p>
      <p>As a solution, we developed a series of
T-PASspecific scripts running on the updated T-PAS
resource database (a JSON-structured file
containing all the patterns and the related
information), which can extract lists of aggregated
data, displayed in columns. By skimming the lists,
one can easily notice data errors in the extracted
data and therefore correct them by moving to the
editor of the resource and editing the patterns
which were wrongly annotated.</p>
      <p>As the number of scripts that we developed was
consistent, covering several aspects of the
resource, we believed that they would have been
useful for users and not only for annotators’
revision. We decided to build an online
application, which is called T-PAS Scraper, that
extends T-PAS accessibility: it can be used by
future T-PAS users for quantitative and
qualitative studies of its linguistic data4.</p>
      <p>In this paper we describe the application in its
components, how it is related to T-PAS resource
and some possible uses.</p>
      <p>The paper is structured as follows. In Section 2
we briefly describe the T-PAS resource, its main
features and the online interface. In Section 3 we
introduce T-PAS Scraper and provide a technical
has been developed at the University of Pavia during a
curricular internship in which the first two authors
were involved with the aim of refining, correcting, and
improving the resource and its main features before its
online publication.
3 Last update: 25/08/2021.
4 This is the primary and final purpose of the
application, as the refinement, correction, and
improvement phase is about to be concluded for the
upcoming release of the resource.
explanation on how it was built; we also present
the main functions of T-PAS Scraper, and in
which sense they are complementary to T-PAS
resource, as well as how it can be used from a user
perspective. In Section 4 we discuss some future
perspective on the project.
2</p>
    </sec>
    <sec id="sec-2">
      <title>T-PAS Resource</title>
      <p>
        T-PAS (
        <xref ref-type="bibr" rid="ref4">Ježek et al., 2014</xref>
        ) is a corpus-derived
resource consisting of an inventory of Typed
Predicate-Argument Structures (T-PAS) for
Italian verbs. It is a gold standard for Italian
verbargument structures. The resource is being
developed at the University of Pavia with the
technical support of Lexical Computing Ltd. (CZ)
and is intended to be used for linguistic analysis,
language teaching, and computational
applications. The resource consists of four
fundamental components:
1. a repository of corpus-derived predicate
argument structures (called patterns) with
semantic specification of their argument slots,
e.g. [Human] drinks [Beverage];
2. an inventory of ca. 200 corpus-derived
semantic classes (called Semantic Types)
organised in a hierarchy (called System of
Semantic Types), used for the semantic
specification of the arguments;
3. a corpus of manually annotated sentences that
instantiate the different patterns of the verbs
in the inventory. Corpus lines are tagged with
their respective pattern numbers and anchored
to the verb they feature, which is the lexical
unit of analysis5;
4. an editing system called Skema
        <xref ref-type="bibr" rid="ref1">(Baisa et al.,
2020)</xref>
        , which allows the registration of
patterns and all the syntactic and semantic
information associated therewith and
facilitates the manual annotation of corpus
instances (directly linked to the patterns)6.
      </p>
      <p>
        Typed predicate-argument structures are
patterns that display the semantic properties of
verbs and their arguments: for each meaning of a
verb, a specific pattern is provided. As referenced
above, the patterns are corpus-derived, i.e., they
are acquired through the manual clustering and
5 The reference corpus for the resource is the web
corpus ItWac (reduced), provided by Sketch Engine. It
contains around 935 million tokens.
6 Skema
        <xref ref-type="bibr" rid="ref1">(Baisa et al., 2020)</xref>
        is a corpus pattern editor
system implemented to facilitate the management of
manual annotation of concordance lines with
userdefined labels and the editing of the corresponding
annotation of corpus instances, following the CPA
methodology
        <xref ref-type="bibr" rid="ref2">(i.e., Corpus Pattern Analysis;
Hanks, 2013)</xref>
        . Currently, T-PAS contains 1165
implemented verbs, 5,326 patterns, and ca.
200,000 annotated corpus instances.
      </p>
      <p>In the resource, each pattern is labelled with a
pattern number and connected to a list of corpus
instances realising that specific verb meaning.
The Skema editor enables the registration of
different semantic and lexical information in each
pattern: the verb, which in T-PAS is generally in
its infinitive form - e.g., bere (Eng., ‘to drink’);
the Semantic Types (e.g. [Human], [Beverage],
always portrayed within square brackets),
specifying the semantics of the arguments
selected by the verb; argument positions7, which
are filled by the Semantic Types in the patterns;
the sense description, i.e. a brief definition of the
meaning of the verb in that specific pattern; a
lexical set (optional) for each Semantic Type in
the pattern, i.e. a selection of the most
representative lexical items instantiating that
Semantic Type (e.g. vino = ‘wine’ | birra = ‘beer’
| aranciata = ‘orange juice’ are good candidates
for the lexical set of [Beverage]); the roles
(optional) played by some specific Semantic
Types in certain contexts: in particular, the
Semantic Type [Human] can acquire the role of
Athlete, Doctor, Musician, Host, Guest, Writer,
etc., depending on the verb selecting it as an
argument; the features (optional) associated with
the Semantic Types, i.e. certain semantic
characteristics required by the pattern syntax (e.g.
Plural) or by the specific verb meaning (e.g.
Female, Negative, Visible); prepositions (for
prepositional complements); particles (for
adverbials); complementizers (for clausals);
quantifiers, and determiners (for lexical sets),
which can be implemented according to the
specific argument position in question.</p>
      <p>
        The System of Semantic Types used to classify
the semantics of arguments (
        <xref ref-type="bibr" rid="ref9">Pustejovsky et al.,
2004</xref>
        ;
        <xref ref-type="bibr" rid="ref3">Ježek, 2019</xref>
        ) is a hierarchy of general
semantic categories obtained by manual
clustering of the lexical items found in the
argument positions of corpus-derived valency
structures. The System currently contains ca. 200
Semantic Types that are hierarchically organised
patterns in terms of slots, attributes and other features
following the lexicographic technique of CPA
        <xref ref-type="bibr" rid="ref2">(Hanks,
2013)</xref>
        .
7 I.e., subject, object, adverbial, clausals, prepositional
complement, predicative complement. They can be
optional, but yet registered in the pattern if they are
relevant to the sense of the verb.
on the basis of the ‘is a’ (subsumption) relation
(e.g. [Human] is an [Animate]).
      </p>
      <p>
        T-PAS online version, which will be publicly
available for the users by the end of 2021, will
consist of:
1. the repository of predicate-argument
structures (patterns);
2. five good corpus examples
        <xref ref-type="bibr" rid="ref8">(GDEX; Kilgarriff
et al., 2008)</xref>
        for each of the patterns
(previously annotated);
3. the System of Semantic Types;
4. a search engine that allows to search Semantic
Types and argument positions (subject,
object, etc.) in combination.
      </p>
      <p>T-PAS Scraper aims at completing and
integrating T-PAS functionalities: the two
interfaces can be used complementarily when
visualizing the pattern and searching for specific
linguistic phenomena.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>T-PAS Scraper</title>
      <sec id="sec-3-1">
        <title>Building T-PAS Scraper</title>
        <p>T-PAS Scraper is, in its first release, constituted
by two parts: the scripts to retrieve the data from
T-PAS database, which were created using
Python8 and PyCharm9, and a graphical interface
that produces a cross-platform executable
program.</p>
        <p>The program can load a Sketch
Enginecompatible database, select a script, and run it.
Once clicked on the “Run script” button (see
Figure 1), a JSON file is produced with the
requested content (i.e., the list of extracted data).
Every script is different and prints different
complex data in its output, but the way in which
data is structured is identical in all of them.</p>
        <p>The online user application was created by
programming a web application with Angular, a
popular front-end framework, PrimeNG, a
component library, and Express, a NodeJS server
that allows the application to be loaded on
Heroku, a hosting company. The procedure
followed in order to develop this application
consists of two steps: first of all, the scripts were
formally defined using a pseudocode, the
linguistic data that are the object of the extraction.
In the second phase, each script was implemented
and printed on a JSON file.
8 https://www.python.org/ (last access: 23/09/2021).
9 https://www.jetbrains.com/pycharm/ (last access:
23/09/2021).
10 It is possible to refine the research by filtering the
columns (e.g., a specific verb and its related
information in each script can searched typing the verb
The online interface (see Figure 2) consists of
several parts. First, a menu indicates the top
categories of scripts that can been chosen. The top
categories are Verbs, Semantic Types, Arguments
and Lexical Sets; each of them contains a group
of related scripts. For each top category there is a
description page and a list of several scripts. Once
a script has been selected, the data is shown in
table format (generally long lists with different
columns), with the possibility of filtering and
paging the results in different ways. Data are
displayed in alphabetical order, based the first
column on the left (containing the verbs). There is
also the option to export the current script in Excel
format, which can be handy for studying the data
externally.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Using T-PAS Scraper</title>
        <p>T-PAS Scraper is useful to access semantic and
syntactic information about verbs and their
arguments, which cannot be accessed in T-PAS
online in aggregate (see Section 2). The basic idea
that underlies T-PAS Scraper usage is to have an
aggregated overview of the lists (i.e., columns) of
data being extracted10.</p>
        <p>As for the verbs, one can visualize the complete
list of the 1165 verbs with different information:
the number of patterns for each verb (and what is
the average number of patterns, as well as which
verbs have the highest number of patterns); the
frequency and per-million frequency in the corpus
(to check whether the frequency is somehow
in the filter function of the first column of the list,
which work as a search box, showing all the items
corresponding to the typed query), but this kind of
query better fits to T-PAS resource and its search
engine (which allows to type a specific verb and open
the list of patterns and the examples).
related to the number of patterns of that specific
verb). The entire inventory of 5326 patterns
contained in T-PAS online, together with the
related sense description is provided, both
separately and jointly. Verbs in patterns can be
registered differently from their base form (i.e.,
the infinitive) or show for examples reflexive
uses, and therefore the entire list of the verb forms
can be filtered in order to obtain those forms (e.g.,
lavarsi, Eng. ‘to wash yourself’).</p>
        <p>The complete inventory of the 212 phrasal verb
patterns annotated in the resource (e.g., buttare
via, Eng. ‘to throw away’) and 388 idiomatic uses
(e.g., bersi il cervello, ‘to go crazy’) can also be
searched, also in parallel to the patterns in T-PAS
online for explanatory examples.</p>
        <p>
          As for Semantic Types, one can search for the
most frequent alternations of Semantic Type in
argument positions (
          <xref ref-type="bibr" rid="ref5">Ježek et al., 2021</xref>
          ; see Figure
2 column 4 for examples) as well as the semantic
roles and the features associated to the Semantic
Types (see Section 2).
        </p>
        <p>For what concerns the arguments, a list of the
argument structures is provided (e.g.,
subjectobject, subject-clausals, subject-prepositional
complement) as well as those which are optional
and obligatory. Syntactic alternations of
arguments (e.g., finire il pranzo (object) vs. finire
di mangiare (clausals) – Eng., ‘to finish the meal’
vs. ‘to finish eating’) are also listed.</p>
        <p>Finally, complementizers, prepositions,
adverbial particles, and obligatory determiners
annotated within the patterns in T-PAS, as well as
lexical sets, can be extracted through T-PAS
Scraper and analysed by the researcher in their
distribution.</p>
        <p>In Table 1 we provide some quantitative data
regarding T-PAS resource, that can be extracted
through T-PAS scraper.
In this paper we introduced T-PAS Scraper, a new
online application for linguistic data extraction
and analysis specifically devised to retrieve the
data contained in the T-PAS resource and make
them available to users for purposes of linguistic
analysis, thus extending T-PAS resource
accessibility. We described why and how it was
built and its main functions related to T-PAS
resource. We also suggested some possible uses in
terms of qualitative and quantitative analysis,
from a user perspective.</p>
        <p>As a new-born project, T-PAS Scraper
application is just at its initial stage and further
work can be done. In particular, new scripts can
be added to enrich the existing ones.</p>
        <p>Currently, the data displayed on the application
are in a static form: the updated database needs to
be manually re-uploaded in case of some changes
in T-PAS editor. The final goal would be to load
the data from T-PAS database within Sketch
Engine in real time, run the scripts and show the
results: in a first phase the data will be displayed
in this way. Loading data directly from Sketch
Engine also requires coordination and the creation
of a dedicated API with external authentication,
which does not currently exist. A dynamic
infrastructure has countless advantages: it allows
to view data in real time, it is scalable and
functional, and can also communicate with other
systems.</p>
        <p>T-PAS Scraper may eventually be extended to
resources other than T-PAS whose structure is
compatible with the database configuration of
TPAS and Sketch Engine.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Baisa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiberius</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ježek</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marini</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Romani</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Skema: A New Tool for Corpus-driven Lexicography</article-title>
          .
          <source>In Proceedings of the 19th EURALEX International Congress.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Hanks</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <source>Lexical Analysis: Norms and Exploitations</source>
          . Cambridge: The MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Ježek</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Sweetening Ontologies Cont'd: Aligning bottom-up with top-down ontologies</article-title>
          . In A. Barton,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seppälä</surname>
          </string-name>
          &amp; D. Porello (eds.)
          <source>Proceedings of the Joint Ontology Workshops</source>
          <year>2019</year>
          . Graz, Austria.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ježek</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feltracco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bianchini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>T-PAS: A resource of corpusderived Types Predicate-Argument Structures for linguistic analysis and semantic processing</article-title>
          .
          <source>In Proceedings of LREC</source>
          . pp.
          <fpage>890</fpage>
          -
          <lpage>895</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Ježek</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marini</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romani</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Encoding semantic phenomena in verb-argument combinations</article-title>
          . In Kosem, I.,
          <string-name>
            <surname>Cukr</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakubíček</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kallas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krek</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Tiberius</surname>
            ,
            <given-names>C</given-names>
          </string-name>
          . (eds.),
          <article-title>Electronic lexicography in the 21st century</article-title>
          .
          <source>Proceedings of the eLex 2021 conference. 5-7 July</source>
          <year>2021</year>
          ,
          <article-title>virtual</article-title>
          . Brno:
          <article-title>Lexical Computing CZ, s</article-title>
          .r.o.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Kilgarriff</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baisa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bušta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakubíček</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovář</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michelfeit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rychlý</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Suchomel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>The Sketch Engine: ten years on</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Lexicography</surname>
          </string-name>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>7</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Kilgarriff</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Husák</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McAdam</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rundell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Rychlý</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2008</year>
          ,
          <article-title>July)</article-title>
          . GDEX:
          <article-title>Automatically finding good dictionary examples in a corpus</article-title>
          .
          <source>In Proceedings of the XIII EURALEX international congress. Barcelona</source>
          , Spain: Documenta Universitaria. pp.
          <fpage>425</fpage>
          -
          <lpage>432</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanks</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <source>Automated Induction of Sense in Context. In Proceedings of the 20th International Conference on Computational Linguistics (COLING)</source>
          . Geneva, Switzerland.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>