<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Tool for Extracting Specialized Argument Structures⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Beatriz Sánchez-Cárdenas</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Rienda</string-name>
          <email>prienda@correo.ugr.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nuria Medina-Medina</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Ramisch</string-name>
          <email>carlos.ramisch@lis-lab.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aix Marseille Université</institution>
          ,
          <addr-line>CNRS, LIS, Marseille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Granada</institution>
          ,
          <addr-line>avenida del Hospicio, 1, 18010</addr-line>
          ,
          <country>España</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad de Granada</institution>
          ,
          <addr-line>calle Buensuceso, 11, 18002</addr-line>
          ,
          <country>España</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1904</year>
      </pub-date>
      <issue>18071</issue>
      <abstract>
        <p>This contribution presents the design and development of MarcoTAO, a web-based prototype for the extraction and analysis of specialized argument structures in multilingual corpora. The tool encapsulates complex command-line scripts into a user-friendly interface, allowing researchers to load, parse, and index corpora, search for noun-verb-noun triples, and organize results into lexical clusters. By leveraging distributional semantics models like word2vec, MarcoTAO refines clusters by filtering irrelevant terms and enriching them with semantically related ones. The prototype supports cross-platform accessibility, ensures centralized server-side storage, and provides scalable functionality for future extensions. Currently in the testing phase, MarcoTAO addresses the limitations of previous tools by streamlining corpus analysis and making phraseological studies more accessible to academia.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;paper template</kwd>
        <kwd>paper formatting</kwd>
        <kwd>CEUR-WS 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Until recently, the study of specialized language tended to focus on terms. However, at the beginning
of the century researchers realized that the description of any specialized domain should go beyond
noun description and consider other information such as phraseological structures, which are
language but also how scientist in this domain express such a process. As a result, translating just
the verb, rather than the whole semantic structure, might lead to non-idiomatic sentences. Such
information is typically absent from general dictionaries and even from most terminological
resources, but it can be found in specialized corpora. By adopting this approach, we are more likely
to identify appropriate equivalents to express the consequences of deforestation, such as agravar,
ocasionar or alterar in Spanish.</p>
      <p>Interestingly, TAN does not always produce satisfactory results for this issue, as it tends to
translate verbs into their formal equivalents (provoque/provocar; cause/causar) rather than
domainspecific ones. This might be partly due to the fact that general machine translation tools are trained
with English corpora and do not discriminate different genre (blogs vs scientific papers), resulting in
non-idiomatic texts with a plain style and standardized language. This is known as the “Digital
Linguistic Bias” (DLB) (Muñoz-Basols et al., 2024). As a result,</p>
      <p>The implications of this go beyond the preservation of linguistic heritage and the specificity of
each language and culture. Indeed, it can also lead to a loss of nuance, terminological imprecision or
semantic inaccuracy. Extracting the linguistic structures from comparable corpora is one potential
solution to create linguistic tools that help mitigate the standardization that comes with the use of
TAN and IA. However, analyzing concordances manually is a rather inefficient strategy. A more
efficient alternative is to run complex queries that are capable of modeling lexico-grammatical
cooccurrence patterns that approximate predicate argument structures. Yet, this is also time consuming
and demands specific skills that most scientific translators or writers lack.</p>
      <p>In this perspective, we developed a methodology to extract triples form corpora in the form of
“noun-verb-noun” structures, called triples, reflecting the argumental structure of a given concept
across several languages. For instance: [Soybean expansion] in southern Brazil [contributed] to
[deforestation] by stimulating migration to agricultural frontier regions.</p>
      <p>In order to simply this complex task, we designed a tool prototype that automates the extraction
of [noun-verb-noun] structures from specialized corpora in multiple languages. It is an easy-to-use
web interface designed to help researchers and linguists analyze and this type of linguistic
information more efficiently.</p>
      <p>
        Although similar projects and initiatives exist
        <xref ref-type="bibr" rid="ref1">(Orliac 2006; Baroni &amp; Bernardini 2004; Vezzani
2023)</xref>
        , to the best of our knowledge, none offers the possibility to extract argumental structures in
the form of triples in specialized corpora across languages.
      </p>
      <p>Nevertheless, the process of triple extractions can be achieved by employing a range of corpus
tools being able to identify argument structures. Previous research (Sánchez Cárdenas 2024)
compared the performance of MWEtoolkit and Sketch Engine in facilitating this specific task. The
key difference between these two tools lies in the specific purpose for which they were originally
designed. Our study concluded that both tools had challenges in terms of noise in the retrieved
triples. Sketch Engine had a higher percentage of noise (90.9%), while MWEtoolkit had a relatively
lower percentage (34.4%). Additionally, MWEtoolkit achieved a higher percentage of accurate triples
(65.5%) compared to Sketch Engine (9.1%).</p>
      <p>In Section 2, we describe the protocol for extracting argumental structures in the form of triples
from specialized corpora. Section 3 explains the features of a web-based tool prototype created to
simplify these searches and includes relevant screenshots for illustration.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Retrieving triples to represent argumental structures</title>
      <p>In previous research (Sánchez Cárdenas &amp; Ramsich 2019; Sánchez Cárdenas 2024), we employed
MWEtoolkit2, a computational tool for the identification of multiword expressions in corpora
(Ramisch 2015, 2023) in order to isolate triples [noun1-verb-noun2] representing argumental
2 http://mwetoolkit.sourceforge.net
structures. For this endeavor, queries were designed through Python scripts in order to process and
query the corpora, extract candidates and sort the results.</p>
      <sec id="sec-2-1">
        <title>2.1. Processing the corpora</title>
        <p>During preprocessing phase, texts were automatically converted to UTF-8. They were processed and
analyzed using UDPipe, a natural language processing tool. It performed the following tasks:
tokenizing sentences into words, tagging words with their part-of-speech (POS) using the Universal
Dependencies tagset, assigning lemmas, and generating syntactic dependency trees to map relations
between words.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Querying the corpora</title>
        <sec id="sec-2-2-1">
          <title>Step 1: Regular expression queries</title>
          <p>Using MWEtoolkit, queries were designed as multi-level regular expressions to extract
[noun1verb-noun2] triples, such as [volcano-eject-lava]. These searches captured argumental structures,
but also irrelevant triples such as [volcano-see-lava], which required manual filtering. To streamline
the process, searches were encapsulated in shell scripts for easier and more efficient execution.</p>
          <p>Future research will enhance triple extraction by addressing current challenges and incorporating
new strategies. Improvements include handling complex nouns, argumental structures with more
than two complements, negations or phrasal verbs.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Step 2: Search strategies</title>
          <p>In order to test the validity of the scripts, a pilot study was conducted.</p>
          <p>Initial queries were constructed using seed terms extracted from the corpora. In previous pilot
studies within the domain of environmental sciences, these seeds terms were derived from
EcoLexicon3, a knowledge data base. Specifically, the semantic relations between concepts were used
to identify verbs lexicalizing those semantic relation. For instance, the query pattern [volcano-?-lava]
retrieved verbs like eject, emit or spew.</p>
          <p>These verbs were then reused to identify additional nouns that could occupy the noun1 or noun2
positions. With each iteration, one of the three elements (noun1, verb, or noun2) was underspecified,
while the others were specified based on previous query results. This iterative process gradually
expanded the representativity of the results, covering a broader range of phraseological patterns in
the domain.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Step 3: Filtering and sorting Results</title>
          <p>Triples were automatically ranked by relevance using pointwise mutual information (PMI),
calculated from co-occurrence frequencies. Results were sorted in descending order of relevance,
with the most significant triples prioritized for further analysis. Encapsulated scripts simplified
queries, and output was stored in tsv files for manual review. Finally, Triples were gathered into a
single tsv file and manually ranked using a code from 0 to 4: 0 (accurate), 1 (acceptable with minor
manual modifications), 2 (irrelevant but potentially useful for refining future searches), and 4
(incorrect).</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Distributional clustering of triples</title>
        <p>The final step involves the distributional clustering of triples marked as 0 or 1. A specific script
organizes these results into clusters, grouping similar triples into linguistic schemas based on shared
patterns. For instance: (volcano, expel, {lava, magma, rock}) or ({volcano, crater}, expel, lava). These
patterns show common phraseological structures in the domain.
3 http://ecolexicon.ugr.es/es/index.htm</p>
        <p>To that end, we use distributional semantics via Word2vec, where words in the triples are
represented as vectors based on their co-occurrence context in the corpus. Triples are automatically
grouped using a semantic similarity measure based on word embeddings (Pilehvar &amp;
CamachoCollados 2021) with Gensim software.</p>
        <p>The system removes words that are infrequent in the patterns and includes words that are
semantically close to the group. A word is added only if its average similarity to the group is above
a threshold, usually 0.3.</p>
        <p>As a result, all possible combinations of nous and verbs that appear in each position of the triples
are generated. This process results in the recurring phraseological-verb-nous argument structures of
the analyzed concept. This lexical clustering organizes the extracted triples in a way that is useful
from a terminological point of view. In fact, it highlights productive lexical patterns relevant for
domain-specific phraseology. This could contribute to the development of terminological resources
and improve translation quality and scientific text production.</p>
        <p>Figure 1 shows the raw output for the clustering of VOLCANO. Needless to say, this information
requires some manual refinement. Table 2 presents the lexical cluster of DEFORESTATION after manual
treatment.</p>
        <p>This kind of information is highly relevant for encoding texts in a specialized domain. In fact,
when comparing these lexical schemas with those referring to the same concept in other languages,
differences in the lexical schemes across languages become evident. This reveals not only linguistic
differences, but also conceptual and cultural dysmorphism. This kind of information is important for
improving terminological resources and translation tools, but also to understand how a concept is
conceptualized across languages and cultures.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Design of MarcoTAO: towards a web user interface4</title>
      <p>However, this protocol cannot be yet widely used by other researchers, since it is composed of
several command-line scripts that must be executed separately. The whole process is error prone and
lacks user friendliness. To address these limitations and make the whole process available to the
academia, we developed a web interface, currently in the prototype phase, that encapsulates the
existing scripts for all the phrases described above. The MarcoTAO prototype is capable of: Loading,
parsing and indexing corpora; Searching for “noun-verb-noun” triples in the indexed corpora;
Grouping search results according to similar annotations; Creating lexical clustering; Visualizing and
storing the results. The interface is accessible, user-friendly and compatible with various operating
browsers. The interface shown in Figure 2 allows users to design triple searches starting with two
elements. For instance, the query [deforestation – provoke – ?] generates results such as floods,
droughts, or desertification. Figure 4 shows a screen shot of the search interface and figure 6
illustrates the lexical clusters obtained when analyzing the concept CLIMA in Spanish.</p>
      <p>Additionally, users can perform bulk searches using word lists. This feature enables the inclusion
of denominative variants of a term (e.g., deforestation, logging, forest loss) or verbs that express the
same concept (e.g., cause, provoke, generate). A key advantage of the tool is its ability to seamlessly
incorporate the results of one search into subsequent queries.</p>
      <p>Concerning the technical details, the MarcoTAO prototype uses a client-server architecture. All
data, including user credentials, project information, and analysis results, are stored in a MySQL
database on the server.</p>
      <p>The frontend uses standard web technologies such as HTML, CSS, and JavaScript. These allow
the interface to change dynamically depending on the user, the selected project, or the current
analysis step. On the server side, PHP handles the backend logic. It also manages the connection with
the database and prepares the content before it is sent to the client.</p>
      <p>The application runs scripts in Python by generating shell commands from the backend. These
commands start Python scripts on the server. They perform tasks such as preprocessing the corpus,
extracting triples, filtering results, and creating clusters. The backend prepares the input, launches
the scripts, and collects the output. Results are saved in structured formats like TSV or JSON. The
frontend then displays these results visually. This allows users to perform complex analysis without
needing technical expertise.</p>
      <p>One main feature of the application is that users can run scripts from any operating system. Since
all scripts run on the server, users only need a web browser. The interface shows the output in a
clear and user-friendly way. Another advantage is that users do not need to install anything. All
dependencies and configurations are stored on the server. This makes the tool easy to access and
maintain.</p>
      <p>In order to illustrate the functionalities of MarcoTAO, the following figures illustrate key steps
in the workflow, from queries of triples (Figure 4) and triple extraction (Figure 5) to the generation
of lexical clusters based on distributional similarity (Figure 6).
4 The screenshots included aim to demonstrate the functionalities of the prototype. At this stage, the linguistic content
shown is illustrative and does not reflect the final output quality expected.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This research was carried out as part of the project PID2020-118369GB-I00, Transversal integration
of culture into an environmental terminological knowledge base (TRANSCULTURE), funded by the
Spanish Ministry of Science and Innovation.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used X-GPT-4 in order to: Grammar and spelling
check. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[9] Muñoz-Basols, Javier, María del Mar Palomares, and Francisco Moreno Fernández. “El Sesgo
Lingüístico Digital (SLD) en la inteligencia artificial: implicaciones para los modelos de lenguaje
masivos en español.” Lengua y Sociedad 23.2, (2024). 623-648. Doi:
https://orcid.org/0000-00023136-4443.
[10] Orliac, Brigitte. “Colex: un outil d’extraction de collocations spécialisées basé sur les fonctions
lexicales.” Terminology. International Journal of Theoretical and Applied Issues in Specialized
Communication 12.2 (2006): 261-280.
[11] Pilehvar, M.T., Camacho-Collados, J. Word Embeddings. In: Embeddings in Natural Language
Processing. Synthesis Lectures on Human Language Technologies. 2021. Springer, Cham.
https://doi.org/10.1007/978-3-031-02177-0_3
[12] Ramisch, Carlos. "Multiword Expressions Acquisition: A Generic and Open Framework", Theory
and Applications of Natural Language Processing series, XIV, Springer, ISBN 978-3-319-09206-5,
230. (2015).
[13] Ramisch, Carlos. "Multiword expressions in computational linguistics: down the rabbit hole and
through the looking glass", Habilitation à diriger des recherches, Aix Marseille University,
Marseille, France, 2023.
[14] Sánchez Cárdenas Beatriz. “Extracting Semantic Frames from Specialized Corpora for
Lexicographic Purposes”, Círculo de Lingüística Aplicada a la Comunicación, 99, 163-177, 2024.
https://doi.org/10.5209/clac.90626
[15] Sánchez Cárdenas, Beatriz, and Carlos Ramisch. "Eliciting specialized frames from corpora using
argument-structure extraction techniques", Terminology. International Journal of Theoretical and
Applied Issues in Specialized Communication, 25(1), John Benjamins, 1–31, (2019). doi:
https://doi.org/10.1075/term.00026.san
[16] Sánchez-Cárdenas, Beatriz, and Miriam Buendía-Castro. “Inclusion of Verbal Syntagmatic
Patterns in Specialized Dictionaries: The Case of EcoLexicon.” In Proceedings of the 15th
EURALEX International Congress, eds. R. V. Fjeld and J. M. Torjusen, Oslo: EURALEX, pp. 554–
562, (2012).
[17] Tutin, Agnès. L'écrit scientifique: du lexique au discours. Eds. Francis Grossmann, and Presses</p>
      <p>Universitaires de Rennes. Presses universitaires de Rennes, (2014).
[18] Vezzani, Federica. “Vers une méthodologie pour l’extraction et la classification automatiques
des collocations terminologiques verbales en langue médicale”. Phraséologie et terminologie, 480,
259. (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Baroni</surname>
          </string-name>
          , Marco, and
          <string-name>
            <surname>Bernardini</surname>
          </string-name>
          , Silvia. BootCaT:
          <article-title>Bootstrapping Corpora and Terms from the Web</article-title>
          .
          <source>Proceedings of LREC</source>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Buendía-Castro</surname>
          </string-name>
          , Miriam, and
          <string-name>
            <surname>Beatriz</surname>
          </string-name>
          Sánchez-Cárdenas.
          <article-title>“Using Argument Structure to Disambiguate Verb Meaning.” In Proceedings of the XVII EURALEX International Congress</article-title>
          , eds. T. Margalitadze and
          <string-name>
            <given-names>G.</given-names>
            <surname>Meladze</surname>
          </string-name>
          , Tbilisi: Ivane Javakhishvili Tbilisi University Press.
          <fpage>482</fpage>
          -
          <lpage>490</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Buendía-Castro</surname>
            ,
            <given-names>Miriam.</given-names>
          </string-name>
          <article-title>Phraseology in Specialized Language and its Representation in Environmental Knowledge Resources</article-title>
          .
          <source>PhD Thesis</source>
          , Universidad de Granada, (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Corpas</given-names>
            <surname>Pastor</surname>
          </string-name>
          , Gloria.
          <article-title>Investigar con corpus en traducción: los retos de un nuevo paradigma</article-title>
          (Vol.
          <volume>49</volume>
          ). Peter
          <string-name>
            <surname>Lang</surname>
          </string-name>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Faber</surname>
          </string-name>
          , Pamela, ed.
          <source>A Cognitive Linguistics View of Terminology and Specialized Language</source>
          . Vol.
          <volume>20</volume>
          , Walter de Gruyter, (
          <year>2012</year>
          ). https://doi.org/10.1515/9783110277203
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Granger</surname>
          </string-name>
          , Sylviane, and Fanny Meunier, eds.
          <source>Phraseology: An Interdisciplinary Perspective</source>
          . John Benjamins Publishing, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Jacques</surname>
          </string-name>
          ,
          <string-name>
            <surname>Marie-Paule</surname>
            , and
            <given-names>Agnès</given-names>
          </string-name>
          <string-name>
            <surname>Tutin</surname>
          </string-name>
          .
          <article-title>Lexique transversal et formules discursives des sciences humaines</article-title>
          . ISTE Group, (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L</given-names>
            <surname>'Homme</surname>
          </string-name>
          , Marie-Claude. “Predicative Lexical Units in Terminology.” In Language Production, Cognition, and the Lexicon, eds. N.
          <string-name>
            <surname>Gala</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rapp</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Bel-Enguix</surname>
          </string-name>
          , Berlin: Springer, pp.
          <fpage>75</fpage>
          -
          <lpage>93</lpage>
          , (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>