<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of ALexS 2020: First Workshop on Lexical Analysis at SEPLN.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jenny A. Ortiz-Zambran o</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ArturoMontejo-Ráez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CEATIC. Universidad de Jaén</institution>
          ,
          <addr-line>Jaén</addr-line>
          ,
          <country>España</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Guayaquil.</institution>
          <addr-line>Guayaquil</addr-line>
          ,
          <country country="EC">Ecuador</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In September 2020, the first edition of the ALexS workshop (Task on Lexical Analysis at SEPLN) was held in Málaga, Spain as part of the second edition of IberLEF (Iberian Languages Evaluation Forum), which joined the eforts of the IberEval and TASS workshops. In this first edition, there has been only one task proposed: Complex Word Identification (CWI). More than seven teams joined the campaign, but only three of them finally submitted results and a description of their systems. The dificulty of the task, due to the lack of labeled data to participants, has forced interesting approaches to tackle the CWI problem in a unsupervised o semi-supervised way. This paper summarizes the approaches and the results of the submitted systems by diferent teams.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lexical Analysis</kwd>
        <kwd>Complex Word Identification</kwd>
        <kwd>Text Simplification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        corpus of transcriptions of teaching classes at the University of Guayaquil (Ecuador), the
VYTEDU-CW corpus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], was recently created. This resource can be used to test complex words
identification systems, configured to fit in an educational scope.
      </p>
      <p>Seven research teams submitted several classification results to the Subtask 1, and four teams
submitted to the Subtask 2. The systems submitted go in the line of the state of the art in similar
workshops, and the participants developed classification systems based on Recurrent Neural
Networks, Transformer Networks and fine-tunning models built upon BERT? ][. The details of
the systems submitted are described in Section?s? and??.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Complex Word Identification Task</title>
      <p>
        Complex word identification (CWI) is a common task within the more general of text
simpliifcation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Actually, in order to perform lexical simplification, words considered dificult to
the reader have to be identified first. It is of interest in areas like second language acquisition
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or reading comprehension7][ for people with certain disabilities. As stated before, only
two shared tasks have tackled the problem in recent years, the CWI Shared Tasks at SemEval
2016 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and NAACL-HTL 2018 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The aim of these task was to contribute in the advance
of methods and techniques for efective complex word identification, as the substitution of
complex words in texts improves the understandability of a given text by the reader (thus,
exhibiting a better readability level). Many areas are interested in this task, as stated before,
and still active research is being performed towards such a goal.
      </p>
      <p>The objective is to mark those words that can be considered complex, in the sense of dificult
comprehension for the reader. The corpus used in this workshop is the VYTEDU-CW corpus.
There are some interesting challenges in this task compared to other CWI tasks:
• Dificult terms have to be within the scope of an academic content. That is, many technical
terms may need to be superseded as they are commonly used in the domain.
• There are several domains corresponding to diferent grades, so the system has to adapt
to them.
• No training data will be released, only dev data for adjusting systems to file formats.</p>
      <p>Therefore, non-supervised or semi-supervised approaches are applicable.</p>
      <p>Next, more details on the VYTEDU-CW corpus are given.</p>
      <sec id="sec-2-1">
        <title>2.1. The VYTEDU-CW corpus</title>
        <p>An adhoc corpus has been created, as a variant of the original VYTEDU-CW co4r].puTsh[e
collection contains 55 texts which correspond to transcripts of academic videos in Spanish made
within the classrooms of the diferent careers of the University of Guayaquil. The VYTEDU-CW
(Videos and transcripts in the educational field - Complex words) corpus is conformed by more
than 1,200 words per transcription on average, has a total of 9,175 diferent words. Its data
set contains 723 words annotated as complex (dificult) terms that are present in the diferent
documents, these dificult words were identified and labeled by 430 annotators (students), 250
students tagged words that other users had not selected, that is, that did not match those
annotated by other students.</p>
        <p>The data set that makes up VYTEDU-CW are seven fields:
• The word identified and labelled as complex.
• The student’s identification.
• The name of the document read.
• The initial position in the text of the dificult word,
• The length of the word in characters,
• The date and time of the creation of the annotation.</p>
        <p>Some examples of the words labeled by the students correspond to names of characters,
abbreviations, use of sophisticated terminology by teachers when teaching their classes, the use
of technical words by teachers, the use of nominal verbs, another problem that exists is that
teachers propose examples in class using words that do not belong to the level of education or
specialization, the use of long words, teachers use words dificult to pronounce and unusual,
the use of compound words, the use of proverbs to illustrate examples, among others.</p>
        <p>We can observe an example in the use of long words, or of dificult pronunciation that caused
the students dificulty in being able to identify them, such was the case in the career of ”Law”,
the students labeled the words, such as: interculturality, methodological, homogeneity.</p>
        <p>The use of abbreviations is also the cause of the barriers that are formed in the understanding
of students. For example, in the ”Networking” degree the abbreviations were labeled: LAN,
GEAR IPAN, MBPS. Another word identified by the students as dificult was in the ”Research
Unit”, the abbreviation ISBN.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. The task: complex word identification</title>
        <p>The task proposes the tagging of words (or multi-words) from lecture transcriptions that could
be considered dificult to understand for students. Annotations in the corpus were made by
students at the same academic level of the annotated lecture. Therefore, we are facing an
scenario diferent from that researchers on CWI are used to. Besides, no training neither
development sets have been released, only the number of total annotations in the corpus is
revealed to participants. This task encourages the exploration of unsupervised approaches to
complex word identification.</p>
        <p>As this is a first tentative to academic/educational CWI, measure will be classical Precision,
Recall and F-score.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Systems presented</title>
      <p>Three teams presented their systems and results for this first task. Their approaches are
summarized below.</p>
      <sec id="sec-3-1">
        <title>3.1. UDLAP participation</title>
        <p>The UDLAP team is composed of only one member: Antonio Rico-Sulayes, from Universidad
de las Américas in Mexico. The approach proposed is a pipeline of several fil8te]:rs [
1. A general lexicon (CREA9[]) that has been extended with proper names and verb
conjugates.
2. An specialized lexicon of Internet-related terms.
3. A filter on frequent n-grams (extracted from the general lexicon)
4. A filter based on normalized frequency over corpus documents</p>
        <p>The thresholds and parameters over each module were selected to produced a final expected
quantity of complex-word candidates as close as possible to that exhibited by the corpus, as this
number is known by participants.</p>
        <p>This system reached the highest score on the macro F1 metric, the highest macro precision
and the highest recall over all runs submitted to ALexS by diferent participants.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Vicomtech participation</title>
        <p>The Vicomtech team is composed of for members, belonging to the SNLT group at Vicomtech
Foundation, from San Sebastián, Spain.</p>
        <p>The approach followed by these participan1t0s][is the only one that takes into account the
diferent categories or domains that can be identified in the VYTEDU-CW corpus, which are
related to several academic learning profiles (degrees) wherefrom the texts were generated.
Each word is modeled as seven diferent word-level features (lemma length, lemma frequency
in subject documents, number of synsets in WordNet, lemma frequency in domain corpora,
lemma probability in domain corpora, word frequency in Wikipedia and word probability in
Wikipedia). Then, within each domain, a clustering process is done. The parameters of this
unsupervised step have been taken from other complex-words detection tasks. Finally, for each
domain, diferent groups of words are obtained, considering as complex that group with lowest
average value of features. K-Means with the seven features showed the best Recall, Accuracy
and G-score (harmonic mean between Accuracy and Recall).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. HULAT participation</title>
        <p>The third team participating in ALexS 2020 is another Spanish group of researchers. Its three
authors are from the Computer Science Department of Universidad Carlos III (Madrid, Spain).</p>
        <p>The solution proposed for solving the CWI task is a supervised o1n1e],[with an architecture
enconding each word using diferent features: word length, a boolean determining whether
only capital letters are used, a boolean determining its inclusion in an easy-to-read lexicon,
Word2Vec vectors and BERT vectors (these last two using pretrained models).</p>
        <p>Finally, all these features are concatenated and fed into a SVM classifier, which was trained
and fine-tuned using the Spanish partition of the BEA Workshop 2018 CWI task dataset.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Table1 shows the results obtained on the ALexS task for complex word identification on the
VYTEDU-CW dataset. UDLAP team obtained the overall best results in terms of F-score (0.2725)
and precision (0.3415). Vicomtech system performed well on recall (0.6881), but at the price of a
very low precision. Although up to three diferent runs were allowed, only the UDLAP and
Vicomtech submitted three annotations, while HULAT team only submitted one annotation. As
organizers, we expected more participants to submit their results, being more than 10 teams
registered at the beginning of the campaign. This could be due to the COVID-19 pandemic that
is afecting everything in our lifes. Anyhow, the systems proposed were diferent enough to
extract interesting conclusions from such an small participation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The results obtained were poor, although a significant recall value was obtained. The authors
agree on the negative efect of training on a corpus and transferring the learned model to a
diferent domain. It was clear that the corpus and the task was very challenging, as there was
no clue about which type of words could be considered complex, the rate of them over the
corpus or examples to have some insight. Anyhow, the description of the task stated that there
were diferent domains, and that complex words to be identified should adapt to each domain.
That is, a word like “stock” could be considered complex in History, but not in Economics.</p>
      <p>Two of the system proposed generated several words characteristics, and were fed into
supervised (SVM) o non-supervised (clustering) algorithms to determine whether the word
was complex or not. The other system relied on pure lexicon-based filtering according to
probability (i.e. frequency) of appearance. A promising solution could be a hybrid system with
all those diferent characteristics (even the frequency/probability over diferent lexicons) all
together in a non-supervised algorithm, which only needs a threshold parameter to fine-tune
the identification of the class of complex words.</p>
      <p>If this task finds its continuity for next year, it is expected to extend the corpus and provide a
training subset to users. Also, following the design of the CWI task in SemEval 2021, we could
consider isolated words versus multi-words as to diferent subtasks. Overall, more research on
CWI for Spanish is needed in order to improve future text simplification systems.
This work has been partially supported by a grant from the Spanish Government under the
LIVING-LANG project (RTI2018-094653-B-C21) . Eugenio Martínez Cámara was supported by
the Spanish Government Programme Juan de la Cierva Formación (FJCI-2016-28353).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <article-title>IberLEF 2020 web site</article-title>
          ,http://sepln2020.sepln.org/index.php/iberl,e?f/??? Accessed:
          <fpage>2020</fpage>
          - 06-30.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Paetzold</surname>
          </string-name>
          , L. Specia, SemEval
          <year>2016</year>
          task 11:
          <article-title>Complex word identification</article-title>
          ,
          <source>in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , San Diego, California,
          <year>2016</year>
          , pp.
          <fpage>560</fpage>
          -
          <lpage>569</lpage>
          . UhRtLtp:s://www. aclweb.org/anthology/S16-10.
          <year>8d5oi</year>
          :
          <fpage>10</fpage>
          .18653/v1/
          <fpage>S16</fpage>
          - 1085.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Yimam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malmasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Paetzold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Štajner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zampieri</surname>
          </string-name>
          ,
          <article-title>A report on the complex word identification shared task 2018</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , Association for Computational Linguistics, New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>66</fpage>
          -
          <lpage>78</lpage>
          . URL: https://www.aclweb.org/anthology/W18-05.
          <year>0d7oi</year>
          :
          <fpage>10</fpage>
          .18653/v1/
          <fpage>W18</fpage>
          - 0507.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Ortiz-Zambrano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Montejo-Ráez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. N.</given-names>
            <surname>Lino-Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. R.</given-names>
            <surname>González-Mendoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Cañizales-Perdomo</surname>
          </string-name>
          ,
          <article-title>VYTEDU-CW: Dificult words as a barrier in the reading</article-title>
          comprehension of university students,
          <source>Advances in Emerging Trends and Technologies: Volume 1 1066</source>
          (
          <year>2019</year>
          )
          <fpage>167</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          , Automatic text simplification,
          <source>Synthesis Lectures on Human Language Technologies</source>
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>R.-M. Botarleanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dascalu</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Crossley</surname>
            ,
            <given-names>D. S. McNamara</given-names>
          </string-name>
          ,
          <article-title>Sequence-to-sequence models for automated text simplification</article-title>
          ,
          <source>in: I. I. Bittencourt</source>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cukurova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Muldner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Luckin</surname>
          </string-name>
          , E. Millán (Eds.),
          <source>Artificial Intelligence in Education</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rochford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Djamasbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <article-title>Automatic text simplification for people with intellectual disabilities</article-title>
          ,
          <source>Artificial Intelligence Science and Technology</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rico-Sulayes</surname>
          </string-name>
          ,
          <article-title>General lexicon-based complex word identification extended with stem n-grams and morphological engines</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <article-title>CEUR-WS, Malaga</article-title>
          , Spain,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Corpus de Referencia del Español Actual (CREA) - Listado de</surname>
          </string-name>
          frecuenhctitap:s/,/corpus. rae.es/lfrecuencias.htm,?l??? Accessed:
          <fpage>2020</fpage>
          -07-30.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zotova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cuadros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García-Pablos</surname>
          </string-name>
          , Vicomtech at ALexS 2020:
          <article-title>Unsupervised complex word identification based on domain frequency</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <article-title>CEUR-WS, Malaga</article-title>
          , Spain,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alarcón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martínez</surname>
          </string-name>
          , Hulat - ALexS
          <string-name>
            <surname>CWI</surname>
          </string-name>
          task
          <article-title>- CWI for language and learning disabilities</article-title>
          applied to university educational texts,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <article-title>CEUR-WS, Malaga</article-title>
          , Spain,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>