=Paper= {{Paper |id=Vol-1382/paper9 |storemode=property |title=Rule-based Location Extraction from Italian Unstructured Text |pdfUrl=https://ceur-ws.org/Vol-1382/paper9.pdf |volume=Vol-1382 |dblpUrl=https://dblp.org/rec/conf/woa/CarusoGMPT15 }} ==Rule-based Location Extraction from Italian Unstructured Text== https://ceur-ws.org/Vol-1382/paper9.pdf
     Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                             June 17-19, Naples, Italy



              Rule-based location extraction from Italian
                           unstructured text
                    Daniele Caruso, Rosario Giunta, Dario Messina, Giuseppe Pappalardo, Emiliano Tramontana
                                        Department of Mathematics and Computer Science
                                                    University of Catania, Italy

                                              Email: {giunta, pappalardo, tramontana}@dmi.unict.it


   Abstract—Named entity recognition is a wide research topic                     is a shared knowledge between speakers.
concerned with the extraction of information from unlabelled                          Unlike machine learning approaches, both unsupervised and
texts. Existing approaches mainly deal with the English language,                 supervised, we proposed a rule-based approach built from sim-
in this paper we present the results of a novel approach
specifically tailored to the Italian language. The approach is                    ple grammar rules of the Italian language complemented by a
directed at recognising location names in unstructured texts by                   dictionary. The process of location names extraction is pursued
several agents based on rules devised for the Italian grammar.                    by means of several specialised agents, each performing an
Preliminary results show an F1 score up to 0.67.                                  elaboration step and connected in the pipe and filter style (see
  Keywords-Information Extraction, Named entity recognition,                      Figure 1), i.e. the result of the application of a rule removes
Free Text, Natural Language Processing, Italian Language.                         the bulk of the candidate words, which later have to pass a
                                                                                  further screening based, essentially, on a variant of a dictionary
                       I. I NTRODUCTION                                           comparison.
   Huge amounts of text data are easily available on the World                        The text is pre-filtered to remove punctuations symbols and
Wide Web. Unfortunately, the great majority of such texts is                      then split into sentences. Each sentence is analysed by up to
in the form of unstructured or semi-structured text. Such a                       three rules (See Section III) so to identify word candidates,
reality makes it difficult for both human beings and machines                     finally combined to remove false positives. Devised rules are
to make a good use of the content of such texts. Information                      typical Italian language patterns, identifying general contexts
Extraction is concerned with the process of structuring existing                  where a location can be found, thus the rules are not a simple
texts (both semi-structured and free) so as to single out some                    filtering of words from an existing dictionary.
parts of text and have them accessed directly by some existing                        Preliminary results of the algorithm are encouraging: pre-
postprocessors [4].                                                               cision goes up to 0.82 and recall up to 0.92, while the
   A comprehensive survey of existing approaches [22] show                        comprehensive F1 score goes up to 0.67.
how the Information Extraction community evolved from
the seminal approaches since the early ’90s, e.g. automatic                                       II. P HASE 1: PRELIMINARIES
learning of rules to extract entities [1], maximum entropy                          The approach and corresponding tool we have developed
models [17], Conditional Random Fields [11], etc. Many, if not                    works on simple text files, i.e. a web page can be pre-processed
all, of these approaches are tested, or developed, on the English                 beforehand by one of the many converters available to remove
language. Moreover, specific analysers have been developed                        HTML tags.
to embed security checks on software programs [10], discover                        For the devised rules, we make use of an especially
structural properties [3], [12], [14], [18], [19], [23], [24], and                compiled Italian lexicon, containing the following classes of
perform automatic transformation of programs [2].                                 words:
   We are especially concerned with the problem of named                            • Articles. A list of definite articles, e.g. il (the).
entities extraction from free texts in the Italian language,                        • Prepositions. Both kinds (semplice (simple) and artico-
in particular we are interested in the extraction of location                          lata (composite)) but excluding con (with), as it is not
names, i.e. proper nouns of places. Free texts can be of any                           used when naming places.
kind, ranging from dialogues in a movie to fiction prose,                           • Verbs. A subset of verbs related with places, such as
thus enacting different constraints, however, in general, a                            andare (to go), mandare (to send), partire (to leave),
location name is assumed to be written with a capital letter,                          passeggiare (to (take a) walk).
and common names can be thus considered location names,                             • Descriptors. A list of adverbs frequently related to a
especially in casual speech, e.g. in Vediamoci in Dipartimento                         place, such as dentro (inside), vicino (near).
(Let’s meet at the Department)1 , where the said Department                         • Non-places. Words of various kinds (verbs, adverbs,

  1 We have decided to use both the original Italian and the translated version
                                                                                       nouns, etc.) not related to places, but that can appear
of any processed text we show, in order to allow a better appreciation of the          in grammar structures (defined by the rules we set) as if
proposed approach.                                                                     they were places. E.g. acido (sour), dormire (to sleep). As




                                                                                  59
Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                          June 17-19, Naples, Italy


                                                                                         else                A or P
                   Rule 1
                                 Filter0       Filter1                                            A or P                   else
  sentence
                   Rule 2        capital        non-
                                                              Filter2         start       0                     1                       2
   splitting                                                  verbs
                                 letters       places
                   Rule 3
                                                                                         Fig. 2: The FSM implementing Rule 1


   Phase 1        Phase 2       Phase 3                                                III. P HASE 2: RULE - BASED EXTRACTION
 Fig. 1: The agents implementing the pipe and filter model                        We defined three finite state automata, to implement three
                                                                              grammar cases possibly implying the use of a place in the
 Verbs                      Descriptors             Non-places                accepting state of the automaton. Each rule identifies a differ-
 abitare (to dwell)         avanti (in front of)    altrimenti (else)
 camminare (to walk)        dietro (rear)           decimo (tenth)            ent sentence pattern. The rules are applied at the sentence
 entrare (to get/come in)   fianco (side)           allora (then)             level, i.e. on a list of words terminated by a punctuation
 uscire (to get out)        dentro (inside)         filosofo (philosopher)    symbol, obtained in phase 1. The tokens (words) are fed to
 salire (to go up)          fuori (outside)         molto (much)
 viaggiare (to travel)      vicino (near)           scrivere (to write)       the automaton and, if an accepting state is reached, the current
 partire (to leave)         direzione (direction)   camminare (to walk)       token is marked as a location candidate. If no accepting state
 andare (to go)             esterno (outer)         distrarre (to distract)   is reached no candidate is produced. When a candidate is
 indirizzare (to address)   interno (inner)         florido (prosperous)
 raggiungere (to reach)     lontano (away)          bere (to drink)           found, and a sentence still contains some more words, then
 risiedere (to inhabit)     sinistra (left)         visitare (to visit)       the automaton restarts from its initial state using the token
 visitare (to visit)        destra (right)          ognuno (everyone)         after the candidate, proceeding until the sentence ends.
 imboccare (to access)      adiacente (adjacent)    cremisi (crimson)
 arrivare (to arrive)       vicinanza (proximity)   nostro (ours)                 The devised rules are independent from one another, so they
 svoltare (to turn)         ingresso (entrance)     riempire (to fill)        can be parallelised by running as different agents e.g. on a
 tornare (to go back)       uscita (exit)           durante (while)           multicore machine or in different machines coordinated in a
 fermare (to stop)          dirimpetto (opposite)   esso (it)
 giungere (to arrive)       attiguo (adjacent)      piatto (flat)             Cloud fashion.
 parcheggiare (to park)                             lucente (shining)             The result of each rule application is a list of candidate
 emigrare (to emigrate)                             spostare (to move)        words, such words are used as input for the next phase (see
 decollare (to take off)                            capace (capable)
                                                                              Section IV) for the definitive labelling. Different rules possibly
                              TABLE I                                         yields different candidates. Then, a way to use all the said rules
        S AMPLE VERBS , DESCRIPTORS AND NON - PLACES WORDS
                                                                              is to combine them, hence the candidate words passing the rule
                                                                              filter(s) will be the union of the candidate words determined
                                                                              by each applied rule (see Section V).
     different words may be incorrectly identified as places,                 A. Rule 1: Da Roma
     the approach assists users in the customisation of the
                                                                                 The first rule, translating “from Rome”, is used to identify
     set, by incorporating additional words, so as to exclude
                                                                              possible candidate words as a location, and is named, as the
     (refine) future results.
                                                                              other rules, after a typical example of a (part of a) sentence
Table I shows the sample lists of words used in these cate-                   in which a place can be identified.
gories.                                                                          The automaton (see Figure 2) scans words (tokens) of a
   In the following sections, such sets will be named after their             given sentence and remains in state 0 until a preposition (P)
initial, e.g. we will talk of V as the set of verbs.                          or an article (A) is found, this condition makes the automaton
   Given a text T (read from an input file), the first step                   changes its current state from 0 to 1, and the state remains
is to separate sentences, based on standard Italian gram-                     unchanged unless a different kind of word is found in the
mar rules. T is split at occurrences of one of the sym-                       next token. Other articles or prepositions do not enable a state
bols in the set of sentence-end punctuation marks, i.e.                       change, which is instead triggered by any other kind of word.
{full-stop, ellipsis, exclamation-mark, question-mark}, all the               The final state is reached when a candidate word for a place is
other punctuation types are removed in order to be processed                  found, however many candidates will be ignored afterwards,
by the next agents, obtaining a list of sentences. Any other                  as described in Section IV.
non-letter symbol is ignored, e.g. dollar sign, percent sign,                    As a single rule, this yields the highest number of false
etc.                                                                          positives, as the use of an article or a preposition is very
   Each sentence in the input text is further segmented in order              common in the Italian language.
to find words, this is accomplished by using the space char-
acter as word separator, this applies to any rule we describe.                B. Rule 2: Vicino a Roma
The words found within a sentence are then compared with                        The second rule accommodates the mentioning of a place
the entries in the lexicons, according to the different rules                 name in sentences such as the name of the rule suggests,
described in the next sections.                                               “Near Rome”, as the presence of a Descriptor (see Section II)




                                                                         60
    Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                    June 17-19, Naples, Italy


           else                          A or P                       into the accepting state 3, i.e. pointing at the current word as
                                                                      a possible candidate for a place name.
                                A or P
                   D                                                          IV. P HASE 3: N ON - PLACE WORDS REMOVAL
start       0              1               2
                                   D
                                                                         The candidate words yielded by the application of a rule
                                               else                   are further filtered before being labelled as a place name. A
                                  else
                                                                      candidate, to be considered a place name and thus evaluated
                                                                      as a positive result, has to pass the following filters.
                                           3
                                                                         • Filter1: The candidate word is checked against the Non-

            Fig. 3: The FSM implementing Rule 2                             places lexicon (N). If it exists in N, then it is regarded
                                                                            a False Postitive (FP) and hence discarded. E.g. in the
                                                                            sentence Andare alla capitale (To go to the capital city),
           else                          A or P                             Rule 2 will suggest capitale as a possible place, however
                                A or P                                      it is a common name and thus it will be discarded.
                   V                                                     • Filter2: After passing the previous filter (Filter1), all the
start       0              1               2
                                   V                                        remaining candidate words are filtered to avoid identify-
                                                                            ing (conjugated) verbs as places. Once again, the check
                                  else         else                         is performed using a stemming algorithm [21]. To check
                                                                            if a candidate word is a verb, it is stemmed and then
                                           3                                concatenated with the three possible suffixes used in
                                                                            Italian verbs (-are, -ere and -ire) so to get the infinitive
            Fig. 4: The FSM implementing Rule 3                             form of the verb, which is then searched for in the Non-
                                                                            places lexicon. E.g. if the word (a verb) appears in the
                                                                            lexicon, it is discarded as it is not a place name. E.g. in the
is a strong indication that a place will be mentioned in the                sentence Ella esce camminando (She gets out walking),
following text in the same sentence.                                        Rule 3 stays in the state 0 for Ella, then goes into state 1
   Figure 3 shows the Finite State Machine (FSM) to find a                  reading the verb uscire, however the next token is not an
candidate place. The automaton starts by reading the words                  article nor a preposition, thus camminando is proposed
one by one and does not change its initial state (0) until a                as a place candidate. However, in this filtering, such a
descriptor is found, then it changes the current state to 1. From           place candidate is recognised as a conjugation of the
state 1 a transition can take place to the state 2, when an article         verb camminare and finally discarded. The remaining
or a preposition is found, or directly to the accepting state 3             candidate words are promoted as results.
in any other case. From state 2 it is possible to return to state        Any word passing the said filters are labelled as a place
1, if another descriptor is found, or stay in the same state, if      name, however such a result may be a True Positive (TP) or
more articles or prepositions are found. Finally, the accepting       a False Positive (FP).
state can be reached by reading any other kind of word.                  Unstructured text may not be reliably using orthographic
   The accepting state identifies a candidate word as a place,        conventions, as a text could be a professionally proof-read
as Roma in the rule name.                                             book or an informal automatic transcription, thus it may or
C. Rule 3: Andando a Roma                                             may not use capital letters to address location names. As far as
                                                                      we described our approach, we did not make any assumption
   While the previous rule uses descriptors as a way to identify
                                                                      on such an orthographic convention, however, experimentally,
a possible place name, this rule is concerned with verbs, such
                                                                      we found better results when such a convention is satisfied,
as in the rule name, “Going to Rome”.
                                                                      thus we also provide a further filter:
   The FSM implementing the rule is shown in Figure 4. The
behaviour of the automaton is the same of Rule 2, where                  • Filter0: If the candidate begins with a lower case charac-

instead of a descriptor a, possibly conjugated, verb is used.               ter, it is not deemed a location name, while it is output
The verbs included are only verbs related to movement, and                  as a result if it starts with a capital letter.
thus usually related to places, such as staying in a place or            As the name suggests, this filter has to be applied before
moving to and from a place. Since a verb can be found in              Filter1 and Filter2, as the user sees fit, based on the text to be
a conjugated form, the check is performed using the Italian           processed.
version of the stemming algorithm Snowball [21].
   The automaton scans the tokens and remains in the state 0                                   V. D ISCUSSION
until a verb is found and the current state is changed to state          In the next subsections we review the rules and how they
1. The automaton may change from state 1 to state 2 (and              relate to the actual grammar writings we examined, and then
vice versa), by reading a preposition or an article (or reading       show the results of the labelling experiments made on different
a verb). Any other words will make the automaton to change            texts.




                                                                      61
Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                       June 17-19, Naples, Italy

                                                                      Rule Sentence                                                     Result
A. Rule’s Assessment                                                  1    Il processo che si svolge a Milano                           TP
                                                                           (The trial taking place in Milano)
   Real case examples. Table II shows several fragments of                 I treni a lunga percorrenza per la Sicilia                   TP
sentences recognised by our approach, for both TP and FP                   (Long distance trains to Sicily)
results, as specified in column 3. The words in italics are the            Il vertice che si terr oggi a Bruxelles                      TP
                                                                           (The meeting taking place in Bruxelles)
tokens consumed by the automaton for the current rule, while               Se il Ministro in indirizzo non intenda intervenire          FP
the bold word is the one identified as a place.                            (If the addressed Minister does not mean to intervene)
   Lexicons. As the rules are based on several lexicons, the          2    Scappa verso il Canale                                       TP
                                                                           (Runs away towards the Channel)
completeness of such lexicons is essential for a good recogni-             All’interno della Basilica Palladiana                        TP
tion. In the sentence “dentro la stanza” (inside the room), the            (Inside the Basilica Palladiana)
rule 2 will candidate “stanza”, which should not be proposed               Mi fanno sedere accanto a Carlo                              FP
                                                                           (They let me sit beside Carlo)
as a result, as it is not a proper location name. It is a                  Sono operative presso le DIGOS di tutto lo Stato             FP
responsibility of the Non-places filter (Filter 1, see Section IV)         (They are operational in the DIGOS (offices) of the State)
to recognise that the word is not a location, however, if             3    Passando per Piazza Del Popolo                               TP
                                                                           (Proceed through Piazza Del Popolo)
“stanza” is not in the N lexicon, it will be selected and thus             La prima volta che vedo Palermo                              TP
proposed as TP while being a FP.                                           (The first time I see Palermo)
                                                                           Non ho più visto Carlo                                      FP
   Repetitions. A simple observation of the FSMs shown                     (I have not seen Carlo)
in Figures 2 to 4 may lead to a traversal of the states                    La soglia richiesta per entrare in Parlamento                FP
recognising illicit sentences for the Italian grammar. E.g.                (The threshold required to get into the Parliament)
“Andando camminare per per Roma” (Going to walk to to                                            TABLE II
Rome) in which the application of Rule 3 would candidate                           S AMPLE RESULTS USING DIFFERENT RULES
“Roma” as a place name. Our preliminary studies show that
ungrammatical sentences, such as the previous example, are                           Rule     Sentence Pattern      Occurrences
not so frequent unless we factor in informal languages, such                        Rule 1    A                           1076
as instant messaging or poetic prose/verses.                                                  P                           4174
                                                                                              AA                              1
   However, the same rules are capable of recognising un-                                     AP                              1
grammatical sentences appearing in both formal and informal                                   PA                             56
speech. A phrase such as “Andando a... a... a Roma” (Going                                    PP                              4
                                                                                              APA                             1
to... to... to Rome) would make the FSM in Figure 4 pointing                        Rule 2    D                              58
at “Roma” as an accepting state, even if the sentence is not                                  DA                             41
grammatically correct. As such a sentence can be typical in                                   DP                             73
                                                                                              DADP                            1
speeches, e.g. when one speaks while recalling something, an                                  DPDP                            2
automatic transcription may report such sentences and thus we                       Rule 3    V                              82
left the loops in the rules.                                                                  VA                             26
                                                                                              VP                            190
   Sentence patterns. The rules we are proposing can be                                       VPA                             1
considered arbitrary, even if intuitively correct. Thus, before                               VPV                             1
making the actual experiments in labelling, we studied the                                    VAVP                            2
                                                                                              VPVP                            1
result of the application of the rules alone on a set of
unstructured texts so as to check if such grammar structures                                      TABLE III
                                                                                          D IFFERENT SENTENCE KINDS
had the needed responsiveness degree. I.e. we are interested in
the possible paths any automaton may take, given real written
texts and not just simple cases (such as the ones in the titles
of subsections in Section III, which are correct but also very          Given a rule, a Sentence Pattern such as A is more general
basic).                                                              than any pattern having A as a suffix e.g. PA. Thus, all the
   Rules have been tested on different kinds of textfiles, both      occurences of PA form a subset of the occurences of A. For
prose and dialogue transcriptions, for a total of 1.2 million        the experiments (Section V-B) the automata are set to found
of characters. The results are shown in Table III. In each           the longest match.
line, the first column is the rule, the second is the sentence          The preliminary study reported in Table III shows just the
pattern found by the rule, the third column is the number of         number of occurrences for each sentence pattern, it does not
instances of the pattern found in the test corpus. The sentence      show the percentage of TPs or FPs, as this is just a way to
pattern is identified by the transition in the automaton, e.g.       check the different transitions in the proposed automata.
VPA (Verb, Preposition, Article) identifies a sentence such as
Viaggiare per l’Italia (To travel in Italy), which is decomposed     B. Experiments
as ViaggiareV perP l’A Italia, where the words before Italia            The rules detailed in Section III have been developed in a
are being catalogued respectively as [V]erb, [P]reposition and       tool and have been tested on different kinds of unstructured
[A]rticle.                                                           texts: (i) theatrical dialogue transcriptions (texts T2, T3, T4),




                                                                62
    Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                June 17-19, Naples, Italy

                                                                          Text   Rules   TP     FP   FN      F1   precision   recall
(ii) official stenographic transcriptions of political debates            T1     1       39     49     9   0,57        0,44    0,81
(T5) and (iii) news articles (T1). In the two latter cases, the                  2        1      1    47   0,04        0,50    0,02
transcriptions are properly capitalized, and thus the Filter0 (see               3        0      3    48    n/a         n/a      n/a
                                                                                 1&2     39     50     9   0,57        0,44    0,81
Section IV) has been used in the experiments, while the other                    1&3     39     52     9   0,56        0,43    0,81
texts were all in lower cases and thus only Filter1 and Filter2                  2&3      1      4    47   0,04        0,20    0,02
have been used in phase 3 (see Section IV).                                      1&2&3   39     53     9   0,56        0,42    0,81
                                                                          T2     1       33     49     6   0,55        0,40    0,85
   All the texts used for the experiments have been manually                     2        2      1    37   0,10        0,67    0,05
labelled for the location names. In the experiments, all the                     3        9      2    30   0,36        0,82    0,23
combinations of the rules have been tested, as shown in the                      1&2     33     50     6   0,54        0,40    0,85
                                                                                 1&3     34     51     5   0,55        0,40    0,87
second column. E.g. Rule “2&3” means to put together as a                        2&3     11      3    28   0,42        0,79    0,28
mathematical union the set of candidates gathered by Rule 2                      1&2&3   34     52     5   0,54        0,40    0,87
with the set of candidates gathered by Rule 3, using such an              T3     1       13      7     6   0,67        0,65    0,68
                                                                                 2        0      0    19    n/a         n/a      n/a
union for the filtering agents in phase 3.                                       3        3      2    16   0,25        0,60    0,16
   The precision metric is computed as a correctness measure,                    1&2     13      7     6   0,67        0,65    0,68
using also the number of False Negatives (FN), as T PT+F                         1&3     14      9     5   0,67        0,61    0,74
                                                                P,
                                                             P
                                                                                 2&3      3      2    16   0,25        0,60    0,16
while the recall is computed as a completeness metric as                         1&2&3   14      9     5   0,67        0,61    0,74
T P +F N . The F1 score gives the harmonic mean of precision
    TP
                                                                          T4     1       56    136     5   0,44        0,29    0,92
and recall.                                                                      2        0      4    61    n/a         n/a      n/a
                                                                                 3        9     15    52   0,21        0,38    0,15
   There are cases where a rule fails to identify any TP,                        1&2     56    140     5   0,44        0,29    0,92
however this is expected. When an input text does reference                      1&3     56    151     5   0,42        0,27    0,92
a place name by e.g. a motion verb, then only Rule 3 can be                      2&3      9     19    52   0,20        0,32    0,15
                                                                                 1&2&3   56    155     5   0,41        0,27    0,92
able to recognise such places, while Rule 2, concerned with               T5     1       74    190    84   0,35        0,28    0,47
the usage of descriptors, will never be applied.                                 2        5      4   153   0,06        0,56    0,03
   The results show an interesting F1 score, going up to 0.67                    3        4      8   154   0,05        0,33    0,03
                                                                                 1&2     76    194    82   0,36        0,28    0,48
with an average of 0.38. The precision metric goes up to 0.82                    1&3     75    198    83   0,35        0,27    0,47
in the best case, with a minimum value of 0.27 and an average                    2&3      9     12   149   0,10        0,43    0,06
of 0.45. The recall shows also good results, having a maximum                    1&2&3   77    202    81   0,35        0,28    0,49
value of 0.92 and an average of 0.51.                                                            TABLE IV
                                                                                          E XPERIMENTAL RESULTS
   While there are cases where very few location names
are identified, we deem such preliminar experiments worth
expanding, as one of the limitations is the small number of
labelled text which we have dealt with.                              propose allow the discovery of names not already inserted in
                                                                     a lexicon.
                     VI. R ELATED WORK                                  The approach presented in [1] shows some similarities with
   Information extraction has come to be a hot research              ours. The authors start with sample patterns containing named
topic, especially since the availability of huge amounts of          entities, then identify actual instances of named entities, found
data publicly available. An excellent survey on Information          names are searched for to automatically identify new patterns
Extraction is [22], where the author reviews all the significant     and reiterate the process.
existing approaches with a great amount of details. While               A different approach has been proposed in [5], and try to
many different approaches have been proposed, however to             identify named entities by short sequences of words, analysing
the best of our knowledge, little to no effort has been put          n-grams statistics obtained on Internet documents. Their Lex
towards the Italian language.                                        method is a semi-supervised learning algorithm based on the
   In [20] named entities are extracted and related to classified    assumption that a sequence of capitalised words compound
newspaper advertisements (in French), using different tech-          the same name when such a n-gram appears to be statistically
niques. They make use of a lexicon to store already known            more frequent than simple chance.
entities, thus once a word is found in an advertisement and             A data mining approach is presented in [25], especially
in the lexicon it can be automatically tagged as the lexicon         crafted for geographical names. The algorithm searches for
suggests. They also use regular expressions for entities such        specific keywords and patterns manually constructed and re-
as telephone numbers. Finally, a word spotting algorithm is          lated to geographical names, such as island of or archipelago.
used to compute a score for unrecognised words, based on the         The results are used to train a classifier with respect to the
context (i.e. other specialised lexicons). While we also make        found instances of a pattern.
use of a lexicon, we use it to exclude a candidate, after a rule
                                                                                          VII. C ONCLUSIONS
has yielded one. It would be a trivial and brute force approach
to recognise a location name using a lexicon with all existing          We have presented an algorithm devised specifically for the
location names (apart from homonymy), instead the rules we           Italian language, based on rules built upon its grammar. The




                                                                     63
Proc. of the 16th Workshop “From Object to Agents” (WOA15)                                                                     June 17-19, Naples, Italy


rules represent grammar pattern, implemented by finite state                      [14] C. Napoli, G. Pappalardo, and E. Tramontana. Using modularity metrics
machines, typically used in both written and spoken language,                          to assist move method refactoring of large systems. In Proceedings
                                                                                       of Complex, Intelligent and Software Intensive Systems (CISIS). IEEE,
thus several agents can be coordinated in a pipe and filter style                      2013.
to get an unstructured input text to be filtered by the rules to                  [15] C. Napoli, G. Pappalardo, and E. Tramontana. An agent-driven se-
get candidate places. Preliminary results are promising, as the                        mantical identifier using radial basis neural networks and reinforcement
                                                                                       learning. In Proceedings of XV Workshop ”Dagli Oggetti agli Agenti”,
F1 score reaches a maximum of 0.67, whereas the highest                                volume 1260. CEUR-WS, 2014.
precision and recall are 0.82 and 0.92, respectively.                             [16] C. Napoli, G. Pappalardo, and E. Tramontana. Improving files availabil-
   As possible future work, we aim to connect with our                                 ity for bittorrent using a diffusion model. In Proceedings of International
                                                                                       WETICE Conference, pages 191–196. IEEE, 2014.
previous research in which we have proposed to improve the                        [17] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy
modularity of a software system by letting classes assume                              for text classification. In IJCAI workshop on machine learning for
roles on some design patterns [6]–[9]. The work presented here                         information filtering, volume 1, pages 61–67, 1999.
                                                                                  [18] G. Pappalardo and E. Tramontana. Automatically discovering design
can foster an approach whereby the automatic processing of                             patterns and assessing concern separations for applications. In Proceed-
the italian language used for program comments can assist in                           ings of Symposium on Applied Computing (SAC). ACM, 2006.
the selection of roles for classes. Moreover, semantic analysis                   [19] G. Pappalardo and E. Tramontana. Suggesting extract class refactoring
                                                                                       opportunities by measuring strength of method interactions. In Proceed-
of text can take advantage of neural networks [15] and as a                            ings of Asia Pacific Software Engineering Conference (APSEC), pages
further work a possible approach would aim to recognise text                           105–110. IEEE, December 2013.
fragments using a soft computing approach [13], [16].                             [20] R. A. Peleato, J.-C. Chappelier, and M. Rajman. Automated informa-
                                                                                       tion extraction out of classified advertisements. In Natural Language
                                                                                       Processing and Information Systems, pages 203–214. Springer, 2001.
                        ACKNOWLEDGEMENT                                           [21] M. F. Porter. Snowball: A language for stemming algorithms, 2001.
                                                                                       URL http://snowball. tartarus. org/texts/introduction. html, 2009.
  This work has been supported by project PRIME funded                            [22] S. Sarawagi. Information extraction. Found. Trends databases, 1(3):261–
within POR FESR Sicilia 2007-2013 framework.                                           377, Mar. 2008.
                                                                                  [23] E. Tramontana. Automatically characterising components with concerns
                              R EFERENCES                                              and reducing tangling. In Proceedings of Computer Software and
                                                                                       Applications Conference (COMPSAC) workshop QUORS. IEEE, 2013.
 [1] E. Agichtein and L. Gravano. Snowball: Extracting relations from large       [24] E. Tramontana. Detecting extra relationships for design patterns roles.
     plain-text collections. In Proceedings of ACM Conference on Digital               In Proceedings of AsianPlop. March 2014.
     Libraries (DL), pages 85–94, New York, NY, USA, 2000. ACM.                   [25] O. Uryupina. Semi-supervised learning of geographical gazetteers from
 [2] F. Bannò, D. Marletta, G. Pappalardo, and E. Tramontana. Tackling                the internet. In Proceedings of the HLT-NAACL 2003 Workshop on
     consistency issues for runtime updating distributed systems. In Proceed-          Analysis of Geographic References - Volume 1, pages 18–25. Association
     ings of International Symposium on Parallel & Distributed Processing,             for Computational Linguistics, 2003.
     Workshops and Phd Forum (IPDPSW), pages 1–8. IEEE, 2010.
 [3] A. Calvagna and E. Tramontana. Delivering dependable reusable
     components by expressing and enforcing design decisions. In Proceed-
     ings of Computer Software and Applications Conference (COMPSAC)
     Workshop QUORS, pages 493–498. IEEE, July 2013.
 [4] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of
     web information extraction systems. IEEE Trans. on Knowl. and Data
     Eng., 18(10):1411–1428, Oct. 2006.
 [5] D. Downey, M. Broadhead, and O. Etzioni. Locating complex named
     entities in web text. In Proceedings of International Joint Conference
     on Artificial Intelligence (IJCAI), pages 2733–2739. Morgan Kaufmann
     Publishers Inc., 2007.
 [6] R. Giunta, G. Pappalardo, and E. Tramontana. Using Aspects and
     Annotations to Separate Application Code from Design Patterns. In
     Proceedings of Symposium on Applied Computing (SAC). ACM, 2010.
 [7] R. Giunta, G. Pappalardo, and E. Tramontana. Aspects and annotations
     for controlling the roles application classes play for design patterns. In
     Proceedings of Asia Pacific Software Engineering Conference (APSEC).
     IEEE, 2011.
 [8] R. Giunta, G. Pappalardo, and E. Tramontana. AODP: refactoring code
     to provide advanced aspect-oriented modularization of design patterns.
     In Proceedings of Symposium on Applied Computing (SAC). ACM, 2012.
 [9] R. Giunta, G. Pappalardo, and E. Tramontana. Superimposing roles
     for design patterns into application classes by means of aspects. In
     Proceedings of Symposium on Applied Computing (SAC). ACM, 2012.
[10] R. Giunta, G. Pappalardo, and E. Tramontana. A redundancy-based
     attack detection technique for java card bytecode. In Proceedings of
     International WETICE Conference, pages 384–389. IEEE, 2014.
[11] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields:
     Probabilistic models for segmenting and labeling sequence data. 2001.
[12] M. Mongiovi, G. Giannone, A. Fornaia, G. Pappalardo, and E. Tra-
     montana. Combining static and dynamic data flow analysis: a hybrid
     approach for detecting data leaks in Java applications. In Proceedings
     of Symposium on Applied Computing (SAC). ACM, 2015.
[13] C. Napoli, G. Pappalardo, and E. Tramontana. A hybrid neuro-wavelet
     predictor for qos control and stability. In Proceedings of AIxIA, volume
     8249 of LNCS, pages 527–538. Springer, 2013.




                                                                            64