=Paper=
{{Paper
|id=Vol-1382/paper9
|storemode=property
|title=Rule-based Location Extraction from Italian Unstructured Text
|pdfUrl=https://ceur-ws.org/Vol-1382/paper9.pdf
|volume=Vol-1382
|dblpUrl=https://dblp.org/rec/conf/woa/CarusoGMPT15
}}
==Rule-based Location Extraction from Italian Unstructured Text==
Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy Rule-based location extraction from Italian unstructured text Daniele Caruso, Rosario Giunta, Dario Messina, Giuseppe Pappalardo, Emiliano Tramontana Department of Mathematics and Computer Science University of Catania, Italy Email: {giunta, pappalardo, tramontana}@dmi.unict.it Abstract—Named entity recognition is a wide research topic is a shared knowledge between speakers. concerned with the extraction of information from unlabelled Unlike machine learning approaches, both unsupervised and texts. Existing approaches mainly deal with the English language, supervised, we proposed a rule-based approach built from sim- in this paper we present the results of a novel approach specifically tailored to the Italian language. The approach is ple grammar rules of the Italian language complemented by a directed at recognising location names in unstructured texts by dictionary. The process of location names extraction is pursued several agents based on rules devised for the Italian grammar. by means of several specialised agents, each performing an Preliminary results show an F1 score up to 0.67. elaboration step and connected in the pipe and filter style (see Keywords-Information Extraction, Named entity recognition, Figure 1), i.e. the result of the application of a rule removes Free Text, Natural Language Processing, Italian Language. the bulk of the candidate words, which later have to pass a further screening based, essentially, on a variant of a dictionary I. I NTRODUCTION comparison. Huge amounts of text data are easily available on the World The text is pre-filtered to remove punctuations symbols and Wide Web. Unfortunately, the great majority of such texts is then split into sentences. Each sentence is analysed by up to in the form of unstructured or semi-structured text. Such a three rules (See Section III) so to identify word candidates, reality makes it difficult for both human beings and machines finally combined to remove false positives. Devised rules are to make a good use of the content of such texts. Information typical Italian language patterns, identifying general contexts Extraction is concerned with the process of structuring existing where a location can be found, thus the rules are not a simple texts (both semi-structured and free) so as to single out some filtering of words from an existing dictionary. parts of text and have them accessed directly by some existing Preliminary results of the algorithm are encouraging: pre- postprocessors [4]. cision goes up to 0.82 and recall up to 0.92, while the A comprehensive survey of existing approaches [22] show comprehensive F1 score goes up to 0.67. how the Information Extraction community evolved from the seminal approaches since the early ’90s, e.g. automatic II. P HASE 1: PRELIMINARIES learning of rules to extract entities [1], maximum entropy The approach and corresponding tool we have developed models [17], Conditional Random Fields [11], etc. Many, if not works on simple text files, i.e. a web page can be pre-processed all, of these approaches are tested, or developed, on the English beforehand by one of the many converters available to remove language. Moreover, specific analysers have been developed HTML tags. to embed security checks on software programs [10], discover For the devised rules, we make use of an especially structural properties [3], [12], [14], [18], [19], [23], [24], and compiled Italian lexicon, containing the following classes of perform automatic transformation of programs [2]. words: We are especially concerned with the problem of named • Articles. A list of definite articles, e.g. il (the). entities extraction from free texts in the Italian language, • Prepositions. Both kinds (semplice (simple) and artico- in particular we are interested in the extraction of location lata (composite)) but excluding con (with), as it is not names, i.e. proper nouns of places. Free texts can be of any used when naming places. kind, ranging from dialogues in a movie to fiction prose, • Verbs. A subset of verbs related with places, such as thus enacting different constraints, however, in general, a andare (to go), mandare (to send), partire (to leave), location name is assumed to be written with a capital letter, passeggiare (to (take a) walk). and common names can be thus considered location names, • Descriptors. A list of adverbs frequently related to a especially in casual speech, e.g. in Vediamoci in Dipartimento place, such as dentro (inside), vicino (near). (Let’s meet at the Department)1 , where the said Department • Non-places. Words of various kinds (verbs, adverbs, 1 We have decided to use both the original Italian and the translated version nouns, etc.) not related to places, but that can appear of any processed text we show, in order to allow a better appreciation of the in grammar structures (defined by the rules we set) as if proposed approach. they were places. E.g. acido (sour), dormire (to sleep). As 59 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy else A or P Rule 1 Filter0 Filter1 A or P else sentence Rule 2 capital non- Filter2 start 0 1 2 splitting verbs letters places Rule 3 Fig. 2: The FSM implementing Rule 1 Phase 1 Phase 2 Phase 3 III. P HASE 2: RULE - BASED EXTRACTION Fig. 1: The agents implementing the pipe and filter model We defined three finite state automata, to implement three grammar cases possibly implying the use of a place in the Verbs Descriptors Non-places accepting state of the automaton. Each rule identifies a differ- abitare (to dwell) avanti (in front of) altrimenti (else) camminare (to walk) dietro (rear) decimo (tenth) ent sentence pattern. The rules are applied at the sentence entrare (to get/come in) fianco (side) allora (then) level, i.e. on a list of words terminated by a punctuation uscire (to get out) dentro (inside) filosofo (philosopher) symbol, obtained in phase 1. The tokens (words) are fed to salire (to go up) fuori (outside) molto (much) viaggiare (to travel) vicino (near) scrivere (to write) the automaton and, if an accepting state is reached, the current partire (to leave) direzione (direction) camminare (to walk) token is marked as a location candidate. If no accepting state andare (to go) esterno (outer) distrarre (to distract) is reached no candidate is produced. When a candidate is indirizzare (to address) interno (inner) florido (prosperous) raggiungere (to reach) lontano (away) bere (to drink) found, and a sentence still contains some more words, then risiedere (to inhabit) sinistra (left) visitare (to visit) the automaton restarts from its initial state using the token visitare (to visit) destra (right) ognuno (everyone) after the candidate, proceeding until the sentence ends. imboccare (to access) adiacente (adjacent) cremisi (crimson) arrivare (to arrive) vicinanza (proximity) nostro (ours) The devised rules are independent from one another, so they svoltare (to turn) ingresso (entrance) riempire (to fill) can be parallelised by running as different agents e.g. on a tornare (to go back) uscita (exit) durante (while) multicore machine or in different machines coordinated in a fermare (to stop) dirimpetto (opposite) esso (it) giungere (to arrive) attiguo (adjacent) piatto (flat) Cloud fashion. parcheggiare (to park) lucente (shining) The result of each rule application is a list of candidate emigrare (to emigrate) spostare (to move) words, such words are used as input for the next phase (see decollare (to take off) capace (capable) Section IV) for the definitive labelling. Different rules possibly TABLE I yields different candidates. Then, a way to use all the said rules S AMPLE VERBS , DESCRIPTORS AND NON - PLACES WORDS is to combine them, hence the candidate words passing the rule filter(s) will be the union of the candidate words determined by each applied rule (see Section V). different words may be incorrectly identified as places, A. Rule 1: Da Roma the approach assists users in the customisation of the The first rule, translating “from Rome”, is used to identify set, by incorporating additional words, so as to exclude possible candidate words as a location, and is named, as the (refine) future results. other rules, after a typical example of a (part of a) sentence Table I shows the sample lists of words used in these cate- in which a place can be identified. gories. The automaton (see Figure 2) scans words (tokens) of a In the following sections, such sets will be named after their given sentence and remains in state 0 until a preposition (P) initial, e.g. we will talk of V as the set of verbs. or an article (A) is found, this condition makes the automaton Given a text T (read from an input file), the first step changes its current state from 0 to 1, and the state remains is to separate sentences, based on standard Italian gram- unchanged unless a different kind of word is found in the mar rules. T is split at occurrences of one of the sym- next token. Other articles or prepositions do not enable a state bols in the set of sentence-end punctuation marks, i.e. change, which is instead triggered by any other kind of word. {full-stop, ellipsis, exclamation-mark, question-mark}, all the The final state is reached when a candidate word for a place is other punctuation types are removed in order to be processed found, however many candidates will be ignored afterwards, by the next agents, obtaining a list of sentences. Any other as described in Section IV. non-letter symbol is ignored, e.g. dollar sign, percent sign, As a single rule, this yields the highest number of false etc. positives, as the use of an article or a preposition is very Each sentence in the input text is further segmented in order common in the Italian language. to find words, this is accomplished by using the space char- acter as word separator, this applies to any rule we describe. B. Rule 2: Vicino a Roma The words found within a sentence are then compared with The second rule accommodates the mentioning of a place the entries in the lexicons, according to the different rules name in sentences such as the name of the rule suggests, described in the next sections. “Near Rome”, as the presence of a Descriptor (see Section II) 60 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy else A or P into the accepting state 3, i.e. pointing at the current word as a possible candidate for a place name. A or P D IV. P HASE 3: N ON - PLACE WORDS REMOVAL start 0 1 2 D The candidate words yielded by the application of a rule else are further filtered before being labelled as a place name. A else candidate, to be considered a place name and thus evaluated as a positive result, has to pass the following filters. 3 • Filter1: The candidate word is checked against the Non- Fig. 3: The FSM implementing Rule 2 places lexicon (N). If it exists in N, then it is regarded a False Postitive (FP) and hence discarded. E.g. in the sentence Andare alla capitale (To go to the capital city), else A or P Rule 2 will suggest capitale as a possible place, however A or P it is a common name and thus it will be discarded. V • Filter2: After passing the previous filter (Filter1), all the start 0 1 2 V remaining candidate words are filtered to avoid identify- ing (conjugated) verbs as places. Once again, the check else else is performed using a stemming algorithm [21]. To check if a candidate word is a verb, it is stemmed and then 3 concatenated with the three possible suffixes used in Italian verbs (-are, -ere and -ire) so to get the infinitive Fig. 4: The FSM implementing Rule 3 form of the verb, which is then searched for in the Non- places lexicon. E.g. if the word (a verb) appears in the lexicon, it is discarded as it is not a place name. E.g. in the is a strong indication that a place will be mentioned in the sentence Ella esce camminando (She gets out walking), following text in the same sentence. Rule 3 stays in the state 0 for Ella, then goes into state 1 Figure 3 shows the Finite State Machine (FSM) to find a reading the verb uscire, however the next token is not an candidate place. The automaton starts by reading the words article nor a preposition, thus camminando is proposed one by one and does not change its initial state (0) until a as a place candidate. However, in this filtering, such a descriptor is found, then it changes the current state to 1. From place candidate is recognised as a conjugation of the state 1 a transition can take place to the state 2, when an article verb camminare and finally discarded. The remaining or a preposition is found, or directly to the accepting state 3 candidate words are promoted as results. in any other case. From state 2 it is possible to return to state Any word passing the said filters are labelled as a place 1, if another descriptor is found, or stay in the same state, if name, however such a result may be a True Positive (TP) or more articles or prepositions are found. Finally, the accepting a False Positive (FP). state can be reached by reading any other kind of word. Unstructured text may not be reliably using orthographic The accepting state identifies a candidate word as a place, conventions, as a text could be a professionally proof-read as Roma in the rule name. book or an informal automatic transcription, thus it may or C. Rule 3: Andando a Roma may not use capital letters to address location names. As far as we described our approach, we did not make any assumption While the previous rule uses descriptors as a way to identify on such an orthographic convention, however, experimentally, a possible place name, this rule is concerned with verbs, such we found better results when such a convention is satisfied, as in the rule name, “Going to Rome”. thus we also provide a further filter: The FSM implementing the rule is shown in Figure 4. The behaviour of the automaton is the same of Rule 2, where • Filter0: If the candidate begins with a lower case charac- instead of a descriptor a, possibly conjugated, verb is used. ter, it is not deemed a location name, while it is output The verbs included are only verbs related to movement, and as a result if it starts with a capital letter. thus usually related to places, such as staying in a place or As the name suggests, this filter has to be applied before moving to and from a place. Since a verb can be found in Filter1 and Filter2, as the user sees fit, based on the text to be a conjugated form, the check is performed using the Italian processed. version of the stemming algorithm Snowball [21]. The automaton scans the tokens and remains in the state 0 V. D ISCUSSION until a verb is found and the current state is changed to state In the next subsections we review the rules and how they 1. The automaton may change from state 1 to state 2 (and relate to the actual grammar writings we examined, and then vice versa), by reading a preposition or an article (or reading show the results of the labelling experiments made on different a verb). Any other words will make the automaton to change texts. 61 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy Rule Sentence Result A. Rule’s Assessment 1 Il processo che si svolge a Milano TP (The trial taking place in Milano) Real case examples. Table II shows several fragments of I treni a lunga percorrenza per la Sicilia TP sentences recognised by our approach, for both TP and FP (Long distance trains to Sicily) results, as specified in column 3. The words in italics are the Il vertice che si terr oggi a Bruxelles TP (The meeting taking place in Bruxelles) tokens consumed by the automaton for the current rule, while Se il Ministro in indirizzo non intenda intervenire FP the bold word is the one identified as a place. (If the addressed Minister does not mean to intervene) Lexicons. As the rules are based on several lexicons, the 2 Scappa verso il Canale TP (Runs away towards the Channel) completeness of such lexicons is essential for a good recogni- All’interno della Basilica Palladiana TP tion. In the sentence “dentro la stanza” (inside the room), the (Inside the Basilica Palladiana) rule 2 will candidate “stanza”, which should not be proposed Mi fanno sedere accanto a Carlo FP (They let me sit beside Carlo) as a result, as it is not a proper location name. It is a Sono operative presso le DIGOS di tutto lo Stato FP responsibility of the Non-places filter (Filter 1, see Section IV) (They are operational in the DIGOS (offices) of the State) to recognise that the word is not a location, however, if 3 Passando per Piazza Del Popolo TP (Proceed through Piazza Del Popolo) “stanza” is not in the N lexicon, it will be selected and thus La prima volta che vedo Palermo TP proposed as TP while being a FP. (The first time I see Palermo) Non ho più visto Carlo FP Repetitions. A simple observation of the FSMs shown (I have not seen Carlo) in Figures 2 to 4 may lead to a traversal of the states La soglia richiesta per entrare in Parlamento FP recognising illicit sentences for the Italian grammar. E.g. (The threshold required to get into the Parliament) “Andando camminare per per Roma” (Going to walk to to TABLE II Rome) in which the application of Rule 3 would candidate S AMPLE RESULTS USING DIFFERENT RULES “Roma” as a place name. Our preliminary studies show that ungrammatical sentences, such as the previous example, are Rule Sentence Pattern Occurrences not so frequent unless we factor in informal languages, such Rule 1 A 1076 as instant messaging or poetic prose/verses. P 4174 AA 1 However, the same rules are capable of recognising un- AP 1 grammatical sentences appearing in both formal and informal PA 56 speech. A phrase such as “Andando a... a... a Roma” (Going PP 4 APA 1 to... to... to Rome) would make the FSM in Figure 4 pointing Rule 2 D 58 at “Roma” as an accepting state, even if the sentence is not DA 41 grammatically correct. As such a sentence can be typical in DP 73 DADP 1 speeches, e.g. when one speaks while recalling something, an DPDP 2 automatic transcription may report such sentences and thus we Rule 3 V 82 left the loops in the rules. VA 26 VP 190 Sentence patterns. The rules we are proposing can be VPA 1 considered arbitrary, even if intuitively correct. Thus, before VPV 1 making the actual experiments in labelling, we studied the VAVP 2 VPVP 1 result of the application of the rules alone on a set of unstructured texts so as to check if such grammar structures TABLE III D IFFERENT SENTENCE KINDS had the needed responsiveness degree. I.e. we are interested in the possible paths any automaton may take, given real written texts and not just simple cases (such as the ones in the titles of subsections in Section III, which are correct but also very Given a rule, a Sentence Pattern such as A is more general basic). than any pattern having A as a suffix e.g. PA. Thus, all the Rules have been tested on different kinds of textfiles, both occurences of PA form a subset of the occurences of A. For prose and dialogue transcriptions, for a total of 1.2 million the experiments (Section V-B) the automata are set to found of characters. The results are shown in Table III. In each the longest match. line, the first column is the rule, the second is the sentence The preliminary study reported in Table III shows just the pattern found by the rule, the third column is the number of number of occurrences for each sentence pattern, it does not instances of the pattern found in the test corpus. The sentence show the percentage of TPs or FPs, as this is just a way to pattern is identified by the transition in the automaton, e.g. check the different transitions in the proposed automata. VPA (Verb, Preposition, Article) identifies a sentence such as Viaggiare per l’Italia (To travel in Italy), which is decomposed B. Experiments as ViaggiareV perP l’A Italia, where the words before Italia The rules detailed in Section III have been developed in a are being catalogued respectively as [V]erb, [P]reposition and tool and have been tested on different kinds of unstructured [A]rticle. texts: (i) theatrical dialogue transcriptions (texts T2, T3, T4), 62 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy Text Rules TP FP FN F1 precision recall (ii) official stenographic transcriptions of political debates T1 1 39 49 9 0,57 0,44 0,81 (T5) and (iii) news articles (T1). In the two latter cases, the 2 1 1 47 0,04 0,50 0,02 transcriptions are properly capitalized, and thus the Filter0 (see 3 0 3 48 n/a n/a n/a 1&2 39 50 9 0,57 0,44 0,81 Section IV) has been used in the experiments, while the other 1&3 39 52 9 0,56 0,43 0,81 texts were all in lower cases and thus only Filter1 and Filter2 2&3 1 4 47 0,04 0,20 0,02 have been used in phase 3 (see Section IV). 1&2&3 39 53 9 0,56 0,42 0,81 T2 1 33 49 6 0,55 0,40 0,85 All the texts used for the experiments have been manually 2 2 1 37 0,10 0,67 0,05 labelled for the location names. In the experiments, all the 3 9 2 30 0,36 0,82 0,23 combinations of the rules have been tested, as shown in the 1&2 33 50 6 0,54 0,40 0,85 1&3 34 51 5 0,55 0,40 0,87 second column. E.g. Rule “2&3” means to put together as a 2&3 11 3 28 0,42 0,79 0,28 mathematical union the set of candidates gathered by Rule 2 1&2&3 34 52 5 0,54 0,40 0,87 with the set of candidates gathered by Rule 3, using such an T3 1 13 7 6 0,67 0,65 0,68 2 0 0 19 n/a n/a n/a union for the filtering agents in phase 3. 3 3 2 16 0,25 0,60 0,16 The precision metric is computed as a correctness measure, 1&2 13 7 6 0,67 0,65 0,68 using also the number of False Negatives (FN), as T PT+F 1&3 14 9 5 0,67 0,61 0,74 P, P 2&3 3 2 16 0,25 0,60 0,16 while the recall is computed as a completeness metric as 1&2&3 14 9 5 0,67 0,61 0,74 T P +F N . The F1 score gives the harmonic mean of precision TP T4 1 56 136 5 0,44 0,29 0,92 and recall. 2 0 4 61 n/a n/a n/a 3 9 15 52 0,21 0,38 0,15 There are cases where a rule fails to identify any TP, 1&2 56 140 5 0,44 0,29 0,92 however this is expected. When an input text does reference 1&3 56 151 5 0,42 0,27 0,92 a place name by e.g. a motion verb, then only Rule 3 can be 2&3 9 19 52 0,20 0,32 0,15 1&2&3 56 155 5 0,41 0,27 0,92 able to recognise such places, while Rule 2, concerned with T5 1 74 190 84 0,35 0,28 0,47 the usage of descriptors, will never be applied. 2 5 4 153 0,06 0,56 0,03 The results show an interesting F1 score, going up to 0.67 3 4 8 154 0,05 0,33 0,03 1&2 76 194 82 0,36 0,28 0,48 with an average of 0.38. The precision metric goes up to 0.82 1&3 75 198 83 0,35 0,27 0,47 in the best case, with a minimum value of 0.27 and an average 2&3 9 12 149 0,10 0,43 0,06 of 0.45. The recall shows also good results, having a maximum 1&2&3 77 202 81 0,35 0,28 0,49 value of 0.92 and an average of 0.51. TABLE IV E XPERIMENTAL RESULTS While there are cases where very few location names are identified, we deem such preliminar experiments worth expanding, as one of the limitations is the small number of labelled text which we have dealt with. propose allow the discovery of names not already inserted in a lexicon. VI. R ELATED WORK The approach presented in [1] shows some similarities with Information extraction has come to be a hot research ours. The authors start with sample patterns containing named topic, especially since the availability of huge amounts of entities, then identify actual instances of named entities, found data publicly available. An excellent survey on Information names are searched for to automatically identify new patterns Extraction is [22], where the author reviews all the significant and reiterate the process. existing approaches with a great amount of details. While A different approach has been proposed in [5], and try to many different approaches have been proposed, however to identify named entities by short sequences of words, analysing the best of our knowledge, little to no effort has been put n-grams statistics obtained on Internet documents. Their Lex towards the Italian language. method is a semi-supervised learning algorithm based on the In [20] named entities are extracted and related to classified assumption that a sequence of capitalised words compound newspaper advertisements (in French), using different tech- the same name when such a n-gram appears to be statistically niques. They make use of a lexicon to store already known more frequent than simple chance. entities, thus once a word is found in an advertisement and A data mining approach is presented in [25], especially in the lexicon it can be automatically tagged as the lexicon crafted for geographical names. The algorithm searches for suggests. They also use regular expressions for entities such specific keywords and patterns manually constructed and re- as telephone numbers. Finally, a word spotting algorithm is lated to geographical names, such as island of or archipelago. used to compute a score for unrecognised words, based on the The results are used to train a classifier with respect to the context (i.e. other specialised lexicons). While we also make found instances of a pattern. use of a lexicon, we use it to exclude a candidate, after a rule VII. C ONCLUSIONS has yielded one. It would be a trivial and brute force approach to recognise a location name using a lexicon with all existing We have presented an algorithm devised specifically for the location names (apart from homonymy), instead the rules we Italian language, based on rules built upon its grammar. The 63 Proc. of the 16th Workshop “From Object to Agents” (WOA15) June 17-19, Naples, Italy rules represent grammar pattern, implemented by finite state [14] C. Napoli, G. Pappalardo, and E. Tramontana. Using modularity metrics machines, typically used in both written and spoken language, to assist move method refactoring of large systems. In Proceedings of Complex, Intelligent and Software Intensive Systems (CISIS). IEEE, thus several agents can be coordinated in a pipe and filter style 2013. to get an unstructured input text to be filtered by the rules to [15] C. Napoli, G. Pappalardo, and E. Tramontana. An agent-driven se- get candidate places. Preliminary results are promising, as the mantical identifier using radial basis neural networks and reinforcement learning. In Proceedings of XV Workshop ”Dagli Oggetti agli Agenti”, F1 score reaches a maximum of 0.67, whereas the highest volume 1260. CEUR-WS, 2014. precision and recall are 0.82 and 0.92, respectively. [16] C. Napoli, G. Pappalardo, and E. Tramontana. Improving files availabil- As possible future work, we aim to connect with our ity for bittorrent using a diffusion model. In Proceedings of International WETICE Conference, pages 191–196. IEEE, 2014. previous research in which we have proposed to improve the [17] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy modularity of a software system by letting classes assume for text classification. In IJCAI workshop on machine learning for roles on some design patterns [6]–[9]. The work presented here information filtering, volume 1, pages 61–67, 1999. [18] G. Pappalardo and E. Tramontana. Automatically discovering design can foster an approach whereby the automatic processing of patterns and assessing concern separations for applications. In Proceed- the italian language used for program comments can assist in ings of Symposium on Applied Computing (SAC). ACM, 2006. the selection of roles for classes. Moreover, semantic analysis [19] G. Pappalardo and E. Tramontana. Suggesting extract class refactoring opportunities by measuring strength of method interactions. In Proceed- of text can take advantage of neural networks [15] and as a ings of Asia Pacific Software Engineering Conference (APSEC), pages further work a possible approach would aim to recognise text 105–110. IEEE, December 2013. fragments using a soft computing approach [13], [16]. [20] R. A. Peleato, J.-C. Chappelier, and M. Rajman. Automated informa- tion extraction out of classified advertisements. In Natural Language Processing and Information Systems, pages 203–214. Springer, 2001. ACKNOWLEDGEMENT [21] M. F. Porter. Snowball: A language for stemming algorithms, 2001. URL http://snowball. tartarus. org/texts/introduction. html, 2009. This work has been supported by project PRIME funded [22] S. Sarawagi. Information extraction. Found. Trends databases, 1(3):261– within POR FESR Sicilia 2007-2013 framework. 377, Mar. 2008. [23] E. Tramontana. Automatically characterising components with concerns R EFERENCES and reducing tangling. In Proceedings of Computer Software and Applications Conference (COMPSAC) workshop QUORS. IEEE, 2013. [1] E. Agichtein and L. Gravano. Snowball: Extracting relations from large [24] E. Tramontana. Detecting extra relationships for design patterns roles. plain-text collections. In Proceedings of ACM Conference on Digital In Proceedings of AsianPlop. March 2014. Libraries (DL), pages 85–94, New York, NY, USA, 2000. ACM. [25] O. Uryupina. Semi-supervised learning of geographical gazetteers from [2] F. Bannò, D. Marletta, G. Pappalardo, and E. Tramontana. Tackling the internet. In Proceedings of the HLT-NAACL 2003 Workshop on consistency issues for runtime updating distributed systems. In Proceed- Analysis of Geographic References - Volume 1, pages 18–25. Association ings of International Symposium on Parallel & Distributed Processing, for Computational Linguistics, 2003. Workshops and Phd Forum (IPDPSW), pages 1–8. IEEE, 2010. [3] A. Calvagna and E. Tramontana. Delivering dependable reusable components by expressing and enforcing design decisions. In Proceed- ings of Computer Software and Applications Conference (COMPSAC) Workshop QUORS, pages 493–498. IEEE, July 2013. [4] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan. A survey of web information extraction systems. IEEE Trans. on Knowl. and Data Eng., 18(10):1411–1428, Oct. 2006. [5] D. Downey, M. Broadhead, and O. Etzioni. Locating complex named entities in web text. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pages 2733–2739. Morgan Kaufmann Publishers Inc., 2007. [6] R. Giunta, G. Pappalardo, and E. Tramontana. Using Aspects and Annotations to Separate Application Code from Design Patterns. In Proceedings of Symposium on Applied Computing (SAC). ACM, 2010. [7] R. Giunta, G. Pappalardo, and E. Tramontana. Aspects and annotations for controlling the roles application classes play for design patterns. In Proceedings of Asia Pacific Software Engineering Conference (APSEC). IEEE, 2011. [8] R. Giunta, G. Pappalardo, and E. Tramontana. AODP: refactoring code to provide advanced aspect-oriented modularization of design patterns. In Proceedings of Symposium on Applied Computing (SAC). ACM, 2012. [9] R. Giunta, G. Pappalardo, and E. Tramontana. Superimposing roles for design patterns into application classes by means of aspects. In Proceedings of Symposium on Applied Computing (SAC). ACM, 2012. [10] R. Giunta, G. Pappalardo, and E. Tramontana. A redundancy-based attack detection technique for java card bytecode. In Proceedings of International WETICE Conference, pages 384–389. IEEE, 2014. [11] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001. [12] M. Mongiovi, G. Giannone, A. Fornaia, G. Pappalardo, and E. Tra- montana. Combining static and dynamic data flow analysis: a hybrid approach for detecting data leaks in Java applications. In Proceedings of Symposium on Applied Computing (SAC). ACM, 2015. [13] C. Napoli, G. Pappalardo, and E. Tramontana. A hybrid neuro-wavelet predictor for qos control and stability. In Proceedings of AIxIA, volume 8249 of LNCS, pages 527–538. Springer, 2013. 64