PierluigiD 'Amadio damadio¡@di.uniroma1.it Paola Velardi Dipartimento di Informatica

via Salaria 113 Roma Italy

3F5674FFF90911923EAEC01E320A3512 GROBID - A machine learning software for extracting information from scholarly documents

EVWUDFW This paper describes a methodology to detect the emergence (or the disappearance) of concepts through the observation of natural language communications (NLC). NLC are the documents, e-mails, written communications of any kind, that the members of a web community produce, access, and exchange for their purposes. The emergence of a new concept is suggested by the repetitive and consistent use of certain terms, while its intended meaning and appropriate conceptualization is obtained through a combination of text mining and algebraic methods.

7KH 6HOI(YROYLQJ *ORVVDU\

Building a glossary of terms is often the first step to model emerging knowledge domains and to favor interoperability between widely distributed communities of interest, who upload, exchange and share relevant information through the web. Modeling web communities in the IT society is significant for several reasons (Flake et al. 2002), that span from socio-cultural aims like the discovery of interdisciplinary connections, to more practical applications like the development of focused search engines, information filtering and information integration tools.

However, glossaries capture a static portion of a reality that can be instead highly dynamic, especially when modeling emerging domains. They are conceived and built as an "a priori" agreement on common terms, a "frozen" picture of the knowledge and competences of a community, that might suffer from a shortage of up-to date descriptions (Staab, 2002) (Heflin and Hendler 2000). On the other side, glossary building is a time consuming task, involving human effort to identify the relevant terms, agree on their meaning, and (in WKHVDXUD) structure terms according to some taxonomic ordering. In other terms, glossary creation is a consensus building process, often painful and tedious. There is an inherent risk in re-opening the process again and again.

The idea that we propose in this paper is that glossaries should be, as much as possible, VHOIHYROYLQJ, continuously capturing the emergence of new concepts in dynamic web communities. The key to obtain this is WR VLPXODWH the process of consensus building in humans, through a constant monitoring of natural language communications (NLC). NLC are the documents, e-mails, written communications of any kind, that the members of a web community produce, access, and exchange for their purposes. The emergence of a new concept is suggested by the repetitive and consistent use of certain terms in NLC. The simulation of consensus can be achieved through VWDWL VWLFDO LQGLFDWRUV, aimed at selecting terms with certain distributional properties across the set of observed NLC.

This paper describes a methodology aimed at implementing the view of a selfevolving Glossary, detecting the emergence (or the disappearance) of concepts through the observation of natural language communications. Experiments have been made in several domains (art, tourism, web-learning, economy and finance), but in this paper we concentrate on an experiment related with the modeling of a web community organized through a Network of Excellence, INTEROP 1 , on enterprise interoperability. Partners in INTEROP are academic and industrial institutions belonging to different research areas, grouped in three domains of expertise: Ontology, Enterprise Modeling, Architecture and Platforms. One of the main objectives of INTEROP is to model partner's competences in a Knowledge Map, indexed through a structured taxonomy of interoperability concepts. The KMap 2 aims at drawing a picture of the status of research in interoperability and to keep this picture up-to-date in the future. This provided us with an ideal test-bed for our methodology.

&ROOHFWLQJ (YLGHQFHV

The first step of the procedure is to collect a wide number of documents in written form, which should represent at best ZKDW LV FRPPXQLFDWHG DQG H[FKDQJHG among the members of a community. This is a partly manual, partly automated step, and its complexity and involved effort strongly depends upon the community under consideration. For the purpose of the self-evolving Glossary, documents must be stored with an attached information about the source, authority and date of the acquired document. We have not developed a specific document warehouse architecture, since this depends upon the community document collection strategy and organization methods. In INTEROP, a collaborative platform in Zope/Plone has been adopted by the network partners (accessible from the INTEROP web site), which is also used to store documents and related metadata.

([WUDFWLRQ RI D 'RPDLQ /H[LFRQ

A GRPDLQ OH[LFRQ L is a list of terms t commonly used within a given community of interest. The purpose of this phase is to automatically extract simple and multi-word expressions from the documentation collected in phase 1. Terminological FDQGLGDWHV are multi-word strings with a precise syntactic structure (e.g: compounds, adjecti-ve+compound, etc) and certain distributional properties across the domain documents. Examples in various fields are the following: in enterprise interoperability: HQWHUSULVH LQWUD RUJDQL]DWLRQDO LQWHJUDWLRQ, in tourism: JRXUPHW UHVWDXUDQW, in computer networks: SDFNHW VZLWFKLQJ SURWRFRO, in art techniques: FKLDURVFXUR. Statistical and natural language processing (NLP) tools are used for automatic extraction of terms (details are in (Navigli and Velardi, 2004)).

Statistical techniques are specifically aimed at simulating human consensus in accepting new domain terms. Only terms uniquely and consistently 3 found in domain-related documents, and not found in other domains used for contrast, are selected as candidates for the domain lexicon.

([WUDFWLRQ RI 'HILQLWLRQV

Once an initial lexicon is extracted, the subsequent phase is to obtain a list of (one or more) definitions for each term. Extraction of definitions, as well as the subsequent step, which is glossary parsing, relies on a model of well-formed " definitory" sentences, that we describe through a set of UHJXODU H[SUHVVLRQV. Regular expressions, discussed later in a dedicated section, have several purposes:

x To VHOHFW definitory sentences from those that are not. For example, many definitory sentences have the pattern " t is a Y" , but using this pattern causes the extraction of a huge amount of non-definitory sentences, for example: ³.QRZOHGJH PDQDJHPHQW LV D FRQWUDGLFWLRQ LQ WHUPV EHLQJ D KDQJR YHU IURP DQ LQGXVWULDO HUD ZKHQ FRQWURO PRGHV RI WKLQNLQJ´ Regular expressions, along with statistical indicators, are used to prune this noise. x To SUHIHU definitory sentences with a precise structure often used by professional lexicographers, i.e. one that describes the meaning of a term by means of its kind (the so-called JHQXV or K\SHUQ\P ) followed by a modifier (what GLIIHUHQWLDWHV the concept from its kind, the GLIIHUHQWLD).

([WUDFWLQJ 'HILQLWLRQV IURP *ORVVDULHV

Google recently provided a new search feature, called " GHILQH:" which can be used to search definitions of terms on web glossaries. However, using this search facility in an unconstrained way may cause the retrieval of a large number of often noisy (not pertinent to the domain) definitions. We defined the following algorithm to select pertinent definitions:

1) From the set of word components forming the extracted lexicon L of a domain D, learn a probabilistic model of the domain, i.e. assign a probability of occurrence to each word component. More precisely, let L be the lexicon of extracted terms, LT the set of word components appearing in L, and let

¦ ¢ £ ¥¤ ¦ § ¨Z IUHT Z IUHT Z 3 ( ) ( ) (

)) ( ( be the estimated probability of w in D, where wLT and the frequencies are computed in L. For example, if L=[GLVWULEXWHG V\VWHP LQWHJUDWLRQ LQWHJUDWLRQ PHWKRG] then LT=[GLVWULEXWHG, V\VWHP, LQWHJUDWLRQ, PHWKRG] and E(P(LQWHJUDWLRQ))=2/5

2) Search the terms in L using the Google " GHILQH´ feature. Select only those definitions def(t), tL, with the following features: a) Domain pertinence: Let W t be the set of words in def(t). Let W' t W t be the subset of words in def(t) belonging to LT. Compute:

ZHLJK GHI W ( 3 Z Z: W Z/7 ¦ ORJ 1 W Q W Z

where Nt is the number of definitions extracted for the term t, and Q W Z is the number of such definitions includ- ing the word w. The log factor, called LQYHUVH GRFXPHQW IUHTXHQF\ in the information retrieval literature, reduces the weight of words that have a very high probability of occurrence in any definition (e.g. V\VWHP).

Definitions are ordered according to their weight. The first k definitions are selected, according to a threshold computed for each t 5 : ZHLJK GHI W t -W b) Well formedness: apply a final filter to select those def(t) matching the " JHQXV GLIIHUHQWLD" style, expressed through a set regular expressions described in detail in section 2.3.

To compute the performance of this method in the worst ambiguity conditions, we selected 10 very ambiguous single-word terms in the INTEROP single word lexicon LT (including over 1000 words). Three evaluators marked the relevant and not relevant definitions (wrt the domain, i.e. enterprise interoperability). The inter-annotators agreement was 84%, since the task is inherently complex and subjective. We considered only the definitions marked in the same way by at least two annotators.

7DEOH Evaluation of definition selection algorithm. Legenda: R=relevant definitions (majority-based), A=System-selected definitions N=extracted definitions, N' =definitions on which there is agreement (majority-based), Ra=RA, Pr=Precision, Rec=Recall, IAA=Inter Annotator Agreement.

7HUP 5 $ 5D 1 1 ¶ 3U 5D$ 5HF 5D5 ,$$

Table 1 shows the results. Except for the last line, all numbers refer to the result of step 2a. The effect of step 2b (well-formedness) is a considerable improvement in recall, and a small increase in precision. Notice that the algorithm outputs always at least one of the relevant definitions, often the best, even though the annotators where requested to vote on a yes-no basis. Appendix I provides the complete output for the term IUDPHZRUN. The definitions selected by the algorithm are underlined.

([WUDFWLQJ 'HILQLWLRQV IURP 1/&

As remarked in the introduction, the Dynamic Glossary needs continuous updates, as new terms and new fields emerge and are accepted within communities of interest. Definitions of new terms in well established communities and a new terminology in an emerging community are not found in glossaries, simply because of their novelty. But it is often the case that the inventors of these terms, or their initial users, provide a definition in their communications to the reference community. For example, the term " IHGHUDWHG RQWRORJ\" appeared only in 2001 in scientific literature (Stumme and Maedche 2001), but the first explicit definition is in a paper 6 dated 2004, that rephrases the concept of IHGHUDWHG RQWRORJ\ proposed in a less explicit way in (Stumme and Maedche 2001) " )HGHUDWHG RQWRORJLHV DUH GLVWULEXWHG FRQQHFWHG RQWRORJLHV VRPHZKDW DQDORJRXV WR IHGHUDWHG GDWDEDVHV" .

Identifying definitions in texts is much more complicated than choosing " good" definitions in glossaries. Definitions are buried in texts, and they cannot be recognized by means of simple regular expressions, like " X is a Y" , since as remarked at the beginning of this section, these would produce an unacceptable amount of noise. We devised the following procedure:

Let L' be the list of terms in L for which no definition was found in the previous glossary search. For each t in L', do the following:

1) Extract from the community-provided documents first, and from the web after (only in case of unsuccessful search), a set of sentences including t. This implies some amount of pre-processing, like the treatment of various format, like KWPO, GRF and SGI. In case of web search, it is also necessary to handle limitations imposed by most search engines to multiple queries. A first filtering is applied, using regular expressions that match patterns like " W LV" " W GHILQHV" " W UHIHUV" etc.

2) A second filter selects sentences which include, besides t, some of the words in LT (the set of word components appearing in L). The same probabilistic filter as in step 2a) of previous section is applied, with a small variation:

ZHLJK GHI W ( 3 Z Z: W Z/7 ¦ ORJ 1 W Q W Z D ( 3Z Z/7Z W ¦

The additional sum in this formula assigns a higher weight to those sentences including some of the components of the term t to be defined, e.g. " 6FKHPD LQ WHJUDWLRQ is >the process by which schemata from heterogeneous databases are conceptually integrated into a single cohesive schema.@" 3) Finally, the well-formedness criterion of previous section 2b is applied. Terms are again selected according to a varying threshold, but, in this case, the threshold must be tuned for high recall, rather than high precision. In fact, for some terms, there might be very few definitions in literature and it is important to capture the majority of them. Table 2 shows the performance obtained when searching 10 terms from the lexicon L. Appendix I (part 2) shows the definitions, with rating, extracted for the term: RQWR ORJ\ DOLJQPHQW, a relatively new term in the area of ontology building.

After this phase of the ontology updating process, selected definitions are presented to domain experts with and indication of the source (document or web glossary) and authoritativeness. Experts can modify, reject or accept each definition 7 .

3DUVLQJ RI 'HILQLWLRQV

This section adds further details on the definition and use of regular expressions. We use regular expressions 8 to select well-formed sentences and to extract kind-of relations from natural language definitions. The components of a regular expression are fixed words or word sequences, part of speech and syntactic chunks.

At first, sentence FKXQNV (e.g. noun phrases NP, prepositional phrases PP, etc.) are identified using an available syntactic parser, the TreeTagger 9 . For example, the following regular expression is used to verify the well formedness criterion: 7 In INTEROP an initial glossary relative to educational objectives has been acquired and evaluated. The interested reader might access on the web site the deliverable 10.1 to learn the details of this process. A second, large scale (1800 terms) interoperability glossary has been acquired and will be fully evaluated by the end of year 2 of the project. 8 http://www.oreilly.com/catalog/regex/chapter/ch04.html U = "^(PP)?(NP)+" This regular expression (see subsequent examples) prescribes a sentence structure at the chunk level: a definitory sentence is formed by a facultative prepositional phrase (^(PP)?) followed by the PDLQ QRXQ SKUDVH (NP), followed by anything else (+).

When a sentence matches the well formedness and probabilistic criteria described in previous section, other regular expressions are applied to extract additional information.

For example, the following regular expression at the word level is applied (with others) on the main NP to separate candidate definitions from non-definitions in step 1 of section 2.3.2:

S ARefers|Referring)\\sto\\s(((a|the)\\s)?(type|kind)\\sof\\s)?(.*)" If a sentence is selected as being a definition, additional regular expressions are used to extract from the main NP the NLQGBRI (K\SHUQ\P information. For example, consider the regular expression U = "^(A|D)?((V|C|,|J|N|R)*)(N)".

Symbols in r1 are part of speech tags (POS), e.g. article (A), verb (V), adjective (J), etc.

A sentence matching both U and U is:

GRPDLQ PRGHO: " ,Q WKH WUDGLWLRQDO

&UHDWLRQ RI D 7D[RQRP\

Parsing definitions allows it to structure the terms in T in taxonomic order. However, ordering terms according to the hypernyms extracted from definitions has well-known drawbacks. An interesting paper (Ide and Véronis, 1993) provides an analysis of typical problems found when attempting to extract (manually or automatically) hypernymy relations from natural language definitions, e.g. attachments too high in the hierarchy, unclear choices for more general terms, or-conjoined hypernyms, absence of hypernym, circularity, etc. These problems are more or less evident -especially overgenerality -when analysing the term trees forest generated on the basis of glossary parsing.

To reduce these problems, we proceeded as follows:

1) First, we arrange the terms in T taxonomically according to simple VWULQJ LQFOX VLRQ. String inclusion is a very reliable indicator of a taxonomic relation, though it does not capture all possible relations. This step produces a forest of subtrees. 2) Then, we use hypernymy information extracted from definitions to capture additional taxonomic relations between terms DW WKH VDPH OHYHO RI JHQHUDOLW\ (e.g. in the example above: UHSUHVHQWDWLRQ PRGHO VFKHPD RQWRORJ\ NQRZOHGJH GDWD LQIRUPDWLRQ).

3) If terms have more than one selected definition, or have or-conjoined heads in the main NP, more than one hypernym is extracted by the algorithm of section 2.3. However, we select only hypernyms belonging to the set of domain relevant words LT.

HJUDWLRQ Q DSSOLFDWLR HJUDWLRQ VHUYLFH ! " $# int _ int _ _ o ).

Appendix II shows a small fragment of the complete INTEROP taxonomy 10 (the sub-trees rooted in LQWHJUDWLRQ) At the end of Appendix II we also show an excerpt of the detected hypernymy relations, used in step 4.

Ordering terms taxonomically is a highly subjective task, therefore is not easy to evaluate the output of this phase. Golden standard are not available, especially in subdomains. However, we did a small experiment: given the initial LQWHJUDWLRQ, LQWHURS HUDELOLW\ and V\VWHP taxonomy, our method was able to detect 25 hypernymy relations, e.g. The evaluation showed that there are around 33% matches with respect to a " golden standard" taxonomy like WordNet, but on the other side, WordNet is a general purpose ontology, and some of the not-corresponding relations detected by our methodology seem still very reasonable in the interoperability domain, as the reader may verify evaluating the detected kind_of links in Appendix II. Notice that, as expected, the major problem is the over-generality of certain hypernymy links (e.g. everything is a " system" ).

In any case, our purpose here is not to fully overcome problems that are inherent with the conceptually complex task of building a domain concept hierarchy. At the end of this process we obtain, a forest of trees where nodes (the concepts) are named as the corresponding terms in natural language, and the only semantic relation is hypernymy, even though ongoing research for extracting additional relations is progressing. Discrepancies and inconsistencies can be corrected by a team of human specialists, who will verify and rearrange the nodes of the sub-tree forest.

$FNQRZOHGJHPHQWV

This work has been supported by the INTEROP network of Excellence IST-2003-508011.

For example: " .QRZOHGJH PDQDJHPHQW LV WKH V\VWHPDWLF PDQDJHPHQW RI YLWDO NQRZOHGJH DQG LWV DVVRFLDWHG SURFHVVHV RI FUHDWLQJ JDWKHULQJ RUJDQL ]LQJ GLIIXVLRQ´ where the kind is ³V\VWHPDWLF PDQDJHPHQW´ A non-well formed definition, where no kind is provided, is: " 7KH FRUH LVVXH RI NQR ZOHGJH PDQDJHPHQW LV WR SODFH NQRZOHGJH XQGHU PDQDJHPHQW UHPLW WR JHW YDOXH IURP LW´ where no kind is explicitly provided. x To SDUVH definitory sentences in RUGHU to extract the NLQG information, and possibly more.

VRIWZDUH HQJLQHHULQJ SHUVSHFWLYH D SUHFLVH UHS UHVHQWDWLRQ RI VSHFLILFDWLRQ DQG LPSOHPHQWDWLRQ FRQFHSWV WKDW GHILQH D FODVV RI H[LVW LQJ V\VWHPV´ When parsing with the TreeTagger we obtain: 6\QWDFWLF &KXQNV: (PP 13 PP CNP RVP NP PP) 326: (PAJNNN AJ1 PNCNNWVANPJN) The application of U returns: K\SHUQ\P: representation The bold POS (1) represents the fragment selected as the hypernym. We then learn that: tion representa model domain

10 the taxonomy includes 1800 terms belonging to the three main domains of INTEROP, e.g. ontology, enterprise modeling, architectures and platforms. relations with the WordNet 11 general purpose lexicalised ontology, in the following way: let NLQG B RI Z L Z M be a detected hypernymy relations between w i and w j , either a direct relation or a chain of hypernymy links, as in the VFKHPD example above. 6 L 6 M 6L VHQ VHV RI Z L 6M VHQ VHV RI ZM , where again NLQGBRI is either a direct relation or a chain, then mark NLQG B RI Z L Z M as positive. For example, in WordNet there is a direct hyperonymy relation between sense #1 of VFKHPD and sense#1 of UHSUHVHQWDWLRQ.

7DEOH Evaluation of the definition extraction algorithm.7HUP5$5D13U 5D$5HF 5D5application integration563350.500.60collaborative system2112160.181.00distributed objecttechnology4104120.401.00knowledge sharing995380.560.56message exchange232200.671.00ontology alignment331160.330.33open standard5145190.361.00process integration1243390.750.25schema integration1041300.250.10service center $YHUDJH 3HUIRUPDQFH DOO VWHSV2182400.11 0.411.00 0.68

Hence for example, NQRZOHGJH has the following hypernyms: LQIRU PDWLRQ, IDFWDQGUHODWLRQVKLS and PHDQLQJ. Only the first is selected.4) After step 3, component terms of the sub-trees ST i have one or more hypernymassociated. Given a term t: t l t r (where t l and t r are left and right components of t, e.g. t=HQWHUSULVH DSSOLFDWLRQ LQWHJUDWLRQ, t l =HQWHUSULVH DSSOLFDWLRQ, t r=LQWHJUDWLRQ) we verify whether there is a multi-word term t' : t' l t' r in the taxon-omy such that t r =t' r and either W O NLQG B RI o W O or W O NLQG B RI o W O (e.g. if t=VHUYLFH LQWHJUDWLRQ and t' =DSSOLFDWLRQ LQWHJUDWLRQ, it holds that VHUYLFH NLQG B RI o DSSOLFDWLRQ , and therefore

http://interop-noe.org/ details on the K-map can be found on the INTEROP web platform Consistency of use across documents is measured through an entropy based measure called domain consensus In this paper NLQGBRI JHQXV and K\SHUQ\P will be used interchangeably to indicate the category to which a concept belongs. http://www.meteck.org/AspectsOntologyIntegration.pdf http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html We augmented TreeTagger with regular expressions that capture named entities of locations, organizations, products, persons, and time expressions. This allows us to capture other relations besides hypernymy, but this research is still in progress. http://www.wordnet.princeton.edu WordNet is the most widely used and cited lexicalized computational ontology

Appendix I highlights in bold the hypernym extracted from selected definitions. Table 3 shows the performances in three domains.

7DEOH Precision and recall of the hypernymy extraction task in three domains.

Art

$SSHQGL[ , 6HOHFWLRQ RI GHILQLWLRQV IURP ZHE DQG GRFXPHQW ZDUHKRXVHV

Example 1: selection of appropriate definitions from glossaries: " IUDPHZRUN" (selected sentences underlined, selected hypernym in bold) Def: A 4 65 87 9 A@ !B 5 C ED 8B C E5 F where the vertical boxes depict the workflow of core processes, and the horizontal boxes depict business subsystems that control the lifecycles of key business objects Weight : 0.1444115 Def: a B F G AH 'I P QB F containing a sequenced set of all groups/segments which relate to a functional business area (or multi-functional business area) and applying to all messages defined for that area (or areas) Weight : 0.12572457 Def: A body of @ SR QT B U VP W5 F designed for high reuse, with specific plugpoints for the functionality required for a particular system Weight : 0.10959378 Def: A framework is an extensible structure for describing a set of concepts, methods, technologies, and cultural changes necessary for a complete product design and manufacturing process Weight : 0.07710117 Def: We use the term framework to refer to a structured collection of software building blocks that can be used and customized to develop components, assemble them into an application, and run the application Weight : 0.07184533 Def: A logical structure for classifying and organizing complex information Weight : 0.059092086 Def: A set of object classes that provide a collection of related functions for a user or piece of software Weight : 0.055604726 Def: The software environment tailored to the needs of a specific domain Weight : 0.046193704 Def: A component that allows its functionality to be extended by writing plug-in modules ("framework extensions") (other definition follow...)

Example 2: selecting definitory from non-definitory sentences in free texts: " RQWRORJ\ DOLJQPHQW" (selected sentences underlined, selected hypernym in bold) Def: Ontology ontology alignment is not valuable for its own sake, but is worthwhile only in the service of some other function that requires it Weight:0.03227434 Def: ontology alignment refers to the @ !7 B C EP QB 7 R WX , where both the source and target ontology are known and mappings between the two ontologies are used as source for explanation Weight:0.03170026 Def:Ontology alignment is the P WC 'B R WG AP QB F 9 A5 F @ SR QI C 'B 7 R WX of semantic correspondences between the representa- tional elements of heterogenous sytems Weight:0.026186492 Def:Ontology alignment is a foundational problem area for semantic interoperability Weight:0.0204144 Def:ontology alignment is extreme: terms from different ontologies are always assumed to mean different things by default, and all ontology mapping is done by humans (implicitly, by putting them into the same col-umn of a report) Weight:0.020371715 Def:Ontology alignment is also crucial for reusing the existing ontologies and for facilitating their interoperability Weight:0.01861836 Def:Ontology alignment is also very relevant in a Semantic Web context Weight:0.016911233 (other definition follow...)