=Paper=
{{Paper
|id=Vol-160/paper-6
|storemode=property
|title=Detecting The Emergence Of New Concepts In Web Communities
|pdfUrl=https://ceur-ws.org/Vol-160/paper4.pdf
|volume=Vol-160
|authors=P. D'Amadio,P. Velardi
|dblpUrl=https://dblp.org/rec/conf/caise/DAmadioV05
}}
==Detecting The Emergence Of New Concepts In Web Communities==
<pdf width="1500px">https://ceur-ws.org/Vol-160/paper4.pdf</pdf>
<pre>
         '(7(&7,1*7+((0(5*(1&(2)1(:
           &21&(376,1:(%&20081,7,(6


                                   Pierluigi D’Amadio
                                      Paola Velardi

                                              
                  Dipartimento di Informatica, via Salaria 113, Roma, Italy
                              velardi ,damadio  @di.uniroma1.it


       $EVWUDFW This paper describes a methodology to detect the emergence (or the
       disappearance) of concepts through the observation of natural language com-
       munications (NLC). NLC are the documents, e-mails, written communications
       of any kind, that the members of a web community produce, access, and e-
       xchange for their purposes. The emergence of a new concept is suggested by the
       repetitive and consistent use of certain terms, while its intended meaning and
       appropriate conceptualization is obtained through a combination of text mining
       and algebraic methods.


7KH6HOI(YROYLQJ*ORVVDU\

Building a glossary of terms is often the first step to model emerging knowledge do-
mains and to favor interoperability between widely distributed communities of inter-
est, who upload, exchange and share relevant information through the web. Modeling
web communities in the IT society is significant for several reasons (Flake et al.
2002), that span from socio-cultural aims like the discovery of interdisciplinary con-
nections, to more practical applications like the development of focused search engi-
nes, information filtering and information integration tools.
    However, glossaries capture a static portion of a reality that can be instead highly
dynamic, especially when modeling emerging domains. They are conceived and built
as an “a priori” agreement on common terms, a “frozen” picture of the knowledge and
competences of a community, that might suffer from a shortage of up-to date descrip-
tions (Staab, 2002) (Heflin and Hendler 2000). On the other side, glossary building is
a time consuming task, involving human effort to identify the relevant terms, agree on
their meaning, and (in WKHVDXUD) structure terms according to some taxonomic orde-
ring. In other terms, glossary creation is a consensus building process, often painful
and tedious. There is an inherent risk in re-opening the process again and again.
    The idea that we propose in this paper is that glossaries should be, as much as pos-
sible, VHOIHYROYLQJ, continuously capturing the emergence of new concepts in dynamic
web communities. The key to obtain this is WR VLPXODWH the process of consensus
building in humans, through a constant monitoring of natural language communica-
tions (NLC). NLC are the documents, e-mails, written communications of any kind,
that the members of a web community produce, access, and exchange for their purpo-
ses. The emergence of a new concept is suggested by the repetitive and consistent use
of certain terms in NLC. The simulation of consensus can be achieved through VWDWL
VWLFDO LQGLFDWRUV, aimed at selecting terms with certain distributional properties across
the set of observed NLC.
    This paper describes a methodology aimed at implementing the view of a self-
evolving Glossary, detecting the emergence (or the disappearance) of concepts
through the observation of natural language communications. Experiments have been
made in several domains (art, tourism, web-learning, economy and finance), but in this
paper we concentrate on an experiment related with the modeling of a web community
organized through a Network of Excellence, INTEROP1, on enterprise interoperabi-
lity. Partners in INTEROP are academic and industrial institutions belonging to diffe-
rent research areas, grouped in three domains of expertise: Ontology, Enterprise Mo-
deling, Architecture and Platforms. One of the main objectives of INTEROP is to
model partner’s competences in a Knowledge Map, indexed through a structured ta-
xonomy of interoperability concepts. The KMap2 aims at drawing a picture of the sta-
tus of research in interoperability and to keep this picture up-to-date in the future. This
provided us with an ideal test-bed for our methodology.


&ROOHFWLQJ(YLGHQFHV

The first step of the procedure is to collect a wide number of documents in written
form, which should represent at best ZKDWLVFRPPXQLFDWHGDQGH[FKDQJHG among the
members of a community. This is a partly manual, partly automated step, and its com-
plexity and involved effort strongly depends upon the community under consideration.
For the purpose of the self-evolving Glossary, documents must be stored with an at-
tached information about the source, authority and date of the acquired document. We
have not developed a specific document warehouse architecture, since this depends
upon the community document collection strategy and organization methods. In
INTEROP, a collaborative platform in Zope/Plone has been adopted by the network
partners (accessible from the INTEROP web site), which is also used to store docu-
ments and related metadata.


1 http://interop-noe.org/

2 details on the K-map can be found on the INTEROP web platform
([WUDFWLRQRID'RPDLQ/H[LFRQ

A GRPDLQOH[LFRQ L is a list of terms t commonly used within a given community of
interest. The purpose of this phase is to automatically extract simple and multi-word
expressions from the documentation collected in phase 1. Terminological FDQGLGDWHV
are multi-word strings with a precise syntactic structure (e.g: compounds, adjecti-
ve+compound, etc) and certain distributional properties across the domain documents.
Examples in various fields are the following: in enterprise interoperability: HQWHUSULVH
LQWUD RUJDQL]DWLRQDO LQWHJUDWLRQ, in tourism: JRXUPHW UHVWDXUDQW, in computer ne-
tworks: SDFNHWVZLWFKLQJSURWRFRO, in art techniques: FKLDURVFXUR. Statistical and natu-
ral language processing (NLP) tools are used for automatic extraction of terms (details
are in (Navigli and Velardi, 2004)).
          Statistical techniques are specifically aimed at simulating human consensus
in accepting new domain terms. Only terms uniquely and consistently3 found in do-
main-related documents, and not found in other domains used for contrast, are selec-
ted as candidates for the domain lexicon.


([WUDFWLRQRI'HILQLWLRQV

Once an initial lexicon is extracted, the subsequent phase is to obtain a list of (one or
more) definitions for each term.
Extraction of definitions, as well as the subsequent step, which is glossary parsing, re-
lies on a model of well-formed “ definitory” sentences, that we describe through a set
of UHJXODU H[SUHVVLRQV. Regular expressions, discussed later in a dedicated section,
have several purposes:
         x To VHOHFW definitory sentences from those that are not. For example, many
            definitory sentences have the pattern “ t is a Y” , but using this pattern cau-
            ses the extraction of a huge amount of non-definitory sentences, for exam-
            ple: ³.QRZOHGJHPDQDJHPHQWLVDFRQWUDGLFWLRQLQWHUPVEHLQJDKDQJR
            YHU IURP DQ LQGXVWULDO HUD ZKHQ FRQWURO PRGHV RI WKLQNLQJ´ Regular
            expressions, along with statistical indicators, are used to prune this noise.
         x To SUHIHU definitory sentences with a precise structure often used by pro-
            fessional lexicographers, i.e. one that describes the meaning of a term by
            means of its kind (the so-called JHQXVorK\SHUQ\P) followed by a modi-
            fier (what GLIIHUHQWLDWHV the concept from its kind, the GLIIHUHQWLD). For e-
            xample: “ .QRZOHGJH PDQDJHPHQW LVWKHV\VWHPDWLFPDQDJHPHQWRIYLWDO
            NQRZOHGJH DQG LWV DVVRFLDWHG SURFHVVHV RI FUHDWLQJ JDWKHULQJ RUJDQL
            ]LQJGLIIXVLRQ´where the kind is³V\VWHPDWLFPDQDJHPHQW´ A non-well


3 Consistency of use across documents is measured through an entropy based measure called

   domain consensus
4 In this paper NLQGBRIJHQXV and K\SHUQ\P will be used interchangeably to indicate the cate-
  gory to which a concept belongs.
            formed definition, where no kind is provided, is: “ 7KHFRUHLVVXHRINQR
            ZOHGJH PDQDJHPHQW LV WR SODFH NQRZOHGJH XQGHU PDQDJHPHQW UHPLW WR
            JHWYDOXHIURPLW´ where no kind is explicitly provided.
          x To SDUVH definitory sentences in RUGHU to extract the NLQG information, and
            possibly more.


([WUDFWLQJ'HILQLWLRQVIURP*ORVVDULHV

Google recently provided a new search feature, called “ GHILQH:” which can be used to
search definitions of terms on web glossaries. However, using this search facility in an
unconstrained way may cause the retrieval of a large number of often noisy (not perti-
nent to the domain) definitions. We defined the following algorithm to select pertinent
definitions:
   1) From the set of word components forming the extracted lexicon L of a domain
D, learn a probabilistic model of the domain, i.e. assign a probability of occurrence to
each word component. More precisely, let L be the lexicon of extracted terms, LT the
set of word components appearing in L, and let

                                                IUHT ( Z)
                                ( ( 3 ( Z))
                                               ¦
                                                IUHT ( Z )
                                                
    be the estimated probability of w in D, where wLT and the frequencies are com-
puted in L. For example, if L=[GLVWULEXWHG V\VWHP LQWHJUDWLRQ LQWHJUDWLRQ PHWKRG]
then LT=[GLVWULEXWHG, V\VWHP, LQWHJUDWLRQ, PHWKRG] and E(P(LQWHJUDWLRQ))=2/5
    2) Search the terms in L using the Google “ GHILQH´ feature. Select only those defi-
nitions def(t), tL, with the following features:
          a) Domain pertinence: Let Wt be the set of words in def(t). Let W’t Wt be
the subset of words in def(t) belonging to LT. Compute:


   ZHLJK GHI W                ¦( 3 Z          ORJ 1W QWZ      where Nt is the number of
                          Z: W Z/7
                                            Z
definitions extracted for the term t, and QW is the number of such definitions includ-
ing the word w. The log factor, called LQYHUVHGRFXPHQWIUHTXHQF\ in the information
retrieval literature, reduces the weight of words that have a very high probability of
occurrence in any definition (e.g. V\VWHP).
Definitions are ordered according to their weight. The first k definitions are selected,
according to a threshold computed for each t5: ZHLJK GHI W t - W


5 We omit the details for sake of brevity
    b) Well formedness: apply a final filter to select those def(t) matching the “ JHQXV
GLIIHUHQWLD” style, expressed through a set regular expressions described in detail in
section 2.3.
    To compute the performance of this method in the worst ambiguity conditions, we
selected 10 very ambiguous single-word terms in the INTEROP single word lexicon
LT (including over 1000 words). Three evaluators marked the relevant and not rele-
vant definitions (wrt the domain, i.e. enterprise interoperability). The inter-annotators
agreement was 84%, since the task is inherently complex and subjective. We conside-
red only the definitions marked in the same way by at least two annotators.
7DEOHEvaluation of definition selection algorithm.


7HUP                             5    $    5D    1   1¶   3U 5D$ 5HF 5D5 ,$$
Application                        8     3     3     31   29      1.00     0.38      0.94
Component                          4     2     1     28   26      0.50     0.25      0.93
Data                              15     3     1     26   22      0.33     0.07      0.85
Design                             5     1     1     39   36      1.00     0.20      0.92
Device                             6     7     4     30   23      0.57     0.67      0.77
Framework                         10     3     3     25   15      1.00     0.30      0.60
Knowledge                          3     4     2     26   23      0.50     0.67      0.88
Process                            8     2     2     38   33      1.00     0.25      0.87
Project                            4     4     1     39   34      0.25     0.25      0.87
System                             7     4     4     34   25      1.00     0.57      0.74
$YHUDJH3HUIRUPDQFHDIWHU
VWHSD                                                          0.71     0.36      0.84
$YHUDJH3HUIRUPDQFHDIWHU
VWHSE                                                          0.73     0.72

Legenda: R=relevant definitions (majority-based), A=System-selected definitions N=extracted
definitions, N’ =definitions on which there is agreement (majority-based), Ra=RA,
Pr=Precision, Rec=Recall, IAA=Inter Annotator Agreement.


Table 1 shows the results. Except for the last line, all numbers refer to the result of
step 2a. The effect of step 2b (well-formedness) is a considerable improvement in re-
call, and a small increase in precision. Notice that the algorithm outputs always at
least one of the relevant definitions, often the best, even though the annotators where
requested to vote on a yes-no basis. Appendix I provides the complete output for the
term IUDPHZRUN. The definitions selected by the algorithm are underlined.


([WUDFWLQJ'HILQLWLRQVIURP1/&

As remarked in the introduction, the Dynamic Glossary needs continuous updates, as
new terms and new fields emerge and are accepted within communities of interest. De-
finitions of new terms in well established communities and a new terminology in an
emerging community are not found in glossaries, simply because of their novelty. But
it is often the case that the inventors of these terms, or their initial users, provide a de-
finition in their communications to the reference community. For example, the term
“ IHGHUDWHGRQWRORJ\” appeared only in 2001 in scientific literature (Stumme and Mae-
dche 2001), but the first explicit definition is in a paper6 dated 2004, that rephrases
the concept of IHGHUDWHGRQWRORJ\ proposed in a less explicit way in (Stumme and Ma-
edche 2001) “ )HGHUDWHGRQWRORJLHVDUHGLVWULEXWHGFRQQHFWHGRQWRORJLHVVRPHZKDW
DQDORJRXVWRIHGHUDWHGGDWDEDVHV” .
    Identifying definitions in texts is much more complicated than choosing “ good” de-
finitions in glossaries. Definitions are buried in texts, and they cannot be recognized
by means of simple regular expressions, like “ X is a Y” , since as remarked at the be-
ginning of this section, these would produce an unacceptable amount of noise. We de-
vised the following procedure:
    Let L’ be the list of terms in L for which no definition was found in the previous
glossary search. For each t in L’, do the following:
      1) Extract from the community-provided documents first, and from the web after
         (only in case of unsuccessful search), a set of sentences including t. This im-
         plies some amount of pre-processing, like the treatment of various format, like
         KWPO, GRF and SGI. In case of web search, it is also necessary to handle limita-
         tions imposed by most search engines to multiple queries.
         A first filtering is applied, using regular expressions that match patterns like “ W
         LV” “ WGHILQHV” “ WUHIHUV” etc.
      2) A second filter selects sentences which include, besides t, some of the words in
         LT (the set of word components appearing in L). The same probabilistic filter
         as in step 2a) of previous section is applied, with a small variation:
                                                              Z
          ZHLJK GHI W                   ¦ ( 3 Z ORJ 1W QW  D               ¦( 3 Z
                                 Z: W Z/7                           Z/7Z W
         The additional sum in this formula assigns a higher weight to those sentences
         including some of the components of the term t to be defined, e.g. “ 6FKHPDLQ
         WHJUDWLRQ is >the process by which schemata from heterogeneous databases are
         conceptually integrated into a single cohesive schema.@”
      3) Finally, the well-formedness criterion of previous section 2b is applied.
    Terms are again selected according to a varying threshold, but, in this case, the
threshold must be tuned for high recall, rather than high precision. In fact, for some
terms, there might be very few definitions in literature and it is important to capture
the majority of them.


6 http://www.meteck.org/AspectsOntologyIntegration.pdf
7DEOHEvaluation of the definition extraction algorithm.


7HUP                                 5        $         5D        1        3U 5D$        5HF 5D5
application integration               5          6         3          35          0.50            0.60
collaborative system                  2         11         2          16          0.18            1.00
distributed object
technology                            4         10          4         12           0.40            1.00
knowledge sharing                     9          9          5         38           0.56            0.56
message exchange                      2          3          2         20           0.67            1.00
ontology alignment                    3          3          1         16           0.33            0.33
open standard                         5         14          5         19           0.36            1.00
process integration                   12         4          3         39           0.75            0.25
schema integration                    10         4          1         30           0.25            0.10
service center                         2        18          2         40           0.11            1.00
$YHUDJH3HUIRUPDQFH
 DOOVWHSV                                                                         0.41            0.68


   Table 2 shows the performance obtained when searching 10 terms from the lexicon
L. Appendix I (part 2) shows the definitions, with rating, extracted for the term: RQWR
ORJ\DOLJQPHQW, a relatively new term in the area of ontology building.
   After this phase of the ontology updating process, selected definitions are presented
to domain experts with and indication of the source (document or web glossary) and
authoritativeness. Experts can modify, reject or accept each definition7.


3DUVLQJRI'HILQLWLRQV

This section adds further details on the definition and use of regular expressions. We
use regular expressions8 to select well-formed sentences and to extract kind-of rela-
tions from natural language definitions. The components of a regular expression are
fixed words or word sequences, part of speech and syntactic chunks.
   At first, sentence FKXQNV (e.g. noun phrases NP, prepositional phrases PP, etc.) are
identified using an available syntactic parser, the TreeTagger9. For example, the fol-
lowing regular expression is used to verify the well formedness criterion:


7 In INTEROP an initial glossary relative to educational objectives has been acquired and evaluated. The
interested reader might access on the web site the deliverable 10.1 to learn the details of this process. A se-
cond, large scale (1800 terms) interoperability glossary has been acquired and will be fully evaluated by
the end of year 2 of the project.
8 http://www.oreilly.com/catalog/regex/chapter/ch04.html
9 TreeTagger is available at
    U = "^(PP)?(NP)+"
    This regular expression (see subsequent examples) prescribes a sentence structure
at the chunk level: a definitory sentence is formed by a facultative prepositional phrase
(^(PP)?) followed by the PDLQQRXQSKUDVH (NP), followed by anything else (+).
    When a sentence matches the well formedness and probabilistic criteria described
in previous section, other regular expressions are applied to extract additional infor-
mation.
    For example, the following regular expression at the word level is applied (with o-
thers) on the main NP to separate candidate definitions from non-definitions in step 1
of section 2.3.2:
    S A Refers|Referring)\\sto\\s(((a|the)\\s)?(type|kind)\\sof\\s)?(.*)" If a sentence
is selected as being a definition, additional regular expressions are used to extract
from the main NP the NLQGBRI (K\SHUQ\P information.
    For example, consider the regular expression
    U = "^(A|D)?((V|C|,|J|N|R)*)(N)".
    Symbols in r1 are part of speech tags (POS), e.g. article (A), verb (V), adjective
(J), etc.
    A sentence matching both U and U is:
    GRPDLQPRGHO: “ ,QWKHWUDGLWLRQDOVRIWZDUHHQJLQHHULQJSHUVSHFWLYHDSUHFLVHUHS
UHVHQWDWLRQRIVSHFLILFDWLRQDQGLPSOHPHQWDWLRQFRQFHSWVWKDWGHILQHDFODVVRIH[LVW
LQJV\VWHPV´
    When parsing with the TreeTagger we obtain:
    6\QWDFWLF&KXQNV: (PP 13 PP CNP RVP NP PP)
    326: (PAJNNN AJ1PNCNNWVANPJN)
    The application of U returns:
    K\SHUQ\P: representation
    The bold POS (1) represents the fragment selected as the hypernym.
    We then learn that:
                                                 
                     domain  model   o representa tion
   Appendix I highlights in bold the hypernym extracted from selected definitions.
   Table 3 shows the performances in three domains.


7DEOHPrecision and recall of the hypernymy extraction task in three domains.

                             Art               Interoperability            Computer
                                                                           Networks
      Precision             0.973                       0.947                 0.955
      Recall                0.957                       0.914                 0.932


http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html We aug-
mented TreeTagger with regular expressions that capture named entities of locations, organiza-
tions, products, persons, and time expressions. This allows us to capture other relations besides
hypernymy, but this research is still in progress.
&UHDWLRQRID7D[RQRP\

Parsing definitions allows it to structure the terms in T in taxonomic order. However,
ordering terms according to the hypernyms extracted from definitions has well-known
drawbacks. An interesting paper (Ide and Véronis, 1993) provides an analysis of typi-
cal problems found when attempting to extract (manually or automatically) hypernymy
relations from natural language definitions, e.g. attachments too high in the hierarchy,
unclear choices for more general terms, or-conjoined hypernyms, absence of
hypernym, circularity, etc. These problems are more or less evident – especially over-
generality – when analysing the term trees forest generated on the basis of glossary
parsing.
   To reduce these problems, we proceeded as follows:
   1) First, we arrange the terms in T taxonomically according to simple VWULQJLQFOX
       VLRQ. String inclusion is a very reliable indicator of a taxonomic relation, though
       it does not capture all possible relations. This step produces a forest of sub-
       trees.
   2) Then, we use hypernymy information extracted from definitions to capture ad-
       ditional taxonomic relations between terms DWWKHVDPHOHYHORIJHQHUDOLW\ (e.g.
       in the example above: UHSUHVHQWDWLRQ PRGHO VFKHPD RQWRORJ\ NQRZOHGJH
       GDWDLQIRUPDWLRQ).
   3) If terms have more than one selected definition, or have or-conjoined heads in
       the main NP, more than one hypernym is extracted by the algorithm of section
       2.3. However, we select only hypernyms belonging to the set of domain relevant
       words LT. Hence for example, NQRZOHGJH has the following hypernyms: LQIRU
       PDWLRQ, IDFWDQGUHODWLRQVKLS and PHDQLQJ. Only the first is selected.
   4) After step 3, component terms of the sub-trees STi have one or more hypernym
       associated. Given a term t: tltr (where tl and tr are left and right components of t,
       e.g. t=HQWHUSULVH DSSOLFDWLRQ LQWHJUDWLRQ, tl =HQWHUSULVH DSSOLFDWLRQ, tr
       =LQWHJUDWLRQ) we verify whether there is a multi-word term t’ : t’ lt’ r in the taxon-
                                              NLQG B RI                  NLQG B RI
       omy such that tr=t’ r and either WO     oWO or WO         oWO (e.g.
       if t=VHUYLFH LQWHJUDWLRQ and t’ =DSSOLFDWLRQ LQWHJUDWLRQ, it holds that
                    NLQG B RI
         VHUYLFH    oDSSOLFDWLRQ ,                  and                   therefore
                                   
         VHUYLFH _ int HJUDWLRQ      _
                                          o DSSOLFDWLRQ _ int HJUDWLRQ ).
   Appendix II shows a small fragment of the complete INTEROP taxonomy10 (the
sub-trees rooted in LQWHJUDWLRQ) At the end of Appendix II we also show an excerpt of
the detected hypernymy relations, used in step 4.
   Ordering terms taxonomically is a highly subjective task, therefore is not easy to
evaluate the output of this phase. Golden standard are not available, especially in sub-
domains. However, we did a small experiment: given the initial LQWHJUDWLRQ, LQWHURS
HUDELOLW\ and V\VWHP taxonomy, our method was able to detect 25 hypernymy relations,
e.g.

10
     the taxonomy includes 1800 terms belonging to the three main domains of INTEROP, e.g.
     ontology, enterprise modeling, architectures and platforms.
                     !                !           !
      VFKHPD 
                   oGHVLJQ 
                                   oPRGHO 
                                                  oUHSUH VHQWDWLRQ
   We compared these relations with the WordNet11 general purpose lexicalised on-
tology, in the following way:
   let NLQG B RI ZL Z M be a detected hypernymy relations between wi and wj, ei-
ther a direct relation or a chain of hypernymy links, as in the VFKHPD example above.
   If             in             WordNet              it            holds         that:
NLQG B RI 6L  6M  6L  VHQ VHV RI ZL 6M  VHQ VHV RI ZM , where again
NLQGBRI is either a direct relation or a chain, then mark NLQG B RI ZL Z M as posi-
tive. For example, in WordNet there is a direct hyperonymy relation between sense #1
of VFKHPD and sense#1 of UHSUHVHQWDWLRQ.
    The evaluation showed that there are around 33% matches with respect to a
“ golden standard” taxonomy like WordNet, but on the other side, WordNet is a gen-
eral purpose ontology, and some of the not-corresponding relations detected by our
methodology seem still very reasonable in the interoperability domain, as the reader
may verify evaluating the detected kind_of links in Appendix II. Notice that, as expec-
ted, the major problem is the over-generality of certain hypernymy links (e.g. ever-
ything is a “ system” ).
    In any case, our purpose here is not to fully overcome problems that are inherent
with the conceptually complex task of building a domain concept hierarchy. At the
end of this process we obtain, a forest of trees where nodes (the concepts) are named
as the corresponding terms in natural language, and the only semantic relation is
hypernymy, even though ongoing research for extracting additional relations is pro-
gressing. Discrepancies and inconsistencies can be corrected by a team of human spe-
cialists, who will verify and rearrange the nodes of the sub-tree forest.


$FNQRZOHGJHPHQWV

This work has been supported by the INTEROP network of Excellence IST-2003-
508011.


11
     http://www.wordnet.princeton.edu WordNet is the most widely used and cited lexicalized
     computational ontology
5HIHUHQFHV

   (Heflin and Hendler, 2000) Heflin, J. and Hendler, J. '\QDPLF2QWRORJLHVRQWKH
:HE In: Proceedings of the Seventeenth National Conference on Artificial Intelli-
gence (AAAI-2000).
   (Kleinberg 1998) Kleinberg, J. $XWKRULWDWLYH VRXUFHV LQ D K\SHUOLQNHG HQYLURQ
PHQW. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.
   (Navigli and Velardi, 2004) Navigli, R. & Velardi, P. (2004). /HDUQLQJ 'RPDLQ
2QWRORJLHV IURP 'RFXPHQW :DUHKRXVHV DQG 'HGLFDWHG :HE 6LWHV. Computational
Linguistics, MIT press, (50)2.
    (Staab, 2002) S. Staab, (PHUJHQW6HPDQWLFV, IEEE Intelligent Systems, v.17 n.1,
p.78-86, January 2002
    (Stumme and Maedche 2001) 4. G Stumme, A Maedche, 2QWRORJ\ 0HUJLQJ IRU
)HGHUDWHG 2QWRORJLHV RQ WKH 6HPDQWLF :HE, Workshop on Ontologies and Informa-
tion Sharing, IJCAI


$SSHQGL[,6HOHFWLRQRIGHILQLWLRQVIURPZHE  DQGGRFXPHQW
ZDUHKRXVHV  

Example 1: selection of appropriate definitions from glossaries: “ IUDPHZRUN” (se-
lected sentences underlined, selected hypernym in bold)

Def: A "$#&% ')(* #+-,&* +-#. where the vertical boxes depict the workflow of core processes, and the horizontal
boxes depict business subsystems that control the lifecycles of key business objects
Weight : 0.1444115
Def: a * ./)01 23* . containing a sequenced set of all groups/segments which relate to a functional business
area (or multi-functional business area) and applying to all messages defined for that area (or areas)
Weight : 0.12572457
Def: A body of (5436 * 7829#. designed for high reuse, with specific plugpoints for the functionality required
for a particular system
Weight : 0.10959378
Def: A framework is an extensible structure for describing a set of concepts, methods, technologies, and
cultural changes necessary for a complete product design and manufacturing process
Weight : 0.07710117
Def: We use the term framework to refer to a structured collection of software building blocks that can be
used and customized to develop components, assemble them into an application, and run the application
Weight : 0.07184533
Def: A logical structure for classifying and organizing complex information
Weight : 0.059092086
Def: A set of object classes that provide a collection of related functions for a user or piece of software
Weight : 0.055604726
Def: The software environment tailored to the needs of a specific domain
Weight : 0.046193704
Def: A component that allows its functionality to be extended by writing plug-in modules ("framework e-
xtensions")
(other definition follow...)
Example 2: selecting definitory from non-definitory sentences in free texts: “ RQWRORJ\
DOLJQPHQW” (selected sentences underlined, selected hypernym in bold)

Def: Ontology ontology alignment is not valuable for its own sake, but is worthwhile only in the service of
some other function that requires it
Weight:0.03227434
Def: ontology alignment refers to the (% * +-23* % 49: , where both the source and target ontology are known and
mappings between the two ontologies are used as source for explanation
Weight:0.03170026
Def:Ontology alignment is the 29+* 49/)23* .')#.(5431 +* % 49: of semantic correspondences between the representa-
tional elements of heterogenous sytems
Weight:0.026186492
Def:Ontology alignment is a foundational problem area for semantic interoperability
Weight:0.0204144
Def:ontology alignment is extreme: terms from different ontologies are always assumed to mean different
things by default, and all ontology mapping is done by humans (implicitly, by putting them into the same
col- umn of a report)
Weight:0.020371715
Def:Ontology alignment is also crucial for reusing the existing ontologies and for facilitating their intero-
perability
Weight:0.01861836
Def:Ontology alignment is also very relevant in a Semantic Web context
Weight:0.016911233
(other definition follow...)


$SSHQGL[,,$QH[FHUSWRIVXEWUHHVH[WUDFWHGIURPWKH
,17(523GRPDLQ

integration                                                          model_integration
  system_engineering_integration                                       ontology_integration
  sensing_integration                                                  enterprise_model_integration
  system_sensing_integration                                           schema_integration
      enterprise_system_sensing_integration                            scheduling_integration
  strategy_integration                                                      process_integration
    business_strategy_integration                                    scheduling_integration
  software_integration                                               design_process_integration
    application_integration                                          business_process_integration
        enterprise_application_integration                             on_demand_business_process_integration
            legacy_enterprise_                                       planning_integration
                  application_integration                            enterprise_integration
  service_integration                                              system_integration
    web_service_integration                                          information_system_integration
  computing_integration                                              process_integration
    enterprise_computing_integration                                   scheduling_integration
  inter_organisational_integration                                     design_process_integration
    enterprise_inter_organisational_integration                        business_process_integration
  intra_organisational_integration
    enterprise_intra_organisational_integration                  on_demand_business_process_integration
  organization_integration                                             planning_integration
  conceptual_integration                                               enterprise_integration
  representation_integration                                         natu-
    view_integration                                             ral_language_based_system_integration
    ontology_integration                                             distributed_system_integration
  database_system_integration                      content_integration
  enterprise_application_integration                    multilingual_content_integration
    legacy_enterprise_application_integration      enterprise_information_integration
  schema_integration                                  le-
method_of_integration                           gacy_enterprise_information_integration
component_integration                                  intelligent_information_integration
supply_chain_integration                           ontology_based_integration
  human_supply_chain_integration                    business_process_support_integration
semantic_integration                                database_integration
ontology_driven_integration                         data_automatic_integration
information_integration
  knowledge_integration

</pre>