=Paper=
{{Paper
|id=None
|storemode=property
|title=Cross-language Semantic Matching for Discovering Links to e-gov Services
in the LOD Cloud
|pdfUrl=https://ceur-ws.org/Vol-992/paper4.pdf
|volume=Vol-992
|dblpUrl=https://dblp.org/rec/conf/esws/NarducciPS13
}}
==Cross-language Semantic Matching for Discovering Links to e-gov Services
in the LOD Cloud==
Cross-language semantic matching for discovering links to e-gov services in the LOD cloud Fedelucio Narducci1 , Matteo Palmonari1 , and Giovanni Semeraro2 1 Department of Information Science, Systems Theory, and Communication University of Milano-Bicocca, Italy surname@disco.unimib.it 2 Department of Computer Science University of Bari Aldo Moro, Italy giovanni.semeraro@uniba.it Abstract. The large diffusion of e-gov initiatives is increasing the at- tention of public administrations towards the Open Data initiative. The adoption of open data in the e-gov domain produces different advantages in terms of more transparent government, development of better public services, economic growth and social value. However, the process of data opening should adopt standards and open formats. Only in this way it is possible to share experiences with other service providers, to exploit best practices from other cities or countries, and to be easily connected to the Linked Open Data (LOD) cloud. In this paper we present CroSeR (Cross-language Service Retriever), a tool able to match and retrieve cross-language e-gov services stored in the LOD cloud. The main goal of this work is to help public adminis- trations to connect their e-gov services to services, provided by other administrations, already connected to the LOD cloud. We adopted a Wikipedia-based semantic representation in order to overcome the prob- lems related to match really short textual descriptions associated to the services. A preliminary evaluation on an open catalog of e-gov services showed that the adopted techniques are promising and are more effective than techniques based only on keyword representation. 1 Introduction and Motivations The main motivation behind the success of the Linked Open Data (LOD) ini- tiative is related to well-known advantages coming from the interconnection of information sources, such as improved discoverability, reusability, and utility of information [11]. In the last years, many governments decided to make public their data about spending, service provision, economic indicators, and so on. These datasets are also known as Open Government Data (OGD). As of Febru- ary 2013, more than 1,000,000 OGD datasets have been put online by national and local governments from more than 40 countries in 24 different languages3 . 3 http://logd.tw.rpi.edu/iogds data analytics As the interest of governments in LOD has grown over the last years, a roadmap consisting of three data-processing stages, namely the open stage, the link stage, and the reuse stage, has been proposed to drive the transition from OGD to Linked Open Government Data (LOGD) [2]. The SmartCities project4 is worth of mentioning in this context. The general aim of that project is to create an innovation network between governments and academic partners leading to excellence in the domain of the development and uptake of e-services, setting a new baseline for e-service delivery in the whole North Sea region. The project involves seven countries of the North Sea region: England, Netherlands, Belgium, Germany, Scotland, Sweden, and Norway. One of the main interesting results of this project is the European Local Government Service List (LGSL) as part of the Electronic Service Delivery (ESD)-toolkit website5 . The goal of the LGSL is to build standard lists (i.e, ESD-standards) which define the semantics of public sector services. Each country involved into the project is responsible to build and maintain its list of public services delivered to the citizens, and all of those services are interlinked to the services delivered by other countries. The ESD-standards are already linked to the LOD cloud6 . LGSL is a great opportunity for local and national governments all over Eu- rope. Linking national or local service catalogs to LGSL allows to make local or national services searchable in several languages, improving also the capability of EU citizens to access services in a foreign language country, an explicit objec- tive of the Digital Agenda for Europe (DAE) [1]. Moreover, local and national governments can learn best practices of service offerings across Europe and com- pare their service to make their service offering more valuable [13]. Finally, by linking e-service catalogs to LGSL additional information can be exploited, e.g., services in the LGSL are linked to a taxonomy of life events, which is useful to enrich the service catalogs and support navigation. However, manually linking e-service catalogs, often consisting of several hundreds - or thousands - of ser- vices, to LGLS requires a lot of effort, which often prevents administrations from taking advantage of becoming part of the LOD cloud. Automatic cross-language ontology matching methods can support local and national administrations in linking their service catalogs to LGSL, and therefore to the LOD cloud, by reducing the cost of this activity. Although some cross- language ontology matching methods have been proposed [16], the application of these methods to the problem of linking local and national service catalogs has to deal with the poor quality of the descriptions available in the catalogs. Services are represented by minimal descriptions that often consist of the name of the service and very few other data. Furthermore, as showed in Figure 1, the labels associated with services linked in the LGSL are not a mere translation from a language to another. As an example, the German service (literally translated as) Acquisition of children daycare contributions and the Dutch service (literally translated as) Grant Babysitting/Child Services have been manually linked to 4 http://www.smartcities.info/aim 5 http://www.esd.org.uk/esdtoolkit/ 6 http://lod-cloud.net/ the English service Nursery education grant by domain experts. Therefore, the automatic matching of the service text labels is not a trivial task. Fig. 1: Examples of linked services in the LGSL. Services with the same ID number are linked by the owl:sameAs relation in the LGSL. The automatic English translation powered by Bing is reported in brackets. In this paper we propose Cross-language Service Retriever (CroSeR), a tool to support the linkage of a source e-service catalog represented in any language to a target catalog represented in English, where both the source and target cat- alogs are characterized by minimal descriptions. Ultimately, the aim of CroSeR is to support human annotators in order to simplify the selection of possibly matching services. Our tool exploits a cross-language ontology matching tech- nique that uses an off-the-shelve machine translation tool and annotates the translated descriptions with Wikipedia concepts in order to extrapolate seman- tic representations of the services; candidate links are retrieved by evaluating the similarity between the extracted semantic representations. Our method is independent from the language adopted in the source catalog and does not as- sume the availability of further information about the services other than very short text descriptions used as names for the services. We conduct an experiment using the English, German and Dutch catalogs from the LGSL dataset. In the experiment we compare several configurations of our system that leverage different semantic annotation tools and the Explicit Semantic Analysis (ESA) [7] method. Our preliminary results show that the method based on ESA outperforms both the methods based on other annotation tools and a baseline where no semantic representation is used. The rest of this paper is organized as follows. Section 2 analyzes the state of the art. Section 3 describes the general architecture of our system, and Section 4 shows the tools exploited for obtaining a Wikipedia-based representation of e-gov services. Finally, experimental results are presented in Section 5 and in Section 6 the conclusion and the future work are summarized. 2 Related Work Ontology matching, link discovery, and entity linking are tightly related research areas. In all of these areas, automatic or semi-automatic matching techniques are applied to discover correspondences among semantically related entities that appear in a source and a target information source [16]. Different types of cor- respondences have been addressed (e.g., equivalence, subclass, same as, and so on), depending on the types of considered entities (e.g., ontology concepts, on- tology instances, generic RDF resources) and information sources (web ontolo- gies, linked datasets, semi-structured knowledge bases). Cross-language ontology matching is the problem of matching a source ontology that uses terms from a natural language L with a target ontology that uses terms from another nat- ural language L0 (e.g., L is German and L0 is English) [17]; multi-lingual on- tology matching is the problem of matching two ontologies that use more than one language each, where the languages used in each ontology can also over- lap [17]. These definitions can be easily extended to semantic matching tasks over other types of information sources (e.g., cross-language and multi-lingual matching of two document corpuses). In the following we discuss the most rele- vant approaches to cross-language matching proposed over different information sources. The most adopted approach for cross-language ontology matching is based on transforming a cross-lingual matching problem into a monolingual one by lever- aging automatic machine translation tools [17, 6, 19]. However, the accuracy of automatic machine translation tools is limited and several strategies have been proposed to improve the quality of the final matchings. One of the most recent approaches uses a Support Vector Machine (SVM) to learn a matching function for ontologies represented in different languages [17]. This method uses features defined by combining string-based and structural similarity metrics. A transla- tion process powered by Microsoft Bing7 is used to build the feature vectors in a unique reference language (English). A first difference with respect to our work is that the proposed approach is deeply based on structural information derived from the ontology; this information is very poor in our scenario and is not used in our method. Also other translation-based approaches use structural informa- tion, i.e., neighboring concepts [6] and instances [19], which is not available in our scenario. Two ontology matching methods have been recently proposed, which use the concepts’ names, labels, and comments to build search keywords and query web data. A first approach queries a web search engine and uses the results to compute the similarity between the ontology concepts [14]. The system sup- ports also cross-language alignment leveraging the Bing API to translate the keywords. A second approach submit queries to the Wikipedia search engine [8]. The similarity between a source and target concept is based on the similarity of the Wikipedia articles retrieved for the concepts. Cross-language matching is supported by using the links between the articles written in different languages, 7 http://www.bing.com/translator which are available in Wikipedia, and by comparing the articles in a common language. The authors observe that their approach has problems when it tries to match equivalent ontology elements that use a different vocabulary and lead to very different translations (e.g., Autor von(de) and has written(en)). Despite we do also leverage Wikipedia, our matching process uses semantic annotation tools and ESA. We can therefore incorporate light-weight disambiguation techniques (provided by the semantic annotation tools) and match entities that, when trans- lated, are represented with significantly different terms (in particular when the system use the ESA model). Another interesting work presented in literature applies the Explicit Seman- tic Analysis (ESA) for cross-language link discovery [9]. The goal of that paper is to investigate how to automatically generate cross-language links between re- sources in large document collections. The authors show that the semantic sim- ilarity based on ESA is able to produce results comparable to those achieved by graph-based methods. However, in this specific domain, algorithms can leverage a significant amount of text that is not available in our case. Finally, the impact of the translation quality on the quality matching in cross-lingual scenarios is investigated in [5]. From that research it emerges that good translation quality is a prerequisite for achieving good quality cross-lingual matches. However, this is likely true only compared to accuracy of monolingual matching [17]. 3 CroSeR: Cross-language Service Retriever Fig. 2: CroSeR general architecture We assume that each service is labeled by a short textual description (called service label in the following), e.g., see examples in Figure 1, and represents an abstract service, i.e., a high-level description of concrete services offered by one or more providers8 [13]. The intuitive semantics of a link between a source 8 http://www.smartcities.info/files/Smart Cities Brief What is a service list.pdf and a target service description is that the two descriptions refer to the same abstract service. Although service descriptions conceptually represent categories of concrete services, which are offered by actual providers, these descriptions are represented as ontology instances and, consistently with the ESD approach, we represent the links with owl:sameAs relations. In order to discover links between a source list of services described in an arbitrary language L, and a target list of services described in English (LGSL), we design a matching algorithm for retrieving the top-k services that are best candidate to be linked to a given source service. We implemented the matching algorithm in a system called Cross- Language Service Retriever (CroSeR). Given a source service described in an arbitrary language L, and a target list of services described in English (LGSL), the CroSeR system returns a ranked list of k services in LGSL that are candidate to be owl:sameAs-related to the source service. The user will use the service list returned by CroSeR to validate the link between the source and the target service. The architecture of CroSeR is depicted in Figure 2. The CroSeR system con- sists of two components: the Content Analyzer analyzes the service descriptions in the source and target lists and builds a semantic annotation for each ser- vice; the Retriever discovers the links between the source and target services by computing the similarity between the semantic annotations. Content Analyzer. The input to the Content Analyzer are the service catalogs in different languages to be linked to the LGSL. The first step performed by this component is to translate the service labels in English by leveraging the Bing API9 . Next, the translated labels are exploited for obtaining the Wikipedia- based annotation of the services. For each service s, a set of Wikipedia concepts Ws semantically related to the service label is generated; we call Wikipedia-based annotation of s the set Ws . The set Ws is built by processing the short text in the label of the service with semantic annotation techniques. The Wikipedia concepts are generated for all the services (English ones, as well), since we need to adopt an unified representation. The Wikipedia-based annotation aim to capture the main topics (represented by the corresponding Wikipedia concepts) related to a service. In this context, a Wikipedia concept is defined as the title of a Wikipedia article. This solu- tion is also able to perform a sort of word sense disambiguation of a natural language text without the application of elaborate algorithms based on lexical ontologies such as Wordnet10 . Furthermore, the annotation of a service with a set of Wikipedia concepts represents an additional link between the service and the LOD cloud (by using DBpedia as input). Retriever. The Retriever adopts the Vector Space Model (VSM) for repre- senting services in a multidimensional space where each dimension is a Wikipedia concept. Therefore, each service is represented as a point in that space. In this first implementation we weighing each concept in Ws by adopting the simplest schema represented by the number of occurrences of the concept in Ws . For- 9 http://www.microsoft.com/en-us/translator/ 10 http://wordnet.princeton.edu/ mally, each service is represented as a vector s =< w1 , . . . , wn > where wk is the occurrence of the the Wikipedia concept k in Ws . We guess that more so- phisticated weighing measures such as TF-IDF and BM25 [15] do not improve the performance of our system at this step. Finally, the similarity between two services (vectors) is computed in terms of cosine similarity. Therefore, given a service in one of the supported languages (the query), the retriever is able to return a ranked list of the most similar English services from the LGSL. Please, note that we adopt a variety of techniques (described in details in Sec- tion 4) for annotating the services, each of which represents a different CroSeR configuration. In addition to those Wikipedia-based configurations, we evaluated our system also by setting hybrid configurations obtained by merging the key- words extracted from the label associated to s with the corresponding Wikipedia concepts in Ws . In that case, all the above-mentioned definitions are still valid, with the slight difference that the vector space is built both on keywords and Wikipedia concepts (instead of Wikipedia concepts alone). 4 Semantic annotation of e-gov services We exploited different techniques for semantically annotate service labels with a set of Wikipedia concepts. In particular, we adopted three well-known on-line services that perform semantic annotation, namely Wikipedia Miner, Tagme, DBpedia Spotlight, and we implemented a semantic feature generation tool based on the Explicit Semantic Analysis (ESA) technique. The on-line services take as input a text description (the service label), and return a set of Wikipedia concepts that emerge from the input text. Also ESA generates a set of Wikipedia concepts as output, but the insight behind it is quite different. All those services allow to configure some parameters in order to favor recall or precision. Given the conciseness of the input text in our domain, we set those parameters for improving the recall instead of precision. Wikipedia Miner. Wikipedia Miner is a tool for automatically cross-referen- cing documents with Wikipedia [12]. The software is trained on Wikipedia ar- ticles, and thus learns to disambiguate and detect links in the same way as Wikipedia editors [3]. The first step is the disambiguation of terms within the text by means of a classifier. Several features are exploited for learning the clas- sifier. The two main features are commonness and relatedness [10]. The com- monness of a Wikipedia article is defined by the number of times it is used as destination from some anchor text. For example, the anchor text tree has a higher commonness value for the article Tree (plant) than the article Tree (data structure). However, this feature is not sufficient to disambiguate a term. Therefore, the algorithm compares each possible sense (Wikipedia article) for a given term with its surrounding context by computing the relatedness value. This measure computes the similarity of two articles by comparing their incoming and outgoing links. Tagme. Tagme is a system that performs an accurate and on-the-fly seman- tic annotation of short texts via Wikipedia as knowledge base [4]. The annota- tion process is composed of two main phases: the anchor disambiguation and the anchor pruning. The disambiguation is based on a process called ”collective agreement”. Given an anchor text a associated with a set of candidate Wikipedia pages Pa , each other anchor in the text casts a vote for each candidate annota- tion in Pa . As in Wikipedia Miner, the vote is based on the commonness and the relatedness between the candidate page pi ∈ Pa and the candidate pages associated to the other anchors in the text. Subsequently, an anchor pruning is performed by deleting the candidate pages in Pa considered to be not mean- ingful. This process takes into account the probability of the anchor text to be used as link in Wikipedia and the coherence between the candidate page and the candidate pages of other anchors in the text. DBpedia Spotlight. DBpedia Spotlight [11] was designed with the explicit goal of connecting unstructured text to the LOD cloud by using DBpedia as hub. Also in this case the output is a set of Wikipedia articles related to a text retrieved by following the URI of the DBpedia instances. The annotation process works in four-stages. First, the text is analyzed in order to select the phrases that may indicate a mention of a DBpedia resource. In this step any spots that are only composed of verbs, adjectives, adverbs and prepositions are disregarded. Subsequently, a set of candidate DBpedia resources is built by mapping the spotted phrase to resources that are candidate disambiguations for that phrase. As in the abovementioned tools, the disambiguation process uses the context around the spotted phrase to decide for the best choice amongst the candidates. Finally, there is a configuration step whose goal is to set the best parameter values for the text to be disambiguated. Explicit Semantic Analysis. Explicit Semantic Analysis (ESA) is a tech- nique proposed by Gabrilovich and Markovitch [7], that uses Wikipedia as a space of concepts explicitly defined and described by humans. The idea is that the meaning of a generic term (e.g. ball ) can be described by a list of concepts it refers to (e.g. the Wikipedia articles: volleyball, soccer, football,...). Formally, given the space of Wikipedia concepts C = {c1 , c2 , ..., cn }, a term ti can be rep- resented by its semantic interpretation vector vi =< wi1 , wi2 , ..., win >, where wij represents the strength of the association between ti and cj . Weights are obtained from a matrix T , called esa-matrix, in which each of the n columns corresponds to a concept, and each row corresponds to a term of the Wikipedia vocabulary, i.e. the set of distinct terms in the corpus of all Wikipedia articles. Cell T [i, j] contains wij , the tf-idf value of term ti in the article (concept) cj . The semantic interpretation vector for a text fragment f (i.e. a sentence, a doc- ument, a service label) is obtained by computing the centroid of the semantic interpretation vectors associated with terms occurring in f . We can observe that while the intuition behind Wikipedia Miner, Tagme, and DBpedia Spotlight is quite similar, ESA implements a different approach. Indeed, the first three tools identify Wikipedia concepts already present in the text, conversely ESA generates new articles related to a given text by using Wikipedia as knowledge base. As an example, let us suppose that we want to annotate the service label Home Schooling. Wikipedia Miner, Tagme and DBpe- dia Spotlight link it to the Wikipedia article Homeschooling, while ESA generates (as centroid vector) the Wikipedia articles Home, School, Education, Family, .... Hence, we can state that the three first tools perform a sort of topic identification of a given text, while ESA performs a feature generation process by adding new knowledge to the input text. Another example enforces the motivation behind the need of producing a sematic annotation of the service labels. Let’s consider the English service label Licences - entertainment and the corresponding Dutch service in LGSL Vergunning voor Festiviteiten (translated as: Permit for Festiv- ities). A keyword-based approach never matches these two services. Conversely, the Tagme annotation generates for the English Service the Wikipedia concepts License, Entertainment, and for the translated Dutch label the concepts License, Festival. 5 Experimental Evaluation In this experiment we evaluated the different configurations in terms of: (1) effectiveness in retrieving the correct service in a list of k service to be presented to the user, (2) capability in boosting the correct service in the first positions of the ranked list. Design and Dataset. We adopted two different metrics: Accuracy@n (a@n) and Mean Reciprocal Rank (MRR) [18]. The a@n is calculated considering only the first n retrieved services. If the correct service occurs in the top-n items, the service is marked as correctly retrieved. We considered different values of n = 1, 3, 5, 10, 20, 30. The second metric (MRR) considers the rank of the correct retrieved service and is defined as follows: PN 1 i=1 ranki M RR = , (1) N where ranki is the rank of the correctly retrieved servicei in the ranked list, and N is the number of the services correctly retrieved. The higher is the position of the services correctly retrieved in the list, the higher is the MRR value for a given configuration. We decided to adopt this normalization instead of considering N as the total number of the services in the catalogue in order to evaluate the ranking of each configuration independently from its coverage. Hence, a configuration with a good ranking, but a small coverage will obtain a higher MRR value than a configuration with a better coverage, but a worse ranking. The dataset is extracted from the esd-toolkit catalogue freely available on- line11 . We indexed English, Dutch, and German services. The number of Dutch services is 225, and the number German services is 190. For each service we extracted and represented its textual label in terms of Wikipedia concepts by exploiting the methods described in Section 4. The labels have an average length of about three words. 11 http://standards.esd-toolkit.eu/EuOverview.aspx Results. The baseline of our experiment is the keyword-based configuration. For that configuration, only stemming and stopword elimination are performed on the text. We compared the baseline with the above-mentioned four different Wikipedia-based configurations (i.e., Wikipedia Miner, Tagme, DBpedia Spot- light, ESA) as well as with a combination of keywords and Wikipedia configu- rations. The latter configurations were obtained by adding to the keywords the corresponding Wikipedia concepts generated by the different methods. Results for the Dutch language are reported in Table 1. We can observe that the best configuration in terms of a@n is ESA. The configuration becomes more effective by returning several services (n > 5), since the matching is particularly difficult in this domain. The worst configuration is that based on Wikipedia Miner. This is likely due to a low effectiveness of Wikipedia Miner in identifying topics from very short text. Most configurations seem to improve their accuracy by merging the Wikipedia concepts with keywords; however they do not generally overcome the baseline. The only tool for which the keywords do not generally improve the performance is ESA. However, by analyzing the MRR values we can observe that ESA produces the worst ranking of the retrieved list of services. Conversely, the method with the best ranking is Wikipedia Miner, but it has a very small coverage (only 24 services). Hence, we can state that ESA is able to identify the correct correspondence for the largest number of services (∼ 82% of services) but under the condition to extend the list of retrieved service. Very similar results are shown for the German services (see Table 2). Also in this experiment ESA is the configuration with the best accuracy, while Wikipedia Miner achieves the best ranking. These results are very promising since the service labels in LGSL are written and matched by human experts and they are not always mere translations of the English labels. Table 1: Accuracy@n and MRR for the Dutch language. The highest values are reported in bold (total services = 225). Configuration a@1 a@3 a@5 a@10 a@20 a@30 MRR N keyword 0.333 0.458 0.502 0.538 0.542 0.547 0.610 123 tagme 0.120 0.164 0.178 0.182 0.187 0.187 0.643 42 tagme+keyword 0.316 0.453 0.484 0.551 0.560 0.569 0.555 128 wikiminer 0.080 0.093 0.107 0.107 0.107 0.107 0.750 24 wikiminer+keyword 0.324 0.440 0.484 0.529 0.542 0.547 0.593 123 esa 0.311 0.480 0.538 0.622 0.689 0.716 0.378 185 esa+keyword 0.311 0.476 0.542 0.622 0.689 0.716 0.378 185 dbpedia 0.182 0.236 0.244 0.249 0.258 0.258 0.707 58 dbpedia+keyword 0.329 0.449 0.498 0.556 0.569 0.573 0.574 129 Table 2: Accuracy@n and MRR for the German language. The highest values are reported in bold (total services = 190). Configuration a@1 a@3 a@5 a@10 a@20 a@30 MRR N keyword 0.204 0.338 0.396 0.413 0.418 0.418 0.489 94 tagme 0.124 0.147 0.151 0.151 0.151 0.151 0.824 34 tagme+keyword 0.218 0.342 0.400 0.427 0.431 0.431 0.505 97 wikiminer 0.098 0.124 0.124 0.129 0.129 0.129 0.759 29 wikiminer+keyword 0.218 0.342 0.396 0.422 0.427 0.427 0.510 96 esa 0.244 0.360 0.431 0.484 0.556 0.600 0.350 157 keyword+esa 0.244 0.360 0.431 0.484 0.556 0.600 0.350 157 dbpedia 0.138 0.169 0.182 0.182 0.182 0.182 0.756 41 dbpedia+keyword 0.231 0.360 0.413 0.440 0.440 0.440 0.525 99 6 Conclusions and Future Work In this paper we proposed a tool called CroSeR that is able to perform a cross- language matching of e-gov services. Four different semantic Wikipedia-based representations were investigated. The most accurate representation turned out to be ESA that, given a Dutch or German service, is able to retrieve the cor- responding English service for most of services. Hence, adding new external knowledge for representing a very short textual description is an effective solu- tion in this specific domain. However, the correct service is generally not boosted in the first positions of the ranked list. Therefore, as a future work we want to combine different representations to generate the ranked list. For example, we can start by adopting the representation with the highest MRR value, and then shifting to the representations with a worse ranking but a better accuracy. We want also to extend the experiment to the other languages in the esd-toolkit catalogue (Belgian, Norwegian, Swedish). Another idea is to exploit also the other relations stored into the catalogue (life event, interactions) for improving the service matching. Finally, we will carry out a user study where users can directly formulate their information need instead of using the service label as query. 7 Acknowledgments The work presented in this paper has been partially supported by the Italian PON project PON01 00861 SMART (Services and Meta-services for smART eGovernment). References 1. European Commission. A digital agenda for europe. COM(2010) 245 final/2, 2010. 2. L. Ding, V. Peristeras, and M. Hausenblas. Linked Open Government Data. IEEE Intelligent Systems, 27(3):11–15, 2012. 3. Samuel Fernando, Mark Hall, Eneko Agirre, Aitor Soroa, Paul Clough, and Mark Stevenson. Comparing taxonomies for organising collections of documents. In Proceedings of COLING ’12, pages 879–894. The COLING 2012 Organizing Com- mittee, 2012. 4. P. Ferragina and U. Scaiella. Tagme: on-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of CIKM ’10, pages 1625–1628. ACM, 2010. 5. B. Fu, R. Brennan, and D. O’Sullivan. Cross-lingual ontology mapping — an investigation of the impact of machine translation. In Proceedings of ASWC ’09, pages 1–15. Springer-Verlag, 2009. 6. B. Fu, R. Brennan, and D. O’Sullivan. Using pseudo feedback to improve cross- lingual ontology mapping. In Proceedings of ESWC ’11, pages 336–351. Springer- Verlag, 2011. 7. E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research (JAIR), 34:443–498, 2009. 8. S. Hertling and H. Paulheim. WikiMatch - Using Wikipedia for Ontology Match- ing. In Proceedings of the 7th International Workshop on Ontology Matching (OM 2012). CEUR, 2012. 9. P. Knoth, L. Zilka, and Z. Zdrahal. Using explicit semantic analysis for cross- lingual link discovery. In Proceedings of 5th International Workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies, 2011. 10. O. Medelyan, I.H. Witten, and D. Milne. Topic indexing with Wikipedia. In Pro- ceedings of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, pages 19–24. AAAI Press, 2008. 11. P. N. Mendes, M. Jakob, A. Garcı́a-Silva, and C. Bizer. DBpedia Spotlight: Shed- ding light on the web of documents. In Proceedings of I-SEMANTICS ’10, pages 1–8. ACM, 2011. 12. D. Milne and I. H. Witten. Learning to link with Wikipedia. In Proceedings of CIKM ’08, pages 509–518. ACM, 2008. 13. M. Palmonari, G. Viscusi, and C. Batini. A semantic repository approach to improve the government to business relationship. Data Knowl. Eng., 65(3):485– 511, 2008. 14. H. Paulheim. WeSeE-Match results for OEAI 2012. In Proceedings of the 7th International Workshop on Ontology Matching (OM 2012), 2012. 15. Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, April 2009. 16. P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future chal- lenges. IEEE Trans. Knowl. Data Eng., 25(1):158–176, 2013. 17. D. Spohr, L. Hollink, and P. Cimiano. A machine learning approach to multilingual and cross-lingual ontology matching. In Proceedings of ISWC 2011, pages 665–680. Springer-Verlag, 2011. 18. E. M. Voorhees. TREC-8 question answering track report. In Proceedings of TREC-8, pages 77–82. NIST Special Publication 500-246, 1999. 19. S. Wang, A. Isaac, B. Schopman, S. Schlobach, and L. Van Der Meij. Matching multi-lingual subject vocabularies. In Proceedings of ECDL ’09, pages 125–137, Berlin, Heidelberg, 2009. Springer-Verlag.