Entity Suggestion Ranking via Context Hashing Rima Türker1,2 , Maria Koutraki1,2 , Jörg Waitelonis3 , and Harald Sack1,2 1 FIZ Karlsruhe, Karlsruhe, Germany firstname.lastname@fiz-karlsruhe.de 2 Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany firstname.lastname@kit.de 3 Hasso Plattner Institute for IT Systems Engineering, Potsdam, Germany joerg.waitelonis@hpi.de 1 Introduction In text-based semantic analysis the task of named entity linking (NEL) estab- lishes the fundamental link between unstructured data elements and knowledge base entities. The increasing number of applications complementing web data via knowledge base entities has led to a rich toolset of NEL frameworks [4,7]. To resolve linguistic ambiguities, NEL relates available context information via sta- tistical analysis, as e.g. term co-occurrences in large text corpora, or graph analy- sis, as e.g. connected component analysis on the contextually induced knowledge subgraph. The semantic document annotation achieved via NEL algorithms can furthermore be complemented, upgraded or even substituted via manual annota- tion, as e.g. in [5]. For this manual annotation task, a popular approach suggests a set of potential entity candidates that fit to the text fragment selected by the user, who decides about the correct entity for the annotation. The high degree of natural language ambiguity causes the creation of a huge sets of entity can- didates to be scanned and evaluated. To speed up this process and to enhance its usability, we propose a pre-ordering of the entity candidates set for a prede- fined context. The complex process of NEL context analysis often is too time consuming to be applied in an online environment. Thus, we propose to speed up the context computation via approximation based on the offline generation of context weight vectors. For each entity, a context vector is computed before- hand and is applied like a hash for quickly computing the most likely entity candidates with respect to a given context. In this paper, the process of entity hashing via context weight vectors is introduced. Context evaluation via weight vectors is evaluated on the test case of SciHi 1 , a web blog on the history of science providing blog posts semantically annotated with DBpedia entities. 2 Creating Context Vectors for Entity Hashing Conceptual Idea. Most of the DBpedia entities are linked to related Wikipedia categories2 .Categories represent precious characteristics about the resources and 1 http://scihi.org/ 2 https://en.wikipedia.org/wiki/Help:Category they are hierarchically related to form a taxonomy via the skos:broader prop- erty. However, to identify the relation between a resource and a more general category such as e.g., between Wolfgang A. Mozart and dbc:Music, a path has to be determined from dbc:Composers_for_piano, the category directly con- nected to Mozart, via a sequence of skos:broader connections to dbc:Music. In contrast to other studies [6,2,1], in this paper, DBpedia resources are charac- terized based on their Wikipedia category associations. Note that related NEL approaches make use of all the Wikipedia categories. However, in this study a small number of categories is used to reduce the complexity and to make the technique scalable. Weight Vector Heuristics. For a manual semantic annotation use case, usually many potential entity candidates are displayed to the user. In order to display the most likely entity candidates first, we propose a sorting based on the computation of a weight vector for each DBpedia entity. Weight vectors reflect the characteristics of the entities with respect to predefined Wikipedia categories. Each DBpedia entity will be represented as an m-dimensional vector of weights (w1 , . . . , wm ) that indicate the connectedness of the entity to the predefined categories c1 , . . . , cm . Since our use case SciHi provides 35 science related categories, entities will be represented by a 35 dimensional weight vector ranging from c1 = Archeology to c35 = Zoology. The single weights wi of a weight vector wve = (w1 , . . . , wm ) are computed based on the number of paths from the entity e to the categories (c1 , . . . , cm ) in the RDF graph via dct:subject and skos:broader with shorter paths indicating a stronger relation. The following formula for weight calculation has been determined empirically:  0,  if there is no path from e to c, wc (e) = maxDepth   P #pl (e,c) N   2l−lmin · log nl , otherwise l=lmin where wc (e) is the single weight value for category c and entity e, maxDepth is the maximum path length, which was empirically determined as maxDepth = 6, since longer path lengths cause the introduction of too much noise. lmin desig- nates the shortest discovered path length from entity e to any category, #pl (e,c) is the number of paths from entity e to category c with length l , N is the total number of entities, and nl is the number of entities with paths to category c of length l . The longer the path length, the less relevant is the connection of entity e to category c and the more noise is introduced. Therefore the number of paths #pl (e,c) is divided by 2l−lmin , i.e. for each hop starting with the shortest found path at length lmin the weight is cut in halve, and further multiplied by term frequency-inverse document frequency(tf-idf), since more general categories will always have a high number of paths and should not be overemphasized [7]. Finally, the weight vectors are normalized to enable a fair comparison. Table 1: Evaluation results Baseline Approach Total items 28,516 28,516 Top 1 2,336 (8.2 %) 2,699 (9.5 %) Top 5 4,609 (16.2 %) 6,481 (22.7 %) Top 10 5,381 (18.9 %) 8,116 (28.5 %) 3 Experiments And Results Evaluation Requirements. The general purpose of the weight vectors is to improve semantic text annotation systems. If the general domain of the text is known, this information might assist the subsequent entity linking processes. With knowledge about the domain the disambiguation process can be refined by putting more focus on entities belonging to the domain. Another application is the entity lookup problem as for example solved in auto-completion systems [3]. By entering a search string, a list of knowledge base entities is presented, usu- ally in a string distance based order, from which the user selects the best suited candidate. With focus on a given domain, the ranking of candidate entities in the auto-completion list could be re-organized so that entities belonging to the domain are preferred over others. The hypothesis is, that re-ranking provides better suggestions so that the work with the system is more efficient when incor- porating domain information. With this application scenario in mind, we have conducted an evaluation experiment, incorporating the auto-suggestion system for entity lookup proposed in [3]. For the evaluation the ground truth dataset requires to include potential search strings (e.g. surface forms of entity men- tions), the associated entities, as well the categories. Datasets for entity linking should fit perfectly for this purpose. Unfortunately, those datasets often don’t provide information about a specific domain or category. Only the WES2015[8] dataset created from the SciHi blog articles provides categories for each anno- tated documents, which contains 331 science related articles with a total number of 28,516 annotations with 8,582 distinct entities and 11,990 different surface forms. Experiment Design. For the evaluation, the auto-suggestion system was adapted to also take into account the weight vectors for the entities. For each annotation of the dataset, the list of entities was generated with the surface form as input and the annotation entity was localized in the list. If the entity was within the top n positions, a true positive was ascertained. The following versions of the ranking were evaluated: (a) order by string distance & in-degree (cf. [3]) (baseline) and (b) order by weight vector category. The baseline is the combined ranking implemented in [3]. First, the results are ordered by JaroWinkler distance [9] between the surface form and the en- tity’s main label. Subsequently the top n = 50 items were ordered according to entities’ in-degree derived from the Wikipedia link graph to approximate entity popularity. Thus, popular entities should be ranked higher than unpopular enti- ties. The order by weight vector category was realized by sorting the result list by the values at the corresponding category index in the weight vector. Evaluation Results. Tab. 1 presents the evaluation results. The first row (“Total items”) shows the total number of entities found, the next rows show the number of entities found within the n topmost positions, e.g. with the baseline 2,336 out of 28,516 entities were suggested at the first position, thus 8.2 % were a perfect hit. One can see that the number of hits was increased by applying the weight vector based ordering. Whereas the improvement on the top 1 hits from 8.2 % to 9.5 % was only slight (1.3 %), the top 5 (6.5 %) and top 10 (9.6 %) improvements are more clear. 4 Conclusion And Future Work In this paper, we have shown how to approximate contextual information using weight vectors. The approach has been evaluated in an entity suggestion use case and has shown a clear improvement compared with the baseline that only takes into account string similarity and popularity measures. To be applicable in a more general use case, weight vector dimensions have to be adapted (a) to the domain in focus or (b) to the general case without domain restrictions, as e.g. categories from a top level ontology. Current research is focused on adapting the presented approach with weight vector hashing also to named entity linking. References 1. Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: EMNLP (2011) 2. Ichinose, S., Kobayashi, I., Iwazume, M., Tanaka, K.: Ranking the results of dbpedia retrieval with sparql query. In: JIST (2013) 3. Osterhoff, J., Waitelonis, J., Sack, H.: Widen the Peepholes! Entity-Based Auto- Suggestion as a rich and yet immediate Starting Point for Exploratory Search. In: IVDW (2012) 4. Shen, W., Wang, J., Han, J.: Entity linking with a knowledge base: Issues, tech- niques, and solutions. Transactions on Knowledge & Data Engineering (2015) 5. Tietz, T., Jäger, J., Waitelonis, J., Sack, H.: Semantic annotation and information visualization for blogposts with refer. In: VOILA (2016) 6. Tonon, A., Catasta, M., Prokofyev, R., Demartini, G., Aberer, K., Cudré-Mauroux, P.: Contextualized ranking of entity types based on knowledge graphs. Web Semant. (2016) 7. Usbeck, R., et al.: Gerbil: General entity annotator benchmarking framework. In: WWW (2015) 8. Waitelonis, J., Exeler, C., Sack, H.: Linked Data Enabled Generalized Vector Space Model to Improve Document Retrieval. In: Proceedings of NLP & DBpedia 2015 workshop in ISWC (2015) 9. Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi- sunter model of record linkage. In: SRMS (1990)