=Paper=
{{Paper
|id=Vol-3324/om2022_LTpaper3
|storemode=property
|title=LamAPI: a comprehensive tool for string-based entity retrieval with type-base filters
|pdfUrl=https://ceur-ws.org/Vol-3324/om2022_LTpaper3.pdf
|volume=Vol-3324
|authors=Roberto Avogadro,Marco Cremaschi,Fabio D'Adda,Flavio De Paoli,Matteo Palmonari
|dblpUrl=https://dblp.org/rec/conf/semweb/AvogadroCDPP22
}}
==LamAPI: a comprehensive tool for string-based entity retrieval with type-base filters==
<pdf width="1500px">https://ceur-ws.org/Vol-3324/om2022_LTpaper3.pdf</pdf>
<pre>
LamAPI: a Comprehensive Tool for String-based
Entity Retrieval with Type-base Filters
Roberto Avogadro1 , Marco Cremaschi1 , Fabio D’Adda1 , Flavio De Paoli1 and
Matteo Palmonari1
1
    Università degli Studi di Milano - Bicocca, 20126 Milano, Italy


                                         Abstract
                                         When information available in unstructured or semi-structured formats, e.g., tables or texts, comes in,
                                         finding links between strings appearing in these sources and the entities they refer to in some background
                                         Knowledge Graphs (KGs) is a key step to integrate, enrich and extend the data and/or KGs. This Entity
                                         Linking task is usually decomposed into Entity Retrieval and Entity Disambiguation because of the large
                                         entity search space. This paper presents an Entity Retrieval service (LamAPI) and discusses the impact
                                         of different retrieval configurations, i.e., query and filtering strategies, on the retrieval of entities. The
                                         approach is to augment the search activity with extra information, like types, associated with the strings
                                         in the original datasets. The results have been empirically validated against public datasets.

                                         Keywords
                                         Entity Linking, Entity Retrieval, Entity Disambiguation, Knowledge Graph


1. Introduction
A key advantage of developing Knowledge Graphs (KGs) consists in effectively supporting the
integration of data coming with different formats and structures [1]. In semantic data integration,
KGs provide identifiers and descriptions of entities, thus supporting data integration like tables
or texts. The table-to-KG matching problem, also referred to as semantic table interpretation,
has recently collected much attention in the research community [2, 3, 4] and is a key step
to enrich data [1, 5] and construct and extend KGs from semi-structured data [6, 7]. When
information available in unstructured or semi-structured formats, e.g., tables or texts, comes
in, finding links between strings (or mentions) appearing in these sources and the entities they
refer to in some background KGs is a key step to integrate, enrich and extend the data and/or
KGs. We name this task Entity Linking (EL), which comes in different flavours depending on
the considered data formats but with some shared features.
   For example, because of the ample entity search space, most of the approaches to EL include
a first step where candidate entities for the input string are collected, i.e., Entity Retrieval (ER)
[8], and a second step where the string is disambiguated by eventually selecting one or none
of the candidate entities, i.e., Entity Disambiguation (ED) [9]. In most approaches, ER returns

OM-2022: The 17th International Workshop on Ontology Matching, Hangzhou, China
$ roberto.avogadro@unimib.it (R. Avogadro); roberto.avogadro@unimib.it (M. Cremaschi);
f.dadda4@campus.unimib.it (F. D’Adda); flavio.depaoli@unimib.it (F. D. Paoli); matteo.palmonari@unimib.it
(M. Palmonari)
                                       © 2022 © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
a ranked list of candidates, while disambiguation consists of re-ranking the input list. Entity
Disambiguation is at the heart of EL, with different approaches that leverage different kinds of
evidence depending on the format and features of the input text [10]. However, the ER step
is also significant considering that its results define an upper bound for the performance of
the end-to-end linking: if an entity is not among the set of candidates, it cannot be selected as
the target for the link. Also, while it is, in principle, possible to scroll the list of candidates at
arbitrary levels of depth, maintaining acceptable efficiency levels requires cutting off the results
of ER at a reasonable depth.
   Approaches to entity searches can either resort to existing lookup APIs, e.g., DBpedia SPARQL
Query Editor1 , DBpedia Spotlight2 or Wikidata Query Service3 , or use recent approaches to
dense ER [11], when entities are searched in a pre-trained dense space, an approach becoming
especially popular in EL for textual data. The APIs reported above provide access to the SPARQL
endpoint because the elements are stored in Resource Description Framework (RDF) format.
Such endpoints are usually offered on local dumps of the original KGs to avoid network latency
and increase efficiency. For instance, DBpedia can be accessed by OpenLink Virtuoso, a row-wise
transaction-oriented RDBMS with a SPARQL query engine4 , and Wikidata by Blazegraph5 , a
high-performance graph database providing RDF/SPARQL-based APIs. An issue faced with
these solutions is the time required for downloading and setting up the datasets: Wikidata 2019
requires some days to set up6 since the full dump is about 1.1TB (uncompressed). Moreover,
writing SPARQL queries may be an issue since specific knowledge of the searched Knowledge
Graph (KG) is required, besides the knowledge of the required syntax. Some limitations related
to the use of these endpoints are:

    • the SPARQL endpoint response time is directly proportional to the size of the returned
      data. As a consequence, sometimes it is not even possible to get a result because the
      endpoint fails for timeout;
    • the number of requests per second may be severely limited for online endpoints (to
      ensure feasibility) or computationally too expensive for local endpoints (a reasonable
      configuration requires at least 64GB of RAM with tons of CPU cycles);
    • there are some intrinsic limits in the SPARQL language expressiveness (i.e., full-text
      search capability, which is required for matching table mentions, can be obtained only
      with extremely slow “contains” or “regex” queries7 ).

Regarding the approaches to dense ER, some limitations can be mentioned [12, 13]:

    • the results are strictly related to the type of representation used. Consequently, careful
      and tedious feature engineering is required when designing these systems;


    1
      dbpedia.org/sparql
    2
      www.dbpedia-spotlight.org
    3
      query.wikidata.org
    4
      virtuoso.openlinksw.com
    5
      blazegraph.com
    6
      addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/
    7
      docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext/
    • generalising the trained entity linking model to other KGs or domains is challenging due
      to the strong dependence on the specific KG and domain knowledge in the process of
      designing features;
    • these systems depend excessively on external data, and the effectiveness of the algorithms
      is directly affected by the quality of the training data, and their utility is indispensable
      restricted.

    Information Retrieval (IR) approaches based on search engines still provide valuable solutions
to support entity search, mainly because they do not require training, work with any KG, and
easily adapt to changes in the reference KG. Although IR-based entity search has been used
extensively, especially in table-to-kg matching [14, 15, 16, 17], their use has been frequently left
to custom optimisations and not adequately discussed or documented in scientific papers. As a
result, researchers willing to apply such solutions must develop from scratch, including data
indexing techniques, query formulation and service set-up.
    In this paper, we aim to present: i) LamAPI, a comprehensive tool for IR-based ER, aug-
mented with type-based filtering features, and ii) a study of the impact of different retrieval
configurations, i.e., query and filtering strategies, on the retrieval of entities. The tool sup-
ports string-based retrieval but also hard and soft filters [18] based on an input entity type
(i.e., rdf:type for DBpedia and Property:P31 for Wikidata). Hard type filters remove non-
matching results, while soft type filters promote or demote results when an exact match is
not feasible. These filters are useful to support either EL in texts (e.g., by exploiting entity
types returned by a classifier [19, 20]), or in tables (e.g., by exploiting a known column type
(rdf:type) to filter out irrelevant entities). While the approach is general, the tool provides
support EL for semi-structured data. In our study, we, therefore, focus on evaluating different
retrieval strategies with/without filters on EL in the table-to-KG matching settings, considering
two different large KG such as WikiData and DBpedia. Finally, the tool also contains mappings
among the latter two KGs and Wikipedia, thus supporting cross-KG bridges. The tool and all
the resources used for the experiments are released following the FAIR Guiding Principles8 .
LamAPI is released under the Apache 2.0 licence.
    The rest of this article is organised as follows. Section 2 will be presented a brief analysis of
state of the art on String-based entity retrieval techniques. We will describe the services offered
by LamAPI in Section 3. Section 4 introduces the Gold Standards, the configuration parameters
and finally discusses the evaluation results. Finally, we conclude this paper and discuss the
future direction in Section 6.


2. String-based entity retrieval
Given a KG containing a set of entities 𝐸 and a collection of named-entity mentions 𝑀 , the
goal of EL is to map each entity mention 𝑚 ∈ 𝑀 to its corresponding entity 𝑒 ∈ 𝐸 in the KG.
As described above, a typical EL service consists of the following modules [10]:
   1. Entity Retrieval (ER). In this module, for each entity mention 𝑚 ∈ 𝑀 , irrelevant entities
      in the KG are filtered out to return a set 𝐸𝑚 of candidate entities: entities that mention
    8
        www.nature.com/articles/sdata201618
      𝑚 may refer to. To achieve this goal, state-of-the-art techniques have been used, such as
      name dictionary-based techniques, surface form expansion from the local document, and
      methods based on search engines.
   2. Entity Disambiguation (ED). In this module, the entities in the set 𝐸𝑚 are more accurately
      ranked to select the correct entity among the candidate ones. In practice, this is a re-
      ranking activity that considers other information (e.g., contextual information) besides
      the simple textual mention 𝑚 used in the ER module.
    According to the experiments conducted [21], the role of the ER module is critical since it
should ensure the presence of the correct entity in the returned set to let the ED module to find
it. Hence, the main contribution of this work is to discuss retrieval configurations, i.e., query
and filtering strategies, for retrieving entities.
    Name dictionary-based techniques are the main approaches to ER; such techniques leverage
different combinations of features (e.g., labels, alias, Wikipedia hyperlinks) to build an offline
dictionary 𝐷 of links between string names and mapping entities to be used to generate the set
of candidate entities. The most straightforward approach considers exact matching between
the textual mention 𝑚 and string names inside 𝐷. Partial matching (e.g., fuzzy and/or n-grams
search) can also be considered.
    Besides pure string matching, type constraints (using types/classes of the KG) associated with
string mentions can be exploited to filter candidate entities. In such a case, the dictionary needs
to be augmented with types associated with linked entities to enable hard or soft filtering. Listing
1 and 2 report an example of how type constraints can influence the result of the candidate entity
retrieval for “manchester” textual mention. The former shows the result without constraint:
cities like Manchester situated in England or Parish in Jamaica are reported (note the similarity
score equal to 1.00). The latter shows the result when type constraints are applied: types like
“SoccerClub” and “SportsClub” allows for the promotion of soccer clubs such as “Manchester
United F.C.”, which is now ranked first (similarity score 0.83).
    Similar approaches have been proposed in this domain, such as the MTab [14] entity search,
where keyword search, fuzzy search and aggregation search are provided. Another relevant
approach is EPGEL [15], where the candidate entity generation uses both a keyword and a fuzzy
search. This approach also uses BERT [22] to create a profile for each entity to improve the
search results. The LinkingPark [16] method proposes a weighted combination of keywords,
trigrams and fuzzy search to maximise recall during the candidate generation process. In
addition, this approach involves verifying the presence of typos before generating candidates.
Concerning the other work, LamAPI provides a n-grams search and the possibility to include
type constraints in the candidate search to apply type/concept filtering in the ER. Furthermore,
LamAPI provides several services to help researchers in tasks like EL.
     Listing 1: DBpedia lookup without type              Listing 2: DBpedia lookup with type con-
                constraints.                                        straints.
1    {                                                   {
     "id": Manchester                                2   "id": Manchester_United_F.C.
3    "label": Manchester                                 "label": Manchester U
     "type": City Settlement ...                     4   "type": SoccerClub SportsClub ...
5    "ed_score": 1                                       "ed_score": 0.833
     },                                              6   },
7    {                                                   {
     "id": Manchester_Parish                         8   "id": Manchester_City_F.C.
9    "label": Manchester                                 "label": Manchester C
     "type": Settlement PopulatedPlace              10   "type": SoccerClub SportsClub ...
11   "ed_score": 1                                       "ed_score": 0.833
     }                                              12   }


 3. LamAPI
 The current version of LamAPI integrates DBpedia (v. 2016-10 and v. 2022.03.01) and Wikidata
 (v. 20220708), the most popular free KGs. However, any KG, even private and domain-specific
 could be integrated. The only constraint is to support indexing, as described in Section 3.1.

 3.1. Knowledge Graphs indexing
 DBpedia, Wikidata and the like are very large KGs that require an enormous amount of time
 and resources to perform ER, so we created a more compact representation of these data suitable
 for ER tasks. For each KG, we downloaded a dump (e.g., ‘latest-all.json.bz2’ for DBpedia that
 sizes 71 GB with multiple files), created a local copy in a single file by extracting and storing all
 triples (e.g., 96580491 entities for DBpedia). We then created an index with ElasticSearch9 , an
 engine that can search and analyse huge volumes of data in near real-time. These customised
 local copies of the KGs are then used to create endpoints to provide EL retrieval services.
 The advantage is that these services can work on partitions of the original KGs to improve
 performance by saving time and using fewer resources.

 3.2. LamAPI services
 Among the services provided by LamAPI to search and retrieve information in a KG, we discuss
 the Lookup and Type-similarity, which are the relevant services for entity retrieval.
 Lookup: given a string input, it retrieves a set of candidate entities from the reference KG. The
 request can be qualified by setting some attributes:

 limit: an integer value specifies the number of entities to retrieve. The default value is 100, and
        it has been empirically demonstrated how this limit allows a good level of coverage.
   kg: specifies which KG and version to use. The default is dbpedia_2022_03_01, and other
        possible values are dbpedia_2022_03_01, dbpedia_2016_10 or wikidata_latest.
fuzzy: a boolean value. When true, it matches tokens inside a string with an edit distance
        (Levenshtein distance) less than or equal to 2. This gives a greater tolerance for spelling
        errors. When false, the fuzzy operator is not applied to the input.
      9
          www.elastic.co
ngrams: a boolean value. When true, it permits to search n-grams. After many empirical ex-
         periments, we set ‘n’ of n-grams equal to 3. A lower value can bring some bias in the
         search, while a higher value could not be very effective in terms of spelling errors. “albert
         einstein” using n-grams equal to 3 is split in [’alb’, ’lbe’, ’ber’, ’ert’, ...]. When false is not
         applied on input.
  types: this parameter allows the specification of a list of types (e.g., rdf:type for DBpedia
         and Property:P31 for Wikidata) associated with the input string to filter the retrieved
         entities. This attribute plays a key role in re-ranking the candidates, allowing a more
         accurate search based on input types.

      The following example discusses the difference between a SPARQL query and the LamAPI
   Lookup service. Listings 3 and 4 show a search using the mention “Albert Einstein”. The
   evidence is that LamAPI syntax is simpler than the one in SPARQL. The Lookup service allows
   for managing the presence of misspelled mentions. Finally, another advantage over SPARQL is
   the ranking of candidates.

       Listing 3: Search example using a SPARQL               Listing 4: Example of LamAPI Lookup ser-
                  query.                                                 vice.
       select distinct ?s where {                         1   /lookup/entity-retrieval?
   2   ?s ?p ?o .                                             name="Albert Einstein"&
       FILTER( ?p IN (rdfs:label)).                       3   limit=100&
   4   ?o bif:contains "Albert Einstein".                     token=insideslab-lamapi-2022&
       }                                                  5   kg=dbpedia_2022_03_01&
   6   order by strlen(str(?s))                               fuzzy=False&
       LIMIT 100                                          7   ngrams=False


      Examples of results with the input string “Albert Einstein” returned by LamAPI are shown in
   Listing 5 and Listing 6, referred to Wikidata and DBpedia, respectively. Each candidate entity is
   described, in W3C specification10 format, by the unique identifier id in the chosen KG, a string
   label name reporting the official name of the entity, a set of types associated with the entity,
   each one described by its unique identifier id and the corresponding string label name, and an
   optional description of the entity (e.g., DBpedia does not provide descriptions, while Wikidata
   does). Moreover, a score with the edit distance measure (Levenshtein distance) between the
   input textual mention and the entity label is reported.

       Listing 5: Lookup: returned data from                  Listing 6: Lookup: returned data from DB-
                  Wikidata.                                              pedia.
   1   {                                                      {
       "id": Q937                                         2   "id": Albert_Einstein
   3   "label": Albert Einstein                               "label": Albert Einstein
       "description": German-born ...                     4   "description": ...
   5   "type": Q19350898 Q16389557 ... Q5                     "type": Scientist Animal ...
       "score": 1.0                                       6   "score": 1.0
   7   },                                                     },
       {                                                  8   {
   9   "id": Q356303                                          "id": Albert_Einstein_ATV
       "label": Albert Einstein                          10   "label": Albert Einstein ATV
  11   "description": American actor ...                      "description": ...
       "type": Q33999 Q2526255 ... Q5                    12   "type": SpaceMission Event ...
  13   "score": 1.0                                           "score": 0.789
       }                                                 14   }

       The score provides a candidate ranking that can be used by the Entity Disambiguation (ED)
       10
            reconciliation-api.github.io/specs/latest/
 module for a straightforward selection of the actual link. The intuition is that when there is
 one candidate with a score above a certain threshold, it can be selected, whereas when multiple
 candidates share the same score, or the highest score is very low, further investigation is needed
 to find the correct entity.
    The types present in the response can be used to iterate a Lookup request to filter the
 results and obtain more accurate candidate lists, as shown in Listing 1 and 2. Thanks to the
 Type-similarity service described below, it is possible to identify similar types of a given type
 (rdf:type), which allows for relaxing the constraints in case of uncertainty on which type to
 use as a filter.

 Type-similarity: given the unique id of a type as input, it retrieves the top 𝑘 most similar
 types by calculating a ranking based on cosine similarity.
    Examples of returned results with the input string Philosopher and Scientist are shown in
 Listing 7 and Listing 8, referred to Wikidata and DBpedia, respectively.

     Listing 7: Type-similarity: returned data              Listing 8: Type-similarity: returned data
                from Wikidata.                                         from DBpedia.
     Q4964182(philosopher)                              1   Philosopher
2    {                                                      {
       "type": Q4964182(philosopher)                    3     "type": Philosopher
4      "cosine_similarity": 1.0                               "cosine_similarity": 1.0
     },                                                 5   },
6    {                                                      {
       "type": Q2306091(sociologist)                    7     "type": Economist
8      "cosine_similarity": 0.865                             "cosine_similarity": 0.684
     }                                                  9   }
10   Q901(scientist)                                        Scientist
     {                                                 11   {
12     "type": Q901(scientist)                                "type": Scientist
       "cosine_similarity": 1.0                        13     "cosine_similarity": 1.0
14   },                                                     },
     {                                                 15   {
16     "type": Q19350898(theoretical...)                      "type": Medician
       "cosine_similarity": 0.912                      17     "cosine_similarity": 0.723
18   }                                                      },
     ...                                               19   ...


 4. Validation
 In this Section, different retrieval configurations, i.e., query and filtering strategies, are illustrated
 and validated.
    The dataset used for validation is 2T_2020 [23]: 2T comprises 180 tables with around 70.000
 unique cells. It is characterised by cells with intentionally orthographic errors, so the ER with
 misspelled words can be tested. The dataset is available for both Wikidata and DBpedia KG; it
 is possible to compare the results for both KG using the same tables.
    The validation process starts with a set of mentions 𝑀 , and a number 𝑘 of candidates
 associated with each mention. The Lookup service returns a set of candidates 𝐸𝑚 that includes
 all the candidates found. The returned set is then checked against the 2T to verify which among
 the correct entities are present and in what position in the ranked results in 𝐸𝑚 . We compute
 the coverage following this formula:
                                              # 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 𝑓 𝑜𝑢𝑛𝑑
                              𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 =                                                        (1)
                                           # 𝑡𝑜𝑡𝑎𝑙 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 𝑡𝑜 𝑓 𝑖𝑛𝑑
Where # represents "number of".
In Table 1 the various coverage values are presented for lookup based on label matching on a
mention by enabling fuzzy and n-grams searches. The experiments were conducted using 20
parallel processes on a server with 40 CPU(s) Intel Xeon Silver 4114 CPU @ 2.20GHz and 40GB
RAM.
   Table 2 and 3 show the coverage using the constraint on types. To select and expand types,
four methods were applied.

   1. Type: This method considers only the type or set of types (seed types) indicated in the
      call to the Lookup service, and it does not carry out any expansion of types.
   2. Type Co-occurrency: For the seed types, it extracts additional types based on the co-
      occurency of types in the KG. The co-occurency score represents the number of times
      each type co-occurs with another type in a KG at entity level.
   3. Type Cosine Similarity: The seed types are extended by the cosine similarity of
      RDF2Vec11 .
   4. Soft Inference: The seed types are extended using a Feed Forward Neural Network
      that takes as input the RDF2Vec vector of an entity, linked to a mention and predicts the
      possible types for the input entity [18].

   In Table 2, it is possible to notice that the first method achieves a higher coverage. The best
result is obtained by adding two types. Co-Occurrencies and Type Cosine Similarity are both
’idempotent’ methods. The Soft Inference technique uses the entities obtained by a prelinking.
Not all entity vectors are available, so we cannot always extend the set of types. In Table 3 we
report the results for Wikidata. Also, in this case, the best results here are achieved using the
first method. The achieved coverage is highest because this KG has a comprehensive hierarchy
with more detailed types.
   Even if lower, the coverage values obtained with type expansion methods are promising. We
must consider how the exact type to use as a filter is often not known a priori in a real scenario.
For example, to select a type, a user should know the profile of a KG and how it is used to
describe entities. Thanks to the methods described above, the search results will contain entities
belonging to other types but still related to the input.


5. The LamAPI retrieval service
LamAPI is implemented in Python using ElasticSearch and MongoDB. A demonstration setting is
publicly available12 through a Swagger documentation page for testing purposes (Figures 1 and
2). LamApi Repository13 is publicly available, so the code can be downloaded and customised if
needed.
   11
      rdf2vec.org
   12
      lamapi.ml
   13
      bitbucket.org/discounimib/lamapi
Table 1
Coverage results and response times for different searches in Wikidata and DBpedia v. 2022.03.01.
                                                               DBpedia             Wikidata
                                     Methods
                                                           Coverage   Time     Coverage    Time
                                   N-gram                   0,842     228 s     0,787      649 s
                                    Fuzzy                   0,806     226 s     0,805      766 s
                                    Token                   0,561     227 s     0,530      230 s
                               N-gram + Fuzzy               0,891     267 s     0,926     1649 s
                              N-gram + Token                0,883     229 s     0,891      807 s
                                Fuzzy + Token               0,812     226 s     0,825      773 s
                           N-gram + Fuzzy + Token           0,895     270 s     0,929     1577 s


Table 2
Coverage results for 2T DBpedia.
 Methods                  w/o type     1 type    2 types     3        4       5       6       7       8       9       10
 Type                     0,892        0,904     0,905       0,904    0,889   0,884   0,879   0,872   0,870   0,867   0,848
 Type Co-occurency        0,892        0,886     0,896       0,886    0,856   0,884   0,830   0,834   0,834   0,833   0,823
 Type Cosine Similarity   0,892        0,892     0,886       0,889    0,885   0,881   0,873   0,869   0,825   0,825   0,830
 Soft Inference           0,892        0,885     0,872       0,884    0,882   0,879   0,885   0,886   0,878   0,874   0,869


Table 3
Coverage results for 2T Wikidata.
 Methods                  w/o type     1 types   2 types     3        4       5       6       7       8       9       10
 Type                     0,929        0,941     0,939       0,946    0,946   0,947   0,947   0,945   0,945   0,943   0,944
 Type Co-occurency        0,929        0,854     0,808       0,796    0,793   0,795   0,797   0,797   0,795   0,796   0,795
 Type Cosine Similarity   0,929        0,853     0,853       0,852    0,851   0,850   0,849   0,849   0,848   0,847   0,845


Figure 1: LamAPI documentation page.                                 Figure 2: LamAPI Lookup service.


  For completeness, the list with the relative description of the LamAPI services is provided.
Types: given the unique id of an entity as input, it retrieves all the types of which the entity is
an instance. The service relies on vector similarity measures among the types in KG to compute
the answer. For DBpedia entities, the service returns both direct types, transitive types, and
Wikidata types of the related entity, while for Wikidata, it returns only the list of concepts/types
for the input entity.
Literals: given the unique id of an entity as input, it retrieves all relationships (predicates) and
literal values (objects) associated with that entity.
Predicates: given the unique id of two entities as input, it retrieves all the relationships
(predicates) between them.
Objects: given the unique id of an entity as input, it retrieves all related objects and predicates.
Type-predicates: given the unique id of two types as input, it retrieves all predicates that
relate entities of input types with a frequency score associated with each predicate.
Labels: given the unique id of an entity as input, it retrieves all the related labels and aliases
(rdfs:label).
WikiPageWikiLinks: given the unique id of an entity as input, it retrieves links from a
WikiPage to other Wikipages.
Same-as: given the unique id of an entity as input, it returns the corresponding entity for both
Wikidata and DBpedia (schema:sameAs).
Wikipedia-mapping: given the unique id or curid of a Wikipedia entity, it returns the corre-
sponding entity for Wikidata and DBpedia.
Literal-recogniser: Given an array as input composed of a set of strings, the endpoint returns
the types of each literal by applying a set of regex rules. The list of literals recognised is dates
(e.g., 1997-08-26, 1997.08.26, 1997/08/26), numbers (e.g., 2.797.800.564, 25 thousand, +/- 34657, 2
km), url, email and time (e.g., 12.30pm, 12pm).


6. Conclusions
Effective Entity Retrieval services are crucial to effectively support the task of Entity Linking for
unstructured and semi-structured datasets. In this paper, we discussed how different strategies
can be beneficial to reduce the search space and therefore deliver more accurate results saving
time, computing power and storage capability. The results have been empirically validated
against public datasets of tabular data. Preliminary experiments with textual data are encour-
aging. We plan to complete such validation activities and further develop LamAPI to provide
full support for any format of input datasets. In addition, other search and filtering strategies
will be implemented and tested to provide users with a complete set of alternatives, along with
information on when and how each can be usefully adopted. The tool could be extended for
supporting also other tasks in the natural language process like entity linking on free text.


References
 [1] V. Cutrona, M. Ciavotta, F. D. Paoli, M. Palmonari, ASIA: a tool for assisted semantic
     interpretation and annotation of tabular data, in: Proceedings of the ISWC 2019 Satellite
     Tracks, volume 2456 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 209–212.
 [2] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, Semtab 2019: Re-
     sources to benchmark tabular data to knowledge graph matching systems, in: The Semantic
     Web, Springer International Publishing, Cham, 2020, pp. 514–530.
 [3] E. Jimenez-Ruiz, O. Hassanzadeh, V. Efthymiou, J. Chen, K. Srinivas, V. Cutrona, Results
     of semtab 2020, CEUR Workshop Proceedings 2775 (2020) 1–8.
 [4] V. Cutrona, J. Chen, V. Efthymiou, O. Hassanzadeh, E. Jimenez-Ruiz, J. Sequeda, K. Srinivas,
     N. Abdelmageed, M. Hulsebos, D. Oliveira, C. Pesquita, Results of semtab 2021, in: 20th
     International Semantic Web Conference, volume 3103, CEUR Workshop Proceedings, 2022,
     pp. 1–12.
 [5] M. Palmonari, M. Ciavotta, F. De Paoli, A. Košmerlj, N. Nikolov, Ew-shopp project:
     Supporting event and weather-based data analytics and marketing along the shopper
     journey, in: Advances in Service-Oriented and Cloud Computing, Springer International
     Publishing, Cham, 2020, pp. 187–191.
 [6] G. Weikum, X. L. Dong, S. Razniewski, F. M. Suchanek, Machine knowledge: Creation and
     curation of comprehensive knowledge bases, Found. Trends Databases 10 (2021) 108–490.
 [7] M. Kejriwal, C. A. Knoblock, P. Szekely, Knowledge graphs: Fundamentals, techniques,
     and applications, MIT Press, 2021.
 [8] L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in:
     Proceedings of the Thirteenth Conference on Computational Natural Language Learning
     (CoNLL-2009), Association for Computational Linguistics, Boulder, Colorado, 2009, pp.
     147–155.
 [9] D. Rao, P. McNamee, M. Dredze, Entity Linking: Finding Extracted Entities in a Knowledge
     Base, Springer Berlin Heidelberg, Berlin, Heidelberg, 2013, pp. 93–115.
[10] W. Shen, J. Wang, J. Han, Entity linking with a knowledge base: Issues, techniques, and
     solutions, IEEE Transactions on Knowledge and Data Engineering 27 (2015) 443–460.
[11] L. Wu, F. Petroni, M. Josifoski, S. Riedel, L. Zettlemoyer, Zero-shot entity linking with
     dense entity retrieval, in: EMNLP, 2020.
[12] W. Shen, Y. Li, Y. Liu, J. Han, J. Wang, X. Yuan, Entity linking meets deep learning:
     Techniques and solutions, IEEE Transactions on Knowledge and Data Engineering (2021)
     1–1.
[13] X. Li, Z. Li, Z. Zhang, N. Liu, H. Yuan, W. Zhang, Z. Liu, J. Wang, Effective few-shot named
     entity linking by meta-learning, 2022.
[14] P. Nguyen, I. Yamada, N. Kertkeidkachorn, R. Ichise, H. Takeda, Semtab 2021: Tabular data
     annotation with mtab tool., in: SemTab@ ISWC, 2021, pp. 92–101.
[15] T. M. Lai, H. Ji, C. Zhai, Improving candidate retrieval with entity profile generation for
     wikidata entity linking, arXiv preprint arXiv:2202.13404 (2022).
[16] S. Chen, A. Karaoglu, C. Negreanu, T. Ma, J.-G. Yao, J. Williams, F. Jiang, A. Gordon, C.-Y.
     Lin, Linkingpark: An automatic semantic table interpretation system, Journal of Web
     Semantics 74 (2022) 100733.
[17] C. Sarthou-Camy, G. Jourdain, Y. Chabot, P. Monnin, F. Deuzé, V.-P. Huynh, J. Liu, T. Labbé,
     R. Troncy, Dagobah ui: A new hope for semantic table interpretation, in: European
     Semantic Web Conference, Springer, 2022, pp. 107–111.
[18] V. Cutrona, G. Puleri, F. Bianchi, M. Palmonari, Nest: Neural soft type constraints to
     improve entity linking in tables., in: SEMANTiCS, 2021, pp. 29–43.
[19] J. Raiman, O. Raiman, Deeptype: Multilingual entity linking by neural type system
     evolution, Proceedings of the AAAI Conference on Artificial Intelligence 32 (2018).
[20] Y. Onoe, G. Durrett, Fine-grained entity typing for domain independent entity linking,
     Proceedings of the AAAI Conference on Artificial Intelligence 34 (2020) 8576–8583.
[21] B. Hachey, W. Radford, J. Nothman, M. Honnibal, J. R. Curran, Evaluating entity linking
     with wikipedia, Artificial Intelligence 194 (2013) 130–150. Artificial Intelligence, Wikipedia
     and Semi-Structured Resources.
[22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[23] V. Cutrona, F. Bianchi, E. Jiménez-Ruiz, M. Palmonari, Tough tables: Carefully evaluating
     entity linking for tabular data, in: The Semantic Web – ISWC 2020, Springer International
     Publishing, Cham, 2020, pp. 328–343.

</pre>