Entity Retrieval Docker Image for OSIRRC at SIGIR 2019

                                                                         Negar Arabzadeh
                                                                        narabzad@ryerson.ca
                                                                         Ryerson University
                                                                          Toronto, Ontario

ABSTRACT
With emerging of structured data, retrieving entities instead of
documents becomes more prevalent in order to satisfy the informa-
tion need related to a query. Therefore, several high-performance
entity retrieval methods have been introduced to the Information
Retrieval (IR) community in recent years. Replicating and repro-
ducing the standard entity retrieval methods are considered as
challenging tasks in the IR community. Open-Source IR Replicabil-
ity Challenge (OSIRRC) has addressed this problem by introducing
a unified framework for dockerizing a variety of retrieval tasks. In
this paper, a Docker image is built for six different entity retrieval
models including, LM, MLM-tc, MLM-all, PRMS, SDM, FSDM. Also,
Entity Linking incorporated Retrieval(ELR) extension, has been im-
plemented that can be applied on top of all the mentioned models.
The entity retrieval docker can retrieve relevant entities for any
given topic.

Image Source: https://github.com/osirrc/entityretrieval-docker
Docker Hub: https://hub.docker.com/r/osirrc2019/entityretrieval

1    OVERVIEW
In the past two decades, search engines have been dealing with
unorganized and unclassified data i.e., unstructured data until the
emergence of semantic search. In order to satisfy the information
need behind a query using structured data, retrieving machine-                              Figure 1: entity retrieval flowchart for a given query Q and
recognizable "entities" has proven to be a suitable complementary                           retrieval model M. D is the relevant required representation
for document retrieval for multiple reasons. For instance, Returning                        of entities for the model M.
a document in response to a query that is looking for an entity might
                                                                                                   • LM [9]                        • PRMS [5]
not be the best option because users have to look into the document
                                                                                                   • MLM-tc [7]                    • SDM [6]
to find their desired information need. That could be one of the
                                                                                                   • MLM-all [8]                   • FSDM [10]
main reason why retrieval operations are getting more and more
entity-centric, particularly on the web. Document retrieval differs                            Furthermore, an extension of the Markov Random Field (MRF)
from entity retrieval in a couple of senses. In document retrieval,                         model framework for incorporating entity annotations into the
entities are usually used for query expansion or retrieval features in                      retrieval model, which is called Entity linking incorporated Re-
order to improve learning-to-rank frameworks and consecutively                              trieval(ELR) has been leveraged on top of mentioned retrieval model.
to enhance document retrieval performance. On the other hand,                               Applying ELR on the state-of-the-art entity retrieval models results
entity retrieval is defined as searching for an entity in a knowledge                       in having the following ELR-integrated entity retrieval models [2] :
base where entities are first class citizens[2]. In other words, our
goal is to retrieve the most relevant entities from a knowledge base                               • LMel r                        • PRMS el r
for a given term-based query.                                                                      • MLM − tc el r                 • SDMel r
                                                                                                   • MLM − allel r                 • FSDMel r
  In this docker image, the following standard entity retrieval
                                                                                               DBpedia version 3.9 has been used as the knowledge base in
models have been implemented:
                                                                                            the entity retrieval tasks in this Docker image. A term-based index
                                                                                            and an entity-based index had created from the knowledge base
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR        by utilizing Lucene. Within the former index, entities URI objects
2019, 25 July 2019, Paris, France.                                                          are resolved to terms and the default Lucene stop words have been


                                                                                       21
                                        Figure 2: Term-based VS Entity-based representation [2]


removed from them. Meanwhile, within the latter, URI objects are               2.1    Language Modeling-based methods
preserved and literal objects are ignored. In total, entity-based index        Language modeling-based models consider dependencies among
contains 3,984,580 entities. While the term-based index is used in             query terms. LM [9], MLM-tc[7], MLM-all[8] and PMRS[5] are
the standard entity retrieval model, both term based and entity                all language modeling based methods. LM only applies on the
based indices are used in ELR entity retrieval models.                         content field. However, MLM-tc run against name field as well
   One of the critical components of the ELR approaches is entity              as content field . The fields content and name have weights of
annotations of the queries. TAGME, which is an open source entity              0.8 and 0.2 respectively. MLM-all and PRMS use top 10 fields . The
linker, has been adopted to annotate entities in queries with the              former is a Mixture of Language models with top 10 fields with
default threshold 0.1. Hasibi et al.[2] have shown that the ELR                uniform weights but the latter’s retrieval model is a probabilistic
approach is robust to annotation threshold.                                    one designed for semi-structured data. More details on each of the
   In summary, given a query Q and a retrieval model M, the en-                mentioned language models can be found in the related papers.
tity retrieval operates according to Figure 1, where D is the rep-
resentation of entities. In section 2, there will be a more in-depth
                                                                               2.2    Sequential Dependence-based methods
elaboration on standard entity retrieval methods in addition to the
combination of them with ELR extension. Section 3 provides more                Sequential Dependence Models (SDM) are popular Markov Random
details on the technical design aspect of the docker image. Section            Field based retrieval models. Given a document D and a query Q, the
4 describes our motivation and experience in participating in the              conditional probability of P(D|Q) is estimated based on Markov Ran-
OSIRRC 2019 challenge. The last section, i.e., section 5, gives an             dom Field as in equation (1) where D is term-based representation
insight on further work that has to be carried out in this area and            of entities.
                                                                                                            rank Õ
conclude the paper.                                                                                 P(D|Q) =             λc f (c)               (1)
                                                                                                                       c ∈C(G)

2   RETRIEVAL MODELS                                                              In equation (1), C(G) is set of cliques in graph G . The nodes of the
As was mentioned in the Overview section, several standard re-                 graph G consist of query terms and documents and the edges among
trieval models including LM, MLM-tc, MLM-all, PRMS, SDM and                    nodes illustrates the dependencies among nodes. λc is weight of
FSDM have been implemented in this docker. In addition, ELR ex-                the feature function f(c) . More details can be found in the original
tension can be applied on top of them, which results in having                 paper [6].
twelve different retrieval models collectively. There will be two dif-            Considering dependencies among query terms results in having
ferent representation of entities; the term-based representation and           equation (2) based on Markov Random Field (equation (1)) as SDM
the entity-based representation. For the standard retrieval models,            ranking function with respect to λT + λO +λU = 1 :
term-based representation of DBpedia collection (using term-based
index) is used and when it comes to ELR extension. On the other                                   rank  Õ
                                                                                            P(D|Q) = λT   fT (qi , D)+
hand, both term-based and entity-based representation of entities
                                                                                                              q i ∈Q
in DBpedia are used (term-based and URI-based indices). The term-                                                 Õ
based and entity-based representations are compared in Figure 2                                          λO                    fO (qi , qi+1 , D)+   (2)
[2].                                                                                                          q i ,q i +1 ∈Q
   Retrieval models can be categorized into Language modeling-
                                                                                                                  Õ
                                                                                                         λU                    fU (qi , qi+1 , D)}
based models and Sequential Dependence models and ELR models.                                                 q i ,q i +1 ∈Q


                                                                          22
    2.2.1    Fielded Sequential Dependence Models (FSDM) .                                  This section describes different components of the docker image
                                                                                         and supported hooks and extra options which can be passed to the
   Fielded Sequential Dependence Models (FSDM) considers docu-                           jig for the entity retrieval Docker.
ment structure by computing linear interpolation of probability of
each documents’ fields.Thus, feature functions are also calculated                       3.1     Dockerfile
based on field representation of documents. In other words, in FSDM                      The latest official version of Ubuntu2 is installed in the Docker
model [10] equation (2) comes with different feature functions as                        with all the required commands. In addition, compatible versions
different language models are built for each field.                                      of other requirements such as java8, Apache Ant, Apache Ivy, g+,
                                                                                         and so on are installed on the Docker. Making all the components
2.3         ELR models                                                                   compatible with each other was a quite challenging issue. Since
Incorporating Entity Linking into entity Retrieval leads to improve                      installing all the requirements is a time-consuming step, a docker
entity retrieval performance [2]. Linking entities by TAGME results                      image is prepared with all the basic requirements and pushed it to
in having confidence score s(e) for each entity e. While considering                     Docker Hub3 . Hence, this prepared image is used as our Dockerfile
sequential dependency in MRF-based models, annotated queries are                         base image in order to decrease the building time of the Docker.
assumed to be independent of each other and query terms. Applying                        This sets the stage for COPYing the init,index and search hooks
ELR extension on the previous models results in the equation(3)                          which should be executable files.
as ranking function where |Q | is the query length and s(e) is the
entity linking confidence score of entity e annotated by TAGME.                          3.2      Supported Collections
Equation(3) is elaborated more in [6] with its feature functions.                        DBpedia version 3.9 4 has been used as the corpus of entity retrieval
Free parameters constraints of λT + λO + λU + λ E = 1 is true for                        Docker image. In order to reduce the run time cost of the docker,
equation 3. For LM, MLM-tc and MLM-all λO and λU are set to                              the original index is used. Both term-based indexed and URI-based
zero since they are unigram based models. All the feature functions                      indexed of the collection will be downloaded once in the preparation
are defined in [2].                                                                      step of the jig. However, to make the docker run using "the jig", a
                                                                                         dummy collection is needed to pass.
                  rank  Õ                  1
            P(D|Q) = λT                       fT (qi , D)+
                               q i ∈Q
                                         |Q |                                            3.3      Supported Hooks
                                    Õ              1                                     This section elaborates on the role of each hook separately. init and
                            λO                          fO (qi , qi+1 , D)+              index hooks are triggered in the jig preparation step and search
                                               |Q | − 1
                               q i ,q i +1 ∈Q                                            hook script will run in the jig search step.
                                                                              (3)
                                    Õ              1
                            λU                          fU (qi , qi+1 , D)+                 3.3.1 init.
                                               |Q | − 1
                               q i ,q i +1 ∈Q                                            The actual implementation of the retrieval models is cloned in
                                                                                         this hook from the GitHub repository5 . The required compatible
                                  Õ
                            λE             s(e)f E (e, D)}
                                  e ∈E(Q )                                               packages are installed. Running this hook may take a while be-
                                                                                         cause of downloading,building and installing PyLucene which is
                                                                                         time-consuming. Once the installation are done, two indexed col-
3     TECHNICAL DESIGN                                                                   lection which are DBpedia term-based index and URI-based index
One of the major issues when dealing with replicability problem,                         are downloaded (∼ 18GB) and extracted.
is that the system should be delivered in a lightweight package[1].
                                                                                           3.3.2 index.
Dockers has this ability since a relatively inexpensive container
                                                                                         The indexed DBpedia collection is already downloaded in the init
can be created from each docker image. a jig was introduced in
                                                                                         hook. Hence, nothing is happening in this hook in this docker.
OSIRRC2019 workshop that makes the co-implementing and co-
designing available. The jig which is open source and available on                          3.3.3 search.
GitHub1 plays a semi-tool role which can maintain computational                          When the image is prepared, retrieval models can run with respect
relationship among Dockers and retrieval tasks.                                          to their relevant customized parameters in the search hook. The
    The entity retrieval Docker image is consisted of init, index                        search hook runs the main implementation of models6 which were
and search hooks invokes by Python3 as its interpreter. The jig                          provided by Hasibi et al.[2]. the main code was cloned in init hook
triggers the init hook first and thenindex and search respectively                       in the Docker. Table 1 demonstrates all the parameters that can be
in the Docker image. Finally , if there golden standard for the topics                   set for each of the retrieval models. Given the query , depending
are available, it will evaluate the results. Since we can get data into                  on the retrieval model, the query would be annotated or not, and
and out of the container built from the Docker image [1], we can get                     then the retrieval takes place based on the set parameters. Then,
the retrieval results in the jig output directory. further explanation
to run the entity retrieval Docker image is available on the entity                      2 https://hub.docker.com/_/ubuntu
retrieval GitHub repository.                                                             3 https://hub.docker.com/r/narabzad/elr_prepared_os
                                                                                         4 https://wiki.dbpedia.org/services-resources/datasets/data-set-39/downloads-39
                                                                                         5 https://github.com/Narabzad/elr_files
1 https://github.com/osirrc/jig                                                          6 https://github.com/hasibi/EntityLinkingRetrieval-ELR/


                                                                                    23
Table 1: Entity retrieval models acceptable parameters                          entities (reproducibility) and if the relevant entities i.e., the golden
which are entity linking threshold (threshold), number of                       standard, is available for the topics, the model can be evaluated
selected fieleds (nfields) and free paramaters (λT , λO , λU , λ E ).           as well. However, the supported collection is still limited to the
For each model, parameters with ✓affect the retrieval model                     indexed DBpedia version 3.9.
and × indicates that parameter does not have any affect                            Dockerizing the entity retrieval models was a challenging task.
on the model. According to each model, Some parameters                          furthermore, standardizing the Docker with the jig increased its
might have been set to zero.                                                    complexity. One of the main issues regarding this Docker was
                                                                                the compatibility of different components e.g, Python, PyLucene,
                         threshold         nfields   λT   λO   λU   λE          Java, Apache Ant, etc. It was a time-consuming task to find all the
     LM                      ×                ×      ✓     0    0   ×           compatible version of all those components and this is one of the
     MLM-tc                  ×                ×      ✓     0    0   ×           critical benefits of this task. Utilizing the entity Docker, Researchers
     MLM-all                 ×                ✓      ✓     0    0   ×           do not have to spend lots of time on combining and connecting
     PRMS                    ×                ✓      ✓     0    0   ×           different packages, libraries and components anymore to run the
     SDM                     ×                ×      ✓     ✓    ✓   ×           mentioned entity retrieval models.
     FSDM                    ×                ✓      ✓     ✓    ✓   ×              Another issue is that topics appear in different formats. we must
     LMELR                   ✓                ×      ✓     0    0   ✓           be able to work with every topic format available in the jig. There-
     MLM-tcELR               ✓                ×      ✓     0    0   ✓           fore a standard topic format is defined in section 3.3.3 so that any
     MLM-allELR              ✓                ✓      ✓     0    0   ✓           topic can be used in this Docker.
     PRMSELR                 ✓                ✓      ✓     0    0   ✓              For the methods with ELR extension, the annotation step has
     SDMELR                  ✓                ×      ✓     ✓    ✓   ✓           to be added to the code that was implemented by Hasibi et al. [2].
     FSDMELR                 ✓                ✓      ✓     ✓    ✓   ✓           Utilizing TAGME tool results in linking entities to queries.


                                                                                5   FUTURE WORKS AND CONCLUDING
the ranked list of retrieved entities for each query will be saved in
the output repository.
                                                                                    REMARKS
   All the queries in the jig e.g, topics of Robust04, ClueWeb09,               The more reproducible and replicable research papers are, the more
ClueWeb12, Gov2, Core17, Core18 and etc. are supported in this                  baselines will be accessible for researches to compare their results
Docker. Any other queries is acceptable in this docker as long as               with. This means, by increasing repeatability, reproducibility,and
each query is represented in the following format "query number                 replicability researchers can can spend less time on implementing
or name:query terms" in each line. An instance of a topic file would            other researchers work. Therefore, they will have more time spend-
be like:                                                                        ing on their own research and not on implementing the baselines.
                                                                                Consecutively, studies would make progress faster. Hence, more
                   wt09-1:obama family tree                                     works needed to be done in this specific area.
                   wt09-2:french lick resort and casino                             According to entity retrieval docker image, this work can be
                   wt09-3:getting organized                                     extended by supporting more collection such as DBpedia-entity v2
                   ...                                                          [4] in addition to the current one. Nordlys [3] implements some of
                                                                                the models with the updated collection. So in the future, a Docker
    If relevant entities (qrel) are available for the associated topics,        could be created for Nordlys, which provides better support for
the jig will utilize Trec Eval 7 to evaluate the retrieval performance          indexing.
by different metrics such as MAP. If there is no ground truth ranked                Furthermore, entity retrieval models can be added to the Docker.
list of entities for the topic, a dummy qrel file is needed to pass to          In terms of entity retrieval applications, it can be used to expand
the jig in order to make the Docker run by the jig.                             queries in order to improve document retrieval performance.
                                                                                    To sum up our work, Docker image is wrapped around the jig
4    OSIRRC EXPERIENCE                                                          introduced for Open-Source IR Replicability Challenge 2019. This
The crucial role of repeatability, replicability, and reproducibility           platform provides a unified framework for different retrieval task.
cannot be neglected in any research domain; especially when it                  Entity retrieval Docker image contains implementation of six differ-
comes to practical experiments. Deciding to participate in this chal-           ent entity retrieval model. ELR extension also can be applied on any
lenge was easy because either it is repeating your computation,                 of the models. All models can be customized with desired parame-
replicating another researchers’ experiments or reproducing other               ters and there is no limit in the supported topics. This docker image
team’s research with a totally different setup, there will be an en-            is implemented based on very lightweight Linux-centric design to
deavored struggle. This docker tackles all these 3 challenges for               tackle repeatability, reproducibility and replicability problem in the
entity retrieval. Not only the docker is built by a non-author of the           IR domain.
main paper (replicability) [2], but also this work is not limited to
specific topics. In other words, the entity retrieval docker is modi-           ACKNOWLEDGEMENT
fied in a way that any topics can be used to retrieve the relevant              Thanks to Faegheh Hasibi for her valuable suggestions during the
7 https://github.com/usnistgov/trec_eval                                        implementation of the Docker image and preparation of this paper.


                                                                           24
REFERENCES
 [1] Ryan Clancy, Nicola Ferro, Claudia Hauff, Jimmy Lin, Tetsuya Sakai, and
     Ze Zhong Wu. 2019. The SIGIR 2019 Open-Source IR Replicability Challenge
     (OSIRRC 2019)+. https://doi.org/10.1145/3331184.3331647
 [2] Faegheh Hasibi, Krisztian Balog, and Svein Erik Bratsberg. 2016. Exploiting
     Entity Linking in Queries for Entity Retrieval. In Proceedings of the 2016 ACM on
     International Conference on the Theory of Information Retrieval, ICTIR 2016, Newark,
     DE, USA, September 12- 6, 2016. 209–218. https://doi.org/10.1145/2970398.2970406
 [3] Faegheh Hasibi, Krisztian Balog, Darío Garigliotti, and Shuo Zhang. 2017. Nordlys:
     A Toolkit for Entity-Oriented and Semantic Search. In Proceedings of the 40th
     International ACM SIGIR Conference on Research and Development in Information
     Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017. 1289–1292. https://doi.org/
     10.1145/3077136.3084149
 [4] Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik
     Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-Entity v2: A Test
     Collection for Entity Search. In Proceedings of the 40th International ACM SIGIR
     Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo,
     Japan, August 7-11, 2017. 1265–1268. https://doi.org/10.1145/3077136.3080751
 [5] Jinyoung Kim, Xiaobing Xue, and W. Bruce Croft. 2009. A Probabilistic Retrieval
     Model for Semistructured Data. In Advances in Information Retrieval, 31th Eu-
     ropean Conference on IR Research, ECIR 2009, Toulouse, France, April 6-9, 2009.
     Proceedings. 228–239. https://doi.org/10.1007/978-3-642-00958-7_22
 [6] Donald Metzler and W. Bruce Croft. 2005. A Markov random field model for term
     dependencies. In SIGIR 2005: Proceedings of the 28th Annual International ACM
     SIGIR Conference on Research and Development in Information Retrieval, Salvador,
     Brazil, August 15-19, 2005. 472–479. https://doi.org/10.1145/1076034.1076115
 [7] Robert Neumayer, Krisztian Balog, and Kjetil Nørvåg. 2012. When Simple is
     (more than) Good Enough: Effective Semantic Search with (almost) no Semantics.
     In Advances in Information Retrieval - 34th European Conference on IR Research,
     ECIR 2012, Barcelona, Spain, April 1-5, 2012. Proceedings. 540–543. https://doi.org/
     10.1007/978-3-642-28997-2_59
 [8] Paul Ogilvie and James P. Callan. 2003. Combining document representations for
     known-item search. In SIGIR 2003: Proceedings of the 26th Annual International
     ACM SIGIR Conference on Research and Development in Information Retrieval, July
     28 - August 1, 2003, Toronto, Canada. 143–150. https://doi.org/10.1145/860435.
     860463
 [9] ChengXiang Zhai. 2008. Statistical Language Models for Information Retrieval:
     A Critical Review. Foundations and Trends in Information Retrieval 2, 3 (2008),
     137–213. https://doi.org/10.1561/1500000008
[10] Nikita Zhiltsov, Alexander Kotov, and Fedor Nikolaev. 2015. Fielded Sequential
     Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data. In Proceedings
     of the 38th International ACM SIGIR Conference on Research and Development in
     Information Retrieval, Santiago, Chile, August 9-13, 2015. 253–262. https://doi.org/
     10.1145/2766462.2767756


                                                                                            25