=Paper= {{Paper |id=Vol-1338/paper4 |storemode=property |title=Applied Distributed Information Retrieval in Enterprise Search |pdfUrl=https://ceur-ws.org/Vol-1338/paper_4.pdf |volume=Vol-1338 |dblpUrl=https://dblp.org/rec/conf/ecir/GunadiPA15 }} ==Applied Distributed Information Retrieval in Enterprise Search== https://ceur-ws.org/Vol-1338/paper_4.pdf
     Applied Distributed Information Retrieval in Enterprise
                             Search

                                    Erwin Gunadi, Till Plumbaum, and Sahin Albayrak
                                                         Technische Universität Berlin
                                                           10587 Berlin, Germany
                                                {firstname.lastname}@dai-labor.de


ABSTRACT                                                                    wiki, pdfs, emails, word documents etc. (2) Multiple docu-
Distributed enterprise search as a special case of Distributed              ment repositories: Documents are normally not held in a sin-
Information Retrieval (DIR) is characterized by the need                    gle file server or system. (3) Access restriction: hierarchies
to query multiple repositories in enterprise environments.                  and roles rules for every document, and (4) Managed data
However, in DIR research there is a lack of publicly avail-                 generation process: Differing from web documents, which
able real-world datasets for evaluation purposes. As a re-                  are created by individual entity, each enterprise defines their
sult, there is a gap between insights gained from simulated                 own document creation and update policy, which effectively
environments and real-world investigations on distributed                   valid for all of its members [9, 5, 13].
enterprise search. In this paper, we outline three fundamen-                   To address the above mentioned issues (1)-(4), the use
tal issues based on our investigations of a large real-world                of Distributed Information Retrieval (DIR) has been pro-
distributed enterprise search system. We found that (1) the                 posed [4, 5, 13]. DIR is a concept of managing different re-
utilization of security features in enterprise repositories, (2)            sources through a broker. A broker mediates between users
the adaptation of resource description and resource selec-                  and different sources, or repositories, to collect and combine
tion for enterprise model, and (3) repository grouping are                  search results. To accomplish this task three sub-problems
fundamental real-world issues. We hypothesize that a bet-                   need to be addressed: resource description, resource selec-
ter understanding of these issues will contribute to improve                tion and result merging [2, 11]. With exception of the TREC
the distributed enterprise search and better support com-                   Federated Web Search track 1 most of the DIR research have
plex search tasks in enterprise environments. Based on our                  been based on synthetic test collections [11]. Due to the lack
experience gained from a real-world system, we also outline                 of appropriate datasets and the proprietary nature of enter-
needed steps to cope with these issues.                                     prise data many improvements achieved in DIR are not di-
                                                                            rectly suitable for distributed enterprise search [6, 5]. These
                                                                            factors prevent a further adoption of DIR improvements in
Categories and Subject Descriptors                                          distributed enterprise search and creates a gap between these
H.3.4 [Systems and Software]: Distributed Systems                           two areas.
                                                                               This paper has two main contributions. First, we bridge
Keywords                                                                    the gap between DIR and the enterprise context, by out-
                                                                            lining three fundamental problems that have been widely
distributed information retrieval; enterprise search, result                ignored in the field of DIR research, but are mandatory to
aggregation                                                                 be considered in the enterprise context. Second, we propose
                                                                            ways to cope with these problems. The presented problems
1.    INTRODUCTION                                                          are (1) Utilization of security features in enterprise reposito-
  Enterprise Search is an area of information retrieval which               ries, (2) Adaptation of resource description and resource se-
specifically addresses the information needs of enterprise                  lection for enterprise model, and (3) Repositories grouping.
users. Enterprise is defined as an organizational entity with               All our findings base on real-world experiences and insights
an exclusive memberships of its users. A typical enterprise                 we gain from the operation of a distributed enterprise search
environment consists of multiple layers of access rights and                system at TU Berlin and the city’s administration of Berlin
different dedicated data repositories such as web servers, file             with about 50.000 employees.
servers, wikis, etc. Key differences to Web search are as fol-                 In the next Section we outline related works which dis-
lows: (1) Heterogeneous document types such as web pages:                   cuss the integration of DIR in enterprise environment. In
                                                                            Section 3 we describe the architecture of our agent-based
                                                                            distributed enterprise search platform and its deployment.
                                                                            Section 4 details the open problems arise out from our ex-
                                                                            perience. In Section 5 we conclude our paper.


Copyright c 2015 for the individual papers by the papers’ authors. Copy-
                                                                            2.     RELATED WORKS
ing permitted for private and academic purposes. This volume is published     Various works have proposed DIR as a possible paradigm
and copyrighted by its editors.                                             to implement enterprise search [5, 13, 12]. The main concept
ECIR Supporting Complex Search Task Workshop ’15 Vienna, Austria            1
Published on CEUR-WS: http://ceur-ws.org/Vol-1338/.                             https://sites.google.com/site/trecfedweb/
of DIR is the usage of multiple resources, or repositories, us-
ing a broker-concept in order to satisfy a users’ information
need. This concept fits the main characteristic of an enter-
prise environment where normally multiple data repositories
for different needs exists, such as web servers, file servers,
wikis, etc. [9, 5, 13].
   Works investigating how to secure enterprise search sys-
tem are presented in [1, 14]. Bailey et al. [1] propose the im-
plementation of document level security for enterprise search
systems.They evaluated how on-the-fly security checks for
each document during search result list building affect the
search processing time. Zhou et al. [14] propose the usage
of ontology-based user profiles to secure the search process.
The ontology models the information search service and user
role information, which can be maintained for different de-
partments in an enterprise. It is still an open question about
how security restrictions affects search result quality in a
distributed enterprise search. Current work focuses more on
performance issues than quality.
   Regarding the DIR algorithms Li et al. [7] evaluated the
performance of various result merging algorithms on multi-
ple enterprise repositories which are unique to each other.
Li et al. argue that repositories in an enterprise context are    Figure 1: Multiple brokers serving different net-
not identical to each other and each of these repositories        works are contacted by a user through a search client
may have different size, document types, intended audience
and administration control. These characteristics need to be      rived from our investigations on the distributed enterprise
explored in the context of DIR in enterprise search.              search system for the administration of Berlin. We also sug-
                                                                  gest how these open problem can be further investigated in
3.     DISTRIBUTED ENTERPRISE SEARCH                              DIR research in enterprise context.
       IN THE REAL-WORLD                                          4.1   Utilization of security features in enterprise
  The findings we present in this paper based on research co-           repositories
operation with the service provider of the administration of
                                                                     Until recently the research of DIR is based on the as-
Berlin, where we have deployed an agent-based distributed
                                                                  sumption that all documents are accessible to all users [11,
enterprise search system [3]. The system is currently used
                                                                  10]. However, in enterprise environments access to the doc-
as the standard search platform for about 50.000 employees.
                                                                  uments is secured [1, 5, 13]. Depending on the access right
The structure of Berlins’ network confronted us with some
                                                                  each user may see different search results for the same search
challenges. Even though the whole city can be regarded as
                                                                  query. We argue that this property can be utilized as an es-
a closed organization with state officials as its employees,
                                                                  sential feature for improving distributed enterprise search.
each of the city’s districts maintains its own data manage-
                                                                  Utilization means not only to comply with access restric-
ment policy. This means a city district is a private network
                                                                  tions, but also to improve resource selection and result merg-
with access to all main data repositories such as city’s own
                                                                  ing algorithm by using the security information.
intranet, but without access rights to data repositories from
                                                                     In order to accomplish this, DIR algorithms for different
other city districts. As opposed to the classic DIR setting
                                                                  steps have to integrate the security information as a feature.
these sub-networks cannot be served by a single main bro-
                                                                  For example, in DIR literature resource selection is responsi-
ker. It requires that each network area has its’ own private
                                                                  ble to select the resources with the most relevant documents.
broker and an extra broker installed in the main intranet.
                                                                  However, when a relevant repository is restricted for partic-
Another use case for an additional broker is the user desk-
                                                                  ular user or user group, these documents will not be shown.
top. To comply with the user privacy policy from the city’s
                                                                  This can be mitigated when repositories, with more accessi-
administration, local desktop files should not be externally
                                                                  ble documents, are higher prioritized even though they have
accessible. Because of such restrictions we have build a lo-
                                                                  less relevancy. In this case in navigational search tasks em-
cal broker so that users can find their local desktop files.
                                                                  ployees may get a better recall about the subject. To the
The interaction between the users, the search client and the
                                                                  best of our knowledge such behavior in a DIR setting is not
multiple brokers is illustrated in Figure 1.
                                                                  yet researched. Creating suitable test collections for this
  Figure 1 illustrates how the search client is used to con-
                                                                  purpose is needed to evaluate this essential feature.
tact all of the different brokers. It enables users to search
in different network areas. In each of the network, multi-        4.2   Adaptation of resource description and re-
ple repositories are queried by the responsible broker. The             source selection for enterprise model
desktop from the user also has a dedicated broker because
                                                                     Li et al. [7] highlighted how repositories in an enterprise
local files should not be queried by an external broker.
                                                                  environment unique to each other are. Based on this fact Li
                                                                  et al. evaluate how these uniqueness may influence the result
4.     OPEN PROBLEMS                                              merging performance. The concern about the exploitation
     In this section we describe following open problems de-      of unique features found in enterprise repositories should
also be considered in the other sub-tasks of DIR: resource       search. This helps to adapt available techniques for different
description and resource selection. Research in this area is     tasks in DIR (resource description, resource selection and re-
needed in order to improve the application of DIR in En-         sult merging) for enterprise use cases. Another investigation
terprise Search. Thus, it will help narrow the gap between       is needed on how to accommodate repositories grouping by
these two research fields.                                       building multiple brokers. Similar to news integration in web
   As an example use case of such exploitation, in an enter-     search [8], group of repositories can be differently ranked for
prise environment a common repository type is file servers.      incoming queries. The ranking result may then be used to
Documents from file servers, which are relevant for search       highlight a group of repository in the presentation of search
queries, are stored in a hierarchy of directories. We can        result page (SERP) in enterprise search.
include these directory names as a part of the resource de-
scription of a repository. This means content of a resource      6.   REFERENCES
description includes not only sampled documents but also
                                                                  [1] P. Bailey, D. Hawking, and B. Matson. Secure search
directory names. These directory information can be used
                                                                      in enterprise webs: tradeoffs in efficient
as important terms that improve the resource selection al-
                                                                      implementation for document level security. CIKM ’06
gorithm. Such exploitation is yet to be investigated in dis-
                                                                      Proceedings of the 15th ACM international conference
tributed enterprise search context.
                                                                      on Information and knowledge management, 2006.
4.3   Repositories grouping                                       [2] J. Callan. Distributed information retrieval. In
                                                                      Advances in Information Retrieval, pages 127–150.
   One of the challenges from our enterprise environment set-
                                                                      Kluwer Academic Publishers, 2000.
ting is the need of multiple brokers to handle different net-
works. Even though the multiple brokers can be seen as a          [3] E. Gunadi, M. Meder, T. Plumbaum, C. Scheel,
technical feature, it introduces a new perspective in handling        F. Hopfgartner, and S. Albayrak. Distributed
multiple repositories. The ability of repositories grouping           enterprise search using software agents. In Proceeding
open the possibility to boost repositories with similar types         AAMAS ’14, pages 1623–1624, Paris, France, 2014.
as a group. The application of boosting for theme specific        [4] D. Hawking. Challenges in enterprise search. In ADC
repositories is being actively researched in distributed web          ’04 Proceedings of the 15th Australasian database
search task context [8]. In this paper, the authors investi-          conference, volume 27, pages 15–24, 2004.
gated how news sources can be ranked and placed in the            [5] D. Hawking. Enterprise Search. In R. Baeza-Yates and
web search result. In our deployment scenario we found                B. Ribeiro-Neto, editors, Modern Information
that search queries about specific law and regulations are            Retrieval, pages 641–684. Addison-Wesley, 2010.
common. In this case, having a theme specific group of            [6] L. Jie, S. Lamkhede, R. Sapra, E. Hsu, H. Song, and
repositories means documents comes from the law-themed                Y. Chang. A unified search federation system based on
repositories receive higher rank than documents from non              online user feedback. In Proceedings of the 19th ACM
law-themed ones.                                                      SIGKDD international conference on Knowledge
   By having groups of repositories, result merging tech-             discovery and data mining - KDD ’13, page 1195, New
niques may rank not only based on the repository rank but             York, New York, USA, 2013. ACM Press.
also on repositories-group rank. To accomplish this the sub-      [7] P. V. Li, P. Thomas, and D. Hawking. Merging
tasks of DIR must be adapted to broker context, namely                algorithms for enterprise search. Proceedings of the
broker description and broker selection. When a particular            18th Australasian Document Computing Symposium
group of repositories is highly relevant for a search query,          on - ADCS ’13, pages 42–49, 2013.
the gained broker ranking may be used to highlight a group        [8] R. McCreadie, C. Macdonald, and I. Ounis. News
of repository in the search result page.                              vertical search: when and what to display to users. In
                                                                      SIGIR ’13: 36th international ACM SIGIR conference
                                                                      on Research and development in information retrieval,
5.    CONCLUSION                                                      pages 253–262, 2013.
   Due to the availability of heterogeneous repositories, pre-    [9] R. Mukherjee and J. Mao. Enterprise search: Tough
vious works have proposed the application of DIR in enter-            stuff. Queue, 2(2):36–46, 4 2004.
prise search [5, 11, 13]. However, the improvements achieved
                                                                 [10] D. Nguyen, T. Demeester, D. Trieschnigg, and
in DIR research are rarely investigated in real-world scenar-
                                                                      D. Hiemstra. Federated search in the wild: the
ios, especially in enterprise environments [12]. Recent works
                                                                      combined power of over a hundred search engines. In
show that further research in the application of DIR in real-
                                                                      Proceedings of the 21st ACM international conference
world settings is needed in order to close the gap between
                                                                      on Information and knowledge management, pages
DIR research and its’ real-world application in enterprises.
                                                                      1874–1878, 2012.
In this paper, we introduced three issues based on our ex-
                                                                 [11] M. Shokouhi and L. Si. Federated Search. Foundations
perience from a real distributed enterprise setting.
                                                                      and Trends in Information Retrieval, 5(1):1–102, 2011.
   For future works, we need to investigate how security in-
formation can be utilized for the different DIR sub-tasks.       [12] P. Thomas. To what problem is distributed
This also applies for the unique features found in enterprise         information retrieval the solution? Journal of the
repositories, such as file paths and directories. It permits          American Society for Information Science and
us to better understand the application of DIR in enterprise          Technology, 63(7):1471–1476, July 2012.
search, thus, bridges the gap between these two research ar-     [13] M. White. Enterprise Search. O’Reilly Media, Inc.,
eas. Also more effort building appropriate test collections           2012.
for a real-world DIR use case, like the work from Nguyen et      [14] L. Zhou. Multi-agent based distributed secure
al. [10], has to be done for evaluating distributed enterprise        information retrieval. In CMC’10, pages 76–79, 2010.