Applied Distributed Information Retrieval in Enterprise Search Erwin Gunadi, Till Plumbaum, and Sahin Albayrak Technische Universität Berlin 10587 Berlin, Germany {firstname.lastname}@dai-labor.de ABSTRACT wiki, pdfs, emails, word documents etc. (2) Multiple docu- Distributed enterprise search as a special case of Distributed ment repositories: Documents are normally not held in a sin- Information Retrieval (DIR) is characterized by the need gle file server or system. (3) Access restriction: hierarchies to query multiple repositories in enterprise environments. and roles rules for every document, and (4) Managed data However, in DIR research there is a lack of publicly avail- generation process: Differing from web documents, which able real-world datasets for evaluation purposes. As a re- are created by individual entity, each enterprise defines their sult, there is a gap between insights gained from simulated own document creation and update policy, which effectively environments and real-world investigations on distributed valid for all of its members [9, 5, 13]. enterprise search. In this paper, we outline three fundamen- To address the above mentioned issues (1)-(4), the use tal issues based on our investigations of a large real-world of Distributed Information Retrieval (DIR) has been pro- distributed enterprise search system. We found that (1) the posed [4, 5, 13]. DIR is a concept of managing different re- utilization of security features in enterprise repositories, (2) sources through a broker. A broker mediates between users the adaptation of resource description and resource selec- and different sources, or repositories, to collect and combine tion for enterprise model, and (3) repository grouping are search results. To accomplish this task three sub-problems fundamental real-world issues. We hypothesize that a bet- need to be addressed: resource description, resource selec- ter understanding of these issues will contribute to improve tion and result merging [2, 11]. With exception of the TREC the distributed enterprise search and better support com- Federated Web Search track 1 most of the DIR research have plex search tasks in enterprise environments. Based on our been based on synthetic test collections [11]. Due to the lack experience gained from a real-world system, we also outline of appropriate datasets and the proprietary nature of enter- needed steps to cope with these issues. prise data many improvements achieved in DIR are not di- rectly suitable for distributed enterprise search [6, 5]. These factors prevent a further adoption of DIR improvements in Categories and Subject Descriptors distributed enterprise search and creates a gap between these H.3.4 [Systems and Software]: Distributed Systems two areas. This paper has two main contributions. First, we bridge Keywords the gap between DIR and the enterprise context, by out- lining three fundamental problems that have been widely distributed information retrieval; enterprise search, result ignored in the field of DIR research, but are mandatory to aggregation be considered in the enterprise context. Second, we propose ways to cope with these problems. The presented problems 1. INTRODUCTION are (1) Utilization of security features in enterprise reposito- Enterprise Search is an area of information retrieval which ries, (2) Adaptation of resource description and resource se- specifically addresses the information needs of enterprise lection for enterprise model, and (3) Repositories grouping. users. Enterprise is defined as an organizational entity with All our findings base on real-world experiences and insights an exclusive memberships of its users. A typical enterprise we gain from the operation of a distributed enterprise search environment consists of multiple layers of access rights and system at TU Berlin and the city’s administration of Berlin different dedicated data repositories such as web servers, file with about 50.000 employees. servers, wikis, etc. Key differences to Web search are as fol- In the next Section we outline related works which dis- lows: (1) Heterogeneous document types such as web pages: cuss the integration of DIR in enterprise environment. In Section 3 we describe the architecture of our agent-based distributed enterprise search platform and its deployment. Section 4 details the open problems arise out from our ex- perience. In Section 5 we conclude our paper. Copyright c 2015 for the individual papers by the papers’ authors. Copy- 2. RELATED WORKS ing permitted for private and academic purposes. This volume is published Various works have proposed DIR as a possible paradigm and copyrighted by its editors. to implement enterprise search [5, 13, 12]. The main concept ECIR Supporting Complex Search Task Workshop ’15 Vienna, Austria 1 Published on CEUR-WS: http://ceur-ws.org/Vol-1338/. https://sites.google.com/site/trecfedweb/ of DIR is the usage of multiple resources, or repositories, us- ing a broker-concept in order to satisfy a users’ information need. This concept fits the main characteristic of an enter- prise environment where normally multiple data repositories for different needs exists, such as web servers, file servers, wikis, etc. [9, 5, 13]. Works investigating how to secure enterprise search sys- tem are presented in [1, 14]. Bailey et al. [1] propose the im- plementation of document level security for enterprise search systems.They evaluated how on-the-fly security checks for each document during search result list building affect the search processing time. Zhou et al. [14] propose the usage of ontology-based user profiles to secure the search process. The ontology models the information search service and user role information, which can be maintained for different de- partments in an enterprise. It is still an open question about how security restrictions affects search result quality in a distributed enterprise search. Current work focuses more on performance issues than quality. Regarding the DIR algorithms Li et al. [7] evaluated the performance of various result merging algorithms on multi- ple enterprise repositories which are unique to each other. Li et al. argue that repositories in an enterprise context are Figure 1: Multiple brokers serving different net- not identical to each other and each of these repositories works are contacted by a user through a search client may have different size, document types, intended audience and administration control. These characteristics need to be rived from our investigations on the distributed enterprise explored in the context of DIR in enterprise search. search system for the administration of Berlin. We also sug- gest how these open problem can be further investigated in 3. DISTRIBUTED ENTERPRISE SEARCH DIR research in enterprise context. IN THE REAL-WORLD 4.1 Utilization of security features in enterprise The findings we present in this paper based on research co- repositories operation with the service provider of the administration of Until recently the research of DIR is based on the as- Berlin, where we have deployed an agent-based distributed sumption that all documents are accessible to all users [11, enterprise search system [3]. The system is currently used 10]. However, in enterprise environments access to the doc- as the standard search platform for about 50.000 employees. uments is secured [1, 5, 13]. Depending on the access right The structure of Berlins’ network confronted us with some each user may see different search results for the same search challenges. Even though the whole city can be regarded as query. We argue that this property can be utilized as an es- a closed organization with state officials as its employees, sential feature for improving distributed enterprise search. each of the city’s districts maintains its own data manage- Utilization means not only to comply with access restric- ment policy. This means a city district is a private network tions, but also to improve resource selection and result merg- with access to all main data repositories such as city’s own ing algorithm by using the security information. intranet, but without access rights to data repositories from In order to accomplish this, DIR algorithms for different other city districts. As opposed to the classic DIR setting steps have to integrate the security information as a feature. these sub-networks cannot be served by a single main bro- For example, in DIR literature resource selection is responsi- ker. It requires that each network area has its’ own private ble to select the resources with the most relevant documents. broker and an extra broker installed in the main intranet. However, when a relevant repository is restricted for partic- Another use case for an additional broker is the user desk- ular user or user group, these documents will not be shown. top. To comply with the user privacy policy from the city’s This can be mitigated when repositories, with more accessi- administration, local desktop files should not be externally ble documents, are higher prioritized even though they have accessible. Because of such restrictions we have build a lo- less relevancy. In this case in navigational search tasks em- cal broker so that users can find their local desktop files. ployees may get a better recall about the subject. To the The interaction between the users, the search client and the best of our knowledge such behavior in a DIR setting is not multiple brokers is illustrated in Figure 1. yet researched. Creating suitable test collections for this Figure 1 illustrates how the search client is used to con- purpose is needed to evaluate this essential feature. tact all of the different brokers. It enables users to search in different network areas. In each of the network, multi- 4.2 Adaptation of resource description and re- ple repositories are queried by the responsible broker. The source selection for enterprise model desktop from the user also has a dedicated broker because Li et al. [7] highlighted how repositories in an enterprise local files should not be queried by an external broker. environment unique to each other are. Based on this fact Li et al. evaluate how these uniqueness may influence the result 4. OPEN PROBLEMS merging performance. The concern about the exploitation In this section we describe following open problems de- of unique features found in enterprise repositories should also be considered in the other sub-tasks of DIR: resource search. This helps to adapt available techniques for different description and resource selection. Research in this area is tasks in DIR (resource description, resource selection and re- needed in order to improve the application of DIR in En- sult merging) for enterprise use cases. Another investigation terprise Search. Thus, it will help narrow the gap between is needed on how to accommodate repositories grouping by these two research fields. building multiple brokers. Similar to news integration in web As an example use case of such exploitation, in an enter- search [8], group of repositories can be differently ranked for prise environment a common repository type is file servers. incoming queries. The ranking result may then be used to Documents from file servers, which are relevant for search highlight a group of repository in the presentation of search queries, are stored in a hierarchy of directories. We can result page (SERP) in enterprise search. include these directory names as a part of the resource de- scription of a repository. This means content of a resource 6. REFERENCES description includes not only sampled documents but also [1] P. Bailey, D. Hawking, and B. Matson. Secure search directory names. These directory information can be used in enterprise webs: tradeoffs in efficient as important terms that improve the resource selection al- implementation for document level security. CIKM ’06 gorithm. Such exploitation is yet to be investigated in dis- Proceedings of the 15th ACM international conference tributed enterprise search context. on Information and knowledge management, 2006. 4.3 Repositories grouping [2] J. Callan. Distributed information retrieval. In Advances in Information Retrieval, pages 127–150. One of the challenges from our enterprise environment set- Kluwer Academic Publishers, 2000. ting is the need of multiple brokers to handle different net- works. Even though the multiple brokers can be seen as a [3] E. Gunadi, M. Meder, T. Plumbaum, C. Scheel, technical feature, it introduces a new perspective in handling F. Hopfgartner, and S. Albayrak. Distributed multiple repositories. The ability of repositories grouping enterprise search using software agents. In Proceeding open the possibility to boost repositories with similar types AAMAS ’14, pages 1623–1624, Paris, France, 2014. as a group. The application of boosting for theme specific [4] D. Hawking. Challenges in enterprise search. In ADC repositories is being actively researched in distributed web ’04 Proceedings of the 15th Australasian database search task context [8]. In this paper, the authors investi- conference, volume 27, pages 15–24, 2004. gated how news sources can be ranked and placed in the [5] D. Hawking. Enterprise Search. In R. Baeza-Yates and web search result. In our deployment scenario we found B. Ribeiro-Neto, editors, Modern Information that search queries about specific law and regulations are Retrieval, pages 641–684. Addison-Wesley, 2010. common. In this case, having a theme specific group of [6] L. Jie, S. Lamkhede, R. Sapra, E. Hsu, H. Song, and repositories means documents comes from the law-themed Y. Chang. A unified search federation system based on repositories receive higher rank than documents from non online user feedback. In Proceedings of the 19th ACM law-themed ones. SIGKDD international conference on Knowledge By having groups of repositories, result merging tech- discovery and data mining - KDD ’13, page 1195, New niques may rank not only based on the repository rank but York, New York, USA, 2013. ACM Press. also on repositories-group rank. To accomplish this the sub- [7] P. V. Li, P. Thomas, and D. Hawking. Merging tasks of DIR must be adapted to broker context, namely algorithms for enterprise search. Proceedings of the broker description and broker selection. When a particular 18th Australasian Document Computing Symposium group of repositories is highly relevant for a search query, on - ADCS ’13, pages 42–49, 2013. the gained broker ranking may be used to highlight a group [8] R. McCreadie, C. Macdonald, and I. Ounis. News of repository in the search result page. vertical search: when and what to display to users. In SIGIR ’13: 36th international ACM SIGIR conference on Research and development in information retrieval, 5. CONCLUSION pages 253–262, 2013. Due to the availability of heterogeneous repositories, pre- [9] R. Mukherjee and J. Mao. Enterprise search: Tough vious works have proposed the application of DIR in enter- stuff. Queue, 2(2):36–46, 4 2004. prise search [5, 11, 13]. However, the improvements achieved [10] D. Nguyen, T. Demeester, D. Trieschnigg, and in DIR research are rarely investigated in real-world scenar- D. Hiemstra. Federated search in the wild: the ios, especially in enterprise environments [12]. Recent works combined power of over a hundred search engines. In show that further research in the application of DIR in real- Proceedings of the 21st ACM international conference world settings is needed in order to close the gap between on Information and knowledge management, pages DIR research and its’ real-world application in enterprises. 1874–1878, 2012. In this paper, we introduced three issues based on our ex- [11] M. Shokouhi and L. Si. Federated Search. Foundations perience from a real distributed enterprise setting. and Trends in Information Retrieval, 5(1):1–102, 2011. For future works, we need to investigate how security in- formation can be utilized for the different DIR sub-tasks. [12] P. Thomas. To what problem is distributed This also applies for the unique features found in enterprise information retrieval the solution? Journal of the repositories, such as file paths and directories. It permits American Society for Information Science and us to better understand the application of DIR in enterprise Technology, 63(7):1471–1476, July 2012. search, thus, bridges the gap between these two research ar- [13] M. White. Enterprise Search. O’Reilly Media, Inc., eas. Also more effort building appropriate test collections 2012. for a real-world DIR use case, like the work from Nguyen et [14] L. Zhou. Multi-agent based distributed secure al. [10], has to be done for evaluating distributed enterprise information retrieval. In CMC’10, pages 76–79, 2010.