Exploiting ERP Systems in Enterprise Search Diego Tosato Eurosystem S.p.a.,Via Newton 21, Villorba (Treviso), Italy diego.tosato@eurosystem.it Abstract. Enterprise resource planning (ERP) systems are the core of many companies: they contain entities which are the focus of enterprise searches [5]. In this paper, a model which exploits those entities to im- prove the search experience of enterprise users is proposed. Specifically, a graph knowledge base called entity graph is defined. It is used both to offer a novel data exploration experience that reflects the business processes and to improve the search accuracy contributing to the score of a search result into a weighted linear model. The applicability of the model is demonstrated by implementing an enterprise search prototype called SeNSE (Skyline eNterprise Search Engine). Keywords: Enterprise search, Entity centric retrieval, Entity graph ex- ploration 1 Introduction According to [5], enterprise search on small data is much more important than web search on big data for many companies, but this issue still receives little attention from the information retrieval community. However, last advances in enterprise search focus on the extraction of concepts or entities from enterprise data, which might be a promising way to enhance the search performances. Among the different sources of information of an enterprise (such as relational databases, file system documents, web pages, etc.), a key role is played by ERP systems [10], which are typically composed of several modules, such as sales, finance and production, or business intelligence. Since ERP systems capture in- formation among modules and provide an integrated view of information through enterprise business processes, we decide to model its main entities and the re- lated entity links. The latter are arranged in a graph knowledge base that we called entity graph (EG), which can be used to boost the search results and to explore data in a way that reflects the business processes and the work-flow of enterprise users. Despite some state-of-the-art enterprise search systems, based on entities [1,5], ours are a small number of complex concepts (such as orders, invoices, estimates, etc.). This choice has two main advantages: from the enter- prise user point of view, entities and their links can be displayed as a meaningful graph that can be exploited for the everyday work; from the machine learning point of view, since we have a small number of entities and entity links, it is easier to assign them weighs that can be used to improve search results. Our contributions are summarized as follows: (1) to our best knowledge, we are the first one to use ERP entities and their relations to build a knowledge base to improve the search experience; (2) we propose a novel data navigation model based on the EG; (3) we build an enterprise search prototype that demonstrate the applicability of our model. 2 System Design ERP is the core of a company [10] because it contains most of the fundamental entities searched by enterprise users. Despite that, most of the enterprise search solutions are not able to achieve satisfying search performances because they still aim at working at word level. However, there are remarkable recent works that show how to extract concepts or entities from data automatically [1,2,4,5,8], but they still cannot deal with complex ERP entities made up of many relational tables. To improve the search experience, we decided to model the fundamental entities and their relations explicitly by exploiting our knowledge of ERP systems and enterprise user needs, which is necessary to build an effective search system [13]. Therefore, we asked our users which were the most relevant type of ERP entities and what kind of relationships connected them. We obtained a list of 33 entity types that are connected by 70 relationship types that represent the core of the work for most of our users. These entities are made of structured and unstructured data that are represented as documents. In order to preserve the structure of the entities, documents are organized as a set of fields (see [7] for more details). Furthermore, according to our ERP domain experts, we defined a set of components, detailed in Sec. 2.1, that must influence the rank of an entity. These are related to the following fundamental aspects of an entity: content, context (in terms of its relationships), and last modified date. To combine the contributions of the components, we follow the idea proposed by [12], which led us to design a modular enterprise search engine. The modules are organized into a pipeline and the contribution of each of them is computed sequentially. 2.1 Entity Ranking When a search is performed, the final rank of the results is a weighted linear combination of contributions computed by a pipeline of components. More for- mally, let {αi }i=1,...,N a set of scores and {wi }i=1,...,N a set of weights, the final rank r of an entity ε is given by N X r(ε) = wi ᾱi s.t. 0 ≤ ᾱi ≤ 1, (1) i=1 where ᾱi represents the normalized version of αi through the min-max normal- ization method [9]. We instantiated the model (Eq. (1)) considering the following contributions: αcnt Given the document representation d of entities ε, this is the TF-IDF score that reflects how relevant an entity is by its content (see [6]). More specifically, we computed the cosine similarity between the user query q and an indexed document d represented as vectors. Therefore, the αcnt can be expressed as V (q)V (d) αcnt (d) = cos(q, d) = , d∈D |V (q)||V (d)| where V () is the vector form of a document and D is the set of indexed docu- ments. αdte It is a linear score that boosts recent entities [7]. Given the date of a document expressed in days t and a normalization constant n = max(t ∈ T ), where T represents the set of dates of the indexed documents, the score is defined as n−t αdte (d) = γ , n where γ is a boost factor that we set to 2 and d is the document associated with t. αegs Considering the subgraph S of EG provided by the top results of αcnt ranking, this is a logarithmic score that boosts connected documents. αegs is defined as   1 αegs (ε) = log 1 + , ε, ε0 ∈ S ϕ(ε, ε0 ) where ϕ() is a weighted distance computed by summing the weight of the edges on the shortest path between a pair of entities (ε, ε0 ) such as ε 6= ε0 . αprk The score provided by Page Rank which is proven to lead to better search performances [12]. Therefore the rank model used by our system is r(ε) = w1 ᾱcnt + w2 ᾱdte + w3 ᾱegs + w4 ᾱprk , (2) where w1 , . . . , w4 are assigned by analyzing search results as explained in Sec. 3.3. Exploiting the click-through data [6], it could be interesting to try to com- pute weights automatically by using a machine learning technique such as SVM, boosting, or neural networks [9]. 2.2 Entity Graph (EG) To meet the user need of exploring ERP entities, we build the EG, enhancing enterprise search with an exploration experience complementary to faceted nav- igation and full text search. EG is a graph which consists of nodes that represent entities extracted from a set of queries on the data sources, one for each entity type. Edges represent the underlying business relations among the entity types. They are extracted by queries that link pairs of entity references. Formally, an EG is a directed graph G = (V, E, W), where V is a set of nodes, E ⊆ V × V a set of edges, and W a set of edge weights. We place an entity identifier into each node, while edges contain labels that explain the meaning of the relations. A configuration file determines the queries to extract the relations, their direction, and the weights of each type of relation. By analyzing the links of EG, we found that there are huge node hubs because there are some types of entities (i.e., master data type) that are linked to almost all the others. This is a problem for the computation of αprk , because ranking methods such as PageRank or HITS [3] are built to rank web pages. So, they give higher rank to hub nodes which are not necessarily relevant for each enterprise information need. Even if the prob- lem is still open, our system gives to αprk a lower weight in order to mitigate the huge hub nodes effect. 3 Prototype SeNSE (Skyline eNterprise Search Engine) is the name of the prototype that demonstrates the applicability of the model described in Sec. 2. The prototype is based on the ERP system Freeway Skyline1 . 3.1 Architecture Fig. 1 provides an overview of the architecture of our system. SeNSE is designed to search on data coming from any source of information such as file servers, ERP applications, and databases. During the indexing phase made by the indexing server (see Fig. 1), entities are extracted in the form of documents and analyzed, and the security information are computed. Then entity links are extracted, the EG is built and analyzed, finally the preview of the entities is computed. From the time and space complexity point of view, this last operation is the most expensive of the entire indexing process, but it is very useful for the users because it provides some entity details without leaving the SERP (Search Engine Results Page) page. The entities extracted and processed by the Indexing Server are stored in three different repositories: the Inverted Index implemented through [7] that contains all the textual information of the entities; the entity graph database contains an instance of EG; the preview database stores an image preview for each entity. Both the databases are implemented through [11]. In order to guarantee the integrity and the synchronization of the repositories an enterprise service bus (ESB) is adopted. The searching server provides two search web services, namely the full text search and the faceted search implemented through Bobo-Browse2 , which are based on the search pipeline that contains the following search components: 1 www.freewayskyline.com 2 senseidb.github.io/bobo ENTERPRISE DATA INDEXING SERVER SEARCHING SERVER Unstructured Data Preview Entity Graph Entity WEB SERVICES (docs, emails, Computation Analysis Preview web etc.) Text Entity Graph Entity Graph Analysis Building Exploration Structured Data (rdbms) Entity Entity Links Full Text Faceted Extraction Extraction Search Search ERP Data Entity Definition Search Pipeline (TF-IDF score, date score, EG score, etc.) REPOSITORIES Inverted Index Entity Graph DB Preview DB Fig. 1. Overview of the architecture of SeNSE. Content Search computes the score αcnt by exploiting the full text search capa- bilities of Lucene [7]; Date Boost computes the score αdte ; Link Score computes the score αegs ; Page Rank computes the score αprk ; Final Score computes the equation (2) given the result of the previous components; Abstract Highlighting highlights terms of the result documents that match the user query; Entity Se- curity defines a cached security filter that is provided by the Content Search component. The searching server provides two other services, namely the en- tity graph exploration and the entity preview services independent of the search pipeline. We store into document fields the security information φ such as user name, company name, and database table grants. For each φ, we define the allow a and deny d policies. To establish if a result can be listed into the SERP the following boolean expression is evaluated (φ1a ∧ ¬φ1d ) ∨ · · · ∨ (φia ∧ ¬φid ) ∨ · · · ∨ (φL L a ∧ ¬φd ), where i ∈ 1, . . . , L is the index of a security information. The presence of φid is not strictly necessary, but it allows to implement security roles such as “allow all but . . . ”. One of the major problems we found in designing the architecture of SeNSE is that it needs different representations of an entity (namely sparse vector, node of a graph, and database entry) to provide its services. This is not only a scalability issue but also a modeling one. In fact, the extension of the search pipeline with further components could introduce novel representations for the entities. In particular, for many machine learning techniques a dense vector representation is necessary [12]. To the best of our knowledge there is not a unified representation to search, analyze and explore entities. Another tricky problem concerns the update of the indexed entities, because enterprise search engines updates should be processed in near real-time. The system has to deal with all the type of updates, in particular it has to manage the cancellation of entities which is the most difficult case. To tackle the update problem, SeNSE implements three update policies: batch full that updates all the entities of a certain type, batch delta that updates entities modified up to a specific date, and real time. The first two policies can be scheduled depending on the number of entities involved into the update and their indexing speed. The current implementations of the policies is specific for each data source, but there is still room to improve because the performances of the update notification infrastructures provided by data sources are not always satisfactory, since the infrastructures provide too many false positive update notifications or too generic notifications. 3.2 User Experience 1 4 5 6 2 3 Fig. 2. The SERP page. The most relevant pieces of the user interface of our system is shown in Fig. 2, Fig. 3, and at www.freewayskyline.com/demosense. In particular, Fig. 2 shows a small part of the SERP which is divided into three main areas. The first area contains the search box as depicted in Fig. 2.1. According to our users, we provide the possibility to choose the type of entity before entering the search query. Once the search is performed, the faceted navigation can be started from the left part of the UI as shown in Fig. 2.2. Simultaneously, the results are listed in the right part of the interface (Fig. 2.3). For each result three functions are available: starting from the left, the first function is the EG exploration (Fig. 2.4) which is detailed below. In the middle (Fig. 2.5), it is placed the preview function that displays the image of the entity associated with the result into a flexbox according to its type and format. Finally, on the right (Fig. 2.6) there is the user actions function that list a set of user defined business actions available for the result such as compile an order or print a bill. Back to Restart the search navigation Categories: Order confirmation Estimate Order request 2 Order Request: 07ORDXV 11/01/2013 - IBM ITALY SPA Order Confirmation: 1234 08/02/2013 - IBM ITALY SPA Estimate: 000785 11/12/2012 - IBM ITALY SPA 1 Fig. 3. The EG exploration page. To implement the exploration of the EG, we use the Vis.js3 library that is able to display automatically and interact with the graph at the same time. When a result of a search is displayed into the SERP, the exploration can be started from the entity associated with the selected result and its neighborhood as shown in Fig. 3.1, then it is possible to continue the exploration experience by selecting a neighbor. Since the ER is interactive, from each node it is possible to execute its business actions. On the top part of the EG exploration page (Fig. 3.2) the map legend and the main navigation functions are displayed. Users found the EG exploration effective and intuitive on both tablet and pc and ask to personalize the appearance of each entity type. 3.3 Experiments We experimented SeNSE with success on an X64 Intel Xeon E5450 3.00 Ghz processor with 10 Gbytes of RAM server. Since we are not aware of any pub- lic database that fit our ERP entity model, we built three different enterprise datasets with real data. They contain approximately 1 million entities and 10 million entity links which are typical magnitude of data for small and medium- sized enterprises. To evaluate the performance of our system we chose the largest 3 visjs.org dataset and we computed the Precision at k (Pk ) [6] on a testing set of 100 user information needs. We collected the needs both by interviewing users and by logging their search queries, then relevance judgments are obtained by merging the user ranking on the top 5 entities. The performance baseline of SeNSE is given by the αcnt rank. It yields that the top 5 entities in user queries are recog- nized with an average precision of 54%. To improve the performances up to 15%, we added all the others score components (αdte , αegs , and αprk ). We assigned a weight {wi }i=1,...,4 performing a grid search [9] that maximize Pk . For this purpose all the scores are normalized (see Eq. (1)) and weights are selected by searching into a range 0 ≤ wi ≤ 1 using a step of 0.1. Final weights are not uni- formly distributed, in fact αcnt is the most important contribution with respect to the others. 4 Conclusions and Future Works We presented an enterprise search model that exploits ERP entities to enhance the enterprise search experience and its implementation: SeNSE. We discussed the main design aspects of the model and the related open issues. Then we present the architecture of the prototype and its user experience. In future work, we aim to clarify the benefit given by each contribution to entity ranking and we will implement an automatic method to compute the weights for those con- tributions. References 1. Brauer, F., Huber, M., Hackenbroich, G., Leser, U., Naumann, F., Barczyn- ski, W.M.: Graph-based concept identification and disambiguation for enterprise search. In: WWW (2010) 2. Graus, D., Tsagkias, M., Weerkamp, W., Meij, E., de Rijke, M.: Dynamic collective entity representations for entity ranking. In: WSDM (2016) 3. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of massive datasets. CUP (2014) 4. Li, J., Yang, J.J., Liu, C., Zhao, Y., Liu, B., Shi, Y.: Exploiting semantic linkages among multiple sources for semantic information retrieval. EIS (2014) 5. Liu, X., Chen, F., Fang, H., Wang, M.: Exploiting entity relationship for query expansion in enterprise search. IR 17(3), 265–294 (2014) 6. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008) 7. McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action: Covers Apache Lucene 3.0. Manning Publications Co. (2010) 8. Meij, E., Balog, K., Odijk, D.: Entity linking and retrieval for semantic search. In: WSDM (2014) 9. Murphy, K.P.: Machine learning: a probabilistic perspective. MIT press (2012) 10. Nazemi, E., Tarokh, M.J., Djavanshir, G.R.: Erp: a literature survey. IJAMT (2012) 11. Owens, M., Allen, G.: SQLite. Springer (2010) 12. Turney, P.D., Pantel, P., et al.: From frequency to meaning: Vector space models of semantics. JAIR (2010) 13. White, M.: Critical success factors for enterprise search. BIR (2015)