Cite4Me: Semantic Retrieval and Analysis of Scientific Publications Bernardo Pereira Nunes Besnik Fetahu Marco Antonio Casanova PUC-Rio L3S Research Center PUC-Rio Rio de Janeiro, Brazil Appelstrasse 9a Rio de Janeiro, Brazil Hannover, Germany ABSTRACT lications3 . However, current approaches by main digital library This paper presents the Cite4Me Web application and its features providers, such as ACM Digital Library4 and Elsevier5 , do not created for the LAK Challenge 2013. The Web application focuses represent the current state of research on exploring resources us- on two main directions: (i) interlinking of the LAK dataset with ing approaches from Information Retrieval, Information Extraction related data sources from the Linked Open Data cloud; and (ii) and Semantic Web. Thus, get an overview of research topics, find providing innovative search, visualization, retrieval and recommen- publications and discover new nomenclatures are an arduous and dation of scientific publications from the LAK dataset and related laborious task that are not always successful. interlinked resources. Our approach is based on semantic and co- In this paper, we introduce Cite4Me a novel application for ex- occurrence relations to provide new browsing experiences to Web ploratory search, retrieval and visualization of scientific publica- users and an overview of scientific data available. Furthermore, we tions. Cite4Me intends to provide to the end users a single point for present a detailed analysis of the LAK dataset along with applica- accessing papers and hence reducing efforts of searching in several tions which contributes to the development of the learning analytics data sources. Our system takes advantage of reference datasets, field. such as DBpedia6 , to explore semantic relationships between sci- entific papers and user queries. Additionally, an analysis of topic coverage and shared concepts from related educational datasets, ex- 1. INTRODUCTION tracted from the Linked Open Data cloud, will be introduced. The volume of information on the Web has been growing steadily The remaining of the paper is organized as follows. Section 2 over the last decade and has doubled every two years. The vast presents the approach used for searching, retrieving and recom- amount of data available on the Web along with new means of mending papers. Section 3 describes the process of dataset dis- communications have transformed our society, including the way covery and interlinking and Section 4 shows a brief result analysis we work, live, relate to each other and learn. of the data discovery. Finally, Section 5 presents related work and In the midst of change, the Learning Analytics emerges to make Section 6 presents some concluding remarks. sense of the produced educational data reported by learners, pro- fessors, institutions and so on. Analyzing and understanding the 2. CITE4ME changes along the past years help us to understand the current state As one of the main goals of the field of “Learning Analytics” is and be aware of the forthcoming trends, enabling a new outlook of to support students in their learning process, we developed a Web the future of learning. application called Cite4Me7 that assists students in making deci- A recent challenge initiative of SOLAR1 and LinkedUp2 project sions to find scientific publications and identify relevant research arises to leverage the creation of tools that enables the analysis, topics. visualization, browsing and recommendation of scientific and edu- Cite4Me implements semantic and co-occurrence methods to (a) cational data. search and retrieve scientific publications; and (b) recommend sci- Although the scientific field has fostered the creation of new entific publications. Moreover, it provides a Web interface that fa- applications in several areas, such as medical, biology, physics, cilitates the search for publications and may help users on discov- amongst others, the information access is based mostly on free ering related terms to a given query. text search and on hierarchical classification system of the pub- In this section, we provide an overview of the major features 1 of the Web application and its Web interface that assist users to Society for Learning Analytics Research - http://www. explore scientific data on the Web. 2 2.1 Search and Retrieval Cite4Me relies on search functionalities to meet the users needs. Briefly, we implemented standard Information Retrieval (IR) and Semantic Web (SW) approaches to retrieve and recommend scien- tific papers to the users. We divided this subsection into (i) free text 3 4 5 Copyright c 2013 by the papers’ authors. Copying permitted only for pri- 6 vate and academic purposes. 7 LAK-Data Challenge ’13 Leuven, Belgium search; (ii) exploratory search; and (iii) semantic search. for the entities contained in a publication. Finally, the ranking of the results is based on the sum of the tf-idf scores of the matching 2.1.1 Free Text Search concepts. The purpose of the free text search functionality is to offer users Figure 2 illustrates the semantic search functionality. It also gen- the abilities to search for mentions, titles and authors of academic erates a tag cloud from matching publications, showing the most publications contained in the LAK dataset. Even though, this func- prominent terms for a given query. Specifically, the tag cloug based tionality is similar to existing digital libraries, we agree that this is a on the results helps the users to have an insight about the topics and basic functionality that must be provided by our application. There- may assist in finding related terms previously unknown by them. fore, we use standard vector space models (tf-idf ) for indexing and retrieving documents. 2.2 Paper recommendation The tf-idf scores were computed for each term extracted from the Another key feature of our system is the paper recommendation publication content after applying stemming [14]. Furthermore, the based on semantic relationships extracted from reference datasets. searching functionality offers boolean queries with standard oper- The recommendation is based on a previous work [12, 11], where ators, such as ’OR’, ’AND’, and also a ranking of the matching we exploit the number of paths and the distance (length of a path) publications based on the sum of tf-idf scores from the individual between given entities to compute a relatedness score between ex- query terms. tracted entities and associated documents. The first step to measure In summary, our free text search provides to the users publica- the relatedness between documents is to compute the semantic con- tions (P) that match query terms and non-matching publications nectivity score (S CS e ) of the entities found in each text (see Eq. 2). P0 , which are related to P according to a degree of similarity (see Eq. 1), but does not contain the query terms. τ X The similarity between a matching publication P and other non- S CS e (a, b) = βl · |paths (a,b) | (2) matching publication P0 in the LAK dataset is measured by the l=1 standard cosine similarity measure, which is built on top of the where |paths (a,b) | is the number of paths between a and b of length l computed tf-idf scores. and 0 < β ≤ 1 is a positive damping factor. As in [12, 11], we used β = 0.5 as our damping factor. Furthermore, we also constrained P · P0 the length of a path to τ = 4. S im(P, P0 ) = (1) |P||P0 | Based on the score for entities, we then define the semantic con- nectivity score (S CS w ) between two documents W1 and W2 as fol- where P and P0 represent the tf-idf scores for the terms in two dis- lows: tinct publications.     2.1.2 Exploratory Search    X |E 1 ∩ E 2 |  1 S CS w (W1 , W2 )=  S CS e (e1 , e2 ) +  ∗ (3) In this section, we provide detailed insights on the exploratory  e1 ∈E1 2  |E1 | ∗ |E2 |   search functionality of our application. As a preliminary step to e2 ∈E2 e1 ,e2 provide analytics and information about the actual content and top- ics coverage, all the scientific publications contained in the LAK where Ei is the set of entities associated with Wi , for i = 1, 2. Note dataset are previously enriched. The enrichment process was per- that documents that contain the same entities receive an extra bonus formed using DBpedia Spotlight API8 , where entities, entity types (the second term on the right-hand side of Eq. 3). and their respective categories were extracted. Thus, a list of documents pairs is generated and ranked according After the enrichment process, we cluster the publications accord- the score and suggested to the user. Figure 3 illustrates the paper ing to entities and its categories found in each document. The publi- recommendation process computed based on S CS w . cations are clustered in a tree-based structure over the enrichments. Note that, each node of the tree represents a topic in which a pub- lication under this node covers. Thus, the exploratory search is 3. DATASET DISCOVERY AND INTER- performed through the topics covered by each publication. LINKING The process of linking publications, categories and extra re- This section briefly describes the datasets used on automatic re- sources is mediated by DBpedia knowledge graph, where we use lated data discovery from DataHub9 and future steps on dataset dis- the dcterms:subject property to match the resources. covery and interlinking. Thus, as a result, the exploratory search provides a way to explore resources through the connections between their topics, 3.1 LAK Dataset which facilitates the search for topically related resources. Figure 1 The LAK dataset contains the metadata of the papers published shows the exploratory search. in the proceedings of LAK conference 2011-12, a special issue of Learning and Knowledge Analytics: Educational Technology & 2.1.3 Semantic Search Society, the proceedings of the International Conference on Educa- Cite4Me provides also a semantic search engine that assists users tional Data Mining (2008-12) and the Journal of Educational Data to find publications semantically related to the query terms. Analo- Mining (2008-12). In total, 315 descriptions of papers containing gously to explicit semantic analysis (ESA) technique [5], the relat- detailed information about authors, institutions, conference venues edness score, is computed between the enriched concepts found in and the full content of the paper were available. the publications’ content. Basically, the semantic search is an adaptation of the free text 3.2 Data Analysis search presented in the Section 2.1.1. Instead of computing the The goal of the data analysis procedure is to align the various tf-idf scores for the words in a text, it computes the tf-idf score publications in the LAK dataset based on mutual information, such 8 9 Figure 1: Preview of the exploratory search funcionality. Figure 2: Preview of the semantic search funcionality. Figure 3: An example of paper recommendation based on S CS w . 4. EVALUATION OF DATA ANALYSIS AND DATA DISCOVERY This section presents an overview of the results obtained by an- alyzing the LAK dataset with respect to the constructed feature set that describes topics covered by individual publications. Moreover, based on the data analysis procedure and shared information, we show that the establishment of links between the different publica- tions within the LAK dataset and from other datasets in DataHub is possible. In the following subsections, we show the analysis of the LAK dataset and the discovery of relevant datasets and publications. 4.1 Data Analysis The data analysis of the LAK dataset focuses mostly on assess- ing the individual publications for their topic coverage. In this man- ner, we build a connected data graph consisting of the individual publications and items from the feature set. This step is necessary Figure 4: Relevant Dataset Discovery Framework based on to provide the exploratory search functionality, where based on the the generated feature set used to query DataHub Linked Data established edges between publications and feature set items, we provider. can navigate through the publications or topics of interest. There- fore, the results obtained with respect to the constructed feature set and LAK dataset graph are shown in what follows. as the topics covered by them. This is achieved using well estab- Table 1 shows the top ranked items for each of the feature sets, lished datasets like DBpedia10 and Freebase11 , where a reference along with the number of associations an item has with respect to point for the unstructured textual content of publications is created all publications (entity, category and type items). Figure 5 shows through an enrichment process. the constructed data graph for the LAK dataset. Again, the enrichment process is carried out using DBpedia Spotlight12 [10] and addresses several issues of significant impor- 4.2 Data Discovery tance. For instance, it offers several advantages such as: (i) identi- After creating the feature set based on the information provided fication of (common) named entities, (ii) disambiguation; and (iii) from reference datasets, we are able to query for relevant datasets expansion of the limited dataset and resource descriptions with ad- in DataHub. ditional background knowledge. Thus, for the top ranked feature set items, the data discovery for relevant resources is considered. Table 2 shows the discovered resources and datasets for the top-10 entity items. Note that, we fo- 3.3 Data Discovery cus only on bibliographic datasets, since we aim at recommending Our Web application uses as its starting point the instances in topically related scientific publications. Due to the lack of bib- the LAK dataset to automatically explore and recommend to users, liographic datasets, we were not able to find related publications datasets that covers similar topics. In order to query, detect and for all entities considered. Table 2 summarizes the discovered re- interlink related datasets, we chose the DataHub as a data provider. sources. The dataset names are represented by their acronyms as DataHub serves as a collecting point of datasets from various fields follows: b3kat - “Bayerische Staatsbibliothek", hebis - “Hessis- and currently it has over 5000 datasets. Note that, from the large ches Bibliotheks Informations System" and npg - “Nature Publish- number of datasets, only 300 datasets are provided as Linked Open ing Group - ALL". Data. As the latter is the main focus of our work, the analysis and Additionally, from the set of 96 bibliographic datasets available, interlinking process is focused for such datasets. only a few of them were offered as Linked Data, thus narrowing Briefly, the data discovery is performed using CKAN13 data our search space for relevant resources. management framework from DataHub, where based on data anal- ysis and user interests (such as topics covered by a publica- Entity b3kat hebis npg tion/resource) related datasets are suggested. Data 14 0 12 Additionally, we provide to the user a set of resources, amongst Learning 5 0 1 other data analytics, that enables the user to harvest and correlate Data mining 0 0 0 new information from the discovered resources, considering the Algorithm 4 0 0 LAK dataset as a starting point of such discovery. Education 17 1 2 This approach presents several advantages such as the adoption Analysis 42 1 6 and the widespread use of Linked Data principles for publishing Student 7 1 1 scientific papers. Nowadays, many conferences make their pro- Knowledge 11 0 0 ceedings and journals freely accessible, hence our approach would Methodology 4 0 1 take advantage of such open data and offer users topically relevant Statistics 7 0 1 papers for a particular resource in the LAK dataset. Table 2: Number of discovered resources from the bibliographic 10 group for the top ranked items from the entity feature set, based 11 on the LAK dataset. 12 13 Entity Assoc. Category Assoc. Type Assoc. Data 90 Educational_psychology 161 DBpedia:TopicalConcept 150 Learning 80 Data_analysis 150 Freebase:/book 142 Data_mining 67 Learning 139 Freebase:/book/book_subject 142 Algorithm 50 Scientific_method 137 Freebase:/media_common 138 Education 49 Neuropsychological_assessment 136 Freebase:/media_common/quotation_subject 136 Analysis 48 Greek_loanwords 135 Freebase:/computer 125 Student 46 Data 131 Freebase:/education 122 Knowledge 46 Evaluation_methods 129 Freebase:/education/field_of_study 120 Methodology 42 Computer_data 126 Freebase:/computer/software_genre 120 Statistics 41 Research_methods 124 Freebase:/internet 118 System 37 Systems_science 118 Freebase:/internet/website_category 118 Scientific_modelling 37 Formal_sciences 108 Freebase:/award 114 Prediction 36 Data_management 108 Freebase:/media_common/media_genre 105 Data_set 36 Cognitive_science 107 Freebase:/organization 103 Statistical_classification 30 Statistical_terminology 107 Freebase:/award/award_discipline 103 Evaluation 29 Developmental_psychology 101 Freebase:/business 102 Standard_deviation 29 Intelligence 93 Freebase:/organization/organization_sector 99 Probability 28 Data_mining 91 Freebase:/people 99 Behavior 26 Critical_thinking 87 Freebase:/film 94 Interaction 24 Thought 84 Freebase:/book/periodical_subject 93 Table 1: Top ranked items from the feature set for the LAK Dataset, from the dataset analysis. Alumni of Woodworking Keele The People Pogues 1955 from births Stoke-on... members English guitarists English People Living associated banjoists peoplewi... Woodcarving Figure 5: Topic coverage of LAK data graph for the individual resources. 5. RELATED WORK [3] G. Cobo, D. García-Solórzano, J. A. Morán, E. Santamaría, Cobo et al.[3] presents an analysis of student participation in on- C. Monzo, and J. Melenchón. Using agglomerative line discussion forums using an agglomerative hierarchical clus- hierarchical clustering to model learner participation profiles tering algorithm, and explore the profiles to find relevant activ- in online discussion forums. In Proc. of the 2nd International ity patterns and detect different student profiles. Barber et al. [1] Conference on Learning Analytics and Knowledge, LAK uses a predictive analytic model to prevent students from failing '12, pages 248–251, New York, NY, USA, 2012. ACM. in courses. They analyze several variables, such as grades, age, [4] A. Essa and H. Ayad. Student success system: risk analytics attendance and others, that can impede the student learning.Kahn and data visualization using ensembles of predictive models. et al. [7] present a long-term study using hierarchical cluster anal- In Proc. of the 2nd International Conference on Learning ysis, t-tests and Pearson correlation that identified seven behavior Analytics and Knowledge, LAK '12, pages 158–161, New patterns of learners in online discussion forums based on their ac- York, NY, USA, 2012. ACM. cess. García-Solórzano et al. [6] introduce a new educational mon- [5] E. Gabrilovich and S. Markovitch. Computing semantic itoring tool that helps tutors to monitor the development of the relatedness using wikipedia-based explicit semantic analysis. students. Unlike traditional monitoring systems, they propose a In Proc. of the 20th international joint conference on faceted browser visualization tool to facilitate the analysis of the Artifical intelligence, IJCAI'07, pages 1606–1611, San student progress. Glass [8] provides a versatile visualization tool to Francisco, CA, USA, 2007. Morgan Kaufmann Pub. Inc. enable the creation of additional visualizations of data collections. [6] D. García-Solórzano, G. Cobo, E. Santamaría, J. A. Morán, Essa et al. [4] utilize predictive models to identify learners aca- C. Monzo, and J. Melenchón. Educational monitoring tool demically at-risk. They present the problem with an interesting based on faceted browsing and data portraits. In Proc. of the analogy to the patient-doctor workflow, where first they identify the 2nd International Conference on Learning Analytics and problem, analyze the situation and then prescribe courses that are Knowledge, LAK '12, pages 170–178, New York, NY, USA, indicated to help the student to succeed. Siadaty et al.[13] present 2012. ACM. the Learn-B environment, a hub system that captures information [7] T. M. Khan, F. Clear, and S. S. Sajadi. The relationship about the users usage in different softwares and learning activities between educational performance and online access routines: in their workplace and present to the user feedback to support future analysis of students' access to an online discussion forum. In decisions, planning and accompanies them in the learning process. Proc. of the 2nd International Conference on Learning In the same way, McAuley et al. [9] propose a visual analyt- Analytics and Knowledge, LAK '12, pages 226–229, New ics to support organizational learning in online communities. They York, NY, USA, 2012. ACM. present their analysis through an adjacency matrix and an ad- [8] D. Leony, A. Pardo, L. de la Fuente Valentín, D. S. justable timeline that show the communication-actions of the users de Castro, and C. D. Kloos. Glass: a learning analytics and is able to organize it into temporal patterns. Bramucci et al. [2] visualization tool. In Proc. of the 2nd International presents Sherpa an academic recommendation system to support Conference on Learning Analytics and Knowledge, LAK students on making decisions. For instance, using the learner pro- '12, pages 162–163, New York, NY, USA, 2012. ACM. files they recommend courses or make interventions in case that [9] J. McAuley, A. O'Connor, and D. Lewis. Exploring students are at-risk. reflection in online communities. In Proc. of the 2nd In the related work, we showed how different perspectives and International Conference on Learning Analytics and the necessity of new tools and methods to make data available and Knowledge, LAK '12, pages 102–110, New York, NY, USA, help decision-makers. 2012. ACM. [10] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. 6. CONCLUSION Dbpedia spotlight: shedding light on the web of documents. In this paper we presented the main features of the Cite4Me Web In Proc. of the 7th International Conference on Semantic application. Cite4Me makes use of several data sources to provide Systems, I-Semantics '11, pages 1–8, New York, NY, USA, information for users interested on scientific publications and its 2011. ACM. applications. [11] B. Pereira Nunes, S. Dietze, M. A. Casanova, R. Kawase, Additionally, we provided a general framework on data discov- B. Fetahu, and W. Nejdl. Combining a co-occurrence-based ery and correlated resources based on a constructed feature set, and a semantic measure for entity linking. In ESWC, 2013 consisting of items extracted from reference datasets. It made pos- (to appear). sible for users, to search and relate resources from a dataset with [12] B. Pereira Nunes, R. Kawase, S. Dietze, D. Taibi, M. A. other resources offered as Linked Data. Casanova, and W. Nejdl. Can entities be friends? In For more information about the Cite4Me Web application refer G. Rizzo, P. Mendes, E. Charton, S. Hellmann, and to A. Kalyanpur, editors, Proc. of the Web of Linked Entities Workshop in conjuction with the 11th International Semantic 7. REFERENCES Web Conference, volume 906 of, pages 45–57, Nov. 2012. [1] R. Barber and M. Sharkey. Course correction: using analytics to predict course success. In Proc. of the 2nd International [13] M. Siadaty, D. Gašević, J. Jovanović, N. Milikić, Z. 