Paperista: Visual Exploration of Semantically Annotated Research Papers Nikola Milikic Uros Krcadinac Jelena Jovanovic Faculty of Organizational Sciences, Faculty of Organizational Sciences, Faculty of Organizational Sciences, University of Belgrade University of Belgrade University of Belgrade Jove Ilića 154 Jove Ilića 154 Jove Ilića 154 Belgrade 11000, Serbia Belgrade 11000, Serbia Belgrade 11000, Serbia +381-11-3950853 +381-11-3950853 +381-11-3950853 nikola.milikic@gmail.com uros@krcadinac.com jeljov@gmail.com Bojan Brankov Srdjan Keca UZROK Labs UZROK Labs 107 Nehruova 107 Nehruova Belgrade 10070, Serbia Belgrade 10070, Serbia +381-63-581879 +381-61-3115661 bb@uzrok.com sk@uzrok.com ABSTRACT data in order to facilitate and enhance educational process, and We consider the problem of visualizing and exploring a dataset contribute to the overall improvement of students’ learning about research publications from the fields of Learning Analytics experience [17]. Even though both LA and EDM are self- (LA) and Educational Data Mining (EDM). Our approach is based contained research fields, they are intertwined and overlap in on semantic annotation that associates publications from the topics they cover. They share many similarities, but also have dataset with Wikipedia topics. We present a visualization and some distinct differences as discussed by Siemens and Baker [18]. exploration tool, called Paperista (www.uzrok.com/paperista), One of the similarities emphasized by these authors is that both which presents these topics in the form of bubble and line charts. fields reflect the emergence of data-intensive approaches to The tool provides multiple views, thus allowing users to observe education, where both communities have the goal of analyzing and interact with topics, understand their evolution and large-scale educational data in order to support research and relationships over time, and compare data originating from practice in education. They differ in the level of automation they different research fields (i.e., LA and EDM). Moreover, user can aim to achieve. In particular, EDM has a greater focus on explore papers to which the presented topics are related to, and automating support for educational processes, such as adaptation make related Web searches to access the papers themselves. and personalization of learning environments and learning processes. On the other hand, LA has a considerably greater focus on leveraging human judgment, on informing and empowering Categories and Subject Descriptors instructors and learners to reflect over and improve learning D.2.2 [Software Engineering]: Design Tools and Techniques - processes. user interfaces The Society for Learning Analytics Research (SoLAR) has published LAK dataset1 containing structured data about research General Terms publications from Learning Analytics and Knowledge (LAK) Algorithms, Design Conference, Educational Data Mining Conference, and Journal of Educational Technology & Society (JETS) Special Issue on LAK. Keywords The data are represented in the RDF form, which makes them Learning Analytics, Visualization, Research Papers easy to integrate and process by applications. In this paper, we propose an approach to visualizing and exploring 1. MOTIVATION the LAK dataset. It is centered around the topics covered by the The field of Learning Analytics is emerging in the past few years papers from the dataset, and is intended to give an overall view of and attracting more and more researchers from other areas of the topics that LA and the EDM fields cover. As the focus of Technology Enhanced Learning (TEL). It aims to address the researchers and the degree of relevance of particular topics have current needs in the broad area of education by making use of the been changing over years, our approach tries to show a trend of latest trends in information technologies where everything is those changes through the whole period the dataset covers, moving towards Big Data and real-time analytics. namely from 2008 to 2012. It also allows for topic-based exploration of research papers and easy navigation to them. Learning Analytics (LA) is defined as “the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimising learning 2. RELATED WORK and the environments in which it occurs” [16]. It is often equated In [11], authors present an interesting work aimed at automating with other similar fields in the TEL area, such as Academic the creation of relations between research areas by using Analytics or Educational Data Mining (EDM) [14]. EDM is a semantically annotated data about research papers in a particular research field that focuses on using computational approaches, namely data mining and machine learning, to analyze educational 1 www.solaresearch.org/resources/lak-dataset area. As a continuation of this work, the same authors have Spotlight4. The decision to use Wikipedia based annotator was created a tool, called Rexplore2, which, among other things, motivated by the fact that Wikipedia is the largest corpus of open visualizes authors migration patterns across research areas [15]. encyclopedic knowledge and is often used as a well established large-scale taxonomy [8]. Both annotator services are designed to In terms of visual representation, we find interesting an approach look for and retrieve recognized Wikipedia concepts from the to visualization of tags (topics) and categories of tags over time. given text. They can be configured to the specific needs of any For example, Dubinko et al. [4] consider the problem of particular usage scenario (i.e., corpus). TagMe is designed to visualizing the evolution of Flickr tags. The authors present a new identify Wikipedia concepts specifically in short texts. Its REST slider-based approach based on a characterization of the most API5 allows for configuration of two parameters: i) the rho interesting tags. A Flash-based animation in a web browser allows parameter which refers to the "goodness" of an annotation with the user to observe and interact with the tags. Zhang et al. [5] respect to the topics of the input text, and ii) the epsilon parameter present an approach to classification and visualization of temporal which is used for fine-tuning the disambiguation process and and geographic tag distributions. The authors argue that their indicates whether to favor the most-common topics or to take the approach can help humans recognize semantic relationships context more into account [9]. DBpedia Spotlight annotates a between tags. Lemma [6] presents the Ebony system, an given text with concepts from DBpedia, a structured application for browsing, navigation, and visualization of the representation of Wikipedia [12]. DBpedia Spotlight REST API6 DBLP database. Wattenberg [7] introduces arc diagrams for exposes two parameters: confidence of the annotation process that representing complex patterns of repetition in string data. takes into account factors such as the topical pertinence and the Watteberg application, the Shape of Song, visualizes music files, contextual ambiguity; support parameter specifies the minimum creating a static representation of repetition throughout a time number of inlinks7 [10]. We used only paper title and abstract for series. However, to our knowledge, there has been no (published) topic extraction, based on an assumption that these two elements research work on the visualization of research topics and contain mentions of the most important and interesting topics a publications in the areas of LA and EDM. paper is related to. In order to decide which service for semantic annotation to use, the two services were tested with a random 3. THE PAPERISTA SYSTEM sample comprising 5% of all papers and with different parameter Our approach is illustrated through a Web application called settings. The best results were achieved by the TagMe service Paperista. The application visualizes topics associated with (rho=0.15; epsilon=0.5). For this reason, TagMe service was research publications from the LAK dataset, allowing users to employed to annotate all papers in the corpus. browse through papers, compare LA and EDM research fields, and make related Web searches. Visualizations are created for 3.1.2 Identifying Popular Topics each individual year in order to display relevant topics in the LA Once having all papers associated with topics, we calculated the and EDM fields for a specific year, but also for all years significance of each topic. Numerical statistic called TF-IDF combined in order to give an overall depiction of the topic (Term Frequency – Inverse Document Frequency)8 was used as it distribution in these research areas. calculates how important a word is to a document in a corpus of documents. This metric was adapted to our case and used to 3.1 Data Preparation and Analysis calculate the importance of a topic in a paper. Instead of LAK dataset consists of data about conferences and journal papers calculating the frequency of a word, we calculate the frequency of published in the LA and EDM research fields in the 2008-2012 a topic. period. For each paper, the following elements are available: title, author(s), abstract, keyword(s) and full text. Also, basic Since Paperista allows for visualizing topics in a specific year and information about authors is available, such as name and overall (in all years, 2008-2012), the significance was calculated affiliation. for corpora containing papers from each of these different time periods. Accordingly, we had six different corpora and calculated 3.1.1 Topic Extraction the significance of a topic for each corpus. In order to present only Since one of the main features of Paperista is visualization of the most significant topics, we have filtered the topic set to only research topics relevant for the given corpus, the first step in the those whose significance for a particular period was over 0.01. data preparation process was to extract main topics of the papers This threshold was empirically chosen and presents the best encompassed by the LAK Dataset. A straightforward approach balance between the relevance of topics and their presentation in was to use keywords associated with the papers. This is because the Paperista’s visualizations (i.e., assuring easy comprehension the authors themselves have compiled those keywords, and it is by users). them who know the best which topics describe their work in the most appropriate way. However, the downside of this approach is 3.1.3 Topic Cleaning that those keywords are given as free form text and are not Even though the output of TagMe service consisted of topics that consistent with any existing formal vocabulary. This makes them are relevant to the papers’ content, some of them can hardly be inconsistent throughout the corpus. Furthermore, the dataset is considered as relevant research topics in the LA and EDM fields incomplete in regard to keywords as for conferences EDM 2008, as they are too general. For instance, topics like Methodology, 2009 and 2010 no keywords are provided. 4 Thus, we decided to employ a service for semantic annotation in http://spotlight.dbpedia.org 5 order to detect paper topics. We took into consideration two http://tagme.di.unipi.it/tagme_help.html 6 Wikipedia based semantic annotators: TagMe3 and DBpedia http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web- service 7 Inlinik, or inline link, are incoming links from other DBpedia 2 http://technologies.kmi.open.ac.uk/rexplore concepts to the observed DBpedia concept 3 8 http://tagme.di.unipi.it http://en.wikipedia.org/wiki/Tf-idf Research, and Experiment can be associated with almost every Once having relatedness calculated for all the topics in our corpus, paper in this corpus. Actually, these topics can be related to we compiled two lists to help us detect removal candidates. In the research papers from almost any other research area. Similarly, first list, each topic was associated with the number of other topics some of the retrieved topics were not relevant to research papers that topic is related to. This gave us an insight into which topics from the LAK dataset. Such topics resulted from imperfection of can be considered too general/specific (the higher the number of the TagMe tool (and semantic annotation tools, in general). Some related topics, the more generic the topic is, and vice versa). In the examples of these alien topics include The T.O. Show, Ade Easily, second list, each topic was associated with a sum of its relatedness Henry Snapp, etc. For instance, Henry Snapp topic was apparently with all the other topics. This list was meant to complement the mistaken with the SNAPP tool9, a popular learning analytics and first one. The rationale here is that there might be a topic with fair visualization tool. Hence, it was important to detect and exclude number of relations to other topics, but those relatedness values all these generic and alien topics from the final visualization in are weak. This behavior also qualifies a topic to be considered as order to reduce the noise. too specific or alien. We applied topic cleaning approach similar to [11]. The idea is to The initial idea with compiling these two lists was that topics to identify topics that have little or no relationships with other topics be removed will be at the beginning and the end of the lists (top in the corpus. This can be an indicator that a topic is too specific and bottom 10%), and that they could be removed automatically. or alien to our set of identified topics and thus can be considered However, by examining the lists, among the obvious exclusion as an exclusion candidate. On the other hand, if a topic has candidates, there were also several topics that should not have relationships with too many other topics, this can be an indicator been excluded. For instance, topics like Online tutoring, Process that a topic is too generic and again should be considered as an mining, Educational data mining etc. were at the end of both lists exclusion candidate. In order to detect these outlier topics, we making them removal candidates, even though these topics are needed a measure of relatedness between topics. To that end, we obviously highly relevant for LA and EDM fields. The reason for used the Wikipedia Miner10 service that calculates semantic this lays in the nature of Wikipedia itself and the fact that not relatedness of two topics by finding the corresponding Wikipedia many other articles in Wikipedia link to these topics. Thus, the articles, and calculating similarity of those articles by comparing topic removal process could not be done completely automatically their incoming and outgoing links [13]. Wikipedia Miner has a and an expert in the area was consulted to mark the topics that REST API11 that allows for retrieving this information should not be excluded. programmatically. Figure 1 - Paperista Interface 9 http://www.snappvis.org 10 http://wikipedia-miner.cms.waikato.ac.nz 11 http://wikipedia-miner.cms.waikato.ac.nz/services circle-shaped bubble chart. “Group Topics” view divides the chart 3.2 Data Visualization and Exploration into two groups of bubbles. The first group presents topics that The topic visualization applied in Paperista is inspired by the New appear only in the EDM field. The third one shows topics related York Times visualizations Four Ways to Slice Obama’s 2013 only to the LA field (i.e., LAK and JETS publications). The group Budget Proposal [1] and At the National Conventions, the Words in the middle shows “mixed” topics, i.e., those that appear at least They Used [2]. once in both EDM and LAK/JETS. Different views of the bubble chart are presented on Figure 2. The Paperista visualization includes bubble and line charts, allowing users to gain insights into topic trends within the LA and The size of a bubble represents the topic’s relevancy (i.e., TF–IDF EDM fields. Bubble charts show the importance of a certain topic value). Two research fields, EDM and LA, are color-coded. Each for the entire dataset, each year, and/or each field. By changing bubble is divided into two slices the size of which corresponds to different views, users can watch the changes within the dataset the frequency of that topic within publications of each of the two and compare the two fields. Animated transitions between charts sources. For the years 2008-2010, the dataset contains data only help users understand these processes. In addition, since the for the EDM research field, so the bubbles are one-colored. animation does not show precise changes in topic’s relevancy (calculated using TF-IDF metric, see Sect. 3.1), users are also The order of topic bubbles is intended to help users compare the presented with a relevancy line chart for each topic. two fields. The leftmost bubbles represent mostly EDM-related topics, while the rightmost bubbles mostly belong to the LA field. The user interface (Figure 1) consists of an animated bubble Moreover, clicking on a bubble creates a line chart in a sidebar. cloud, two button sliders, a sidebar, and an optional timeline. The The line chart shows the growth and decline of a certain topic. first slider button (All Years / By Year) allows users to choose between the “All Years” and “By Year” views. “All Years” view In addition to the visualization, the Paperista application allows presents relevant topics for the entire corpus of publications. “By users to browse papers by topic. When a user clicks on a Year” view activates a timeline, showing relevant topics for each particular bubble (topic), a list of papers related to that topic year. By using the slider, users can follow the change in topic appears in the right sidebar (represented by a title and a list of relevancy through the years the data is available for (2008-2012). authors). Clicking on the particular paper opens a link to Google scholar with a name of the article as a search query. Thus, if a The second slider button (All Topics / Group Topics) allows for paper is available online, a user could easily obtain the paper grouping and regrouping of topics. “All Topics” view shows one using the Paperista system. Figure 2 - Different views of the bubble chart: (1) All Year / All Topic (no highlights); (2) All Years / Group Topics (with highlights); (3) By Year (2009) / All Topics (no highlights); and (4) By Year (2012) / Group Topics (with highlights) Furthermore, when a user hovers over the paper title, all topics both fields. This suggests that the similarities between the two related to that paper become highlighted. By hovering over fields are significant as they share many research topics. papers, users can gain quick insight about topic connections between publications. Users can also distinguish papers annotated 5. CONCLUSION with highly relevant topics from those marked with insignificant In this paper we have presented our approach to visualizing topics ones. This can show which papers are more related to the fields of and their trends in the LA and EDM fields. Our application allows EDM and LA, and which can be viewed as “outliers”. for easy identification of the main topics researchers in these fields have been focusing on, and also exploration of papers 3.3 Paperista Architecture and Dataset API related to those topics. The Paperista system consists of a Web application and a server When compared to other similar tools that provide visualization of application that provides RESTful API for communicating with research topics, our tool is the most similar to the previously the dataset. The Web-based visualization is written in D3, a mentioned Rexplore tool. However, while Rexplore is more JavaScript library for manipulating documents based on data [3]. focused on relations between authors and topics in research areas, We have chosen D3 because of its good performance for Paperista’s focus is on research topics and their trends over time. animation and interaction within the Web environment. The Also, Paperista allows for exploring papers related to different visualization is available at the following address: topics. www.uzrok.com/paperista. Future work for Paperista will be primarily directed towards All data about conference topics and their significance (explained extending the system to support other datasets, related to other in Section 3.1) is available as a part of Paperista Dataset API. This research areas. Since the LAK dataset is RDF-based, Paperista API supports a REST model for accessing the data and it is can easily be expanded to support other RDF-based datasets available at: http://147.91.128.71:9090/LAKChallenge2013. The expressed using the same or related vocabulary, such as the Paperista’s Web application calls these operations in order to Semantic Web Dog Food corpus12. Regarding the interface, we access data from the dataset (for example, a click on a topic plan to introduce keyword-based search functionality for triggers a call to the API, which returns a list of papers). searching a topic by its name. This would allow for easy navigation to a desired topic and filtering papers related to it. The 4. DISCUSSION final goal for Paperista is to become a universal visualization tool When looking at the view displaying topic distribution in all years for research papers. (Figure 2.1), one can observe that EDM conference dominates in almost all topics. This is due to the fact that EDM conference is 6. REFERENCES being organized longer than the LAK conference (3 years longer), [1] Carter, S. Four Ways to Slice Obama’s 2013 Budget and thus the LAK dataset contains overall more papers coming Proposal. New York Times, 2012. Available online: from the EDM conference. http://www.nytimes.com/interactive/2012/02/13/us/politics/2 Filtering topics by years allows for observing the popularity of 013-budget-proposal-graphic.html topics in a particular year and a particular field (LA or EDM). [2] Bostok, M., Carter, S., and Ericson, M. At the National This further enables one to observe the shift in interest for a Conventions, the Words They Used. New York Times, 2012. particular topic by researchers in the LA and EDM fields Available online: throughout the years. For instance, one can observe that before http://www.nytimes.com/interactive/2012/09/06/us/politics/c 2011, the topic of Learning Analytics was not much popular in the onvention-word-counts.html papers from the EDM field; thus this topic is not displayed at all [3] Bostok, M., Ogievetsky, V., and Heer J. D3: Data-Driven in visualizations for years 2008-2010. In 2011, it boomed in Documents. IEEE Trans. Visualization & Comp. Graphics popularity as indicated by the significant rise in the number of (Proc. InfoVis), 2011. Available online: papers covering it. In fact, this was the first year the LAK http://vis.stanford.edu/papers/d3 conference was organized, and it immediately occupied the attention of researchers interested in the topic of Learning [4] Dubinko, M. et. al. Visualizing Tags over Time. WWW 2006, Analytics. Interestingly, this topic also gained some traction Edinbourgh. Available online: among the researchers publishing in the EDM field. In 2012, the http://labs.rightnow.com/colloquium/papers/visualizing_tags. topic’s popularity grew even bigger and the researchers covering pdf it directed their effort toward the LA field. This resulted in papers [5] Zhang H., Korayem M., You E., and Crandall D. J. Beyond published within the LA field to almost exclusively cover the Co-occurrence: Discovering and Visualizing Tag topic of Learning Analytics. Similarly, we can observe topics that Relationships from Geo-spatial and Temporal Similarities. have kept high popularity in both areas over years. For instance, Available online: this is the case with the Data topic, obviously as a consequence of http://www.cs.indiana.edu/~zhanhaip/wsdm2012- research in both areas concentrating on the analysis of large clustering.pdf amounts of data coming from various learning systems and other [6] Lemma, R. Visualizing the DBLP Database. Bachelor sources. Thesis, 2010. Available online: The application also allows us to observe that topics such as http://www.inf.usi.ch/faculty/lanza/Downloads/Lemm2010a. Intelligent Tutoring System, Prediction and Accuracy and pdf Precision mostly kept their popularity throughout the years and stayed exclusively within the EDM field. On the other hand, one can observe that the large majority of topics have been covered by 12 http://data.semanticweb.org [7] Wattenberg, M. Arc Diagrams: Visualizing Structure in [13] Milne, D., & Witten, I. H. (2008, October). Learning to link Strings. InfoVis 2002. Available online: with wikipedia. InProceedings of the 17th ACM conference http://hint.fm/papers/arc-diagrams.pdf on Information and knowledge management (pp. 509-518). [8] Ponzetto, S. P., & Strube, M. (2007, July). Deriving a large ACM. scale taxonomy from Wikipedia. In Proceedings of the [14] Siemens, G., & Long, P. (2011). Penetrating the Fog: national conference on artificial intelligence(Vol. 22, No. 2, Analytics in Learning and Education. Educause p. 1440). Menlo Park, CA; Cambridge, MA; London; AAAI Review, 46(5), 30-32. Press; MIT Press; 1999. Available online: http://www.h- [15] Osborne, F., & Motta, E. (2012). Making Sense of Research its.org/english/research/nlp/papers/ponzetto07b.pdf with Rexplore. The Semantic Web–ISWC 2012 [9] Ferragina, P., & Scaiella, U. (2010, October). TAGME: on- [16] 1st International Conference on Learning Analytics and the-fly annotation of short text fragments (by wikipedia Knowledge, Banff, Alberta, February 27–March 1, 2011, link entities). In Proceedings of the 19th ACM international https://tekri.athabascau.ca/analytics/ conference on Information and knowledge management (pp. 1625-1628). ACM. [17] Romero, C., & Ventura, S. (2010). Educational data mining: a review of the state of the art. Systems, Man, and [10] Mendes, P. N., Jakob, M., García-Silva, A., & Bizer, C. Cybernetics, Part C: Applications and Reviews, IEEE (2011, September). Dbpedia spotlight: Shedding light on the Transactions on, 40(6), 601-618. web of documents. In Proceedings of the 7th International Conference on Semantic Systems (pp. 1-8). ACM. [18] Siemens, G., & Baker, R. S. D. (2012, April). Learning analytics and educational data mining: towards [11] Osborne, F., & Motta, E. (2012). Mining semantic relations communication and collaboration. In Proceedings of the 2nd between research areas. The Semantic Web–ISWC 2012, 410- International Conference on Learning Analytics and 426. Knowledge (pp. 252-254). ACM. [12] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. The Semantic Web, 722-735.