Paperista: Visual Exploration of Semantically Annotated
                      Research Papers
             Nikola Milikic                                Uros Krcadinac                             Jelena Jovanovic
  Faculty of Organizational Sciences,           Faculty of Organizational Sciences,          Faculty of Organizational Sciences,
        University of Belgrade                        University of Belgrade                       University of Belgrade
             Jove Ilića 154                                Jove Ilića 154                               Jove Ilića 154
       Belgrade 11000, Serbia                        Belgrade 11000, Serbia                       Belgrade 11000, Serbia
           +381-11-3950853                               +381-11-3950853                              +381-11-3950853
     nikola.milikic@gmail.com                         uros@krcadinac.com                             jeljov@gmail.com
           Bojan Brankov                                  Srdjan Keca
              UZROK Labs                                    UZROK Labs
             107 Nehruova                                  107 Nehruova
         Belgrade 10070, Serbia                        Belgrade 10070, Serbia
            +381-63-581879                               +381-61-3115661
     bb@uzrok.com                                          sk@uzrok.com
ABSTRACT                                                              data in order to facilitate and enhance educational process, and
We consider the problem of visualizing and exploring a dataset        contribute to the overall improvement of students’ learning
about research publications from the fields of Learning Analytics     experience [17]. Even though both LA and EDM are self-
(LA) and Educational Data Mining (EDM). Our approach is based         contained research fields, they are intertwined and overlap in
on semantic annotation that associates publications from the          topics they cover. They share many similarities, but also have
dataset with Wikipedia topics. We present a visualization and         some distinct differences as discussed by Siemens and Baker [18].
exploration tool, called Paperista (www.uzrok.com/paperista),         One of the similarities emphasized by these authors is that both
which presents these topics in the form of bubble and line charts.    fields reflect the emergence of data-intensive approaches to
The tool provides multiple views, thus allowing users to observe      education, where both communities have the goal of analyzing
and interact with topics, understand their evolution and              large-scale educational data in order to support research and
relationships over time, and compare data originating from            practice in education. They differ in the level of automation they
different research fields (i.e., LA and EDM). Moreover, user can      aim to achieve. In particular, EDM has a greater focus on
explore papers to which the presented topics are related to, and      automating support for educational processes, such as adaptation
make related Web searches to access the papers themselves.            and personalization of learning environments and learning
                                                                      processes. On the other hand, LA has a considerably greater focus
                                                                      on leveraging human judgment, on informing and empowering
Categories and Subject Descriptors                                    instructors and learners to reflect over and improve learning
D.2.2 [Software Engineering]: Design Tools and Techniques -           processes.
user interfaces
                                                                      The Society for Learning Analytics Research (SoLAR) has
                                                                      published LAK dataset1 containing structured data about research
General Terms                                                         publications from Learning Analytics and Knowledge (LAK)
Algorithms, Design                                                    Conference, Educational Data Mining Conference, and Journal of
                                                                      Educational Technology & Society (JETS) Special Issue on LAK.
Keywords                                                              The data are represented in the RDF form, which makes them
Learning Analytics, Visualization, Research Papers                    easy to integrate and process by applications.
                                                                      In this paper, we propose an approach to visualizing and exploring
1. MOTIVATION                                                         the LAK dataset. It is centered around the topics covered by the
The field of Learning Analytics is emerging in the past few years     papers from the dataset, and is intended to give an overall view of
and attracting more and more researchers from other areas of          the topics that LA and the EDM fields cover. As the focus of
Technology Enhanced Learning (TEL). It aims to address the            researchers and the degree of relevance of particular topics have
current needs in the broad area of education by making use of the     been changing over years, our approach tries to show a trend of
latest trends in information technologies where everything is         those changes through the whole period the dataset covers,
moving towards Big Data and real-time analytics.                      namely from 2008 to 2012. It also allows for topic-based
                                                                      exploration of research papers and easy navigation to them.
Learning Analytics (LA) is defined as “the measurement,
collection, analysis and reporting of data about learners and their
contexts, for purposes of understanding and optimising learning       2. RELATED WORK
and the environments in which it occurs” [16]. It is often equated    In [11], authors present an interesting work aimed at automating
with other similar fields in the TEL area, such as Academic           the creation of relations between research areas by using
Analytics or Educational Data Mining (EDM) [14]. EDM is a             semantically annotated data about research papers in a particular
research field that focuses on using computational approaches,
namely data mining and machine learning, to analyze educational       1
                                                                          www.solaresearch.org/resources/lak-dataset
area. As a continuation of this work, the same authors have            Spotlight4. The decision to use Wikipedia based annotator was
created a tool, called Rexplore2, which, among other things,           motivated by the fact that Wikipedia is the largest corpus of open
visualizes authors migration patterns across research areas [15].      encyclopedic knowledge and is often used as a well established
                                                                       large-scale taxonomy [8]. Both annotator services are designed to
In terms of visual representation, we find interesting an approach     look for and retrieve recognized Wikipedia concepts from the
to visualization of tags (topics) and categories of tags over time.
                                                                       given text. They can be configured to the specific needs of any
For example, Dubinko et al. [4] consider the problem of                particular usage scenario (i.e., corpus). TagMe is designed to
visualizing the evolution of Flickr tags. The authors present a new    identify Wikipedia concepts specifically in short texts. Its REST
slider-based approach based on a characterization of the most          API5 allows for configuration of two parameters: i) the rho
interesting tags. A Flash-based animation in a web browser allows      parameter which refers to the "goodness" of an annotation with
the user to observe and interact with the tags. Zhang et al. [5]       respect to the topics of the input text, and ii) the epsilon parameter
present an approach to classification and visualization of temporal    which is used for fine-tuning the disambiguation process and
and geographic tag distributions. The authors argue that their         indicates whether to favor the most-common topics or to take the
approach can help humans recognize semantic relationships
                                                                       context more into account [9]. DBpedia Spotlight annotates a
between tags. Lemma [6] presents the Ebony system, an                  given text with concepts from DBpedia, a structured
application for browsing, navigation, and visualization of the         representation of Wikipedia [12]. DBpedia Spotlight REST API6
DBLP database. Wattenberg [7] introduces arc diagrams for              exposes two parameters: confidence of the annotation process that
representing complex patterns of repetition in string data.            takes into account factors such as the topical pertinence and the
Watteberg application, the Shape of Song, visualizes music files,      contextual ambiguity; support parameter specifies the minimum
creating a static representation of repetition throughout a time       number of inlinks7 [10]. We used only paper title and abstract for
series. However, to our knowledge, there has been no (published)       topic extraction, based on an assumption that these two elements
research work on the visualization of research topics and
                                                                       contain mentions of the most important and interesting topics a
publications in the areas of LA and EDM.
                                                                       paper is related to. In order to decide which service for semantic
                                                                       annotation to use, the two services were tested with a random
3. THE PAPERISTA SYSTEM                                                sample comprising 5% of all papers and with different parameter
Our approach is illustrated through a Web application called           settings. The best results were achieved by the TagMe service
Paperista. The application visualizes topics associated with           (rho=0.15; epsilon=0.5). For this reason, TagMe service was
research publications from the LAK dataset, allowing users to          employed to annotate all papers in the corpus.
browse through papers, compare LA and EDM research fields,
and make related Web searches. Visualizations are created for          3.1.2 Identifying Popular Topics
each individual year in order to display relevant topics in the LA     Once having all papers associated with topics, we calculated the
and EDM fields for a specific year, but also for all years             significance of each topic. Numerical statistic called TF-IDF
combined in order to give an overall depiction of the topic            (Term Frequency – Inverse Document Frequency)8 was used as it
distribution in these research areas.                                  calculates how important a word is to a document in a corpus of
                                                                       documents. This metric was adapted to our case and used to
3.1 Data Preparation and Analysis                                      calculate the importance of a topic in a paper. Instead of
LAK dataset consists of data about conferences and journal papers      calculating the frequency of a word, we calculate the frequency of
published in the LA and EDM research fields in the 2008-2012           a topic.
period. For each paper, the following elements are available: title,
author(s), abstract, keyword(s) and full text. Also, basic             Since Paperista allows for visualizing topics in a specific year and
information about authors is available, such as name and               overall (in all years, 2008-2012), the significance was calculated
affiliation.                                                           for corpora containing papers from each of these different time
                                                                       periods. Accordingly, we had six different corpora and calculated
3.1.1 Topic Extraction                                                 the significance of a topic for each corpus. In order to present only
Since one of the main features of Paperista is visualization of        the most significant topics, we have filtered the topic set to only
research topics relevant for the given corpus, the first step in the   those whose significance for a particular period was over 0.01.
data preparation process was to extract main topics of the papers      This threshold was empirically chosen and presents the best
encompassed by the LAK Dataset. A straightforward approach             balance between the relevance of topics and their presentation in
was to use keywords associated with the papers. This is because        the Paperista’s visualizations (i.e., assuring easy comprehension
the authors themselves have compiled those keywords, and it is         by users).
them who know the best which topics describe their work in the
most appropriate way. However, the downside of this approach is        3.1.3 Topic Cleaning
that those keywords are given as free form text and are not            Even though the output of TagMe service consisted of topics that
consistent with any existing formal vocabulary. This makes them        are relevant to the papers’ content, some of them can hardly be
inconsistent throughout the corpus. Furthermore, the dataset is        considered as relevant research topics in the LA and EDM fields
incomplete in regard to keywords as for conferences EDM 2008,          as they are too general. For instance, topics like Methodology,
2009 and 2010 no keywords are provided.
                                                                       4
Thus, we decided to employ a service for semantic annotation in          http://spotlight.dbpedia.org
                                                                       5
order to detect paper topics. We took into consideration two             http://tagme.di.unipi.it/tagme_help.html
                                                                       6
Wikipedia based semantic annotators: TagMe3 and DBpedia                  http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-
                                                                          service
                                                                       7
                                                                          Inlinik, or inline link, are incoming links from other DBpedia
2
    http://technologies.kmi.open.ac.uk/rexplore                           concepts to the observed DBpedia concept
3                                                                      8
    http://tagme.di.unipi.it                                             http://en.wikipedia.org/wiki/Tf-idf
Research, and Experiment can be associated with almost every              Once having relatedness calculated for all the topics in our corpus,
paper in this corpus. Actually, these topics can be related to            we compiled two lists to help us detect removal candidates. In the
research papers from almost any other research area. Similarly,           first list, each topic was associated with the number of other topics
some of the retrieved topics were not relevant to research papers         that topic is related to. This gave us an insight into which topics
from the LAK dataset. Such topics resulted from imperfection of           can be considered too general/specific (the higher the number of
the TagMe tool (and semantic annotation tools, in general). Some          related topics, the more generic the topic is, and vice versa). In the
examples of these alien topics include The T.O. Show, Ade Easily,         second list, each topic was associated with a sum of its relatedness
Henry Snapp, etc. For instance, Henry Snapp topic was apparently          with all the other topics. This list was meant to complement the
mistaken with the SNAPP tool9, a popular learning analytics and           first one. The rationale here is that there might be a topic with fair
visualization tool. Hence, it was important to detect and exclude         number of relations to other topics, but those relatedness values
all these generic and alien topics from the final visualization in        are weak. This behavior also qualifies a topic to be considered as
order to reduce the noise.                                                too specific or alien.
We applied topic cleaning approach similar to [11]. The idea is to        The initial idea with compiling these two lists was that topics to
identify topics that have little or no relationships with other topics    be removed will be at the beginning and the end of the lists (top
in the corpus. This can be an indicator that a topic is too specific      and bottom 10%), and that they could be removed automatically.
or alien to our set of identified topics and thus can be considered       However, by examining the lists, among the obvious exclusion
as an exclusion candidate. On the other hand, if a topic has              candidates, there were also several topics that should not have
relationships with too many other topics, this can be an indicator        been excluded. For instance, topics like Online tutoring, Process
that a topic is too generic and again should be considered as an          mining, Educational data mining etc. were at the end of both lists
exclusion candidate. In order to detect these outlier topics, we          making them removal candidates, even though these topics are
needed a measure of relatedness between topics. To that end, we           obviously highly relevant for LA and EDM fields. The reason for
used the Wikipedia Miner10 service that calculates semantic               this lays in the nature of Wikipedia itself and the fact that not
relatedness of two topics by finding the corresponding Wikipedia          many other articles in Wikipedia link to these topics. Thus, the
articles, and calculating similarity of those articles by comparing       topic removal process could not be done completely automatically
their incoming and outgoing links [13]. Wikipedia Miner has a             and an expert in the area was consulted to mark the topics that
REST API11 that allows for retrieving this information                    should not be excluded.
programmatically.


                                                        Figure 1 - Paperista Interface

9
  http://www.snappvis.org
10
   http://wikipedia-miner.cms.waikato.ac.nz
11
   http://wikipedia-miner.cms.waikato.ac.nz/services
                                                                      circle-shaped bubble chart. “Group Topics” view divides the chart
3.2 Data Visualization and Exploration                                into two groups of bubbles. The first group presents topics that
The topic visualization applied in Paperista is inspired by the New   appear only in the EDM field. The third one shows topics related
York Times visualizations Four Ways to Slice Obama’s 2013             only to the LA field (i.e., LAK and JETS publications). The group
Budget Proposal [1] and At the National Conventions, the Words        in the middle shows “mixed” topics, i.e., those that appear at least
They Used [2].                                                        once in both EDM and LAK/JETS. Different views of the bubble
                                                                      chart are presented on Figure 2.
The Paperista visualization includes bubble and line charts,
allowing users to gain insights into topic trends within the LA and   The size of a bubble represents the topic’s relevancy (i.e., TF–IDF
EDM fields. Bubble charts show the importance of a certain topic      value). Two research fields, EDM and LA, are color-coded. Each
for the entire dataset, each year, and/or each field. By changing     bubble is divided into two slices the size of which corresponds to
different views, users can watch the changes within the dataset       the frequency of that topic within publications of each of the two
and compare the two fields. Animated transitions between charts       sources. For the years 2008-2010, the dataset contains data only
help users understand these processes. In addition, since the         for the EDM research field, so the bubbles are one-colored.
animation does not show precise changes in topic’s relevancy
(calculated using TF-IDF metric, see Sect. 3.1), users are also       The order of topic bubbles is intended to help users compare the
presented with a relevancy line chart for each topic.                 two fields. The leftmost bubbles represent mostly EDM-related
                                                                      topics, while the rightmost bubbles mostly belong to the LA field.
The user interface (Figure 1) consists of an animated bubble          Moreover, clicking on a bubble creates a line chart in a sidebar.
cloud, two button sliders, a sidebar, and an optional timeline. The   The line chart shows the growth and decline of a certain topic.
first slider button (All Years / By Year) allows users to choose
between the “All Years” and “By Year” views. “All Years” view         In addition to the visualization, the Paperista application allows
presents relevant topics for the entire corpus of publications. “By   users to browse papers by topic. When a user clicks on a
Year” view activates a timeline, showing relevant topics for each     particular bubble (topic), a list of papers related to that topic
year. By using the slider, users can follow the change in topic       appears in the right sidebar (represented by a title and a list of
relevancy through the years the data is available for (2008-2012).    authors). Clicking on the particular paper opens a link to Google
                                                                      scholar with a name of the article as a search query. Thus, if a
The second slider button (All Topics / Group Topics) allows for       paper is available online, a user could easily obtain the paper
grouping and regrouping of topics. “All Topics” view shows one        using the Paperista system.


        Figure 2 - Different views of the bubble chart: (1) All Year / All Topic (no highlights); (2) All Years / Group Topics
           (with highlights); (3) By Year (2009) / All Topics (no highlights); and (4) By Year (2012) / Group Topics (with
                                                              highlights)
Furthermore, when a user hovers over the paper title, all topics        both fields. This suggests that the similarities between the two
related to that paper become highlighted. By hovering over              fields are significant as they share many research topics.
papers, users can gain quick insight about topic connections
between publications. Users can also distinguish papers annotated       5. CONCLUSION
with highly relevant topics from those marked with insignificant        In this paper we have presented our approach to visualizing topics
ones. This can show which papers are more related to the fields of      and their trends in the LA and EDM fields. Our application allows
EDM and LA, and which can be viewed as “outliers”.                      for easy identification of the main topics researchers in these
                                                                        fields have been focusing on, and also exploration of papers
3.3 Paperista Architecture and Dataset API                              related to those topics.
The Paperista system consists of a Web application and a server         When compared to other similar tools that provide visualization of
application that provides RESTful API for communicating with            research topics, our tool is the most similar to the previously
the dataset. The Web-based visualization is written in D3, a            mentioned Rexplore tool. However, while Rexplore is more
JavaScript library for manipulating documents based on data [3].        focused on relations between authors and topics in research areas,
We have chosen D3 because of its good performance for                   Paperista’s focus is on research topics and their trends over time.
animation and interaction within the Web environment. The               Also, Paperista allows for exploring papers related to different
visualization is available at the following address:                    topics.
www.uzrok.com/paperista.
                                                                        Future work for Paperista will be primarily directed towards
All data about conference topics and their significance (explained      extending the system to support other datasets, related to other
in Section 3.1) is available as a part of Paperista Dataset API. This   research areas. Since the LAK dataset is RDF-based, Paperista
API supports a REST model for accessing the data and it is              can easily be expanded to support other RDF-based datasets
available at: http://147.91.128.71:9090/LAKChallenge2013. The           expressed using the same or related vocabulary, such as the
Paperista’s Web application calls these operations in order to          Semantic Web Dog Food corpus12. Regarding the interface, we
access data from the dataset (for example, a click on a topic           plan to introduce keyword-based search functionality for
triggers a call to the API, which returns a list of papers).            searching a topic by its name. This would allow for easy
                                                                        navigation to a desired topic and filtering papers related to it. The
4. DISCUSSION                                                           final goal for Paperista is to become a universal visualization tool
When looking at the view displaying topic distribution in all years     for research papers.
(Figure 2.1), one can observe that EDM conference dominates in
almost all topics. This is due to the fact that EDM conference is       6. REFERENCES
being organized longer than the LAK conference (3 years longer),        [1] Carter, S. Four Ways to Slice Obama’s 2013 Budget
and thus the LAK dataset contains overall more papers coming                Proposal. New York Times, 2012. Available online:
from the EDM conference.                                                    http://www.nytimes.com/interactive/2012/02/13/us/politics/2
Filtering topics by years allows for observing the popularity of            013-budget-proposal-graphic.html
topics in a particular year and a particular field (LA or EDM).         [2] Bostok, M., Carter, S., and Ericson, M. At the National
This further enables one to observe the shift in interest for a             Conventions, the Words They Used. New York Times, 2012.
particular topic by researchers in the LA and EDM fields                    Available online:
throughout the years. For instance, one can observe that before             http://www.nytimes.com/interactive/2012/09/06/us/politics/c
2011, the topic of Learning Analytics was not much popular in the           onvention-word-counts.html
papers from the EDM field; thus this topic is not displayed at all      [3] Bostok, M., Ogievetsky, V., and Heer J. D3: Data-Driven
in visualizations for years 2008-2010. In 2011, it boomed in                Documents. IEEE Trans. Visualization & Comp. Graphics
popularity as indicated by the significant rise in the number of            (Proc. InfoVis), 2011. Available online:
papers covering it. In fact, this was the first year the LAK                http://vis.stanford.edu/papers/d3
conference was organized, and it immediately occupied the
attention of researchers interested in the topic of Learning            [4] Dubinko, M. et. al. Visualizing Tags over Time. WWW 2006,
Analytics. Interestingly, this topic also gained some traction              Edinbourgh. Available online:
among the researchers publishing in the EDM field. In 2012, the             http://labs.rightnow.com/colloquium/papers/visualizing_tags.
topic’s popularity grew even bigger and the researchers covering            pdf
it directed their effort toward the LA field. This resulted in papers   [5] Zhang H., Korayem M., You E., and Crandall D. J. Beyond
published within the LA field to almost exclusively cover the               Co-occurrence: Discovering and Visualizing Tag
topic of Learning Analytics. Similarly, we can observe topics that          Relationships from Geo-spatial and Temporal Similarities.
have kept high popularity in both areas over years. For instance,           Available online:
this is the case with the Data topic, obviously as a consequence of         http://www.cs.indiana.edu/~zhanhaip/wsdm2012-
research in both areas concentrating on the analysis of large               clustering.pdf
amounts of data coming from various learning systems and other
                                                                        [6] Lemma, R. Visualizing the DBLP Database. Bachelor
sources.
                                                                            Thesis, 2010. Available online:
The application also allows us to observe that topics such as               http://www.inf.usi.ch/faculty/lanza/Downloads/Lemm2010a.
Intelligent Tutoring System, Prediction and Accuracy and                    pdf
Precision mostly kept their popularity throughout the years and
stayed exclusively within the EDM field. On the other hand, one
can observe that the large majority of topics have been covered by
                                                                        12
                                                                             http://data.semanticweb.org
[7] Wattenberg, M. Arc Diagrams: Visualizing Structure in            [13] Milne, D., & Witten, I. H. (2008, October). Learning to link
    Strings. InfoVis 2002. Available online:                              with wikipedia. InProceedings of the 17th ACM conference
    http://hint.fm/papers/arc-diagrams.pdf                                on Information and knowledge management (pp. 509-518).
[8] Ponzetto, S. P., & Strube, M. (2007, July). Deriving a large          ACM.
    scale taxonomy from Wikipedia. In Proceedings of the             [14] Siemens, G., & Long, P. (2011). Penetrating the Fog:
    national conference on artificial intelligence(Vol. 22, No. 2,        Analytics in Learning and Education. Educause
    p. 1440). Menlo Park, CA; Cambridge, MA; London; AAAI                 Review, 46(5), 30-32.
    Press; MIT Press; 1999. Available online: http://www.h-          [15] Osborne, F., & Motta, E. (2012). Making Sense of Research
    its.org/english/research/nlp/papers/ponzetto07b.pdf                   with Rexplore. The Semantic Web–ISWC 2012
[9] Ferragina, P., & Scaiella, U. (2010, October). TAGME: on-        [16] 1st International Conference on Learning Analytics and
    the-fly annotation of short text fragments (by wikipedia              Knowledge, Banff, Alberta, February 27–March 1, 2011, link
    entities). In Proceedings of the 19th ACM international               https://tekri.athabascau.ca/analytics/
    conference on Information and knowledge management (pp.
    1625-1628). ACM.                                                 [17] Romero, C., & Ventura, S. (2010). Educational data mining:
                                                                          a review of the state of the art. Systems, Man, and
[10] Mendes, P. N., Jakob, M., García-Silva, A., & Bizer, C.              Cybernetics, Part C: Applications and Reviews, IEEE
     (2011, September). Dbpedia spotlight: Shedding light on the          Transactions on, 40(6), 601-618.
     web of documents. In Proceedings of the 7th International
     Conference on Semantic Systems (pp. 1-8). ACM.                  [18] Siemens, G., & Baker, R. S. D. (2012, April). Learning
                                                                          analytics and educational data mining: towards
[11] Osborne, F., & Motta, E. (2012). Mining semantic relations           communication and collaboration. In Proceedings of the 2nd
     between research areas. The Semantic Web–ISWC 2012, 410-             International Conference on Learning Analytics and
     426.                                                                 Knowledge (pp. 252-254). ACM.
[12] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,
     R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open
     data. The Semantic Web, 722-735.