Visualizing the LAK/EDM Literature Using Combined Concept and Rhetorical Sentence Extraction Davide Taibi1, Ágnes Sándor2, Duygu Simsek3, Simon Buckingham Shum3, Anna DeLiddo3, Rebecca Ferguson3 1 2 3 Institute for Educational Parsing & Semantics Group The Open University Technologies Xerox Research Centre Europe Knowledge Media Institute & Italian National Research Council 6 Chemin de Maupertuis Institute of Educational Technology Via Ugo La Malfa 153 F-38240 Meylan, France Milton Keynes, MK7 6AA, UK 90146 Palermo, Italy agnes.sandor@xrce.xerox.com firstname.lastname@open.ac.uk davide.taibi@itd.cnr.it ABSTRACT Network analysis yields sets of related papers based on Scientific communication demands more than the mere listing of statistical corpus processing (Section 3). In order to improve the empirical findings or assertion of beliefs. Arguments must be precision of information about the content of the connections constructed to motivate problems, expose weaknesses, justify among the papers, we carried out semantic and rhetorical higher-order concepts, and support claims to be advancing the analysis (Section 4). On the one hand, we extracted similar field. Researchers learn to signal clearly in their writing when concepts in order to provide topical similarity indicators they are making such moves, and the progress of natural (Section 4.1) and, on the other hand, we extracted salient language processing technology has made it possible to combine sentences that indicate the main research topics of these papers conventional concept extraction with rhetorical analysis that (Section 4.2). We repeated the statistical analysis of this reduced detects these moves. To demonstrate the potential of this list of concepts, and of the reduced list of salient sentences. At technology, this short paper documents preliminary analyses of the end of this paper, we present the design and implementation the dataset published by the Society for Learning Analytics, of the first prototype of an analytics dashboard (Section 5), comprising the full texts from primary conferences and journals which is designed to summarize results of the socio-semantic- in Learning Analytics and Knowledge (LAK) and Educational rhetorical analysis in a way that users will find both meaningful Data Mining (EDM). We document the steps taken to analyse and easy to explore. the papers thematically using Edge Betweenness Clustering, combined with sentence extraction using the Xerox Incremental 2. THE LAK DATASET Parser's rhetorical analysis, which detects the linguistic forms We selected the LAK Dataset1 published by the Society for used by authors to signal argumentative discourse moves. Initial Learning Analytics Research (SoLAR2), which provides results indicate that the refined subset derived from more machine-readable plain-text versions of the Learning Analytics complex concept extraction and rhetorically significant and Knowledge (LAK) conference proceedings and a journal sentences, yields additional relevant clusters. Finally, we special issue related to learning analytics, and of the Educational illustrate how the results of this analysis can be rendered as a Data Mining (EDM) conferences and journal. visual analytics dashboard. The corpus was extracted using the SPARQL endpoint of the LAK dataset. The corpus comprised the following: Categories and Subject Descriptors  24 papers presented at the LAK2011 conference K.3.1 [Computers and Education]: Computer Uses in  42 papers presented at the LAK2012 conference Education  10 papers from the journal of Educational Technology General Terms and Society special issue on learning analytics Design  31 papers presented at the EDM2008 conference  32 papers presented at the EDM2009 conference Keywords  64 papers presented at the EDM2010 conference Learning Analytics, Corpus Analysis, Scientific Rhetoric,  61 papers presented at the EDM2011 conference Visualization, Network Analysis, Natural Language Processing  52 papers presented at the EDM2012 conference 1. INTRODUCTION AND MOTIVATION For each resource, the title, description and keywords properties Our overall aims are to provide users automatically with were used to feed the data mining processes employed in our suggestions about similar papers, about connections between analysis. At the end of this initial process, a relational database papers, and to present these similarities and connections in ways was used to store 305 papers, 599 authors, 448 distinct that are both meaningful and searchable. keywords. After this preliminary phase the entire LAK Dataset In order to achieve this, we integrated three different approaches to linking and analysing a specific dataset of scientific papers 1 LAK Dataset: http://www.solaresearch.org/resources/lak-dataset (see section 2). These approaches were: Published by SoLAR and made available to the LAK Data challenge 1. network analysis of the 3rd International Conference on Learning Analytics and 2. rhetorical analysis Knowledge (http://lakconference.org) 2 3. visualization of the results http://www.solaresearch.org was analyzed by using the Xerox Incremental Parser (XIP) [1] and their aggregations [4]. The yEd tool allows users to balance for concept extraction and rhetorical analysis, a total of 305 quality and speed of the cluster algorithm by the use of a slider. papers, from which XIP extracted 7,847 sentences and 40,163 When the quality is set at the highest value, the Girvan and concepts. Newman algorithm is used in its normal form. At the opposite end, the lowest quality value produces the fastest running time. 3. STATISTICAL ANALYSIS In this case it executes a local betweenness calculation following A preliminary analysis reported the most-used keywords, the Gregory’s algorithm [5]. When a mid value is chosen for quality most frequently occurring authors and the most-referenced and speed, the fast betweenness approximation of Brandes and papers. A second phase of analysis was then carried out using Pich [6] is applied. In this case, less accurate clustering is the data-mining tool, RapidMiner [2]. balanced by a lower execution time. 3.1 Statistical Data from RapidMiner The clusters created with yEd have the following properties: A three-step process was developed in order to analyze the  each node (paper) is a member of exactly one cluster corpus using the data-mining tool, RapidMiner:  each node shares many edges with other members of its cluster, where edges represent the connection  Process documents from file: this module generates between a pair of papers if their similarity values is word vectors from the text files. more than a threshold value (0.3 in our experiment).  Select attributes: This allows users to select the  each node shares few or no edges with nodes of other attributes to be considered by the analysis. In our case, clusters a threshold was set in order to eliminate less important elements in the word vectors. Figure 1 shows a visualization of the primary clusters. Some of  Data to similarity: This module was used to calculate the clusters did seem to have thematic coherence, while others a similarity index for the conference papers based on were harder to label: Cosine similarity.  Cluster 1: collaborative, learning, social The first block ‘Process Documents from file’ is made up of the  Cluster 2: skills, model, slip, guess, parameters following steps:  Cluster 3: causality, variables, model, construct  Cluster 4: question, fit, grain, school, skill  Tokenize: This operator splits the text of a document into a sequence of tokens.  Cluster 5: translating, sentences, grinder, corpus  Replace token: This operator is used to replace tokens, for instance in cases where words are misspelled.  Filter tokens (by length): This operator filters tokens based on their length. In our case, all the words with fewer than three characters were removed.  Filter stopwords (English): This operator filters English stopwords from a document by removing every token that is the same as a stopword from the built-in stopword list.  Stem (Snowball): This operator stems words by applying stemming using the Snowball tool.3 At the end of the main process, the ‘Data to Similarity’ step returns two results: a) The list of the most relevant words (stemmed version) used in the entire corpus b) The measured similarity index between the papers that make up the corpus. We employed the similarity relationships between papers to Figure 1: Results of initial LAK paper clustering analysis build a network of papers. In this network each node represents The complete list of the papers belonging to the clusters has a paper, and an edge between two paper is created if the been reported in the web page5 associated to this work. similarity value of a pair of papers overcome a threshold of 0.3. This analysis was word-driven and not concept-driven. The next 3.2 Analysing the Network of Papers step was to try and refine this by distilling (1) a richer set of The network of papers was then analysed with the yEd tool4 in concepts, and (2) a more salient subset of sentences. order to extract clusters of documents using the algorithm for natural clusters “based on Edge Betweenness Clustering 4. SEMANTIC ANALYSIS proposed by Girvan and Newman” [3]. This algorithm has been In order to go beyond full-text statistical analysis and find successfully used in Network Analysis to study communities connections between papers at the level of the claims they make, we processed the corpus using the Xerox Incremental Parser 3 http://snowball.tartarus.org 4 5 http://www.yworks.com http://www.pa.itd.cnr.it/lak-data-challenge.html (XIP) [1] for extracting concepts and rhetorically salient A basic observation concerns the distribution of the pairs of sentences [7]. similar papers yielded by the three methods. According to the expectations, the most similarity pairs have been yielded by 4.1 Concept Extraction taking into account the full text only in both the LAK and the The basic module of XIP performs morphosyntactic analysis, EDM collection. There are considerable overlaps among the part-of-speech tagging, constituent analysis and dependency three methods, and there are cases when just one method yields extraction on free text. Since we define concepts as simple or similarity pairs. In subsequent evaluations we aim at evaluating compound noun phrases, they can be identified using general these various cases. As a first step towards a more complete morphosyntactic analysis. Examples of extracted concepts are evaluation, we have selected some pairs of papers and checked analytics, learning analytics, social learning analytics and their similarity according to some independent similarity social network analytics. indicators. We have found that our statistical method is coherent with independent similarity indicators in case of high similarity 4.2 Rhetorical Analysis scores and that in these cases, similarity is found with and Scientific research does not consist in providing a list of facts, without XIP-extracted text. This indicates the validity of our but in the construction of narrative and argumentation around statistical method in these cases for finding related papers. In the facts. In articles, researchers make hypotheses, support, refute, case where no independent similarity indicator could be found, reconsider, confirm, and build on previous ideas in order to but we do have XIP-based similarity pairs, we looked for related support their ideas and findings. The aim of rhetorical analysis is key claims or findings in the pairs of papers 7. In the cases where to detect where authors signal that they are making such moves. the similarity score between the two papers was high we did find This analysis builds on the widely studied feature of research such interesting related claims in the two papers. However, in articles that, besides their well-defined standard structure (title, cases where the similarity measure is low, we did not find any abstract, keywords, often IMRAD body structure) rhetorical related claims. This indicates that we might want to define a moves emphasize articles’ contribution to the state of the art, threshold score. The details of the preliminary tests are reported and the research problems they address. In previous work [7] we in the web page. described a list of rhetorical moves that characterize such salient messages, together with the extraction methodology. Figure 2 5. XIP DASHBOARD lists the detected rhetorical moves (in caps) together with The XIP Dashboard was designed to provide visual analytics examples of expressions that mark them. from XIP output in order to help readers assess the current state of the art in terms of trends, patterns, gaps and connections in the LAK and EDM literature. The dashboard also draws attention to candidate patterns of potential significance within the dataset:  the occurrence of domain concepts in different metadiscourse contexts (e.g. effective tutoring dialogue in sentences classified as contrast).  trends over time (e.g. the development of an idea)  trends within and differences between research communities, as reflected in their publications. 5.1 Implementation All the papers in the LAK dataset were analyzed using XIP. The Figure 2: Rhetorical moves (in capital red letters) followed output files of the XIP analysis, one per paper, were then by some examples of expressions used to signify them in imported into a MySQL database, and the user interface was papers implemented using PHP and JavaScript, making use of Google Chart Tools for the interactive visualizations.8 Once the XIP concept extraction and rhetorical analysis were concluded we repeated the cluster analysis on the XIP-filtered 5.2 User Interface lists of concepts and salient sentences. Thus our statistical analysis (described in Section 3.) of the LAK dataset has been The dashboard consists of three sections, each showing different analytical results in different types of chart. conducted in three different ways: Section one of the dashboard shows two line charts, representing  considering the full text of the articles the LAK and the EDM conferences respectively. Each line chart  considering only the salient sentences extracted by shows the distribution of the number of salient sentences over XIP time and by rhetorical marker type (see Figure 2 for a list of the  considering only the concepts extracted by XIP types of rhetorical markers). Each coloured line in these line The comparison of the sets of papers yielded by the three charts indicates how many sentences of a specific rhetorical type approaches is still ongoing. At this stage we can only present were extracted, and how this number changed by year (Figure 3 some preliminary observations concerning pairs of similar shows the line chart for the EDM conference). papers yielded by the three kinds of input. The data obtained through this preliminary evaluation is reported in the web page6 7 The related claims have been searched by reading the pairs of related to this work. sentences. Our long-term goal is to provide the related claims automatically. 6 8 http://www.pa.itd.cnr.it/lak-data-challenge.html https://developers.google.com/chart 6. SUMMARY This short paper has summarised an approach to conducting ‘analytics on Learning Analytics’. The LAK Dataset comprising LAK and EDM literature has been analyzed in order to identify clusters of papers dealing with similar topics (conceptual clustering), and in order to identify key contributions of papers in terms of the claims authors make, as signalled by rhetorical patterns. Our preliminary tests are promising, but more thorough testing is needed to validate the method. Finally, we showed how the results of this analysis are beginning to be visualized using an analytics dashboard. All the secondary datasets produced have been published as open data, for further research. Figure 3: Rhetorical sentences graphed by year, for EDM The second section of the dashboard (Figure 4) allows users to select a combination of the extracted concepts, in order to visualize the occurrence of these concepts in papers within any or all research communities represented in the corpus– that is to say across the whole LAK dataset (EDM plus LAK conference). Figure 6: Distribution of rhetorical types in XIP-classified sentences within a selected concept bubble In the longer term, the aim of this research is to provide users with automatic suggestions about similar papers and about Figure 4: Number of papers with rhetorically extracted connections between papers, and to present these similarities sentences containing user-selected concepts and connections in ways that are both meaningful and The third dashboard section consists of a bubble chart that searchable for the users. Future steps will validate the outputs displays the occurrence of papers within the entire dataset, from these analyses with researchers, and test the usability of the filtered by user-selected concepts (Figure 5). This visualization dashboard with different end-users (e.g. researchers, educators, can be restricted to display just the LAK or the EDM students). conference. In Figure 5, each bubble represents a concept that has been selected by the user. This is associated with a specific 7. REFERENCES number of papers and sentences in which that concept has been [1] Salah Aït-Mokhtar, Jean-Pierre Chanod, and Claude detected. The colour saturation of each bubble (expressed by the Roux. (2002). Robustness beyond shallowness: color spectrum shown at the top) represents the ‘density’ of the incremental dependency parsing. Natural Language chosen concept as defined by the number of XIP-extracted Engineering, 8(2/3):121-144. sentences in which the concept occurs. The darker the colour, [2] Jungermann, F. (2009). Information extraction with the greater the density. RapidMiner. In Proceedings of the GSCL Symposium’ Sprachtechnologie und eHumanities’. W. Hoeppner, ed. [3] Girvan M. and Newman. M. E. J. 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences. 99, 12, 7821-7826. [4] Newman MEJ: Detecting Community Structure in Networks. Eur Phys J B 2004, 38:321-330. [5] Gregory, S.: Local Betweenness for Finding Communities in Networks. Technical Report, University of Bristol (2008). [6] Brandes, U., Pich, C., Centrality Estimation in Large Networks. Intl. Journal of Bifurcation and Chaos in Figure 5: Concept ‘density’ within XIP sentences, by year Applied Sciences and Engineering 17(7) 2303–2318 and number of papers [7] Ágnes Sándor. (2007). Modeling metadiscourse When a concept bubble is selected (Figure 6), a pie chart pops conveying the author's rhetorical strategy in biomedical up representing the relative distribution of the rhetorical types research abstracts. Revue Française de Linguistique for that bubble (that is to say for that concept, and across the Appliquée 200(2): 97-109 papers and sentences in which the concept has been detected).