=Paper=
{{Paper
|id=Vol-1518/paper4
|storemode=property
|title=A Network Based Approach for the Visualization and Analysis of Collaboratively Edited Texts
|pdfUrl=https://ceur-ws.org/Vol-1518/paper4.pdf
|volume=Vol-1518
|dblpUrl=https://dblp.org/rec/conf/lak/HeckingH15
}}
==A Network Based Approach for the Visualization and Analysis of Collaboratively Edited Texts==
A Network Based Approach for the Visualization and Analysis of Collaboratively Edited Texts Tobias Hecking H. Ulrich Hoppe University of Duisburg-Essen University of Duisburg-Essen Lotharstraße 63/65 Lotharstraße 63/65 47048 Duisburg, Germany 47048 Duisburg, Germany hecking@collide.info hoppe@collide.info ABSTRACT This work concentrates on the relations between concepts that can This paper describes an approach for network text analysis and be found in evolving and collaboratively edited texts such as wiki visualization for collaboratively edited documents. It incorporates articles. This introduces additional complexity since dynamic network extraction from texts where nodes represent concepts texts lead to dynamic concept networks. The presented method identified from the words in the text and the edges represent retains the user information of each revision of a text which relations between the concepts. The visualization of the concept allows for characterizing the contributors in collaborative writing networks depicts the general structure of the underlying text in a scenarios regarding the nature of concept relations they introduce compact way. In addition to that, latent relations between to the text. The resulting visualization is a concept network with concepts become visible, which are not explicit in the text. This colored edges where each edge color is allocated uniquely to a work concentrates on evolving texts such as wiki articles. This specific contributor. In further analysis steps, network centrality introduces additional complexity since dynamic texts lead to measures are calculated that give additional information about the dynamic concept networks. The presented method retains the user contribution of each editor. information of each revision of a text and makes them visible in The outline of this paper is as follows: Section 2 gives the the network visualization. In a case study it is demonstrated how theoretical background of this work and highlights significant the proposed method can be used to characterize the contributors research work in the area of network text analysis. The general in collaborative writing scenarios regarding the nature of concept idea of our visualization and analysis approach is presented in relations they introduce to the text. section 3. Section 4 focuses on the concrete implementation. This incorporates the applied natural language processing chain, as well as the description of network analysis methods. General Terms 2. Background Algorithms, Visualization, Experimentation 2.1 Collaborative Writing Activities in Keywords Education Network Visualization, Network Analysis, Natural Language Collaborative writing activities are a common task in educational Processing, Collaborative Writing, Learning Analytics scenarios [3, 13]. Users can learn actively by creating artefacts but can also learn passively by consuming artefacts created by others [14]. 1. INTRODUCTION It could be shown that user generated content is relevant to Network text analysis is the task of extraction and analysis of learners in addition to tutor provided content [13]. With the networks from text corpora. In those networks the nodes are emergence of online communities such as Wikipedia collaborative concepts identified from the words in the text and the edges knowledge building takes place with open scale in terms of the between the nodes represent relations between the concepts. The number of contributors. There is some evidence that individual visualization of concept networks can help to depict the general and collective knowledge co-evolves through collaborative structure of the underlying text in a compact way. In addition to editing of epistemic artefacts in open online environments [9]. In that, latent relations between concepts become visible, which are general collaborative writing requires different rhetorical and not explicit in the text. Thus, approaches for visualizing texts as organizational skills of the editors [8], and thus, the learner networks allow analysts to concentrate on important aspects generated artefacts are a valuable data source for analysis. without reading large amounts of the texts. Several network This motivates the development of methods that makes analysis techniques can be applied to identify important concepts, collaborative writing processes visible in order to understand and perform concept clustering, as well as comparative analysis of improve the application of collaborative text writing in different texts [11]. educational settings. Existing applications for network text analysis include the identification of key phrases [10], mining of relations between 2.2 Visualization Approaches for real world entities [6], as well as the extraction of complete Collaborative Writing concept ontologies and concept maps with labelled edges [18]. Several methods have been developed to represent evolving texts with multiple editors in a visual way. One of the first approaches for the visualization of evolving wiki articles is the History Flow of the contributors. Central concepts that are used by different method [17]. In this approach each contributor has assigned a authors but linked to different other concepts indicate different unique color. Each revision of the evolving text is then associations or views of the authors. Furthermore, the represented as a sequence of blocks that represent the sections of visualization approach additionally depicts which authors the document. The blocks are colored according to the author who concentrate on thematic areas and which authors tend to relate has edited the section and the size of the block corresponds to the concepts from different sub topics, for example, by writing a amount of text. This does not only depict the insertion and summary. removal of text sections by the users but additionally allow for the identification of edit wars between authors. In contrast to this page centric view, the iChase method [12] visualizes activities of a set of authors across multiple wiki articles as heatmaps. Southavilay et al. [16] extend the pure depiction of the amount and location of text edits done by a user by incorporating topic modeling. Therefore, they apply latent dirichlet allocation [4] in order to identify the contributions of users to the particular topics covered in a document. Based on the identified topics the evolution of topics as well as collaboration networks of users on particular topics can be analyzed. 2.3 Representing Mental Models as Graphs Networks are a common representation for relations between entities of various kinds. Schvaneveldt et al. [15] argue that networks between entities based on proximities induced by people have a psychological interpretation. They assume that cognitive concepts such as memory organization and mental categories are reflected in the network structure. The pathfinder algorithm [15] derives a network of concepts from proximity data. Such proximities could be induced, for example, by associations made by a person. In general, it is also possible to derive such proximity data between concepts described in natural language texts [20]. One of the first approaches that utilize computational tools to extract mental models from text has been described by Carley [5]. After the identification of relevant words in a text, the words are linked based on syntactical analysis of the sentences of a text. Figure 1 A concept network extracted from a text edited by This approach has been further developed by Diesner et al. [6] two different authors. The authors are represented by color. and implemented in the software tool Automap where an analyst can specify a metamatrix of concepts and concept classes. This By calculating network measures on the concept network a further enables the identification of relations between entities of different quantitative characterization of the authors is possible as types from text corpora, for example, people and organizations. described in section 4.3. 3. Visualization Approach 4. Implementation This paper extends network extraction from texts to dynamically This section outlines details of the implementation in two evolving and collaboratively edited documents. When networks perspectives. In particular, these are word network extraction extracted from texts are considered as the author’s mental model using natural language processing, and network analysis. of the domain, as described in section 2.3, the aggregation of the networks extracted from several revisions of a collaboratively 4.1 Extracting Concept Networks from Texts edited text can be interpreted as the joint representation of the The extraction of networks from text requires several natural individual mental models of all authors. language processing components. In this work the DKPro toolkit The basic assumption is that different authors introduce different [7] was used. It is based on the Apache UIMA1 framework and concepts and relations to the text. In order to make these provides a large variety of natural language processing algorithms differences visible the author information is additionally that can be combined in a flexible way. The process of the incorporated into the network representation. extraction of word networks from a single document is depicted in Each connection between concepts that can be extracted from the Figure 2. First, a preprocessing step is often required for text text can be labeled with the author who established it. In the small gathered from the web in order to remove wiki or HTML markup. example in Figure 1 the little piece of text was produced by two Further, in this step irrelevant content can be filtered from the different authors. Each author has assigned a unique color - in this document. For example, Wikipedia pages often contain a large case blue and red. The edges of the resulting network can then be reference section and a list of related web resources. These parts colored according to the author who was the first who introduced are important for the wiki article itself but are a source of noise the concept relation in the text. when the actual content of the article should be analyzed. In the This not only allows for a characterization of the underlying 1 document in terms of concept relations but also a characterization https://uima.apache.org/ second step, the phrases representing concepts in the text have to For example, the words “Approach [NN] for [for] teaching [NN]” be identified, and after that, connected to a network by using a are then identified as one single noun phrase. proximity measure in step 3. Since the result might contain phrases with slightly different spelling which actually refer to the 4.1.2 Relation Extraction same semantic concept the entity resolution step merges those After all concepts in the text are identified they have to be candidate phrases to a single concept. Concepts and relations can connected to a concept network according to a certain proximity then be encoded as a network that is used for further processing. measure. In this work, an edge between two concepts becomes In the following the steps 2 to 4 are described in more detail. established if the concepts co-occur in a sliding window of n words in at least one of the sentences in the text. This approach is straight forward but works well in practice [6, 10]. 4.1.3 Entity Resolution As already mentioned entity resolution is necessary in order to identify nodes in the network that represent the same concept and to merge them into single nodes. For example the noun phrases “Wiki” and “The Wikis” can be merged to the same concept “Wiki”. In order to solve this problem, first all noun phrases have to be normalized using lemmatization. After that the concepts are compared pairwise by substring similarity [1]. If the similarity exceeds a value of 0.7 the concepts are merged and labeled with the shorter label of the two concepts. 4.2 Networks from Different Revisions In order to extract an aggregated network from different revisions of a collaboratively edited text, the process chain described in section 4.1 is applied to each revision of the text in temporal order from the oldest to the latest revision. Each revision of the text was done by a single author. The edges in the network of the first revision are labeled with the author of this initial revision. Then in the first aggregation step all edges that are part of the network extracted from the second revision but do not exist in the network of the first revision are labeled with the author of the second revision and added to the previously extracted network. This proceeds until each revision has been processed. As described in section 3 the author information attached to the edges can then be visualized by using different colors for each author. Since the aggregated network contains every noun phrase that has been used by the authors as a concept node, the network can be Figure 2 Process chain for the extraction of work networks very large and likely contains concepts that are not relevant for from texts. the domain. Those concepts are often not well connected. Thus, in a preprocessing step the k-core [2] of the network is computed such that the resulting network contains only concepts with at 4.1.1 Concept Extraction least k connections to other concepts of the core. The resulting For the identification of the concepts in the input text noun phrase network has a reduced number of nodes, and the visualization chunking was applied. First, the text is segmented into its concentrates on the most important concepts according to the sentences. Then part-of-speech (POS) tagging (using the Stanford connectedness to other core concepts in the network. PSO tagger2) is applied to label each word according to its function in its sentence. A naive solution for the extraction of 4.3 Quantitative Characterization of concepts from the text would be to take each noun identified by Contributors the POS tagging as one concept. However, often one concept is For quantitative analysis the nodes (concepts) and edges can be described by more than one word. For example the phrase ranked according to network centrality measures [19]. In this “Approach [NN] for [for] teaching [NN]” would result in two work concepts are ranked according to eigenvector centrality and concepts, namely “Approach” and “Teaching”, which does not betweenness centrality. The eigenvector centrality is a recursive really reflect the meaning of the phrase. Thus, noun phrase measure and assigns a weight to each node according to the chunking is applied where the POS labeled words are chunked to number its neighbors while the connections are weighted meaningful noun phrases. This is done with the OpenNLP according to the centrality of the neighbors. This gives high chunker3, which identifies noun phrases according to certain rules. weight to concepts that have many connections to other important concepts. Edges are ranked according to the edge-betweenness centrality. 2 http://nlp.stanford.edu/software/tagger.shtml The edge-betweenness centrality assigns high weights to edges 3 https://opennlp.apache.org/documentation/1.5.2- that often occur on shortest paths between any pair of nodes. incubating/manual/opennlp.html In order to use the network measures for a characterization of the 5. Case Study authors of the document an aggregation is necessary. For the node As a case study the described method was applied to a wiki article centric centralities, namely node-betweenness and eigenvector on media economy created during a master level university course centrality the centrality contribution of an author A can be in a study program on Applied Cognitive Science and Media calculated by equation 1: Science. The relations between the concepts are based on a sliding window with the size of 4 words. Figure 3 depicts the 5-core of (1) the resulting aggregated concept network. The size of the nodes corresponds to the number of connections in order to support the This result is the average centrality of nodes that are incident to visual discovery of important concepts. It can be directly seen edges labeled with author A. from the visualization that the concept “media combination” is most central. Four of the six authors relate this concept to other The edge-betweenness contribution of author A is the average of concepts as it can be seen by counting the different colors of the all edges labeled with author A (equation 2): incident edges. The highest coverage of the edges has the author who has pink as assigned color. Other contributors relate concepts (2) more according to certain sub topics like communication (see blue edges). An author with a high contribution in terms of edge-betweenness centralilty could be interpreted as someone who relates different The results for the quantitative characterization of the contributors parts of the text and introduces relations between concepts of are presented in Table 1. It is important to mention that reducing different sections. This could, for example, be someone who the network to its 5-core has mainly presentation purposes. Thus, creates a comprehensive summary of a longer wiki article. for more reliable results the calculations were performed on the 2- Authors with high contribution to the eigenvector centrality of the core of the network in which more concept are present. concepts can be those who work on important sections of the text and establish many relations between important domain concepts. options Figure 3 5-core of the aggregated concept networks extracted from a wiki article on media economics. Table 1 Centrality contributions of the authors. EVC: [4] Blei, D. M., Ng, A. Y. and Jordan, M. I. Latent dirichlet Eigenvector centrality, NBC: Node-betweenness centrality allocation. J.Mach.Learn.Res., 3(mar 2003), 993-1022. (normalized), EBC: Edge betweenness centrality. [5] Carley, K. and Palmquist, M. Extracting, representing, and analyzing mental models. Social forces, 70, 3 (1992), 601- Author Color EVC NBC EBC 636. Student 1 Pink 0.20 0.07 161.02 [6] Diesner, J. and Carley, K. M. Revealing social structure from texts: meta-matrix text analysis as a novel method for Student 2 Red 0.71 0.16 95.85 network text analysis. Causal mapping for information Student 3 Green 0.35 0.07 80.17 systems and technology research: Approaches, advances, Student 4 Blue 0.16 0.05 73.19 and illustrations, 2005, pp. 81-108. Student 5 Orange 0.1 0.04 111.45 [7] Eckart de Castilho, R. and Gurevych, I.In Proceedings of the Workshop on Open Infrastructures and Analysis Student 6 Brown 0.35 0.08 81.47 Frameworks for HLT. Association for Computational Linguistics, 2014, 1-11. Student 1 has by far the highest contribution to the edge [8] Flower, L. and Hayes, J. R. A Cognitive Process Theory of betweenness centrality. This is reasonable because this student did Writing. College Composition and Communication, 32, 4 a reworking of large parts of the article and was highly involved (1981), pp. 365-387. in the shaping of the particular sections of the text. Student 2 has [9] Harrer, A., Moskaliuk, J., Kimmerle, J. and Cress, U. the highest scores regarding the node based centrality measures. Visualizing wiki-supported knowledge building: co- However, the average edge-betweenness centrality is only evolution of individual and collective knowledge. In moderate. This indicates that this student concentrated on the core Anonymous International. Symposium on Wikis. 2008 19:1- topic of the article. This can also be seen in Figure 3 where the 19:9. red edges of student 2 are all incident to the central concept. [10] Mihalcea, R. and Tarau, P. TextRank: Bringing order into texts. In Proceedings of the EMNLP. Association for 6. CONCLUSION AND FURTHER WORK Computational Linguistics, (Barcelona, Spain), 2004, 404- 411. The research presented in this paper describes an approach for the [11] Paranyushkin, D. Identifying the pathways for meaning extraction of concept networks from text that incorporates author circulation using text network analysis. Technical Report information in the visualization. In contrast to other existing Nodus Labs, Berlin, (2011). visualizations of evolving texts our approach focuses rather on the [12] Riche, N. H., Lee, B. and Chevalier, F. iChase: Supporting relations between concepts than on the amount of text that is Exploration and Awareness of Editing Activities on produced by individual authors. The case study has shown that the Wikipedia. In Proceedings of the International Conference method is promising and can contribute to the analysis of on Advanced Visual Interfaces. (Roma, Italy). ACM, New collaborative text writing. In educational scenarios the proposed York, NY, USA, 2010, 59-66. method enables tutors to investigate how students relate important [13] Sabrina Ziebarth and Hoppe, H. U. Moodle4SPOC: A domain concepts, and therefore, gain insights into their (possibly Resource-Intensive Blended Learning Course. In different) mental conceptualization. Thus, different views and Proceedings of the European Conference on Technology focuses of students become visible. In future work the Enhanced Learning. (Graz, Austria), 2014, 359-372. visualization will be integrated in an interactive application that [14] Scardamalia, M. and Bereiter, C. Computer Support for supports the visual exploration of the resulting network through Knowledge-Building Communities. The Journal of the improved node and edge highlighting as well as facilities for data Learning Sciences, 3, 3 (1993), pp. 265-283. gathering and network reduction using k-core analysis. Regarding [15] Schvaneveldt, R. W., Durso, F. T. and Dearholt, D. W. the interpretation and the analysis of the extracted networks the Network structures in proximity data. Psychol. Learn. concept extraction can be adapted in such a way that the concepts Motiv., 24 (1989), 249-284. and relations can be weighted by an expert according to their [16] Southavilay, V., Yacef, K., Reimann, P. and Calvo, R. A. importance for the domain. This would result in more compact Analysis of collaborative writing processes using revision networks. In further evaluation the student characterizations maps and probabilistic topic models. In Proccedings of the derived from the colored word network can be related to self- Learning Analytics and Knowledge Conference. (Leuven, assessment and characterizations made by a tutor. Belgium), 2013, 38-47. [17] Viegas, F. B., Wattenberg, M. and Dave, K. Studying 7. REFERENCES cooperation and conflict between authors with history flow [1] Bär, D., Zesch, T. and Gurevych, I. DKPro Similarity: An visualizations. In Proceedings of the SIGCHI conference on Open Source Framework for Text Similarity. In Proceedings Human factors in computing systems. ACM, 2004, 575-582. of the 51st Annual Meeting of the Association for [18] Villalon, J. J. and Calvo, R. A. Concept Map Mining: A Computational Linguistics: System Demonstrations. (Sofia, definition and a framework for its evaluation. In Proceedings Bulgaria). Association for Computational Linguistics, 2013, of the International Vonference on Web Intelligence and 121-126. Intelligent Agent Technology, 2008, IEEE, 2008, 357-360. [2] Bader, G. D. and Hogue, C. W. An automated method for [19] Wasserman, S. and Faust, K. Social Network Analysis: finding molecular complexes in large protein interaction Methods and Applications. Cambridge University Press, networks. BMC Bioinformatics, 4, 1 (2003), 2. 1994. [3] Belanger, Y. and Thornton, J. Bioelectricity: A Quantitative [20] Wild, F., Haley, D. and Bulow, K. Monitoring conceptual Approach Duke University’s First MOOC. (2013) Technical development with text mining technologies: CONSPECT. In Report, Duke University. Proceedings of eChallenge, 2010, 1-8.