=Paper=
{{Paper
|id=Vol-1518/paper4
|storemode=property
|title=A Network Based Approach for the Visualization and Analysis of Collaboratively Edited Texts
|pdfUrl=https://ceur-ws.org/Vol-1518/paper4.pdf
|volume=Vol-1518
|dblpUrl=https://dblp.org/rec/conf/lak/HeckingH15
}}
==A Network Based Approach for the Visualization and Analysis of Collaboratively Edited Texts==
<pdf width="1500px">https://ceur-ws.org/Vol-1518/paper4.pdf</pdf>
<pre>
       A Network Based Approach for the Visualization and
             Analysis of Collaboratively Edited Texts
                       Tobias Hecking                                                     H. Ulrich Hoppe
                 University of Duisburg-Essen                                       University of Duisburg-Essen
                     Lotharstraße 63/65                                                 Lotharstraße 63/65
                  47048 Duisburg, Germany                                            47048 Duisburg, Germany
                    hecking@collide.info                                                hoppe@collide.info


ABSTRACT                                                              This work concentrates on the relations between concepts that can
This paper describes an approach for network text analysis and        be found in evolving and collaboratively edited texts such as wiki
visualization for collaboratively edited documents. It incorporates   articles. This introduces additional complexity since dynamic
network extraction from texts where nodes represent concepts          texts lead to dynamic concept networks. The presented method
identified from the words in the text and the edges represent         retains the user information of each revision of a text which
relations between the concepts. The visualization of the concept      allows for characterizing the contributors in collaborative writing
networks depicts the general structure of the underlying text in a    scenarios regarding the nature of concept relations they introduce
compact way. In addition to that, latent relations between            to the text. The resulting visualization is a concept network with
concepts become visible, which are not explicit in the text. This     colored edges where each edge color is allocated uniquely to a
work concentrates on evolving texts such as wiki articles. This       specific contributor. In further analysis steps, network centrality
introduces additional complexity since dynamic texts lead to          measures are calculated that give additional information about the
dynamic concept networks. The presented method retains the user       contribution of each editor.
information of each revision of a text and makes them visible in      The outline of this paper is as follows: Section 2 gives the
the network visualization. In a case study it is demonstrated how     theoretical background of this work and highlights significant
the proposed method can be used to characterize the contributors      research work in the area of network text analysis. The general
in collaborative writing scenarios regarding the nature of concept    idea of our visualization and analysis approach is presented in
relations they introduce to the text.                                 section 3. Section 4 focuses on the concrete implementation. This
                                                                      incorporates the applied natural language processing chain, as
                                                                      well as the description of network analysis methods.

General Terms                                                         2. Background
Algorithms, Visualization, Experimentation
                                                                      2.1 Collaborative Writing Activities in
Keywords                                                              Education
Network Visualization, Network Analysis, Natural Language             Collaborative writing activities are a common task in educational
Processing, Collaborative Writing, Learning Analytics                 scenarios [3, 13]. Users can learn actively by creating artefacts but
                                                                      can also learn passively by consuming artefacts created by others
                                                                      [14].
1. INTRODUCTION                                                       It could be shown that user generated content is relevant to
Network text analysis is the task of extraction and analysis of       learners in addition to tutor provided content [13]. With the
networks from text corpora. In those networks the nodes are           emergence of online communities such as Wikipedia collaborative
concepts identified from the words in the text and the edges          knowledge building takes place with open scale in terms of the
between the nodes represent relations between the concepts. The       number of contributors. There is some evidence that individual
visualization of concept networks can help to depict the general      and collective knowledge co-evolves through collaborative
structure of the underlying text in a compact way. In addition to     editing of epistemic artefacts in open online environments [9]. In
that, latent relations between concepts become visible, which are     general collaborative writing requires different rhetorical and
not explicit in the text. Thus, approaches for visualizing texts as   organizational skills of the editors [8], and thus, the learner
networks allow analysts to concentrate on important aspects           generated artefacts are a valuable data source for analysis.
without reading large amounts of the texts. Several network
                                                                      This motivates the development of methods that makes
analysis techniques can be applied to identify important concepts,
                                                                      collaborative writing processes visible in order to understand and
perform concept clustering, as well as comparative analysis of
                                                                      improve the application of collaborative text writing in
different texts [11].
                                                                      educational settings.
Existing applications for network text analysis include the
identification of key phrases [10], mining of relations between       2.2 Visualization Approaches for
real world entities [6], as well as the extraction of complete        Collaborative Writing
concept ontologies and concept maps with labelled edges [18].
                                                                      Several methods have been developed to represent evolving texts
                                                                      with multiple editors in a visual way. One of the first approaches
for the visualization of evolving wiki articles is the History Flow     of the contributors. Central concepts that are used by different
method [17]. In this approach each contributor has assigned a           authors but linked to different other concepts indicate different
unique color. Each revision of the evolving text is then                associations or views of the authors. Furthermore, the
represented as a sequence of blocks that represent the sections of      visualization approach additionally depicts which authors
the document. The blocks are colored according to the author who        concentrate on thematic areas and which authors tend to relate
has edited the section and the size of the block corresponds to the     concepts from different sub topics, for example, by writing a
amount of text. This does not only depict the insertion and             summary.
removal of text sections by the users but additionally allow for the
identification of edit wars between authors. In contrast to this
page centric view, the iChase method [12] visualizes activities of
a set of authors across multiple wiki articles as heatmaps.
Southavilay et al. [16] extend the pure depiction of the amount
and location of text edits done by a user by incorporating topic
modeling. Therefore, they apply latent dirichlet allocation [4] in
order to identify the contributions of users to the particular topics
covered in a document. Based on the identified topics the
evolution of topics as well as collaboration networks of users on
particular topics can be analyzed.

2.3 Representing Mental Models as Graphs
Networks are a common representation for relations between
entities of various kinds. Schvaneveldt et al. [15] argue that
networks between entities based on proximities induced by people
have a psychological interpretation. They assume that cognitive
concepts such as memory organization and mental categories are
reflected in the network structure. The pathfinder algorithm [15]
derives a network of concepts from proximity data. Such
proximities could be induced, for example, by associations made
by a person. In general, it is also possible to derive such proximity
data between concepts described in natural language texts [20].
One of the first approaches that utilize computational tools to
extract mental models from text has been described by Carley [5].
After the identification of relevant words in a text, the words are
linked based on syntactical analysis of the sentences of a text.        Figure 1 A concept network extracted from a text edited by
This approach has been further developed by Diesner et al. [6]          two different authors. The authors are represented by color.
and implemented in the software tool Automap where an analyst
can specify a metamatrix of concepts and concept classes. This
                                                                        By calculating network measures on the concept network a further
enables the identification of relations between entities of different
                                                                        quantitative characterization of the authors is possible as
types from text corpora, for example, people and organizations.
                                                                        described in section 4.3.
3. Visualization Approach                                               4. Implementation
This paper extends network extraction from texts to dynamically
                                                                        This section outlines details of the implementation in two
evolving and collaboratively edited documents. When networks
                                                                        perspectives. In particular, these are word network extraction
extracted from texts are considered as the author’s mental model
                                                                        using natural language processing, and network analysis.
of the domain, as described in section 2.3, the aggregation of the
networks extracted from several revisions of a collaboratively          4.1 Extracting Concept Networks from Texts
edited text can be interpreted as the joint representation of the       The extraction of networks from text requires several natural
individual mental models of all authors.                                language processing components. In this work the DKPro toolkit
The basic assumption is that different authors introduce different      [7] was used. It is based on the Apache UIMA1 framework and
concepts and relations to the text. In order to make these              provides a large variety of natural language processing algorithms
differences visible the author information is additionally              that can be combined in a flexible way. The process of the
incorporated into the network representation.                           extraction of word networks from a single document is depicted in
Each connection between concepts that can be extracted from the         Figure 2. First, a preprocessing step is often required for text
text can be labeled with the author who established it. In the small    gathered from the web in order to remove wiki or HTML markup.
example in Figure 1 the little piece of text was produced by two        Further, in this step irrelevant content can be filtered from the
different authors. Each author has assigned a unique color - in this    document. For example, Wikipedia pages often contain a large
case blue and red. The edges of the resulting network can then be       reference section and a list of related web resources. These parts
colored according to the author who was the first who introduced        are important for the wiki article itself but are a source of noise
the concept relation in the text.                                       when the actual content of the article should be analyzed. In the

This not only allows for a characterization of the underlying
                                                                        1
document in terms of concept relations but also a characterization          https://uima.apache.org/
second step, the phrases representing concepts in the text have to     For example, the words “Approach [NN] for [for] teaching [NN]”
be identified, and after that, connected to a network by using a       are then identified as one single noun phrase.
proximity measure in step 3. Since the result might contain
phrases with slightly different spelling which actually refer to the   4.1.2 Relation Extraction
same semantic concept the entity resolution step merges those          After all concepts in the text are identified they have to be
candidate phrases to a single concept. Concepts and relations can      connected to a concept network according to a certain proximity
then be encoded as a network that is used for further processing.      measure. In this work, an edge between two concepts becomes
In the following the steps 2 to 4 are described in more detail.        established if the concepts co-occur in a sliding window of n
                                                                       words in at least one of the sentences in the text. This approach is
                                                                       straight forward but works well in practice [6, 10].

                                                                       4.1.3 Entity Resolution
                                                                       As already mentioned entity resolution is necessary in order to
                                                                       identify nodes in the network that represent the same concept and
                                                                       to merge them into single nodes. For example the noun phrases
                                                                       “Wiki” and “The Wikis” can be merged to the same concept
                                                                       “Wiki”. In order to solve this problem, first all noun phrases have
                                                                       to be normalized using lemmatization. After that the concepts are
                                                                       compared pairwise by substring similarity [1]. If the similarity
                                                                       exceeds a value of 0.7 the concepts are merged and labeled with
                                                                       the shorter label of the two concepts.

                                                                       4.2 Networks from Different Revisions
                                                                       In order to extract an aggregated network from different revisions
                                                                       of a collaboratively edited text, the process chain described in
                                                                       section 4.1 is applied to each revision of the text in temporal order
                                                                       from the oldest to the latest revision. Each revision of the text was
                                                                       done by a single author. The edges in the network of the first
                                                                       revision are labeled with the author of this initial revision. Then in
                                                                       the first aggregation step all edges that are part of the network
                                                                       extracted from the second revision but do not exist in the network
                                                                       of the first revision are labeled with the author of the second
                                                                       revision and added to the previously extracted network. This
                                                                       proceeds until each revision has been processed. As described in
                                                                       section 3 the author information attached to the edges can then be
                                                                       visualized by using different colors for each author.
                                                                       Since the aggregated network contains every noun phrase that has
                                                                       been used by the authors as a concept node, the network can be
    Figure 2 Process chain for the extraction of work networks         very large and likely contains concepts that are not relevant for
                            from texts.                                the domain. Those concepts are often not well connected. Thus, in
                                                                       a preprocessing step the k-core [2] of the network is computed
                                                                       such that the resulting network contains only concepts with at
4.1.1 Concept Extraction                                               least k connections to other concepts of the core. The resulting
For the identification of the concepts in the input text noun phrase   network has a reduced number of nodes, and the visualization
chunking was applied. First, the text is segmented into its            concentrates on the most important concepts according to the
sentences. Then part-of-speech (POS) tagging (using the Stanford       connectedness to other core concepts in the network.
PSO tagger2) is applied to label each word according to its
function in its sentence. A naive solution for the extraction of       4.3 Quantitative Characterization of
concepts from the text would be to take each noun identified by        Contributors
the POS tagging as one concept. However, often one concept is
                                                                       For quantitative analysis the nodes (concepts) and edges can be
described by more than one word. For example the phrase
                                                                       ranked according to network centrality measures [19]. In this
“Approach [NN] for [for] teaching [NN]” would result in two
                                                                       work concepts are ranked according to eigenvector centrality and
concepts, namely “Approach” and “Teaching”, which does not
                                                                       betweenness centrality. The eigenvector centrality is a recursive
really reflect the meaning of the phrase. Thus, noun phrase
                                                                       measure and assigns a weight to each node according to the
chunking is applied where the POS labeled words are chunked to
                                                                       number its neighbors while the connections are weighted
meaningful noun phrases. This is done with the OpenNLP
                                                                       according to the centrality of the neighbors. This gives high
chunker3, which identifies noun phrases according to certain rules.
                                                                       weight to concepts that have many connections to other important
                                                                       concepts.
                                                                       Edges are ranked according to the edge-betweenness centrality.
2
    http://nlp.stanford.edu/software/tagger.shtml                      The edge-betweenness centrality assigns high weights to edges
3
 https://opennlp.apache.org/documentation/1.5.2-                       that often occur on shortest paths between any pair of nodes.
  incubating/manual/opennlp.html
In order to use the network measures for a characterization of the     5. Case Study
authors of the document an aggregation is necessary. For the node      As a case study the described method was applied to a wiki article
centric centralities, namely node-betweenness and eigenvector          on media economy created during a master level university course
centrality the centrality contribution of an author A can be           in a study program on Applied Cognitive Science and Media
calculated by equation 1:                                              Science. The relations between the concepts are based on a sliding
                                                                       window with the size of 4 words. Figure 3 depicts the 5-core of
                                                                 (1)   the resulting aggregated concept network. The size of the nodes
                                                                       corresponds to the number of connections in order to support the
This result is the average centrality of nodes that are incident to    visual discovery of important concepts. It can be directly seen
edges labeled with author A.                                           from the visualization that the concept “media combination” is
                                                                       most central. Four of the six authors relate this concept to other
The edge-betweenness contribution of author A is the average of
                                                                       concepts as it can be seen by counting the different colors of the
all edges labeled with author A (equation 2):
                                                                       incident edges. The highest coverage of the edges has the author
                                                                       who has pink as assigned color. Other contributors relate concepts
                                                                (2)    more according to certain sub topics like communication (see blue
                                                                       edges).
An author with a high contribution in terms of edge-betweenness
centralilty could be interpreted as someone who relates different      The results for the quantitative characterization of the contributors
parts of the text and introduces relations between concepts of         are presented in Table 1. It is important to mention that reducing
different sections. This could, for example, be someone who            the network to its 5-core has mainly presentation purposes. Thus,
creates a comprehensive summary of a longer wiki article.              for more reliable results the calculations were performed on the 2-
Authors with high contribution to the eigenvector centrality of the    core of the network in which more concept are present.
concepts can be those who work on important sections of the text
and establish many relations between important domain concepts.


                                                             options


Figure 3 5-core of the aggregated concept networks extracted from a wiki article on media economics.
    Table 1 Centrality contributions of the authors. EVC:              [4] Blei, D. M., Ng, A. Y. and Jordan, M. I. Latent dirichlet
  Eigenvector centrality, NBC: Node-betweenness centrality                  allocation. J.Mach.Learn.Res., 3(mar 2003), 993-1022.
      (normalized), EBC: Edge betweenness centrality.                  [5] Carley, K. and Palmquist, M. Extracting, representing, and
                                                                            analyzing mental models. Social forces, 70, 3 (1992), 601-
Author          Color         EVC          NBC           EBC                636.
Student 1       Pink          0.20         0.07          161.02        [6] Diesner, J. and Carley, K. M. Revealing social structure from
                                                                            texts: meta-matrix text analysis as a novel method for
Student 2       Red           0.71         0.16          95.85              network text analysis. Causal mapping for information
Student 3       Green         0.35         0.07          80.17              systems and technology research: Approaches, advances,
Student 4       Blue          0.16         0.05          73.19              and illustrations, 2005, pp. 81-108.
Student 5       Orange        0.1          0.04          111.45        [7] Eckart de Castilho, R. and Gurevych, I.In Proceedings of the
                                                                            Workshop on Open Infrastructures and Analysis
Student 6       Brown         0.35         0.08          81.47
                                                                            Frameworks for HLT. Association for Computational
                                                                            Linguistics, 2014, 1-11.
Student 1 has by far the highest contribution to the edge              [8] Flower, L. and Hayes, J. R. A Cognitive Process Theory of
betweenness centrality. This is reasonable because this student did         Writing. College Composition and Communication, 32, 4
a reworking of large parts of the article and was highly involved           (1981), pp. 365-387.
in the shaping of the particular sections of the text. Student 2 has   [9] Harrer, A., Moskaliuk, J., Kimmerle, J. and Cress, U.
the highest scores regarding the node based centrality measures.            Visualizing wiki-supported knowledge building: co-
However, the average edge-betweenness centrality is only                    evolution of individual and collective knowledge. In
moderate. This indicates that this student concentrated on the core         Anonymous International. Symposium on Wikis. 2008 19:1-
topic of the article. This can also be seen in Figure 3 where the           19:9.
red edges of student 2 are all incident to the central concept.        [10] Mihalcea, R. and Tarau, P. TextRank: Bringing order into
                                                                            texts. In Proceedings of the EMNLP. Association for
6. CONCLUSION AND FURTHER WORK                                              Computational Linguistics, (Barcelona, Spain), 2004, 404-
                                                                            411.
The research presented in this paper describes an approach for the
                                                                       [11] Paranyushkin, D. Identifying the pathways for meaning
extraction of concept networks from text that incorporates author
                                                                            circulation using text network analysis. Technical Report
information in the visualization. In contrast to other existing
                                                                            Nodus Labs, Berlin, (2011).
visualizations of evolving texts our approach focuses rather on the
                                                                       [12] Riche, N. H., Lee, B. and Chevalier, F. iChase: Supporting
relations between concepts than on the amount of text that is
                                                                            Exploration and Awareness of Editing Activities on
produced by individual authors. The case study has shown that the
                                                                            Wikipedia. In Proceedings of the International Conference
method is promising and can contribute to the analysis of
                                                                            on Advanced Visual Interfaces. (Roma, Italy). ACM, New
collaborative text writing. In educational scenarios the proposed
                                                                            York, NY, USA, 2010, 59-66.
method enables tutors to investigate how students relate important
                                                                       [13] Sabrina Ziebarth and Hoppe, H. U. Moodle4SPOC: A
domain concepts, and therefore, gain insights into their (possibly
                                                                            Resource-Intensive Blended Learning Course. In
different) mental conceptualization. Thus, different views and
                                                                            Proceedings of the European Conference on Technology
focuses of students become visible. In future work the
                                                                            Enhanced Learning. (Graz, Austria), 2014, 359-372.
visualization will be integrated in an interactive application that
                                                                       [14] Scardamalia, M. and Bereiter, C. Computer Support for
supports the visual exploration of the resulting network through
                                                                            Knowledge-Building Communities. The Journal of the
improved node and edge highlighting as well as facilities for data
                                                                            Learning Sciences, 3, 3 (1993), pp. 265-283.
gathering and network reduction using k-core analysis. Regarding
                                                                       [15] Schvaneveldt, R. W., Durso, F. T. and Dearholt, D. W.
the interpretation and the analysis of the extracted networks the
                                                                            Network structures in proximity data. Psychol. Learn.
concept extraction can be adapted in such a way that the concepts
                                                                            Motiv., 24 (1989), 249-284.
and relations can be weighted by an expert according to their
                                                                       [16] Southavilay, V., Yacef, K., Reimann, P. and Calvo, R. A.
importance for the domain. This would result in more compact
                                                                            Analysis of collaborative writing processes using revision
networks. In further evaluation the student characterizations
                                                                            maps and probabilistic topic models. In Proccedings of the
derived from the colored word network can be related to self-
                                                                            Learning Analytics and Knowledge Conference. (Leuven,
assessment and characterizations made by a tutor.
                                                                            Belgium), 2013, 38-47.
                                                                       [17] Viegas, F. B., Wattenberg, M. and Dave, K. Studying
7. REFERENCES                                                               cooperation and conflict between authors with history flow
[1] Bär, D., Zesch, T. and Gurevych, I. DKPro Similarity: An
                                                                            visualizations. In Proceedings of the SIGCHI conference on
    Open Source Framework for Text Similarity. In Proceedings
                                                                            Human factors in computing systems. ACM, 2004, 575-582.
    of the 51st Annual Meeting of the Association for
                                                                       [18] Villalon, J. J. and Calvo, R. A. Concept Map Mining: A
    Computational Linguistics: System Demonstrations. (Sofia,
                                                                            definition and a framework for its evaluation. In Proceedings
    Bulgaria). Association for Computational Linguistics, 2013,
                                                                            of the International Vonference on Web Intelligence and
    121-126.
                                                                            Intelligent Agent Technology, 2008, IEEE, 2008, 357-360.
[2] Bader, G. D. and Hogue, C. W. An automated method for
                                                                       [19] Wasserman, S. and Faust, K. Social Network Analysis:
    finding molecular complexes in large protein interaction
                                                                            Methods and Applications. Cambridge University Press,
    networks. BMC Bioinformatics, 4, 1 (2003), 2.
                                                                            1994.
[3] Belanger, Y. and Thornton, J. Bioelectricity: A Quantitative
                                                                       [20] Wild, F., Haley, D. and Bulow, K. Monitoring conceptual
    Approach Duke University’s First MOOC. (2013) Technical
                                                                            development with text mining technologies: CONSPECT. In
    Report, Duke University.
                                                                            Proceedings of eChallenge, 2010, 1-8.

</pre>