=Paper= {{Paper |id=Vol-1518/paper3 |storemode=property |title=Getting a Grasp on Tag Collections by Visualising Tag Clusters Based on Higher-order Co-occurrences |pdfUrl=https://ceur-ws.org/Vol-1518/paper3.pdf |volume=Vol-1518 |dblpUrl=https://dblp.org/rec/conf/lak/NiemannSRWDS15 }} ==Getting a Grasp on Tag Collections by Visualising Tag Clusters Based on Higher-order Co-occurrences== https://ceur-ws.org/Vol-1518/paper3.pdf
       Getting a grasp on tag collections by visualising tag
         clusters based on higher-order co-occurrences

           Katja Niemann, Sarah León Rojas,                        Maren Scheffel, Hendrik Drachsler,
                    Martin Wolpers                                         Marcus Specht
                        Fraunhofer FIT                                  Open University of the Netherlands
                      Schloss Birlinghoven                                    Valkenburgerweg 177
                 53754 Sankt Augustin, Germany                          6419 AT Heerlen, The Netherlands
            {katja.niemann, sarah.leon.rojas,                         {maren.scheffel, hendrik.drachsler,
            martin.wolpers}@fit.fraunhofer.de                              marcus.specht}@ou.nl

ABSTRACT                                                         tos or videos) tags provide meaningful descriptors of these
Tagging learning resources in repositories or web portals of-    objects [8].
fers a way to meaningfully describe these resources. The
more tags there are, however, the more difficult it is to find   A common problem, however, when relying on tags is that
one’s way around the repository, especially when they are        they are often user-generated and not restricted to a closed
user-generated free-text tags. This paper therefore presents     vocabulary. Different users can tag the same learning re-
a visualisation of tag clusters based on higher-order co-oc-     source with different tags leading to a large collection of
currences that allows users of such repositories a plain but     rarely used but highly related tags. The use of singular or
simple way of exploring them in an intuitive manner.             plural versions of the same word, the same word in differ-
                                                                 ent languages or different words with the same meaning, i.e.
                                                                 synonyms, can also lead to problems when relying on tags
Categories and Subject Descriptors                               in order to get an overview on a collection of learning re-
H.3.3 [Information Storage and Retrieval]: Informa-
                                                                 sources. In order to detect unknown relations between tags
tion Search and Retrieval—Clustering, Information filtering,
                                                                 they therefore need to be contextualised.
Search process, Selection process; I.2.7 [Artificial Intelli-
gence]: Natural Language Processing; J.1 [Administra-
                                                                 Based on an approach of visualising large document collec-
tive Data Processing]: —Education; K.3.1 [Computers
                                                                 tions according to the documents’ keywords [6] we suggest
and Education]: Computer Uses in Education
                                                                 to use a visualisation of tag relations that allows users to
                                                                 quickly get a grasp of the resources offered by a learning por-
General Terms                                                    tal and to dig deeper to get an understanding of certain sub-
Algorithms, Experimentation, Language                            ject areas. Instead of clustering the learning objects accord-
                                                                 ing to their content, however, we cluster the tags according
Keywords                                                         to their higher-order co-occurrences and then present them
clustering, higher-order co-occurrences, tags, technology en-    in a clearly arranged and intuitive manner. The creation of
hanced learning, visualisation                                   higher-order co-occurrences is a well-known approach in cor-
                                                                 pus linguistics to discover semantic relations between words
                                                                 based on their usage in text documents [2]. We adapt this
1.   INTRODUCTION                                                approach by analysing the assignments of tags to learning
Many educational web portals allow users to manually en-         resources instead of the occurrences of terms in sentences or
rich the offered learning resources with social metadata like    text documents.
comments and free-text tags. It has been shown that tags in
particular provide powerful knowledge that can be used to        The paper is structured as follows. Chapter 2 gives a short
improve the quality of searching and recommendations [4, 7].     overview of related work. Chapter 3 describes the approach
Similar to automatically extracted keywords, tags thus offer     of higher-order co-occurrence clustering to group tags with
a way to get a quick grasp on the content or theme of mul-       similar meanings, followed by the description of the MACE
timedia objects. Especially when dealing with multimedia         data set in chapter 4 which is used in this paper. Thereafter,
objects that provide little or no textual context (e.g. pho-     chapter 5 describes the visualisation of the tag clusters and
                                                                 chapter 6 discusses the results. Finally, chapter 7 holds a
                                                                 conclusion and an outlook on future work.

                                                                 2.     RELATED WORK
                                                                 According to Rivadeneira et al. [5], a meaningful visuali-
                                                                 sation of tags supports four main functions: (1) search, i.e.
                                                                 tags can be directly included in the search process and, thus,
                                                                 enhance the findability of items, (2) browsing, i.e. the visual-
                                                                 isation offers a central entry point for users that know what
they are looking for but not what exactly to search for, (3)       Heyer et al. [2] show this for the co-occurrences of IBM,
impression formation / gisting, i.e. the visualisation allows      among other words. Their investigations are based on text
users to get a quick grasp on the items’ subject areas, and        corpora collected for the portal wortschatz.uni-leipzig.de,
(4) recognition, i.e. the users are offered the possibility to     the German treasury of words. The first co-occurrence class
understand different aspects of certain information.               is rather heterogeneous, and contains words like computer
                                                                   manufacturer, stock exchange, global and so on. After some
The most common approach to visualise a large number of            iterations of computing higher-order co-occurrence classes,
tags is the creation of tag clouds. Here, the relative size of     however, the classes become more homogenous and stable.
each tag stands in relation to its frequency in the tag col-       The tenth order co-occurrence class only contains names of
lection. Nowadays, many tools are available that allow an          other computer-related companies like Microsoft, Sony etc.
easy integration of personalised tag clouds in web sites, e.g.
TagCrowd1 and Wordle2 . While there is a huge potential in-        In the given scenario we do not have sentences in which the
herent in tag clouds they also suffer from some issues, e.g. the   tags occur. However, the tags are assigned to learning re-
missing semantic between the visualised tags [1, 9]. In order      sources which can be considered to represent usage contexts.
to deal with this, tag clouds have been created that analyse       Thus, two tags are co-occurrences if they are assigned to at
(first-order) co-occurrences between the tags and group tags       least one common learning resource. In order to calculate
that often co-occur [11]. Here, similar tags do not necessar-      the significance of two tags, the association measure Mutual
ily reference to the same semantic concept but are linked by       Information (MI) is used which compares the observed fre-
the resources they have in common [3]. Another problem             quency O of a co-occurrence with its expected frequency E,
of tag clouds is that many frequent tags often dominate the        see formula 1.
whole tag cloud and less frequent tags and their concepts
get lost [1].
                                                                                                       O
This paper presents a clustering approach for tags that is                                 MI = log2                         (1)
                                                                                                       E
based on higher-order co-occurrences, i.e. a corpus linguistic
technique to find semantically related terms [2]. This way
we aim to discover and visually cover all subject areas even       Here, selecting the n most significant co-occurrences for each
though it might not be possible to display all single tags.        tag would imply to have a pre-defined cluster size which is
                                                                   not desirable, thus, a threshold is used. Because the calcu-
                                                                   lated significance scores for resource pairs are only compara-
3.     HIGHER-ORDER CO-OCCURRENCE                                  ble if they have one resource in common, a resource-specific
       CLUSTERING OF TAGS                                          threshold is used to distinguish between relevant and non-
The creation of higher order co-occurrences is a corpus-           relevant co-occurrences. Here, this threshold is calculated
linguistic approach to exploit the usage context of linguistic     for each learning resource by averaging the significance val-
entities in order to find semantic relations. Two linguistic       ues of all its co-occurrences and multiplying the result with
entities are defined to be co-occurrences if they occur in at      a regulation constant α which has a value of 0.95 in the
least one common usage context, e.g. in a sentence. For ex-        presented experiment.
ample, the word dog often co-occurs with the words bark,
growl, and sniff among others.
                                                                   4.     THE MACE DATA SET
In order to calculate the significance of a co-occurrence sta-     The MACE3 (Metadata for Architectural Contents in Eu-
tistical association measures are used. Thereafter, the most       rope) project relates digital learning resources about archi-
significant co-occurrences must be selected for each term.         tecture with each other across repository boundaries to en-
Since there is no standard scale of measurement to draw a          able a simplified discovery and access [10]. Users are able
clear distinction between significant and non-significant oc-      to search for learning resources and filter the results, e.g.
currences, there are two ways to do so, i.e. by selecting only     according to their language, the original repository, and the
the n most significant co-occurrences for each resource or by      classification terms they hold. Furthermore, the portal offers
using a threshold.                                                 a social search based on tags, a location search based on the
                                                                   geographical coordinates of buildings represented through
The significant co-occurrences of an entity form its first-        learning resources, and a competence search based on the
order co-occurrence class and entities which co-occur in first-    competencies the learning resources aim to impart. Regis-
order co-occurrence classes are second-order co-occurrences.       tered and logged-in users are able to rate, tag, and com-
These second-order co-occurrence classes again can be used         ment on learning resources. Additionally, they can follow
as input to calculate third order co-occurrences and so forth.     the metadata provision activities of other users.
When this procedure is repeated several times, the higher-
order co-occurrence classes tend to get stable, i.e. their ele-    The MACE data set holds 117,907 events on 12,442 learning
ments do not change any more. This indicates that there ex-        resources conducted by 630 registered users. 70.8% of the
ist universal relations between the entities in the remaining      learning resources hold tags in which each tagged learning
classes that induce their aggregation again in each iteration      resource holds on average 6.59 tags. Overall, the users as-
step. In fact, these stable higher-order co-occurrence classes     signed 13,291 distinct tags of which 73% are only used once
have shown to usually hold semantically related entities.          and only about 4% of the tags are added to more than 10
                                                                   learning resources.
1
    http://tagcrowd.com/
2                                                                  3
    http://wordle.net/                                                 http://mace-project.eu/
5.   VISUALISATION
When creating a visualisation of the tag clusters for the
MACE data set we decided to not present tags in the vi-
sualisation that are assigned to only one or two learning
resources. Only clusters that hold more than five tags are
selected for presentation. Finally, the two most frequent
tags are selected as title for each cluster. If a cluster’s most
frequent tags significantly overlap, the less frequent one is
neglected and the next frequent tag is selected.

After this data processing, the tag clusters and all attached
information are written to a JSON file. The visualisation
is realised using the Data-Driven Documents D3.js frame-
work4 , i.e. a JavaScript library, paired with HTML, CSS
and JQuery to process the previously created JSON files.

Figure 1 shows the default starting view of the visualisa-
tion5 . The tag clusters are represented by circles and are
ordered according to their size in the form of a spiral with
the largest cluster having the largest circle and being posi-
tioned at the outside of the spiral and the smallest cluster
being in the middle of the spiral. Here, the size of a tag
cluster depends on the number of learning objects that are                             Figure 1: Start view
referenced by the tags belonging to it. Additionally to size
and position, every cluster has its own color and is labelled
with its two most representing tags to enable the users to         ods, objects, and materials used for insulation, e.g. cobertes
quickly get a grasp on the clusters’ content.
                                                                   (covered), paneles (panels), as well as poliestireno (polysty-
                                                                   rene) and reference 683 distinct resources. The resources’
By clicking on a cluster, the view changes and the visualisa-      descriptions hold further tags that can be used to orientate
tion zooms into to the chosen cluster for which up to 20 tags      in this field. For example, figure 3 shows an excerpt from
become visible. We chose this number to not overload the           the list of resources that are assigned with the tag sand-
visualisation. In order to continue the circle approach used       wich. While this tag might be unexpected at a first glance,
for the clusters, we adapted the common usage of font size,        the tags it was used with clarify its meaning, i.e. a (panel)
coloring and word positioning in tag clouds and used sized         structure made of three layers. Overall, 2,190 distinct tags
and spirally ordered circles for the tags as well. On the right    are given in the resource list of this cluster.
side of the visualisation, a list of all the learning resources
that are associated with that cluster is given showing the re-     Cluster 2: fachada / facana (facade). This cluster
sources’ title, media type, and language additionally to the       mainly holds Spanish and Catalan tags that deal with the
list of all tags assigned to it. All resource titles link to the   construction and cladding of buildings, e.g. sistemas con-
original resource.                                                 structivos (building systems), cerramientos (enclosure), gres
                                                                   (stoneware), and constructivos (building). Overall, this clus-
Clicking on a tag circle results in a new list next to the         ter’s tags reference 661 distinct resources that are assigned
visualisation in which all resources assigned with that tag are    with 2,080 distinct tags.
given. By clicking on a specific tag, its circle is highlighted
and the object list only displays those resources that are
                                                                   Cluster 3: seguridad obra / seguridad construccion
assigned with the highlighted tag, see figure 2.                   (work and construction safety). This cluster holds a
                                                                   mix of Spanish and English tags that deal with security,
6.   DISCUSSION                                                    e.g. seguridad trabajador (worker safety), construction se-
This chapter provides an insight on the eleven clusters shown      curity, sistemas de seguridad (security systems), and nor-
in the visualisation, discusses the topics they cover includ-      mativa (regulations). Overall, it references 532 distinct re-
ing their relations, and reference further distinctive features.   sources that hold 634 distinct tags.
The following cluster descriptions are ordered by the size of
the clusters, i.e. from the outside of the spiral to its center.   Cluster 4: architects / design. The first cluster that
Whenever needed, the tags’ English translations are given in       mainly holds English tags and few Spanish ones deals with
brackets. Here, if two tags hold the same English translation      (green) architecture in the public space, e.g. architecture,
it is only given once.                                             museum, green architecture, architettura (architecture), pi-
                                                                   azza, and bioarchitettura. It references 296 distinct resources
Cluster 1: cubierta / aislante (cover and insulation).             that hold 962 distinct tags.
This cluster’s tags, which are mainly in Spanish, name meth-
                                                                   Cluster 5: movimiento tierras / tierras (land move-
4                                                                  ment). This cluster comprises Spanish tags that deal with
  http://d3js.org/
5                                                                  the preparation of building zones, e.g. excavaciones (dig-
  The visualisation is available at
http://mitarbeiter.fit.fraunhofer.de/˜niemann/VisLA/               gings), maquinaria (machinery), calculo (calculation), and
                                        Figure 2: Zoomed cluster view with a selected tag


excavadora (excavator). It references 278 distinct resources       well as famous architects of those buildings, e.g. Santiago
that comrpise 1,201 distinct tags.                                 Calatrava Valls and Norman Robert Foster. Overall, the
                                                                   clusters references distinct 86 learning resources that com-
Cluster 6: cimentaciones / fonaments (foundation).                 prise 107 distinct tags.
This cluster mainly holds in Spanish and Catalan tags that
deal with the construction and anchoring of buildings, e.g.        Cluster 10: software / 3d . The only cluster that contains
muro (wall), building, terreno (ground), zapatas (shoes), and      less than 20 tags deals with the design of buildings using the
anclajes (anchors). Overall, this cluster’s tags reference 210     computer and comprises tags like cad (computer-aided de-
distinct learning resources that are assigned with 831 dis-        sign), rhino3d (CAD Software), tutorial, and programming.
tinct tags.                                                        The cluster references 72 distinct resources that are assigned
                                                                   with 274 distinct tags.
Cluster 7: torre / portale (tower and portal). The
main topic of this cluster is sustainability although its two      Cluster 11: ruine / schloss (ruin and castle). This
most frequent tags do not imply it. Further tags are e.g.          cluster references learning resources that describe or depict
bio edilizia (bio building), solar, and sostenibilidad (sustain-   buildings built in the mittelalter (middle ages) or hochmit-
ability). However, it can be seen in the resource list that        telalter (high middle ages) in German regions like pfealzer
the learning resources that are tagged with torre or portale       wald (Palatinate Forest) and rhein-lahn-kreis (Rhine Lahn
also deal with this topic, e.g. the insulation of towers. Thus,    circle). Consequently, all tags are in German. Overall, they
this cluster exhibits a topical relation to the first one but      reference 51 distinct resources that hold 96 distinct tags.
in contrast, in mainly contains Italian tags. Overall, the
cluster references 201 distinct resources and its resource list    Concluding, the clusters mostly contain tags that indeed
comprises 661 distinct tags.                                       belong to the same subject area, though, they are not com-
                                                                   pletely separated. For example, several clusters deal with
Cluster 8: ecological / oekologisch. This cluster also             sustainability. However, their tags are in different languages
deals with sustainability but with a stronger focus on the         and they have different focuses, e.g. the generation vs. the
generation and recovery rather than on the conservation            conservation of energy or public vs. private buildings. Fur-
of energy. Furthermore, it mainly comprises German tags,           thermore, this shows that sustainability is an important field
e.g. photovoltaikanlage (photovoltaic power station), waer-        in architecture. The other clusters reference resources that
merueckgewinnung (heat recovery), and waermepumpe (heat            describe different construction phases (design of buildings,
pump). The cluster references 200 distinct resources that          preparation of building zones, as well as construction and
are assigned with 825 distinct tags.                               cladding of buildings), security issues, and notable buildings
                                                                   as study objects.
Cluster 9: hotel / mercat (hotel and market). This
cluster holds tags that reference resources dealing with (aes-     In numbers, the tags that hold their own circles in the vi-
thetic) buildings in the in public space like puente (bridge),     sualisation reference 2,849 distinct learning resources, i.e. a
rascacielos (skyscraper), puerto (harbour), and hotel arts as      third of all tagged learning resources in the MACE data set.
                                                                  can be conducted to investigate if the use of the tag cluster
                                                                  visualisation increases the orientation in the portal or the
                                                                  performance of the students when solving tasks.


                                                                  8.   ACKNOWLEDGMENTS
                                                                  The work presented in this paper has been supported by the
                                                                  Open Discovery Space that is funded by the European Com-
                                                                  missio’s CIP-ICT Policy Support Program (Project Num-
                                                                  ber: 297229).


                                                                  9.   REFERENCES
                                                                   [1] M. A. Hearst and D. Rosner. Tag clouds: Data
                                                                       analysis tool or social signaller? In Proc. of the 41st
                                                                       Annual Hawaii International Conference on System
Figure 3: Excerpt of the resource list for the tag sandwich            Sciences, HICSS ’08, pages 160–, Washington, DC,
                                                                       USA, 2008. IEEE Computer Society.
                                                                   [2] G. Heyer, U. Quasthof, and T. Wittig. Text Mining:
While this number seems small at a first glance, it is quite           Wissensrohstoff Text. Konzepte, Algorithmen,
high when considering that only about 3% of the tags hold              Ergebnisse. W3L GmbH, 2006.
their own circles. However, this number can be increased           [3] O. Kaser and D. Lemire. Tag-cloud drawing:
by presenting all resources referenced by a tag that was as-           Algorithms for cloud visualization, 2007.
signed to a cluster in the visualisation. So far, the tags that
                                                                   [4] S. Lohmann, S. Thalmann, A. Harrer, and R. Maier.
do not belong to the clusters’ 20 most frequents ones are
                                                                       Learner-Generated Annotation of Learning Resources
neglected.
                                                                       - Lessons from Experiments on Tagging. In Proc. of
                                                                       the International Conference on Knowledge
Overall, the referenced learning resources are assigned with
                                                                       Management (I-KNOW 2008), pages 304–312, 2008.
6,585 distinct tags (i.e. half of all tags) which are shown in
the resource lists. Considering that about 70% of the tags         [5] A. W. Rivadeneira, D. M. Gruen, M. J. Muller, and
are only used once, this seems to be an acceptable number.             D. R. Millen. Getting our head in the clouds: Toward
Furthermore, it will be increased as well as soon as more              evaluation studies of tagclouds. In Proc. of the
resources are displayed.                                               SIGCHI Conference on Human Factors in Computing
                                                                       Systems, CHI ’07, pages 995–998, New York, NY,
                                                                       USA, 2007. ACM.
7.   CONCLUSION AND FUTURE WORK                                    [6] M. Scheffel, K. Niemann, S. Leon Rojas, H. Drachsler,
In summary, the visualisation of the tag clusters gives a
                                                                       and M. Specht. Spiral me to the core: Getting a visual
broad and easily understandable overview on the learning
                                                                       grasp on text corpora through clusters and keywords.
resources’ subject areas. Furthermore, it enables the users
                                                                       In K. Yacef and H. Drachsler, editors, Proc. of the
to explore the data set by zooming into the clusters and
                                                                       Workshops at the LAK 2014 Conference, volume 1137
browsing the result lists.
                                                                       of CEUR Proc., Indianapolis, Indiana, USA, 2014.
This visualisation, though, is not intended to be a stan-          [7] S. Sen, J. Vig, and J. Riedl. Tagommenders. In Proc.
dalone tool for the exploration of a data set. It is rather            of the 18th international conference on World wide
meant to be an additional tool that can be integrated with             web (WWW ’09), pages 671–680, New York, New
(already available) search functions like a faceted search or          York, USA, 2009. ACM Press.
a social search as offered by the MACE portal. This way,           [8] B. Sigurbjörnsson and R. van Zwol. Flickr tag
the displayed resources could for example be filtered accord-          recommendation based on collective knowledge. In
ing to their language or media type and the tags in the                Proc. of the 17th international conference on World
resources’ descriptions could be used to search for resources          Wide Web - WWW ’08, pages 327–336, New York,
assigned with one ore more specific tags. Furthermore, the             New York, USA, 2008. ACM Press.
visualisation offers several possibilities for extensions. For     [9] J. Sinclair and M. Cardew-Hall. The folksonomy tag
example, by clicking on a learning resource in a tag’s or a            cloud: When is it useful? J. Inf. Sci., 34(1):15–29,
cluster’s resource list, all tags that are assigned to this re-        Feb. 2008.
source but are located in other clusters could be highlighted.    [10] M. Stefaner, E. D. Vecchia, M. Condotta, M. Wolpers,
This would further enhance the ability to discover relations           M. Specht, S. Apelt, and E. Duval. MACE - Enriching
between tags and, thus, between subject areas. Another op-             Architectural Learning Objects for Experience
tion would be to allow the users to browse all tags belonging          Multiplication. In E. Duval, R. Klamma, and
to one cluster and not only the most frequent ones.                    M. Wolpers, editors, Proc. of the 2nd European
                                                                       Conference on Technology Enhanced Learning
So far, no evaluation has been conducted. In order to do               (EC-TEL ’07), volume 4753 of LNCS, pages 322–336,
so, the tag cluster visualisation needs to be integrated in            Berlin, Heidelberg, 2007. Springer.
a web portal. Thereafter, the acceptance of this visualisa-       [11] M. Steinbach, G. Karypis, and V. Kumar. A
tion can be evaluated by analysing its usage or by conduct-            comparison of document clustering techniques. In In
ing a survey. Furthermore, user studies with control groups            KDD Workshop on Text Mining, 2000.