=Paper=
{{Paper
|id=Vol-1518/paper3
|storemode=property
|title=Getting a Grasp on Tag Collections by Visualising Tag Clusters Based on Higher-order Co-occurrences
|pdfUrl=https://ceur-ws.org/Vol-1518/paper3.pdf
|volume=Vol-1518
|dblpUrl=https://dblp.org/rec/conf/lak/NiemannSRWDS15
}}
==Getting a Grasp on Tag Collections by Visualising Tag Clusters Based on Higher-order Co-occurrences==
Getting a grasp on tag collections by visualising tag clusters based on higher-order co-occurrences Katja Niemann, Sarah León Rojas, Maren Scheffel, Hendrik Drachsler, Martin Wolpers Marcus Specht Fraunhofer FIT Open University of the Netherlands Schloss Birlinghoven Valkenburgerweg 177 53754 Sankt Augustin, Germany 6419 AT Heerlen, The Netherlands {katja.niemann, sarah.leon.rojas, {maren.scheffel, hendrik.drachsler, martin.wolpers}@fit.fraunhofer.de marcus.specht}@ou.nl ABSTRACT tos or videos) tags provide meaningful descriptors of these Tagging learning resources in repositories or web portals of- objects [8]. fers a way to meaningfully describe these resources. The more tags there are, however, the more difficult it is to find A common problem, however, when relying on tags is that one’s way around the repository, especially when they are they are often user-generated and not restricted to a closed user-generated free-text tags. This paper therefore presents vocabulary. Different users can tag the same learning re- a visualisation of tag clusters based on higher-order co-oc- source with different tags leading to a large collection of currences that allows users of such repositories a plain but rarely used but highly related tags. The use of singular or simple way of exploring them in an intuitive manner. plural versions of the same word, the same word in differ- ent languages or different words with the same meaning, i.e. synonyms, can also lead to problems when relying on tags Categories and Subject Descriptors in order to get an overview on a collection of learning re- H.3.3 [Information Storage and Retrieval]: Informa- sources. In order to detect unknown relations between tags tion Search and Retrieval—Clustering, Information filtering, they therefore need to be contextualised. Search process, Selection process; I.2.7 [Artificial Intelli- gence]: Natural Language Processing; J.1 [Administra- Based on an approach of visualising large document collec- tive Data Processing]: —Education; K.3.1 [Computers tions according to the documents’ keywords [6] we suggest and Education]: Computer Uses in Education to use a visualisation of tag relations that allows users to quickly get a grasp of the resources offered by a learning por- General Terms tal and to dig deeper to get an understanding of certain sub- Algorithms, Experimentation, Language ject areas. Instead of clustering the learning objects accord- ing to their content, however, we cluster the tags according Keywords to their higher-order co-occurrences and then present them clustering, higher-order co-occurrences, tags, technology en- in a clearly arranged and intuitive manner. The creation of hanced learning, visualisation higher-order co-occurrences is a well-known approach in cor- pus linguistics to discover semantic relations between words based on their usage in text documents [2]. We adapt this 1. INTRODUCTION approach by analysing the assignments of tags to learning Many educational web portals allow users to manually en- resources instead of the occurrences of terms in sentences or rich the offered learning resources with social metadata like text documents. comments and free-text tags. It has been shown that tags in particular provide powerful knowledge that can be used to The paper is structured as follows. Chapter 2 gives a short improve the quality of searching and recommendations [4, 7]. overview of related work. Chapter 3 describes the approach Similar to automatically extracted keywords, tags thus offer of higher-order co-occurrence clustering to group tags with a way to get a quick grasp on the content or theme of mul- similar meanings, followed by the description of the MACE timedia objects. Especially when dealing with multimedia data set in chapter 4 which is used in this paper. Thereafter, objects that provide little or no textual context (e.g. pho- chapter 5 describes the visualisation of the tag clusters and chapter 6 discusses the results. Finally, chapter 7 holds a conclusion and an outlook on future work. 2. RELATED WORK According to Rivadeneira et al. [5], a meaningful visuali- sation of tags supports four main functions: (1) search, i.e. tags can be directly included in the search process and, thus, enhance the findability of items, (2) browsing, i.e. the visual- isation offers a central entry point for users that know what they are looking for but not what exactly to search for, (3) Heyer et al. [2] show this for the co-occurrences of IBM, impression formation / gisting, i.e. the visualisation allows among other words. Their investigations are based on text users to get a quick grasp on the items’ subject areas, and corpora collected for the portal wortschatz.uni-leipzig.de, (4) recognition, i.e. the users are offered the possibility to the German treasury of words. The first co-occurrence class understand different aspects of certain information. is rather heterogeneous, and contains words like computer manufacturer, stock exchange, global and so on. After some The most common approach to visualise a large number of iterations of computing higher-order co-occurrence classes, tags is the creation of tag clouds. Here, the relative size of however, the classes become more homogenous and stable. each tag stands in relation to its frequency in the tag col- The tenth order co-occurrence class only contains names of lection. Nowadays, many tools are available that allow an other computer-related companies like Microsoft, Sony etc. easy integration of personalised tag clouds in web sites, e.g. TagCrowd1 and Wordle2 . While there is a huge potential in- In the given scenario we do not have sentences in which the herent in tag clouds they also suffer from some issues, e.g. the tags occur. However, the tags are assigned to learning re- missing semantic between the visualised tags [1, 9]. In order sources which can be considered to represent usage contexts. to deal with this, tag clouds have been created that analyse Thus, two tags are co-occurrences if they are assigned to at (first-order) co-occurrences between the tags and group tags least one common learning resource. In order to calculate that often co-occur [11]. Here, similar tags do not necessar- the significance of two tags, the association measure Mutual ily reference to the same semantic concept but are linked by Information (MI) is used which compares the observed fre- the resources they have in common [3]. Another problem quency O of a co-occurrence with its expected frequency E, of tag clouds is that many frequent tags often dominate the see formula 1. whole tag cloud and less frequent tags and their concepts get lost [1]. O This paper presents a clustering approach for tags that is MI = log2 (1) E based on higher-order co-occurrences, i.e. a corpus linguistic technique to find semantically related terms [2]. This way we aim to discover and visually cover all subject areas even Here, selecting the n most significant co-occurrences for each though it might not be possible to display all single tags. tag would imply to have a pre-defined cluster size which is not desirable, thus, a threshold is used. Because the calcu- lated significance scores for resource pairs are only compara- 3. HIGHER-ORDER CO-OCCURRENCE ble if they have one resource in common, a resource-specific CLUSTERING OF TAGS threshold is used to distinguish between relevant and non- The creation of higher order co-occurrences is a corpus- relevant co-occurrences. Here, this threshold is calculated linguistic approach to exploit the usage context of linguistic for each learning resource by averaging the significance val- entities in order to find semantic relations. Two linguistic ues of all its co-occurrences and multiplying the result with entities are defined to be co-occurrences if they occur in at a regulation constant α which has a value of 0.95 in the least one common usage context, e.g. in a sentence. For ex- presented experiment. ample, the word dog often co-occurs with the words bark, growl, and sniff among others. 4. THE MACE DATA SET In order to calculate the significance of a co-occurrence sta- The MACE3 (Metadata for Architectural Contents in Eu- tistical association measures are used. Thereafter, the most rope) project relates digital learning resources about archi- significant co-occurrences must be selected for each term. tecture with each other across repository boundaries to en- Since there is no standard scale of measurement to draw a able a simplified discovery and access [10]. Users are able clear distinction between significant and non-significant oc- to search for learning resources and filter the results, e.g. currences, there are two ways to do so, i.e. by selecting only according to their language, the original repository, and the the n most significant co-occurrences for each resource or by classification terms they hold. Furthermore, the portal offers using a threshold. a social search based on tags, a location search based on the geographical coordinates of buildings represented through The significant co-occurrences of an entity form its first- learning resources, and a competence search based on the order co-occurrence class and entities which co-occur in first- competencies the learning resources aim to impart. Regis- order co-occurrence classes are second-order co-occurrences. tered and logged-in users are able to rate, tag, and com- These second-order co-occurrence classes again can be used ment on learning resources. Additionally, they can follow as input to calculate third order co-occurrences and so forth. the metadata provision activities of other users. When this procedure is repeated several times, the higher- order co-occurrence classes tend to get stable, i.e. their ele- The MACE data set holds 117,907 events on 12,442 learning ments do not change any more. This indicates that there ex- resources conducted by 630 registered users. 70.8% of the ist universal relations between the entities in the remaining learning resources hold tags in which each tagged learning classes that induce their aggregation again in each iteration resource holds on average 6.59 tags. Overall, the users as- step. In fact, these stable higher-order co-occurrence classes signed 13,291 distinct tags of which 73% are only used once have shown to usually hold semantically related entities. and only about 4% of the tags are added to more than 10 learning resources. 1 http://tagcrowd.com/ 2 3 http://wordle.net/ http://mace-project.eu/ 5. VISUALISATION When creating a visualisation of the tag clusters for the MACE data set we decided to not present tags in the vi- sualisation that are assigned to only one or two learning resources. Only clusters that hold more than five tags are selected for presentation. Finally, the two most frequent tags are selected as title for each cluster. If a cluster’s most frequent tags significantly overlap, the less frequent one is neglected and the next frequent tag is selected. After this data processing, the tag clusters and all attached information are written to a JSON file. The visualisation is realised using the Data-Driven Documents D3.js frame- work4 , i.e. a JavaScript library, paired with HTML, CSS and JQuery to process the previously created JSON files. Figure 1 shows the default starting view of the visualisa- tion5 . The tag clusters are represented by circles and are ordered according to their size in the form of a spiral with the largest cluster having the largest circle and being posi- tioned at the outside of the spiral and the smallest cluster being in the middle of the spiral. Here, the size of a tag cluster depends on the number of learning objects that are Figure 1: Start view referenced by the tags belonging to it. Additionally to size and position, every cluster has its own color and is labelled with its two most representing tags to enable the users to ods, objects, and materials used for insulation, e.g. cobertes quickly get a grasp on the clusters’ content. (covered), paneles (panels), as well as poliestireno (polysty- rene) and reference 683 distinct resources. The resources’ By clicking on a cluster, the view changes and the visualisa- descriptions hold further tags that can be used to orientate tion zooms into to the chosen cluster for which up to 20 tags in this field. For example, figure 3 shows an excerpt from become visible. We chose this number to not overload the the list of resources that are assigned with the tag sand- visualisation. In order to continue the circle approach used wich. While this tag might be unexpected at a first glance, for the clusters, we adapted the common usage of font size, the tags it was used with clarify its meaning, i.e. a (panel) coloring and word positioning in tag clouds and used sized structure made of three layers. Overall, 2,190 distinct tags and spirally ordered circles for the tags as well. On the right are given in the resource list of this cluster. side of the visualisation, a list of all the learning resources that are associated with that cluster is given showing the re- Cluster 2: fachada / facana (facade). This cluster sources’ title, media type, and language additionally to the mainly holds Spanish and Catalan tags that deal with the list of all tags assigned to it. All resource titles link to the construction and cladding of buildings, e.g. sistemas con- original resource. structivos (building systems), cerramientos (enclosure), gres (stoneware), and constructivos (building). Overall, this clus- Clicking on a tag circle results in a new list next to the ter’s tags reference 661 distinct resources that are assigned visualisation in which all resources assigned with that tag are with 2,080 distinct tags. given. By clicking on a specific tag, its circle is highlighted and the object list only displays those resources that are Cluster 3: seguridad obra / seguridad construccion assigned with the highlighted tag, see figure 2. (work and construction safety). This cluster holds a mix of Spanish and English tags that deal with security, 6. DISCUSSION e.g. seguridad trabajador (worker safety), construction se- This chapter provides an insight on the eleven clusters shown curity, sistemas de seguridad (security systems), and nor- in the visualisation, discusses the topics they cover includ- mativa (regulations). Overall, it references 532 distinct re- ing their relations, and reference further distinctive features. sources that hold 634 distinct tags. The following cluster descriptions are ordered by the size of the clusters, i.e. from the outside of the spiral to its center. Cluster 4: architects / design. The first cluster that Whenever needed, the tags’ English translations are given in mainly holds English tags and few Spanish ones deals with brackets. Here, if two tags hold the same English translation (green) architecture in the public space, e.g. architecture, it is only given once. museum, green architecture, architettura (architecture), pi- azza, and bioarchitettura. It references 296 distinct resources Cluster 1: cubierta / aislante (cover and insulation). that hold 962 distinct tags. This cluster’s tags, which are mainly in Spanish, name meth- Cluster 5: movimiento tierras / tierras (land move- 4 ment). This cluster comprises Spanish tags that deal with http://d3js.org/ 5 the preparation of building zones, e.g. excavaciones (dig- The visualisation is available at http://mitarbeiter.fit.fraunhofer.de/˜niemann/VisLA/ gings), maquinaria (machinery), calculo (calculation), and Figure 2: Zoomed cluster view with a selected tag excavadora (excavator). It references 278 distinct resources well as famous architects of those buildings, e.g. Santiago that comrpise 1,201 distinct tags. Calatrava Valls and Norman Robert Foster. Overall, the clusters references distinct 86 learning resources that com- Cluster 6: cimentaciones / fonaments (foundation). prise 107 distinct tags. This cluster mainly holds in Spanish and Catalan tags that deal with the construction and anchoring of buildings, e.g. Cluster 10: software / 3d . The only cluster that contains muro (wall), building, terreno (ground), zapatas (shoes), and less than 20 tags deals with the design of buildings using the anclajes (anchors). Overall, this cluster’s tags reference 210 computer and comprises tags like cad (computer-aided de- distinct learning resources that are assigned with 831 dis- sign), rhino3d (CAD Software), tutorial, and programming. tinct tags. The cluster references 72 distinct resources that are assigned with 274 distinct tags. Cluster 7: torre / portale (tower and portal). The main topic of this cluster is sustainability although its two Cluster 11: ruine / schloss (ruin and castle). This most frequent tags do not imply it. Further tags are e.g. cluster references learning resources that describe or depict bio edilizia (bio building), solar, and sostenibilidad (sustain- buildings built in the mittelalter (middle ages) or hochmit- ability). However, it can be seen in the resource list that telalter (high middle ages) in German regions like pfealzer the learning resources that are tagged with torre or portale wald (Palatinate Forest) and rhein-lahn-kreis (Rhine Lahn also deal with this topic, e.g. the insulation of towers. Thus, circle). Consequently, all tags are in German. Overall, they this cluster exhibits a topical relation to the first one but reference 51 distinct resources that hold 96 distinct tags. in contrast, in mainly contains Italian tags. Overall, the cluster references 201 distinct resources and its resource list Concluding, the clusters mostly contain tags that indeed comprises 661 distinct tags. belong to the same subject area, though, they are not com- pletely separated. For example, several clusters deal with Cluster 8: ecological / oekologisch. This cluster also sustainability. However, their tags are in different languages deals with sustainability but with a stronger focus on the and they have different focuses, e.g. the generation vs. the generation and recovery rather than on the conservation conservation of energy or public vs. private buildings. Fur- of energy. Furthermore, it mainly comprises German tags, thermore, this shows that sustainability is an important field e.g. photovoltaikanlage (photovoltaic power station), waer- in architecture. The other clusters reference resources that merueckgewinnung (heat recovery), and waermepumpe (heat describe different construction phases (design of buildings, pump). The cluster references 200 distinct resources that preparation of building zones, as well as construction and are assigned with 825 distinct tags. cladding of buildings), security issues, and notable buildings as study objects. Cluster 9: hotel / mercat (hotel and market). This cluster holds tags that reference resources dealing with (aes- In numbers, the tags that hold their own circles in the vi- thetic) buildings in the in public space like puente (bridge), sualisation reference 2,849 distinct learning resources, i.e. a rascacielos (skyscraper), puerto (harbour), and hotel arts as third of all tagged learning resources in the MACE data set. can be conducted to investigate if the use of the tag cluster visualisation increases the orientation in the portal or the performance of the students when solving tasks. 8. ACKNOWLEDGMENTS The work presented in this paper has been supported by the Open Discovery Space that is funded by the European Com- missio’s CIP-ICT Policy Support Program (Project Num- ber: 297229). 9. REFERENCES [1] M. A. Hearst and D. Rosner. Tag clouds: Data analysis tool or social signaller? In Proc. of the 41st Annual Hawaii International Conference on System Figure 3: Excerpt of the resource list for the tag sandwich Sciences, HICSS ’08, pages 160–, Washington, DC, USA, 2008. IEEE Computer Society. [2] G. Heyer, U. Quasthof, and T. Wittig. Text Mining: While this number seems small at a first glance, it is quite Wissensrohstoff Text. Konzepte, Algorithmen, high when considering that only about 3% of the tags hold Ergebnisse. W3L GmbH, 2006. their own circles. However, this number can be increased [3] O. Kaser and D. Lemire. Tag-cloud drawing: by presenting all resources referenced by a tag that was as- Algorithms for cloud visualization, 2007. signed to a cluster in the visualisation. So far, the tags that [4] S. Lohmann, S. Thalmann, A. Harrer, and R. Maier. do not belong to the clusters’ 20 most frequents ones are Learner-Generated Annotation of Learning Resources neglected. - Lessons from Experiments on Tagging. In Proc. of the International Conference on Knowledge Overall, the referenced learning resources are assigned with Management (I-KNOW 2008), pages 304–312, 2008. 6,585 distinct tags (i.e. half of all tags) which are shown in the resource lists. Considering that about 70% of the tags [5] A. W. Rivadeneira, D. M. Gruen, M. J. Muller, and are only used once, this seems to be an acceptable number. D. R. Millen. Getting our head in the clouds: Toward Furthermore, it will be increased as well as soon as more evaluation studies of tagclouds. In Proc. of the resources are displayed. SIGCHI Conference on Human Factors in Computing Systems, CHI ’07, pages 995–998, New York, NY, USA, 2007. ACM. 7. CONCLUSION AND FUTURE WORK [6] M. Scheffel, K. Niemann, S. Leon Rojas, H. Drachsler, In summary, the visualisation of the tag clusters gives a and M. Specht. Spiral me to the core: Getting a visual broad and easily understandable overview on the learning grasp on text corpora through clusters and keywords. resources’ subject areas. Furthermore, it enables the users In K. Yacef and H. Drachsler, editors, Proc. of the to explore the data set by zooming into the clusters and Workshops at the LAK 2014 Conference, volume 1137 browsing the result lists. of CEUR Proc., Indianapolis, Indiana, USA, 2014. This visualisation, though, is not intended to be a stan- [7] S. Sen, J. Vig, and J. Riedl. Tagommenders. In Proc. dalone tool for the exploration of a data set. It is rather of the 18th international conference on World wide meant to be an additional tool that can be integrated with web (WWW ’09), pages 671–680, New York, New (already available) search functions like a faceted search or York, USA, 2009. ACM Press. a social search as offered by the MACE portal. This way, [8] B. Sigurbjörnsson and R. van Zwol. Flickr tag the displayed resources could for example be filtered accord- recommendation based on collective knowledge. In ing to their language or media type and the tags in the Proc. of the 17th international conference on World resources’ descriptions could be used to search for resources Wide Web - WWW ’08, pages 327–336, New York, assigned with one ore more specific tags. Furthermore, the New York, USA, 2008. ACM Press. visualisation offers several possibilities for extensions. For [9] J. Sinclair and M. Cardew-Hall. The folksonomy tag example, by clicking on a learning resource in a tag’s or a cloud: When is it useful? J. Inf. Sci., 34(1):15–29, cluster’s resource list, all tags that are assigned to this re- Feb. 2008. source but are located in other clusters could be highlighted. [10] M. Stefaner, E. D. Vecchia, M. Condotta, M. Wolpers, This would further enhance the ability to discover relations M. Specht, S. Apelt, and E. Duval. MACE - Enriching between tags and, thus, between subject areas. Another op- Architectural Learning Objects for Experience tion would be to allow the users to browse all tags belonging Multiplication. In E. Duval, R. Klamma, and to one cluster and not only the most frequent ones. M. Wolpers, editors, Proc. of the 2nd European Conference on Technology Enhanced Learning So far, no evaluation has been conducted. In order to do (EC-TEL ’07), volume 4753 of LNCS, pages 322–336, so, the tag cluster visualisation needs to be integrated in Berlin, Heidelberg, 2007. Springer. a web portal. Thereafter, the acceptance of this visualisa- [11] M. Steinbach, G. Karypis, and V. Kumar. A tion can be evaluated by analysing its usage or by conduct- comparison of document clustering techniques. In In ing a survey. Furthermore, user studies with control groups KDD Workshop on Text Mining, 2000.