=Paper=
{{Paper
|id=Vol-1384/paper4
|storemode=property
|title=Using Noun Phrases Extraction for the Improvement of Hybrid Clustering with Text- and Citation-based Components. The Example of "Information System Research"
|pdfUrl=https://ceur-ws.org/Vol-1384/paper4.pdf
|volume=Vol-1384
|dblpUrl=https://dblp.org/rec/conf/issi/ThijsGM15
}}
==Using Noun Phrases Extraction for the Improvement of Hybrid Clustering with Text- and Citation-based Components. The Example of "Information System Research"==
Using noun phrases extraction for the improvement of hybrid clustering with text- and citation-based components. The example of “Information System Research” Bart Thijs1, Wolfgang Glänzel2, and Martin Meyer3 1 bart.thijs@kuleuven.be KU Leuven, ECOOM and Dept. MSI, Leuven (Belgium) 2 wolfgang.glanzel@kuleuven.be KU Leuven, ECOOM and Dept. MSI, Leuven (Belgium) Library of the Hungarian Academy of Sciences, Dept. Science Policy & Scientometrics, Budapest (Hungary) 3 m.s.meyer@kent.ac.uk Kent Business School, University of Kent, Canterbury(UK) KU Leuven, ECOOM and Dept. MSI, Leuven (Belgium) SC-Research, University of Vaasa, Lapua, (Finland) Abstract The hybrid clustering approach combining lexical and link-based similarities suffered for a long time from the different properties of the underlying networks. We propose a method based on noun phrase extraction using natural language processing to improve the measurement of the lexical component. Term shingles of different length are created form each of the extracted noun phrases. Hybrid networks are built based on weighted combination of the two types of similarities with seven different weights. We conclude that removing all single term shingles provides the best results at the level of computational feasibility, comparability with bibliographic coupling and also in a community detection application. Workshop Topic Text enhanced bibliographic coupling Introduction For a long time scientometrians have been using the combination of textual analyses with citation based links for many different applications. In 1991, Braam et al. (1991a; 1991b) suggested the use of co-citation in combination with word-profiles which are indexing terms and classification codes for a mapping of science. In the same year, Callon et al. (1991) demonstrated how co-word analysis can be used for studying academic and technological research. Glenisson (2005) encountered the disadvantage of the single term approach and used the Dunning likelihood ratio test (Dunning 1993; Manning & Schütze, 2000) to identify common bigrams. For this test the occurrence of each pair has to be calculated together with the frequency of each term appearing separately. The bigrams with the highest score are retained. The risk of this procedure is that pairs that are less frequent or that appear in a few variations are not selected. Also the selection of a bigram in a paper might change when additional documents are added to the dataset. It is clear that the introduction of full text analysis increased processing complexity. Janssens (2005) introduced a true integrated approach where he combines the distance based on bibliometric features with a text-based distance using a linear combination of distances (or similarities), where the parameter can be used to fine-tune the weight of the two components. Later Janssens et al. (2008) warned against the combination based on simple vector concatenation and linear combinations of similarity measures because of the completely different structures of the underlying vector spaces and they proposed a combination based on Fisher’s inverse Chi-Square. They also showed that this method outperforms hitherto applied methods. This method solves the issue of different distributions drastically but it introduces an even more complex calculation scheme. Glänzel & Thijs (2012) take a more pragmatic approach and exploit the fact that both similarities can be expressed as cosines in a vector space model and introduce a hybrid similarity as the cosine of the weighted linear combination of the underlying angles of each of the cosine similarities. None of solutions proposed in the literature were so far able to eliminate or at least to considerably reduce the effect of different distributions in each of the two components without excessive computational requirements. In this paper we introduce the use of noun phrase extracted by the application of Natural Language Processing (NLP) and we investigate different options that can be taken while using syntactical parsing and the effects of these choices on the lexical similarities and the properties of networks based on these similarities. The rationale here is that as we are using the text mining to map documents in order to identify clusters of fields or emerging topics we have to limit the textual information that we use to those elements in texts (or - more formally - those parts of speech) that actually contain the topics. Nouns or noun phrases are used as subjects, objects, predicative expressions or prepositions in sentences. Syntactic parsing as one of the applications within NLP will be used to extract the noun phrases from the abstracts; other categories, such as verb, adjective or adpositional phrases will be neglected. However, the selected noun phrase might contain an embedded phrase of these other types or even other embedded noun phrases. We will illustrate the new approach using the example of a document set on Information System Research. Data source and data processing A set of 6144 publications on ‘Information Systems’ is used in this study. This data set is retrieved from the Social Sciences Citation Index by using a custom developed search strategy focusing on ‘Management Information System’, ‘Geographical Information System’, ‘Decision Support System’ or ‘Transaction Processing System’ (Meyer et al., 2013). Publications from 1991 up to 2012 with document type Article, Letter, Note or Review are selected. For the lexical component, the title and the abstract of the papers are processed by both Lucene1 (version 4.0) and the Stanford Parser. Terms used in the older single term based approach were retrieved by the next pre-processing steps: title and abstracts are merged and converted to lower case. Then, this data is tokenized by punctuation and white spaces. Stop words are removed through a custom built stop word list and remaining terms were stemmed by the Snowball Stemmer available in Lucene which is an extended version of the original Porter Stemmer (Porter, 1980). All terms that occur in only one document are removed. A term-by-document matrix is constructed in a vector space model with term frequency-inverse document frequency weightings (TF-IDF). Salton’s cosine measure is used as measure of document similarity (Salton & McGill, 1986). For the extraction noun-phrases we rely on the Stanford Parser, a Java package which has been developed and distributed by the Stanford Natural Language Processing Group. In short, this parser returns the grammatical structure of sentences based on probabilistic language models. In this study we use the PCFG-parser version 2.0.5 (Klein & Manning, 2003). The format of the output of the parser are Stanford Dependencies, which describe the grammatical relations between words in a sentence (de Marneffe & Manning, 2008a; 2008b). In the output, nouns are tagged with NN or NNS (for plurals), noun phrases with NP. For the selection of the noun phrases from the parsing result we can choose between several options. Complete noun phrases (NP) can be selected or only restricted noun phrases in which no other noun phrase is embedded. Noun phrases can be recorded with the constituent words in the given order or the included 1 See http://lucene.apache.org, visited in January 2015 terms can be sorted alphabetically. It is the objective of this paper to study the consequences of these options. After selection of the type of noun phrases and the optional sorting of the terms additional processing steps are taken. Similar to the single-term based approach the Snowball stemmer is applied and stop words are removed. The stemmed terms within a single phrase are then used to create term-shingles. A term shingle is a set of subsequent terms. The length of these shingles can vary between one and the number of terms in the phrase which is the maximum. With respect to the length of the selected shingles we identified five different possibilities with different criteria on the number of terms in the phrase and on the length of the shingle. Table 1 lists the five applicable criteria on the length of the shingle. Table 1. All possible shingles in a phrase with three terms Tag Criteria (none) None – all possible shingles are included Lm Shingle length is equal the longest possible shingle thus length = maximum. Only the full noun phrase is used in the analysis lm_l1 Shingles with length one or shingles with a length equal to the maximum l>1 Any shingle with length higher than one m1_l>1 Any shingle with length higher than one or any single term noun phrase. The combination of these five possible selection criteria with the two options for the type of noun phrase and the possible sorting creates twenty different scenarios for the creation of a phrase by document matrix. This matrix contains only phrases or shingles that occur in more than one document and the weighting is a slightly modified TF-IDF version where the term frequency is equal to the number of sentences in which the phrase or shingle appears. Salton’s cosine is calculated to express the similarity between documents. As a result we have for each document pair up to twenty different similarities based on the different scenarios in this NLP approach. For the citation component we calculate the cosine similarity based on bibliographic coupling (BC), that is, the number of references shared by document pairs with respect to all of their references that are indexed in the Web of Science databases. The two components lexical and bibliographic coupling are combined by calculating a hybrid similarity as the cosine of the weighted linear combination of the underlying angles of each of the cosine similarities. This method has been introduced and described by Glänzel & Thijs (2012). A free parameter () defines the convex combination and the weight of the two components. For document pairs, where one of the components is not defined /2 is used as the underlying angle of this component. Document pairs with two undefined components are discarded. In this paper we will only use an NLP based lexical component with seven values of the parameter (0.125, 0.25, 0.33, 0.5, 0.66, 0.75 and 0.875). Clustering of the data is done by the Pajek ‘Single Refinement’ implementation (Batagelj & Mrvar, 2003) of the Louvain method for community detection (Blondel et al., 2008). Prior to this clustering all singletons are removed from the network. The resolution parameter is set to 1.0, and five random restarts are requested. Results In this section we discuss shortly the twenty networks resulted from the different options and compare seven hybrid combinations of the bibliographic coupling component together with the selected NLP component according to the above parameters. The density of the networks and the outcomes of the clustering algorithm are hardly influenced by the choice of noun-phrase types nor by the ordering of the terms the phrases. We only found that restricted noun-phrases resulted in much smaller data files. However, the creation of shingles from the noun-phrases had a large influence on the results. We found out that scenario’s that still allow single term phrases did not reduce the density nor did change the distribution of edge weights. As a consequence the best result was obtained when restricting the lexical component to the use of shingles with a length higher than one. For the second analysis we use hybrid combinations. In Table 2 the results for the two trivial combinations, i.e., and 1 is included for reference. After the hybrid combination, 25 documents remained singletons in the network and were removed. We would like to recall that the appropriate choice of the weight parameter used to be crucial for the quality of the clustering result with a possible distortion of the results by too much weight on the single term lexical approach (Janssens et al. 2008). However, Table 2 clearly shows that the distribution of weighted degree is not distorted by any particular choice of the parameter. Also, for each of the chosen values a modularity above 0.3 is obtained. Table 2. Results of hybrid clustering with different weight parameters Weighted Degree Community Detection Weight Average Median Max NC Mod. <10 NLP ( 0) 16.64 14.86 118.50 12 0.338 2 0.125 16.26 14.88 104.66 12 0.322 2 0.25 15.90 14.82 90.66 11 0.312 2 0.33 15.68 14.58 81.62 11 0.308 2 0.5 15.19 13.64 62.22 10 0.310 3 0.67 14.73 12.40 69.24 10 0.317 3 0.75 14.47 11.62 75.95 10 0.323 4 0.875 14.11 10.48 85.27 10 0.333 3 BC ( 1) 14.62 10.71 94.59 16 0.350 8 Table 3. Cramer’s V measurement of association NLP 0.125 0.25 0.33 0.5 0.66 0.75 0.875 0.125 0.85 0.25 0.79 0.86 0.33 0.76 0.80 0.89 0.5 0.66 0.71 0.74 0.74 0.66 0.63 0.65 0.68 0.71 0.90 0.75 0.62 0.64 0.66 0.69 0.87 0.93 0.875 0.59 0.61 0.63 0.66 0.84 0.93 0.91 BC 0.30 0.33 0.40 0.44 0.65 0.70 0.77 0.75 When looking at the number of clusters, it evolves from 12 in the lexical component to 10 in the = 0.5 weighting scheme to 16 in the link component. When we look at the correspondence of cluster assignment between two schemes we observe higher stability between schemes with values closer to each other. Cramer’s V measures are calculated between all schemes and plotted in Table 3. Application This section outlines briefly the results of our partitioning of the hybrid network with = 0.5 weight on both components at three levels with increasing resolution (0.7, 1.0 and 1.5). As mentioned above, we used a data set on in Information System Research for our analyses. Level I resulted in three large clusters and two pairs or triplets of papers with no link to any other documents. These pairs/triplets (five papers at level I) are removed from further analysis. At level II we found seven clusters and three pairs/triplets (8 papers) and level III has 19 clusters and the same eight papers were grouped in three pairs/triplets. Although the three levels consist of independent runs of the Louvain cluster algorithm we can observe a near-perfect hierarchical structure. This is confirmed by Cramér’s-V values of 0.94 between level I and II, 0.93 between I and III and 0.84 between levels II and III. The labels of each cluster at the three levels are taken from the titles of core documents within each cluster. These core documents have been determined according to Glänzel & Thijs (2011) and Glänzel (2012) on the basis of the degree h-index of the hybrid document network. In particular, core documents are represented by core nodes which, in turn, are defined as nodes with at least h degrees each, where h is the h-index of the underlying graph and only edges with a minimum weight of 0.15 are retained. At the lowest level, the three clusters contain publications that fit in broad categories, such as ‘planning/development/ implementation’ (cluster I.2 with 3855 papers), ‘user and technology acceptance’ (cluster I.3 with 1302 papers), and ‘decision support systems’ (cluster I.1 with 957 papers). Figure 1. Cluster solution at three levels [Data sourced from Thomson Reuters Web of Science Core Collection] Given the size of the planning/development/implementation cluster and the hierarchical structure of the different levels, there is value in exploring the clustering at a higher resolution which allows us to develop a more differentiated understanding of the IS literature that falls in this category. At Level II with a resolution of 1.0 we identify 5 clusters. There are three large clusters: ‘II.c strategic IS planning’ (1414 papers), ‘II .b development /OSS /planning’ (1119 papers), ‘II.e supply chain’ (1108). Smaller clusters were also found with one midsized cluster: ‘II.f intangible assets’ (376) and one small but emergent topic: ‘II.h security’ (48). This last cluster is not further partitioned at level III. The three large Level II clusters can be divided further. The obtained hierarchical structure for the three levels is shown in Figure 1. Conclusions Based on the data presented in this paper we can conclude that the extraction of noun phrases from abstracts and titles can considerably improve the lexical component in the hybrid clustering. However, using the noun phrase itself is not sufficient for the improvement. Only if data is restricted to shingles with at least two terms constructed out of the noun phrases an improvement in the clustering is observed. We found that many of the shingles only appear once in each document which allows us to bring the calculation of similarities in the lexical approach more in line with the bibliographic coupling by abandoning the TF-IDF weighting and adopting a binary approach. The new approach was tested in a hybrid combination and resulted in a valid clustering of the field of ‘Information System Research’ with different resolution levels and changing weights for both components. This methodology has several advantages over the other scenarios. The risk of distorting the network by choosing not the optimum parameter or even an inappropriate parameter in the hybrid approach is distinctly reduced. It seems that the parameter will not be used anymore in a function to set the right focus on the document set but to change the viewpoint while the clustering stays in focus. References Batagelj, V. & Mrvar, A. (2003). Pajek–Analysis and visualization of large networks. In M. Jünger & P. Mutzel (Eds.), Graph drawing software. Berlin: Springer, 77–103. Blondel, V.D., Guillaume, JL., Lambiotte, R. & Lefebvre, E. (2008) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008 Braam, R.R., Moed, H.F., van Raan, A.F.J. (1991a). Mapping of science by combined cocitation and word analysis, Part 1: Structural aspects. JASIS, 42 (4), 233–251. Braam, R.R., Moed, H.F., van Raan, A.F.J. (1991b). Mapping of science by combined cocitation and word analysis, Part II: Dynamical aspects. JASIS, 42 (4), 252–266. Callon, M., Courtial, J. P., Turner, W., & Brain, S. (1983). From translations to problematic networks. An introduction to co-word analysis. Social Science Information, 22(2), 191–235 de Marneffe, MC & Manning, C.D. (2008). The Stanford typed dependencies representation. In: COLING 2008 Workshop on Cross-framework and Cross-domain Parser Evaluation. Dunning, T. (1993) Accurate methods for the statics of surprise and coincidence. Computational Linguistics, 19(1), 61–74. Glänzel, W. & Thijs, B. (2011), Using `core documents' for the representation of clusters and topics. Scientometrics, 88 (1), 297–309. Glänzel, W. & Thijs, B. (2012). Using `core documents` for detecting and labelling new emerging topics. Scientometrics, 91(2, 399–416. Glänzel, W. (2012), The role of core documents in bibliometric network analysis and their relation with h-type indices. Scientometrics, 93 (1), 113–123. Glenisson, P., Glänzel, W., Janssens, F & De Moor, B. (2005) Combining full text and bibliometric information in mapping scientific disciplines. Inf. Proc. & Management, 41, 1548–1572. Janssens, F., Glenisson, P., Glänzel, W. & De Moor, B. (2005), Co-clustering approaches to integrate lexical and bibliographical information. In: P. Ingwersen, B. Larsen: Proc. of the 10th ISSI Conference, Karolinska University Press, Stockholm, 285–289. Janssens, F., Glänzel, W. & De Moor, B (2008) A hybrid mapping of information science. Scientometrics, 75 (3), 607–631. Klein, D. & Manning, C.D. (2003), Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics, 423–430. Maning, C.D. & Schütze, H. (2000), Foundations of Statistical Natural Language Processing. Combridge: MIT press. Meyer, M., Grant, K., Thijs, B., Zhang, L., Glänzel, W. (2013), The Evolution of Information Systems as a Research Field. Paper presented at the 9th International Conference on Webometrics, Informetrics and Scientometrics and 14th COLLNET Meeting, Tartu, Estonia. Porter, M.F., (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.