Research on NLP for RE at Fraunhofer FKIE: a Report on Grouping Requirements Daniel Toews Timm Heuss Fraunhofer FKIE Fraunhofer FKIE Wachtberg 53343 Wachtberg 53343 daniel.toews@fkie.fraunhofer.de timm.heuss@fkie.fraunhofer.de Abstract In this report we describe the previous research done by our institute in the field of requirement analysis using different natural language processing methods. To represent the different degrees of similarity between words we implemented different methods that make use of synonyms and hyperonyms. We present the strengths of our methods and identify their weaknesses. For our future research we want to incorporate Word Embeddings as they solve most of the difficulties we faced with synonyms and hyperonyms. 1 Team Overview In our interdisciplinary research team at the Fraunhofer FKIE, we combine practical expertise in requirement engineering of large-scale commercial projects as well as knowledge in the field of Natural Language Processing (NLP), Data Science and Machine Learning, Software Architecture, Systems Engineering and Linguistics. It is our goal to provide novel approaches, algorithms and prototypical implementations to our customers to help them with their daily businesses, such as large commercial requirement engineering projects. 2 Past Research on NLP for RE Over the past year our team at Fraunhofer FKIE conducted research in clustering and analysing requirements. This research was started because of the challenges that requirements engineers and managers faced at our institute, which include finding duplicate requirements as well as requirements formulated with different words for the same semantic context. This leads to requirements stocks that are hard to maintain. One way to address these challenges is to introduce a grouping for requirements, providing requirements engineers and managers a better overview over the data. Since this is an extensive and time consuming task, especially when looking at hundreds of requirements, an automatic system would improve the work flow of those people significantly. Thus a combination of approaches and algorithms is desired to maximize the gain for the people working on those requirement sets. 2.1 General Setup To identify such a combination we implemented multiple algoriths to improve the clustering of their methods (such as k-Means [Llo82], ClusterART [Mas03] and Neural Gas [MS+ 91]) and distance functions (such as euclidean distance and manhattan distance). Following the common Bag of Words approach, the requirements were Copyright c 2018 by the paper’s authors. Copying permitted for private and academic purposes. translated into vectors that represented how often certain words appeared. This resulted in over a thousand of dimensions and thus leaving us vulnerable for the curse of dimensionality. This ”curse” describes problems that arise when trying to calculate the distance between objects high dimensionality. With higher dimensionality distances grow more alike and thus it is hard to differentiate between similar and not similar objects [SEK04]. We tried to reduce the number of dimensions by using stopwords filtering and lemmatization of words. Ad- ditionally we used tf-idf as suggested by the literature [SM86]. 2.2 Using Synonyms for Requirements Clustering We tried to mitigate the effects that using different words for similar semantic intentions (e.g. using broadcast instead of send ) had on the distances of requirements by using synonym lists. The idea is to replace each word with a specific synonym, such that words like transfer and broadcast would be replaced by the word send. This reduced the number of dimensions by a small amount, as well as putting requirements with those similar words closer together. However, this makes words that are similar into the same word, thus having no distance between them. But since those words are not the same, a more nuanced method, in which words can be similar without being the same word, would be preferred. 2.3 Using Ontologies for Requirements Clustering In a next step we enriched the text by using an ontology. This allowed us to determine hyperonyms of a word and then enrich the text with this information. The hyperonym of a word is a word describing the more general concept. For example, the hyperonym of car would be vehicle and the hyperonym of red would be color. It is noteworthy that words can have multiple hyperonyms. For instance, the word watch is a clock as well as an accessoire. For our method we enriched the text by adding all hyperonym information available to this word (meaning the different hyperonyms as well as the hypernyms of those words and so on) to the sentence. This information can contain more then ten or twenty words and thus increases the size of the sentences, as well as more than double the dimension size. However, similar words like car and truck will then share some words in this tree, increasing the similarity between both sentences, while also describing the differences between the words with another subtree. For our tests we made use of the GermaNet ontology.1 2.4 Difficulties with Synonyms and Hyperonyms Even though both approaches have theoretical advantages, in our experience they proved to be not perfect. While our synonyms approach reduced the dimensionality, it provided only a binary similarity between words, either they are the same or not. In contrast, the hyperonym approach with an ontology was able to provide more differentiated information on words, but in doing more than doubled the number of different words in our corpus. Additionally, we faced another problem with those approaches. In general, words can have multiple meanings, for example homonyms, like the word bear, which can describe a mammal, as well as tolerating or enduring something. But other words can also have slightly different meanings. The word group can describe multiple things, for example a criminal group (gang) or a musical group (band ). In both cases a collective of persons is described, but both would have different synonyms and hyperonyms. Currently we designed an algorithm that tries to solve this problem by previously determining the context of the requirements (is the text about criminals or musicians?) and then chooses the appropriate meaning of the word. 2.5 First Evaluation In a first evaluation we tried to determine the impacts of different parts on the clustering quality. We decided to use external metrics to determine the quality of our algorithms. As we wanted to measure the potential support our software could have on requirements engineers, we measured the algorithms against a grouping provided by our own experts. They grouped the technical requirements by their topic, meaning which technical subparts were specified. The grouping was done to get a better overview about the current status of the requirement stock. This means that we had to use a fixed number of clusters to compare against the hand crafted grouping. Additionally this meant that we judged the quality of the algorithm on how well it was able to recreate a human made grouping, while dismissing potentially good results that were specific to the algorithm. Thus we wanted 1 http://www.sfs.uni-tuebingen.de/lsd/tools.shtml, last access on 2018-01-25 requirements to be treated as similar, when the specified technical concept of the requirements was similar. Despite these restrictions on our evaluation, we were not able to determine a better way of measuring the quality of our algorithms, as we did not think that for our case internal metrics were suitable. How to improve the evaluation of our requirement analysis framework is thus an open question for our team. We tested multiple combinations of algorithms, distance functions, modifiers (such as lemmatization and stop words filter), as well as our synonyms and ontology approaches. We calculated the F1 scores for all those combinations and found small indications showing good combinations. Multiple of those good combinations contained our synonyms as well as our ontology approach. We published our results2 as well as parts of our algorithmic framework3 on GitHub. 3 Research Plans on NLP for RE Today sophisticated word embedding algorithms, like word2vec [MCCD13], GloVe [PSM14] and fasttext [BGJM16], are able to place words in a vector space that contains semantic information about the words. Thus similar words will be placed close to each other and form groups of words. This means, that the distance between two words in this space can describe how semantically similar they are. Meaning that synonyms as well as hyperonyms of a certain word will be placed close to each other, while non related words will have a higher distance. Additionally, the words have a fixed dimensionality (generally orders of magnitude lower than the number of different words in the corpus). Previous work in requirement analysis discusses the possibility of integrating Word Embeddings [FDE+ 17] or integrates them in classification approaches [WV16]. Additionally, Lucassen et. al also use the approach for clustering requirements [LDvdWB16]. Looking closer at word2vec and fasttext, these models are generated by reading lage sources of human text, such as Wikipedia, and trying to determine the word positions of word x by looking at words used around x. In short, the model is trained by looking at n words before and after x and is then encouraged to guess x using only the surrounding words (other methods like the Skip-Gramm Model [MSC+ 13], GloVe [PSM14] or fasttext [BGJM16] prove to be more precise, but would be too extensive in this report). Thus the position of x in the vector space is determined by the words often used with x. The assumption is (and proves to be correct), that similar words are used with the same surrounding words. Because of this training, we assume that the model trained with an appropriate dataset contains finer nuances of words like group, as the surrounding words will catch the different meanings and contexts (see the example in 2.4 the word can be used with). Looking back again at the difficulties explained in section 2.4, word embeddings seem to solve exactly those problems. The fixed dimension size of the vectors around 200 to 500 dimensions (as recommended by Mikolov et. al [MCCD13]) is significantly less than our datasets containing thousands of requirements, with more than 1800 unique words, not counting the number of words that occur when using our ontology approach. This means that the number of dimensions in each vector is reduced significantly and thus the calculation of the distances is more meaningful [SEK04]. Words which are synonyms to each other will be placed close in the vector space and thus result in a low distance between those words, while still maintaining their identity. Further, the different meanings of the words are taken into consideration in the vector space due to the nature of the training. Thus, we hope to improve our clustering results even further when using this approach. Additionally, we want to try to use them together and evaluate these results. Another open question we face is the training of the model. The fasttext framework provides pre trained models on 294 languages, using Wikipedia as training data4 , but requirements are mostly specific to a certain domain and so the words may not be presented well or at all in Wikipedia. Thus, using a model trained on a domain specific model may also improve the results. Another research question is how to define similarity between requirements when using word embeddings. Previously, we defined a requirement by adding all words together into one vector which represented the words and their number of occurrences in this requirement. However the number of words in a requirement is now effecting the result in a more meaningful way [KDR15]. Averaging the resulting vector over the number of words could be one solution, but is not feasible, since the average of two sentences can be the same even though they do not share common words with each other. This is due to the fact that the position of words and the spaces between them carries semantic meaning. 2 https://github.com/fkie/requirement-clustering-evaluation-2017 3 https://github.com/fkie/requirement-clustering 4 https://github.com/facebookresearch/fastText#enriching-word-vectors-with-subword-information Because of these problems Kenter et. al suggest a different formula called semantic text similarity (sts) [KDR15]. The idea in short is that for the distance of two sentences (requirements) each word of sentence s1 will be matched to the best corresponding word of sentence s2 . The distance of those words will be calculated and added together. The exact formula is: X sem(w, ss ) ∗ (k1 +) fsts (sl , ss ) = IDF (w) ∗ |ss | w∈sl sem(w, ss ) + k1 ∗ (1 − b + b ∗ avgsl Here sl is the longer of the two sentences, while ss is the shorter one. The parameter avgsl describes the average sentence length of the data set, while b as well as k1 smooth the results and are set to b = 0.75 and k1 = 1.2 on default. The function sem(w, s) describes the semantic similarity of the word w w.r.t sentence s: sem(w, s) = max 0 fsem (w, w0 ) w ∈s where fs em(w, w) describes the similarity of both words. For this, the cosine similarity is suggested. In the next step of our research we will use pre trained fasttext models for clustering requirements and the fsts formula to determine the similarity between requirements. References [BGJM16] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016. [FDE+ 17] Alessio Ferrari, Felice DellOrletta, Andrea Esuli, Vincenzo Gervasi, and Stefania Gnesi. Natural language requirements processing: A 4d vision. IEEE Software, 34(6):28–35, 2017. [KDR15] Tom Kenter and Maarten De Rijke. Short text similarity with word embeddings. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1411–1420. ACM, 2015. [LDvdWB16] Garm Lucassen, Fabiano Dalpiaz, Jan Martijn EM van der Werf, and Sjaak Brinkkemper. Visualiz- ing user story requirements at multiple granularity levels via semantic relatedness. In International Conference on Conceptual Modeling, pages 463–478. Springer, 2016. [Llo82] Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982. [Mas03] Louis Massey. On the quality of art1 text clustering. Neural Networks, 16(5):771–778, 2003. [MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word repre- sentations in vector space. arXiv preprint arXiv:1301.3781, 2013. [MS+ 91] Thomas Martinetz, Klaus Schulten, et al. A” neural-gas” network learns topologies. 1991. [MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [PSM14] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014. [SEK04] Michael Steinbach, Levent Ertöz, and Vipin Kumar. The Challenges of Clustering High Dimen- sional Data, pages 273–309. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. [SM86] Gerard Salton and Michael J McGill. Introduction to modern information retrieval. 1986. [WV16] Jonas Winkler and Andreas Vogelsang. Automatic classification of requirements based on con- volutional neural networks. In Requirements Engineering Conference Workshops (REW), IEEE International, pages 39–45. IEEE, 2016.