SRXCRM: Discovering Association Rules Between System Requirements and Product Specifications Vasco Leitão, Ibéria Medeiros LASIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal Abstract Industrial products integrate highly configurable safety-critical systems which must be intensively tested before being delivered to customers. This process is highly time-consuming and may require associa- tions between product features and requirements demanded by customers. Machine Learning (ML) has proven to help engineers in this task, through automation of associations between features and re- quirements, where the latter are prioritized first. However, ML application can be more difficult when requirements are written in natural language (NL), and if it does not exist a ground truth dataset with them. This work presents SRXCRM, a Natural Language Processing-based model able to extract and associate components from product design specifications and customer requirements, written in NL, of safety-critical systems. The model has a Weight Association Rule Mining framework that defines as- sociations between components, generating visualizations that can help engineers in prioritization of the most impactful features. Preliminary results of the use of SRXCRM show that it can extract such associations and visualizations. Keywords Requirement engineering, requirement prioritization, software testing, natural language processing, noun phrase chunking, association rule mining, 1. Introduction Safety-critical systems (e.g., automotive, railway systems) have been widely integrated into industrial products. Being critical, it is of the utmost importance to test them properly and intensively. Testing is an essential part of the requirement engineering process to better understand the system behavior and to detect failures. Quality testing aims for the reduction of cycle times while avoiding human intervention. This is a hard task, as products can have several configurations, features and corresponding customer requirements (CRs), and the available time providers have to deliver them to customers is not enough to test them properly. Software Product Line Engineering (SPLE) tries to tackle this task. Software Product Lines (SPL) are sets of software-intensive systems that share common series of product line features, which allow to derive an individual product from reusable features of the product line [1]. CRs need to be analysed by engineers familiar with the system in order to detect discrepancies with the product. Such is made in a manual process where all features of a product and the existing In: F.B. Aydemir, C. Gralha, S. Abualhaija, T. Breaux, M. Daneva, N. Ernst, A. Ferrari, X. Franch, S. Ghanavati, E. Groen, R. Guizzardi, J. Guo, A. Herrmann, J. Horkoff, P. Mennig, E. Paja, A. Perini, N. Seyff, A. Susi, A. Vogelsang (eds.): Joint Proceedings of REFSQ-2021 Workshops, OpenRE, Posters and Tools Track, and Doctoral Symposium, Essen, Germany, 12-04-2021 " vleitao@lasige.di.fc.ul.pt (V. Leitão); imedeiros@di.fc.ul.pt (I. Medeiros) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) project requirements on paper must be correlated. This process is prone to errors, since CRs are described in natural language (NL) or a restricted language and lack any use of formal writing techniques [2]. Accuracy is also dependent on the engineer’s experience. Natural Language Processing (NLP) can be applied to support linguistic tasks presented in the SPLE field [3], while requirement prioritization can identify relevant requirements to associate with a system, reducing testing costs. In this paper, we tackle a SPLE scenario focused on identification of requirement similarities and variance. For that, we use NLP techniques over CRs and product design specifications (DSs) from a Propulsion and Control railway subsystem (PCS), both written in NL and belong to two different, but related, domains. The paper presents SRXCRM, a NLP-based model for CR prioritization, linking them to DSs in an automated manner. The model contains a Weight Association Rule Mining (WARM) framework to process both CRs and DSs for extraction of relevant information, obtaining association rules within each domain and their visualizations. Preliminary results show that the model can extract such associations and visualizations. The contributions of the paper are: 1) a grammar able to interpret DSs and CRs written in NL; 2) the SRXCRM model capable of extracting components and association rules from DSs and CRs, linking the latter to the former; 3) the WARM framework to extract association rules, considering the weights of components; 4) the knowledge graph of the extracted rules and showing their associations; 5) preliminary results of SRXCRM on processing of DSs and CRs. 2. Related Work There is a great amount of research focused in adapting NLP models to named entity recognition (NER). Passos et al. [4] seek to use a variation of the Skip-Gram [5] model that can extract information from lexicons in order to improve representation. Foley et al. [6] explored NER in limited data as a retrieval task through user interaction. In these models, expert users identify entities in a set of CRF-ranked sentences, generating training data that is suitable for NER. Some adaptations to specific domains have been made lately as well, in order to reduce user annotation and manual training. Soderland et al. [7] classified domain arguments from elements extracted in extensive noun-phrases, given a small training set. They also proposed an algorithm for rule extraction that attempts to evaluate generalization of relations between distinct classes, although most of the relations extracted are verb-centered. For requirement modelling, Mu et al. proposed EFRF [8], a framework focused on extracting functional requirements from software requirement specifications, analyzing their linguistic structure. However, they did not specify how the model deals with domain-specific terms. Schlutter et al. [9] proposed a pipeline for concept extraction from requirement documents, being these represented in graphs that help engineers in knowledge detection and querying. However, the pipeline lacks quality criteria to support validation. We look for validation in communication with domain experts, to understand if term extraction ensures completeness. Abbas et al. [10] defined a process focused in requirement reuse, using existing CRs in order to recommend features to implement new, unseen CRs. Their work is based on TFIDF or Doc2Vec techniques, followed by embedding clustering to aid recommendation. Despite the good results in accuracy and requirement identification, the lack of understanding of its algorithm from expert users decreased the methodology acceptance. Our model aims on creating a system based on chunking and Association Rule Mining frameworks [11], which provides easy-to-follow results about component relationships. We propose a model with an Information Extraction (IE) pipeline similar to [12], but we look for different approaches in prioritization. 3. Data 3.1. Raw Dataset The raw dataset is source material from a PCS, constituted by a set of DSs and CRs, both written in NL and not preprocessed before being handed to us. DSs and CRs follow different domains, i.e, the former follows a low level domain, while the latter fits on a high level domain. The material was received in form of a document that consists in three main elements: object identifiers, DSs, and CRs that may be associated with the DSs. This document follows a hierarchy of sections, such as Conventions or System Overview. Through exploratory data analysis, we identified 526 instances of DSs, each one identified by a unique object identifier, and where 85 of them are related to one or more CRs. Overall, we identify 164 CRs. A DSs is formally described as an conceptualization of a PCS standard product feature, describing its implementation in NL. It comprises various interfaces and functions, which are some of the components that compose a product feature. On the other hand, a CR is defined as a concrete description of product feature to be deployed to a customer. It consists in a revision of a product feature in order to adapt the product feature to the needs of a specific customer of the system. It complements the standard feature description integrated in the component specification with new information defined by a customer. Each DS is identified by an object identifier, a code that does not give any information about the system, although we will use it to map DSs. In addition, a DS may be linked to various CRs; and CRs may be linked to several DSs. CRs associated with the same DS are independent. 3.2. Preprocessing The preprocessing pipeline is focused on formatting the NL sentences of each DS and CR in noun phrase chunks, the latter being the main inputs to association rule extraction. First, we perform non-specification identification over DSs. DSs related to section, subsection, and category titles are identified and discarded, since these do not bring any value to evalu- ation. Afterwards, tokenization is done, to split text into tokens and determine which these corresponding to words or punctuation marks. Those corresponding to stopwords are removed. The final task of the process is the Part-of-Speech Tagging, categorizing tokens by its gramat- ical tag in each CR and DS, using the Averaged Perceptron Model [13] with the Penn Treebank Tagset [14]. These tags are used as input to the Main Chunking phase, described in Section 4.1. 4. The SRXCRM Model This section is focused on presenting SRXCRM (System Requirement eXtraction with Chunking and Rule Mining), a NLP-based model able to extract main chunks for both DSs and CRs, creating for each domain association rules that map different components. These will be then used to map new CRs to the DSs of the subsystem. SRXCRM also creates visualizations that help engineers identify the most relevant features in each domain. Figure 1 depicts the proposed model, comprising three main phases: Preprocessing (described in Section 3.2), Main Chunking, and Rule and Knowledge Extraction. The whole process requires a raw dataset of DSs and CRs as input in order to generate, for each domain, the knowledge graph and the association rules that represent the relations between relevant components of the PCS. Rules are established after a Noun Phrase Chunking refinement process and a Weighted Association Rule Mining (WARM) task. Both processes are detailed in the next sections. 4.1. Main Chunking The main chunking phase aims to recognize components of PCS and extract them from DSs and CRs. For that, the phase receives a set of DSs and CRs, with their sentences already tagged under the preprocessing phase, performs the Noun Phrase Chunking task over each sentence, and then refines the results to obtain the most relevant chunks, that we refer as main chunks. Noun Phrase Chunking. In order to retrieve components, we have created a chunk grammar able to recognize and extract noun phrase chunks (chunks for short), i.e., components, over the tagged sentences. The grammar, shown in Table 1, comprises four rules (expressed as regular expressions) that define tag sequences that can be associated with the presence of components. Main Chunking Refinement. After chunk extraction and manual analysis, we realized that several chunks derive from others. A chunk 𝐴 derives from a chunk 𝐵 if 𝐵 is a substring of 𝐴, and 𝐵 occurs more frequently than 𝐴. With the aim of getting the most precise results, the main chunk concept was introduced, which denotes a chunk with frequency above a threshold t. For 𝐴 and 𝐵, 𝐵 would be a main chunk. Therefore, after retrieving the chunks, we refine the whole set to extract the main chunks and, for each one, the set of their derived chunks. 1. Preprocessing 2. Main Chunking 3. Rule and Knowledge Extraction 1.1. Non- Specification 2.1. Noun Phrase 3.1. Weight Attribuition Identification Chunking main chunks derived chunks Design Specifications 3.2. Frequent Itemset or 1.2. Tokenization Extraction tagged sentences Customer 2.2. Main Requirements Chunking Refinement 1.3. Stopwords 3.3. Weight Association Removal Rule Mining 1.4. Part-of- Speech Tagging Knowledge Association Graph Rules Figure 1: Overview of the SRXCRM model. Table 1 The rules defined in the chunk grammar. HR PR NR Rule Description #ch (%) (%) (%) At least one adjective (JJ) followed by at least by a proper {++} noun (NNP) A noun (NN) (if exists) followed by at least a proper {*+*} 108 65,7 16,7 17,6 noun (NNP), followed by another noun (NN) (if exists) {++} At least one adjective (JJ) followed by at least a noun (NN) {} Two consecutive nouns (NN) #ch: Number of extracted main chunks HR(%): Percentage of highly relevant main chunks extracted PR(%): Percentage of possibly relevant main chunks extracted NR(%): Percentage of not relevant main chunks extracted 4.2. Rule and Knowledge Extraction The Rule and Knowledge Extraction phase receives as input the main chunks, their derived chunks, the DSs or CRs that have been processed, outputting the association rules discovered between the main chunks and the graph representing the connections between them. This phase employs the Weight Association Rule Mining (WARM) process, which evolves from the ARM methodology, with the weight of each chunk that occurs in the dataset. The WARM process is boosted by the Weight Attribution and Frequent Itemset Extraction tasks that, respectively, calculates the weight of participation for each chunk and extracts itemsets where they occur within the dataset. Afterwards, the rules are extracted by WARM. Weight Attribution. We look to define weights for each main chunk. For that, first, each chunk c (either a main chunk 𝑚𝑐 or a derived chunk 𝑑𝑐) will have an associated uniqueness value (𝑈 𝑉 ), which depends on the number of main chunks from where it derives (Equation 1). Secondly, the weight (𝑊 ) of each main chunk 𝑚𝑐 is determined as being the sum of the 𝑈 𝑉 values of its derived chunks (Equation 2). In order to avoid a great discrepancy between values, we apply min-max normalization to [0,1] range over the resulting weight array of main chunks. Since Equation 3 uses a product of the 𝑊 (𝑚𝑐), we perform value smoothing by adding a thousandth unit to the weights to avoid giving zero-value for them, and so to avoid null results in Equation 3. Similarly to zero-value weights, we also apply smoothing over unit weights by subtracting a thousandth unit as well, in order to keep uniformity in the task. 𝑁 1 (1) ∑︁ 𝑈 𝑉 (𝑐) = 𝑊 (𝑚𝑐) = 𝑈 𝑉 (𝑑𝑐𝑖 ) (2) #(𝑟𝑒𝑙𝑎𝑡𝑒𝑑 𝑚𝑎𝑖𝑛 𝑐ℎ𝑢𝑛𝑘𝑠)𝑐 𝑖=1 Frequent Itemset Extraction. It is created a binary occurrence dataset that sets associations between the DSs or CRs and the occurrence of main chunks. Each DS or CR is encoded as a transaction that contains the occurrences of main chunks. Transactions are obtained following the reasoning described by Agrawal et al. [11]. Given a transaction T and a main chunk 𝑚𝑐: if 𝑚𝑐 ∈ 𝑇 , 𝑇 [𝑚𝑐] = 1, otherwise 𝑇 [𝑚𝑐] = 0. For example, if we have 15 main chunks and a specification S contains 4 main chunks, an instance of the binary dataset will be composed of 15 elements, where 4 of them will have their values equal to 1. Afterwards, a further analysis is made to recognize if there are DSs or CRs that do not contain any main chunk, i.e., instances of the dataset with all values equal to 0. These are viewed as irrelevant, and therefore are removed. After refining, frequent itemset extraction is performed over the binary dataset, invoking the Apriori [15] algorithm to extract itemsets with support above a manually defined threshold. These are required to build association rules that express relations between components. Weight Association Rule Mining (WARM). Following the reasoning of Tao et al. [16], we evolved the standard ARM with the inclusion of the weights of chunks for better using the derived chunks of each main chunk. Also, the standard metrics of support, confidence, and lift [11] were adjusted, resulting new ones. WARM is described in the following pipeline: (1) For each itemset {𝑚𝑐1 , ..., 𝑚𝑐𝑛 } (𝑖𝑡 for short) from the previous task, it is defined its Itemset Transaction Weight (𝐼𝑇 𝑊 ) (Equation 3) reflecting the aggregated weight of all main chunks it contains. (2) Weighted support (𝑊 𝑆) is calculated through Equation 4. We also make adjustments in the number of transactions, since they depend directly on the 𝑊 𝑆 (Table 2, column 4). (3) Representing a rule as 𝐴 =⇒ 𝐵, where 𝐴 = {𝑚𝑐1 , ..., 𝑚𝑐𝑛 }, 𝐵 = {𝑚𝑐𝑚 } and 𝐴 ∩ 𝐵 = ∅, weighted confidence (𝑊 𝐶) and weighted lift (𝑊 𝐿) are obtained from Equations 5 and 6. 𝑛 ∏︁ 𝑇𝑖𝑡 × 𝐼𝑇 𝑊 (𝑖𝑡) 𝐼𝑇 𝑊 ({𝑚𝑐1 , ..., 𝑚𝑐𝑛 }) = 𝑊 (𝑚𝑐𝑖 ) (3) 𝑊 𝑆(𝑖𝑡) = (4) 𝑖=1 #(𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠) 𝑊 𝑆(𝐴 ∪ 𝐵) 𝑊 𝑆(𝐴 ∪ 𝐵) 𝑊 𝐶(𝐴 =⇒ 𝐵) = (5) 𝑊 𝐿(𝐴 =⇒ 𝐵) = (6) 𝑊 𝑆(𝐴) 𝑊 𝑆(𝐴) × 𝑊 𝑆(𝐵) The extracted rules follow a minimum 𝑊 𝐶 threshold and are ordered first by 𝑊 𝐶 values. In case of a tie, rules are ordered by 𝑊 𝑆 and then by 𝑊 𝐿. The knowledge graph that represents the relations between components (i.e., main chunks) is also created, where a graph node corresponds to a main chunk that occurs in the set of rules, including notation with its amount of derived chunks, and a directional edge corresponds to a rule between two main chunks. As two main chunks might occur in several rules together, the highest possible confidence is represented in the edge for a combination of the two main chunks with coloring methods. Figure 2 shows an example of a knowledge graph containing the association rules between components extracted from DSs, for a minimum 𝑊 𝐶 threshold of 10.0%. 5. Experimental Results In this section, we evaluate SRXCRM over the raw PCS dataset. First, we give a description of the configuration of its main phases, and then we present the preliminary results obtained. SRXCRM is implemented with different packages. For preprocessing and noun phrase chunk- ing, we used the NLTK [17] toolkit, namely tokenization and POS tagging. To conduct frequent itemset and rule extraction, we used MLxtend [18]. Graphs are obtained with Graphviz [19]. Preprocessing. From preprocessing we were able to extract 2074 sentences from the 228 identified DSs, and 1076 sentences from 164 CRs. The stopword removal task involved the English stopword corpus defined by NLTK [17], although we removed the word not from the list, since we noticed it had an impact on the integrity of the chunks extracted. MVB MCM2 MCM/ACM 37 32 21 ACM 93 DC 56 AC 60 LCM 69 interconnection contactor 31 MCM 71 PCU DC-link TCMS 45 34 37 Figure 2: Knowledge graph extracted from 228 DSs representing component relations. If 𝑊 𝐶 > 80%, a red-colored edge is returned. If 80% ≥ 𝑊 𝐶 > 50%, an orange-red edge is returned. If 50% ≥ 𝑊 𝐶 > 20%, an orange-colored edge is returned. If 20% ≥ 𝑊 𝐶 > 10%, a pink-colored edge is returned. Main Chunking. Noun Phrase Chunking uses the POS tagged sentences and the grammar as input. For each sentence, the grammar parser is ran during 10 epochs and extracts all chunks and respective frequencies. Chunks with 𝑓 𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 ≥ 4 (the threshold we defined) were considered main chunks. In DS dataset, 8.04% of chunks were signalized as main chunks. We decided to use PCS human expert validation to evaluate each grammar rule, based on the relevancy of each extracted main chunk. To do so, for each main chunk, we asked if it was highly relevant (recognized as a PCS component), possibly relevant (PCS state) or not relevant. This process was done in two iterations. In the first iteration, with an initial grammar we defined, 89.6% (69) of the 77 extracted main chunks from DSs were considered highly or possibly relevant by PCS experts. Afterwards, we adjusted the grammar adding a new rule to capture new interesting chunk patterns (see Table 1), and we rerun the model. In this second iteration, from the 108 main chunks extracted, 82.4% (89) were considered highly or possibly relevant by PCS experts. These results are described in Table 1, columns 3-6. Though there is a reduction on the ratio of relevant chunks with the new rule, the grammar is useful because it extracted 16 new highly relevant main chunks that are helpful to decision makers. After applied the model over CRs, we verified that 38% of highly relevant main chunks extracted in DSs occur in CRs. Rule and Knowledge Extraction. We compared WARM with ARM. For ARM we extracted frequent itemsets without using the weight parameter to understand which patterns would be extracted without the impact of the derived chunks. Frequent itemsets from DSs were extracted with a minimum support threshold of 5 transactions (≈ 1.9%), and then rules were extracted for a minimum confidence threshold of 50% (Table 2, rows 2–6). We manage to extract 428 itemsets (columns 2–4) and 459 rules (columns 5–8) from DSs, many with the highest possible confidence (𝐶 = 1), which did not allow to get conclusions about main chunk relations. Next, to better understand the model behaviour we retrieved the rules with WARM (Table 2, rows 7–11). We reduced the support threshold to 0.5% to capture a significant set of rules, keeping the same minimum confidence threshold, retrieving 65 weighted itemsets (columns 2–4) and 102 weighted rules (columns 5–8). These resulting rules are better related to the context of the Table 2 Itemsets and rules extracted in DSs with ARM and WARM. F Itemsets 𝑆𝑖 #T Association Rules 𝑆𝑟 C L (MCM) 0.221 57.0 (MCM1) =⇒ (MCM2) 0.07 1.00 9.92 (AC) 0.186 48.0 (MCM, MCM/ACM) =⇒ (ACM) 0.05 1.00 5.38 ARM ... ... ... ... ... ... ... (effort reference, Max) 0.019 5.0 (LCM, MVB) =⇒ (MCM) 0.02 0.50 2.26 (TCMS, MCM, PCU, MVB) 0.019 5.0 (TCMS, LCM) =⇒ (MCM) 0.02 0.50 2.26 (ACM) 0.091 23.52 (MCM, MVB) =⇒ (ACM) 0.01 0.83 9.13 (MCM) 0.084 21.55 (MCM/ACM) =⇒ (ACM) 0.01 0.80 8.77 WARM ... ... ... ... ... ... ... (AC, DC, LCM) 0.005 1.31 (LCM) =⇒ (connection contactor) 0.01 0.10 5.58 (ACM, MCM, TCMS) 0.005 1.30 (LCM) =⇒ (AC, ACM) 0.01 0.10 8.69 F : Framework 𝑆𝑖 : Itemset Support #T : Number of Transactions 𝑆𝑟 : Rule Support C: Confidence L: Lift problem due to the fact that weight is dependent of derived chunks and their UV. The knowledge graph depicted in Figure 2 shows the connections between the 12 components that are recognized from the association rules from WARM, for a minimum weighted confidence threshold of 10.0%. The graph can give a comprehensive overview of the relations between the components, to help engineers understand how these interact inside the processed domains. For CRs we applied the same process. From 123 main chunks discovered, 238 rules were extracted, with the graph representing relations between 22 main chunks. 6. Conclusions The paper presented SRXCRM, a NLP-based model able to extract, associate, and rank NL requirements from highly configurable systems, resorting to Noun Phrase Chunking and Weight Association Rule Mining. The model was validated given domain expert analysis where almost 90% of the processed noun phrase chunks can be considered relevant to system engineers. As next steps, we intend to perform a sentence similarity task that uses tuples as input, where the match value is based on the main chunks and rules contained in each DS and CR. This will require better processing of CRs since these have a broader domain. Acknowledgments This work was partially supported by the ITEA3 European through the XIVT project (I3C4- 17039/FEDER-039238), and national funds through FCT with reference to LASIGE Research Unit (UIDB/00408/2020 and UIDP/00408/2020). The authors would also like to thank Bombardier Transportation AB, for their continued support. References [1] F. J. v. d. Linden, K. Schmid, E. Rommes, Software Product Lines in Action: The Best Industrial Practice in Product Line Engineering, Springer-Verlag, Berlin, Heidelberg, 2007. [2] M. Kassab, C. Neill, P. Laplante, State of practice in requirements engineering: contempo- rary data, Innovations in Systems and Software Engineering 10 (2014) 235–241. [3] L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, R. T. Batista- Navarro, Natural language processing (nlp) for requirements engineering: A systematic mapping study, 2020. arXiv:2004.01099. [4] A. Passos, V. Kumar, A. McCallum, Lexicon infused phrase embeddings for named entity resolution, in: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Ann Arbor, Michigan, 2014, pp. 78–86. [5] T. Mikolov, G. Corrado, K. Chen, J. Dean, Efficient estimation of word representations in vector space, 2013, pp. 1–12. [6] J. Foley, S. M. Sarwar, J. Allan, Named Entity Recognition with Extremely Limited Data (2018) 2–7. [7] S. Soderland, B. Roof, B. Qin, S. Xu, Mausam, O. Etzioni, Adapting open information extraction to domain-specific relations, AI Magazine 31 (2010) 93–102. [8] Y. Mu, Y. Wang, J. Guo, Extracting software functional requirements from free text documents, in: 2009 International Conference on Information and Multimedia Technology, 2009, pp. 194–198. doi:10.1109/ICIMT.2009.47. [9] A. Schlutter, A. Vogelsang, Knowledge representation of requirements documents using natural language processing, RWTH, 2018. [10] M. Abbas, M. Saadatmand, E. P. Enoiu, D. Sundmark, C. Lindskog, Automated reuse recommendation of product line assets based on natural language requirements, in: 19th International Conference on Software and Systems Reuse, 2020. [11] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, SIGMOD Rec. 22 (1993) 207–216. doi:10.1145/170036.170072. [12] R. Sonbol, G. Rebdawi, N. Ghneim, Towards a Semantic Representation for Functional Software Requirements (2020) 1–7. doi:10.1109/AIRE51212.2020.00007. [13] J. Votrubec, Morphological tagging based on averaged perceptron, WDS’06 proceedings of contributed papers (2006) 191–195. [14] Alphabetical list of part-of-speech tags used in the penn treebank project, 2003. URL: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. [15] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994, p. 487–499. [16] F. Tao, F. Murtagh, M. Farid, Weighted association rule mining using weighted support and significance framework, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003) 661–666. doi:10.1145/956750.956836. [17] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with the natural language toolkit, " O’Reilly Media, Inc.", 2009. [18] S. Raschka, Mlxtend: Providing machine learning and data science utilities and extensions to python’s scientific computing stack, Journal of Open Source Software 3 (2018) 638. [19] J. Ellson, E. R. Gansner, E. Koutsofios, S. C. North, G. Woodhull, Graphviz and Dynagraph — Static and Dynamic Graph Drawing Tools, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, pp. 127–148.