Introduction

Knowledge Representation of Requirements Documents Using Natural Language Processing

Aaron Schlutter

aaron.schlutter@tu-berlin.de 0 1 0 Andreas Vogelsang Technische Universitat Berlin DCAITI , Germany 1 Technische Universitat Berlin DCAITI , Germany

2013

Complex systems such as automotive software systems are usually broken down into subsystems that are speci ed and developed in isolation and afterwards integrated to provide the functionality of the desired system. This results in a large number of requirements documents for each subsystem written by di erent people and in di erent departments. Requirements engineers are challenged by comprehending the concepts mentioned in a requirement because coherent information is spread over several requirements documents. In this paper, we describe a natural language processing pipeline that we developed to transform a set of heterogeneous natural language requirements into a knowledge representation graph. The graph provides an orthogonal view onto the concepts and relations written in the requirements. We provide a rst validation of the approach by applying it to two requirements documents including more than 7,000 requirements from industrial systems. We conclude the paper by stating open challenges and potential application of the knowledge representation graph.

Introduction

Framework)1. To create the graph, our NLP pipeline rst performs a number of preprocessing steps on the requirements to normalize the data and then uses Semantic Role Labeling [CM04] to extract single triples. The triples are afterwards merged into a single knowledge graph.

As an early validation, we applied the pipeline to two requirements documents, one academic example with 150 requirements and one industrial document with over 7,000 requirements. Both documents contain requirements from the automotive domain. The resulting knowledge graph for the smaller document contains 345 nodes, while the graph for the larger document consists of over 13,000 nodes. These results show how much conceptual information is contained in requirements documents and motivates why knowledge representation may help analyzing the semantics embedded in the requirements documents.

Although we have not yet elaborated on speci c applications of this knowledge graph, we envision bene ts for assessing the amount of implicit knowledge in requirements documents, trace link generation, and support for reviewing requirements documents.

At the end of this paper, we detail some of the challenges that need to be solved to leverage the full potential of this knowledge representation including identi cation of semantic similarity and representation of causal relationships. 2

Background

Knowledge representation is useful in many disciplines, whenever collecting, managing and sharing information and facts is necessary. The form of knowledge representation may vary from expressing knowledge by natural language to technical solutions that aim at creating complex logical views on the information. As a technical term, the topic evolved in the context of arti cial intelligence and reasoning systems in the 1950's. Knowledge representation focuses on the depiction of information that enables computers to solve complex problems.

Knowledge representation supports users and computers to handle large amount of information. The basic idea of knowledge representation systems is to store complex information or facts in a knowledge base and make them available for various application. Therefore, searching the knowledge base is a crucial task. How information can be queried depends on the way the information is stored. Reasoning systems, for example, aim at extracting information to deduce new facts by an inference engine. Deduced information can be added to the knowledge base. These systems and their knowledge representation are often very formal and limited to solve speci c constraints.

One possibility to represent knowledge is provided by the RDF. In RDF, each statement/fact is represented by a triple that contains a subject, an object, and a predicate describing the relationship between them. The parts of an RDF triple are also categorized into di erent classes, e.g., a subject can be an instance of the class resource or of the class literal. One triple's object can be another triple's subject which allows to concatenate multiple triples. A set of triples can be represented as an RDF graph where subjects and objects are represented as nodes and predicates as edges.

In addition to RDF, it is possible to de ne an RDF schema (RDFS) and build up an ontology. Such a schema contains speci cations about the classes used in an RDF model. Therefore, subclasses are de ned, e.g., of resource or literal specifying a limited value range or de ning permissible relationships to other classes.

Knowledge bases can be built manually or extracted automatically from other knowledge representations. If the source for such an extraction is natural language, NLP pipelines are usually used for information extraction. In case of building up an ontology, this process is called ontology learning and is often semi-automatic. Due to the ambiguity of natural language, those pipelines usually result in knowledge representations that are imprecise and incomplete up to a certain point. Despite these shortcomings, such structural semi-formal knowledge bases enable computers to execute a wider range of algorithms.

However, for knowledge contained in industrial requirements speci cations, a (semi-)automatic extraction is often the only chance for proper knowledge representation. 2.1

Related Work

Borgida et al. stated already in 1985 that knowledge representation is the basis for requirements engineering [BGM85]. They argue that in addition to the functional speci cation of a system, it is essential to have more knowledge about its environment, i.e., \knowledge of terms, technical and scienti c rules, everyday procedures and conventions". The proposed Requirements Modeling Language (RML) can be seen as a predecessor of an ontology while they proposed to use such a model as the basis to exchange knowledge in a formal, yet easy to understand, way.

OntoLearn [VFN13], a pipeline for ontology learning, consists of several steps to learn the taxonomy of an ontology. At rst, they identify and extract terms inside a corpus, search for their hypernyms using additional knowledge like web glossaries and documents (e.g., WordNet2), and build up a \noisy hypernym graph". Afterwards this graph is ltered by given upper terms of a speci c domain. Lastly, this graph is pruned to a tree-like taxonomy to reduce cycles and hypernyms for most nodes. In case important edges were pruned, they are optionally recovered at the end. This work ow results in a directed acyclic graph.

Browarnik and Maimon proposed to build graphs via the concept \from clauses or subsentences to RDF triples and RDFS" [BM15]. They argued that the common attempt for ontology learning, the so called layer cake, is not useful because each step (layer) analyzes the whole corpus at a single point of view which is very often faulty and this error is passed on to all depending layer. This results in a generally low recall and precision (less than 0.6). To circumvent this shortcoming, they propose alternative models for ontology learning by reviewing and evaluating existing papers.

One use of ontologies in the automotive context is recommendation-based decision support [HK16]. They are searching for similar already nished safety analyses to support a human expert in hazard analysis and risk assessment, prescribed by the ISO 26262 standard. While the ontology containing extensive expert's knowledge, spreading activation as a highly con gurable semantic search algorithm is used to query the knowledge base for relevant information.

Dermeval et al. report on the use of ontologies in requirements engineering in their systematic literature review [DVB+16]. They reviewed 67 papers from academic and industrial application contexts, dealing with di erent types of requirements and identi ed and categorized typical problems solved by ontologies in di erent RE phases. Most citations deal with problems like (1) ambiguity, inconsistency, and/or incompleteness, (2) requirements management/evolution and (3) domain knowledge representation. While only 34% of the papers reuse existing ontologies in their solutions, most of them speci ed their own ontology. The largest number of papers in the review rely on textual requirements as the RE modeling style, especially in the speci cation phase. 3

Natural Language Processing Pipeline

The Stanford CoreNLP is a collection of common NLP tasks, assembled in a pipeline3. We are using parts of it, particularly the tokenization, sentence splitting, part-of-speech tagging and lemmatizing (morphological analysis).

2https://wordnet.princeton.edu/ 3https://stanfordnlp.github.io/CoreNLP/

The rst two tasks determine all tokens (words, punctuation marks, etc.) and sentences in a natural language text. Part-of-speech tagging categorizes the tokens by their grammatical role inside of a sentence, e.g., as noun, verb or article. Lastly, each word is annotated with its lemma, a basis version depending on the part-of-speech tag. For example, the lemma of \walking" (verb) is \walk" (verb), the lemma of \walkers" (noun) is \walker" (noun). In contrast to this, stemming of these words would result in \walk" as the word stem. Semantic role labeling (SRL) [CM04] is a sentence-based NLP task that consists of two steps. In the rst step, all predicates or verbs of a sentence are determined. In the second step, all semantic arguments of each verb are associated/classi ed with their roles within this sentence. In general, the rst argument is called the agent, the second the patient, all other semantic roles depend on the verb. For example, \John broke the window with a rocket", contains the verb \broke". Its arguments are [A0John], [A1the window] and [AM-MNRwith a rocket]. The numbered arguments would be the same even if the syntax of the sentence would change, e.g., \[A1The window] was [Vbroken] [AM-MNRwith a rocket] [A0by John]". The semantic roles are prede ned in a database called PropBank4. As stated in [CM05, Table 1], most of the identi ed propositions have at least one argument, about two-third also have a second argument.

The CoNLL (Conference on Computational Natural Language Learning) shared task for semantic role labeling was rst published in 2004 [CM04] and extended in 2005 [CM05]. Within these competitions, the PropBank corpus was annotated with semantic roles as training data. While the top submission of the rst shared task receives a precision of 72.43% and a recall of 66.77%5, the top submission in 2005 already performs with a precision of 81.18% and a recall of 74.92% with an enlarged corpus6. While the competitions are conceived to run just for several months and further submissions are no longer possible, there are still new systems evolving, which are compared under those conditions [GCW+16, FTGD15, PP14].

We use SENNA7 as state-of-the-art implementation for SRL tagging. It uses a uni ed neural network architecture to perform SRL and achieved an F1 score of 75.49% for CoNLL in 2005. 3.3

Building the graph

As last step of the pipeline, an exporter builds a graph from the annotated natural language. The exporter searches the text for SRL verbs and their arguments. If at least a rst and second argument exist, it creates an RDF triple with the rst SRL argument as subject, the second one as object, and the verb as predicate. All other arguments are stored as metadata of the triple inside the relation. If there is no second argument, a re exive relation is built and exported.

SRL annotates the tokens in a sentence as they are (like \the window" or \broke"). However, this is not useful when building a graph with nodes and edges identi ed by unique labels. This would lead to duplicates of nodes and edges, if there are multiple occurrences of the same phrase in di erent grammatical variation. To circumvent this issue, the exporter also trims the words by using their lemma or removing articles on the basis of their part-of-speech tags. 4

Early Validation

We applied the pipeline from Section 3 to two requirements documents. Both have an automotive background and contain real industrial data but they di er in size and do not relate to each other. For this reason, we applied the pipeline to each of them separately. The rst one is the system requirements speci cation Automotive System Cluster (ASC)8, which describes two subsystems, ELC (exterior light system) and ACC (adaptive cruise control). The document is a public case study, which is derived from real requirements but largely reduced and modi ed by the original authors in size and detail. The other document is provided by our industry partner, describing a single subsystem called \Charging of electric vehicles". Table 1 contains the central measures about the two requirements documents and the knowledge representation graphs that we extracted for both documents.

The graphs for both requirements documents are shown in Figure 2. The graph from Figure 2a has di erent colors representing from which of the two contained systems the nodes and edges originate. Nodes and edges 4https://verbs.colorado.edu/~mpalmer/projects/ace.html 5https://www.cs.upc.edu/~srlconll/st04/st04.html 6https://www.lsi.upc.edu/~srlconll/st05/st05.html 7https://ronan.collobert.com/senna/ 8https://www.aset.tu-berlin.de/fileadmin/fg331/Docs/ASC-EN.pdf

ASC

Charging System requirements authors sentences tokens (words and marks) nodes [main graph] edges [main graph] extracted from ELC are colored grey, ACC nodes and edges are light grey. If they are contained in both systems, they are colored black.

(a) Automotive System Cluster (ASC) (b) Charging System

Both graphs in Figure 2 show some similarities as both contain a \main graph" in the upper left with a lot of nodes and edges linked to each other, several \sub graphs" around the \main graph" with some nodes, and edges a lot of \single graphs" with just two or three edges on the lower part of the gures. The number of nodes and edges for each graph is stated in Table 1, as well as the sum of the \main graph" and the \sub graphs" as [main graph].

While most nodes and edges represent concepts and relations that are only contained in one system, there are some black nodes. Two of them in the middle of the main graph are \vehicle" and \driver". They are central because they are typically stated in almost every requirements document of automotive software. The other black nodes represent \ASIL-B", which stands for an automotive safety integrity level, and \function" respectively \user function", which exists in both systems as a common phrase.

These results cannot be interpreted correctly in detail due to a lack of corresponding quality criteria and currently non-existing use cases. Of course, the pipeline is just a prototype which not able to extract every information correctly nor to build a valid ontology or RDF graph. Nevertheless, it is observable from Figure 2a, that the two systems have no explicit described, conjointly interfaces. This observation can also be found in the original text document.

The sub graphs and single graphs contain information without any connection to the main graph, which indicates, that this connection is not explicitly stated in the requirements document. One reason might be, that this is knowledge the author considers to be common and therefore not documented. 5

Open Challenges

As stated in Section 4, our approach is not able to capture every bit of information contained in the original requirements documents. The major reason for this is the loss that you have to consider whenever you are using NLP techniques on real data. Each step in the pipeline is prone to some errors (missed or incorrectly identi ed information). Each requirements document has its own structure and is often extended with additional metadata. The pipeline does not consider these characteristics of a document as it interprets each sentence independently step by step. One of the challenges that we want to consider for the future is to integrate structural properties of the document into the knowledge graph (e.g., sentences within one requirement, requirements arranged under one headline).

Causal Relations: Requirements, especially functional ones, often describe causal relationships like \If [A] then [B]". If [A] contains a subject, predicate, and object, SRL annotates those accordingly and annotates the words in [B] as a single adjunct argument like general-purpose or cause. Currently, we do not consider such causal relations as the graph is build only based on the rst two arguments of the SRL. Integrating such causal relations the current graph model is not possible since [A] is already a graph which cannot be related (connected via an edge) to another [B] which is a graph itself. Causal relations that we extract from the requirements may be represented as inference rules in RDF schema.

Semantic Similarity: Another problem we observed in the data is, that some nodes and edges seem redundant because di erent words or phrases are used with the same meaning. Some of them are just wording issues like \speed set point of the cruise control" and \cruise control speed set point" in ASC, but there are also other di culties like \vehicle" and \car". A major challenge for our approach is to identify such semantic similarities to keep the knowledge graph free of redundancies. The according research eld dealing with these kind of challenges is called semantic similarity. Determine a semantic similarity between words or phrases would enable us to identify edges which are identically in the meaning of their concept. These edges could be connected to each other or merged into a single one and would result in a more connected \main graph" and less \sub graphs" or \single graphs".

There are already existing approaches to identify the semantic similarity. A very promising solution is \Align, Disambiguate and Walk: A Uni ed Approach for Measuring Semantic Similarity" [PJN13]. It is applicable at multiple lexical levels, from single words up to complete texts, comparing each word within its semantic signature. A disadvantage to our use case is the dependency on WordNet as we want to analyse requirements speci cations from the automotive context which is not part of WordNet in detail.

Another opportunity are word embeddings like word2vec [MCCD13] or GloVe [PSM14]. They calculate onedimensional vectors to represent words. This enables a comparison to the similarity of these vectors. One downside of them is, that the calculation requires a lot of data to build appropriate vectors. Also, these word embeddings rely on the distributional hypothesis, the assumption that semantically similar words appear in similar contexts, which is not always the case [MST+16].

No Ontology: As stated before, the graph does not meet the RDF speci cation nor is a valid ontology. The nodes and edges are not classi ed or categorized, e.g., into literal or property in RDF graph, nor does the current pipeline examine, whether a node or edge is part of the schema or must be part of the instance level. To leverage the full potential of knowledge reasoning, we need to enhance the data with such additional metadata. Depending on the application of this pipeline, this could be done with a prede ned schema where the data should t in or those information could be extracted from the graph itself. 6

Potential Applications

Once we have a good representation of the knowledge contained in a number of requirements documents, we see a number of potential applications.

Knowledge querying: In semantic networks, a requirements engineer can e ciently search for speci c information that is contained in any requirements document. In contrast to simple full-text search, a user can express more complex and semantically enriched queries such as \Return all subsystems that interact with my subsystem". Besides querying the semantic net for information that is already encoded in the network, there are also algorithms that can be used to query for associated knowledge (e.g., spreading activation [Cre97]). This allows the user to state queries such as \Return a list of the most relevant stakeholders given a certain feature". If these advanced querying features are integrated in an interactive tool, requirements reviewers may greatly pro t from that.

Trace link generation: There is already a number of papers that propose using information retrieval techniques for trace link recovery [HDO03]. In such approaches, single requirements are formulated as queries to nd related requirements in a large set of requirements documents. The resulting requirements are considered trace link candidates. With our approach, we may improve these approaches by leveraging the structure of the knowledge graph.

Integration of other NL sources: Requirements documents are not the only natural language documents that contain knowledge about the system to be developed. We plan to integrate knowledge from other natural language artifacts such as system test cases or safety analysis documents into the knowledge representation graph as well.

Detection of implicit knowledge: Implicit knowledge is a potential source of misunderstandings and errors. When requirements authors assume that certain knowledge is obvious or clear to the reader, they do not document this knowledge explicitly. However, sometimes, this assumption is not true. Knowledge representation graphs can indicate implicit knowledge by small subgraphs with concepts that are not related to other concepts. This indicates that the authors presume that additional relations of these concepts are clear to the reader. Another possibility to detect implicit knowledge is to compare the total number of relations that a concept has over all documents with the number of relations that this concept has in one document. Concepts with many relations are \rich" concepts in the sense that authors provide much information about these concepts in their documents. If there is only little information about such \rich" concepts provided in one speci c document (i.e., only a small number of relations originate from one document), this may indicate that the author assumed much implicit knowledge. 7

Conclusions

In this paper, we have proposed an NLP pipeline for extracting knowledge from requirements documents and expressing this knowledge as a graph. We use Semantic Role Labeling within the pipeline to extract RDF triples from requirements sentences. As a rst validation, we have applied the NLP pipeline to two requirements documents, one academic example with 150 requirements and one industrial document with over 7,000 requirements. The validation shows that both resulting knowledge graph consists of one main subgraph that contains and connects between 40% and 50% of all nodes and several smaller subgraphs with less than 10 connected nodes. In the future, we plan to enhance the approach by identifying semantic similarities between concepts and integrating causal relations expressed in requirements to enable reasoning about the knowledge. We have identi ed a number of potential applications, where such a knowledge representation may support RE tasks. [BGM85]

Alexander Borgida, Sol Greenspan, and John Mylopoulos. Knowledge representation as the basis for requirements speci cations. Computer, 18(4), 1985.

Abel Browarnik and Oded Maimon. Ontology learning from text: Departing the ontology layer cake. International Journal of Signs and Semiotic Systems, 4(2), 2015.

Xavier Carreras and Llu s Marques. Introduction to the CoNLL-2004 shared task: Semantic role labeling. In Computational Natural Language Learning (CoNLL), 2004.

Xavier Carreras and Llu s Marquez. Introduction to the CoNLL-2005 shared task: Semantic role labeling. In Computational Natural Language Learning (CoNLL), 2005.

Fabio Crestani. Application of spreading activation techniques in information retrieval. Arti cial Intelligence Review, 11(6), 1997.

Jane Hu man Hayes, Alex Dekhtyar, and Je rey Osborne. Improving requirements tracing via information retrieval. In 11th IEEE International Requirements Engineering Conference (RE), 2003. Kerstin Hartig and Thomas Karbe. Recommendation-based decision support for hazard analysis and risk assessment. In Eighth International Conference on Information, Process, and Knowledge Management (eKNOW), 2016. [PJN13] [PP14] [PSM14] [VF13] [VFN13]

Mohammad Taher Pilevar, David Jurgens, and Roberto Navigli. Align, disambiguate and walk: A uni ed approach for measuring semantic similarity. In 51st Annual Meeting of the Association for Computational Linguistics (ACL), 2013.

Gerhard Paass and Bhanu Pratap. Semantic role labeling using deep neural networks. In International Workshop on Representation Learning (RL), 2014.

Andreas Vogelsang and Ste en Fuhrmann. Why feature dependencies challenge the requirements engineering of automotive systems: An empirical study. In 21st IEEE International Requirements Engineering Conference (RE), 2013.

[CWB+11] Ronan

Collobert

, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and

Pavel

Kuksa . Natural language processing (almost) from scratch . Computing Research Repository (CoRR) , abs/1103.0398 , 2011 .

[DVB+16] Diego

Dermeval

, Jessyka Vilela, Ig Ibert Bittencourt, Jaelson Castro, Seiji Isotani, Patrick Brito, and

Alan

Silva . Applications of ontologies in requirements engineering: A systematic review of the literature . Requirements Engineering , 21 ( 4 ), 2016 .

[FTGD15] Nicholas

FitzGerald

, Oscar

ackstrom, Kuzman Ganchev, and Dipanjan Das . Semantic role labeling with neural network factors . In Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2015 .

[GCW+16] Jiang

Guo

, Wanxiang Che, Haifeng Wang, Ting Liu, and

Jun

Xu . A uni ed architecture for semantic role labeling and relation classi cation . International Conference on Computational Linguistics (COLING) , 2016 .

[HDO03] [HK16] Je rey Pennington , Richard Socher, and

Christopher D.

Manning . GloVe: Global vectors for word representation . In Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014 .

Paola

Velardi , Stefano Faralli, and Roberto Navigli. OntoLearn reloaded: A graph-based algorithm for taxonomy induction . Computational Linguistics (CompLing) , 39 ( 3 ), 2013 .