Annotation of CVE Descriptions Vladimir Dimitrov Faculty of Mathematics and Informatics University of Sofia St. Kliment Ohridki, 5 James Bourchier Blvd., 1164, Sofia, Bulgaria cht@fmi.uni-sofia.bg Abstract. Knowledge extraction from texts is based on the text annotation. Text annotation process in essence understands of the text contents. This process intensively uses knowledge that cannot be found in the annotated text. The aim of this research is to a generate knowledge base from CVE descriptions. Keywords: text annotation, knowledge base, ontology, CVE, vulnerability. 1 Introduction The MITRE Corporation maintains a public database for weaknesses, namely CWE [1] and a public database for vulnerabilities, known as CVE [2]. The ontology must be simplified to be usable for educational purposes. • “Weakness-a type of mistake in software that, in proper conditions, could contribute to the introduction of vulnerabilities within that software. This term applies to mistakes regardless of whether they occur in implementa- tion, design, or other phases of the SDLC.” • “Vulnerability-an occurrence of a weakness (or multiple weaknesses) within software, in which the weakness can be used by a party to cause the software to modify or access unintended data, interrupt proper execution, or perform incorrect actions that were not specifically granted to the party who uses the weakness.” The focus of this research is on the vulnerabilities, i.e. CVEs. Here, the weaknesses (CWEs) are vulnerability types. 2 CWEs and CVEs The CWE database is organized in several views intended for different auditoria. A view may be structured by categories. The last ones are conceptual elements structuring the weaknesses. CWE views for researchers, for developers, and for architects are structured by categories. Each category can contain subcategories. The classes, bases and variants are kind of weaknesses at different abstraction levels. The class is an abstract weakness that is not associated with any platform or technology. Bases are more specific than classes. The base usually is not associated with any platform or technology but contains enough details to be detected. The variant is more specific than the base and is usually associated with a specific platform or technology. Copyright © 2020 for this paper by 101its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). The weaknesses are organized in abstraction levels but not in inheritance hierarchies. A class can be more abstract than other classes, bases and variants. A base can be more abstract than other bases and variants. A variant can be more abstract than other variants. Compound weaknesses (composites and chains) combine several other simple weaknesses. The chains are ordered while the composites are simply sets. The weaknesses participate in more than one view, but it is possible for a weakness to participate within a view more than once. Weaknesses are organized by structure and by abstraction, but there are relations among them. CWE entry includes the following fields: CWE ID and name; description; alternate terms; description of the behavior; description of the exploit; likelihood of exploit; description of the consequences of the exploit; potential mitigations; node relationships; source taxonomies; code samples for the languages/ architectures; CVEs (vulnerabilities) for which that type of weakness exists; and references. Weaknesses are vulnerability types. Each CWE references CVEs of its type. CVEs are classified by CWEs. Initially, the vulnerability is registered as CVE, but usually, its type is not clear. After some investigations, a type (or types) is assigned to this new vulnerability. If there are no suitable CWEs, a new CWE can be created. The investigation process can identify the conducted attack types for investigated CVE. Attack types are classified as templates in CAPEC (Common Attack Pattern Enumeration and Classification) [3] by MITRE Corporation. Sometimes, it is impossible to identify the vulnerability type or its attack pattern. In these cases, corresponding references are not created in the CWE. CWEs are the cornerstone for cybersecurity activities. They contain information for a vulnerability and possibly: how to identify, protect, detect, respond, and recur from it. CVE database is very simple. Each CVE entry has a name, description, references to external sources, and some maintenance information. NVD (National Vulnerability Database) [4] is based on CVE database. Each CVE entry in NVD contains some metrics. The CVE description must follow one of the next two patterns as described in [2]: • [VULNTYPE] in [COMPONENT] in [VENDOR] [PRODUCT] [VER- SION] allows [ATTACKER] to [IMPACT] via [VECTOR]. • [COMPONENT] in [VENDOR] [PRODUCT] [VERSION] [ROOT CAUSE], which allows [ATTACKER] to [IMPACT] via [VECTOR]. The product ([PRODUCT]) can be identified in the next combinations: • “[VENDOR_NAME] [PRODUCT_NAME]”, • “[PRODUCT_NAME]”, with keywords (the product has no name), 102 • the product name is written as the vendor names it, • “[PRODUCT_NAME] (aka [ALT_NAME])”, • “[PRODUCT_NAME] ([ACRONYM])”, • “[PRODUCT_NAME (formerly [OLD_NAME])”, • “[PRODUCT_NAME] and [OTHER_PRODUCT_NAME]”, • “[PRODUCT_NAME], as used in [BUNDLING_PRODUCT]”, • “[PRODUCT_NAME] [COMPONENT_TYPE] for [PLATFORM]”. The version ([VERSION]) can be represented in several variants: • “The version 1.2.3” • “The versions 1.2.3, 2.3.1, and 3.1.2”, • “The version 1.2.3 and earlier”, • “The versions 1.2.3, 2.3.1, 3.1.2, and earlier”, • “The versions before 1.2.3”, • “The versions before 1.2.3, 2.x before 2.3.1, and 3.x before 3.1.2”, • “The versions 1.2.1 through 1.2.3”, • “The versions 1.2.1 through 1.2.3 and 2.0.1 through 2.3.1”, • “The versions 1.2.3, 2.0.3 before 2.3.1, and 3.0.1 through 3.1.2”, • “Product A 1.2.3 and Product B 4.5.6”, • “Product A 1.2.3, 2.3.1, and 3.2.1 and Product B 4.5.6, 5.6.4, and 6.5.4”. When [VERSION] is used in disclosure phrasing, the combinations are: • “Tested: 1.2.3”, • “Tested 1.2.3. Earlier versions are affected.”, • “Fixed in 1.2.3”, • “1.2.3 to 2.3.1 or Tested: 2.3.1. Introduced in 1.2.3”, • “1.2.3 and later”, • “Product A 1.2.3 and Product B 2.3.4”, • “v1.2.3”. The [ATTACKER] can be remote attackers, remote authenticated users, local users, physically proximate attackers, remote [TYPE] servers, guest OS users, guest OS administrators, context dependent attackers, attackers, [EXTENT] user assisted [ATTACKER], and man-in-the-middle attackers. The [VULNTYPE] is descriptive, but it is possible for more than one vulnerability type (CWE) to be applicable or for more than one component to be affected. Pattern examples given in [2] are: • Cross-site scripting (XSS) vulnerability in [COMPONENT] in [VEN- DOR] [PRODUCT] [VERSION] allows remote attackers to inject arbi- trary web script or HTML via the [PARAM] parameter. • Multiple cross-site scripting (XSS) vulnerabilities in [VENDOR] [PROD- UCT] [VERSION] allow remote attackers to inject arbitrary web script or HTML via the [PARAM] parameter to (1) [COMPONENT1], (2) [COM- PONENT2], ... or (n) [COMPONENTn]. • Multiple cross-site scripting (XSS) vulnerabilities in [COMPONENT] in 103 [VENDOR] [PRODUCT] [VERSION] allow remote attackers to inject ar- bitrary web script or HTML via the (1) [PARAM1], (2) [PARAM2], ..., or (n) [PARAMn] parameter. • Multiple cross-site scripting (XSS) vulnerabilities in [VENDOR] [PROD- UCT] [VERSION] allow remote attackers to inject arbitrary web script or HTML via the (1) [PARAM1] or (2) [PARAM2] parameter to [COMPO- NENT1]; the (3) [PARAM3] parameter to [COMPONENT2]; ...; or (n) [PARAMn] parameter to [COMPONENTm]. • SQL injection vulnerability in [COMPONENT] in [VENDOR] [PROD- UCT] [VERSION] allows [ATTACKER] to execute arbitrary SQL com- mands via the [PARAM] parameter. • Multiple SQL injection vulnerabilities in [VENDOR] [PRODUCT] [VER- SION] allow [ATTACKER] to execute arbitrary SQL commands via the [PARAM] parameter to (1) [COMPONENT1], (2) [COMPONENT2], ..., or (n) [COMPONENTn]. • Multiple SQL injection vulnerabilities in [COMPONENT] in [VENDOR] [PRODUCT] [VERSION] allow [ATTACKER] to execute arbitrary SQL commands via the (1) [PARAM1], (2) [PARAM2], ..., or (n) [PARAMn] parameter. • Multiple SQL injection vulnerabilities in [VENDOR] [PRODUCT] [VER- SION] allow [ATTACKER] to execute arbitrary SQL commands via the (1) [PARAM1] or (2) [PARAM2] parameter to [COMPONENT1]; the (3) [PARAM3] parameter to [COMPONENT2]; ...; or (n) PARAMn] param- eter to COMPONENTm]. The [VECTOR] is the input and/or processes required to exploit the vulnerability. It is possible several attack vectors to be applicable for the same vulnerability. The [COMPONENT] is a product part. A component can be a trigger point where the error occurs (may be in multiple places) or interaction point that accepts the vectors. It is possible for a component to be unknown – in that case, it is skipped in the phrasing. In addition, the message payload can be used as a vector or as a component. There are rules for combination of vectors and components as listed below: • There are two possible component locations: after the vulnerability type, but before the product name; after the vector. • Trigger point goes before the product name. • Interaction point goes after the vector. Component goes before the product if you are unsure which type of component it is; you think the component can be both a trigger and an interaction point. • For multiple component/vector pairs components always go after the vec- tor, no matter their type; dot notation is used. 104 The aim of this research is to extract knowledge from CVEs descriptions. For that purpose, GATE [5] is used. In the next section, GATE is briefly described. 3 GATE Environment Our intention is to annotate CVEs descriptions in a way that permit automatically to generate ontology individuals for each CVE. What is GATE? GATE is an open source solution for all live cycle of text processing. There are many GATE modules, but here the focus is on the GATE Developer, which is an integrated environment for language processing development. Its purpose is information extraction from text annotations. GATE has many components (language, processing, and visualization resources). The standard set of resources is called CREOLE (a Collection of Reusable Objects for Language Engineering) [6]. ANNIE (A Nearly-New Information Extraction system) [7] is a CREOLE subset of components tuned for English language. It intensively uses components implemented in JAPE (Java Annotation Patterns Engine) [8]. GATE is also a template work process for language engineering. ANNIE components’ arrangement within the standard workflow is as follows: 1. GATE inputs a single document or a set of documents (corpora). All cor- pora documents must have the same format. Among accepted by GATE document formats are XML, HTML, SGML, plain text. 2. Initially, the document is tokenized in words, numbers, and punctuation. English Tokenizer or Unicode Tokenizer can be used. Tokens are annota- tions that have attributes. 3. Then, the tokenized text can be processed with POS Tagger that annotates parts of the speech, such as noun, verb, adjective, etc. 4. Gazetteer annotate the text with known names. Essentially, it uses pre- prepared lists of names. 5. Sentence Splitter annotates the sentences in the text using language punc- tuation rules. 6. Semantic Tagger annotates some well-defined kinds of text: Person, Lo- cation, Organization, Money, Percent, Date, Address, Identifier and Un- known. 7. OrtoMatcher does not introduce new named annotations, but assigns types to unclassified proper names. 8. Pronominal Coreference annotates quoted texts and process pronouns. The user can modify GATE components, can create new components and can rearrange process components because GATE source is freely distributed. Ontologies can be used with Onto Gazetteer for text annotation. The user can develop in JAPE components that fully manipulate ontologies and create new individuals within them. 105 4 Annotating CVEs The first step is to load CVE documents into GATE. For this step, corpora have to be created. CVE database is available as one XML document. Every Vulnerability element in it is a CVE. GATE’s import process can be configured to separate each Vulnerability element as a different document in the corpora. CVE database is available in two formats: the original CVE format and in CVRF. The last one is simpler and contains only the last updated version – it is more suitable to be imported in GATE. CVEs are more than 128 000 and as result of that, the loading process is very slow. It is recommended to create five corpora and to load them with around 25 000 documents – GATE fails to import more than 30 000 documents. Then English Tokenizer tokenizes the corpora documents. It is recommended to save the XML tags in the result annotation set. The next processing steps follow the standard procedure: POS Tagger, Gazetteer, Sentence Splitter, Semantic Tagger, OrtoMatcher, and Pronominal Coreference. The key problem in the CVE descriptions annotation are product and vendor combination. For example, it is possible the vendor name to be part of the product name. The product and the vendor have key positions in the phrasing template that facilitate the recognition of other phrasing elements. All product and vendor names are listed at [9]. These lists can be used with Gazetteer to annotate products and vendors. The annotation of the other elements from the CVE phrasing template ([VULNTYPE], [COMPONENT], [VERSION], [ATTACKER], [IMPACT], [VECTOR], and [ROOT CAUSE]) requires the development of a processing component in JAPE. [VULNTYPE] has to be a CWE, but usually vulnerability types in CVE description do not refer to a CWE. In the best case, a vulnerability type is a CWE name (without the enumeration). How to deal with this problem? The first approach is to extract all CWE names into lists and to use Gazetteer to annotate vulnerability types. On the other hand, the vulnerability type has a fixed position in both phrasing templates and this fact can be used to annotate them. At this stage of the research, it is not clear which approach to be used for vulnerability type annotation. May be a combination of them is better. Anyway, some manual work must be done. The [COMPONENT] has no keywords or cannot be extracted from some lists, but they have fixed positions in the phrasing templates. The key for their annotation is vendor and product annotation must precedes that annotation. The same considerations are applicable to [VERSION], [IMPACT], [VECTOR], and [ROOT CAUSE]. 106 The situation with [ATTACKER] is better, because there is some keyword phrasing for it. 5 Conclusion Annotated CVE descriptions can be used to generate ontology individuals. A GATE processing component has been developed and tested successfully. The corresponding CVE ontology has been developed, but it description is out of the scope of this paper. GATE annotation processing components for key elements in CVE phrasing template have been implemented, but their recognition efficiency is still not satisfactory. For that purpose, additional research on the real CVE descriptions will be done to increase the recognition power of the component. Unrecognized elements from this component must be annotated manually, which exists as an option in GATE Developer. 6 Acknowledgements This research is supported by the National Scientific Program “Information and Communication Technologies for a Single Digital Market in Science, Education and Security (ICTinSES)”, financed by the Ministry of Education and Science. References 1. MITRE Corporation, Common Weakness Enumeration (CWE), http://cwe.mitre.org, accessed 20.02.2020. 2. MITRE Corporation, Common Vulnerabilities and Exposures (CVE), http://cve.mitre.org, accessed 20.02.2020. 3. MITRE Corporation, Common Attack Pattern Enumeration and Classification (CAPEC), http:// capec.mitre.org, accessed 20.02.2020. 4. NIST, National Vulnerability Database (NVD), http://nvd.nist.gov, accessed 20.02.2020. 5. GATE, http://gate.ac.uk, accessed 20.02.2020. 6. GATE, Chapter 4. CREOLE: the GATE Component Model, http://gate.ac.uk/sale/tao/splitch4. html, accessed 20.02.2020. 7. GATE, Chapter 6. ANNIE: a Nearly-New Information Extraction System, http://gate.ac.uk/ sale/tao/splitch6.html#chap:annie, accessed 20.02.2020. 8. GATE, Chapter 8. JAPE: Regular Expressions over Annotations, http://gate.ac.uk/sale/tao/ splitch8.html, accessed 20.02.2020. 9. CVE Details, The ultimate security vulnerability datasource, http://www.cvedetails.com, accessed 20.02.2020. 107