CPE Ontology Vladimir Dimitrov [0000-0002-7441-253X] Sofia University „St. Kliment Ohridski”, Faculty of Mathematics and Informatics 1164 Sofia, 1 James Bourchier Blvd., Bulgaria cht@fmi.uni-sofia.bg Abstract. Common Platform Enumeration (CPE) is maintained by NIST as a structured schema for naming of information technology systems, software and packages. CPE is described in a several NIST documents (the last version): • Common Platform Enumeration: Naming Specification. Version 2.3. • Common Platform Enumeration: Name Matching Specification. Version 2.3. • Common Platform Enumeration: Dictionary Specification. Version 2.3 • Common Platform Enumeration: Application Language Specification. Version 2.3. CPE names are building block in NIST classification systems for vulnerabilities (CVE), weaknesses (CWE) and attack patterns (CAPEC). In this paper, CPE names are described as an ontology in OWL Manchester Syntax. This ontology is generated from NIST “Official Common Platform Enumeration Dictionary”. Its purpose is to be a reference ontology for the ontologies representing vulnerabilities, weaknesses and attack patterns. Keywords: CPE, OWL, Ontology, Cybersecurity. 1 Introduction CPE (Common Platform Enumeration) has been developed by MITRE Corporation [1], but since November 2014, it is supported only by NIST [2]. This is a structured naming scheme for information technology systems, software and packages (platforms). The latest version of the CPE is 2.3 and is described in the NIST series of publications [3-6]. The naming specification is presented in [3]. Here, the logical structure of CPE names is defined. The name matching specification [4] describes a method for comparing two CPE names. Here, CPE names are treated as search patterns. In general, there is no difference in the notation for CPE names and search patterns. The official NIST dictionary [5] contains only “basic” CPE names. These “basic” CPE names can also be considered as search patterns because they rep- Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). resent a class of platforms that have the same vulnerabilities, vulnerabilities, and attack patterns. Each search pattern corresponds to a set of “basic” CPE names in the diction- ary. The “basic” CPE name as a search pattern has only one element – the “basic” CPE name itself. The publication [4] defines how CPE names are compared as search patterns, i.e. through the sets of their corresponding “basic” CPE names. Complex CPE name configurations can be specified with the CPE applicabil- ity language [6]. These configurations are used in guidelines, policies, and other places where platforms are referenced as CPE names. CPE names are the main building blocks in the NIST specifications and in the maintained vulnerability databases [7]. In NVD, CPE names are used to describe platforms that are affected by the vulnerabilities. This is done through the applicability language. Statements in this language refer to vulnerable platform configurations. Before proceeding to the CPE ontology, a brief presentation the naming specification has to be done. 2 CPE naming The CPE names are described with an abstract naming model WFN (Well-Formed Name). Abstract names can be bounded to specific syntax. There are two specific syntaxes described in [3]: URI [8] and FS (Formatted String). WFN is just a logical construction in this specification. It is an unordered set of attribute-value pairs that are in the form “attribute = value”. Attribute names are not case sensitive. The underline (the symbol “_”) is treated as a letter. The WFN name format is: wfn:[a1=v1, a2=v2, …, an=vn] The following attribute names are used: “part”, “vendor”, “product”, “ver- sion”, “update”, “edition”, “language”, “sw_edition”, “target_sw”, “target_hw” and “other”. Each attribute can participate no more than once in the CPE name. If an attribute does not participate in the CPE name, its value by default is the logical value ANY. Attribute values can be “logical” or character strings. Additional restrictions may be imposed on the individual attributes. The “logical” values are ANY and NA. The first value ANY (any value) means that any value is applicable for that attribute. The second value NA (not applicable / not used) means that for this attribute cannot be assigned any mean- ingful or valid value, or that the attribute is not used in the name. 303 The attribute values that are character strings consist of printed UTF-8 char- acters. They are enclosed in quotation marks. The underscore is considered as a letter. There are three special characters: “\” backslash, “*” star and “?” question mark. The backslash is used to escape characters. All non-alphanumeric characters must be escaped in the attribute value string, including the backslash itself when used without its special meaning. The attribute value string cannot be “\ _”. The star means zero or more random symbols in place, and the question mark – one random symbol. In fact, these two symbols are used to set matching patterns. However, there are restrictions on their use: • The special characters “*” and “?” can be at the beginning or end, i.e. cannot be used in the middle of the pattern string. • A single star cannot be an attribute value. • The star cannot be used more than once in a sequence. The question mark can be a single attribute value, and can be used more than once in a row. Star and a question mark at the beginning or the end of the string does not make sense, but it is permitted the string to starts or ends with a star and several question marks. These rules for attribute values are presented in [3] as a grammar in ABNF (Augmented Backus-Naur Form). Specific restrictions on attribute values are: • The “part” attribute can have a value of “a” for applications; “o” for op- erating systems; or “h” for hardware devices. • The “edition” attribute, in general, has ANY value, but may have a string value for backward compatibility with the previous versions of the speci- fication. The names of the attributes correspond to their meaning, but there are attrib- utes whose names are not so directly clear: • “sw_edition” fixes the product to a specific market or class of end users. • “target_sw” determines the software environment in which the product operates. • “target_hw” sets the architectural set of machine instructions. Bytecode intermediate languages are also considered as an architectural set of ma- chine instructions. • “language” is a valid language label according to RFC 5646 [10], but only language and region codes are used. This attribute describes the lan- guage of the user interface. • “other” is anything else that can be used to identify the class. Examples from [3] for WFN names: 304 wfn:[part=”a”, vendor=”microsoft”, product=”internet_ explorer”, version=”8\.0\.6001”, update=”beta”, edition=NA] wfn:[part=”a”, vendor=”hp”, product=”insight_ diagnostics”, version=”7\.4\.0\.1570”, sw_ edition=”online”, target_sw=”windows_2003”, target_ hw=”x64”] There are three operations defined for WFN: • new() – creates an empty WFN, i.e. there are no attribute-value pairs in it, or, equivalently, the attributes are initialized with ANY value. • get(w, a) – from WFN (argument “w”) returns the value of the sec- ond argument (“a”). If no value is set for this attribute, it will return ANY. • set(w, a, v) – in the WFN “w” of the attribute “a” sets the value “v”. The absence of an attribute in the WFN is equivalent to initializing it with an ANY value, and then deleting an attribute is equivalent to assigning to it the ANY value. The above considerations have been used in the implementation of a module for manipulating CPE names. The original idea was to use the cpe 1.2.1 module from the Python repository. This module supports all published versions of CPE: 1.1, 2.2 and 2.3, as well as the syntax of WFN, URI and FS. The module has minimal gaps in its implementation, but in general, it turned out to be completely unusable for processing huge amounts CPE names – it is too slow. Most likely this is due to the universal and simultaneous support of the all three versions of the specification. The above limitations necessitated the development of new module only for CPE 2.3 and with minimum functionality required only for conversion between the three formats (WFN, URI and FS). The binding of the abstract WFN is done with the two specific formats URI and FS. In the specification, the URI format is included rather for backward com- patibility with previous versions, while preference is given to the FS format. From the point of view of OWL ontologies, an URI is more appropriate for identifier of individuals in an ontology because it is an IRI in the sense of an OWL – each URI is an IRI. FS simply structures attributes without considering their representation ca- pabilities as IRI. In fact, the content of an FS can be considered as a set of attrib- ute values in the CPE ontology linked as values to data properties. Therefore, it is unnecessary to maintain FS in the ontology in any other way. 305 3 Binding WFN to URI The URI syntax used in CPE is presented in [3] as a grammar with ABNF. This is a restriction on the URI. The language label is in a simplified version. Note to the case: It is generally assumed that the control of CPE contents is performed by NIST, i.e. the language label is set according to the requirements. New CPEs are not entered in the ontology; instead, they are imported from the CPE dictionary. The CPE’s URI starts with “cpe:/” (header or prefix) and then follow the attribute’s values of “part”, “vendor”, “product”, “version”, “update”, “edition”, and “language” in the specified order and separated by a colon. There is also a colon after the header and before “part”. If a value of an attribute is missing, its place is marked with an empty string, i.e. “::”. The absence of trailing attributes until the header itself is allowed. Simply, the last missing values and their colons are omitted. |The URI’s format with re- moved trailing empty attribute values will be called “compressed” URI format. CPE URI version 2.3 maintains backward compatibility with CPE URI ver- sion 2.2. This is achieved by “packaging” the new attribute values with the value of the “edition” attribute. It is discussed below. Special characters and punctuation, in the attribute values, coding with the percentage notation is required. The logical value ANY is represented in the URI with the empty string at the position of the attribute, i.e. “” or in place of the attribute position it is represented as “::”. The logical value NA is represented by “-“. In WFN, non-alphanumeric characters in attribute value strings must be pre- ceded by “\”, except for “*” and “?” when they have been used as special sym- bols. In URI, non-alphanumeric characters are represented in percentage encod- ing. In WFN, the characters minus “-” and dot “.” are “escaped”, i.e. they are represented as “\-” and “\.”. In URI, they are not in the percentage notation. The special characters “?” and “*” when not “escaped” are represented by “%01” and “%02” encodings, respectively. 3.1. Packaging of the new attributes CPE version 2.3 has four new attributes: “sw_edition”, “target_sw”, “target_hw”, and “other”. When one of them has a value other than ANY, they are “packaged” in the specified order in the component “edition” of the URI. Otherwise they are skipped. In the packaged format, the symbol “~” is used as a separator. In fact, 306 the sixth component of the URI packs five attributes: the old “edition” and the four new attributes. Packaging consists of concatenation of attribute values in the specified order, using “~” as a prefix and delimiter. If the four new attributes have an ANI value, then no packaging is performed and the only value in the sixth component is that of “edition”. 4 Unbinding URI to WFN If the URI meets the format requirements of CPE 2.3, then it can be “unbind” to WFN. URI components 1-5 and 7 are decoded in the corresponding values as: • the empty string is represented by the logical value ANY; • the unit value “-” – with NA; • the dot “.” and minus “-” are escaped; • and decode the percentage forms. The sixth component is unpacked if it starts with “~”. 5 Binding WFN to FS The formatted string is similar to URI, but unfortunately, it is not. Many of URI restrictions are released. The formatted string starts with “cpe:2.3:” and is followed by the attribute values separated by “:”. Its format is: cpe:2.3: part : vendor : product : version : update : edition : language : sw_edition : target_sw : target_hw : other The symbols that are not “escaped” in the values of its attributes are the let- ters, numbers, “-”, “.” and “_”. The logical values ANY and NA are represented by “*” and “-”, respectively. All other characters are “escaped” with “\”. The special symbols for the logi- cal values, when used as such (as a stand-alone value), are not “escaped”. Each attribute must have a value in the formatted string. The binding process is relatively simple. Each WFN attribute value is pro- cessed separately and then the resulting processed values are concatenated. Since there is a difference between the “escaped” symbols in WFN and FS, the work must be a bit more precise here. Each “escaped” symbol is checked (if preceded by “\”). The characters “.”, “-” and “_” are “escaped” in WFN, but not in FS. The conver- sion of “escaped” characters in WFN to characters in the formatted string is: • When a character is “escaped” in WFN, it remains the same in FS. • From the “escaped” dot, minus and underscore (“.”, “-” and “_”) is re- moved the escape symbol (“\”). 307 The binding and unbinding algorithms are described in pseudo code in the bind_to_fs(wfn) and unbind_to_fs(uri) functions. There are two other functions recommended by the specification: con- vert_uri_to_fs(uri) and convert_fs_to_uri(fs). 6 CPE ontology The ontology has been developed in OWL 2 Manchester Syntax [11]. The reason for this is that Manchester Syntax is compact and easy to read by humans. The development was carried out with the Protégé tool [12], which supports the above version of OWL. A general scheme of the classes is presented in Fig. 1. The ontology follows the specification of CPE 2.3, but it is an incomplete implementation of CPE 2.3 requirements and recommendations. It cannot be re- garded as CPE 2.3 compatible implementation. Fig. 1. Class hierarchy. URI identifiers are selected to be ontology individual identifiers. Each CPE name description in the NIST dictionary contains both a FS name and an URI name. The latter is suitable for IRI individuals in the ontology. CPE is the main class in the ontology. It has three main subclasses: Applica- tion, Hardware and OS, which correspond to the classification given by the “part” attribute. In such a way, greater search flexibility is achieved and the “part” at- tribute itself does not need to be stored as a data property. The class Deprecated is the fourth subclass of CPE. This is the subclass of obsolete names. An obsolete name can be from the other three subclasses of CPE and therefore Deprecated is a subclass of one of Application, Hardware and OS. The definition of this class is: CPE and (Application or Hardware or OS) 308 The class Deprecation is at the level of the class CPE. This utility class as- sociates the obsolete name with its replacements. There are several possible rea- sons a name to be deprecated and for each of them a subclass has been defined: AdditionalInformation, NameCorrection and NameRemoval. It is not clear to what extent obsolete names would play role into the practical use of the ontology. The hierarchy of classes is presented partly in Fig. 2. Fig. 2. Another view on class hierarchy. Deprecated name is associated with an individual of class Deprecation. This is done through the object property “deprecation” of class Deprecated. In this individual (of class Deprecation), the reason for cancellation of the name by subclasses is specified. In the current version of the dictionary, there are mainly deprecated names that have been changed, i.e. of class NameCorrection. Maybe over time the situation will change and other reasons for canceling the names will appear. The definition of the object property “deprecation” is presented below: ObjectProperty: deprecation Domain: Deprecated Range: Deprecation Correcting one name or adding additional information about it may result in several new names. Therefore, an individual in the subclasses of Depreca- tion may point to several new (or already deprected) names. The object property “deprecated-by” of the class Deprecation is used for this purpose. The following is the definition of “deprecated-by”: ObjectProperty: deprecated-by Domain: AdditionalInformation or NameCorrection Range: CPE 309 When a name is removed, then there should be no object property “deprecat- ed-by” in the corresponding individual from class NameRemoval. The considerations in the above two paragraphs should be reflected in spe- cial axioms, but this is not done for reasons of possible errors in CPE dictionary, which would block loading of the entire ontology in Protégé. Every change of name (its cancellation) took place on a certain date. A name may be deprecated gradually, i.e. may be replaced by several new names at differ- ent times. The name can be changed or even removed. Obviously, the considera- tion a removed name not to be discarded (even when it is revoked), is to support the systems that use dictionaries with older content. For example, a CVE weak- ness that references a deprecated name continues to refer to it without the need to change the reference to it in the CVE. Most likely, the vulnerability itself will disappear at some point, but at least until the deprecated CPE name is referenced, it should be kept in the dictionary. However, the CPE dictionary is primarily in- tended to serve other systems. In addition to this scheme of aging CPE names, version 2.3 also supports compatibility with previous versions of the specification. In the latter case, one CPE name is replaced by no more than one other CPE name (removed ones are not replaced). For this purpose it uses the object property “deprecated_by” (here is an underscore, not a minus) of class Deprecated. Most likely, the outdated name substitution mechanism (before CPE 2.3) will not be used, at least in the current content of the dictionary. This will require the omission of the object property “deprecated_by” with all possible axioms for its maintenance. However, an axiom requiring the presence of an object property “deprecation” in individuals of Deprecated has to be added. In the current version of the ontology, neither is done, because backward compatibility is supported. The following is the definition of “deprecated_by”: ObjectProperty: deprecated_by Characteristics: Functional Domain: Deprecated Range: CPE Deprecated names can be viewed as a tree whose lists are removed CPE names or current ones. There are two problematic attributes in the old and the new version of the specification. These are the “deprecation-date” and “deprecation_date” data prop- erties. The last property is inherited from the previous version of the specification and it indicates the date of deprection of the CPE name. This property is a data 310 property in Deprecated. The following are the definitions of “deprecation-date” and “deprecation_date”: DataProperty: deprecation-date Characteristics: Functional Domain: Deprecation Range: xsd:dateTime DataProperty: deprecation_date Characteristics: Functional Domain: Deprecated Range: xsd:dateTime On the other hand, according to the new scheme, a CPE name can be dep- recated in stages and therefore each stage is reflected with an individual from Deprecation where the “deprecation-date” property is used. Both data properties are used in the dictionary, but the relationship between them is not obvious and at least such an information have not been found. The other data properties used are the WFN attributes (without “part”). They belong to the class CPE. The values of these attributes are according to FS rules. There is no data property for the FS identifier. The definitions of the data proper- ties are uniform. Fig. 3 presents the definition of data property “product”. Fig. 3. Definition of data property “product”. 311 The data property “language” is simply a character string – it does not need to follow RFC 5646. The value is controlled in the NIST dictionary. Finally, in addition to the dictionary specification, there are a number of de- scriptions related to official information on the maintenance of the dictionary. This information is transferred to the ontology annotation system. It turned out that, at least officially, the official information is not maintained in NIST. However, its description is included insofar as it is available in the speci- fication. 7 Conclusion The CPE ontology complies with the CPE 2.3 specification. It includes its requirements for backward compatibility with the previous versions. In the ontology development, the basic concept is the minimalism but com- plete inclusion of dictionary information. This means that information, which does not actually participate in the dictionary or is descriptive or duplicated, is presented in annotations or ignored. In this context, preference is given to remov- ing attributes at the expense of building a hierarchy of classes. The second concept in ontology design is maximum search flexibility with SPARQL. This implies the elimination of possible additional no-SPARQL search. For example, searching CPE patterns involves searching with regular expres- sions. The latter mechanism can be used to achieve full compatibility with the requirements for CPE implementation, but is inadequate with the concepts of OWL. The ontology is an extended environment used in conjunction with other ontologies and inference engines. The ontology, and more precisely its content (individuals), is generated from NIST official CPE dictionary. At this time, NIST does not plan to shift CPE dic- tionary from XML to OWL – a lot of extra work needs to be done to do this. CPE dictionary replaces obsolete names with search patterns of CPE names. In fact, these are again CPE names according to [5]. Here in the ontology, this approach is rejected because there is no similar search engine in OWL. The on- tology uses comprehensive referencing of CPE names (base names). Moreover, with this approach, the control over the integrity of the ontology is stronger using standard control tools such as HermiT reasoner that can be applied during the ontology loading in Protégé particularly. The presented ontology is used in another real ontology developed that one for CVE vulnerabilities. CPE ontology generator is published at https://github.com/VladimirDim- itrov1957/CPE-ontology-generator. 312 8 Acknowledgements I would also like to thank NIST for consulting on certain issues in the dictionary and especially to Amy Mahn for her responsiveness. This work was conducted using the Protégé resource, which is supported by grant GM10331601 from the National Institute of General Medical Sciences of the United States National Institutes of Health. This research is supported by the National Scientific Program “Information and Communication Technologies for a Single Digital Market in Science, Educa- tion and Security (ICTinSES)”, financed by the Ministry of Education and Sci- ence. References 1. MITRE Corporation, CPE, https://cpe.mitre.org, last accessed 25/04/2021. 2. NIST, National Vulnerability Database, NVD, Official Common Platform Enumeration (CPE) Dictionary, https://nvd.nist.gov/products/cpe, last accessed 25/04/2021. 3. NIST, NISTIR 7695, Common Platform Enumeration: Naming Specification Version 2.3, htt- ps://csrc.nist.gov/publications/detail/nistir/7695/final, last accessed 25/04/2021. 4. NIST, NISTIR 7697, Common Platform Enumeration: Dictionary Specification Version 2.3, https://csrc.nist.gov/publications/detail/nistir/7697/final, last accessed 25/04/2021. 5. NIST, NISTIR 7696, Common Platform Enumeration: Name Matching Specification Version 2.3, https://csrc.nist.gov/publications/detail/nistir/7696/final, last accessed 25/04/2021. 6. NIST, NISTIR 7698, Common Platform Enumeration: Applicability Language Specification Version 2.3, https://csrc.nist.gov/publications/detail/nistir/7698/final, last accessed 25/04/2021. 7. NIST, National Vulnerability Database, NVD, https://nvd.nist.gov, last accessed 25/04/2021. 8. W3C, Naming and Addressing: URIs, URLs, ..., https://www.w3.org/Addressing, last accessed 25/04/2021. 9. ISO, ISO/IEC 14977: 1996(E), Information technology — Syntactic metalanguage — Extend- ed BNF, https://www.iso.org/standard/26153.html, last accessed 25/04/2021. 10. IETF, RFC 5646, Tags for Identifying Languages, https://tools.ietf.org/html/rfc5646, last ac- cessed 25/04/2021. 11. W3C, OWL 2 Web Ontology Language, Manchester Syntax (Second Edition, https://www. w3.org/TR/owl2-manchester-syntax, last accessed 25/04/2021. 12. Musen, M.A. The Protégé project: A look back and a look forward. AI Matters. Association of Computing Machinery Specific Interest Group in Artificial Intelligence, 1(4), June 2015. DOI: 10.1145/2557001.25757003 313