=Paper=
{{Paper
|id=None
|storemode=property
|title=Order Theoretical Semantic Recommendation
|pdfUrl=https://ceur-ws.org/Vol-1059/ordring2013-paper2.pdf
|volume=Vol-1059
|dblpUrl=https://dblp.org/rec/conf/semweb/JoslynHPPST13
}}
==Order Theoretical Semantic Recommendation==
Order Theoretical Semantic Recommendation Cliff Joslyn1 , Emilie Hogan1 , Patrick Paulson1 , Elena Peterson1 , Eric Stephan1 , and Dennis Thomas1 Pacific Northwest National Laboratory, firstname.lastname@pnnl.gov Abstract. Mathematical concepts of order and ordering relations play multiple roles in semantic technologies. Discrete totally ordered data characterize both input streams and top-k rank-ordered recommenda- tions and query output, while temporal attributes establish numerical total orders, either over time points or in the more complex case of start- end temporal intervals. But also of note are the fully partially ordered data, including both lattices and non-lattices, which actually dominate the semantic strcuture of ontological systems. Scalar semantic similari- ties over partially-ordered semantic data are traditionally used to return rank-ordered recommendations, but these require complementation with true metrics available over partially ordered sets. In this paper we report on our work in the foundations of order measurement in ontologies, with application to top-k semantic recommendation in workflows. We con- clude that true ordered set metrics are strongly preferable to traditional semantic similarities. 1 Introduction Order and sequence occur frequently in the semantic web and semantic tech- nologies, but in different contexts and manners. First, data themselves can be ordered, for example in a time sequence or order of arrival to a data stream. When time stamps are available, a mathematical “total” (or “linear”) order arises. If events can occur simultaneously, then a “weak order” is needed so as to allow “ties”. And if event occurences can extend over time, then “interval orders” are needed to reflect the interactions between start and stop times of possibly overlapping events with different durations. But data can be ordered from other ways and sources. In fact, the heart of se- mantic technology may be considered to be class structures, which sit at the cores of ontologies as taxonomic semantic categorization hierarchies. Such struc- tures are mathematically “partial orders”, that is tree-like hierarchies supporting multiple inheritance, including lattices. Note that this is the most general case, as all total orders are weak orders, which are all interval orders, which are all partial orders. Also, all lattices are partial orders. Whether input data or semantic categories are totally, weakly, intervally, or partially ordered, in any semantic selection task there is a need to rank-order the resulting calculations, in order to recommend particular choices to the user. And the algorithmic methods used to make such recommendations must measure the data in a way which acccommodates and respects its underlying mathematically ordered nature, perhaps up to and including being fully partially ordered. 2 Our interest is 1) the proper use of metrics in partially-ordered data in order to 2) make top-k rank-ordered recommendations in semantic systems. We report on the performance of order metrics in a test analytic semantic recommendation task within multi-domain scientific workflows as part of the Signature Discov- ery Initiative1 (SDI) [1] hosted by the Pacific Northwest National Laboratory (PNNL). Comparing true order metrics against traditional, non-metric seman- tic similarities, we conclude that within our test knowledgebase, the rank-order recommendations provided by partial order metrics are strongly preferable. 2 Workflows Using the Sequence Analysis Knowledgebase We now introduce our test knowledgebase and use case within which we make recommendations. Our use case concerns identifying signatures within generic sequential data objects from multiple domains. While alignment of biosequences has been in use for decades, it has only been recently recognized as a significant method for analyzing non-biological sequences [8]. The overall workflow is shown in Fig. 1. In the initial “alphabet construction” phase, members of an arbitrary data set “input corpus” are translated into a linear sequence of exchange prim- itives here called “symbols”. Each primitive is then assigned a unique character called a “token” drawn from a limited alphabet using a “map file”, a data struc- ture mapping symbols to tokens via a lossy encoding. This encoding of data to a limited character set allows linear structure to be maintained for analysis while reducing the vocabulary of characters to a computationally manageable number. Sequences encoded as these token strings are represented in the FASTA file for- mat2 as used in bio-informatics. We call encoding outputs “pseudo-proteins”, as they are represented in FASTA files, similarly to real proteins, even if they are representing any arbitrary input. Once encoded, pseudo-proteins are BLASTed against themselves, then hierarchically clustered. Additional stages involve mul- tiply aligning pseudo-proteins to generate sequence features for downstream de- tection, but these do not concern the current effort being reported on here. While this workflow targets any sequential data object which can be translated and encoded into a pseudo-protein, the obvious cases are from computational biology. Here inputs can be either gene sequences, with a map file implementing the genetic code of codons to amino acids; or protein sequences directly (with a null map file). But as shown in Table 1, we also include cyber objects, in par- ticular disassembled executable files with actual processor opcodes as symbols; and system log files, with log events (e.g. “login” or “logoff”) as symbols. The goal for recommendation is to rank order available choices of data objects and component connections in terms of semantic appropriateness, e.g. not rec- ommending a cyber objet for a genetics task. We developed a Sequence Analysis Ontology (SAO) to model our use case, including the types of data objects passed in to and out of the workflow components, and the operations performed by the workflow components. The SAO, and our methodology, uses object relations to 1 http://signatures.pnnl.gov 2 http://zhanglab.ccmb.med.umich.edu/FASTA/ 3 Fig. 1. Sequence analysis workflow. represent semantic relations of interest. For analytical and testing purposes, we augment the SAO with OWL individuals to create a Sequence Analysis Knowl- edgeBase (SA-KB) as a static representation of one time state of the overall workflow environment. Each OWL individual thus represents a specific data ob- ject or component port which may be available at any one time. Domain Subdomain Symbols Genomics Codons Biology Proteomics Amino acids Executables Opcodes Cyber Log files Log events Table 1. Domains and symbols used in sample workflows. SA-KB statistics are shown in Table 2. Note that we regard the class hierarchy, the core of the ontology, as a (partially) ordered data object of 157 classes organized into nine levels. But it’s not a tree, with an average of 1.68 parents per class. Thus methods in order theory are necessary to properly measure it. Fig. 2 show an illustrative portion of the SA-KB around the individual FASTAs_Linux, which is a KB entry for a particular FASTA file produced from encoding a Linux executable file. FASTAs_Linux is a member of the class FASTA_File_Set (classes shown in ovals, dashed lines show class membership). Below we will de- note C(x) for the classes (potentially multiple) of an individual x, so here we have C(FASTAs Linux) = {FASTA File Set}. FASTAs_Linux is connected to other individuals (shown in boxes), represent- ing data objects on which it depends semantically, by object properties. For 4 Property Value # Classes (asserted) 157 # Classes (inferred) 168 Average # children/class 2.02 Average # parents/class 1.68 Class hierarchy height 9 # Object properties 31 # Individuals 116 Table 2. SA-KB statistics. Fig. 2. A portion of the SA-KB around the FASTAs Linux individual. Individuals are in boxes, classes are in ovals. Dashed lines indicate class membership, all others are object properties or class hierarchy. 5 example, FASTAs_Linux was encoded with a particular mapfile represented by MapFile_Executables, of class Map_File, and is thus linked to it by Depends_On. MapFile_Executables effects a mapping from opcode symbols to tokens, and so is itself related by File_Constructed_Using to two individuals: Opcodes, of class Symbol_Set, representing the symbols of opcodes; and Cyber_Alphabet of class Alphabet, representing the token set used by the cyber domain. Recommendation proceeds when, given a fixed target individual (e.g. the FASTA file FASTAs_Linux) at a particular point in the workflow (e.g. as input to the BLASTer), calling for an object of a given type (e.g. scoring matrix), which of many candidates are recommended? We wish to see recommendations rank or- dered by being most semantically similar, for example the matrix SM_Executable used for executable files being highly ranked, while SM_Genetics used for genetic sequences loweest ranked. 3 Ontology Metric Annotation Methodology We now describe our semantic annotation metric methodology. Given a class hierarchy, we have a number of metrics to measure pairwise distances between classes. We collect the sets of classes associated with individuals, and can then measure the overall extent of those sets of classes within the ontology, and using a Hausdorff set distance can measure the overall metric similarity between them. This supports the rank-ordered recommendation of the most closely related ob- jects of the same data type otherwise. 3.1 Distances in Ordered Semantic Structures Real-world semantic hierarchies such as subsumption and composition are usu- ally cast as Directed Acyclic Graphs (DAGs) on a finite set of classes P . They are typically bounded above with a top element > ∈ P , with ∀a ∈ P, a ≤ >; can have (a moderate amount of) multiple inheritance; and branch downward very strongly. These are best represented mathematically as partially ordered sets (posets) P = hP, ≤i [3], where ≤ ⊆ P 2 is a reflexive, anti-symmetric, and transitive binary relation. Each DAG as described above uniquely determines a bounded poset P by taking its transitive closure and including a bottom bound ⊥ ∈ P such that ∀a ∈ P, ⊥ ≤ a. The first thought towards a metric in a graphical structure such as a semantic hierarchy is as the length of the minimum undirected path dM P (a, b) = min |p|. p∈undirected paths(a↔b) But such a distance ignores both the directional nature and the level structure of the semantic hierarchy. Instead, the knowledge systems literature has focused on semantic similarities [2, 10] to perform a similar function. These are available when P P is equipped with a probabilistic weighting function p : P → [0, 1], with a∈P p(a) = 1. While p can be derived, for example, from the frequency with which terms appear in documents [4], or which genes are annotated to bio- ontology nodes [5], the simplest approach is to uniformaly weight each a ∈ P proportionally, so that ∀a ∈ P, p(a) ≡ π, where π = |P1 | . 6 This then allows the definition of the information content of a node as X IC(a) = − log2 π p(b) = − log2 (π| ↓ a|) , b≤a where ↓ a : = {b ≤ a} is the downset of a, then set of its descendants, so that | ↓ a| is the number of such descendants. Then the information content of the most informative common ancestor of a, b ∈ P is M (a, b) : = max IC(c). c∈↑ a∩↑ b where ↑ a is defined analogously to ↓ a as the upset of a. This then allows us to define the first of three standard semantic similarity-based (SS-based) distances, the Jiang-Conrath distance [10], as dJC (a, b) = IC(a) + IC(b) − 2M (a, b). The other two follow from the Resnik and Lin similarity measures [10], after normalization, to dervie corresponding distances as: M (a, b) 2M (a, b) dR (x, y) = 1 + , dL (x, y) = 1 − . log2 (π) IC(a) + IC(b) While all of these, including dM P , are somewhat standard metrics, our purpose is more general, since we may not have such a weighting function available, and semantic similarities are not required to be metrics satisfying the triangle inequality. In seeking out the proper mathematical grounding, we turn to or- der metrics [6, 9] which can use, but do not require, a quantitative weighting. For details about order metrics built from isotone and antitone lower and up- per semimodular functions on ordered sets, see [9]. In this work, we use the cardinality-based distances. First we have the upper and lower distances: du (a, b) = | ↑ a| + | ↑ b| − 2| ↑ a ∩ ↑ b|, dl (a, b) = | ↓ a| + | ↓ b| − 2| ↓ a ∩ ↓ b|, Having two distances, upper and lower, may appear arbitrary or unfortunate, but they behave differently.3 It may at first appear to be more natural to use upper distance, since we’re then “looking upwards” towards the top bound > ∈ P . But when P is top-bounded, strongly down-branching, and with multiple inheritance (as in our cases), then it might be argued that it is preferable to use lower distance, since up-sets are typically very small and narrow, frequently single chains; where down-sets are large, branching structures. Additionally, this allows siblings deep in the hierarchy to be closer together than siblings high in the 3 Note that it can be the case that the upper distance du (a, b) is the same as dM P (a, b), but this is only required to be true when P is an upper-bounded tree: path length and all these other metrics are generally unrelated. 7 hierarchy (this will be demonstrated below). This is considered valuable, for example, where e.g. “mammal” and “reptile” are considered farther apart than “horse” and “goat”. In practice, these can be combined. Defining the hourglass of a node Ξa = ↑ a ∪ ↓ a, we then have the hourglass-based measure: dT H (a, b) = |Ξa| + |Ξb| − 2|Ξa ∩ Ξb| dT H is a direct analog of dl and du , but it is not a true metric. We need to normalize distance to the size of the structure, so that we are mea- suring the relative proportion of the overall structure two nodes are apart, or in other words, what proportion of their potential maximum distance. These normalized forms are (all ∈ [0, 1]): du (a, b) dl (a, b) dT H (a, b) d¯u (a, b) : = , d¯l (a, b) : = , d¯T H (a, b) = . |P | − 1 |P | − 1 |P | − 2 3.2 Annotation Collection and Measurement The metrics above measure distances between classes in an ontology, and in particular, for any two individuals x, y in our KB, it can measure the distance between its classes C(x), C(y). But in order to compare the entire semantic context of an individual x, we need to identify not just its classes C(x), but additionally the classes C(z), C(w) of any individuals z, w which x is related to, that is, to which x is connected via object properties. We may not wish to invoke all object properties in such connections, but perhaps particular ones. And these should be transitive, so that through transitive closure we can find both directly and indirectly related individuals. The SAO has two transitive object relations , sao:depends on and sao:has part, to play this role. But we don’t wish to restrict ourselves to just sao:depends on and sao:has part proper, but rather all of their sub-relations as well, for exam- ple sao:has corpus, which inherits from sao:depends on. So, for an individual x, let A(x) be the class annotations of x, defined as: A(x) is the set of all classes which either x or some other individual y, to which x is either directly or indi- rectly linked by a particular, transitive object property, or some sub-relation of one, are members of. From our example from Fig. 2, we have: A(FASTAs Linux) = { sao:alphabet, sao:corpus, sao:cyber sequence, sao:executable file, sao:FASTA file set, sao:linux executable file, sao:map file, sao:opcode, sao:symbol set } (1) Consider an individual x with classes C(x) and full set of class annotations A(x). We wish to have measures of A(x) for various x, and to compare A(x) and A(y) for different individuals x, y. We first wish to capture the “spread” of A(x) within the ontology. Given a distance d, we use P P ¯ C1 ∈A(x) C2 ∈A(x) d(C1 , C2) Ed (x) = |A(x)| ∈ [0, 1], 2 8 as the extent of x, which is normalized by the number of pairs of classes drawn from A(x). Next, given two individuals x, y with classes C(x), C(y) and anno- tations A(x), A(y), then we are interested in comparing the aggregate distance d(A(x), A(y)) between those sets of annotations. To do so, we use a Hausdorff distance [7] as a standard method. Given a distance d, then we have ¯ ¯ Hd (x, y) = max max min d(C, D), max min d(C, D) ∈ [0, 1]. C∈A(x) D∈A(y) D∈A(y) C∈A(x) If d is a true distance function on P (positive definite, symmetric, satisfies tri- angle inequality) then Hd is also a distance function. 4 Metrics Behavior The left side of Fig. 3 shows distributions of the 169 2 = 14, 196 distances d(C1 , C2) between the 168+1 (for the bottom node) classes in the SA-KB. The right side shows the distributions of the 116 2 = 6, 670 Hausdorff distances Hd(x, y) between the individuals. The left side of Fig. 4 shows the distribution of the extents E(x) of the individuals, while the right side shows the distribution of |A(x)|, the number of classes annotating each individual. We note the following: Fig. 3. (Left) Distances d(C1 , C2 ) between classes C1 , C2 . (Right) Hausdorff distance Hd(x, y) between annotation sets of individuals A(x), A(y). – In Figs. 3 and 4, the distributions shown are not mutually correlated, rather each is sorted separately in descending order, and then overlaid in the figure. – The SS-based distances are convex, and of the SS-based distances, JC is likely the best in the sense of the most discriminative, although it has large regions in the high distances where it fails to distinguish class pairs. – For the poset distances, lower dominates upper (as discussed above), and is concave. When combined with lower, the resulting hourglass distance dT H is less concave and more linear, providing a wide range of discriminatory values, although less in the lower values. Below we will show that SS-based distances provide poor rank orderings for recommendations, wo while we will return to this issue, in the sequel dT H will be considered preferentially. 9 Fig. 4. (Left) Extents E(x) of the individuals. (Right) The number of classes |A(x)| annotating each individual. Fig. 5. A portion of the inferred SAO class hierarchy showing class annotations for just FASTAs Linux (marked with ‘FL’ in blue), just SM Executable (marked with ’SM’ in red), and both (marked with both). 10 – The Hausdorff distances are a “selection” of the values appearing in the class distances, in virtue of its max-min composition, effectively acting as a coarsened representation. – Extents Ed (x) and counts |A(x)| are shown separately, and our normal- ization factor is effective, with generally small correlations coefficients be- tween Ed and |A|, especially for the poset distances, with Pearson correlation ρ(ET H (x), |A(x)|) = 0.23, while for the SS-based distances ρ > 0.5. 5 A Detailed Recommendation Example In our example in Fig. 2, from Equ. 1 we have that FASTAs_Linux is annotated by |A(x)| = 9 classes and, which are marked by ‘FL’ in blue in Fig. 5, and has extent ET H (x) = .0381. This is relatively low extent, but a relatively high count, as can be seen in Fig. 4. Referring back to Fig. 1, consider that we are at the portion of the workflow where a BLASTer component has been identified, and connected to its input FASTA file using the output of the upstream encoder component. In order to complete the instantiation of the BLASTer component, we need to identify its one lacking input, namely a scoring matrix, and we need one which is as seman- tically compatible as possible. The five available scoring matrices are shown in Table 3, each of which has its own set of annotations, as shown in Table 4, which is arranged with gaps to show where annotations sets overlap and differ. Hd(FASTAs Linux, y) Minpath TH Resnik Lin J+C sao:SM Executable 0.012 0.041 0.749 0.676 0.607 sao:SM Server Logfiles 0.012 0.053 0.749 0.676 0.607 sao:SM Logfiles 0.012 0.059 0.749 0.676 0.607 sao:SM Genetics 0.012 0.148 0.749 0.676 0.607 Table 3. Hausdorff distances between the annotating classes from FASTAs Linux and all the scoring matrices in SA-KB, by different distances d. Sorted up by TH. Consider the one example scoring matrix SM_Executable. On the one hand, it differs from FASTAs_Linux in that it is annotated to sao:scoring_matrix, since it is a scoring matric, but to neither sao:FASTA_file_set, since it is not a FASTA file, nor to sao:linux_executable_file, since it can be used with any executable, not being specific to any particular operating system. The right side of Fig. 5 shows this annotation set in the dual ‘FL SM’ markings. In measuring the difference in the annotation sets A(FASTAs Linux) and A(SM Executable), the first thought is to cast them as sets X, Y and measure their symmetric difference |X∆Y | = |(X \ Y ) ∪ (Y \ X)|, the number of different elements. We have |A(FASTAs Linux)∆A(SM Executable)| = 3 for these three differing classes. Table 4 shows ∆ for all the scoring matrices, along with HT H , the Hausdorff for the hourglass measure. But we actually observe that HT H is strongly preferred to ∆, in that it not only gets the correct rank ordering, but can also distinguish SM_Server_Logfiles as being preferable to SM_Logfile, in virtue of the fact that it doesn’t just count the differences, it can weight them by semantic distance, in this case that server_log_file is more specific 11 x =FASTAs Linux y =SM Executable y=SM Server Logfiles y=SM Logfiles y =SM Genetics HT H (x, y) 0.041 0.053 0.059 0.148 A(x)∆A(y) 3 8 8 12 E(y) 0.042 0.038 0.041 0.076 |A(y)| 8 8 8 9 edam:data 2976 sao:alphabet sao:alphabet sao:alphabet sao:alphabet sao:alphabet sao:codon sao:corpus sao:corpus sao:corpus sao:corpus sao:corpus sao:cyber sequence sao:cyber sequence sao:cyber sequence sao:cyber sequence sao:executable file sao:executable file sao:FASTA file set sao:file sao:gene sequence sao:linux executable file sao:log event sao:log event sao:log file sao:map file sao:map file sao:map file sao:map file sao:map file sao:opcode sao:opcode sao:scoring matrix sao:scoring matrix sao:scoring matrix sao:scoring matrix sao:server log file sao:symbol set sao:symbol set sao:symbol set sao:symbol set sao:symbol set Table 4. Annotation sets for FASTAs Linux vs. all scoring matrices. than log_file. Finally SM_Genetics is the least recommended scoring matrix, coming from another domain entirely, it is most divergent in both ∆ and HT H . Considering the other metrics available, and referring back to Table 3, we can see the Hausdorffs for other distances, and that HT H is preferable to any of the SS- based measures, in that it can distinguish all of these examples, and rank-orders them correctly, where the SS-based distances cannot. 6 Overall Comparison Broadening from our one example, we can consider an evaluation of the perfor- mance of order metrics relative to SS-based distances. To accomplish this, we considered all possible recommendations of a candidate individual against a fixed source individual in the four cases implied by the upper parts of our workflow in Fig. 1, specifically: 1) recommend symbol sets (candidates) against a mapfile (target); 2) recommend a map file against a fixed symbol set; 3) a scoring matrix against a FASTA file; and finally 4) a FASTA file against a scoring matrix. There were 33 targets across all four groups, and for each we independently es- tablished a “ground truth” for the ordering of the candidates. For example, in Table 3, ground truth for candidate scoring matrices against the FASTAs Linux target is the rank ordering shown, but with Server_Logfiles and Logfiles reversed. Then for each recommendation, we measured the Spearman rank cor- relation rs (averaging ranks over ties) for ground truth using each of the metrics. Results are shown in Fig. 6. Note that rs = 1 indicates complete agreement on rank order, while rs = −1 is a complete reversal of that order. Also, rs does not exist when all scores are the same, as exemplified by the SS-based metrics in Table 3. This was always true for minpath dM P , which was thereby eliminated, and was frequently true for the SS-based distances, as can be seen in Fig. 3. 12 Fig. 6. (Left) Spearman’s rank correlation rs for each of the 33 recommendations; (Right) Average rs for each group of recommendations. While our results only hold within our test use case and KB, these were con- structed independently of our evaluation. Poset metrics, especially lower and hourglas, provided strong performance for top-k semantic ranking, with high rank correlations above 80%. They were almost always preferable to SS-based metrics, and sometimes remarkably so, including negative rs values for some SS-based metrics. Acknowledgments This research is part of the Signature Discovery Initiative (http://signatures.pnnl.gov) at Pacific Northwest National Laboratory. It was conducted under the Labora- tory Directed Research and Development Program at PNNL, a multi-program national laboratory operated by Battelle for the U.S. Department of Energy. References 1. N Baker, JL Barr, G Bonheyo, CA Joslyn, K Krishnaswami, M Oxley, R Quadrel, LH Sego, MF Tardiff, AS Wynne. Research towards a systematic signature dis- covery process. In Wshop. on Signature Discovery for Intelligence and Security, IEEE Intelligence and Information 2013, 2013. 2. Alexander Butanitsky and Graeme Hirst. Evaluating wordnet-based measures of lexical semantic relatedness. Computational Linguistics, 32:1:13–47, 2006. 3. BA Davey and HA Priestly. Introduction to Lattices and Order. Cambridge UP, Cambridge UK, 1990. 4. C Fellbaum, ed. Wordnet: An Electronic Lexical Database. MIT Press, 1998. 5. PW Lord, Robert Stevens, A Brass, and CA Goble. Investigating semantic sim- ilarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics, 10:1275–1283, 2003. 6. B Monjardet. Metrics on partially ordered sets - a survey. Discrete Mathematics, 35:173–184, 1981. 7. Munkres. James; Topology (2nd edition). Prentice Hall, 280–281., 1999. 8. C. Oehmen et al. An organic model for detecting cyber events. In Proc. Sixth Annual Workshop on Cyber Security and Information Intelligence Research, 2010. 9. Chris Orum and Cliff A Joslyn. Valuations and metrics on partially ordered sets. Technical report, 2009. http://arxiv.org/abs/0903.2679v1. 10. C Pesquita, D Faria, A Falcão, P Lord, FM Couto. Semantic similarity in biomed- ical ontologies. PLOS Computational Biology, 5:7, 2009.