=Paper= {{Paper |id=Vol-2600/paper16 |storemode=property |title=Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data |pdfUrl=https://ceur-ws.org/Vol-2600/paper16.pdf |volume=Vol-2600 |authors=Summaya Mumtaz,Martin Giese |dblpUrl=https://dblp.org/rec/conf/aaaiss/MumtazG20 }} ==Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data== https://ceur-ws.org/Vol-2600/paper16.pdf
 Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data

                                            Summaya Mumtaz and Martin Giese
                                                    Department of Informatics,
                                                    University of Oslo, Norway
                                             {summayam@ifi.uio.no, martingi@ifi.uio.no}




                            Abstract                                 applicable to non-numerical data. However, defining sensi-
  Calculation of similarity between two entities is a key step in
                                                                     ble metrics for categorical attributes is challenging.
  several data mining processes. While there are several com-           The most common approach in machine learning al-
  mon similarity measures for continuous data, there is little       gorithms for handling categorical data is one-hot encod-
  work for categorical data. Most approaches are purely data-        ing (Alkharusi 2012; Davis 2010). A binary column is cre-
  driven and don’t consider the inherent dependencies of com-        ated for each unique value of the categorical column. This
  plex domains such as geological structures, phylogenetics,         yields a high-dimensional sparse matrix, containing a signif-
  etc. We propose two new similarity measures that take into         icant proportion of zeros. This approach requires high com-
  account semantic information to calculate the similarity be-       putational resources, is unable to handle unseen values and
  tween two categorical values. Semantic information is repre-       ignores any domain dependencies known to exist between
  sented as a hierarchy extracted from an ontology or a domain
  taxonomy. The first approach calculates semantic similarity
                                                                     values of the same categorical attribute.
  by combining the data-driven approach with the hierarchy              In a supervised learning approach, the distance δ(x, y) be-
  imposed on the possible categorical values. The second ap-         tween two categorical values can be defined using value dis-
  proach ignores the data and uses only the hierarchy to calcu-      tance matrix (Stanfill and L. Waltz 1986) and modified value
  late semantic similarity. We apply our methods to a specific       distance matrix (Cost and Salzberg 1998).
  complex data mining task in the oil and gas industry: reser-          For unsupervised learning, the hamming distance is used
  voir analogue identification. The two proposed measures are        and similarity is defined as a matching measure that assigns
  compared to existing data-based measures.                          1 if both values are identical, and 0 otherwise (Esposito et al.
                                                                     2000; Ahmad and Dey 2007). Various similarity measures
                     1    Introduction                               have been derived using this distance measure, e.g. Jaccard
The context of this work is the combination of data-based            similarity coefficient, Sokal-Michener similarity measure,
(statistical) methods with knowledge-based methods in data           Grower-Legendre similarity measure, etc. (Esposito et al.
science. In many disciplines, there is a considerable body of        2000). These measures are inherently quite coarse: in the
domain knowledge available, while data sets may not always           absence of an ordering between the categorical values, the
be large enough to support machine learning of complex re-           only possible distinction is whether two values are identical
lationships. In this work, we look specifically at similarity        or not (Esposito et al. 2000).
measures (or equivalently distance measures), which lie at              To improve on these, frequency-based similarity measures
the core of a number of machine learning tasks such as clus-         have been proposed that take the frequency distribution of
tering, outlier identification and classification (k-NN). We         different attribute values into account. These measures are
concentrate on entities described by categorical data, fea-          data-driven and hence are dependent on certain data char-
ture values taken from a finite set of possible values with no       acteristics such as the size of data, number of attributes,
inherent order. The domain knowledge we wish to incorpo-             number of values for each attribute and distribution of fre-
rate is given in the form of hierarchies that can be extracted       quency of each value. While data-driven measures perform
from domain ontologies, standard classifications, etc.               well on simple datasets, these measures are unable to take
   There is a variety of suitable metrics to quantify similar-       into account semantic relationships and often don’t perform
ity for numerical data such as Euclidean or Manhattan dis-           well on complex datasets with hidden domain dependencies.
tance (Esposito et al. 2000). These methods are not directly         Moreover, a concept of similarity that is based solely on how
                                                                     often values occur in the data cannot be expected to give rea-
Copyright © 2020 held by the author(s). In A. Martin, K. Hinkel-     sonable results in all cases. Using frequencies seems more
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
                                                                     like a ‘last straw’ when frequencies are the only distinguish-
bining Machine Learning and Knowledge Engineering in Practice        ing feature between categorical values.
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,           In this paper, we propose an alternative way to measure
USA, March 23-25, 2020. Use permitted under Creative Commons         similarity for categorical data in an unsupervised setting. We
License Attribution 4.0 International (CC BY 4.0).                   combine a frequency-based measure with explicitly repre-
sented domain knowledge given in the form of a hierarchy         explained in terms of a set of assumptions. If the assump-
on attribute values, and we also consider a measure that is      tions are considered true, the similarity measure is necessar-
based purely on the hierarchy, without taking frequencies        ily followed. Therefore, the similarity between the two val-
into account.                                                    ues is calculated by the ratio between the amount of infor-
   Section 2 describes the related work. Section 3 explains      mation required to state the commonality of both values and
the problem formulation and proposed algorithm. Section 4        the information needed to fully describe both values sepa-
presents the dataset and evaluation by comparing with exist-     rately. Lin derived similarity measure for words, ordinal and
ing algorithms.                                                  string data.
                                                                    Das and Mannila’s research is based on the key point that
                2    Literature Review                           attribute value similarity is related to other attributes (Das
                                                                 and Mannila 2000). They proposed Iterated Contextual Dis-
The surveys (Boriah, Chandola, and Kumar 2008; Alamuri,          tances (ICD) based on the idea that attribute and object sim-
Surampudi, and Negi 2014) discuss various similarity mea-        ilarities are interdependent. ICD finds attribute similarity,
sures for categorical data. Wilson and Martinez (Wilson and      sub relation, and row similarity. Ahmed and Dey proposed a
Martinez 2000) have studied in-depth heterogeneous func-         distance-based measure in term of co-occurrence of values,
tions for mixed data (categorical and continuous variables)      the overall distribution of two attribute values are consid-
for instance-based learning. Their approach is based on su-      ered along with their co-occurrence with the values of other
pervisor learning where each instance has class labels in ad-    attributes (Ahmad and Dey 2007).
dition to input variables. The focus of this paper is to find       Document or sentence similarity is considered the ba-
similarity in an unsupervised setting where information re-      sic task for many natural language processing(NLP) en-
garding classes is unknown.                                      gines such as information retrieval, query answering, and
   For unsupervised learning, various techniques have been       text summarization. Semantic-based methods use informa-
proposed (Boriah, Chandola, and Kumar 2008). The major-          tion from dictionaries (WordNet) to find relatedness between
ity of these techniques are based only on the data-driven        two terms. Classic methods in NLP are based on the short-
approach. However, in some other domains like in natural         est path measure (Roy et al. 1989). (Leacock and Chodorow
language processing, research is being conducted to calcu-       1998) proposed a similarity technique based on the short-
late similarity based on semantics and domain knowledge.         est path between nodes in a taxonomy and the number of
Below, we provide an overview of the existing data-driven        nodes.(Huang and Sheng 2012) based their sentence simi-
measures, followed by research done in natural language          larity measure by using WordNet information content and
processing.                                                      string edit distance, for paraphrase recognition.
   The simplest similarity measure used is known as over-           However, the techniques mentioned above are not directly
lap measure (Boriah, Chandola, and Kumar 2008). Similar-         suitable for categorical features. In an NLP setting, there are
ity of 1 is assigned when two categorical values are identi-     many terms in a complete sentence or document, that pro-
cal otherwise similarity is assigned as 0. The overall simi-     vide the neighborhood context and aid understanding the se-
larity between two data instances of multivariate categorical    mantics. Furthermore, NLP tasks are constrained by the sen-
data is proportional to the number of attributes in which they   tence structures and grammar of a particular language such
are identical. The overlap measure does not distinguish dif-     as the ordering of subject, verb, noun, etc. However, categor-
ferent values of attributes hence matches and mismatches         ical features are represented by single domain terms with no
are treated equally. Goodall proposed a similarity measure       obvious representation of neighborhood or the context that
to normalize similarity between two data instances by the        explains the semantic similarity. The main focus here is to
probability of occurrences in a random sample (W. Goodall        define semantic similarity between categorical terms based
1966). This measure assigns a higher similarity score to the     on the characteristics extracted from domain knowledge.
values which are less frequent. Gambaryan proposed sim-
ilarity measure by giving more weight to matches where                         3    Problem Formulation
the frequency of occurrence of categorical values is about       In this section, we first discuss a toy example to identify the
half in the dataset (Gambaryan 1964). (Eskin et al. 2002)        drawbacks of frequency-based similarity approaches. Fur-
developed a normalization kernel intrusion detection sys-        ther, we provide an overview of metric properties and se-
tem. This measure assigns more weight to mismatches of           mantic similarity to establish the foundation of the proposed
attributes that contain many values. Inverse Occurrence fre-     similarity measure.
quency (IOF) assigns lower similarity values to mismatches          We analyze the problems in existing work and inherent
that are based on more frequent values. IOF measure is de-       challenges associated with categorical data based on the toy
rived from information retrieval (Sparck Jones 2004) and is      dataset in Table 1. The dataset consists of candidates’ pro-
associated with the idea of inverse document frequency. The      files and we wish to retrieve matching candidates for a given
Occurrence frequency (OF) measure assigns lower similar-         job advertisement.
ity to mismatches on less frequent values and mismatches on         Many of the data-driven similarity measures consider two
more frequent items are assigned higher similarity (Boriah,      values of a given categorical attribute to be similar if both
Chandola, and Kumar 2008).                                       have similar frequency distributions. For instance, the OF
   Lin proposed a similarity framework based on informa-         similarity measure for values of an attribute is calculated as
tion theory (Lin 1998). According to Lin, similarity can be      follows (Boriah, Chandola, and Kumar 2008).
                                                                    of Europe, but it still makes sense to consider a hierarchy of
                     Table 1: Toy Dataset                           geographic regions.
      User ID           Occupation                Education             A value c ∈ S is called a lowest common ancestor of
                                                                    two node values a ∈ S and b ∈ S if c ∈ S is the lowest (i.e.
          1       Computer Programmer              Bachelors        deepest) node that has both a ∈ S and b ∈ S as descendants.
          2        Administrative Staff            Bachelors        It is the first shared ancestor of a and b located farthest from
          3           HR Manager                   Bachelors        the root. In a hierarchy two values have a lowest common
          4           HR Manager                    Masters         ancestor denoted as a t b. A value is called a leaf value if it
          5        Software Developer              Bachelors        is not the ancestor of any other value.
          6       Computer Programmer               Masters             In this paper, we add a restriction to our hierarchies by
                                                                    only considering mono-hierarchies: we assume that there is
                                                                    some root value r in the hierarchy, such that a v r for all
                                                                    a ∈ S, and that all values except the root have exactly one
                (                                                   direct ancestor. In other words, the hierarchy is tree-shaped.
                    1                                   if x = y
  OF (x, y) =                      1
                                                        if x 6= y
                               N
                    (1+log( f (x)+1           N
                                    )+log( f (y)+1 ))               3.2   Semantic Similarity
                                                           (1)
                                                                    Semantic similarity refers to similarity based on meaning
where f (x) is the number of occurrences of the attribute
                                                                    or semantic content as opposed to form (Smelser and Baltes
value x and N represents the total number of observations in
                                                                    2001). Semantic similarity measures are automated methods
the data set. Similarity between pairs ‘Computer Program-
                                                                    for assigning a pair of concepts a measure of similarity and
mer’ and ‘HR Manager’ and ‘Computer Programmer’ and
                                                                    can be derived from a taxonomy of concepts arranged in is-a
‘Software Developer’ based on equation 1 is calculated as:
                                                                    relationships (Pedersen, Pakhomov, and Patwardhan 2005).
OF (Comp. Programmer, HR Manager) = 0.64
                                                                    The concept of semantic similarity has been applied in Nat-
OF (Comp. Programmer, Soft. Developer) = 0.44
                                                                    ural language processing for the past decade to solve tasks
   These numbers would indicate that the Programmer is
                                                                    such as the resolution of ambiguities between terms, docu-
more similar to HR Managers than to Developers. However,
                                                                    ment categorization or clustering, word spelling correction,
based on the evaluation of semantic evidence observed in a
                                                                    automatic language translation, ontology learning or infor-
knowledge source (such as an ontology or a standard classi-
                                                                    mation retrieval. Similarity computation for categorical data
fication) shown in Table 2, it is evident that computer pro-
                                                                    can improve the performance of existing machine learning
grammers and software developers perform the same work
                                                                    algorithms (Ahmad and Dey 2007) and may ease the inte-
activities and tasks hence having a greater semantic similar-
                                                                    gration of heterogeneous data (Wilson and Martinez 2000).
ity.
   Semantic similarity can be made explicit in different               Is-a relationships in a concept hierarchy encompass for-
ways, and one of the prominent ways is through hierarchies,         mal classification, properties and relations between concepts
which we will use in this paper. Section 3.1 explains in detail     and data. This provides us with a common understanding of
the formal definition of hierarchies.                               the structure of a domain, explicit domain assumptions and
                                                                    reuse of domain knowledge. In order to achieve interpretable
3.1   Hierarchies                                                   and good quality results in machine learning models, it is vi-
                                                                    tal to take this information into account. This intuition mo-
Our similarity measures are based on a given hierarchical           tivates us to link the notion of similarity based on is-a rela-
structure of the value range of categorical features. Formally,     tionships with the similarity measures for categorical data.
we assume that the categorical values for each feature form a       We develop a framework to use is-a relationships extracted
finite, partially ordered set (poset). A poset is an ordered pair   from a concept hierarchy to quantify semantic similarity and
of binary relation v defined over a set S, such that (v, S)         propose a distance measure for categorical data.
satisfies the following properties: Let x, y, z ∈ S,
 • Reflexivity: x v x                                               3.3   Proposed Framework
 • Antisymmetry: if x v y and y v x, then x = y
                                                                    In this paper, we propose two techniques for measuring sim-
 • Transitivity: if x v y and y v z, then x v z                     ilarity based on domain knowledge, extracted as the concept
If a v b, we call b an ancestor of a. The intention of a v b        hierarchy. First, we present a framework for calculating se-
is that b is in some way more general, broader, etc. than a.        mantic similarity using information content and concept hi-
E.g., for the occupations in Fig. 1, TopExecutives v Man-           erarchy by modifying Resnik’s idea (Resnik 1970). To com-
agementOccupations; for data about geographic areas, we             pare the performance of information-content based semantic
could have Oslo v Norway v Europe.                                  measure, we extended the idea to introduce a simple similar-
   If domain knowledge is given in the form of an ontol-            ity measure based only on concept hierarchy.
ogy, in some cases (depending on the modeling style), the              Further, we are interested in computing global semantic
relation v will correspond to parts of the is-a subclass re-        similarity in a multi-dimensional setting where we have sev-
lation of the ontology, but in others it won’t. E.g. it doesn’t     eral hierarchy-structured features. We define the global sim-
make sense to consider Norway a sub-class or sub-concept            ilarity between two data objects X and Y in a d-dimensional
                                                 Table 2: Occupation Activities and Skills
                                Occupation                     Work Activity                    Skills
                               HR Manager               Liaise between departments       PeopleSoft, SAP
                          Computer Programmer            Write programming code          C++, Java, Python
                           Software Developers          Modify software programs        C++, Oracle ,Python



attribute space as,                                                        In order to formulate the semantic similarity of values
                                                                        based on the lowest common ancestor, we use the idea of
                              d
                              X                                         associating probabilities with the values (Resnik 1970). We
                 δ(X, Y ) =         wi δ(xi , yi )             (2)      base ourselves on a function p : C → [0, 1] such that for any
                                i                                       c ∈ S, p(c) represents the probability of the feature value
where δ(xi , yi ) corresponds to similarity between two val-            being v c. Furthermore, using information theory we can
ues x and y in the i-th dimension and wi is the weight asso-            state that the information content of a feature having some
ciated with each dimension. The following section presents              value is quantified as negative of log likelihood (Ross 1976).
both frameworks for calculating semantic based similarity                  For categorical data, we can find the information content I
δ(xi , yi ).                                                            of the lowest common ancestor c by finding the information
                                                                        content of all the leaf values subsumed by c in the hierarchy.
Information Content Semantic Similarity (ICS) This                                                        X
approach is based on a modification of Resnik’s idea (Resnik                              I(c) = −log             p(n)             (3)
1970). Resnik proposed a measure for finding semantic sim-                                               n∈ leaf (c)
ilarity in an is-a taxonomy based on information content and            where leaf (c) is the set of all leaf values in x ∈ S such that
defined similarity between two nodes in a hierarchy as the              x v c. The probability of leaf values may be estimated by
extent to which they share common information.                          the relative frequency.3
   In order to formulate the semantic similarity of two given
categorical values, the key intuition is to find the common in-                                   frequency(n)
                                                                                             p(n) =                            (4)
formation in both values. This information is represented by                                            N
the lowest common ancestor in the hierarchy that subsumes               where N is the number of samples.
both values (Lin 1998). If the lowest common ancestor of                  Based on the above definitions, we formulate information
two values is close to leaf nodes, that implies both values             content based semantic similarity(ICS) between two cate-
share many characteristics. As the lowest common ancestor               gorical values x and y as
moves up in the hierarchy, fewer commonalities exist be-                                      (
tween a given pair of values.                                                                   1             if x = y.
   For the given dataset, we can map the ‘Occupation’ at-                       Sim(x, y) =       I(xty)                       (5)
                                                                                                max(I(xty))   else if x 6= y
tribute to the O*net taxonomy1 (Fig. 1) by placing all the
values at the corresponding leaf nodes in the occupation                where I(xty) denotes the information content of the lowest
hierarchy whereas intermediate nodes represent the lowest               common ancestor of x and y, calculated by using equation 3
common ancestors for given pairs. In Fig. 12 ,‘Computer                 and max(I(xty) represents the maximum information con-
Programmer’ and ‘Software Developer’ are both subsumed                  tent of all given pair of leaves and is used for normalization.
by the lowest common ancestor ‘Computer Occupations’,                   Hierarchy-based Semantic Similarity(HS) As explained
whereas the lowest common ancestor that subsumes the con-               earlier, the main intuition of semantic similarity is based on
cept ‘HR Manager’ and ’Computer Programmer’ is ‘Occu-                   the idea that any two values having the lowest common an-
pation’(root node of the occupation hierarchy). Hence, tak-             cestor close to leaf nodes, should have high similarity and
ing into account the lowest common ancestor, we expect that             vice versa. Hence, we quantify semantic similarity by con-
the similarity between Computer Programmer and Software                 sidering the level of the lowest common ancestor in the hi-
Developer to be significantly greater than the similarity be-           erarchy. The level of a node is defined by 1+ the number
tween the Computer Programmer and HR Developer.                         of connections between the node and the root4 . Greater the
   Our intuition about the concept of semantic similarity is            level of the lowest common ancestor of any given pair of
that for two categorical values x and y that share lowest               values in the hierarchy, more similar the values are. We for-
common ancestor c, farthest from the root node, are always              mulate the similarity as,
considered to be more semantically similar than to two cate-                                  
gorical values x and z that share lowest common ancestor c0                                      1               if x = y.
                                                                                Sim(x, y) =                                        (6)
close to root node. In addition, identical values should have                                    λd−level(x∪y) else if x 6= y
a maximum similarity of 1.
                                                                            3
                                                                              Probabilities may also be known from other sources, for in-
   1
     https://www.onetcenter.org/taxonomy.html                           stance known priors for the specific domain.
   2                                                                        4
     https://www.bls.gov/soc/soc structure 2010.pdf                           Level starts from 1 and the level of the root is 1
                                             Figure 1: O*net Occupation Taxonomy


Where 0 < λ < 1 is a fixed decay parameter, level(n) is           there exists no standard machine learning system for solving
the distance of n from the root in the hierarchy, and d =         this use case. The common industrial practice to date is to
maxn∈X level(n) is the maximum depth of the hierarchy.            conduct a manual analysis by human experts.
   The main advantage of equation 6 is that the calculation of
semantic similarity no longer requires any input from train-      4.2      Reservoir Analogues
ing data such as information content. Once the concept hier-      In the Oil and Gas Industry, during the exploration phase,
archy is formalized, we can measure the similarity between        analogous reservoirs are used to study reservoirs that lack
any two values including the categorical values not observed      critical information. Any reservoir with a deficit of critical
in the training data.                                             information is known as a “target reservoir”, and “analo-
   Below, we explain how to perform evaluation of the pro-        gous reservoirs” are ones expected to have similar charac-
posed techniques.                                                 teristics. (Martı́n Rodrı́guez et al. 2013).
                                                                     Usually, a technical evaluation team must analyze various
                     4    Evaluation                              data types – seismic, well logs, test, and cores – in order to
In this section, we compare the ICSD and HDM approaches           make the first approximation of analogous reservoirs. Due to
to other similarity measures for the identification of reser-     a lack of resources and time constraints, the first approxima-
voir analogues of a target reservoir, given a dataset of known    tion is usually the neighboring reservoirs to provide an esti-
reservoirs. This use-case is further explained in Section 4.2     mate of the fluid and rock properties of the target reservoir.
below.                                                            A single analogue is mostly used because it is in the same
                                                                  geographic region or basin. This is risky however, since it
4.1   Baseline Methods                                            does not always give sufficient information to characterize
                                                                  a new prospect. Furthermore, it becomes a tedious task for
The following four state-of-the-art similarity/distance mea-      new target reservoirs where no neighboring reservoir exists.
sures are compared with the proposed techniques: Occur-              Limited efforts have been made to identify analogues
rence Frequency (OF) (Boriah, Chandola, and Kumar 2008),          based on machine learning (Martı́n Rodrı́guez et al. 2013;
Eskin Similarity measure (Boriah, Chandola, and Kumar             Perez-Valiente et al. 2014). In order to generate a list of
2008; Eskin et al. 2002) , Lin Similarity measure (Lin 1998)      ranked reservoirs based on similarity, it is important to au-
and Coupled Similarity Matrix (CMS) (Jian et al. 2018).           tomate this process using a standard knowledge source and
   We compare the performance of the different similar-           to develop a method that is flexible enough to produce ana-
ity measures in a recommendation scenario: given a query          logues for reservoirs with no neighboring analogues.
item, we compute its similarity to each item in the ‘training’
dataset using Equation 2, and determine the top k items with      4.3      Dataset
highest similarity.                                               The main source of information used in this evaluation is the
   For our evaluation, we do this for all of the different sim-   dataset of reservoirs licensed by IHS5 . It comprises a total
ilarity measures, and compare the outcome to a fixed ‘gold        of 43000 reservoirs and various properties/attributes associ-
standard’ list of items to determine the average precision.       ated with each reservoir. According to domain experts, only
   For our experimental evaluation, we have chosen reservoir      a few key parameters are known during the initial stage of
analogues (explained in the section below): a complex task
                                                                     5
in the Oil and Gas industry. To the best of our knowledge,               https://ihsmarkit.com/index.html
          Figure 2: The hierarchy of geologic age.               Figure 3: Ontology showing IS-A relationships for Lithol-
                                                                 ogy

reservoir identification. Hence for our analysis of retrieving
similar reservoirs, we use the following set of key parame-      Lithology: The lithology of a rock unit is a description
ters/attributes identified by domain experts.                    of its physical characteristics visible at outcrop, in hand or
                                                                 core samples or with low magnification microscopies, such
• Depositional Environment                                       as color, texture, grain size, or composition. There is no stan-
• Lithology                                                      dard ontology for lithology. With the help of geologists, we
                                                                 develop an ontology that considers all the categorical values
• Age                                                            occurring in data and groups them based on similar physical
• Geographical Location                                          characteristics. In Fig. 3, we show a part of this ontology.
• Structural Setting                                             4.5   Data Pre-processing
   Detailed definitions of these parameters are described in     The main challenge associated with the given data is a large
the section below.                                               number of categorical values associated with each attribute.
                                                                 For the attribute ‘Age,’ there are about 250 unique values.
4.4   Semantic Information for Attributes                        These values are not standardized. Hence, there are instances
This section explains the process of standardizing the se-       where the same category exists in the dataset with various
mantic information used in the calculation of similarity. Due    names. Furthermore, most of the age values are unofficial
to data confidentiality, we only explain two attributes ‘Age’    names, which are used only in a few specific areas of the
and ‘Lithology.’                                                 world. With the help of geological experts, we replaced these
                                                                 unofficial names by standard domain names.
                                                                    For the attribute ‘Depositional Environment,’ there are 32
Reservoir Age: A geologic age is a subdivision of geo-           unique values occurring in the given data set. Some categori-
logic time that divides an epoch into smaller parts. A succes-   cal values are merged together based on the same geological
sion of rock strata laid down in a single age on the geologic    properties identified by domain experts.
timescale is a stage. The geological time has been divided          In the original data set, there are 1731 categories of the
into eras, periods and epochs. The named divisions of the        attribute ‘Lithology.’ The raw values of lithology contain
geological time are based on fossil evidence. Fig. 2 shows a     abbreviations for the same lithology, unofficial lithology
part of an ontology developed to show how geological times       names, and combinations of various lithologies. These cat-
are organized into Erathem, Period, Epoch and Age.               egories are replaced with the standard names and combina-
   Note that age can also be given on a linear scale, e.g. in    tions are replaced with only primary lithology, which leads
millions of years. However, the characteristics of rocks de-     to 228 unique categories.
posited in different geologic eras, periods, and epochs differ      Outliers are extreme values that deviate from other obser-
so much that their position in the hierarchy is a much better    vations on data, they may indicate variability in measure-
indicator of similarity than the numerical difference in age.    ment, experimental errors or a novelty. In order to avoid the
disastrous effect on the results of the statistical analysis, a
step is added to identify, analyze and delete outliers in the       Table 3: Average Precision for the selected target reservoirs
dataset. In this step, for every attribute, we remove the val-             Reservoir    ICS     HS     OF     CMS       Eskin
ues that don’t confirm with standard domain names.                          Snorre       39     59     39      40        34
   After cleaning the data, the comparative evaluation be-                  Snohvit      57     66     15      29        27
tween ICS, HS and existing similarity algorithms is con-
ducted.
                                                                                  Table 4: Mean Average Precision
4.6    Evaluation Method
                                                                                       ICS    HS     OF     CMS      Eskin
For the given task, we will evaluate the similarity measure
                                                                             MAP        48    63     27      35       30
on two main objectives.
• Retrieving top 15 similar analogues to the target reservoir.
• Producing the result in a ranked order such that the first        a set of queries Q, the mean average precision(M APQ @K)
  retrieved analogue corresponds to the most similar reser-         of an algorithm is defined as
  voir to the target reservoir.
                                                                                                        Q
   Mean Average Precision (MAP) is the mostly commonly                                               1 X
                                                                                   M APQ @K =              APq @K               (11)
used evaluation metric in information retrieval and object                                           Q q=1
detection (Baeza-Yates and Ribeiro-Neto 2008). MAP is the
arithmetic mean of the average precision (AP) values for            where APq @K is calculated by using Equation 10
an information retrieval system over a set of n query top-
ics (Liu Ling 2009) . It can be expressed as follows:               4.7    Experimental Results
                                                                    There is no standard way to evaluate similarity measures
                               1X
                     M AP =        APn                       (7)    for semantic similarity. Resnik uses human expert similar-
                               n n                                  ity ranking to judge similarity (Resnik 1970). We follow the
                                                                    same approach. In order to perform this evaluation, we se-
Precision for a classification task is defined as
                                                                    lected two target reservoirs ‘Snorre’ and ‘Snøhvit.’ We then
                              TruePositive                          asked our domain experts to produce a gold set for each
        Precision =                                          (8)    reservoir. This gold set contains a set of reservoirs identified
                       TruePositive + FalsePositive
                                                                    by our experts as most similar to the target reservoir based on
   Based on Equation 8, recommender system Precision (P)            their hindsight knowledge about the target reservoir. Further-
is defined as,                                                      more, the gold set is produced in a ranked manner, the first
                                                                    item in the list corresponds to the highest similar analogue
            # of our recommendations that are relevant              and the last item corresponds to the lowest similar reservoir.
      P =                                                    (9)       After acquiring the gold dataset, we perform an experi-
                    # of items we recommended                       mental evaluation to compare the performance of the pro-
   For evaluating the performance of recommender systems,           posed techniques with three existing similarity measures
we are only interested in recommending top-N items to the           (OF (Boriah, Chandola, and Kumar 2008) , Eskin (Eskin et
user. Usually, the higher the number of relevant recommen-          al. 2002) , CMS (Jian et al. 2018) for finding reservoir ana-
dations at the top, the more positive is the impression of the      logues. For each selected target reservoir, all the remaining
users. Therefore, it is sensible to compute precision and re-       reservoirs in the dataset are given as input to each similar-
call metrics in the first N items instead of all the items. Thus    ity measure and the similarity between the target and all re-
the precision at a cutoff k is introduced in order to evaluate      maining reservoirs is calculated. The top 15 reservoirs with
ranking, where k is an integer that is set by the user to match     maximum similarity are retrieved and are now referred to as
the objective of the top-N recommendations. Average preci-          analogues to the target reservoir.
sion at cutoff k, is the average of all precisions in the places       In order to penalize poor estimations, we are using Av-
where a recommendation is a true positive and is defined as         erage Precision (e Equation 10) as a quality criterion for
follows:                                                            evaluation of similarity between reservoirs. For this metric,
                                K                                   a higher value corresponds to better results. Table 3, shows
                              1 X                                   the experimental result of each similarity measure separately
               APq @K =             P (i) · Rel(i)           (10)
                             K i=1                                  for each target reservoir 6 .
                                                                       As shown in table 3, ICS and HS measures outperform
where K represents the top K recommendations for the                the data-driven similarity measures for both selected reser-
given query q and Rel(i) shows the relevance of the rec-            voirs. For the target reservoirs, ’Snorre’ and ’Snohvit’, the
ommendation. Rel(i) is 1 if the recommended item was rel-
evant(true positive) otherwise 0.                                      6
                                                                         Similarity measure proposed by Lin (Lin 1998) doesn’t re-
   Usually, the performance of a recommendation system is           trieve any similar analogues in the top k-recommendations. There-
calculated by considering a set of queries. Therefore, given        fore, results are not included in table 3.
average precision for ICS is 39% and 57% which is higher           Boriah, S.; Chandola, V.; and Kumar, V. 2008. Similarity
than the average precision of other similarity measures. For       measures for categorical data: A comparative evaluation. In
HS average precision for ’Snorre’ and ’Snohvit’ is 59% and         Proceedings of the SIAM International Conference on Data
66%.Further, table 4 shows that the MAP (Equation 11)              Mining, volume 30, 243–254.
for ICS and HS is 48% and 63% respectively, which signifi-         Cost, S., and Salzberg, S. 1998. A weighted nearest neigh-
cantly better than the MAP values of other algorithms. This        bor algorithm for learning with symbolic features. Machine
evaluation supports the initial hypothesis that by adding do-      Learning 10.
main information to the similarity measure, we can increase
the similarity performance for the complex categorical data.       Das, G., and Mannila, H. 2000. Context-based similar-
                                                                   ity measures for categorical databases. In Lecture Notes in
   It is important to note that results obtained using ICS and
                                                                   Computer Science, volume 1910, 201–210.
HS are not directly comparable with the gold set provided by
human experts. In order to produce a gold set, human experts       Davis, M. J. 2010. Contrast coding in multiple regression
take into account the geological history of the current basin,     analysis: Strengths, weaknesses, and utility of popular cod-
analysis of geological time periods and overall processes          ing structures. In Journal of Data Science.
of formation of reservoir rocks. Furthermore, they also use        Eskin, E.; Arnold, A.; Prerau, M.; Portnoy, L.; and Stolfo, S.
conceptual facies models, reservoir simulation models, core        2002. A geometric framework for unsupervised anomaly de-
samples and well logs for selecting appropriate analogues. In      tection: Detecting intrusions in unlabeled data. Applications
contrast to this, our experimental evaluation of the proposed      of Data Mining in Computer Security 6.
technique is based only on a limited part of this information.     Esposito, F.; Malerba, D.; Tamma, V.; and Bock, H.-H.
Achieving 63% precision in this scenario highlights the fact       2000. Classical resemblance measures. Springer Verlag.
that it is highly remarkable to correctly retrieve analogues in
the top 15 recommendations based only on hierarchy-based           Gambaryan, P. 1964. A mathematical model for taxonomy.
semantic measure.                                                  SSR 47–53.
                                                                   Huang, G., and Sheng, J. 2012. Measuring similarity be-
          5    Conclusion & Future Work                            tween sentence fragments. In Proceedings of the 2012 4th
                                                                   International Conference on Intelligent Human-Machine
Computing similarity measure in an unsupervised setting is         Systems and Cybernetics, IHMSC 2012, volume 1, 327–330.
a complex task. In this paper, we propose a method based on
domain information extracted in the form of is-a links from        Jian, S.; Cao, L.; Lu, K.; and Gao, H. 2018. Unsupervised
a concept hierarchy. The experimental results in the previ-        coupled metric similarity for non-iid categorical data. IEEE
ous section, show that by using domain information, the re-        Transactions on Knowledge and Data Engineering PP:1–1.
sults are significantly better than the traditional methods of     Leacock, C., and Chodorow, M. 1998. Combining Local
finding similarity only based on frequency match/mismatch.         Context and WordNet Similarity for Word Sense Identifica-
In our current work, we approach the problem by consider-          tion, volume 49. MIT Press.
ing the lowest common ancestor in the concept hierarchy by         Lin, D. 1998. An information-theoretic definition of simi-
considering mono-hierarchies only and in an unsupervised           larity. ICML. Madison 1.
setting. In the future, we want to extend the notion of sim-
                                                                   Liu Ling, Özsu, M. T. 2009. Encyclopedia of Database
ilarity for categorical data in a supervised setting for com-
                                                                   Systems. Springer US.
plex use cases such as mortality prediction in the medical
domain. Furthermore, the idea can be extended to find sim-         Martı́n Rodrı́guez, H.; Escobar, E.; Embid, S.; Rodriguez,
ilarity for categorical data in poly-hierarchies (i.e. not tree-   N.; Hegazy, M.; and Lake, L. 2013. New approach to iden-
shaped).                                                           tify analogue reservoirs. SPE Economics & Management 6.
                                                                   Pedersen, T.; Pakhomov, S.; and Patwardhan, S. 2005. Mea-
                        References                                 sures of semantic similarity and relatedness in the medical
                                                                   domain. Journal of Biomedical Informatics - JBI.
Ahmad, A., and Dey, L. 2007. A method to compute dis-
tance between two categorical values of same attribute in un-      Perez-Valiente, M.; Rodriguez, H.; Santos, C.; Vieira, M.;
supervised learning for categorical data set. Pattern Recog-       and Embid, S. 2014. Identification of reservoir analogues in
nition Letters 28:110–118.                                         the presence of uncertainty. SPE Intelligent Energy Confer-
                                                                   ence and Exhibition.
Alamuri, M.; Surampudi, B.; and Negi, A. 2014. A survey of
distance/similarity measures for categorical data. Proceed-        Resnik, P. 1970. Using information content to evaluate se-
ings of the International Joint Conference on Neural Net-          mantic similarity in a taxonomy. IJCAI 95.
works 1907–1914.                                                   Ross, S. M. 1976. A First Course in Probability. Pearson
Alkharusi, H. 2012. Categorical variables in regression anal-      Education, Inc.
ysis: A comparison of dummy and effect coding. Interna-            Roy, R.; Hafedh, M.; Ellen, B.; and Maria, B. 1989. “devel-
tional Journal of Education 4:202–210.                             opment and application of a metric on semantic nets. IEEE
Baeza-Yates, R., and Ribeiro-Neto, B. 2008. Modern In-             Transactions on Systems, Man, and Cybernetics 19:17–30.
formation Retrieval: The Concepts and Technology Behind            Smelser, N., and Baltes, P. 2001. International Encyclopedia
Search. Addison-Wesley Publishing Company, 2nd edition.            of the Social & Behavioral Sciences. Elsevier.
Sparck Jones, K. 2004. A statistical interpretation of term
specificity and its application in retrieval. Journal of Docu-
mentation 28:493–502.
Stanfill, C., and L. Waltz, D. 1986. Toward memory-based
reasoning. Commun. ACM 29:1213–1228.
W. Goodall, D. 1966. A new similarity index based on
probability. Biometrics 22.
Wilson, D., and Martinez, T. 2000. Improved heterogeneous
distance functions. J. of Artif. Intell. Res. 6.