<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Summaya Mumtaz</string-name>
          <email>fsummayam@i</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Giese</string-name>
          <email>martingi@i</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, University of Oslo</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>Calculation of similarity between two entities is a key step in several data mining processes. While there are several common similarity measures for continuous data, there is little work for categorical data. Most approaches are purely datadriven and don't consider the inherent dependencies of complex domains such as geological structures, phylogenetics, etc. We propose two new similarity measures that take into account semantic information to calculate the similarity between two categorical values. Semantic information is represented as a hierarchy extracted from an ontology or a domain taxonomy. The first approach calculates semantic similarity by combining the data-driven approach with the hierarchy imposed on the possible categorical values. The second approach ignores the data and uses only the hierarchy to calculate semantic similarity. We apply our methods to a specific complex data mining task in the oil and gas industry: reservoir analogue identification. The two proposed measures are compared to existing data-based measures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The context of this work is the combination of data-based
(statistical) methods with knowledge-based methods in data
science. In many disciplines, there is a considerable body of
domain knowledge available, while data sets may not always
be large enough to support machine learning of complex
relationships. In this work, we look specifically at similarity
measures (or equivalently distance measures), which lie at
the core of a number of machine learning tasks such as
clustering, outlier identification and classification (k-NN). We
concentrate on entities described by categorical data,
feature values taken from a finite set of possible values with no
inherent order. The domain knowledge we wish to
incorporate is given in the form of hierarchies that can be extracted
from domain ontologies, standard classifications, etc.</p>
      <p>There is a variety of suitable metrics to quantify
similarity for numerical data such as Euclidean or Manhattan
distance (Esposito et al. 2000). These methods are not directly
applicable to non-numerical data. However, defining
sensible metrics for categorical attributes is challenging.</p>
      <p>
        The most common approach in machine learning
algorithms for handling categorical data is one-hot
encoding
        <xref ref-type="bibr" rid="ref3 ref8">(Alkharusi 2012; Davis 2010)</xref>
        . A binary column is
created for each unique value of the categorical column. This
yields a high-dimensional sparse matrix, containing a
significant proportion of zeros. This approach requires high
computational resources, is unable to handle unseen values and
ignores any domain dependencies known to exist between
values of the same categorical attribute.
      </p>
      <p>
        In a supervised learning approach, the distance (x; y)
between two categorical values can be defined using value
distance matrix
        <xref ref-type="bibr" rid="ref26">(Stanfill and L. Waltz 1986)</xref>
        and modified value
distance matrix
        <xref ref-type="bibr" rid="ref15 ref6">(Cost and Salzberg 1998)</xref>
        .
      </p>
      <p>
        For unsupervised learning, the hamming distance is used
and similarity is defined as a matching measure that assigns
1 if both values are identical, and 0 otherwise
        <xref ref-type="bibr" rid="ref1">(Esposito et al.
2000; Ahmad and Dey 2007)</xref>
        . Various similarity measures
have been derived using this distance measure, e.g. Jaccard
similarity coefficient, Sokal-Michener similarity measure,
Grower-Legendre similarity measure, etc. (Esposito et al.
2000). These measures are inherently quite coarse: in the
absence of an ordering between the categorical values, the
only possible distinction is whether two values are identical
or not (Esposito et al. 2000).
      </p>
      <p>To improve on these, frequency-based similarity measures
have been proposed that take the frequency distribution of
different attribute values into account. These measures are
data-driven and hence are dependent on certain data
characteristics such as the size of data, number of attributes,
number of values for each attribute and distribution of
frequency of each value. While data-driven measures perform
well on simple datasets, these measures are unable to take
into account semantic relationships and often don’t perform
well on complex datasets with hidden domain dependencies.
Moreover, a concept of similarity that is based solely on how
often values occur in the data cannot be expected to give
reasonable results in all cases. Using frequencies seems more
like a ‘last straw’ when frequencies are the only
distinguishing feature between categorical values.</p>
      <p>In this paper, we propose an alternative way to measure
similarity for categorical data in an unsupervised setting. We
combine a frequency-based measure with explicitly
represented domain knowledge given in the form of a hierarchy
on attribute values, and we also consider a measure that is
based purely on the hierarchy, without taking frequencies
into account.</p>
      <p>Section 2 describes the related work. Section 3 explains
the problem formulation and proposed algorithm. Section 4
presents the dataset and evaluation by comparing with
existing algorithms.</p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Literature Review</title>
      <p>
        The surveys
        <xref ref-type="bibr" rid="ref2 ref4 ref5">(Boriah, Chandola, and Kumar 2008; Alamuri,
Surampudi, and Negi 2014)</xref>
        discuss various similarity
measures for categorical data. Wilson and Martinez
        <xref ref-type="bibr" rid="ref28 ref7">(Wilson and
Martinez 2000)</xref>
        have studied in-depth heterogeneous
functions for mixed data (categorical and continuous variables)
for instance-based learning. Their approach is based on
supervisor learning where each instance has class labels in
addition to input variables. The focus of this paper is to find
similarity in an unsupervised setting where information
regarding classes is unknown.
      </p>
      <p>
        For unsupervised learning, various techniques have been
proposed
        <xref ref-type="bibr" rid="ref4 ref5">(Boriah, Chandola, and Kumar 2008)</xref>
        . The
majority of these techniques are based only on the data-driven
approach. However, in some other domains like in natural
language processing, research is being conducted to
calculate similarity based on semantics and domain knowledge.
Below, we provide an overview of the existing data-driven
measures, followed by research done in natural language
processing.
      </p>
      <p>
        The simplest similarity measure used is known as
overlap measure
        <xref ref-type="bibr" rid="ref4 ref5">(Boriah, Chandola, and Kumar 2008)</xref>
        .
Similarity of 1 is assigned when two categorical values are
identical otherwise similarity is assigned as 0. The overall
similarity between two data instances of multivariate categorical
data is proportional to the number of attributes in which they
are identical. The overlap measure does not distinguish
different values of attributes hence matches and mismatches
are treated equally. Goodall proposed a similarity measure
to normalize similarity between two data instances by the
probability of occurrences in a random sample
        <xref ref-type="bibr" rid="ref27">(W. Goodall
1966)</xref>
        . This measure assigns a higher similarity score to the
values which are less frequent. Gambaryan proposed
similarity measure by giving more weight to matches where
the frequency of occurrence of categorical values is about
half in the dataset
        <xref ref-type="bibr" rid="ref11">(Gambaryan 1964)</xref>
        . (Eskin et al. 2002)
developed a normalization kernel intrusion detection
system. This measure assigns more weight to mismatches of
attributes that contain many values. Inverse Occurrence
frequency (IOF) assigns lower similarity values to mismatches
that are based on more frequent values. IOF measure is
derived from information retrieval
        <xref ref-type="bibr" rid="ref25">(Sparck Jones 2004)</xref>
        and is
associated with the idea of inverse document frequency. The
Occurrence frequency (OF) measure assigns lower
similarity to mismatches on less frequent values and mismatches on
more frequent items are assigned higher similarity
        <xref ref-type="bibr" rid="ref4 ref5">(Boriah,
Chandola, and Kumar 2008)</xref>
        .
      </p>
      <p>
        Lin proposed a similarity framework based on
information theory
        <xref ref-type="bibr" rid="ref16">(Lin 1998)</xref>
        . According to Lin, similarity can be
explained in terms of a set of assumptions. If the
assumptions are considered true, the similarity measure is
necessarily followed. Therefore, the similarity between the two
values is calculated by the ratio between the amount of
information required to state the commonality of both values and
the information needed to fully describe both values
separately. Lin derived similarity measure for words, ordinal and
string data.
      </p>
      <p>
        Das and Mannila’s research is based on the key point that
attribute value similarity is related to other attributes
        <xref ref-type="bibr" rid="ref28 ref7">(Das
and Mannila 2000)</xref>
        . They proposed Iterated Contextual
Distances (ICD) based on the idea that attribute and object
similarities are interdependent. ICD finds attribute similarity,
sub relation, and row similarity. Ahmed and Dey proposed a
distance-based measure in term of co-occurrence of values,
the overall distribution of two attribute values are
considered along with their co-occurrence with the values of other
attributes
        <xref ref-type="bibr" rid="ref1">(Ahmad and Dey 2007)</xref>
        .
      </p>
      <p>
        Document or sentence similarity is considered the
basic task for many natural language processing(NLP)
engines such as information retrieval, query answering, and
text summarization. Semantic-based methods use
information from dictionaries (WordNet) to find relatedness between
two terms. Classic methods in NLP are based on the
shortest path measure
        <xref ref-type="bibr" rid="ref23">(Roy et al. 1989)</xref>
        .
        <xref ref-type="bibr" rid="ref15 ref6">(Leacock and Chodorow
1998)</xref>
        proposed a similarity technique based on the
shortest path between nodes in a taxonomy and the number of
nodes.
        <xref ref-type="bibr" rid="ref13">(Huang and Sheng 2012)</xref>
        based their sentence
similarity measure by using WordNet information content and
string edit distance, for paraphrase recognition.
      </p>
      <p>However, the techniques mentioned above are not directly
suitable for categorical features. In an NLP setting, there are
many terms in a complete sentence or document, that
provide the neighborhood context and aid understanding the
semantics. Furthermore, NLP tasks are constrained by the
sentence structures and grammar of a particular language such
as the ordering of subject, verb, noun, etc. However,
categorical features are represented by single domain terms with no
obvious representation of neighborhood or the context that
explains the semantic similarity. The main focus here is to
define semantic similarity between categorical terms based
on the characteristics extracted from domain knowledge.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Problem Formulation</title>
      <p>In this section, we first discuss a toy example to identify the
drawbacks of frequency-based similarity approaches.
Further, we provide an overview of metric properties and
semantic similarity to establish the foundation of the proposed
similarity measure.</p>
      <p>We analyze the problems in existing work and inherent
challenges associated with categorical data based on the toy
dataset in Table 1. The dataset consists of candidates’
profiles and we wish to retrieve matching candidates for a given
job advertisement.</p>
      <p>
        Many of the data-driven similarity measures consider two
values of a given categorical attribute to be similar if both
have similar frequency distributions. For instance, the OF
similarity measure for values of an attribute is calculated as
follows
        <xref ref-type="bibr" rid="ref4 ref5">(Boriah, Chandola, and Kumar 2008)</xref>
        .
      </p>
      <p>User ID
(1)
where f (x) is the number of occurrences of the attribute
value x and N represents the total number of observations in
the data set. Similarity between pairs ‘Computer
Programmer’ and ‘HR Manager’ and ‘Computer Programmer’ and
‘Software Developer’ based on equation 1 is calculated as:
OF (Comp. Programmer; HR Manager) = 0:64
OF (Comp. Programmer; Soft. Developer) = 0:44</p>
      <p>These numbers would indicate that the Programmer is
more similar to HR Managers than to Developers. However,
based on the evaluation of semantic evidence observed in a
knowledge source (such as an ontology or a standard
classification) shown in Table 2, it is evident that computer
programmers and software developers perform the same work
activities and tasks hence having a greater semantic
similarity.</p>
      <p>Semantic similarity can be made explicit in different
ways, and one of the prominent ways is through hierarchies,
which we will use in this paper. Section 3.1 explains in detail
the formal definition of hierarchies.
3.1</p>
      <sec id="sec-3-1">
        <title>Hierarchies</title>
        <p>Our similarity measures are based on a given hierarchical
structure of the value range of categorical features. Formally,
we assume that the categorical values for each feature form a
finite, partially ordered set (poset). A poset is an ordered pair
of binary relation v defined over a set S, such that (v; S)
satisfies the following properties: Let x; y; z 2 S,
• Reflexivity: x v x
• Antisymmetry: if x v y and y v x, then x = y
• Transitivity: if x v y and y v z, then x v z
If a v b, we call b an ancestor of a. The intention of a v b
is that b is in some way more general, broader, etc. than a.
E.g., for the occupations in Fig. 1, TopExecutives v
ManagementOccupations; for data about geographic areas, we
could have Oslo v Norway v Europe.</p>
        <p>If domain knowledge is given in the form of an
ontology, in some cases (depending on the modeling style), the
relation v will correspond to parts of the is-a subclass
relation of the ontology, but in others it won’t. E.g. it doesn’t
make sense to consider Norway a sub-class or sub-concept
of Europe, but it still makes sense to consider a hierarchy of
geographic regions.</p>
        <p>A value c 2 S is called a lowest common ancestor of
two node values a 2 S and b 2 S if c 2 S is the lowest (i.e.
deepest) node that has both a 2 S and b 2 S as descendants.
It is the first shared ancestor of a and b located farthest from
the root. In a hierarchy two values have a lowest common
ancestor denoted as a t b. A value is called a leaf value if it
is not the ancestor of any other value.</p>
        <p>In this paper, we add a restriction to our hierarchies by
only considering mono-hierarchies: we assume that there is
some root value r in the hierarchy, such that a v r for all
a 2 S, and that all values except the root have exactly one
direct ancestor. In other words, the hierarchy is tree-shaped.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Semantic Similarity</title>
        <p>
          Semantic similarity refers to similarity based on meaning
or semantic content as opposed to form
          <xref ref-type="bibr" rid="ref24">(Smelser and Baltes
2001)</xref>
          . Semantic similarity measures are automated methods
for assigning a pair of concepts a measure of similarity and
can be derived from a taxonomy of concepts arranged in is-a
relationships
          <xref ref-type="bibr" rid="ref19">(Pedersen, Pakhomov, and Patwardhan 2005)</xref>
          .
The concept of semantic similarity has been applied in
Natural language processing for the past decade to solve tasks
such as the resolution of ambiguities between terms,
document categorization or clustering, word spelling correction,
automatic language translation, ontology learning or
information retrieval. Similarity computation for categorical data
can improve the performance of existing machine learning
algorithms
          <xref ref-type="bibr" rid="ref1">(Ahmad and Dey 2007)</xref>
          and may ease the
integration of heterogeneous data
          <xref ref-type="bibr" rid="ref28 ref7">(Wilson and Martinez 2000)</xref>
          .
        </p>
        <p>Is-a relationships in a concept hierarchy encompass
formal classification, properties and relations between concepts
and data. This provides us with a common understanding of
the structure of a domain, explicit domain assumptions and
reuse of domain knowledge. In order to achieve interpretable
and good quality results in machine learning models, it is
vital to take this information into account. This intuition
motivates us to link the notion of similarity based on is-a
relationships with the similarity measures for categorical data.
We develop a framework to use is-a relationships extracted
from a concept hierarchy to quantify semantic similarity and
propose a distance measure for categorical data.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Proposed Framework</title>
        <p>
          In this paper, we propose two techniques for measuring
similarity based on domain knowledge, extracted as the concept
hierarchy. First, we present a framework for calculating
semantic similarity using information content and concept
hierarchy by modifying Resnik’s idea
          <xref ref-type="bibr" rid="ref21">(Resnik 1970)</xref>
          . To
compare the performance of information-content based semantic
measure, we extended the idea to introduce a simple
similarity measure based only on concept hierarchy.
        </p>
        <p>Further, we are interested in computing global semantic
similarity in a multi-dimensional setting where we have
several hierarchy-structured features. We define the global
similarity between two data objects X and Y in a d-dimensional
where (xi; yi) corresponds to similarity between two
values x and y in the i-th dimension and wi is the weight
associated with each dimension. The following section presents
both frameworks for calculating semantic based similarity
(xi; yi).</p>
      </sec>
      <sec id="sec-3-4">
        <title>Information Content Semantic Similarity (ICS) This</title>
        <p>
          approach is based on a modification of Resnik’s idea
          <xref ref-type="bibr" rid="ref21">(Resnik
1970)</xref>
          . Resnik proposed a measure for finding semantic
similarity in an is-a taxonomy based on information content and
defined similarity between two nodes in a hierarchy as the
extent to which they share common information.
        </p>
        <p>
          In order to formulate the semantic similarity of two given
categorical values, the key intuition is to find the common
information in both values. This information is represented by
the lowest common ancestor in the hierarchy that subsumes
both values
          <xref ref-type="bibr" rid="ref16">(Lin 1998)</xref>
          . If the lowest common ancestor of
two values is close to leaf nodes, that implies both values
share many characteristics. As the lowest common ancestor
moves up in the hierarchy, fewer commonalities exist
between a given pair of values.
        </p>
        <p>For the given dataset, we can map the ‘Occupation’
attribute to the O*net taxonomy1(Fig. 1) by placing all the
values at the corresponding leaf nodes in the occupation
hierarchy whereas intermediate nodes represent the lowest
common ancestors for given pairs. In Fig. 12,‘Computer
Programmer’ and ‘Software Developer’ are both subsumed
by the lowest common ancestor ‘Computer Occupations’,
whereas the lowest common ancestor that subsumes the
concept ‘HR Manager’ and ’Computer Programmer’ is
‘Occupation’(root node of the occupation hierarchy). Hence,
taking into account the lowest common ancestor, we expect that
the similarity between Computer Programmer and Software
Developer to be significantly greater than the similarity
between the Computer Programmer and HR Developer.</p>
        <p>Our intuition about the concept of semantic similarity is
that for two categorical values x and y that share lowest
common ancestor c, farthest from the root node, are always
considered to be more semantically similar than to two
categorical values x and z that share lowest common ancestor c0
close to root node. In addition, identical values should have
a maximum similarity of 1.</p>
        <sec id="sec-3-4-1">
          <title>1https://www.onetcenter.org/taxonomy.html 2https://www.bls.gov/soc/soc structure 2010.pdf</title>
          <p>
            In order to formulate the semantic similarity of values
based on the lowest common ancestor, we use the idea of
associating probabilities with the values
            <xref ref-type="bibr" rid="ref21">(Resnik 1970)</xref>
            . We
base ourselves on a function p : C ! [0; 1] such that for any
c 2 S, p(c) represents the probability of the feature value
being v c. Furthermore, using information theory we can
state that the information content of a feature having some
value is quantified as negative of log likelihood
            <xref ref-type="bibr" rid="ref22">(Ross 1976)</xref>
            .
          </p>
          <p>For categorical data, we can find the information content I
of the lowest common ancestor c by finding the information
content of all the leaf values subsumed by c in the hierarchy.</p>
          <p>I(c) =
log</p>
          <p>X
n2 leaf (c)
p(n)
where leaf (c) is the set of all leaf values in x 2 S such that
x v c. The probability of leaf values may be estimated by
the relative frequency.3</p>
          <p>frequency (n)
p(n) = (4)</p>
          <p>N
where N is the number of samples.</p>
          <p>Based on the above definitions, we formulate information
content based semantic similarity(ICS) between two
categorical values x and y as</p>
          <p>Sim(x; y) =
(1</p>
          <p>I(xty)
max(I(xty))
if x = y:
else if x 6= y
where I(xty) denotes the information content of the lowest
common ancestor of x and y, calculated by using equation 3
and max(I(xty) represents the maximum information
content of all given pair of leaves and is used for normalization.
Hierarchy-based Semantic Similarity(HS) As explained
earlier, the main intuition of semantic similarity is based on
the idea that any two values having the lowest common
ancestor close to leaf nodes, should have high similarity and
vice versa. Hence, we quantify semantic similarity by
considering the level of the lowest common ancestor in the
hierarchy. The level of a node is defined by 1+ the number
of connections between the node and the root4. Greater the
level of the lowest common ancestor of any given pair of
values in the hierarchy, more similar the values are. We
formulate the similarity as,</p>
          <p>Sim(x; y) =
1
d level(x[y)
if x = y:
else if x 6= y
(6)
3Probabilities may also be known from other sources, for
instance known priors for the specific domain.</p>
          <p>4Level starts from 1 and the level of the root is 1
(3)
(5)
Where 0 &lt; &lt; 1 is a fixed decay parameter, level(n) is
the distance of n from the root in the hierarchy, and d =
maxn2X level(n) is the maximum depth of the hierarchy.</p>
          <p>The main advantage of equation 6 is that the calculation of
semantic similarity no longer requires any input from
training data such as information content. Once the concept
hierarchy is formalized, we can measure the similarity between
any two values including the categorical values not observed
in the training data.</p>
          <p>Below, we explain how to perform evaluation of the
proposed techniques.</p>
          <p>4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>In this section, we compare the ICSD and HDM approaches
to other similarity measures for the identification of
reservoir analogues of a target reservoir, given a dataset of known
reservoirs. This use-case is further explained in Section 4.2
below.
4.1</p>
      <sec id="sec-4-1">
        <title>Baseline Methods</title>
        <p>
          The following four state-of-the-art similarity/distance
measures are compared with the proposed techniques:
Occurrence Frequency (OF)
          <xref ref-type="bibr" rid="ref4 ref5">(Boriah, Chandola, and Kumar 2008)</xref>
          ,
Eskin Similarity measure
          <xref ref-type="bibr" rid="ref4 ref5">(Boriah, Chandola, and Kumar
2008; Eskin et al. 2002)</xref>
          , Lin Similarity measure
          <xref ref-type="bibr" rid="ref16">(Lin 1998)</xref>
          and Coupled Similarity Matrix (CMS)
          <xref ref-type="bibr" rid="ref14">(Jian et al. 2018)</xref>
          .
        </p>
        <p>We compare the performance of the different
similarity measures in a recommendation scenario: given a query
item, we compute its similarity to each item in the ‘training’
dataset using Equation 2, and determine the top k items with
highest similarity.</p>
        <p>For our evaluation, we do this for all of the different
similarity measures, and compare the outcome to a fixed ‘gold
standard’ list of items to determine the average precision.</p>
        <p>For our experimental evaluation, we have chosen reservoir
analogues (explained in the section below): a complex task
in the Oil and Gas industry. To the best of our knowledge,
there exists no standard machine learning system for solving
this use case. The common industrial practice to date is to
conduct a manual analysis by human experts.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Reservoir Analogues</title>
        <p>
          In the Oil and Gas Industry, during the exploration phase,
analogous reservoirs are used to study reservoirs that lack
critical information. Any reservoir with a deficit of critical
information is known as a “target reservoir”, and
“analogous reservoirs” are ones expected to have similar
characteristics.
          <xref ref-type="bibr" rid="ref18">(Mart´ın Rodr´ıguez et al. 2013)</xref>
          .
        </p>
        <p>Usually, a technical evaluation team must analyze various
data types – seismic, well logs, test, and cores – in order to
make the first approximation of analogous reservoirs. Due to
a lack of resources and time constraints, the first
approximation is usually the neighboring reservoirs to provide an
estimate of the fluid and rock properties of the target reservoir.
A single analogue is mostly used because it is in the same
geographic region or basin. This is risky however, since it
does not always give sufficient information to characterize
a new prospect. Furthermore, it becomes a tedious task for
new target reservoirs where no neighboring reservoir exists.</p>
        <p>
          Limited efforts have been made to identify analogues
based on machine learning
          <xref ref-type="bibr" rid="ref18 ref20">(Mart´ın Rodr´ıguez et al. 2013;
Perez-Valiente et al. 2014)</xref>
          . In order to generate a list of
ranked reservoirs based on similarity, it is important to
automate this process using a standard knowledge source and
to develop a method that is flexible enough to produce
analogues for reservoirs with no neighboring analogues.
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Dataset</title>
        <p>The main source of information used in this evaluation is the
dataset of reservoirs licensed by IHS5. It comprises a total
of 43000 reservoirs and various properties/attributes
associated with each reservoir. According to domain experts, only
a few key parameters are known during the initial stage of</p>
        <sec id="sec-4-3-1">
          <title>5https://ihsmarkit.com/index.html</title>
          <p>reservoir identification. Hence for our analysis of retrieving
similar reservoirs, we use the following set of key
parameters/attributes identified by domain experts.
• Depositional Environment
• Lithology
• Age
• Geographical Location
• Structural Setting</p>
          <p>Detailed definitions of these parameters are described in
the section below.
4.4</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Semantic Information for Attributes</title>
        <p>This section explains the process of standardizing the
semantic information used in the calculation of similarity. Due
to data confidentiality, we only explain two attributes ‘Age’
and ‘Lithology.’
Reservoir Age: A geologic age is a subdivision of
geologic time that divides an epoch into smaller parts. A
succession of rock strata laid down in a single age on the geologic
timescale is a stage. The geological time has been divided
into eras, periods and epochs. The named divisions of the
geological time are based on fossil evidence. Fig. 2 shows a
part of an ontology developed to show how geological times
are organized into Erathem, Period, Epoch and Age.</p>
        <p>Note that age can also be given on a linear scale, e.g. in
millions of years. However, the characteristics of rocks
deposited in different geologic eras, periods, and epochs differ
so much that their position in the hierarchy is a much better
indicator of similarity than the numerical difference in age.
Lithology: The lithology of a rock unit is a description
of its physical characteristics visible at outcrop, in hand or
core samples or with low magnification microscopies, such
as color, texture, grain size, or composition. There is no
standard ontology for lithology. With the help of geologists, we
develop an ontology that considers all the categorical values
occurring in data and groups them based on similar physical
characteristics. In Fig. 3, we show a part of this ontology.
4.5</p>
      </sec>
      <sec id="sec-4-5">
        <title>Data Pre-processing</title>
        <p>The main challenge associated with the given data is a large
number of categorical values associated with each attribute.
For the attribute ‘Age,’ there are about 250 unique values.
These values are not standardized. Hence, there are instances
where the same category exists in the dataset with various
names. Furthermore, most of the age values are unofficial
names, which are used only in a few specific areas of the
world. With the help of geological experts, we replaced these
unofficial names by standard domain names.</p>
        <p>For the attribute ‘Depositional Environment,’ there are 32
unique values occurring in the given data set. Some
categorical values are merged together based on the same geological
properties identified by domain experts.</p>
        <p>In the original data set, there are 1731 categories of the
attribute ‘Lithology.’ The raw values of lithology contain
abbreviations for the same lithology, unofficial lithology
names, and combinations of various lithologies. These
categories are replaced with the standard names and
combinations are replaced with only primary lithology, which leads
to 228 unique categories.</p>
        <p>Outliers are extreme values that deviate from other
observations on data, they may indicate variability in
measurement, experimental errors or a novelty. In order to avoid the
disastrous effect on the results of the statistical analysis, a
step is added to identify, analyze and delete outliers in the
dataset. In this step, for every attribute, we remove the
values that don’t confirm with standard domain names.</p>
        <p>After cleaning the data, the comparative evaluation
between ICS, HS and existing similarity algorithms is
conducted.
For the given task, we will evaluate the similarity measure
on two main objectives.
• Retrieving top 15 similar analogues to the target reservoir.
• Producing the result in a ranked order such that the first
retrieved analogue corresponds to the most similar
reservoir to the target reservoir.</p>
        <p>
          Mean Average Precision (MAP) is the mostly commonly
used evaluation metric in information retrieval and object
detection
          <xref ref-type="bibr" rid="ref4">(Baeza-Yates and Ribeiro-Neto 2008)</xref>
          . MAP is the
arithmetic mean of the average precision (AP) values for
an information retrieval system over a set of n query
topics
          <xref ref-type="bibr" rid="ref17">(Liu Ling 2009)</xref>
          . It can be expressed as follows:
M AP =
n
1 X APn
n
        </p>
        <sec id="sec-4-5-1">
          <title>Precision for a classification task is defined as</title>
          <p>Precision =</p>
          <p>TruePositive</p>
          <p>TruePositive + FalsePositive</p>
          <p>Based on Equation 8, recommender system Precision (P)
is defined as,</p>
          <p>P =
# of our recommendations that are relevant</p>
          <p># of items we recommended</p>
          <p>For evaluating the performance of recommender systems,
we are only interested in recommending top-N items to the
user. Usually, the higher the number of relevant
recommendations at the top, the more positive is the impression of the
users. Therefore, it is sensible to compute precision and
recall metrics in the first N items instead of all the items. Thus
the precision at a cutoff k is introduced in order to evaluate
ranking, where k is an integer that is set by the user to match
the objective of the top-N recommendations. Average
precision at cutoff k, is the average of all precisions in the places
where a recommendation is a true positive and is defined as
follows:</p>
          <p>K</p>
          <p>i=1
1 XK P (i) Rel(i)
where K represents the top K recommendations for the
given query q and Rel(i) shows the relevance of the
recommendation. Rel(i) is 1 if the recommended item was
relevant(true positive) otherwise 0.</p>
          <p>Usually, the performance of a recommendation system is
calculated by considering a set of queries. Therefore, given
(7)
(8)
(9)
(10)
(11)
where APq @K is calculated by using Equation 10
4.7</p>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>Experimental Results</title>
        <p>
          There is no standard way to evaluate similarity measures
for semantic similarity. Resnik uses human expert
similarity ranking to judge similarity
          <xref ref-type="bibr" rid="ref21">(Resnik 1970)</xref>
          . We follow the
same approach. In order to perform this evaluation, we
selected two target reservoirs ‘Snorre’ and ‘Snøhvit.’ We then
asked our domain experts to produce a gold set for each
reservoir. This gold set contains a set of reservoirs identified
by our experts as most similar to the target reservoir based on
their hindsight knowledge about the target reservoir.
Furthermore, the gold set is produced in a ranked manner, the first
item in the list corresponds to the highest similar analogue
and the last item corresponds to the lowest similar reservoir.
        </p>
        <p>
          After acquiring the gold dataset, we perform an
experimental evaluation to compare the performance of the
proposed techniques with three existing similarity measures
(OF
          <xref ref-type="bibr" rid="ref4 ref5">(Boriah, Chandola, and Kumar 2008)</xref>
          , Eskin (Eskin et
al. 2002) , CMS
          <xref ref-type="bibr" rid="ref14">(Jian et al. 2018)</xref>
          for finding reservoir
analogues. For each selected target reservoir, all the remaining
reservoirs in the dataset are given as input to each
similarity measure and the similarity between the target and all
remaining reservoirs is calculated. The top 15 reservoirs with
maximum similarity are retrieved and are now referred to as
analogues to the target reservoir.
        </p>
        <p>In order to penalize poor estimations, we are using
Average Precision (e Equation 10) as a quality criterion for
evaluation of similarity between reservoirs. For this metric,
a higher value corresponds to better results. Table 3, shows
the experimental result of each similarity measure separately
for each target reservoir 6.</p>
        <p>
          As shown in table 3, ICS and HS measures outperform
the data-driven similarity measures for both selected
reservoirs. For the target reservoirs, ’Snorre’ and ’Snohvit’, the
6Similarity measure proposed by Lin
          <xref ref-type="bibr" rid="ref16">(Lin 1998)</xref>
          doesn’t
retrieve any similar analogues in the top k-recommendations.
Therefore, results are not included in table 3.
average precision for ICS is 39% and 57% which is higher
than the average precision of other similarity measures. For
HS average precision for ’Snorre’ and ’Snohvit’ is 59% and
66%.Further, table 4 shows that the MAP (Equation 11)
for ICS and HS is 48% and 63% respectively, which
significantly better than the MAP values of other algorithms. This
evaluation supports the initial hypothesis that by adding
domain information to the similarity measure, we can increase
the similarity performance for the complex categorical data.
        </p>
        <p>It is important to note that results obtained using ICS and
HS are not directly comparable with the gold set provided by
human experts. In order to produce a gold set, human experts
take into account the geological history of the current basin,
analysis of geological time periods and overall processes
of formation of reservoir rocks. Furthermore, they also use
conceptual facies models, reservoir simulation models, core
samples and well logs for selecting appropriate analogues. In
contrast to this, our experimental evaluation of the proposed
technique is based only on a limited part of this information.
Achieving 63% precision in this scenario highlights the fact
that it is highly remarkable to correctly retrieve analogues in
the top 15 recommendations based only on hierarchy-based
semantic measure.</p>
        <p>5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion &amp; Future Work</title>
      <p>Computing similarity measure in an unsupervised setting is
a complex task. In this paper, we propose a method based on
domain information extracted in the form of is-a links from
a concept hierarchy. The experimental results in the
previous section, show that by using domain information, the
results are significantly better than the traditional methods of
finding similarity only based on frequency match/mismatch.
In our current work, we approach the problem by
considering the lowest common ancestor in the concept hierarchy by
considering mono-hierarchies only and in an unsupervised
setting. In the future, we want to extend the notion of
similarity for categorical data in a supervised setting for
complex use cases such as mortality prediction in the medical
domain. Furthermore, the idea can be extended to find
similarity for categorical data in poly-hierarchies (i.e. not
treeshaped).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>28</volume>
          :
          <fpage>110</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Alamuri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Surampudi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Negi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>A survey of distance/similarity measures for categorical data</article-title>
          .
          <source>Proceedings of the International Joint Conference on Neural Networks</source>
          <year>1907</year>
          -
          <year>1914</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Alkharusi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Categorical variables in regression analysis: A comparison of dummy and effect coding</article-title>
          .
          <source>International Journal of Education</source>
          <volume>4</volume>
          :
          <fpage>202</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Modern Information Retrieval: The Concepts and Technology Behind Search</article-title>
          .
          <source>Addison-Wesley Publishing Company, 2nd edition.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Boriah</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chandola</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Similarity measures for categorical data: A comparative evaluation</article-title>
          .
          <source>In Proceedings of the SIAM International Conference on Data Mining</source>
          , volume
          <volume>30</volume>
          ,
          <fpage>243</fpage>
          -
          <lpage>254</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Cost</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Salzberg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>A weighted nearest neighbor algorithm for learning with symbolic features</article-title>
          .
          <source>Machine Learning</source>
          <volume>10</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Mannila</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Context-based similarity measures for categorical databases</article-title>
          .
          <source>In Lecture Notes in Computer Science</source>
          , volume
          <year>1910</year>
          ,
          <fpage>201</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Contrast coding in multiple regression analysis: Strengths, weaknesses, and utility of popular coding structures</article-title>
          .
          <source>In Journal of Data Science.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          2002.
          <article-title>A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. Applications of Data Mining in Computer Security 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2000.
          <article-title>Classical resemblance measures</article-title>
          . Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Gambaryan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>1964</year>
          .
          <article-title>A mathematical model for taxonomy.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>SSR</surname>
          </string-name>
          47-53.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Measuring similarity between sentence fragments</article-title>
          .
          <source>In Proceedings of the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics</source>
          ,
          <string-name>
            <surname>IHMSC</surname>
          </string-name>
          <year>2012</year>
          , volume
          <volume>1</volume>
          ,
          <fpage>327</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Jian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Unsupervised coupled metric similarity for non-iid categorical data</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data</source>
          Engineering PP:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Leacock</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chodorow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Combining Local Context and WordNet Similarity for Word Sense Identification</article-title>
          , volume
          <volume>49</volume>
          . MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>An information-theoretic definition of similarity</article-title>
          .
          <source>ICML. Madison</source>
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Liu</given-names>
            <surname>Ling</surname>
          </string-name>
          , O¨ zsu, M. T.
          <year>2009</year>
          .
          <source>Encyclopedia of Database Systems</source>
          . Springer US.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>Mart´ın Rodr´ıguez, H.;</article-title>
          <string-name>
            <surname>Escobar</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Embid</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hegazy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Lake</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>New approach to identify analogue reservoirs</article-title>
          .
          <source>SPE Economics &amp; Management</source>
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pakhomov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Patwardhan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Measures of semantic similarity and relatedness in the medical domain</article-title>
          .
          <source>Journal of Biomedical Informatics - JBI.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Perez-Valiente</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ; Santos,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Vieira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Embid</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Identification of reservoir analogues in the presence of uncertainty</article-title>
          .
          <source>SPE Intelligent Energy Conference and Exhibition.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Resnik</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>1970</year>
          .
          <article-title>Using information content to evaluate semantic similarity in a taxonomy</article-title>
          .
          <source>IJCAI 95.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <year>1976</year>
          .
          <article-title>A First Course in Probability</article-title>
          . Pearson Education, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Hafedh,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Ellen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ; and
            <surname>Maria</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <year>1989</year>
          .
          <article-title>“development and application of a metric on semantic nets</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          <volume>19</volume>
          :
          <fpage>17</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Smelser</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Baltes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2001</year>
          .
          <source>International Encyclopedia of the Social &amp; Behavioral Sciences. Elsevier.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Sparck</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <year>2004</year>
          .
          <article-title>A statistical interpretation of term specificity and its application in retrieval</article-title>
          .
          <source>Journal of Documentation</source>
          <volume>28</volume>
          :
          <fpage>493</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Stanfill</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Waltz</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>1986</year>
          .
          <article-title>Toward memory-based reasoning</article-title>
          .
          <source>Commun. ACM</source>
          <volume>29</volume>
          :
          <fpage>1213</fpage>
          -
          <lpage>1228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>W.</given-names>
            <surname>Goodall</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>1966</year>
          .
          <article-title>A new similarity index based on probability</article-title>
          .
          <source>Biometrics</source>
          <volume>22</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Wilson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Improved heterogeneous distance functions</article-title>
          .
          <source>J. of Artif. Intell. Res. 6.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>