=Paper= {{Paper |id=Vol-2600/paper16 |storemode=property |title=Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data |pdfUrl=https://ceur-ws.org/Vol-2600/paper16.pdf |volume=Vol-2600 |authors=Summaya Mumtaz,Martin Giese |dblpUrl=https://dblp.org/rec/conf/aaaiss/MumtazG20 }} ==Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data== https://ceur-ws.org/Vol-2600/paper16.pdf

Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data

Summaya Mumtaz and Martin Giese
Department of Informatics,
University of Oslo, Norway
{summayam@ifi.uio.no, martingi@ifi.uio.no}

Abstract applicable to non-numerical data. However, defining sensi-
Calculation of similarity between two entities is a key step in
ble metrics for categorical attributes is challenging.
several data mining processes. While there are several com- The most common approach in machine learning al-
mon similarity measures for continuous data, there is little gorithms for handling categorical data is one-hot encod-
work for categorical data. Most approaches are purely data- ing (Alkharusi 2012; Davis 2010). A binary column is cre-
driven and don’t consider the inherent dependencies of com- ated for each unique value of the categorical column. This
plex domains such as geological structures, phylogenetics, yields a high-dimensional sparse matrix, containing a signif-
etc. We propose two new similarity measures that take into icant proportion of zeros. This approach requires high com-
account semantic information to calculate the similarity be- putational resources, is unable to handle unseen values and
tween two categorical values. Semantic information is repre- ignores any domain dependencies known to exist between
sented as a hierarchy extracted from an ontology or a domain
taxonomy. The first approach calculates semantic similarity
values of the same categorical attribute.
by combining the data-driven approach with the hierarchy In a supervised learning approach, the distance δ(x, y) be-
imposed on the possible categorical values. The second ap- tween two categorical values can be defined using value dis-
proach ignores the data and uses only the hierarchy to calcu- tance matrix (Stanfill and L. Waltz 1986) and modified value
late semantic similarity. We apply our methods to a specific distance matrix (Cost and Salzberg 1998).
complex data mining task in the oil and gas industry: reser- For unsupervised learning, the hamming distance is used
voir analogue identification. The two proposed measures are and similarity is defined as a matching measure that assigns
compared to existing data-based measures. 1 if both values are identical, and 0 otherwise (Esposito et al.
2000; Ahmad and Dey 2007). Various similarity measures
1 Introduction have been derived using this distance measure, e.g. Jaccard
The context of this work is the combination of data-based similarity coefficient, Sokal-Michener similarity measure,
(statistical) methods with knowledge-based methods in data Grower-Legendre similarity measure, etc. (Esposito et al.
science. In many disciplines, there is a considerable body of 2000). These measures are inherently quite coarse: in the
domain knowledge available, while data sets may not always absence of an ordering between the categorical values, the
be large enough to support machine learning of complex re- only possible distinction is whether two values are identical
lationships. In this work, we look specifically at similarity or not (Esposito et al. 2000).
measures (or equivalently distance measures), which lie at To improve on these, frequency-based similarity measures
the core of a number of machine learning tasks such as clus- have been proposed that take the frequency distribution of
tering, outlier identification and classification (k-NN). We different attribute values into account. These measures are
concentrate on entities described by categorical data, fea- data-driven and hence are dependent on certain data char-
ture values taken from a finite set of possible values with no acteristics such as the size of data, number of attributes,
inherent order. The domain knowledge we wish to incorpo- number of values for each attribute and distribution of fre-
rate is given in the form of hierarchies that can be extracted quency of each value. While data-driven measures perform
from domain ontologies, standard classifications, etc. well on simple datasets, these measures are unable to take
There is a variety of suitable metrics to quantify similar- into account semantic relationships and often don’t perform
ity for numerical data such as Euclidean or Manhattan dis- well on complex datasets with hidden domain dependencies.
tance (Esposito et al. 2000). These methods are not directly Moreover, a concept of similarity that is based solely on how
often values occur in the data cannot be expected to give rea-
Copyright © 2020 held by the author(s). In A. Martin, K. Hinkel- sonable results in all cases. Using frequencies seems more
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
like a ‘last straw’ when frequencies are the only distinguish-
bining Machine Learning and Knowledge Engineering in Practice ing feature between categorical values.
(AAAI-MAKE 2020). Stanford University, Palo Alto, California, In this paper, we propose an alternative way to measure
USA, March 23-25, 2020. Use permitted under Creative Commons similarity for categorical data in an unsupervised setting. We
License Attribution 4.0 International (CC BY 4.0). combine a frequency-based measure with explicitly repre-
sented domain knowledge given in the form of a hierarchy explained in terms of a set of assumptions. If the assump-
on attribute values, and we also consider a measure that is tions are considered true, the similarity measure is necessar-
based purely on the hierarchy, without taking frequencies ily followed. Therefore, the similarity between the two val-
into account. ues is calculated by the ratio between the amount of infor-
Section 2 describes the related work. Section 3 explains mation required to state the commonality of both values and
the problem formulation and proposed algorithm. Section 4 the information needed to fully describe both values sepa-
presents the dataset and evaluation by comparing with exist- rately. Lin derived similarity measure for words, ordinal and
ing algorithms. string data.
Das and Mannila’s research is based on the key point that
2 Literature Review attribute value similarity is related to other attributes (Das
and Mannila 2000). They proposed Iterated Contextual Dis-
The surveys (Boriah, Chandola, and Kumar 2008; Alamuri, tances (ICD) based on the idea that attribute and object sim-
Surampudi, and Negi 2014) discuss various similarity mea- ilarities are interdependent. ICD finds attribute similarity,
sures for categorical data. Wilson and Martinez (Wilson and sub relation, and row similarity. Ahmed and Dey proposed a
Martinez 2000) have studied in-depth heterogeneous func- distance-based measure in term of co-occurrence of values,
tions for mixed data (categorical and continuous variables) the overall distribution of two attribute values are consid-
for instance-based learning. Their approach is based on su- ered along with their co-occurrence with the values of other
pervisor learning where each instance has class labels in ad- attributes (Ahmad and Dey 2007).
dition to input variables. The focus of this paper is to find Document or sentence similarity is considered the ba-
similarity in an unsupervised setting where information re- sic task for many natural language processing(NLP) en-
garding classes is unknown. gines such as information retrieval, query answering, and
For unsupervised learning, various techniques have been text summarization. Semantic-based methods use informa-
proposed (Boriah, Chandola, and Kumar 2008). The major- tion from dictionaries (WordNet) to find relatedness between
ity of these techniques are based only on the data-driven two terms. Classic methods in NLP are based on the short-
approach. However, in some other domains like in natural est path measure (Roy et al. 1989). (Leacock and Chodorow
language processing, research is being conducted to calcu- 1998) proposed a similarity technique based on the short-
late similarity based on semantics and domain knowledge. est path between nodes in a taxonomy and the number of
Below, we provide an overview of the existing data-driven nodes.(Huang and Sheng 2012) based their sentence simi-
measures, followed by research done in natural language larity measure by using WordNet information content and
processing. string edit distance, for paraphrase recognition.
The simplest similarity measure used is known as over- However, the techniques mentioned above are not directly
lap measure (Boriah, Chandola, and Kumar 2008). Similar- suitable for categorical features. In an NLP setting, there are
ity of 1 is assigned when two categorical values are identi- many terms in a complete sentence or document, that pro-
cal otherwise similarity is assigned as 0. The overall simi- vide the neighborhood context and aid understanding the se-
larity between two data instances of multivariate categorical mantics. Furthermore, NLP tasks are constrained by the sen-
data is proportional to the number of attributes in which they tence structures and grammar of a particular language such
are identical. The overlap measure does not distinguish dif- as the ordering of subject, verb, noun, etc. However, categor-
ferent values of attributes hence matches and mismatches ical features are represented by single domain terms with no
are treated equally. Goodall proposed a similarity measure obvious representation of neighborhood or the context that
to normalize similarity between two data instances by the explains the semantic similarity. The main focus here is to
probability of occurrences in a random sample (W. Goodall define semantic similarity between categorical terms based
1966). This measure assigns a higher similarity score to the on the characteristics extracted from domain knowledge.
values which are less frequent. Gambaryan proposed sim-
ilarity measure by giving more weight to matches where 3 Problem Formulation
the frequency of occurrence of categorical values is about In this section, we first discuss a toy example to identify the
half in the dataset (Gambaryan 1964). (Eskin et al. 2002) drawbacks of frequency-based similarity approaches. Fur-
developed a normalization kernel intrusion detection sys- ther, we provide an overview of metric properties and se-
tem. This measure assigns more weight to mismatches of mantic similarity to establish the foundation of the proposed
attributes that contain many values. Inverse Occurrence fre- similarity measure.
quency (IOF) assigns lower similarity values to mismatches We analyze the problems in existing work and inherent
that are based on more frequent values. IOF measure is de- challenges associated with categorical data based on the toy
rived from information retrieval (Sparck Jones 2004) and is dataset in Table 1. The dataset consists of candidates’ pro-
associated with the idea of inverse document frequency. The files and we wish to retrieve matching candidates for a given
Occurrence frequency (OF) measure assigns lower similar- job advertisement.
ity to mismatches on less frequent values and mismatches on Many of the data-driven similarity measures consider two
more frequent items are assigned higher similarity (Boriah, values of a given categorical attribute to be similar if both
Chandola, and Kumar 2008). have similar frequency distributions. For instance, the OF
Lin proposed a similarity framework based on informa- similarity measure for values of an attribute is calculated as
tion theory (Lin 1998). According to Lin, similarity can be follows (Boriah, Chandola, and Kumar 2008).
of Europe, but it still makes sense to consider a hierarchy of
Table 1: Toy Dataset geographic regions.
User ID Occupation Education A value c ∈ S is called a lowest common ancestor of
two node values a ∈ S and b ∈ S if c ∈ S is the lowest (i.e.
1 Computer Programmer Bachelors deepest) node that has both a ∈ S and b ∈ S as descendants.
2 Administrative Staff Bachelors It is the first shared ancestor of a and b located farthest from
3 HR Manager Bachelors the root. In a hierarchy two values have a lowest common
4 HR Manager Masters ancestor denoted as a t b. A value is called a leaf value if it
5 Software Developer Bachelors is not the ancestor of any other value.
6 Computer Programmer Masters In this paper, we add a restriction to our hierarchies by
only considering mono-hierarchies: we assume that there is
some root value r in the hierarchy, such that a v r for all
a ∈ S, and that all values except the root have exactly one
( direct ancestor. In other words, the hierarchy is tree-shaped.
1 if x = y
OF (x, y) = 1
if x 6= y
N
(1+log( f (x)+1 N
)+log( f (y)+1 )) 3.2 Semantic Similarity
(1)
Semantic similarity refers to similarity based on meaning
where f (x) is the number of occurrences of the attribute
or semantic content as opposed to form (Smelser and Baltes
value x and N represents the total number of observations in
2001). Semantic similarity measures are automated methods
the data set. Similarity between pairs ‘Computer Program-
for assigning a pair of concepts a measure of similarity and
mer’ and ‘HR Manager’ and ‘Computer Programmer’ and
can be derived from a taxonomy of concepts arranged in is-a
‘Software Developer’ based on equation 1 is calculated as:
relationships (Pedersen, Pakhomov, and Patwardhan 2005).
OF (Comp. Programmer, HR Manager) = 0.64
The concept of semantic similarity has been applied in Nat-
OF (Comp. Programmer, Soft. Developer) = 0.44
ural language processing for the past decade to solve tasks
These numbers would indicate that the Programmer is
such as the resolution of ambiguities between terms, docu-
more similar to HR Managers than to Developers. However,
ment categorization or clustering, word spelling correction,
based on the evaluation of semantic evidence observed in a
automatic language translation, ontology learning or infor-
knowledge source (such as an ontology or a standard classi-
mation retrieval. Similarity computation for categorical data
fication) shown in Table 2, it is evident that computer pro-
can improve the performance of existing machine learning
grammers and software developers perform the same work
algorithms (Ahmad and Dey 2007) and may ease the inte-
activities and tasks hence having a greater semantic similar-
gration of heterogeneous data (Wilson and Martinez 2000).
ity.
Semantic similarity can be made explicit in different Is-a relationships in a concept hierarchy encompass for-
ways, and one of the prominent ways is through hierarchies, mal classification, properties and relations between concepts
which we will use in this paper. Section 3.1 explains in detail and data. This provides us with a common understanding of
the formal definition of hierarchies. the structure of a domain, explicit domain assumptions and
reuse of domain knowledge. In order to achieve interpretable
3.1 Hierarchies and good quality results in machine learning models, it is vi-
tal to take this information into account. This intuition mo-
Our similarity measures are based on a given hierarchical tivates us to link the notion of similarity based on is-a rela-
structure of the value range of categorical features. Formally, tionships with the similarity measures for categorical data.
we assume that the categorical values for each feature form a We develop a framework to use is-a relationships extracted
finite, partially ordered set (poset). A poset is an ordered pair from a concept hierarchy to quantify semantic similarity and
of binary relation v defined over a set S, such that (v, S) propose a distance measure for categorical data.
satisfies the following properties: Let x, y, z ∈ S,
• Reflexivity: x v x 3.3 Proposed Framework
• Antisymmetry: if x v y and y v x, then x = y
In this paper, we propose two techniques for measuring sim-
• Transitivity: if x v y and y v z, then x v z ilarity based on domain knowledge, extracted as the concept
If a v b, we call b an ancestor of a. The intention of a v b hierarchy. First, we present a framework for calculating se-
is that b is in some way more general, broader, etc. than a. mantic similarity using information content and concept hi-
E.g., for the occupations in Fig. 1, TopExecutives v Man- erarchy by modifying Resnik’s idea (Resnik 1970). To com-
agementOccupations; for data about geographic areas, we pare the performance of information-content based semantic
could have Oslo v Norway v Europe. measure, we extended the idea to introduce a simple similar-
If domain knowledge is given in the form of an ontol- ity measure based only on concept hierarchy.
ogy, in some cases (depending on the modeling style), the Further, we are interested in computing global semantic
relation v will correspond to parts of the is-a subclass re- similarity in a multi-dimensional setting where we have sev-
lation of the ontology, but in others it won’t. E.g. it doesn’t eral hierarchy-structured features. We define the global sim-
make sense to consider Norway a sub-class or sub-concept ilarity between two data objects X and Y in a d-dimensional
Table 2: Occupation Activities and Skills
Occupation Work Activity Skills
HR Manager Liaise between departments PeopleSoft, SAP
Computer Programmer Write programming code C++, Java, Python
Software Developers Modify software programs C++, Oracle ,Python

attribute space as, In order to formulate the semantic similarity of values
based on the lowest common ancestor, we use the idea of
d
X associating probabilities with the values (Resnik 1970). We
δ(X, Y ) = wi δ(xi , yi ) (2) base ourselves on a function p : C → [0, 1] such that for any
i c ∈ S, p(c) represents the probability of the feature value
where δ(xi , yi ) corresponds to similarity between two val- being v c. Furthermore, using information theory we can
ues x and y in the i-th dimension and wi is the weight asso- state that the information content of a feature having some
ciated with each dimension. The following section presents value is quantified as negative of log likelihood (Ross 1976).
both frameworks for calculating semantic based similarity For categorical data, we can find the information content I
δ(xi , yi ). of the lowest common ancestor c by finding the information
content of all the leaf values subsumed by c in the hierarchy.
Information Content Semantic Similarity (ICS) This X
approach is based on a modification of Resnik’s idea (Resnik I(c) = −log p(n) (3)
1970). Resnik proposed a measure for finding semantic sim- n∈ leaf (c)
ilarity in an is-a taxonomy based on information content and where leaf (c) is the set of all leaf values in x ∈ S such that
defined similarity between two nodes in a hierarchy as the x v c. The probability of leaf values may be estimated by
extent to which they share common information. the relative frequency.3
In order to formulate the semantic similarity of two given
categorical values, the key intuition is to find the common in- frequency(n)
p(n) = (4)
formation in both values. This information is represented by N
the lowest common ancestor in the hierarchy that subsumes where N is the number of samples.
both values (Lin 1998). If the lowest common ancestor of Based on the above definitions, we formulate information
two values is close to leaf nodes, that implies both values content based semantic similarity(ICS) between two cate-
share many characteristics. As the lowest common ancestor gorical values x and y as
moves up in the hierarchy, fewer commonalities exist be- (
tween a given pair of values. 1 if x = y.
For the given dataset, we can map the ‘Occupation’ at- Sim(x, y) = I(xty) (5)
max(I(xty)) else if x 6= y
tribute to the O*net taxonomy1 (Fig. 1) by placing all the
values at the corresponding leaf nodes in the occupation where I(xty) denotes the information content of the lowest
hierarchy whereas intermediate nodes represent the lowest common ancestor of x and y, calculated by using equation 3
common ancestors for given pairs. In Fig. 12 ,‘Computer and max(I(xty) represents the maximum information con-
Programmer’ and ‘Software Developer’ are both subsumed tent of all given pair of leaves and is used for normalization.
by the lowest common ancestor ‘Computer Occupations’, Hierarchy-based Semantic Similarity(HS) As explained
whereas the lowest common ancestor that subsumes the con- earlier, the main intuition of semantic similarity is based on
cept ‘HR Manager’ and ’Computer Programmer’ is ‘Occu- the idea that any two values having the lowest common an-
pation’(root node of the occupation hierarchy). Hence, tak- cestor close to leaf nodes, should have high similarity and
ing into account the lowest common ancestor, we expect that vice versa. Hence, we quantify semantic similarity by con-
the similarity between Computer Programmer and Software sidering the level of the lowest common ancestor in the hi-
Developer to be significantly greater than the similarity be- erarchy. The level of a node is defined by 1+ the number
tween the Computer Programmer and HR Developer. of connections between the node and the root4 . Greater the
Our intuition about the concept of semantic similarity is level of the lowest common ancestor of any given pair of
that for two categorical values x and y that share lowest values in the hierarchy, more similar the values are. We for-
common ancestor c, farthest from the root node, are always mulate the similarity as,
considered to be more semantically similar than to two cate-
gorical values x and z that share lowest common ancestor c0 1 if x = y.
Sim(x, y) = (6)
close to root node. In addition, identical values should have λd−level(x∪y) else if x 6= y
a maximum similarity of 1.
3
Probabilities may also be known from other sources, for in-
1
https://www.onetcenter.org/taxonomy.html stance known priors for the specific domain.
2 4
https://www.bls.gov/soc/soc structure 2010.pdf Level starts from 1 and the level of the root is 1
Figure 1: O*net Occupation Taxonomy

Where 0 < λ < 1 is a fixed decay parameter, level(n) is there exists no standard machine learning system for solving
the distance of n from the root in the hierarchy, and d = this use case. The common industrial practice to date is to
maxn∈X level(n) is the maximum depth of the hierarchy. conduct a manual analysis by human experts.
The main advantage of equation 6 is that the calculation of
semantic similarity no longer requires any input from train- 4.2 Reservoir Analogues
ing data such as information content. Once the concept hier- In the Oil and Gas Industry, during the exploration phase,
archy is formalized, we can measure the similarity between analogous reservoirs are used to study reservoirs that lack
any two values including the categorical values not observed critical information. Any reservoir with a deficit of critical
in the training data. information is known as a “target reservoir”, and “analo-
Below, we explain how to perform evaluation of the pro- gous reservoirs” are ones expected to have similar charac-
posed techniques. teristics. (Martı́n Rodrı́guez et al. 2013).
Usually, a technical evaluation team must analyze various
4 Evaluation data types – seismic, well logs, test, and cores – in order to
In this section, we compare the ICSD and HDM approaches make the first approximation of analogous reservoirs. Due to
to other similarity measures for the identification of reser- a lack of resources and time constraints, the first approxima-
voir analogues of a target reservoir, given a dataset of known tion is usually the neighboring reservoirs to provide an esti-
reservoirs. This use-case is further explained in Section 4.2 mate of the fluid and rock properties of the target reservoir.
below. A single analogue is mostly used because it is in the same
geographic region or basin. This is risky however, since it
4.1 Baseline Methods does not always give sufficient information to characterize
a new prospect. Furthermore, it becomes a tedious task for
The following four state-of-the-art similarity/distance mea- new target reservoirs where no neighboring reservoir exists.
sures are compared with the proposed techniques: Occur- Limited efforts have been made to identify analogues
rence Frequency (OF) (Boriah, Chandola, and Kumar 2008), based on machine learning (Martı́n Rodrı́guez et al. 2013;
Eskin Similarity measure (Boriah, Chandola, and Kumar Perez-Valiente et al. 2014). In order to generate a list of
2008; Eskin et al. 2002) , Lin Similarity measure (Lin 1998) ranked reservoirs based on similarity, it is important to au-
and Coupled Similarity Matrix (CMS) (Jian et al. 2018). tomate this process using a standard knowledge source and
We compare the performance of the different similar- to develop a method that is flexible enough to produce ana-
ity measures in a recommendation scenario: given a query logues for reservoirs with no neighboring analogues.
item, we compute its similarity to each item in the ‘training’
dataset using Equation 2, and determine the top k items with 4.3 Dataset
highest similarity. The main source of information used in this evaluation is the
For our evaluation, we do this for all of the different sim- dataset of reservoirs licensed by IHS5 . It comprises a total
ilarity measures, and compare the outcome to a fixed ‘gold of 43000 reservoirs and various properties/attributes associ-
standard’ list of items to determine the average precision. ated with each reservoir. According to domain experts, only
For our experimental evaluation, we have chosen reservoir a few key parameters are known during the initial stage of
analogues (explained in the section below): a complex task
5
in the Oil and Gas industry. To the best of our knowledge, https://ihsmarkit.com/index.html
Figure 2: The hierarchy of geologic age. Figure 3: Ontology showing IS-A relationships for Lithol-
ogy

reservoir identification. Hence for our analysis of retrieving
similar reservoirs, we use the following set of key parame- Lithology: The lithology of a rock unit is a description
ters/attributes identified by domain experts. of its physical characteristics visible at outcrop, in hand or
core samples or with low magnification microscopies, such
• Depositional Environment as color, texture, grain size, or composition. There is no stan-
• Lithology dard ontology for lithology. With the help of geologists, we
develop an ontology that considers all the categorical values
• Age occurring in data and groups them based on similar physical
• Geographical Location characteristics. In Fig. 3, we show a part of this ontology.
• Structural Setting 4.5 Data Pre-processing
Detailed definitions of these parameters are described in The main challenge associated with the given data is a large
the section below. number of categorical values associated with each attribute.
For the attribute ‘Age,’ there are about 250 unique values.
4.4 Semantic Information for Attributes These values are not standardized. Hence, there are instances
This section explains the process of standardizing the se- where the same category exists in the dataset with various
mantic information used in the calculation of similarity. Due names. Furthermore, most of the age values are unofficial
to data confidentiality, we only explain two attributes ‘Age’ names, which are used only in a few specific areas of the
and ‘Lithology.’ world. With the help of geological experts, we replaced these
unofficial names by standard domain names.
For the attribute ‘Depositional Environment,’ there are 32
Reservoir Age: A geologic age is a subdivision of geo- unique values occurring in the given data set. Some categori-
logic time that divides an epoch into smaller parts. A succes- cal values are merged together based on the same geological
sion of rock strata laid down in a single age on the geologic properties identified by domain experts.
timescale is a stage. The geological time has been divided In the original data set, there are 1731 categories of the
into eras, periods and epochs. The named divisions of the attribute ‘Lithology.’ The raw values of lithology contain
geological time are based on fossil evidence. Fig. 2 shows a abbreviations for the same lithology, unofficial lithology
part of an ontology developed to show how geological times names, and combinations of various lithologies. These cat-
are organized into Erathem, Period, Epoch and Age. egories are replaced with the standard names and combina-
Note that age can also be given on a linear scale, e.g. in tions are replaced with only primary lithology, which leads
millions of years. However, the characteristics of rocks de- to 228 unique categories.
posited in different geologic eras, periods, and epochs differ Outliers are extreme values that deviate from other obser-
so much that their position in the hierarchy is a much better vations on data, they may indicate variability in measure-
indicator of similarity than the numerical difference in age. ment, experimental errors or a novelty. In order to avoid the
disastrous effect on the results of the statistical analysis, a
step is added to identify, analyze and delete outliers in the Table 3: Average Precision for the selected target reservoirs
dataset. In this step, for every attribute, we remove the val- Reservoir ICS HS OF CMS Eskin
ues that don’t confirm with standard domain names. Snorre 39 59 39 40 34
After cleaning the data, the comparative evaluation be- Snohvit 57 66 15 29 27
tween ICS, HS and existing similarity algorithms is con-
ducted.
Table 4: Mean Average Precision
4.6 Evaluation Method
ICS HS OF CMS Eskin
For the given task, we will evaluate the similarity measure
MAP 48 63 27 35 30
on two main objectives.
• Retrieving top 15 similar analogues to the target reservoir.
• Producing the result in a ranked order such that the first a set of queries Q, the mean average precision(M APQ @K)
retrieved analogue corresponds to the most similar reser- of an algorithm is defined as
voir to the target reservoir.
Q
Mean Average Precision (MAP) is the mostly commonly 1 X
M APQ @K = APq @K (11)
used evaluation metric in information retrieval and object Q q=1
detection (Baeza-Yates and Ribeiro-Neto 2008). MAP is the
arithmetic mean of the average precision (AP) values for where APq @K is calculated by using Equation 10
an information retrieval system over a set of n query top-
ics (Liu Ling 2009) . It can be expressed as follows: 4.7 Experimental Results
There is no standard way to evaluate similarity measures
1X
M AP = APn (7) for semantic similarity. Resnik uses human expert similar-
n n ity ranking to judge similarity (Resnik 1970). We follow the
same approach. In order to perform this evaluation, we se-
Precision for a classification task is defined as
lected two target reservoirs ‘Snorre’ and ‘Snøhvit.’ We then
TruePositive asked our domain experts to produce a gold set for each
Precision = (8) reservoir. This gold set contains a set of reservoirs identified
TruePositive + FalsePositive
by our experts as most similar to the target reservoir based on
Based on Equation 8, recommender system Precision (P) their hindsight knowledge about the target reservoir. Further-
is defined as, more, the gold set is produced in a ranked manner, the first
item in the list corresponds to the highest similar analogue
# of our recommendations that are relevant and the last item corresponds to the lowest similar reservoir.
P = (9) After acquiring the gold dataset, we perform an experi-
# of items we recommended mental evaluation to compare the performance of the pro-
For evaluating the performance of recommender systems, posed techniques with three existing similarity measures
we are only interested in recommending top-N items to the (OF (Boriah, Chandola, and Kumar 2008) , Eskin (Eskin et
user. Usually, the higher the number of relevant recommen- al. 2002) , CMS (Jian et al. 2018) for finding reservoir ana-
dations at the top, the more positive is the impression of the logues. For each selected target reservoir, all the remaining
users. Therefore, it is sensible to compute precision and re- reservoirs in the dataset are given as input to each similar-
call metrics in the first N items instead of all the items. Thus ity measure and the similarity between the target and all re-
the precision at a cutoff k is introduced in order to evaluate maining reservoirs is calculated. The top 15 reservoirs with
ranking, where k is an integer that is set by the user to match maximum similarity are retrieved and are now referred to as
the objective of the top-N recommendations. Average preci- analogues to the target reservoir.
sion at cutoff k, is the average of all precisions in the places In order to penalize poor estimations, we are using Av-
where a recommendation is a true positive and is defined as erage Precision (e Equation 10) as a quality criterion for
follows: evaluation of similarity between reservoirs. For this metric,
K a higher value corresponds to better results. Table 3, shows
1 X the experimental result of each similarity measure separately
APq @K = P (i) · Rel(i) (10)
K i=1 for each target reservoir 6 .
As shown in table 3, ICS and HS measures outperform
where K represents the top K recommendations for the the data-driven similarity measures for both selected reser-
given query q and Rel(i) shows the relevance of the rec- voirs. For the target reservoirs, ’Snorre’ and ’Snohvit’, the
ommendation. Rel(i) is 1 if the recommended item was rel-
evant(true positive) otherwise 0. 6
Similarity measure proposed by Lin (Lin 1998) doesn’t re-
Usually, the performance of a recommendation system is trieve any similar analogues in the top k-recommendations. There-
calculated by considering a set of queries. Therefore, given fore, results are not included in table 3.
average precision for ICS is 39% and 57% which is higher Boriah, S.; Chandola, V.; and Kumar, V. 2008. Similarity
than the average precision of other similarity measures. For measures for categorical data: A comparative evaluation. In
HS average precision for ’Snorre’ and ’Snohvit’ is 59% and Proceedings of the SIAM International Conference on Data
66%.Further, table 4 shows that the MAP (Equation 11) Mining, volume 30, 243–254.
for ICS and HS is 48% and 63% respectively, which signifi- Cost, S., and Salzberg, S. 1998. A weighted nearest neigh-
cantly better than the MAP values of other algorithms. This bor algorithm for learning with symbolic features. Machine
evaluation supports the initial hypothesis that by adding do- Learning 10.
main information to the similarity measure, we can increase
the similarity performance for the complex categorical data. Das, G., and Mannila, H. 2000. Context-based similar-
ity measures for categorical databases. In Lecture Notes in
It is important to note that results obtained using ICS and
Computer Science, volume 1910, 201–210.
HS are not directly comparable with the gold set provided by
human experts. In order to produce a gold set, human experts Davis, M. J. 2010. Contrast coding in multiple regression
take into account the geological history of the current basin, analysis: Strengths, weaknesses, and utility of popular cod-
analysis of geological time periods and overall processes ing structures. In Journal of Data Science.
of formation of reservoir rocks. Furthermore, they also use Eskin, E.; Arnold, A.; Prerau, M.; Portnoy, L.; and Stolfo, S.
conceptual facies models, reservoir simulation models, core 2002. A geometric framework for unsupervised anomaly de-
samples and well logs for selecting appropriate analogues. In tection: Detecting intrusions in unlabeled data. Applications
contrast to this, our experimental evaluation of the proposed of Data Mining in Computer Security 6.
technique is based only on a limited part of this information. Esposito, F.; Malerba, D.; Tamma, V.; and Bock, H.-H.
Achieving 63% precision in this scenario highlights the fact 2000. Classical resemblance measures. Springer Verlag.
that it is highly remarkable to correctly retrieve analogues in
the top 15 recommendations based only on hierarchy-based Gambaryan, P. 1964. A mathematical model for taxonomy.
semantic measure. SSR 47–53.
Huang, G., and Sheng, J. 2012. Measuring similarity be-
5 Conclusion & Future Work tween sentence fragments. In Proceedings of the 2012 4th
International Conference on Intelligent Human-Machine
Computing similarity measure in an unsupervised setting is Systems and Cybernetics, IHMSC 2012, volume 1, 327–330.
a complex task. In this paper, we propose a method based on
domain information extracted in the form of is-a links from Jian, S.; Cao, L.; Lu, K.; and Gao, H. 2018. Unsupervised
a concept hierarchy. The experimental results in the previ- coupled metric similarity for non-iid categorical data. IEEE
ous section, show that by using domain information, the re- Transactions on Knowledge and Data Engineering PP:1–1.
sults are significantly better than the traditional methods of Leacock, C., and Chodorow, M. 1998. Combining Local
finding similarity only based on frequency match/mismatch. Context and WordNet Similarity for Word Sense Identifica-
In our current work, we approach the problem by consider- tion, volume 49. MIT Press.
ing the lowest common ancestor in the concept hierarchy by Lin, D. 1998. An information-theoretic definition of simi-
considering mono-hierarchies only and in an unsupervised larity. ICML. Madison 1.
setting. In the future, we want to extend the notion of sim-
Liu Ling, Özsu, M. T. 2009. Encyclopedia of Database
ilarity for categorical data in a supervised setting for com-
Systems. Springer US.
plex use cases such as mortality prediction in the medical
domain. Furthermore, the idea can be extended to find sim- Martı́n Rodrı́guez, H.; Escobar, E.; Embid, S.; Rodriguez,
ilarity for categorical data in poly-hierarchies (i.e. not tree- N.; Hegazy, M.; and Lake, L. 2013. New approach to iden-
shaped). tify analogue reservoirs. SPE Economics & Management 6.
Pedersen, T.; Pakhomov, S.; and Patwardhan, S. 2005. Mea-
References sures of semantic similarity and relatedness in the medical
domain. Journal of Biomedical Informatics - JBI.
Ahmad, A., and Dey, L. 2007. A method to compute dis-
tance between two categorical values of same attribute in un- Perez-Valiente, M.; Rodriguez, H.; Santos, C.; Vieira, M.;
supervised learning for categorical data set. Pattern Recog- and Embid, S. 2014. Identification of reservoir analogues in
nition Letters 28:110–118. the presence of uncertainty. SPE Intelligent Energy Confer-
ence and Exhibition.
Alamuri, M.; Surampudi, B.; and Negi, A. 2014. A survey of
distance/similarity measures for categorical data. Proceed- Resnik, P. 1970. Using information content to evaluate se-
ings of the International Joint Conference on Neural Net- mantic similarity in a taxonomy. IJCAI 95.
works 1907–1914. Ross, S. M. 1976. A First Course in Probability. Pearson
Alkharusi, H. 2012. Categorical variables in regression anal- Education, Inc.
ysis: A comparison of dummy and effect coding. Interna- Roy, R.; Hafedh, M.; Ellen, B.; and Maria, B. 1989. “devel-
tional Journal of Education 4:202–210. opment and application of a metric on semantic nets. IEEE
Baeza-Yates, R., and Ribeiro-Neto, B. 2008. Modern In- Transactions on Systems, Man, and Cybernetics 19:17–30.
formation Retrieval: The Concepts and Technology Behind Smelser, N., and Baltes, P. 2001. International Encyclopedia
Search. Addison-Wesley Publishing Company, 2nd edition. of the Social & Behavioral Sciences. Elsevier.
Sparck Jones, K. 2004. A statistical interpretation of term
specificity and its application in retrieval. Journal of Docu-
mentation 28:493–502.
Stanfill, C., and L. Waltz, D. 1986. Toward memory-based
reasoning. Commun. ACM 29:1213–1228.
W. Goodall, D. 1966. A new similarity index based on
probability. Biometrics 22.
Wilson, D., and Martinez, T. 2000. Improved heterogeneous
distance functions. J. of Artif. Intell. Res. 6.