<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Summaya</forename><surname>Mumtaz</surname></persName>
							<email>summayam@ifi.uio.no</email>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Giese</surname></persName>
							<email>martingi@ifi.uio.no</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Department of Informatics</orgName>
								<orgName type="institution">University of Oslo</orgName>
								<address>
									<country key="NO">Norway</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="institution">Stanford University</orgName>
								<address>
									<settlement>Palo Alto</settlement>
									<region>California</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Frequency-Based vs. Knowledge-Based Similarity Measures for Categorical Data</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">0D6331EC83E3DF0FC9B33F86DC33FE8D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T15:38+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Calculation of similarity between two entities is a key step in several data mining processes. While there are several common similarity measures for continuous data, there is little work for categorical data. Most approaches are purely datadriven and don't consider the inherent dependencies of complex domains such as geological structures, phylogenetics, etc. We propose two new similarity measures that take into account semantic information to calculate the similarity between two categorical values. Semantic information is represented as a hierarchy extracted from an ontology or a domain taxonomy. The first approach calculates semantic similarity by combining the data-driven approach with the hierarchy imposed on the possible categorical values. The second approach ignores the data and uses only the hierarchy to calculate semantic similarity. We apply our methods to a specific complex data mining task in the oil and gas industry: reservoir analogue identification. The two proposed measures are compared to existing data-based measures.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The context of this work is the combination of data-based (statistical) methods with knowledge-based methods in data science. In many disciplines, there is a considerable body of domain knowledge available, while data sets may not always be large enough to support machine learning of complex relationships. In this work, we look specifically at similarity measures (or equivalently distance measures), which lie at the core of a number of machine learning tasks such as clustering, outlier identification and classification (k-NN). We concentrate on entities described by categorical data, feature values taken from a finite set of possible values with no inherent order. The domain knowledge we wish to incorporate is given in the form of hierarchies that can be extracted from domain ontologies, standard classifications, etc.</p><p>There is a variety of suitable metrics to quantify similarity for numerical data such as Euclidean or Manhattan distance <ref type="bibr">(Esposito et al. 2000)</ref>. These methods are not directly applicable to non-numerical data. However, defining sensible metrics for categorical attributes is challenging.</p><p>The most common approach in machine learning algorithms for handling categorical data is one-hot encoding <ref type="bibr" target="#b2">(Alkharusi 2012;</ref><ref type="bibr" target="#b6">Davis 2010)</ref>. A binary column is created for each unique value of the categorical column. This yields a high-dimensional sparse matrix, containing a significant proportion of zeros. This approach requires high computational resources, is unable to handle unseen values and ignores any domain dependencies known to exist between values of the same categorical attribute.</p><p>In a supervised learning approach, the distance δ(x, y) between two categorical values can be defined using value distance matrix <ref type="bibr" target="#b18">(Stanfill and L. Waltz 1986)</ref> and modified value distance matrix <ref type="bibr" target="#b4">(Cost and Salzberg 1998)</ref>.</p><p>For unsupervised learning, the hamming distance is used and similarity is defined as a matching measure that assigns 1 if both values are identical, and 0 otherwise <ref type="bibr">(Esposito et al. 2000;</ref><ref type="bibr" target="#b0">Ahmad and Dey 2007)</ref>. Various similarity measures have been derived using this distance measure, e.g. Jaccard similarity coefficient, Sokal-Michener similarity measure, Grower-Legendre similarity measure, etc. <ref type="bibr">(Esposito et al. 2000)</ref>. These measures are inherently quite coarse: in the absence of an ordering between the categorical values, the only possible distinction is whether two values are identical or not <ref type="bibr">(Esposito et al. 2000)</ref>.</p><p>To improve on these, frequency-based similarity measures have been proposed that take the frequency distribution of different attribute values into account. These measures are data-driven and hence are dependent on certain data characteristics such as the size of data, number of attributes, number of values for each attribute and distribution of frequency of each value. While data-driven measures perform well on simple datasets, these measures are unable to take into account semantic relationships and often don't perform well on complex datasets with hidden domain dependencies. Moreover, a concept of similarity that is based solely on how often values occur in the data cannot be expected to give reasonable results in all cases. Using frequencies seems more like a 'last straw' when frequencies are the only distinguishing feature between categorical values.</p><p>In this paper, we propose an alternative way to measure similarity for categorical data in an unsupervised setting. We combine a frequency-based measure with explicitly repre-sented domain knowledge given in the form of a hierarchy on attribute values, and we also consider a measure that is based purely on the hierarchy, without taking frequencies into account.</p><p>Section 2 describes the related work. Section 3 explains the problem formulation and proposed algorithm. Section 4 presents the dataset and evaluation by comparing with existing algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Literature Review</head><p>The surveys <ref type="bibr">(Boriah, Chandola, and Kumar 2008;</ref><ref type="bibr" target="#b1">Alamuri, Surampudi, and Negi 2014)</ref> discuss various similarity measures for categorical data. <ref type="bibr" target="#b20">Wilson and Martinez (Wilson and Martinez 2000)</ref> have studied in-depth heterogeneous functions for mixed data (categorical and continuous variables) for instance-based learning. Their approach is based on supervisor learning where each instance has class labels in addition to input variables. The focus of this paper is to find similarity in an unsupervised setting where information regarding classes is unknown.</p><p>For unsupervised learning, various techniques have been proposed <ref type="bibr">(Boriah, Chandola, and Kumar 2008)</ref>. The majority of these techniques are based only on the data-driven approach. However, in some other domains like in natural language processing, research is being conducted to calculate similarity based on semantics and domain knowledge. Below, we provide an overview of the existing data-driven measures, followed by research done in natural language processing.</p><p>The simplest similarity measure used is known as overlap measure <ref type="bibr">(Boriah, Chandola, and Kumar 2008)</ref>. Similarity of 1 is assigned when two categorical values are identical otherwise similarity is assigned as 0. The overall similarity between two data instances of multivariate categorical data is proportional to the number of attributes in which they are identical. The overlap measure does not distinguish different values of attributes hence matches and mismatches are treated equally. Goodall proposed a similarity measure to normalize similarity between two data instances by the probability of occurrences in a random sample <ref type="bibr" target="#b19">(W. Goodall 1966)</ref>. This measure assigns a higher similarity score to the values which are less frequent. Gambaryan proposed similarity measure by giving more weight to matches where the frequency of occurrence of categorical values is about half in the dataset <ref type="bibr" target="#b7">(Gambaryan 1964)</ref>. <ref type="bibr" target="#b6">(Eskin et al. 2002)</ref> developed a normalization kernel intrusion detection system. This measure assigns more weight to mismatches of attributes that contain many values. Inverse Occurrence frequency (IOF) assigns lower similarity values to mismatches that are based on more frequent values. IOF measure is derived from information retrieval (Sparck Jones 2004) and is associated with the idea of inverse document frequency. The Occurrence frequency (OF) measure assigns lower similarity to mismatches on less frequent values and mismatches on more frequent items are assigned higher similarity <ref type="bibr">(Boriah, Chandola, and Kumar 2008)</ref>.</p><p>Lin proposed a similarity framework based on information theory <ref type="bibr" target="#b11">(Lin 1998)</ref>. According to Lin, similarity can be explained in terms of a set of assumptions. If the assumptions are considered true, the similarity measure is necessarily followed. Therefore, the similarity between the two values is calculated by the ratio between the amount of information required to state the commonality of both values and the information needed to fully describe both values separately. Lin derived similarity measure for words, ordinal and string data.</p><p>Das and Mannila's research is based on the key point that attribute value similarity is related to other attributes <ref type="bibr" target="#b5">(Das and Mannila 2000)</ref>. They proposed Iterated Contextual Distances (ICD) based on the idea that attribute and object similarities are interdependent. ICD finds attribute similarity, sub relation, and row similarity. Ahmed and Dey proposed a distance-based measure in term of co-occurrence of values, the overall distribution of two attribute values are considered along with their co-occurrence with the values of other attributes <ref type="bibr" target="#b0">(Ahmad and Dey 2007)</ref>.</p><p>Document or sentence similarity is considered the basic task for many natural language processing(NLP) engines such as information retrieval, query answering, and text summarization. Semantic-based methods use information from dictionaries (WordNet) to find relatedness between two terms. Classic methods in NLP are based on the shortest path measure <ref type="bibr" target="#b15">(Roy et al. 1989)</ref>. <ref type="bibr" target="#b10">(Leacock and Chodorow 1998)</ref> proposed a similarity technique based on the shortest path between nodes in a taxonomy and the number of nodes. <ref type="bibr" target="#b8">(Huang and Sheng 2012)</ref> based their sentence similarity measure by using WordNet information content and string edit distance, for paraphrase recognition.</p><p>However, the techniques mentioned above are not directly suitable for categorical features. In an NLP setting, there are many terms in a complete sentence or document, that provide the neighborhood context and aid understanding the semantics. Furthermore, NLP tasks are constrained by the sentence structures and grammar of a particular language such as the ordering of subject, verb, noun, etc. However, categorical features are represented by single domain terms with no obvious representation of neighborhood or the context that explains the semantic similarity. The main focus here is to define semantic similarity between categorical terms based on the characteristics extracted from domain knowledge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Formulation</head><p>In this section, we first discuss a toy example to identify the drawbacks of frequency-based similarity approaches. Further, we provide an overview of metric properties and semantic similarity to establish the foundation of the proposed similarity measure.</p><p>We analyze the problems in existing work and inherent challenges associated with categorical data based on the toy dataset in Table <ref type="table" target="#tab_0">1</ref>. The dataset consists of candidates' profiles and we wish to retrieve matching candidates for a given job advertisement.</p><p>Many of the data-driven similarity measures consider two values of a given categorical attribute to be similar if both have similar frequency distributions. For instance, the OF similarity measure for values of an attribute is calculated as follows <ref type="bibr">(Boriah, Chandola, and Kumar 2008)</ref>. Computer Programmer Masters</p><formula xml:id="formula_0">OF (x, y) = 1 if x = y 1 (1+log( N f (x)+1 )+log( N f (y)+1 )) if x = y</formula><p>(1) where f (x) is the number of occurrences of the attribute value x and N represents the total number of observations in the data set. Similarity between pairs 'Computer Programmer' and 'HR Manager' and 'Computer Programmer' and 'Software Developer' based on equation 1 is calculated as: OF (Comp. Programmer, HR Manager) = 0.64 OF (Comp. Programmer, Soft. Developer) = 0.44</p><p>These numbers would indicate that the Programmer is more similar to HR Managers than to Developers. However, based on the evaluation of semantic evidence observed in a knowledge source (such as an ontology or a standard classification) shown in Table <ref type="table" target="#tab_1">2</ref>, it is evident that computer programmers and software developers perform the same work activities and tasks hence having a greater semantic similarity.</p><p>Semantic similarity can be made explicit in different ways, and one of the prominent ways is through hierarchies, which we will use in this paper. Section 3.1 explains in detail the formal definition of hierarchies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Hierarchies</head><p>Our similarity measures are based on a given hierarchical structure of the value range of categorical features. Formally, we assume that the categorical values for each feature form a finite, partially ordered set (poset). A poset is an ordered pair of binary relation defined over a set S, such that ( , S) satisfies the following properties: Let x, y, z ∈ S, • Reflexivity: x x • Antisymmetry: if x y and y x, then x = y • Transitivity: if x y and y z, then x z If a b, we call b an ancestor of a. The intention of a b is that b is in some way more general, broader, etc. than a. E.g., for the occupations in Fig. <ref type="figure" target="#fig_0">1</ref>, TopExecutives Man-agementOccupations; for data about geographic areas, we could have Oslo Norway Europe.</p><p>If domain knowledge is given in the form of an ontology, in some cases (depending on the modeling style), the relation will correspond to parts of the is-a subclass relation of the ontology, but in others it won't. E.g. it doesn't make sense to consider Norway a sub-class or sub-concept of Europe, but it still makes sense to consider a hierarchy of geographic regions.</p><p>A value c ∈ S is called a lowest common ancestor of two node values a ∈ S and b ∈ S if c ∈ S is the lowest (i.e. deepest) node that has both a ∈ S and b ∈ S as descendants. It is the first shared ancestor of a and b located farthest from the root. In a hierarchy two values have a lowest common ancestor denoted as a b. A value is called a leaf value if it is not the ancestor of any other value.</p><p>In this paper, we add a restriction to our hierarchies by only considering mono-hierarchies: we assume that there is some root value r in the hierarchy, such that a r for all a ∈ S, and that all values except the root have exactly one direct ancestor. In other words, the hierarchy is tree-shaped.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Semantic Similarity</head><p>Semantic similarity refers to similarity based on meaning or semantic content as opposed to form <ref type="bibr" target="#b16">(Smelser and Baltes 2001)</ref>. Semantic similarity measures are automated methods for assigning a pair of concepts a measure of similarity and can be derived from a taxonomy of concepts arranged in is-a relationships <ref type="bibr">(Pedersen, Pakhomov, and Patwardhan 2005)</ref>. The concept of semantic similarity has been applied in Natural language processing for the past decade to solve tasks such as the resolution of ambiguities between terms, document categorization or clustering, word spelling correction, automatic language translation, ontology learning or information retrieval. Similarity computation for categorical data can improve the performance of existing machine learning algorithms <ref type="bibr" target="#b0">(Ahmad and Dey 2007)</ref> and may ease the integration of heterogeneous data <ref type="bibr" target="#b20">(Wilson and Martinez 2000)</ref>.</p><p>Is-a relationships in a concept hierarchy encompass formal classification, properties and relations between concepts and data. This provides us with a common understanding of the structure of a domain, explicit domain assumptions and reuse of domain knowledge. In order to achieve interpretable and good quality results in machine learning models, it is vital to take this information into account. This intuition motivates us to link the notion of similarity based on is-a relationships with the similarity measures for categorical data. We develop a framework to use is-a relationships extracted from a concept hierarchy to quantify semantic similarity and propose a distance measure for categorical data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Proposed Framework</head><p>In this paper, we propose two techniques for measuring similarity based on domain knowledge, extracted as the concept hierarchy. First, we present a framework for calculating semantic similarity using information content and concept hierarchy by modifying Resnik's idea <ref type="bibr" target="#b13">(Resnik 1970)</ref>. To compare the performance of information-content based semantic measure, we extended the idea to introduce a simple similarity measure based only on concept hierarchy.</p><p>Further, we are interested in computing global semantic similarity in a multi-dimensional setting where we have several hierarchy-structured features. We define the global similarity between two data objects X and Y in a d-dimensional </p><formula xml:id="formula_1">w i δ(x i , y i )<label>(2)</label></formula><p>where δ(x i , y i ) corresponds to similarity between two values x and y in the i-th dimension and w i is the weight associated with each dimension. The following section presents both frameworks for calculating semantic based similarity δ(x i , y i ).</p><p>Information Content Semantic Similarity (ICS) This approach is based on a modification of Resnik's idea <ref type="bibr" target="#b13">(Resnik 1970)</ref>. Resnik proposed a measure for finding semantic similarity in an is-a taxonomy based on information content and defined similarity between two nodes in a hierarchy as the extent to which they share common information.</p><p>In order to formulate the semantic similarity of two given categorical values, the key intuition is to find the common information in both values. This information is represented by the lowest common ancestor in the hierarchy that subsumes both values <ref type="bibr" target="#b11">(Lin 1998</ref>). If the lowest common ancestor of two values is close to leaf nodes, that implies both values share many characteristics. As the lowest common ancestor moves up in the hierarchy, fewer commonalities exist between a given pair of values.</p><p>For the given dataset, we can map the 'Occupation' attribute to the O*net taxonomy<ref type="foot" target="#foot_0">1</ref> (Fig. <ref type="figure" target="#fig_0">1</ref>) by placing all the values at the corresponding leaf nodes in the occupation hierarchy whereas intermediate nodes represent the lowest common ancestors for given pairs. In Fig. <ref type="figure" target="#fig_0">1</ref> <ref type="foot" target="#foot_1">2</ref> ,'Computer Programmer' and 'Software Developer' are both subsumed by the lowest common ancestor 'Computer Occupations', whereas the lowest common ancestor that subsumes the concept 'HR Manager' and 'Computer Programmer' is 'Occupation'(root node of the occupation hierarchy). Hence, taking into account the lowest common ancestor, we expect that the similarity between Computer Programmer and Software Developer to be significantly greater than the similarity between the Computer Programmer and HR Developer.</p><p>Our intuition about the concept of semantic similarity is that for two categorical values x and y that share lowest common ancestor c, farthest from the root node, are always considered to be more semantically similar than to two categorical values x and z that share lowest common ancestor c close to root node. In addition, identical values should have a maximum similarity of 1.</p><p>In order to formulate the semantic similarity of values based on the lowest common ancestor, we use the idea of associating probabilities with the values <ref type="bibr" target="#b13">(Resnik 1970</ref>). We base ourselves on a function p : C → [0, 1] such that for any c ∈ S, p(c) represents the probability of the feature value being c. Furthermore, using information theory we can state that the information content of a feature having some value is quantified as negative of log likelihood <ref type="bibr" target="#b14">(Ross 1976)</ref>.</p><p>For categorical data, we can find the information content I of the lowest common ancestor c by finding the information content of all the leaf values subsumed by c in the hierarchy.</p><formula xml:id="formula_2">I(c) = −log n∈ leaf (c) p(n)<label>(3)</label></formula><p>where leaf (c) is the set of all leaf values in x ∈ S such that x c. The probability of leaf values may be estimated by the relative frequency.<ref type="foot" target="#foot_2">3</ref> </p><formula xml:id="formula_3">p(n) = frequency(n) N (<label>4</label></formula><formula xml:id="formula_4">)</formula><p>where N is the number of samples. Based on the above definitions, we formulate information content based semantic similarity(ICS) between two categorical values x and y as Sim(x, y) = 1 if x = y.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I(x y) max(I(x y))</head><p>else if x = y</p><p>(5)</p><p>where I(x y) denotes the information content of the lowest common ancestor of x and y, calculated by using equation 3 and max(I(x y) represents the maximum information content of all given pair of leaves and is used for normalization.</p><p>Hierarchy-based Semantic Similarity(HS) As explained earlier, the main intuition of semantic similarity is based on the idea that any two values having the lowest common ancestor close to leaf nodes, should have high similarity and vice versa. Hence, we quantify semantic similarity by considering the level of the lowest common ancestor in the hierarchy. The level of a node is defined by 1+ the number of connections between the node and the root<ref type="foot" target="#foot_3">4</ref> . Greater the level of the lowest common ancestor of any given pair of values in the hierarchy, more similar the values are. We formulate the similarity as, Below, we explain how to perform evaluation of the proposed techniques.</p><formula xml:id="formula_5">Sim(x, y) = 1 if x = y. λ d−level(x∪y) else if x = y (6)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation</head><p>In this section, we compare the ICSD and HDM approaches to other similarity measures for the identification of reservoir analogues of a target reservoir, given a dataset of known reservoirs. This use-case is further explained in Section 4.2 below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Baseline Methods</head><p>The following four state-of-the-art similarity/distance measures are compared with the proposed techniques: Occurrence Frequency (OF) <ref type="bibr">(Boriah, Chandola, and Kumar 2008)</ref>, Eskin Similarity measure <ref type="bibr">(Boriah, Chandola, and Kumar 2008;</ref><ref type="bibr" target="#b6">Eskin et al. 2002)</ref> , Lin Similarity measure <ref type="bibr" target="#b11">(Lin 1998)</ref> and Coupled Similarity Matrix (CMS) <ref type="bibr" target="#b9">(Jian et al. 2018)</ref>.</p><p>We compare the performance of the different similarity measures in a recommendation scenario: given a query item, we compute its similarity to each item in the 'training' dataset using Equation <ref type="formula" target="#formula_1">2</ref>, and determine the top k items with highest similarity.</p><p>For our evaluation, we do this for all of the different similarity measures, and compare the outcome to a fixed 'gold standard' list of items to determine the average precision.</p><p>For our experimental evaluation, we have chosen reservoir analogues (explained in the section below): a complex task in the Oil and Gas industry. To the best of our knowledge, there exists no standard machine learning system for solving this use case. The common industrial practice to date is to conduct a manual analysis by human experts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Reservoir Analogues</head><p>In the Oil and Gas Industry, during the exploration phase, analogous reservoirs are used to study reservoirs that lack critical information. Any reservoir with a deficit of critical information is known as a "target reservoir", and "analogous reservoirs" are ones expected to have similar characteristics. <ref type="bibr">(Martín Rodríguez et al. 2013)</ref>.</p><p>Usually, a technical evaluation team must analyze various data types -seismic, well logs, test, and cores -in order to make the first approximation of analogous reservoirs. Due to a lack of resources and time constraints, the first approximation is usually the neighboring reservoirs to provide an estimate of the fluid and rock properties of the target reservoir. A single analogue is mostly used because it is in the same geographic region or basin. This is risky however, since it does not always give sufficient information to characterize a new prospect. Furthermore, it becomes a tedious task for new target reservoirs where no neighboring reservoir exists.</p><p>Limited efforts have been made to identify analogues based on machine learning <ref type="bibr">(Martín Rodríguez et al. 2013;</ref><ref type="bibr" target="#b12">Perez-Valiente et al. 2014)</ref>. In order to generate a list of ranked reservoirs based on similarity, it is important to automate this process using a standard knowledge source and to develop a method that is flexible enough to produce analogues for reservoirs with no neighboring analogues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Dataset</head><p>The main source of information used in this evaluation is the dataset of reservoirs licensed by IHS<ref type="foot" target="#foot_4">5</ref> . It comprises a total of 43000 reservoirs and various properties/attributes associated with each reservoir. According to domain experts, only a few key parameters are known during the initial stage of </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Semantic Information for Attributes</head><p>This section explains the process of standardizing the semantic information used in the calculation of similarity. Due to data confidentiality, we only explain two attributes 'Age' and 'Lithology.' Reservoir Age: A geologic age is a subdivision of geologic time that divides an epoch into smaller parts. A succession of rock strata laid down in a single age on the geologic timescale is a stage. The geological time has been divided into eras, periods and epochs. The named divisions of the geological time are based on fossil evidence. Fig. <ref type="figure" target="#fig_1">2</ref> shows a part of an ontology developed to show how geological times are organized into Erathem, Period, Epoch and Age.</p><p>Note that age can also be given on a linear scale, e.g. in millions of years. However, the characteristics of rocks deposited in different geologic eras, periods, and epochs differ so much that their position in the hierarchy is a much better indicator of similarity than the numerical difference in age. Lithology: The lithology of a rock unit is a description of its physical characteristics visible at outcrop, in hand or core samples or with low magnification microscopies, such as color, texture, grain size, or composition. There is no standard ontology for lithology. With the help of geologists, we develop an ontology that considers all the categorical values occurring in data and groups them based on similar physical characteristics. In Fig. <ref type="figure" target="#fig_2">3</ref>, we show a part of this ontology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Data Pre-processing</head><p>The main challenge associated with the given data is a large number of categorical values associated with each attribute. For the attribute 'Age,' there are about 250 unique values. These values are not standardized. Hence, there are instances where the same category exists in the dataset with various names. Furthermore, most of the age values are unofficial names, which are used only in a few specific areas of the world. With the help of geological experts, we replaced these unofficial names by standard domain names.</p><p>For the attribute 'Depositional Environment,' there are 32 unique values occurring in the given data set. Some categorical values are merged together based on the same geological properties identified by domain experts.</p><p>In the original data set, there are 1731 categories of the attribute 'Lithology.' The raw values of lithology contain abbreviations for the same lithology, unofficial lithology names, and combinations of various lithologies. These categories are replaced with the standard names and combinations are replaced with only primary lithology, which leads to 228 unique categories.</p><p>Outliers are extreme values that deviate from other observations on data, they may indicate variability in measurement, experimental errors or a novelty. In order to avoid the disastrous effect on the results of the statistical analysis, a step is added to identify, analyze and delete outliers in the dataset. In this step, for every attribute, we remove the values that don't confirm with standard domain names.</p><p>After cleaning the data, the comparative evaluation between ICS, HS and existing similarity algorithms is conducted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Evaluation Method</head><p>For the given task, we will evaluate the similarity measure on two main objectives.</p><p>• Retrieving top 15 similar analogues to the target reservoir.</p><p>• Producing the result in a ranked order such that the first retrieved analogue corresponds to the most similar reservoir to the target reservoir.</p><p>Mean Average Precision (MAP) is the mostly commonly used evaluation metric in information retrieval and object detection <ref type="bibr" target="#b3">(Baeza-Yates and Ribeiro-Neto 2008)</ref>. MAP is the arithmetic mean of the average precision (AP) values for an information retrieval system over a set of n query topics (Liu Ling 2009) . It can be expressed as follows:</p><formula xml:id="formula_6">M AP = 1 n n AP n (7)</formula><p>Precision for a classification task is defined as</p><formula xml:id="formula_7">Precision = TruePositive TruePositive + FalsePositive<label>(8)</label></formula><p>Based on Equation 8, recommender system Precision (P) is defined as, P = # of our recommendations that are relevant # of items we recommended (9)</p><p>For evaluating the performance of recommender systems, we are only interested in recommending top-N items to the user. Usually, the higher the number of relevant recommendations at the top, the more positive is the impression of the users. Therefore, it is sensible to compute precision and recall metrics in the first N items instead of all the items. Thus the precision at a cutoff k is introduced in order to evaluate ranking, where k is an integer that is set by the user to match the objective of the top-N recommendations. Average precision at cutoff k, is the average of all precisions in the places where a recommendation is a true positive and is defined as follows:</p><formula xml:id="formula_8">AP q @K = 1 K K i=1 P (i) • Rel(i)<label>(10)</label></formula><p>where K represents the top K recommendations for the given query q and Rel(i) shows the relevance of the recommendation. Rel(i) is 1 if the recommended item was relevant(true positive) otherwise 0. Usually, the performance of a recommendation system is calculated by considering a set of queries. Therefore, given a set of queries Q, the mean average precision(M AP Q @K) of an algorithm is defined as</p><formula xml:id="formula_9">M AP Q @K = 1 Q Q q=1 AP q @K<label>(11)</label></formula><p>where AP q @K is calculated by using Equation <ref type="formula" target="#formula_8">10</ref>4</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>.7 Experimental Results</head><p>There is no standard way to evaluate similarity measures for semantic similarity. Resnik uses human expert similarity ranking to judge similarity <ref type="bibr" target="#b13">(Resnik 1970)</ref>. We follow the same approach. In order to perform this evaluation, we selected two target reservoirs 'Snorre' and 'Snøhvit.' We then asked our domain experts to produce a gold set for each reservoir. This gold set contains a set of reservoirs identified by our experts as most similar to the target reservoir based on their hindsight knowledge about the target reservoir. Furthermore, the gold set is produced in a ranked manner, the first item in the list corresponds to the highest similar analogue and the last item corresponds to the lowest similar reservoir.</p><p>After acquiring the gold dataset, we perform an experimental evaluation to compare the performance of the proposed techniques with three existing similarity measures (OF <ref type="bibr">(Boriah, Chandola, and</ref><ref type="bibr">Kumar 2008) , Eskin (Eskin et al. 2002)</ref> , CMS <ref type="bibr" target="#b9">(Jian et al. 2018)</ref> for finding reservoir analogues. For each selected target reservoir, all the remaining reservoirs in the dataset are given as input to each similarity measure and the similarity between the target and all remaining reservoirs is calculated. The top 15 reservoirs with maximum similarity are retrieved and are now referred to as analogues to the target reservoir.</p><p>In order to penalize poor estimations, we are using Average Precision (e Equation <ref type="formula" target="#formula_8">10</ref>) as a quality criterion for evaluation of similarity between reservoirs. For this metric, a higher value corresponds to better results. Table <ref type="table" target="#tab_2">3</ref>, shows the experimental result of each similarity measure separately for each target reservoir<ref type="foot" target="#foot_5">6</ref> .</p><p>As shown in table 3, ICS and HS measures outperform the data-driven similarity measures for both selected reservoirs. For the target reservoirs, 'Snorre' and 'Snohvit', the average precision for ICS is 39% and 57% which is higher than the average precision of other similarity measures. For HS average precision for ' <ref type="bibr">Snorre' and 'Snohvit' is 59% and 66%.Further,</ref><ref type="bibr">table 4</ref> shows that the MAP (Equation <ref type="formula" target="#formula_9">11</ref>) for ICS and HS is 48% and 63% respectively, which significantly better than the MAP values of other algorithms. This evaluation supports the initial hypothesis that by adding domain information to the similarity measure, we can increase the similarity performance for the complex categorical data.</p><p>It is important to note that results obtained using ICS and HS are not directly comparable with the gold set provided by human experts. In order to produce a gold set, human experts take into account the geological history of the current basin, analysis of geological time periods and overall processes of formation of reservoir rocks. Furthermore, they also use conceptual facies models, reservoir simulation models, core samples and well logs for selecting appropriate analogues. In contrast to this, our experimental evaluation of the proposed technique is based only on a limited part of this information. Achieving 63% precision in this scenario highlights the fact that it is highly remarkable to correctly retrieve analogues in the top 15 recommendations based only on hierarchy-based semantic measure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion &amp; Future Work</head><p>Computing similarity measure in an unsupervised setting is a complex task. In this paper, we propose a method based on domain information extracted in the form of is-a links from a concept hierarchy. The experimental results in the previous section, show that by using domain information, the results are significantly better than the traditional methods of finding similarity only based on frequency match/mismatch. In our current work, we approach the problem by considering the lowest common ancestor in the concept hierarchy by considering mono-hierarchies only and in an unsupervised setting. In the future, we want to extend the notion of similarity for categorical data in a supervised setting for complex use cases such as mortality prediction in the medical domain. Furthermore, the idea can be extended to find similarity for categorical data in poly-hierarchies (i.e. not treeshaped).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: O*net Occupation Taxonomy</figDesc><graphic coords="5,127.13,54.00,357.74,183.96" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The hierarchy of geologic age.</figDesc><graphic coords="6,54.00,54.00,238.50,226.78" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Ontology showing IS-A relationships for Lithology</figDesc><graphic coords="6,319.50,54.00,238.49,226.77" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Toy Dataset</figDesc><table><row><cell>User ID</cell><cell>Occupation</cell><cell>Education</cell></row><row><cell>1</cell><cell cols="2">Computer Programmer Bachelors</cell></row><row><cell>2</cell><cell>Administrative Staff</cell><cell>Bachelors</cell></row><row><cell>3</cell><cell>HR Manager</cell><cell>Bachelors</cell></row><row><cell>4</cell><cell>HR Manager</cell><cell>Masters</cell></row><row><cell>5</cell><cell>Software Developer</cell><cell>Bachelors</cell></row><row><cell>6</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Occupation Activities and Skills</figDesc><table><row><cell>Occupation</cell><cell>Work Activity</cell><cell>Skills</cell></row><row><cell>HR Manager</cell><cell>Liaise between departments</cell><cell>PeopleSoft, SAP</cell></row><row><cell>Computer Programmer</cell><cell>Write programming code</cell><cell>C++, Java, Python</cell></row><row><cell>Software Developers</cell><cell cols="2">Modify software programs C++, Oracle ,Python</cell></row><row><cell>attribute space as,</cell><cell></cell><cell></cell></row><row><cell>d</cell><cell></cell><cell></cell></row><row><cell>δ(X, Y ) =</cell><cell></cell><cell></cell></row><row><cell>i</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Average Precision for the selected target reservoirs</figDesc><table><row><cell cols="6">Reservoir ICS HS OF CMS Eskin</cell></row><row><cell>Snorre</cell><cell>39</cell><cell>59</cell><cell>39</cell><cell>40</cell><cell>34</cell></row><row><cell>Snohvit</cell><cell>57</cell><cell>66</cell><cell>15</cell><cell>29</cell><cell>27</cell></row><row><cell cols="5">Table 4: Mean Average Precision</cell><cell></cell></row><row><cell></cell><cell cols="5">ICS HS OF CMS Eskin</cell></row><row><cell>MAP</cell><cell>48</cell><cell>63</cell><cell>27</cell><cell>35</cell><cell>30</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.onetcenter.org/taxonomy.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://www.bls.gov/soc/soc structure 2010.pdf</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">Probabilities may also be known from other sources, for instance known priors for the specific domain.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">  4  Level starts from 1 and the level of the root is 1</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">https://ihsmarkit.com/index.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">Similarity measure proposed by Lin<ref type="bibr" target="#b11">(Lin 1998</ref>) doesn't retrieve any similar analogues in the top k-recommendations. Therefore, results are not included in table 3.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="110" to="118" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey of distance/similarity measures for categorical data</title>
		<author>
			<persName><forename type="first">M</forename><surname>Alamuri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Surampudi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Negi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Joint Conference on Neural Networks</title>
				<meeting>the International Joint Conference on Neural Networks</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1907" to="1914" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Categorical variables in regression analysis: A comparison of dummy and effect coding</title>
		<author>
			<persName><forename type="first">H</forename><surname>Alkharusi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Education</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="202" to="210" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Modern Information Retrieval: The Concepts and Technology Behind Search</title>
		<author>
			<persName><forename type="first">R</forename><surname>Baeza-Yates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ribeiro-Neto</surname></persName>
		</author>
		<author>
			<persName><surname>Boriah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the SIAM International Conference on Data Mining</title>
				<meeting>the SIAM International Conference on Data Mining</meeting>
		<imprint>
			<publisher>Addison-Wesley Publishing Company</publisher>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="243" to="254" />
		</imprint>
	</monogr>
	<note>Similarity measures for categorical data: A comparative evaluation</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A weighted nearest neighbor algorithm for learning with symbolic features</title>
		<author>
			<persName><forename type="first">S</forename><surname>Cost</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Salzberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Context-based similarity measures for categorical databases</title>
		<author>
			<persName><forename type="first">G</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Mannila</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Lecture Notes in Computer Science</title>
		<imprint>
			<biblScope unit="volume">1910</biblScope>
			<biblScope unit="page" from="201" to="210" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Contrast coding in multiple regression analysis: Strengths, weaknesses, and utility of popular coding structures</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><surname>Eskin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Portnoy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Stolfo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Applications of Data Mining in Computer Security</title>
				<editor>
			<persName><forename type="first">F</forename><surname>Esposito</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Malerba</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Tamma</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H.-H</forename><surname>Bock</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer Verlag</publisher>
			<date type="published" when="2000">2010. 2002. 2000</date>
		</imprint>
	</monogr>
	<note>Classical resemblance measures</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A mathematical model for taxonomy</title>
		<author>
			<persName><forename type="first">P</forename><surname>Gambaryan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SSR</title>
		<imprint>
			<biblScope unit="page" from="47" to="53" />
			<date type="published" when="1964">1964</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Measuring similarity between sentence fragments</title>
		<author>
			<persName><forename type="first">G</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sheng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics</title>
				<meeting>the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics<address><addrLine>IHMSC</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012. 2012</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="327" to="330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Unsupervised coupled metric similarity for non-iid categorical data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Jian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering PP</title>
		<imprint>
			<biblScope unit="page" from="1" to="1" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Combining Local Context and WordNet Similarity for Word Sense Identification</title>
		<author>
			<persName><forename type="first">C</forename><surname>Leacock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chodorow</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>MIT Press</publisher>
			<biblScope unit="volume">49</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">An information-theoretic definition of similarity</title>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><surname>Liu Ling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Özsu</surname></persName>
		</author>
		<author>
			<persName><surname>T ; Pedersen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">New approach to identify analogue reservoirs. SPE Economics &amp; Management</title>
				<imprint>
			<date type="published" when="1998">1998. 2009. 2013. 2005</date>
		</imprint>
	</monogr>
	<note>Measures of semantic similarity and relatedness in the medical domain</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Identification of reservoir analogues in the presence of uncertainty</title>
		<author>
			<persName><forename type="first">M</forename><surname>Perez-Valiente</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Santos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vieira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Embid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SPE Intelligent Energy Conference and Exhibition</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Using information content to evaluate semantic similarity in a taxonomy</title>
		<author>
			<persName><forename type="first">P</forename><surname>Resnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IJCAI 95</title>
				<imprint>
			<date type="published" when="1970">1970</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">A First Course in Probability</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Ross</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1976">1976</date>
			<publisher>Pearson Education, Inc</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">development and application of a metric on semantic nets</title>
		<author>
			<persName><forename type="first">R</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hafedh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ellen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Maria</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Systems, Man, and Cybernetics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="page" from="17" to="30" />
			<date type="published" when="1989">1989</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">International Encyclopedia of the Social &amp; Behavioral Sciences</title>
		<author>
			<persName><forename type="first">N</forename><surname>Smelser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Baltes</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2001">2001</date>
			<publisher>Elsevier</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">A statistical interpretation of term specificity and its application in retrieval</title>
		<author>
			<persName><forename type="first">Sparck</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Documentation</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="493" to="502" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Toward memory-based reasoning</title>
		<author>
			<persName><forename type="first">C</forename><surname>Stanfill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Waltz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="1213" to="1228" />
			<date type="published" when="1986">1986</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">A new similarity index based on probability</title>
		<author>
			<persName><forename type="first">W</forename><surname>Goodall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biometrics</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<date type="published" when="1966">1966</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Improved heterogeneous distance functions</title>
		<author>
			<persName><forename type="first">D</forename><surname>Wilson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Martinez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. of Artif. Intell. Res</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
