<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Automated Taxonomy Building by Adopting Discriminant and Characteristic Capabilities</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Giuliano</forename><surname>Armano</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Electrical and Electronic Engineering</orgName>
								<orgName type="institution">University of Cagliari Via</orgName>
								<address>
									<addrLine>Marengo 2</addrLine>
									<postCode>09123</postCode>
									<settlement>Cagliari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Giuliani</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Electrical and Electronic Engineering</orgName>
								<orgName type="institution">University of Cagliari Via</orgName>
								<address>
									<addrLine>Marengo 2</addrLine>
									<postCode>09123</postCode>
									<settlement>Cagliari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Emanuele</forename><surname>Tamponi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Electrical and Electronic Engineering</orgName>
								<orgName type="institution">University of Cagliari Via</orgName>
								<address>
									<addrLine>Marengo 2</addrLine>
									<postCode>09123</postCode>
									<settlement>Cagliari</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Automated Taxonomy Building by Adopting Discriminant and Characteristic Capabilities</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B04336D0B363397BF548B0360F9632D4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:41+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Taxonomies are becoming essential in several fields, playing an important role in a large number of applications, particularly for specific domains. Taxonomies provide efficient tools to people by organizing a huge amount of information into a small hierarchical structure. Taxonomies were originally built by hand, but nowadays the technology permits to produce a vast amount of information. Consequently, recent research activities have been focused on automated taxonomy generation. In this paper, we propose a novel approach for automatically build a taxonomy, starting from a set of categories. We deem that, in a hierarchical structure, each node should intuitively be represented with proper meaningful and discriminant features, instead of considering a fixed feature space. Our proposal relies on two metrics able to identify the most meaningful features. Our conjecture is that a feature could significantly change its discriminant power (hence, its role) along the taxonomy levels. Hence, we devise a greedy algorithm able to build a taxonomy by identifying the meaningful terms for each level. We perform preliminary experiments that give rise to the usefulness of the proposed approach.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Taxonomies play an important role in a growing number of application, particularly for specific domains. Originally built by hand, taxonomies have been recently focused on their automatic building. In particular, a crucial issue of taxonomy building is the choice of the most suitable features (e.g., meaningful terms in textual documents). We deem that, in a hierarchical structure, a node should intuitively be identified by proper discriminant terms, rather than defining a sole feature space for the entire taxonomy. In this paper, we define a novel approach for automatically build a taxonomy, starting from a set of categories. We adopt two novel metrics, i.e., the discriminant capability and the characteristic capability <ref type="bibr" target="#b0">[1]</ref>, the former growing in accordance with the ability to distinguish a given category against others, whereas the latter grows in accordance to how the feature is frequent and common over all categories. Our conjecture is that a feature could change its role, depending on its discriminant power, along the taxonomy levels. We assert that this behavior can be exploited for devising an automatic taxonomy building approach. In so doing, we propose an algorithm able to build taxonomies by identifying the meaningful terms for each level. In this work, the underlying scenario is text categorization, where source items are textual documents (e.g., webpages, online news, scientific papers, or e-books), and the features are the terms in the documents.</p><p>The rest of the paper is organized as follows: Section 2 presents the background and the related work on taxonomy generation; Section 3 describes the discriminant and characteristic capabilities, whereas in Section 4 the proposed algorithm is described and detailed; experiments are reported in Section 5, while Section 6 ends the paper with the conclusions and the future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head><p>Nowadays, taxonomies are indispensable to a huge number of applications. For example, in web search, organizing domain-specific queries into a hierarchy can help to better understand the queries and improve search result <ref type="bibr" target="#b12">[13]</ref>, or to improve query refinement <ref type="bibr" target="#b10">[11]</ref>. Taxonomies, originally built by hand, have been recently focused on their automatic generation. Several works have been devoted to taxonomy induction, in particular with respect to automatically creating a domain-specific ontology or taxonomy <ref type="bibr" target="#b6">[7]</ref><ref type="bibr" target="#b7">[8]</ref><ref type="bibr" target="#b8">[9]</ref>. Several works have been based on hierarchical clustering algorithms. In particular, there are two main approaches of hierarchical clustering: agglomerative (bottom-up) and divisive (top-down). The former regards each data item as a cluster, and clusters are recursively merged; the latter considers the entire dataset as a cluster, and then clusters are recursively split. Both approaches end when a stop criterion is yielded. The hierarchical agglomerative clustering (HAC, hereinafter) has been widely adopted for building hierarchies, e.g., in the work of Li et al. <ref type="bibr" target="#b5">[6]</ref>, that proposes an algorithm able to build a dendrogram (basically, a binary tree). On the other hand, further works proposes divisive approaches, as in the work of Punera et al. <ref type="bibr" target="#b9">[10]</ref>. Chuang et al. proposed a hybrid approach, in which, essentially, the binary tree obtained from HAC is modified by a divisive task in order to obtain a wide tree with multiple children <ref type="bibr" target="#b4">[5]</ref>.</p><p>An important task for taxonomy building is to recognize the most meaningful features. Our insight is that, due to the hierarchical structure, each node intuitively should be represented with proper meaningful terms, instead of considering a fixed vocabulary for the entire structure. we deem that each document collection is unique, making useful to devise methods and algorithms able to automatically build a distinct list of meaningful features for each collection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">The Adopted Metrics</head><p>In this Section we describe the adopted metrics and their properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Discriminant and Characteristic Capabilities</head><p>The metrics have been devised for both classifiers performance assessment and feature selection <ref type="bibr" target="#b0">[1]</ref>. We apply these metrics for feature selection, as they are able to evaluate the discriminant (δ) and characteristic (ϕ) capabilities of each feature.</p><p>In the underlying scenario of text categorization, δ measures the ability of a term to distinguish a given category C against others, whereas ϕ measures to which extent a term is pervasive in the given set of documents. in formula:</p><formula xml:id="formula_0">δ = #(t, C) #(C) − #(t, C) #(C)<label>(1)</label></formula><formula xml:id="formula_1">ϕ = #(t, C) #(C) + #(t, C) #(C) − 1 (<label>2</label></formula><formula xml:id="formula_2">)</formula><p>where a generic term t contained in a document represents the binary feature under analysis, meaning that it can be assume two values, depending on the presence or absence in the document. The meaning of each component in the formulas are the following: #(t, C) is the number of documents of C containing t; #(t, C) is the number of documents of C containing t; #(C) is the total number of documents of C; #( C) is the total number of documents of C. Let us recall that in <ref type="bibr" target="#b0">[1]</ref> definitions are given for a binary problem, meaning that, for a given class C, the alternate category C is the union of all the categories except C. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Terms Roles</head><p>Assuming both ranging from -1 to +1, the metrics described by formulas 2 and 1 show an orthogonal behavior; furthermore, it has been proved that the ϕ − δ space is constrained by a rhomboidal shape <ref type="bibr" target="#b0">[1]</ref>. In this context, a term plays a distinct role in each category, depending on the rhombus region in which the term falls. Important terms for text classification appear in upper and lower corner of the rhombus in Figure <ref type="figure" target="#fig_0">1</ref>, as they have high values of |δ|. In particular, a high positive value of δ (the region marked as δ + in the Figure <ref type="figure" target="#fig_0">1</ref>) means that the term frequently occurs in C and is rare in C; ideally, δ is +1 when the term occurs in all documents of C and no documents of C contain it. Conversely, a high negative value of δ (the δ − region) means that the term frequently occurs in C and is rare in C; ideally, δ = −1 means that all documents of C contain the term, and no documents of C contain it. As for the characteristic capability, terms that occur barely on the entire domain are expected to appear in the left corner of the rhombus (ϕ − ), while stopwords are expected to appear in the right handed corner (ϕ + ). Ideally, ϕ = +1 when the term occurs in each document of the entire domain, whereas ϕ = −1 when the term is completely absent in the domain. Figure <ref type="figure" target="#fig_0">1</ref> outlines the expected behavior for all cases.</p><p>Terms falling in ϕ + do not necessarily represent typical stopwords only (i.e., common articles, nouns, conjunctions, verbs, and adverbs). Rather, also domaindependent stopwords are located in that area <ref type="bibr" target="#b2">[3]</ref>.</p><p>Moreover, theoretically, if a term has a zero value for both δ and ϕ in a given category C, it is equally distributed in the domain in this way: half of C documents contain the term, and also half of documents of the alternate category C contain the term. If a term is projected close to the origin of the space, there is uncertainty in considering the term as stopword, irrelevant, or discriminant <ref type="foot" target="#foot_0">1</ref> .</p><p>In a previous work a preliminary analysis on how each feature changes its role along taxonomy nodes has been performed <ref type="bibr" target="#b1">[2]</ref>, showing that a discriminant term tends to become irrelevant when moving up in the taxonomy path.</p><p>Furthermore, a domain-dependent stopword becomes discriminant in the upper levels, giving rise to the relevance of such terms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Methodology</head><p>The algorithm proposed in this work is based on a bottom-up approach for building a hierarchy tree, starting from a set of leaf categories over a corpus of documents. The information needed to initialize the algorithm is, for each category: a) the number of documents falling into the category and b) for each term t in the corpus, the number of documents containing t.</p><p>In the previous work about the term roles in a taxonomy <ref type="bibr" target="#b1">[2]</ref> the metrics are computed by considering a binary problem, in which the alternate category has been considered as the union of all siblings of the positive node. The devised algorithm adopts a generalization to a multi-category case. After these preliminary insights, we present our algorithm for taxonomy building.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Identifying Characteristic and Discriminant Terms</head><p>As discussed in the previous section, the properties of a term depend on which region of the ϕ − δ rhombus it falls. Let us define two regions: the characteristic (A ϕ ) and the discriminant (A δ ) areas. In this preliminary study, we simply identify the areas by the following schema: We say that a term is characteristic (discriminant) of two categories C i and C j if it falls in the characteristic (discriminant) area:</p><formula xml:id="formula_3">Φ(t; C i , C j ) = 1 (ϕ, δ) ∈ A ϕ 0 otherwise<label>(3)</label></formula><formula xml:id="formula_4">∆(t; C i , C j ) = 1 (ϕ, δ) ∈ A δ 0 otherwise<label>(4)</label></formula><p>These equations need to be generalized for a hierarchy. We define a hierarchy as a series of level represented by a layer of categories. Each layer contains the categories belonging to the respective level of the hierarchy, grouped by their parent category. For example, if a particular level contains the categories A, B, C, D, E, and A, B, C have X as parent category and D, E have Y as parent category, the layer for that level will be defined as: {A, B, C}, {D, E}. We call {A, B, C} and {D, E} siblings groups. We can see a layer as the partition of the categories in a level, defined by their sibling relations.</p><p>We define the indicator function Φ for the characteristic terms of a sibling group S k as follows:</p><formula xml:id="formula_5">Φ(t; S k ) =        1 |S k | i=1 Φ(t; C i , C i ) ≥ |S k | 2 0 otherwise (<label>5</label></formula><formula xml:id="formula_6">)</formula><p>where Ci is the alternate category of a given class C i , i.e., the union of the siblings of C i .</p><p>In a similar way we can define the indicator function ∆ for the discriminant terms of a sibling group in a layer L:</p><formula xml:id="formula_7">∆(t; S k , L) = ∆(t; P (S k ), P (S k )) (<label>6</label></formula><formula xml:id="formula_8">)</formula><p>where Sk is the union of the rest of siblings groups, and P (S) indicates the parent node for a siblings group S. Using Eq. 5 and Eq. 6 we can now define the set of characteristic terms T ϕ and the set of discriminant terms T δ for each siblings group in a layer:</p><formula xml:id="formula_9">T ϕ (S k ) = {t : Φ(t; S k ) = 1}<label>(7)</label></formula><formula xml:id="formula_10">T δ (S k , L) = {t : ∆(t; S k , L) = 1} (8)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Identifying the Optimal Layer</head><p>The proposed bottom-up algorithm is split into a series of optimal partition problems. For the current set of categories at hand (starting from the leaves), we strive for identifying the optimal layer, that is, the optimal set of siblings groups. Once these groups are identified, the categories inside them are "collapsed" together to generate the parent categories, and the algorithm goes on recursively, layer by layer, until some stop condition is met.</p><p>To identify the optimal layer we use a target function that is derived from Eq. 7 and Eq. 8. The rationale behind this function has already been discussed in Sec. 3.2: the behavior we expect in a taxonomy is that a large number of characteristic terms of a sibling group become determinant terms of the parent category. We define the target function, that we call layer score (G), as following:</p><formula xml:id="formula_11">G(L) B = |L| k=1 |T ϕ (S k ) ∩ T δ (S k , L)|<label>(9)</label></formula><p>The optimization task that identifies the optimal layer L * can then be stated as follows:</p><formula xml:id="formula_12">L * = arg max L G(L)<label>(10)</label></formula><p>Solving exactly Eq. 10 is an NP-hard problem as the search space is defined by the Bell number <ref type="bibr" target="#b3">[4]</ref>. We propose a simple greedy algorithm, and we will show in the Experimental section that the solutions it provides are good enough for a wide range of taxonomies.</p><p>The algorithm starts from the trivial layer, in which each siblings group contains exactly one category, and proceeds recursively. The layer under analysis is the current candidate. At each step, the algorithm looks for all the layers reachable by moving exactly one category from its current sibling group to another group. The algorithm then evaluates the layer score of each of these layers, and if the maximum score exceeds the score of the starting layer, the associated layer becomes the current candidate. If not, the current candidate is chosen as the one satisfying (approximately) Eq. 10.</p><p>Once the optimal layer has been chosen, each sibling group is collapsed to generate the parent categories (P ):</p><formula xml:id="formula_13">P (S k ) = |S k | i=1 C i<label>(11)</label></formula><p>In the rest of the discussion, it is implicit that some sort of mechanism is in place to keep track of which categories (if any) are children of a given category.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Growing the Taxonomy</head><p>We are now able to describe the entire algorithm succinctly. It expects the following inputs:</p><p>-the set of leaf categories C leaves ; -for each leaf category C i , all the data needed to calculate ϕ and δ, described in Sec. 3.1.</p><p>The output is the generated taxonomy T . The algorithm then proceeds iteratively, until the stop condition (Step 6) is met. There are two distinct loops: in the outer one, we build the taxonomy one level at a time; in the inner loop, we search for an approximation of the optimal layer. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>Assessing an algorithm for the generation of taxonomies is an hard task. This is due to the fact that existing taxonomies are far from being considered "golden standards", that is, they are not precise enough to guarantee that they can be taken as absolute reference during the test phase. Moreover, many metrics exist to measure the agreement between two taxonomies, and none of them is universally accepted as a good one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Assessing the Learned Taxonomy</head><p>Let T R be the reference taxonomy and T L the learned taxonomy, that the algorithm described in Sec. 4.3 generates. We need to assess the learned taxonomy by comparing them with the reference.</p><p>To measure the agreement between T R and T L , we use the metric defined by Navigli <ref type="bibr" target="#b11">[12]</ref>. Let C leaves be the leaves of both the reference taxonomy and the learned one (the two sets are identical as the set of leaves is one of the inputs of the taxonomy generation algorithm). Let k be the depth of the reference taxonomy. If the depth of the learned taxonomy is at least k, then for each i ∈ 0, . . . , k we have two unwrapped layers, T R i and T i L . An unwrapped layer is the same as the layer defined in the previous section, but it is always defined in terms of the leaf categories. That is, also the grouping defined by intermediate layer is not defined in term of some parent category generated by collapsing its children, but from the children themselves; we call them unwrapped layer as we are effectively unwrapping each collapsed sibling group up to the leaves. By definition, T 0 R = T 0 L = {C leaves }, that is, the root layer contain a single group that has all the leaves.</p><p>To assess the agreement between the two taxonomies, we first measure the agreement between two unwrapped layers:</p><formula xml:id="formula_14">B i = n i 11 (n i 11 + n i 10 ) • (n i 11 + n i 01 )<label>(12)</label></formula><p>Where n 11 is the number of pair of leaves included in the same group in both layers, n 01 is the number of pair of leaves included in the same group in the learned layer but not in the reference layer, and viceversa for n 10 . The overall metric is then defined as:</p><formula xml:id="formula_15">B(T R , T L ) = 2 k + 1 k−1 i=0 i + 1 k B i<label>(13)</label></formula><p>It can easily be noticed that B takes into higher consideration the layer metric of the deeper layers in the taxonomy. This is necessary to counterbalance the fact that as less sibling groups are found in the first layers, the probability that a pair of categories is found in the same group just by chance is increased; they thus need to be given a lower weight in the overall metric.</p><p>If the depth of the learned taxonomy is different from the depth of the reference one, we have to slightly adjust the previous definition:</p><p>-if k L &gt; k R : the layers after k R are ignored: this way we give an higher score to more structured learned taxonomies; -if k L &lt; k R : the last layer in the learned taxonomy is repeated as many times as needed to reach k R . This way less structured taxonomies get a lower score. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Reference Taxonomies</head><p>Experiments are performed using a collection of webpage documents. The dataset is extracted from the DMOZ taxonomy<ref type="foot" target="#foot_1">2</ref> . Aside from the leaves, each node is built with the union of the children's documents. Textual information from each page code is extracted, and each document is converted into a bag of words representation, each word being weighted with two values: δ and ϕ, computed by applying equations 1 and 2. We built 19 small sub-taxonomies of DMOZ. We grouped them in terms of "difficulty". Table <ref type="table" target="#tab_0">1</ref> shows the main properties of each taxonomy. "Easy" and "easy-medium" taxonomies have base categories very different between them, so also the leaves will clearly be very different or very similar between them, and the grouping task should be easier. "Medium" taxonomies stem from the same root category, so the similarity between leaves increases. Finally, "medium-hard" and "hard" taxonomies have a lot more leaves and multiple root and intermediate categories. Fig. <ref type="figure">3</ref> and Fig. <ref type="figure">4</ref> are examples of taxonomies used in the experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Results</head><p>Tab. 2 contains a summary of the experimental results, and Fig. <ref type="figure" target="#fig_3">5</ref> shows an example of taxonomy learned by the algorithm. Let us note that the generated  taxonomy is very close to the original one even from a qualitative point of view. The obtained scores confirm the quality in most of the cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions and Future Work</head><p>In this paper, a novel approach for automatically build a taxonomy has been proposed. The proposal adopts two novel metrics, i.e., the discriminant capability and the characteristic capability, able to measure the discriminant power of a feature. The former grows in accordance with the ability to distinguish a given category against others, the latter grows in accordance to how the feature is frequent and common over all categories. Our conjecture is that a feature could change its role, depending on its discriminant power, along the taxonomy levels. This behavior has been exploited for devising the algorithm, that is based on the identification of the meaningful terms for each level. In particular, we devise a greedy bottom-up algorithm that recursively identifies the optimal layer for each level. The proposal, even if it is still in a preliminary stage, provides encouraging results, as shown by the experiments. As for future work, we are currently improving and refining the algorithm, with different strategies. Furthermore, a set of comparing experiments with several state of the art approaches are planned, in order to give rise to the usefulness of the proposed approach.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: The regions of the space.</figDesc><graphic coords="3,222.65,388.98,171.48,158.51" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 :</head><label>2</label><figDesc>Fig. 2: The A ϕ and A δ areas.</figDesc><graphic coords="5,238.51,209.59,138.33,123.14" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>1 .</head><label>1</label><figDesc>Add the current set of categories to the taxonomy T as a taxonomy level ; 2. Put the current set of categories in a trivial layer ; 3. Mark the trivial layer as the chosen layer Lc and calculate its score Gc; 4. Mark the next layer Ln = Lc, and next layer score Gn = Gc; 5. For each (unordered) pair of sibling groups in Lc: (a) Evaluate the score G h of the layer obtained by merging the pair of groups, L h ; (b) If G h &gt; Gn: set Ln = L h and Gn = G h ; 6. If Ln is still the trivial layer: the stop condition is met: return T . 7. If Ln = Lc: set Lc = Ln and Gc = Gn; go back to Step 5; 8. Otherwise: consider Lc the optimal layer and collapse it to generate the set of parent categories (use Eq. 11 on each of its sibling groups); go back to Step 1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 5 :</head><label>5</label><figDesc>Fig.5: Learned taxonomy starting from the leaves in the "hard2" taxonomy. The Computer Science category has been correctly split from the other categories in the same levels, whereas the categories Hardware and Internet were first grouped together and then sub-split in a wrong way: On the Web and Web Design should be put along with E-mail and RFCs. The total score of the learned taxonomy is B = 0.669.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Taxonomies for the experiments. They where built starting from the DMOZ webpage taxonomy, and organized to have different degree of difficulty.</figDesc><table><row><cell>Name</cell><cell>Leaves</cell><cell cols="2">Base Intermediate</cell></row><row><cell>easy1</cell><cell>10</cell><cell>2</cell><cell>1</cell></row><row><cell>easy2</cell><cell>6</cell><cell>2</cell><cell>1</cell></row><row><cell>easy3</cell><cell>10</cell><cell>2</cell><cell>1</cell></row><row><cell>easy4</cell><cell>9</cell><cell>2</cell><cell>1</cell></row><row><cell>easy-medium1</cell><cell>12</cell><cell>3</cell><cell>1</cell></row><row><cell>easy-medium2</cell><cell>10</cell><cell>3</cell><cell>1</cell></row><row><cell>easy-medium3</cell><cell>15</cell><cell>3</cell><cell>1</cell></row><row><cell>medium1</cell><cell>8</cell><cell>1</cell><cell>2</cell></row><row><cell>medium2</cell><cell>9</cell><cell>1</cell><cell>2</cell></row><row><cell>medium3</cell><cell>4</cell><cell>1</cell><cell>2</cell></row><row><cell>medium4</cell><cell>8</cell><cell>1</cell><cell>2</cell></row><row><cell>medium5</cell><cell>12</cell><cell>1</cell><cell>2</cell></row><row><cell>medium-hard1</cell><cell>12</cell><cell>2</cell><cell>4</cell></row><row><cell>medium-hard2</cell><cell>15</cell><cell>2</cell><cell>4</cell></row><row><cell>medium-hard3</cell><cell>25</cell><cell>2</cell><cell>4</cell></row><row><cell>hard1</cell><cell>13</cell><cell>1</cell><cell>3</cell></row><row><cell>hard2</cell><cell>8</cell><cell>1</cell><cell>3</cell></row><row><cell>hard3</cell><cell>15</cell><cell>1</cell><cell>3</cell></row><row><cell>hard4</cell><cell>17</cell><cell>1</cell><cell>3</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">An analysis of this region is a part of future work.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.dmoz.org</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Root</head><p>Fig. <ref type="figure">3</ref>: easy-medium1 taxonomy. You can notice how each group of leaves is "distant" from the others. High cohesion of groups and high discriminant characteristic let us suppose that this kind of hierarchy is easy to learn. Fig. <ref type="figure">4</ref>: medium-hard2 taxonomy. In this case the taxonomy is far more complex than that shown in Fig. <ref type="figure">3</ref>. We expect that the algorithm will make more errors on this kind of taxonomy. </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A direct measure of discriminant and characteristic capability for classifier building and assessment</title>
		<author>
			<persName><forename type="first">G</forename><surname>Armano</surname></persName>
		</author>
		<ptr target="http://www.sciencedirect.com/science/article/pii/S0020025515005241" />
	</analytic>
	<monogr>
		<title level="j">Information Sciences</title>
		<imprint>
			<biblScope unit="volume">325</biblScope>
			<biblScope unit="page" from="466" to="483" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Analysis of term roles along taxonomy nodes by adopting discriminant and characteristic capabilities</title>
		<author>
			<persName><forename type="first">G</forename><surname>Armano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fanni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Giuliani</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-1404/paper_24.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th Italian Information Retrieval Workshop</title>
				<meeting>the 6th Italian Information Retrieval Workshop<address><addrLine>Cagliari, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">May 25-26, 2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Stopwords identification by means of characteristic and discriminant analysis</title>
		<author>
			<persName><forename type="first">G</forename><surname>Armano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Fanni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Giuliani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">7th International Conference on Agents and Artificial Intelligence 2015 (ICAART 2015)</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Loiseau</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Filipe</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Duval</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Van Den Herik</surname></persName>
		</editor>
		<meeting><address><addrLine>Lisbon, Portugal</addrLine></address></meeting>
		<imprint>
			<publisher>SCITEPRESS Science and Technology Publications</publisher>
			<date type="published" when="2015-01-12">10-12 Jan 2015</date>
			<biblScope unit="page" from="353" to="360" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The iterated exponential integers</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">T</forename><surname>Bell</surname></persName>
		</author>
		<ptr target="http://www.jstor.org/stable/1968633" />
	</analytic>
	<monogr>
		<title level="j">Annals of Mathematics</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="539" to="557" />
			<date type="published" when="1938">1938</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A practical web-based approach to generating topic hierarchy for text segments</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">L</forename><surname>Chuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Chien</surname></persName>
		</author>
		<idno type="DOI">10.1145/1031171.1031193</idno>
		<ptr target="http://doi.acm.org/10.1145/1031171.1031193" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management</title>
				<meeting>the Thirteenth ACM International Conference on Information and Knowledge Management<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="127" to="136" />
		</imprint>
	</monogr>
	<note>CIKM &apos;04</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Hierarchical document classification using automatically generated hierarchy</title>
		<author>
			<persName><forename type="first">T</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ogihara</surname></persName>
		</author>
		<idno type="DOI">10.1007/s10844-006-0019-7</idno>
		<ptr target="http://dx.doi.org/10.1007/s10844-006-0019-7" />
	</analytic>
	<monogr>
		<title level="j">Journal of Intelligent Information Systems</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="211" to="230" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Automatically inducing ontologies from corpora</title>
		<author>
			<persName><forename type="first">I</forename><surname>Mani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Com-puTerm 2004: 3rd International Workshop on Computational Terminology</title>
				<meeting>Com-puTerm 2004: 3rd International Workshop on Computational Terminology<address><addrLine>COL-ING</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2004. 2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A graph-based algorithm for inducing lexical taxonomies from scratch</title>
		<author>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Velardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Faralli</surname></persName>
		</author>
		<idno type="DOI">10.5591/978-1-57735-516-8/IJCAI11-313</idno>
		<ptr target="http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-313" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Second international joint conference on Artificial Intelligence -Volume Volume Three</title>
				<meeting>the Twenty-Second international joint conference on Artificial Intelligence -Volume Volume Three</meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1872" to="1877" />
		</imprint>
	</monogr>
	<note>IJ-CAI&apos;11</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Unsupervised ontology induction from text</title>
		<author>
			<persName><forename type="first">H</forename><surname>Poon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Domingos</surname></persName>
		</author>
		<ptr target="http://dl.acm.org/citation.cfm?id=1858681.1858712" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 48th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Stroudsburg, PA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="296" to="305" />
		</imprint>
	</monogr>
	<note>ACL &apos;10</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Automatically learning document taxonomies for hierarchical classification</title>
		<author>
			<persName><forename type="first">K</forename><surname>Punera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rajan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ghosh</surname></persName>
		</author>
		<idno type="DOI">10.1145/1062745.1062843</idno>
		<ptr target="http://doi.acm.org/10.1145/1062745.1062843" />
	</analytic>
	<monogr>
		<title level="m">Special Interest Tracks and Posters of the 14th International Conference on World Wide Web</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="1010" to="1011" />
		</imprint>
	</monogr>
	<note>WWW &apos;05</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Clustering query refinements by user intent</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sadikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Madhavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Halevy</surname></persName>
		</author>
		<idno type="DOI">10.1145/1772690.1772776</idno>
		<ptr target="http://doi.acm.org/10.1145/1772690.1772776" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th international conference on World wide web</title>
				<meeting>the 19th international conference on World wide web<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="841" to="850" />
		</imprint>
	</monogr>
	<note>WWW &apos;10</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Ontolearn reloaded: A graph-based algorithm for taxonomy induction</title>
		<author>
			<persName><forename type="first">P</forename><surname>Velardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Faralli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</author>
		<ptr target="http://dblp.uni-trier.de/db/journals/coling/coling39.html#VelardiFN13" />
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="665" to="707" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Predicting short-term interests using activity-based search context</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">W</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">N</forename><surname>Bennett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<idno type="DOI">10.1145/1871437.1871565</idno>
		<ptr target="http://doi.acm.org/10.1145/1871437.1871565" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th ACM international conference on Information and knowledge management</title>
				<meeting>the 19th ACM international conference on Information and knowledge management<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1009" to="1018" />
		</imprint>
	</monogr>
	<note>CIKM &apos;10</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
