Building Taxonomies with Triplet Queries
                                Donatella Firmani1,* , Sainyam Galhotra2 , Barna Saha3 and Divesh Srivastava4
                                1
                                  Sapienza University
                                2
                                  Cornell University
                                3
                                  University of California San Diego
                                4
                                  AT&T Chief Data Office


                                            Abstract
                                            The organization of records referring to different entities into a taxonomy is crucial for capturing their
                                            relationships. Nevertheless, the automatic identification of such relationships often faces inaccuracies
                                            due to noise and heterogeneity of records across various sources. Simultaneously, manual maintenance
                                            of these relationships proves impractical and lacks scalability. This study addresses these challenges
                                            by adopting a weak supervision strategy, in the form of an oracle, to solve a novel Hierarchical Entity
                                            Resolution task. Within our framework, records are organized into a tree-like structure that encompasses
                                            records at the bottom level and encapsulates entities and categories at the higher levels. To make the
                                            most effective use of supervision, we employ a triplet comparison oracle, which takes three records as
                                            input and output the most similar pair(s). Finally, we introduce HierER, a querying strategy utilizing
                                            record pair similarities to minimize the number of oracle queries while simultaneously maximizing
                                            the identification of the hierarchical structure. Theoretical and empirical analyses demonstrate the
                                            effectiveness and efficiency of HierER with noisy datasets with millions of records.


                                1. Introduction
                                In many applications, records are represented in diverse formats like images, unstructured
                                and structured text and these records need to be organized to capture complex relationships.
                                Assigning records to taxonomies is useful for diverse applications like recommendation [1],
                                categorization [2], and search [3]. For example, e-commerce websites like Amazon organize
                                products in the form of a taxonomy to enable better search and recommendations. Animal and
                                plant species along with their textual descriptions can be arranged in hierarchies of varying
                                depth, average degree and shape, called pylogenetic trees. In Figure 1a we show an example
                                hierarchy for arranging a collection of records describing birds. Example records are 𝑟𝑎 and 𝑟𝑏 ,
                                describing respectively the little bunting (Emberiza pusilla) and the yellow-breasted bunting
                                (Emberiza aureola). Example categories in the hierarchy is 𝑢1 . Note that categories are not
                                required to have a descriptive name: in this example the category 𝑢1 is simply defined as the
                                set comprising the little bunting (𝑟𝑎 ) and the yellow-breasted bunting (𝑟𝑏 ).
                                   Two main types of relationships arise naturally in a hierarchy:
                                     • is-A (in orange) connecting record-nodes such as 𝑟𝑎 to category-nodes such as 𝑢1 ;
                                     • category-supercategory (in black) connecting pairs of category-nodes such as 𝑢1 to 𝑢2 ;
                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                *
                                 Corresponding author.
                                $ donatella.firmani@uniroma1.it (D. Firmani); sg@cs.cornell.edu (S. Galhotra); bsaha@ucsd.edu (B. Saha);
                                divesh@research.att.com (D. Srivastava)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                                         Pinicola                                           u1
                                                  u2                                                                                                                         Pinicola                           Pinico
                                                                                        Enucleator                                                                          Enucleator                         Enuclea
                                                                                Emberiza                                                                                                               Emberiza
                                          u1
                                                                                                                                                                    Emberiza
                                                                               melanocepha
                                                                                    la                                           ea           eb                   melanocepha
                                                                                                                                                                        la
                                                                                                                                                                                                      melanocepha
                                                                                                                                                                                                           la
                                                                                                                                u1                                 u1
                                     ra           rb                                                      r1             r2                 r3 r4                        r5
                                                       Emberiza   Emberiza Emberiza                                                       Emberiza     Emberiza Emberiza Emberiza         Emberiza Emberiza
                                                        rustica    pallasi schoeniclus                                                     rustica      pallasi schoeniclus rustica        pallasi schoeniclus
                                     Emberiza Emberiza                                                                  Emberiza Emberiza                  Emberiza Emberiza
                                      pusilla aureola                                                                    pusilla aureola (b)                pusilla aureola

                                            (a)                                                                                                                                      rb: Small bunting with an attractive
                                                                                                    ra: The islittle bunting (Emberiza pusilla) is r2: Small bunting with an attractive
                                                                  r1: The little bunting (Emberiza pusilla)
                                                                        a passerine bird belonging to a passerine bird belonging to contrasting head pattern: black crown              with head pattern: black crown
                                                                                                                                                                                 contrasting
Figure 1: (a) Example hierarchy for phylogenetic         data.
                                          the bunting family.        The thecenter
                                                              First described inbunting of the
                                                                                 1776. family. First diagram        corresponds
                                                                                                                   red-brown
                                                                                                     described in 1776.                   toandthe
                                                                                                                             central stripe     palered-brown
                                                                                                                                                      eyering. central stripe and pale eyer
root and the subsequent levels are placed in concentric circles. (b) Entity-nodes and duplicate records.

In order to account for duplicate records, that are, different records referring to the same entity
in the real world, we introduce a third type of relationship, that we call co-reference. In Figure 1b,
the co-reference relationship is shown in green. To represent this relationship we introduce
intermediate entity-nodes, such as 𝑒𝑎 and 𝑒𝑏 , between record-nodes and category-nodes. In the
figure, records 𝑟1 , 𝑟2 and 𝑟3 are different pictures of little bunting, which is now represented by
𝑒𝑎 , while records 𝑟4 , 𝑟5 both represent a yellow-breasted bunting, that we now call 𝑒𝑏 . Both 𝑒𝑎
and 𝑒𝑏 belong to the 𝑢1 category, which can be equivalently defined as a set of entities.
   Starting from records 𝑟1 , . . . , 𝑟𝑛 , our task is to return the complete tree structure, identifying
duplicate records and enriching them with is-A and category-supercategory relationships. We
call this problem as the Hierarchical Entity Resolution (HER) problem. The popular Entity
Resolution (ER) problem corresponds to identification of record-entity relationships only (see
green edges in Figure 1b) and thus our problem represents a natural extension.
   Before our work there are two main roads to the solution of HER.
     • Fully automatic construction. To automatically construct categories and identify
       their relationships, prior techniques have proposed to leverage co-occurence patterns of
       hypernyms [4, 5]. However, certain categories may not always be explicitly mentioned in
       the records [6] nor respect the same terminology across different sources.
     • Fully manual Construction. A domain expert can easily generate a hierarchy that
       contains all well-known categories. However, it might be impossible to capture all the
       fine-grain relationships in the data (such as, very specialized category-nodes like 𝑢1 and
       𝑢2 in Example 1) without processing all the records manually, which is not scalable.
Even though the benefits of constructing a hierarchy are widely acknowledged, due to limi-
tations of fully automatic construction, the majority of category-supercategory relationships
are maintained manually by domain experts. Manual maintenance of such relationships is
labor-intensive as the hierarchy evolves whenever new records are introduced/discontinued.
Our intuition. We aim at using a hybrid weakly supervised approach; that is, an automatic
approach that is guided by domain experts providing answers to targeted queries. While
manually processing all the records might be infeasible, if only three records are considered in
isolation, say 𝑟1 , 𝑟2 and 𝑟4 , then a user – or even a trained classifier – can easily distinguish that
𝑟1 and 𝑟2 are closer to each other than either of them is to 𝑟3 . Even though considering record
triplets in isolation can help uncover the hierarchical structure, it is still unfeasible to compare all
possible triples in million scale datasets. Our framework is capable of automatically prioritizing
records to optimize the number of triplet comparisons and to minimize the query workload. We
refer to the domain expert (or the classifier) using the abstraction of a black-box oracle. Oracle-
based algorithms have been widely popular to study fairness [7], correlation clustering [8]
and classification [9, 10], identify maximum elements [11, 12], top-𝑘 elements [13, 14, 15],
information retrieval [16], skyline computation [17], and so on. Our oracle can answer to:
      • triplet queries: “which pair of records among 𝑢, 𝑣 and 𝑤 are most similar?”
      • co-reference queries: “do records 𝑢 and 𝑣 refer to the same entity?”;
These queries can reveal the local hierarchical structure and can be answered without requiring
(i) the context of all the identified categories in the constructed hierarchy, nor (ii) the knowledge
of other records in the dataset. Recent advancements of deep-learning based classifiers for
ER [18, 19, 20, 21] are also alternative implementation of our oracle.
Related works and oracle-based methods. The closest task to our problem is ER, featur-
ing a variety of oracle-based methods [22, 23, 24]. However ER techniques typically ignore
category information.Another related task is Hierarchical Clustering. This has been studied
in a variety of application domains including the construction of phylogenetic trees [25, 26]
and taxonomies [27, 28]. The work in [26] describes an oracle-based method. In Hierarchi-
cal Clustering, records refer to different entities, and thus clustering methods ignore entity
resolution. Moreover, these techniques build almost binary hierarchies, where every node
has approximately two children. Thus, neither ER nor Hierarchical Clustering techniques can
by itself solve our HER problem effectively. Even pipelining the two processes turns out to
achieve a sub-optimal query workload. Let 𝑛 be the number of records. Running a hierarchical
clustering technique first and then post-processing the bottom level in order to detect entities
can require 𝑂(𝑛2 ) oracle queries for non-binary hierarchies in the worst case [26], before even
identifying the entities. Running an oracle-based ER technique first and post-processing entities
after that to detect categories can be efficient in case of large entities but can require 𝑂(𝑛2 )
queries to identify small entity clusters [23], before even starting to process categories.
Contributions and Outline. We show that previous methods can be significantly outper-
formed by our approach, thus alleviating the manual workload required for construction and
maintenance of taxonomies over very large databases. Section 2 contains a high-level overview
of our approach, and in particular of the HierER algorithm. Detailed description of the al-
gorithm and theoretical analysis can be found in [29]. Main experimental results of [29] are
reported in Section 3. Finally, Section 4 contain our concluding remarks.


2. Overview
Let 𝑉 = {𝑣1 , 𝑣2 , . . . , 𝑣𝑛 } be a collection of 𝑛 records. We use the notion of laminar family of
sets to define the ground truth hierarchy 𝐻 * more formally. A family of sets 𝒞 * is laminar iff
∀𝑋1 , 𝑋2 ∈ 𝒞 * , either 𝑋1 ∩ 𝑋2 = 𝜑 or 𝑋1 ⊆ 𝑋2 or 𝑋2 ⊆ 𝑋1 . The hierarchy 𝐻 * corresponds to
a laminar family of labelled sets 𝒞 * such that each set in 𝒞 * is labelled with one of the three labels:
record (r), entity (e) or category (t). Each labelled set is denoted as ⟨label : 𝑋⟩
where 𝑋 ⊆ 𝑉 . According to this notation, ⟨r : {𝑣}⟩ ∈ 𝒞 * , ∀𝑣 ∈ 𝑉 and ⟨t : 𝑉 ⟩ ∈ 𝒞 * . This
hierarchy has an additional constraint that a set labelled ‘entity’ cannot have a proper superset
of label ‘entity’ or ‘record’ and a set labelled ‘category’ cannot have a superset labelled
‘entity’ or ‘record’. Following this definition, there exists a one-to-one mapping between
internal nodes of the hierarchy and the laminar family of sets 𝒞 * where an internal node of the
hierarchy (say 𝑢) is equivalent to a set 𝐶 ∈ 𝒞 * containing all the leaf-level descendants of 𝑢
and vice versa. In this formulation, a category node that has a single category node as a child is
redundant and can be ignored. However, an entity can have a single child (record).
   A hierarchy has an interesting property that for any three records 𝑣1 , 𝑣2 , 𝑣3 ∈ 𝑉 , the lca’s
of two pairs of these records are the same and the lca of the third pair is either the same or a
descendant of the other two lca’s [26]. Without loss of generality, one of the following hold.
                              lca(𝑣2 , 𝑣3 ) ≥ lca(𝑣1 , 𝑣2 ) = lca(𝑣1 , 𝑣3 )

where lca(𝑣1 , 𝑣2 ) > lca(𝑥, 𝑦) denotes that lca(𝑣1 , 𝑣2 ) is a descendant of lca(𝑥, 𝑦). In case
all three lca’s are the same, then 𝑣1 , 𝑣2 and 𝑣3 belong to three different descendant-branches of
the internal node corresponding to lca(𝑣1 , 𝑣2 ) = lca(𝑣1 , 𝑣3 ) = lca(𝑣2 , 𝑣3 ). A triplet oracle is
a function 𝑞𝑡 : 𝑉 × 𝑉 × 𝑉 → 𝑉 ∪ {𝜑} that takes three records as input and outputs the farthest
record (if any). For an input (𝑣1 , 𝑣2 , 𝑣3 ), the oracle outputs 𝑞𝑡 (𝑣1 , 𝑣2 , 𝑣3 ) = 𝑣1 if lca(𝑣2 , 𝑣3 ) >
lca(𝑣1 , 𝑣2 ) = lca(𝑣1 , 𝑣3 ) and 𝑞𝑡 (𝑣1 , 𝑣2 , 𝑣3 ) = 𝜑 if lca(𝑣1 , 𝑣2 ) = lca(𝑣1 , 𝑣3 ) = lca(𝑣2 , 𝑣3 ).
    Prioritizing triplet oracle queries to maximize the accuracy in the 𝐻 * estimation task after
every query, is the goal of the high-level workflow presented in Figure 2. Note that evaluating
accuracy
       (︀𝑛)︀ is non-trivial.
                         (︀𝑛)︀A naive way is to compare the fraction of the total Θ(𝑛 ) relationships
                                                                                           3

(i.e., 3 triplets and 2 equality relationships) that are correctly identified by a given method.
However, such an approach can be infeasible even in medium-sized datasets. Therefore, we
extend the popular metric of comparing F-score of different ER techniques to our hierarchical
setting. In ER, F-score is computed over two types of pairwise relationships: intra-cluster and
inter-cluster. Following these ideas, we consider co-reference relationships as intra-cluster and
enumerate the different types of inter-entity relationships between record pairs. We define
the notion of t-ancestor relationship to capture the distance between record pairs and then
use it to compute the F-score of the output hierarchy 𝐻. A pair of records (𝑢, 𝑣) satisfies a
t-ancestor relationship if their lca is at most t edges away from both 𝑢 and 𝑣’s entity nodes.
0-ancestor relationship is equivalent to an equality (i.e., co-reference) relationship. We map
t-ancestor relationships identified from the output hierarchy 𝐻 to the ground truth hierarchy
𝐻 * and then use those to compute precision as the weighted fraction of correctly identified
pairwise relationships among the identified relationships and recall as the weighted fraction of
total relationships that were identified. Weighting mechanisms are discussed in detail in [29].
F-score is finally computed as the harmonic mean of precision and recall.
Auxiliary modules. In the following, we first describe the auxiliary modules in Figure 2, then
we provide more intuition on the core Oracle strategy module.
      • The Blocking module is used to reduce the number of pairs considered for similarity
         computation. Blocking is a widely used operation in ER literature [30] to efficiently
         generate a small set of candidate pairs so that similarity values are computed only for this
         small set of candidates. Standard blocking (also known as token-based blocking) is one of
         the most popular mechanisms that generates a block for each token in the input set of
         records [30]. Similarity values may not be directly interpretable as probability distributions
         over the possible oracle responses. For practical purposes, calibration approach in [23, 31]
         can be used to map values 𝑠(𝑣1 , 𝑣2 ) to probability distributions 𝑝(𝑣1 , 𝑣2 ) and 𝑝(𝑣1 , 𝑣2 , 𝑣3 ).
                                                                                                      inference

                      similarity values of record pairs   probability      estimated oracle                inferred
                                                          calibration      answers probabilities           relationships
                      set of       blocking and                                                        oracle
                                                                                                                          hierarchy
                    records          similarity                                                       strategy
                                                                           candidate
                                                          initialization                              u,v u,v,w
                      similarity values of record pairs                    hierarchy
                                                                                        T/F                             u/v/w/ϕ

                                                                                    equality oracle               triplet oracle


Figure 2: Our hierarchical ER workflow.

        • The Initialization module in our workflow constructs a candidate hierarchy 𝐻      ¯ that can
           be used downstream to guide the querying strategies. Construction of 𝐻     ¯ is based solely
           on the similarity scores 𝑠 : 𝑉 × 𝑉 → [0, 1] and requires no oracle queries. We prove
           in [29] that 𝐻
                        ¯ has high F-score under low noise of similarity values.
        • The Inference module provides tools to infer relationships from previously asked triplet
           and equality queries, without asking new oracle queries leading to unnecessary cost. Such
           module allows to infer, for instance, which is the farthest record among 𝑣, 𝑤 and 𝑥 by
           looking at the result of previous queries about other triples like (𝑢, 𝑣, 𝑤) and (𝑢, 𝑤, 𝑥).
    We observe that inferring category-subcategory relationships is the major challenge, whereas
co-reference can be easily inferred via transitive closure.1 Consider three records 𝑟1 , 𝑟2 , 𝑟6
and a query 𝑞1 ≡ 𝑞𝑡 (𝑟1 , 𝑟2 , 𝑟6 ) that returns 𝑟6 . To interpret 𝑞1 mathematically, we define
three variables corresponding to the lca’s of involved record pairs (𝑟1 , 𝑟2 ), (𝑟1 , 𝑟6 ) and (𝑟2 , 𝑟6 ).
Using these variables, 𝑞1 can be represented as lca(𝑟1 , 𝑟2 ) > lca(𝑟1 , 𝑟6 ) = lca(𝑟2 , 𝑟6 ).
This inequality characterizes a relation between the lca of record pairs (𝑟1 , 𝑟2 ), (𝑟1 , 𝑟6 ) and
(︀𝑛2)︀, 𝑟6 ). Each query can be written in the form of such inequality constraints over at most
(𝑟
  2 lca variables. Consider another query 𝑞2 ≡ 𝑞𝑡 (𝑟1 , 𝑟6 , 𝑟1 1) = 𝑟1 1, meaning lca(𝑟1 , 𝑟6 ) >
𝑙𝑐𝑎(𝑟1 , 𝑟1 1) = lca(𝑟6 , 𝑟1 1). Using the inequalities of 𝑞1 and 𝑞2 , we can infer that lca(𝑟1 , 𝑟2 ) >
lca(𝑟1 , 𝑟6 ) > lca(𝑟1 , 𝑟1 1).
Oracle strategy. This module prioritizes oracle queries by leveraging (i) the candidate hierarchy
from the initialization step to give higher priority to queries that yield higher F-score increment
and (ii) the inference engine to identify inferable relationships for free. The algorithm HierER
described in [29] can be thought of as an oracle strategy that leverage the initialization and
inference methods and focus on the following principles to maximize progressive F-score.
    • Internal node discovery. Prioritize queries that enable the discovery of new internal nodes.
       Indeed, identifying the tree structure, specifically the internal nodes between the root
       node and the leaf nodes corresponding to the processed records is important to provide
       optimal progressive behavior.
    • Large entities. Give high priority to queries enabling the discovery of new children of
       high-degree entity nodes. This principle has also been used in ER literature [23, 22] since
       asking queries in non-increasing order of entity sizes provides the maximum gain in
1
    E.g., if we know that 𝑣1 refers to the same entity as 𝑣2 , and 𝑣2 refers to the same entity as 𝑣3 , then we can infer
    that 𝑣1 refers to the same entity as 𝑣3 without asking the corresponding oracle query.
                  Dataset          n     Depth   Average degree   Max degree   Largest Entity
                Phylogenetic     1039     21          2.01            4              1
                  DMOZ           100K      5          274.3          17K             1
                   Cars          16.5K     2           330          1800           1800
                  Camera          30K      3            5             91            91
                  Amazon          30K      7          8.46           760             1
                 Geography        3M       5           8K            35K             1

Figure 3: Dataset description.

      progressive recall.
    • Connectivity. Prioritize queries that grow the hierarchy in a connected fashion. Indeed,
      ensuring that the processed records at any given time form a single connected hierarchy
      (rather than growing multiple disjoint hierarchies in parallel) allows for the inference of
      more relationships with the same number of queries.


3. Experiments
We now compare HierER and baselines on real-world datasets and answer the following
questions. Q1: What is the end-to-end quality vs query complexity trade off for HierER? Q2:
Is HierER sensitive to noise in the dataset? Q3: Is HierER scalable to large-scale datasets?
Set-up. Figure 3 reports the six real-world datasets in our experiments, consisting of hierarchies
of varying depth, average degree and shape, and comprising either textual (records) or visual
description (images) of entities. Phylogenetic contains scientific names of bird and insect
species along with textual descriptions collected from Wikipedia. The hierarchy corresponds to
the phylogenetic tree obtained from [32]. Figure 1a in Section 1 shows the shape of such hierar-
chy. DMOZ [33] is an open-content directory of web pages along with a hierarchy that organizes
these webpages according to their categories like art, science, mathematics, etc. Cars comprises
images of different models of cars. For Cars we generate textual descriptions using Google’s
vision API [34] and hierarchy is constructed based on their make and model. Camera [35] is a
collection of specifications of cameras collected from over 25 retail companies and the hierarchy
corresponds to the brand-model categorization. Amazon [36] contains descriptions of products
and the amazon catalog ontology. Geography [37] contains names of cities across the world
with categories corresponding to their state, country and continent.
   Due to low-training data requirements for random forests as compared to deep-learning
based techniques [20], we used a random forest classifier trained with active learning for cars,
geography and camera datasets as an oracle. Such model achieved more than 0.95 F-score for
all three datasets. For other datasets we consider a simulated oracle model (i.e., using ground
truth hierarchy to generate responses) to control the noise level. For more implementation
details such as similarity calculation and triplet probability we refer the reader to [29].
   We consider the following pipelined baseline strategies that perform hierarchical clustering
and entity resolution separately as a two-step procedure. (i) Average Linkage (denoted by
AverageLink) is an Agglomerative clustering technique. We used the sklearn package [38] to
construct the hierarchy and then run triplet queries bottom up to merge neighboring internal
nodes. (ii) HiExpan denotes the automated taxonomy construction technique from [28] that uses
the initial seed hierarchy to generate other internal nodes assuming access to all internal nodes
                            Ideal                                     HierER                Hybrid                                    InsSort                 AverageLink                         HiExpan
            1                                                          1                                              1                                                 1
           0.8                                                        0.8                                            0.8                                                       0.8
 F-score


                                                            F-score


                                                                                                           F-score


                                                                                                                                                                     F-score
           0.6                                                        0.6                                            0.6                                                       0.6
           0.4                                                        0.4                                            0.4                                                       0.4
           0.2                                                        0.2                                            0.2                                                       0.2
            0               5K                 10K                     0          2M      4M         6M               0                  0.2M     0.4M        0.6M              0            20M     40M          60M
                          #queries                                                 #queries                                                 #queries                                           #queries
                      (a) phylogenetic                                              (b) dmoz                                               (c) amazon                                        (d) geography

                                                                                                          (a)

            InsSort              AverageLink                           HiExpan                                                                                  HierER                           Theory
            1                                                 1
                                                                                                                                                                                        4K
          0.8                                              0.8                                                                    6K
F-score


                                                 F-score


                                                                                                                                                                                        3K


                                                                                                                           #queries


                                                                                                                                                                                 #queries
          0.6                                              0.6
                                                                                                                                  4K
                                                                                                                                                                                        2K
          0.4                                              0.4
          0.2                                              0.2                                                                    2K                                                    1K

            0          20K      40K                           0                  20K           40K                                          0   2    3    4     10                           0    5    10    15   20
                       #queries                                                #queries                                                              a                                            Percentage
                      (a) cars                                                 (b) camera                                                   (a) Processing error                                 (b) Data error

                                               (b)                                                                                                                        (c)
Figure 4: F-score vs #queries for datasets where (Top) all records refer to distinct entities and (bottom-
left) datasets have many co-references. (Bottom-right) Query complexity for varying similarity noise.

in the hierachy. We added the internal nodes to run this algorithm. (iii) InsSort considers the
extension of the insertion sort algorithm [26] for non-binary hierarchies to generate a hierarchy,
followed by Hybrid [23] to identify equality relationships. (iv) The pipeline that performs ER
followed by hierarchical clustering has almost zero F-score at the end of ER phase when the
datasets contain singleton entities and therefore suffer from poor progressiveness. Instead, we
consider Hybrid as an adaptation of the state-of-the-art entity resolution strategy [23].
Q1: Result Quality. In order to answer to our first question, we compare the quality of
techniques by measuring the progressive F-score. The F-score value after every query is
computed by mapping the relationships in the constructed hierarchy to those in the the ground
truth hierarchy. Figures 4a and Figures 4b compare the F-score of HierER with other baselines
on multiple datasets. Across all datasets, HierER achieves the highest progressive F-score and
is closest to the ideal curve. InsSort and Hybrid achieve poor progressive F-score. These
techniques require more than 5× the queries required by HierER to achieve comparable F-score
across all large scale datasets. AverageLink generates a hierarchical clustering over the records
without any oracle queries. This hierarchy achieves non-zero F-score but the mistakes in the
constructed hierarchy can not be corrected. HiExpan performs better than AverageLink for
most datasets but does not achieve high F-score. It is sensitive to the initial structure provided
as input and does not generalize if it does not contain all levels of the hierarchy. Due to the
presence of noise in datasets, such automated techniques do not achieve high F-score.
   We observe different behaviors across datasets. In phylogenetic dataset, the ideal requires
< 2𝑛 queries as majority of the internal nodes have two children and require less than 2 queries
to be inserted. For each leaf level record, HierER requires 𝑂(log 𝑛) queries to identify its
location in the processed hierarchy [29], performing better than InsSort but requiring much
more queries than the ideal strategy. This is due to high noise in similarity values, with
more than 85% of the records suffering from data error in this dataset. Among other datasets,
including those with internal nodes having much higher average degree, HierER achieves
                     Dataset       HierER     InsSort       Hybrid      AverageLink
                  Phylogenetic   1min 40sec   1min 50sec   1min 53sec    2min 10sec
                      DMOZ       1hr 23min    4hr 47min    6hr 20min      3hr 17min
                      Cars         45min      3hr 25min    5hr 30min      3hr 5min
                     Camera        52min      2hr 47min    3hr 35min      3hr 10min
                     Amazon       1hr 5min    3hr 27min    3hr 55min      3hr 40min
                    Geography    11hr 35min   30hr 35min   34hr 35min   Did Not finish

Figure 5: Running Time Comparison

near-optimal progressive F-score and performs much better than all other baselines.
  The experiments in Figure 4 assumes that the oracle answers all queries correctly. For
experiments with independent triplet error we refer the reader to [29]. To develop robust
methods, we leverage the random graph toolkit [24] for each oracle query.
Q2: Noise sensitivity. To validate the effect of noise, Figure 4c considers synthetic similari-
ties in the Phylogenetic dataset and compares the query complexity with the theoretically
proven bounds in [29], where we establish that HierER requires 𝑂(𝑛 log 𝑛) triplet comparisons,
assuming reasonable data error. Figure 4c (left) simulates processing error, where each pairwise
similarity (𝑠(𝑢, 𝑣)) is sampled independently according to a normal distribution with mean 𝜇(𝑢,𝑣)
and variance 𝜎 2 = 𝑎𝜇2 (𝜇 denotes the difference in expected similarity for pairs connected
at different depths). The query complexity of HierER is the same as that of ideal (roughly
2𝑛) for 𝑎 < 3. For higher noise, the query complexity increases but it plateaus at 𝑂(𝑛 log 𝑛).
Figure 4c (right) simulates data error, where a random sample of the nodes is erroneous such
that all pairwise similarities containing these records are erroneous. In this case, the query
complexity of the technique is directly proportional to the error.
Q3: Scalability. Figure 5 compares the running time of HierER to reach 0.90 F-score with
respect to other baselines. HierER has significantly lower running time for all the datasets due
to the linear dependence of running time on the number of queries, finishing in less than 2
minutes on Phylogenetic and in less than 12 hrs on the million-scale Geography dataset.


4. Conclusions
In this paper, we have formalized the Hierarchical Entity Resolution problem, introducing
an oracle-based approach. Our algorithm, HierER, presents a novel query ordering strategy
that capitalizes on pairwise record similarities, prioritizing triplet and equality oracle queries.
Theoretical and empirical analysis underscores that HierER is capable of constructing an
accurate hierarchy with a limited query workload across diverse real-world datasets.


Acknowledgments
Donatella Firmani has been partially supported by the following research grants: SEED PNR
Project “FLOWER” “Frontiers in Linking records: knOWledge graphs, Explainability and tempo-
Ral data”, Sapienza Research Project B83C22007180001 “Trustworthy Technologies for Augment-
ing Knowledge Graphs” and HORIZON Research and Innovation Action 101135576 INTEND
“Intent-based data operation in the computing continuum”.
References
 [1] J. Huang, Z. Ren, W. X. Zhao, G. He, J.-R. Wen, D. Dong, Taxonomy-aware multi-hop
     reasoning networks for sequential recommendation, in: Proceedings of the Twelfth ACM
     International Conference on Web Search and Data Mining, 2019, pp. 573–581.
 [2] Y. Mao, J. Tian, J. Han, X. Ren, Hierarchical text classification with reinforced label
     assignment, arXiv preprint arXiv:1908.10419 (2019).
 [3] J. Wang, C. Kang, Y. Chang, J. Han, A hierarchical dirichlet model for taxonomy expansion
     for search engines, in: Proceedings of the 23rd international conference on World wide
     web, 2014, pp. 961–970.
 [4] M. A. Hearst, Automatic acquisition of hyponyms from large text corpora, in: Coling 1992
     volume 2: The 15th international conference on computational linguistics, 1992.
 [5] A. Panchenko, S. Faralli, E. Ruppert, S. Remus, H. Naets, C. Fairon, S. P. Ponzetto, C. Bie-
     mann, Taxi at semeval-2016 task 13: a taxonomy induction method based on lexico-
     syntactic patterns, substrings and focused crawling, in: Proceedings of the 10th Interna-
     tional Workshop on Semantic Evaluation (SemEval-2016), 2016, pp. 1320–1327.
 [6] Y. Mao, T. Zhao, A. Kan, C. Zhang, X. L. Dong, C. Faloutsos, J. Han, Octet: Online catalog
     taxonomy enrichment with self-supervision, in: Proceedings of the 26th ACM SIGKDD
     International Conference on Knowledge Discovery & Data Mining, 2020, pp. 2247–2257.
 [7] C. Ilvento, Metric learning for individual fairness, arXiv preprint arXiv:1906.00250 (2019).
 [8] A. Ukkonen, Crowdsourced correlation clustering with relative distance comparisons, in:
     2017 IEEE International Conference on Data Mining (ICDM), IEEE, 2017, pp. 1117–1122.
 [9] O. Tamuz, C. Liu, S. Belongie, O. Shamir, A. T. Kalai, Adaptively learning the crowd kernel,
     in: Proceedings of the 28th International Conference on International Conference on
     Machine Learning, 2011, pp. 673–680.
[10] M. Hopkins, D. Kane, S. Lovett, G. Mahajan, Noise-tolerant, reliable active classification
     with comparison queries, arXiv preprint arXiv:2001.05497 (2020).
[11] S. Guo, A. Parameswaran, H. Garcia-Molina, So who won? dynamic max discovery
     with the crowd, in: Proceedings of the 2012 ACM SIGMOD International Conference on
     Management of Data, 2012, pp. 385–396.
[12] P. Venetis, H. Garcia-Molina, K. Huang, N. Polyzotis, Max algorithms in crowdsourcing
     environments, in: Proceedings of the 21st international conference on World Wide Web,
     2012, pp. 989–998.
[13] S. Davidson, S. Khanna, T. Milo, S. Roy, Top-k and clustering with noisy comparisons,
     ACM Trans. Database Syst. 39 (2015). URL: https://doi.org/10.1145/2684066. doi:10.1145/
     2684066.
[14] N. M. Kou, Y. Li, H. Wang, L. H. U, Z. Gong, Crowdsourced top-k queries by confidence-
     aware pairwise judgments, in: Proceedings of the 2017 ACM SIGMOD International
     Conference on Management of Data, 2017, pp. 1415–1430.
[15] E. Dushkin, T. Milo, Top-k sorting under partial order information, in: Proceedings
     of the 2018 ACM SIGMOD International Conference on Management of Data, 2018, pp.
     1007–1019.
[16] E. Kazemi, L. Chen, S. Dasgupta, A. Karbasi, Comparison based learning from weak oracles,
     arXiv preprint arXiv:1802.06942 (2018).
[17] V. Verdugo, Skyline computation with noisy comparisons, in: Combinatorial Algorithms:
     31st International Workshop, IWOCA 2020, 2020, p. 289.
[18] Y. Li, J. Li, Y. Suhara, A. Doan, W.-C. Tan, Deep entity matching with pre-trained language
     models, arXiv preprint arXiv:2004.00584 (2020).
[19] M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, N. Tang, Distributed represen-
     tations of tuples for entity resolution, PVLDB 11 (2018) 1454–1467.
[20] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute,
     V. Raghavendra, Deep learning for entity matching: A design space exploration, in:
     Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data,
     2018, pp. 19–34.
[21] R. Cappuzzo, P. Papotti, S. Thirumuruganathan, Creating embeddings of heterogeneous
     relational datasets for data integration tasks, Proceedings of the 2020 ACM SIGMOD
     International Conference on Management of Data (2020). URL: http://dx.doi.org/10.1145/
     3318464.3389742. doi:10.1145/3318464.3389742.
[22] N. Vesdapunt, K. Bellare, N. Dalvi, Crowdsourcing algorithms for entity resolution, PVLDB
     7 (2014) 1071–1082.
[23] D. Firmani, B. Saha, D. Srivastava, Online entity resolution using an oracle, PVLDB 9
     (2016) 384–395.
[24] S. Galhotra, D. Firmani, B. Saha, D. Srivastava, Robust entity resolution using random
     graphs, in: Proceedings of the 2018 ACM SIGMOD International Conference on Manage-
     ment of Data, 2018, pp. 3–18.
[25] D. G. Brown, J. Truszkowski, Fast error-tolerant quartet phylogeny algorithms, in: Annual
     Symposium on Combinatorial Pattern Matching, Springer, 2011, pp. 147–161.
[26] E. Emamjomeh-Zadeh, D. Kempe, Adaptive hierarchical clustering using ordinal queries, in:
     Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms,
     SIAM, 2018, pp. 415–429.
[27] C. Zhang, F. Tao, X. Chen, J. Shen, M. Jiang, B. Sadler, M. Vanni, J. Han, Taxogen:
     Unsupervised topic taxonomy construction by adaptive term embedding and clustering, in:
     Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery
     & Data Mining, 2018, pp. 2701–2709.
[28] J. Shen, Z. Wu, D. Lei, C. Zhang, X. Ren, M. T. Vanni, B. M. Sadler, J. Han, Hiexpan:
     Task-guided taxonomy construction by hierarchical tree expansion, in: Proceedings of the
     24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
     2018, pp. 2180–2189.
[29] S. Galhotra, D. Firmani, B. Saha, D. Srivastava, Hierarchical entity resolution using an
     oracle, in: Z. G. Ives, A. Bonifati, A. E. Abbadi (Eds.), SIGMOD ’22: International Conference
     on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022, ACM, 2022, pp. 414–428.
     URL: https://doi.org/10.1145/3514221.3526147. doi:10.1145/3514221.3526147.
[30] G. Papadakis, J. Svirsky, A. Gal, T. Palpanas, Comparative analysis of approximate blocking
     techniques for entity resolution, PVLDB 9 (2016) 684–695.
[31] S. E. Whang, P. Lofgren, H. Garcia-Molina, Question selection for crowd entity resolution,
     PVLDB 6 (2013) 349–360.
[32] C. Roquet, S. Lavergne, W. Thuiller, One tree to link them all: a phylogenetic dataset for
     the european tetrapoda, PLoS currents 6 (2014).
[33] Dmoz https://dmoz-odp.org/, 2018.
[34] Google vision api https://cloud.google.com/vision, 2024.
[35] V. Crescenzi, A. D. Angelis, D. Firmani, M. Mazzei, P. Merialdo, F. Piai, D. Srivastava,
     Alaska: A flexible benchmark for data integration tasks, 2021. arXiv:2101.11259.
[36] R. He, J. McAuley, Ups and downs: Modeling the visual evolution of fashion trends with
     one-class collaborative filtering, in: proceedings of the 25th international conference on
     world wide web, 2016, pp. 507–517.
[37] World cities database https://simplemaps.com/data/world-cities, 2024.
[38] Scikit-learn https://scikit-learn.org/stable/modules/clustering.html, 2024.