<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Auctus: A Search Engine for Data Discovery and Augmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sonia Castelo</string-name>
          <email>s.castelo@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rémi Rampin</string-name>
          <email>remi.rampin@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aécio Santos</string-name>
          <email>aecio.santos@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aline Bessa</string-name>
          <email>aline.bessa@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Chirigati</string-name>
          <email>fchirigati@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juliana Freire</string-name>
          <email>juliana.freire@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New York University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Reference Format: Sonia Castelo</institution>
          ,
          <addr-line>Rémi Rampin, Aécio Santos, Aline Bessa, Fernando Chirigati</addr-line>
          ,
          <institution>and Juliana Freire. Auctus: A Search Engine for Data Discovery and Augmentation. In the 2nd Workshop on Search</institution>
          ,
          <addr-line>Exploration, and Analysis in Heterogeneous Datastores, SEA Data 2021</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The large volumes of structured data currently available open up new opportunities for progress in answering many important scientific, societal, and business questions. However, finding relevant data is dificult. While search engines have addressed this problem for Web documents, there are many new challenges involved in supporting the discovery of structured data for specific tasks. To tackle these challenges, we propose the dataset search engine Auctus. In this paper, we describe Auctus and present open questions and future work related to dataset discovery.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        While data are abundant, given the large number of datasets spread
over a large number of sites and repositories, finding relevant data
for a given task is dificult. Recognizing this challenge, a number of
approaches have been proposed to organize and index data
collections [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. While these present a significant step towards simplifying
data discovery, they have an important limitation: they only
support keyword-based search queries over published dataset metadata.
In addition, published metadata is often incomplete, and in many
cases it is inconsistent with the actual data. Thus, relying solely
on the metadata also hampers the discoverability of datasets. In
this work we describe Auctus, a system we propose to tackle these
limitations. We also introduce a number of open questions related
to the problem of dataset discovery, as well as future work related
to Auctus.
      </p>
    </sec>
    <sec id="sec-2">
      <title>THE AUCTUS SYSTEM</title>
      <p>
        Auctus [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] is an open-source dataset search engine designed to
support data discovery and augmentation. In addition to keyword-based
search, the system supports a rich set of queries including spatial
and temporal queries, as well as data integration and augmentation
queries. These queries are enabled in part by a data profiler that
automatically extracts metadata from the actual datasets. The profiler
generates summaries (or sketches) of column contents and data
types which are to construct indices that support eficient query
evaluation. Users can explore large dataset collections through an
intuitive interface. To help users identify relevant datasets, Auctus
displays snippets that summarize the contents of datasets.
Copyright © 2021 for the individual papers by the papers’ authors. Copyright © 2021
for the volume as a collection by its editors. This volume and its papers are published
under the Creative Commons License Attribution 4.0 International (CC BY 4.0).
Published in the Proceedings of the 2nd Workshop on Search, Exploration, and
Analysis in Heterogeneous Datastores, co-located with VLDB 2021 (August 16-20, 2021,
Copenhagen, Denmark) on CEUR-WS.org.
      </p>
      <p>
        Auctus was implemented with scalability in mind: the system is
containerized using Docker [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Each data discovery plugin
corresponds to an independent container, allowing multiple plugins to be
executed in parallel. Auctus can also spin up as many profiling and
query containers as required in response to load. Users can access
the system via a Web UI or programmatically via Python and REST
APIs. Auctus has been successfully deployed and is currently used
by diferent research groups within the DARPA D3M program [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ].
We refer the reader to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for additional details about the system
and its architecture.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>FUTURE WORK AND OPEN QUESTIONS</title>
      <p>Messy data. In addition to the lack metadata, real data is messy
and noisy. The Auctus data profiler represents a first step at
identifying semantic types as well as summarizing datasets to support
integration queries. However the recall and precision of discovery
queries can be negatively afected by the presence of data quality
issues. Robust and automated techniques are needed to automate
data cleaning, discover semantic types, and support approximate
join and union queries.</p>
      <p>
        Correlated data discovery. One of the original motivations for Auctus
was to support data augmentation to improve machine learning
models. For this task, given a large collection of tabular datasets
 and a query table , we need to identify all datasets  ∈ 
that are both joinable with  and that contain an attribute that
is correlated with the target variable in . However, computing
these joins and correlations in real-time is not feasible for large
tables and dataset collections. We have recently proposed a new
sketching-based method that support the eficient evaluation of
join-correlation queries [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. To integrate this method with Auctus
and further improve query eficiency, it would be interesting to
explore techniques that have been successfully used to speed-up web
search queries, as well as the use of locality-sensitive hashing (LSH).
User interfaces for data discovery. User interfaces for dataset search is
a rather unexplored research area. We posit this is due the prior
unavailability of eficient algorithms and systems for building dataset
search engines. There are many open questions in how can we
present dataset search results to users so that they can make sense
of the data and eficiently and efectively perform relevance
judgements about the suitability of the data for their task.
      </p>
      <p>Result ranking. Diferent information needs and associated
discovery tasks demand diferent ranking strategies – there is no
oneifts-all strategy. Moreover, determining whether a dataset is better
than another can be dificult even for a xfied task. Furthermore,
datasets have other properties that contribute to their value,
including the publisher (e.g., datasets published by reputable sources
can be considered ‘better’ than datasets from unknown sources)
or intrinsic quality measures (e.g., the number of NULL values).
Research is needed both to better understand ranking in the context
of structured datasets and to devise efective strategies.</p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by the DARPA D3M program and
NSF award OAC-1640864. Any opinions, findings, and conclusions
or recommendations expressed in this material are those of the
authors and do not necessarily reflect the views of NSF and DARPA.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] Auctus: Github Repository</source>
          <year>2021</year>
          . https://github.com/VIDA-NYU/auctus.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Sonia</given-names>
            <surname>Castelo</surname>
          </string-name>
          , Remi Rampin, Aecio Santos, Fernando Chirigati, and
          <string-name>
            <given-names>Juliana</given-names>
            <surname>Freire</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Auctus: A Dataset Search Engine for Data Discovery and Augmentation</article-title>
          .
          <source>In Proceedings of the 47th International Conference on Very Large Data Bases</source>
          (to appear).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Adriane</given-names>
            <surname>Chapman</surname>
          </string-name>
          , Elena Simperl, Laura Koesten, George Konstantinidis, LuisDaniel Ibáñez, Emilia Kacprzak, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Groth</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Dataset search: a survey</article-title>
          .
          <source>VLDB Journal 29</source>
          ,
          <issue>1</issue>
          (
          <year>2020</year>
          ),
          <fpage>251</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[4] Data-Driven Discovery of Models (D3M)</source>
          .
          <year>2019</year>
          . https://www.darpa.
          <article-title>mil/program/ data-driven-discovery-of-models.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Docker</surname>
          </string-name>
          .
          <year>2021</year>
          . https://www.docker.com/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Aécio</surname>
            <given-names>S. R.</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
            , Aline Bessa, Fernando Chirigati, Christopher Musco, and
            <given-names>Juliana</given-names>
          </string-name>
          <string-name>
            <surname>Freire</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Correlation Sketches for Approximate Join-Correlation Queries</article-title>
          .
          <source>In International Conference on Management of Data (SIGMOD)</source>
          .
          <volume>1531</volume>
          -
          <fpage>1544</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Aécio</surname>
            <given-names>S. R.</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
            , Sonia Castelo, Cristian Felix, Jorge Piazentin Ono, Bowen Yu, Sungsoo Ray Hong, Cláudio T. Silva, Enrico Bertini, and
            <given-names>Juliana</given-names>
          </string-name>
          <string-name>
            <surname>Freire</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Visus: An Interactive System for Automatic Machine Learning Model Building and Curation</article-title>
          .
          <source>In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA@SIGMOD)</source>
          .
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <issue>6</issue>
          :
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>