<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Density- and Correlation-based Table Extension</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benedikt Kleppmann</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Bizer</string-name>
          <email>chris@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Yaqub</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Temme</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Schlunder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Arnu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Klinkenberg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RapidMiner GmbH</institution>
          ,
          <addr-line>Westfalendamm 87, 44141 Dortmund</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Mannheim</institution>
          ,
          <addr-line>68131 Mannheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With thousands of data sources available on the Web as well as within organizations, data scientists increasingly spend more time searching for data than analyzing it. In order to ease the task of finding relevant data for data mining projects, this paper presents two data discovery and data integration methods that have been developed in a joint research project by RapidMiner Research and the University of Mannheim. Given a corpus of relational tables, the methods extend a query table with additional attributes and automatically fill these new attributes with data values from the corpus. The first method, densitybased table extension, extends the query table with all attributes that can be filled with data values so that a user-specified density threshold is reached. The second method, correlation-based table extension, extends the query table with all attributes that correlate with a specific attribute of the query table. Both methods are integrated as operators into RapidMiner Studio, a popular data mining environment. This enables data scientists to search for data and apply a wide range of different mining methods to the discovered data within the same environment.</p>
      </abstract>
      <kwd-group>
        <kwd>Data discovery</kwd>
        <kwd>table extension</kwd>
        <kwd>holistic matching</kwd>
        <kwd>web tables</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This paper proposes and evaluates two new table extension methods, which
try to fulfill these requirements: Density-based table extension, which extends a
query table with all attributes that can be filled above a user-specified density
threshold given a data corpus. For instance, given a table describing cities, the
method would add various attributes providing statistics about these cities. The
second method performs correlation-based table extension: Given a query table
describing cities, the method would add all attributes that correlate with a
specific attribute of the query table. For instance, the user could specify that she
wants the new attributes to correlate with the attribute unemployment, which
would result in attributes to be added that are connected to the unemployment
in the cities. Figure 1 illustrates how a query table describing Roman emperors
is extended within RapidMiner Studio with additional attributes covering the
emperors birth- and death dates, as well as the cause of their death. The table
extension operators are published as part of the Data Search for Data Mining
(DS4DM) extension on the RapidMiner Marketplace3. Beside the actual search
operators, the extension also includes functionality for indexing relational tables
and for managing table repositories. The extension supports extracting tabular
data from various sources including web pages, google tables, tables within pdf
documents, online spreadsheets from Microsoft and Google, as well accessing
Sharepoint. Detailed information about the extensions is found on the DS4DM
website4.
3 https://marketplace.rapidminer.com
4 http://ds4dm.de
5 http://web.informatik.uni-mannheim.de/ds4dm/
a query table, a density threshold, and a reference to a data repository as input
from the user. It returns the query table extended with all attributes that could
be filled above the density threshold using the data repository. The method
performs the following steps in order to create the extended table:
1. Subject Column Detection: The method determines the column of the query
table that most likely contains the names of the described entities. For this,
different regex-patterns are matched against the column headers (such as
.*name). If no column header is identified as a subject column header, then
the string column with the highest amount of distinct values is chosen as the
subject column.
2. Table Search: Using a Lucene index, the top-k tables having the highest
overlap in subject column values with the query table are retrieved from the
repository.
3. Entity Matching: The rows of the retrieved tables are matched against the
rows of the query table in order to determine entity correspondences. The
similarity of two rows is calculated by combining the similarity of the
subjectcolumn values (weight 50%) and the maximal similarity of
non-subjectcolumn values (weight 50%). The individual similarities are calculated using
datatype-specific similarity metrics (string, number, and date).
4. Schema Matching: Correspondences between the columns of the query table
and the retrieved tables are determined using a combination of label-based
and instance-based schema matching techniques.
5. Data Fusion: Using the correspondences, the data from the retrieved tables
is grouped by entity and attribute. If the retrieved tables contain conflicting
values for an attribute of a specific entity, these conflicts are resolved by
choosing the value that is most similar to all other values within the group.
6. Table Extension: All newly created attributes are added to the query table.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Correlation-based Table Extension</title>
      <p>In many data analysis settings, the attributes that correlate with a specific
target attribute are highly relevant, for instance for learning classification and
regression models. The correlation-based table extension method expects a query
table, an attribute of this query table to which the new attributes should
correlate, a minimum correlation threshold, a density threshold, and a reference to
a data repository as input from the user. It returns the query table extended
with all attributes that could be filled above the density threshold and
correlate with the specified correlation attribute. Only correlations between numeric
attributes are considered. Correlations are calculated using the Pearson
correlation coefficient. The correlation-based table extension method is implemented as
a post-processing step for the density-based table extension. First, the
densitybased table extension method is used to add as many attributes as possible to
the query table. Afterwards, attributes with a correlation below the
minimumcorrelation threshold are removed.</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        We evaluated both methods on the task of extending various query tables with
data from a corpus of relational web tables [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We used the T2D Gold
Standard V2 for the evaluation. This table corpus consists of 779 tables and
covers topics such as populated places, organizations, people, music, etc. The
gold standard was originally created for evaluating web table to knowledge base
matching systems. For our evaluation, we rearranged the tables into query tables
and expected result tables using the schema- and instance-correspondences form
the gold standard. We used 13 query tables (airports, currencies, lakes, etc.) to
evaluate the density-based table extension method. Comparing the tables that
were produced by the method to the expected result tables leads to a precision
of 80% and a recall of 98%. This means that the method was able to discover
and populate most attributes that could be added to the query tables. The
precision of 80% results from errors in the data fusion step, but on the other
hand also from the system filling too many cells of the result tables due to
matching errors. For evaluating the correlation-based table extension, we used
the four query tables that result in the largest number of numeric attributes to
be added. The experiment showed a precision of 63% and a recall 77%. These
results are due to the rather low density of many of the created attributes, which
makes calculating correlations tricky. The results of each individual query as well
as the evaluation data can be found on the website about the DS4DM backend.
The run times for both of types of table extension are between 5 and 10 seconds
when searching a repository of 500 000 web tables6.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgment</title>
      <p>The methods and prototypes presented in this paper were developed within the
research project DS4DM (Data Search for Data Mining) funded by the
German Federal Ministry of Education and Research (BMBF) under grant number
01IS15027A-B.
6 http://web.informatik.uni-mannheim.de/ds4dm/#evaluation</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>M. J. Cafarella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Halevy</surname>
            , and
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Khoussainova</surname>
          </string-name>
          .
          <article-title>Data Integration for the Relational Web</article-title>
          .
          <source>Proc. of the VLDB Endow.</source>
          ,
          <volume>2</volume>
          :
          <fpage>1090</fpage>
          -
          <lpage>1101</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>M. J. Cafarella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>D. Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , E. Wu, and
            <given-names>Y. Zhang.</given-names>
          </string-name>
          <article-title>WebTables: Exploring the Power of Tables on the Web</article-title>
          .
          <source>Proc. of the VLDB Endow.</source>
          ,
          <volume>1</volume>
          :
          <fpage>538</fpage>
          -
          <lpage>549</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>O.</given-names>
            <surname>Lehmberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ritze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>A large public corpus of web tables containing time and context metadata</article-title>
          .
          <source>In Proceedings of the 25th International Conference Companion on World Wide Web</source>
          , pages
          <fpage>75</fpage>
          -
          <lpage>76</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Yakout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ganjam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. Chaudhuri.</surname>
          </string-name>
          <article-title>InfoGather: Entity Augmentation and Attribute Discovery by Holistic Matching with Web Tables</article-title>
          .
          <source>In Proc. of the 2012 ACM SIGMOD Int. Conference on Management of Data</source>
          , pages
          <fpage>97</fpage>
          -
          <lpage>108</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>