<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How plausible is automatic annotation of scientific spreadsheets?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martine de Vos∗</string-name>
          <email>martine.de.vos@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Wielemaker</string-name>
          <email>j.wielemaker@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bob Wielinga †</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guus Schreiber</string-name>
          <email>guus.schreiber@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Top†</string-name>
          <email>jan.top@wur.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spreadsheets, Knowledge engineering, Domain Knowledge, Heuris-</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>∗Corresponding author</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer science, Network Institute, Vrije Universiteit Amsterdam</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Food and Biobased Research, Wageningen University and Research, Centre</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>tics</institution>
          ,
          <addr-line>Vocabularies</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Second afiliation: Computer science, Network Institute, Vrije Universiteit Amsterdam</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>It is possible to automatically annotate a natural science spreadsheet using lexical matching, given that the tables in these spreadsheets meet a number of requirements regarding the content. Results of a survey show that most of the existing natural science spreadsheets deviate from the ideal situation. We propose to complement lexical matching with both heuristics and knowledge from external vocabularies to overcome these deviations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Model development and
analysis; • Applied computing → Physical sciences and
engineering;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>In this paper we investigate the feasiblity of automatically
annotating existing natural science spreadsheets.</p>
      <p>
        Scientists in the domain of natural science, hereafter referred to
as domain scientists, frequently use spreadsheets to analyze and
manipulate their research data [
        <xref ref-type="bibr" rid="ref13 ref18 ref25">13, 18, 25</xref>
        ]. The format of spreadsheets
gives them a great deal of freedom in how they enter their data.
Domain scientists can make their own choices with respect to the
entities and processes to be included, and the way in which these
are organized in tables. In this way their domain model is implicitly
reflected in the content and structure of the spreadsheet tables. As
researchers do not anticipate the reuse of their spreadsheet data
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] they tend to be sloppy in the specification of the semantics of
their data, and the free format allows them to do so [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However,
as the domain model is essential to understand the meaning and
context of spreadsheet data, it is currently hard to unambiguously
interpret these data for people other than the original developers.
      </p>
      <p>The content of the tables can be annotated with concepts from
external vocabularies to facilitate interpretation, evaluation and
reuse of spreadsheet data. This annotation can be performed
automatically by using lexical matching, i.e., by assessing the lexical
similarity between spreadsheet terms and labels from selected
vocabularies. This method requires, ironically, that the spreadsheet
cells contain explicit and complete information on the
corresponding research. Furthermore, as natural science spreadsheets often
represent observational data, the tables typically contain mostly
numbers and little text. The amount of data that can be annotated
in natural science spreadsheets is thus limited.</p>
      <p>
        The goal of this paper is to evaluate to what extent automatic
annotation of the table content is possible for existing natural
science spreadsheets. In Section 3 we explain the requirements for
the content of natural science spreadsheets that enable automatic
annotation. In Section 4 we present an analysis on the nature and
frequency of common design characteristics in existing natural
science spreadsheets, and discuss how these deviate from the ideal
situation. We propose to repair these deviations by complementing
the lexical matching method with heuristics and rules (Section 5)
that have been developed in earlier work [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Finally, we discuss
how these heuristics could be applied to the set of analyzed tables,
and to what extent automatic annotation would be possible (Section
6).
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>Many studies focus on improving the interpretation of spreadsheet
data to facilitate reuse and integration. In order to derive a correct
interpretation, the studies use diferent strategies to infer the
semantics from spreadsheet data and dissolve ambiguities. We observe
two main types of strategies.</p>
      <p>
        One strategy is to encourage domain scientists to standardize
their data to facilitate interpretation and reuse. Semantic markup
tools like RightField [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and OntoMaton [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], may be used to
develop templates for domain scientists to enter and annotate their
data simultaneously. MAGE-Tab [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], ISA-Tab [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], and BIOM [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
are tabular formats that use an underlying data model with relevant
metadata from scientific experiments. These formats can be used
to either directly enter data, or as a template for mapping other
spreadsheet files onto one structure.
      </p>
      <p>
        Another approach is to annotate tabular data with concepts from
vocabularies. Some tools [
        <xref ref-type="bibr" rid="ref12 ref17">12, 17</xref>
        ] use existing generic ontologies,
like Yago, DBPedia , while other tools [
        <xref ref-type="bibr" rid="ref13 ref25">13, 25</xref>
        ], use existing domain
ontologies for semantic markup. Some approaches develop their
own ontology, either manually [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] or by extracting concepts and
relations from the web [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], to annotate tabular data.
      </p>
      <p>
        All the abovementioned studies acknowledge that a correct
interpretation of tabular data is essential for conversion or annotation.
In order to derive a correct interpretation, the studies use diferent
approaches to infer the semantics from tabular data and dissolve
ambiguities. Some of these approaches rely on manual mapping
specifications constructed by users [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] or human analysts with suficient
knowledge of applying semantic web techniques [
        <xref ref-type="bibr" rid="ref11 ref15 ref25 ref4 ref9">4, 9, 11, 15, 25</xref>
        ].
Others compare their tabular data with large collections of
example data, e.g., large vocabularies like Yago or DBPedia, or generic
databases extracted from the Web, and rely on probabilistic
reasoning methods to find the best suitable annotation or interpretation
for table cells and columns [
        <xref ref-type="bibr" rid="ref12 ref17 ref2 ref24">2, 12, 17, 24</xref>
        ]. And, many studies use
knowledge on the structural properties of a table to derive a correct
interpretation of its content. Several studies created a library on
commonly used layout patterns in tabular data [
        <xref ref-type="bibr" rid="ref10 ref20 ref8">8, 10, 20</xref>
        ]. Abraham
and Erwig [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] developed a framework to automatically classify roles
of cells in a table based on the spatial layout of a spreadsheet. Van
Assem and colleagues [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] introduced disambiguation strategies
for units of measure and quantities ([
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]) based on the way these
are notated in table cells. And Chen and Cafarella [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] use heuristics
and rules on spreadsheet layout and implicit metadata structure to
automatically extract relational data from spreadsheets.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>REQUIREMENTS FOR SPREADSHEET</title>
    </sec>
    <sec id="sec-5">
      <title>TABLES</title>
      <p>
        Spreadsheets from the domain of natural science, e.g., biology,
physics and medical science, often represent laboratory or field
observations. The tables in these spreadsheets therefore typically
consist of numerical data, quantities and units of measure [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and
information on the associated phenomena, i.e., objects, events and
substances.
      </p>
      <p>
        Annotation of these spreadsheets with vocabulary concepts sets
requirements for both the vocabularies and the content of the
spreadsheet tables. The selected vocabularies should contain
labeled concepts, and comprise at least one vocabulary that covers
the domain of the considered spreadsheets, and a dedicated
vocabulary on quantities and units. In our research we use the OM
Ontology for units of Measure and related concepts [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Regarding the requirements for the content of spreadsheet
tables, the terms representing units of measure should follow the
international notation standards [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (Figure 1). This implies that
these terms consist of short strings containing one or more symbols,
and optional brackets and slashes [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The symbol(s) in the term
should lexically match with a unit symbol from the OM vocabulary,
, e.g., “ha” representing the unit “hectare” Cells that contain
information on quantities should contain a description of a quantity
concept, that can be lexically matched with a concept from the
OM vocabulary, e.g., “area” or “mass”. Furthermore, quantity cells
should have an associated unit of measure, that is located either
in the same or in a neighboring cell. Phenomenon cells, i.e. cells
containing phenomenon instances, should contain terms that can
not be confused with quantities or units, i.e., these terms should
preferrably not consist of very short strings, symbols, or abbreviations.
The phenomena in tables are annotated with domain concepts, e.g.,
"corn" and "urea".
      </p>
    </sec>
    <sec id="sec-6">
      <title>4 ANALYSIS OF TABLE DESIGN</title>
    </sec>
    <sec id="sec-7">
      <title>Data set</title>
      <p>
        We conduct an analysis on a set of existing natural science
spreadsheet tables, in order to gain knowledge on common practice of
table design, and to find out in what ways the content of these
tables may deviate from the ideal situation as described in the
previous section. To this end, we analyse a total of 361 tables in 84
spreadsheets, that are used in 20 existing research projects in the
domain of natural science (Table 1). All spreadsheets fall within
the scope of our research, i.e., natural science spreadsheets that
consist of numerical data, quantities and units of measure, and
information on the associated objects and events. About half of
the inspected tables is used in our earlier work on interpretation
and annotation of spreadsheets [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], or formulas [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We collect
additional spreadsheets from colleagues at Wageningen University
and Research and we use the Google Scholar web search engine to
ifnd spreadsheets that are published online as supplementary data
alongside journal papers.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Data analysis</title>
      <p>Our analysis consists of a manual inspection of all spreadsheet
tables, in which we only consider the content of the tables, and
ignore the title or comments. We color code the cells in each block
based on the content (see, for example, the legend in Figure 2), and
analyze the syntax of the formulas and their composition in the
table. Subsequently, we determine for each table to what extent
it meets our requirements described in section 3. We explain our
results in terms of deviations from these requirements, and discuss
some additional observations we made on the structure of the
analyzed tables.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Results</title>
      <p>Deviant unit notations. More than half of the analyzed tables
contains unit cells (Table 2), but in almost one third the notation
of the unit terms and symbols is not according to the international
standards. In the majority of the cases the unit terms are customized
by the scientists. The resulting unit terms are not incorrect per
se, but rather unconventional, and automatic recognition of these
terms is hindered. Scientists often combine phenomena with unit
symbols, e.g., “MJ/1000 kg milk/yr” and “g CO2e/MJ” (Figures 2, 3).</p>
      <p>Incomplete quantity notations. The majority of the analyzed
tables contains quantity cells (Table 2). In almost half of the analyzed
tables, the quantity cells do not contain complete information, and
automatic recognition of the quantities is not straightforward. Some
tables contain cells with a phenomenon description, and an
associated unit of measure located in the neighboring cell (Figure 2).
Although no quantity concept is mentioned, these cells implicitly
represent quantities. In Table 2 we do not consider these cells as
quantity cells.</p>
      <p>Unclear phenomenon notations. In the majority of the analyzed
tables phenomenon cells are present (Table 2), However, part of
these tables contains cells that, judging from the position in the
table, probably represent phenomena, but do not contain full words.
Instead, these cells contain numbers or codes representing, e.g.,
dates, scientific experiments, identification numbers (Figures 2,3),
or abbreviations which are either application specific, e.g., chemical
elements or geographical codes, or related to scientific experiments.</p>
      <p>Other observations. In more than half of the analyzed tables the
lfoat cells, containing the values of observations, and the string
cells, containing contextual information on these observations, are
not located in homogeneous blocks. In most of these tables, the
lfoat blocks are interrupted, either by empty cells (Figure 3), or less
frequent, by qualitative, i.e., string values. In a small part of the
analyzed tables the string and float blocks are not aligned with each
other, i.e., these blocks do not have similar dimensions. In these
tables it is not clear which observations are associated with which
context.</p>
      <p>Cells representing semantically related phenomenon instances
are typically grouped in the same string block. We observe that</p>
    </sec>
    <sec id="sec-10">
      <title>COMPLEMENTARY HEURISTICS</title>
      <p>We observe that most of the analyzed tables do not meet one or more
of the requirements listed in Section 3, and we expect that lexical
matching will not yield many useful annotations. In this section we
therefore propose to complement lexical matching with heuristics
and knowledge from vocabularies to overcome the challenges of
incomplete information.
5.1</p>
    </sec>
    <sec id="sec-11">
      <title>Recognizing and annotating blocks</title>
      <p>
        Domain scientists typically group cells that are semantically related
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and use structure and layout features to distinguish between
these groups [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We assume that this grouping not only applies
to phenomenon cells, but also to cells representing quantities and
units of measure.
      </p>
      <p>
        We have developed heuristics that support us in the recognition
of the type of cell, i.e., unit of measure, quantity or phenomenon,
and the annotation of its content [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These heuristics combine
information on the notation of terms in cells, and the composition
and positioning of blocks of cells in a table, e.g.:
• If a cell contains both a string term and a unit of measure, it is
a quantity cell
• A block is considered a “Quantity” or a “Unit” block when at
least 30% of the cells is recognized as a quantity or unit cell
• A block is considered a “Quantity” block when it is vertically or
horizontally aligned with the “Unit” block and the float block
Although the quantity cells in Figure 2B and 3B contain no
description of a quantity concept, these cells could still be recognized
as quantity cells by considering the presence and position of the
units of measure in the table.
5.2
      </p>
    </sec>
    <sec id="sec-12">
      <title>Knowledge from vocabularies</title>
      <p>The phenomenon cells in natural science tables may not contain
explicit terms, but codes, numbers or abbreviations (Section 4.1),
that are commonly used by scientists to refer to domain specific
entities, e.g., chemical elements or human hormones (resp. Figure
2A and 3B). Vocabularies with additional information at the
instance level could be used to annotate the content of these cells,
and to recognize these as phenomenon cells. Furthermore, these
vocabularies may also facilitate recognition of these phenomenon
cells, by distinguishing the codes and abbreviations from quantities
and units of measure.</p>
      <p>
        Many quantity cells in the analyzed tables do not contain a
concept description, but do have an associated unit of measure
(Section 4.1). The missing quantity concept may be obtained by
using the following heuristic:
• Annotation concepts for quantity cells can be deduced from the
included unit term
Some units are commonly associated with certain quantities [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ],
and this type of information is included in, for example, the OM
vocabulary. For example, the missing quantity concept in a cell
containing the term “BT (nmol/L)” (Figure 3B) is probably “Molar
Concentration”, as the unit of measure “mol/L” is commonly
associated with this quantity. However,this requires both a correct
notation and interpretation of the associated units of measure, and
that the information on the ‘common association’ is indeed present
in OM.
      </p>
      <p>Besides, domain scientists often customize the units of measure
in their spreadsheets by combining unit symbols with phenomenon
terms. Recognizing these phenomena using a domain vocabulary,
and subsequently removing these from the unit terms would
probably result in better recognition and interpretation of the units of
measure in a table.
5.3</p>
    </sec>
    <sec id="sec-13">
      <title>Deduction from table context</title>
      <p>Empty cells in tables are often left empty on purpose, as data on
either the observation value or its context is missing. In many cases
these empty cells are surrounded by non-empty neighbouring cells,
which could be used to deduce the missing information :
• The content type of an empty cell is the same as that of the
neighbouring cells</p>
      <p>
        Annotation of a whole group of phenomena is often dificult, as
the semantic relation between the grouped phenomenon instances
can not be recognized in a domain vocabulary. As suggested by
[
        <xref ref-type="bibr" rid="ref1 ref10 ref3">1, 3, 10</xref>
        ], the following heuristic may be used:
• A group of phenomenon instances may have a common
denominator cell that is present above or left from the phenomenon
block
The term in this common denominator cell may be annotated and
provide the concept of the phenomenon class.
6
      </p>
    </sec>
    <sec id="sec-14">
      <title>PLAUSIBILITY OF AUTOMATIC</title>
    </sec>
    <sec id="sec-15">
      <title>ANNOTATION</title>
      <p>In this section we investigate the applicability of our heuristics
on the set of tables analyzed in our survey. Given that for the
majority of these tables automatic annotation solely based on lexical
matching would not be successful, the applicability of the heuristics
gives us as an indication to what extent automatic annotation may
still be possible. We distinguish three levels of plausibility:
Automatic annotation is not possible. A small part of the analyzed
tables display serious deviations in their basic structure (e.g., Figure
3A). In these tables the blocks with numerical data on observations
either are accompanied by only one block of contextual information,
or from the table structure it is not clear which string and float cells
are related to each other.</p>
      <p>Although a good basic structure is not a requirement for
successful lexical matching, it is a prerequisite for our heuristics. Without
the presence and alignment of string and float blocks, our block
heuristics (Section 5.1) can not be used to recognize the units of
measure, quantities and phenomena in these tables. Consequently,
without knowledge of the types of cells, deduction of information
from the table context (Section 5.3) and deriving additional
knowledge from vocabularies (Section 5.2) is not possible. The annotation
of these tables would thus be based solely on lexical matching. As
these tables do not contain complete and explicit information on
the underlying research, we expect that automatic annotation of
these tables is not possible. By the way, the majority of these tables
would probably be hard to interpret for human readers as well.</p>
      <p>Automatic annotation is dificult. Several of the analyzed tables
have a good basic structure, but are missing entities.</p>
      <p>In some of these tables the units of measure are missing, which
hinders the recognition and annotation of quantities. The
recognition of quantity cells in a table may be improved by block heuristics
(Section 5.1). For the annotation of these quantities it is, however,
not possible to derive additional knowledge from vocabularies.</p>
      <p>In other tables the quantities are only implicitly represented (e.g.,
Figure 2A). Block heuristics may facilitate recognizing which of
the contextual blocks in the table serves as a quantity block. The
annotation of the quantities in these tables is dificult, as these are
technically not present, but may be deduced from the associated
units of measure.</p>
      <p>Reconstruction is possible. The majority of the analyzed tables
have a good basic structure, and consist of unit, quantity, and
phenomenon cells, but do not contain complete and explicit information
on these entities. As we expect all of our heuristics to be applicable
in these type of tables, the missing information may be
complemented and succesful recognition and annotation of the entities in
these tables is possible.
7</p>
    </sec>
    <sec id="sec-16">
      <title>DISCUSSION AND CONCLUSION</title>
      <p>In this study we show that it is plausible to automatically
annotate natural science spreadsheets, even if the tables do not contain
complete and explicit information on the corresponding research
project.</p>
      <p>The quality and the level of detail of the annotations will, of
course, still be depending on the completeness and accuracy of the
content of the tables. However, even if the quality of the content
is not suficient to automatically annotate terms on an individual
level, the block heuristics may be used to recognize the quantities,
phenomena and units of measure blocks, thereby providing a basic
understanding of the table. What is more, we think that block
heuristics are useful in all spreadsheet tables, as these heuristics
provide insight in how the cells in a table are related, and as such
facilitate interpretation.
The number of existing research spreadsheets, especially in the
informal area, is large. Our proposed approach could be used to
facilitate search, reuse and integration of these spreadsheet data, by
analyzing the information in annotations. Furthermore, the
observations and heuristics from this study can be used as guidelines or in
support tools for domain scientists to design new spreadsheet tables
that are easier to interpret by both humans and machines.
However, we do not believe that the common practice of spreadsheet
development by domain scientists is easily changed. We expect that
domain scientists will keep using spreadsheets, as these structures
provide an easy and accessible way to store and manipulate reseach
data according to their preferences. Therefore we expect that the
automatic annotation of existing natural science spreadsheets will
remain an issue. Our proposed method provides part of the solution
to handle this issue.</p>
    </sec>
    <sec id="sec-17">
      <title>ACKNOWLEDGMENTS</title>
      <p>This publication was supported by Dutch national program
COMMIT.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Robin</given-names>
            <surname>Abraham</surname>
          </string-name>
          and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Erwig</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Inferring Templates from Spreadsheets</article-title>
          .
          <source>In Proceedings of the 28th international conference on Software engineering. ACM</source>
          ,
          <volume>182</volume>
          -
          <fpage>191</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Michael</surname>
            <given-names>J Cafarella</given-names>
          </string-name>
          , Alon Halevy, Daisy Zhe Wang,
          <string-name>
            <surname>Eugene Wu</surname>
            ,
            <given-names>and Yang</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>WebTables: Exploring the Power of Tables on theWeb</article-title>
          .
          <source>Proceedings of the VLDB Endowment 1</source>
          ,
          <issue>1</issue>
          (
          <year>2008</year>
          ),
          <fpage>538</fpage>
          -
          <lpage>549</lpage>
          . https://doi.org/10.14778/1453856.1453916
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Chen and Michael J Cafarella</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Automatic web spreadsheet data extraction</article-title>
          .
          <source>In Proceedings of the 3rd International Workshop on Semantic Search Over the Web - SS@ '13</source>
          .
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . https://doi.org/10.1145/2509908.2509909
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Martin</surname>
            <given-names>J O</given-names>
          </string-name>
          <string-name>
            <surname>Connor</surname>
          </string-name>
          ,
          <source>Christian Halaschek-wiener, and Mark A Musen</source>
          .
          <year>2010</year>
          .
          <article-title>Mapping Master : a Flexible Approach for Mapping Spreadsheets to OWL</article-title>
          . In The Semantic WebâĂŞISWC. Springer Berlin Heidelberg,
          <fpage>194</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Martine G De Vos</surname>
          </string-name>
          , Willem Robert Van Hage,
          <string-name>
            <surname>Jan Ros</surname>
            , and
            <given-names>Guus</given-names>
          </string-name>
          <string-name>
            <surname>Schreiber</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Reconstructing Semantics of Scientific Models : a Case Study</article-title>
          .
          <source>In Proceedings of the OEDW workshop on Ontology engineering in a data driven world, EKAW 2012</source>
          . Galway, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Martine G De Vos</surname>
            ,
            <given-names>Jan</given-names>
            Wielemaker, Hajo Rijgersberg, Guus Schreiber, Bob Wielinga, and Jan
          </string-name>
          <string-name>
            <surname>Top</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Combining Information on Structure and Content to Automatically Annotate Natural Science Spreadsheets</article-title>
          .
          <source>International Journal of Human-Computer Studies (in press)</source>
          ,
          <volume>0</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Martine G De Vos</surname>
            ,
            <given-names>Jan</given-names>
            Wielemaker, Bob Wielinga, Guus Schreiber, and Jan
          </string-name>
          <string-name>
            <surname>Top</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A methodology for constructing the calculation model of scientific spreadsheets</article-title>
          .
          <source>In Proceedings of the 8th International Conference on Knowledge Capture.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Andres</given-names>
            <surname>Garcia-silva</surname>
          </string-name>
          ,
          <article-title>Asuncion Gomez-perez, Mari Carmen Suarez-figueroa, and Boris Villazon-terrazas</article-title>
          .
          <year>2008</year>
          .
          <article-title>A Pattern Based Approach for Re-engineering Non-Ontological Resources into Ontologies</article-title>
          .
          <source>In The Semantic Web. Number 2</source>
          . Springer Berlin Heidelberg,
          <fpage>167</fpage>
          -
          <lpage>181</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Lushan</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tim</given-names>
            <surname>Finin</surname>
          </string-name>
          , Cynthia Parr, Joel Sachs, and
          <string-name>
            <given-names>Anupam</given-names>
            <surname>Joshi</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>RDF123 : From Spreadsheets to RDF</article-title>
          . In
          <source>The Semantic Web-ISWC 2008</source>
          . Springer Berlin Heidelberg,
          <fpage>451</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Felienne</surname>
            <given-names>Hermans</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Pinzger</surname>
          </string-name>
          , and Arie Van Deursen.
          <year>2010</year>
          .
          <article-title>Automatically Extracting Class Diagrams from Spreadsheets</article-title>
          .
          <source>In 24th European Conference on Object-Oriented Programming (ECOOP),Lecture Notes in Computer Science</source>
          ,. Springer-Verlag,
          <fpage>52</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Langegger</surname>
          </string-name>
          and
          <string-name>
            <given-names>W</given-names>
            <surname>Wolfram</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>XLWrap âĂŞ Querying and Integrating Arbitrary Spreadsheets with SPARQL</article-title>
          .
          <source>In International Semantic Web Conference</source>
          .
          <volume>359</volume>
          -
          <fpage>374</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Girija</surname>
            <given-names>Limaye</given-names>
          </string-name>
          , Sunita Sarawagi, and
          <string-name>
            <given-names>Soumen</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Annotating and searching web tables using entities, types and relationships</article-title>
          .
          <source>In Proceedings of the VLDB Endowment</source>
          , Vol.
          <volume>3</volume>
          .
          <fpage>1338</fpage>
          -
          <lpage>1347</lpage>
          . https://doi.org/10.14778/1920841.1921005
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Eamonn</surname>
            <given-names>Maguire</given-names>
          </string-name>
          , Alejandra González-Beltrán, Patricia L. Whetzel, Susanna Assunta Sansone, and
          <string-name>
            <surname>Philippe</surname>
          </string-name>
          Rocca-Serra.
          <year>2013</year>
          .
          <article-title>OntoMaton: A Bioportal powered ontology widget for Google Spreadsheets</article-title>
          .
          <source>Bioinformatics</source>
          <volume>29</volume>
          ,
          <issue>4</issue>
          (
          <year>2013</year>
          ),
          <fpage>525</fpage>
          -
          <lpage>527</lpage>
          . https://doi.org/10.1093/bioinformatics/bts718
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Daniel</surname>
            <given-names>McDonald</given-names>
          </string-name>
          , Jose C Clemente, Justin Kuczynski, Jai Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker Meyer, Rob Knight, and
          <string-name>
            <given-names>J</given-names>
            <surname>Caporaso</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome</article-title>
          .
          <source>GigaScience 1</source>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <article-title>7</article-title>
          . https://doi.org/10.1186/2047-217X-1-7
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Albert</surname>
            <given-names>Meroño-Peñuela</given-names>
          </string-name>
          , Ashkan Ashkpour, Laurens Rietveld, Rinke Hoekstra, and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Schlobach</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Linked Humanities Data : The Next Frontier ? A Case-study in Historical Census Data</article-title>
          .
          <source>In The Semantic Web: Semantics and Big Data</source>
          . Springer Berlin Heidelberg,
          <fpage>645</fpage>
          -
          <lpage>649</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Roland</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mittermeir</surname>
            and
            <given-names>Markus</given-names>
          </string-name>
          <string-name>
            <surname>Clermont</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Finding High-Level Structures in Spreadsheet Programs</article-title>
          .
          <source>In Proceedings of the 9th Working Conference on Reverse Engineering</source>
          . Richmond,
          <string-name>
            <surname>VA</surname>
          </string-name>
          ,USA,
          <fpage>221</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Varish</surname>
            <given-names>Mulwad</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tim Finin</surname>
            , and
            <given-names>Anupam</given-names>
          </string-name>
          <string-name>
            <surname>Joshi</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A Domain Independent Framework for Extracting Linked Semantic Data from Tables</article-title>
          .
          <source>In Search Computing</source>
          . Springer Berlin Heidelberg,
          <fpage>16</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Tim</surname>
            <given-names>F Rayner</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Philippe</surname>
            Rocca-Serra, Paul T Spellman, Helen C Causton, Anna Farne, Ele Holloway, Rafael A Irizarry, Junmin Liu, Donald S Maier,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>Kjell</given-names>
          </string-name>
          <string-name>
            <surname>Petersen</surname>
          </string-name>
          , John Quackenbush, Gavin Sherlock, Christian J Stoeckert, Joseph White, Patricia L. Whetzel, Farrell Wymore, Helen Parkinson, Ugis Sarkans,
          <article-title>Catherine A Ball,</article-title>
          and
          <string-name>
            <given-names>Alvis</given-names>
            <surname>Brazma</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>A simple spreadsheet-based, MIAMEsupportive format for microarray data: MAGE-TAB</article-title>
          .
          <source>BMC bioinformatics 7</source>
          (
          <year>2006</year>
          ),
          <volume>489</volume>
          . https://doi.org/10.1186/
          <fpage>1471</fpage>
          -2105-7-489
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Hajo</given-names>
            <surname>Rijgersberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wigham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Top</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>How semantics can improve engineering processes: A case of units of measure and quantities</article-title>
          .
          <source>Advanced Engineering Informatics</source>
          <volume>25</volume>
          ,
          <issue>2</issue>
          (apr
          <year>2011</year>
          ),
          <fpage>276</fpage>
          -
          <lpage>287</lpage>
          . https://doi.org/10.1016/j.aei.
          <year>2010</year>
          .
          <volume>07</volume>
          .008
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Ivelize</given-names>
            <surname>Rocha</surname>
          </string-name>
          <string-name>
            <surname>Bernardo</surname>
          </string-name>
          , Matheus S Mota, and
          <string-name>
            <given-names>André</given-names>
            <surname>Santanchè</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Extracting and Semantically Integrating Implicit Schemas from Multiple Spreadsheets of Biology based on the Recognition of their Nature</article-title>
          .
          <source>Journal of Information and Database Management</source>
          <volume>4</volume>
          ,
          <issue>2</issue>
          (
          <year>2013</year>
          ),
          <fpage>104</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Susanna-Assunta</surname>
            <given-names>Sansone</given-names>
          </string-name>
          , Philippe Rocca-Serra, Dawn Field, Eamonn Maguire, Chris Taylor, Oliver Hofmann, Hong Fang, Stefen Neumann, Weida Tong, Linda Amaral-Zettler, Kimberly Begley, Tim Booth, Lydie Bougueleret, Gully Burns, Brad Chapman, Tim Clark,
          <string-name>
            <surname>Lee-Ann</surname>
            <given-names>Coleman</given-names>
          </string-name>
          , Jay Copeland,
          <string-name>
            <surname>Sudeshna Das</surname>
            , Antoine de Daruvar, Paula de Matos, Ian Dix, Scott Edmunds, Chris T Evelo, Mark J Forster, Pascale Gaudet, Jack Gilbert, Carole Goble, Julian L Grifin, Daniel Jacob, Jos Kleinjans, Lee Harland,
            <given-names>Kenneth</given-names>
          </string-name>
          <string-name>
            <surname>Haug</surname>
          </string-name>
          , Henning Hermjakob,
          <string-name>
            <surname>Shannan J Ho Sui</surname>
          </string-name>
          , Alain Laederach, Shaoguang Liang, Stephen Marshall,
          <string-name>
            <surname>Annette</surname>
            <given-names>McGrath</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Emily</given-names>
            <surname>Merrill</surname>
          </string-name>
          , Dorothy Reilly, Magali Roux, Caroline E Shamu, Catherine A Shang, Christoph Steinbeck, Anne Trefethen,
          <string-name>
            <surname>Bryn</surname>
            Williams-Jones, Katherine Wolstencroft, Ioannis Xenarios, and
            <given-names>Winston</given-names>
          </string-name>
          <string-name>
            <surname>Hide</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Toward interoperable bioscience data</article-title>
          .
          <source>Nature genetics 44</source>
          ,
          <issue>2</issue>
          (
          <year>2012</year>
          ),
          <fpage>121</fpage>
          -
          <lpage>6</lpage>
          . https://doi.org/10.1038/ng. 1054
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Yanfeng</surname>
            <given-names>Shu</given-names>
          </string-name>
          , David Ratclife,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Compton</surname>
          </string-name>
          , Geofrey Squire, and
          <string-name>
            <given-names>Kerry</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A semantic approach to data translation: A case study of environmental observations data</article-title>
          .
          <source>Knowledge-Based Systems</source>
          <volume>75</volume>
          (
          <year>2015</year>
          ),
          <fpage>104</fpage>
          -
          <lpage>123</lpage>
          . https://doi.org/10.1016/j.knosys.
          <year>2014</year>
          .
          <volume>11</volume>
          .023
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Mark Van Assem</given-names>
            ,
            <surname>Hajo Rijgersberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wigham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Top</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Converting and Annotating Quantitative Data</article-title>
          . In ISWC2010,
          <string-name>
            <given-names>P.F.</given-names>
            <surname>Patel-Schneider</surname>
          </string-name>
          (Ed.).
          <fpage>16</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Petros</surname>
            <given-names>Venetis</given-names>
          </string-name>
          , Alon Halevy, and
          <string-name>
            <given-names>J</given-names>
            <surname>Madhavan</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Recovering semantics of tables on the web</article-title>
          .
          <source>In Proceedings of the VLDB Endowment</source>
          , Vol.
          <volume>4</volume>
          .
          <fpage>528</fpage>
          -
          <lpage>538</lpage>
          . https://doi.org/10.14778/2002938.2002939
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Katy</surname>
            <given-names>Wolstencroft</given-names>
          </string-name>
          , Stuart Owen, Matthew Horridge, Olga Krebs, Wolfgang Mueller, Jacky L Snoep,
          <article-title>Franco du Preez, and</article-title>
          <string-name>
            <given-names>Carole</given-names>
            <surname>Goble</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>RightField: embedding ontology annotation in spreadsheets</article-title>
          .
          <source>Bioinformatics</source>
          (Oxford, England)
          <volume>27</volume>
          ,
          <issue>14</issue>
          (jul
          <year>2011</year>
          ),
          <year>2021</year>
          -
          <fpage>2</fpage>
          . https://doi.org/10.1093/bioinformatics/btr312
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>