=Paper= {{Paper |id=Vol-2065/paper06 |storemode=property |title=How Plausible is Automatic Annotation of Scientific Spreadsheets? |pdfUrl=https://ceur-ws.org/Vol-2065/paper06.pdf |volume=Vol-2065 |authors=Martine de Vos,Jan Wielemaker,Bob Wielinga,Guus Schreiber,Jan Top |dblpUrl=https://dblp.org/rec/conf/kcap/VosWWST17 }} ==How Plausible is Automatic Annotation of Scientific Spreadsheets?== https://ceur-ws.org/Vol-2065/paper06.pdf
                   How plausible is automatic annotation of scientific
                                     spreadsheets?
                  Martine de Vos∗                                            Jan Wielemaker                                  Bob Wielinga †
    Computer science, Network Institute,                        Computer science, Network Institute,             Computer science, Network Institute,
       Vrije Universiteit Amsterdam                                Vrije Universiteit Amsterdam                     Vrije Universiteit Amsterdam
           martine.de.vos@vu.nl                                         j.wielemaker@vu.nl

                                                Guus Schreiber                                           Jan Top†
                                  Computer science, Network Institute,                       Food and Biobased Research,
                                     Vrije Universiteit Amsterdam                         Wageningen University and Research
                                         guus.schreiber@vu.nl                                           Centre
                                                                                                   jan.top@wur.nl

ABSTRACT                                                                                   [23] they tend to be sloppy in the specification of the semantics of
It is possible to automatically annotate a natural science spreadsheet                     their data, and the free format allows them to do so [19]. However,
using lexical matching, given that the tables in these spreadsheets                        as the domain model is essential to understand the meaning and
meet a number of requirements regarding the content. Results of a                          context of spreadsheet data, it is currently hard to unambiguously
survey show that most of the existing natural science spreadsheets                         interpret these data for people other than the original developers.
deviate from the ideal situation. We propose to complement lex-                                The content of the tables can be annotated with concepts from
ical matching with both heuristics and knowledge from external                             external vocabularies to facilitate interpretation, evaluation and
vocabularies to overcome these deviations.                                                 reuse of spreadsheet data. This annotation can be performed au-
                                                                                           tomatically by using lexical matching, i.e., by assessing the lexical
CCS CONCEPTS                                                                               similarity between spreadsheet terms and labels from selected vo-
                                                                                           cabularies. This method requires, ironically, that the spreadsheet
• Computing methodologies → Model development and anal-
                                                                                           cells contain explicit and complete information on the correspond-
ysis; • Applied computing → Physical sciences and engineer-
                                                                                           ing research. Furthermore, as natural science spreadsheets often
ing;
                                                                                           represent observational data, the tables typically contain mostly
KEYWORDS                                                                                   numbers and little text. The amount of data that can be annotated
                                                                                           in natural science spreadsheets is thus limited.
Spreadsheets, Knowledge engineering, Domain Knowledge, Heuris-                                 The goal of this paper is to evaluate to what extent automatic
tics, Vocabularies                                                                         annotation of the table content is possible for existing natural sci-
ACM Reference Format:                                                                      ence spreadsheets. In Section 3 we explain the requirements for
Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan                    the content of natural science spreadsheets that enable automatic
Top. 2018. How plausible is automatic annotation of scientific spreadsheets?.              annotation. In Section 4 we present an analysis on the nature and
In . ACM, New York, NY, USA, 6 pages.                                                      frequency of common design characteristics in existing natural
                                                                                           science spreadsheets, and discuss how these deviate from the ideal
1    INTRODUCTION                                                                          situation. We propose to repair these deviations by complementing
In this paper we investigate the feasiblity of automatically annotat-                      the lexical matching method with heuristics and rules (Section 5)
ing existing natural science spreadsheets.                                                 that have been developed in earlier work [6]. Finally, we discuss
   Scientists in the domain of natural science, hereafter referred to                      how these heuristics could be applied to the set of analyzed tables,
as domain scientists, frequently use spreadsheets to analyze and ma-                       and to what extent automatic annotation would be possible (Section
nipulate their research data [13, 18, 25]. The format of spreadsheets                      6).
gives them a great deal of freedom in how they enter their data.
Domain scientists can make their own choices with respect to the
entities and processes to be included, and the way in which these
                                                                                           2   RELATED WORK
are organized in tables. In this way their domain model is implicitly                      Many studies focus on improving the interpretation of spreadsheet
reflected in the content and structure of the spreadsheet tables. As                       data to facilitate reuse and integration. In order to derive a correct
researchers do not anticipate the reuse of their spreadsheet data                          interpretation, the studies use different strategies to infer the seman-
∗ Corresponding author
                                                                                           tics from spreadsheet data and dissolve ambiguities. We observe
† Second affiliation: Computer science, Network Institute, Vrije Universiteit Amsterdam    two main types of strategies.
                                                                                              One strategy is to encourage domain scientists to standardize
                                                                                           their data to facilitate interpretation and reuse. Semantic markup
K-CAP2017 Workshops and Tutorials Proceedings,                                             tools like RightField [25] and OntoMaton [13], may be used to
© Copyright held by the owner/author(s).
                                                                                           develop templates for domain scientists to enter and annotate their
K-CAP2017 Workshops and Tutorials Proceedings,                Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan Top


data simultaneously. MAGE-Tab [18], ISA-Tab [21], and BIOM [14]           Table 1: Sources of tables inspected to perform the analysis
are tabular formats that use an underlying data model with relevant       on design characteristics in natural science spreadsheets
metadata from scientific experiments. These formats can be used
to either directly enter data, or as a template for mapping other                                     Online supplementary data   From institutes
                                                                                # research projects               12                    8
spreadsheet files onto one structure.                                           # spreadsheets                    40                    44
   Another approach is to annotate tabular data with concepts from              # tables                         128                   233
vocabularies. Some tools [12, 17] use existing generic ontologies,
like Yago, DBPedia , while other tools [13, 25], use existing domain
ontologies for semantic markup. Some approaches develop their             , e.g., “ha” representing the unit “hectare” Cells that contain infor-
own ontology, either manually [22] or by extracting concepts and          mation on quantities should contain a description of a quantity
relations from the web [24], to annotate tabular data.                    concept, that can be lexically matched with a concept from the
   All the abovementioned studies acknowledge that a correct inter-       OM vocabulary, e.g., “area” or “mass”. Furthermore, quantity cells
pretation of tabular data is essential for conversion or annotation.      should have an associated unit of measure, that is located either
In order to derive a correct interpretation, the studies use different    in the same or in a neighboring cell. Phenomenon cells, i.e. cells
approaches to infer the semantics from tabular data and dissolve am-      containing phenomenon instances, should contain terms that can
biguities. Some of these approaches rely on manual mapping specifi-       not be confused with quantities or units, i.e., these terms should pre-
cations constructed by users [22] or human analysts with sufficient       ferrably not consist of very short strings, symbols, or abbreviations.
knowledge of applying semantic web techniques [4, 9, 11, 15, 25].         The phenomena in tables are annotated with domain concepts, e.g.,
Others compare their tabular data with large collections of exam-         "corn" and "urea".
ple data, e.g., large vocabularies like Yago or DBPedia, or generic
databases extracted from the Web, and rely on probabilistic reason-       4 ANALYSIS OF TABLE DESIGN
ing methods to find the best suitable annotation or interpretation        Data set
for table cells and columns [2, 12, 17, 24]. And, many studies use
                                                                          We conduct an analysis on a set of existing natural science spread-
knowledge on the structural properties of a table to derive a correct
                                                                          sheet tables, in order to gain knowledge on common practice of
interpretation of its content. Several studies created a library on
                                                                          table design, and to find out in what ways the content of these
commonly used layout patterns in tabular data [8, 10, 20]. Abraham
                                                                          tables may deviate from the ideal situation as described in the pre-
and Erwig [1] developed a framework to automatically classify roles
                                                                          vious section. To this end, we analyse a total of 361 tables in 84
of cells in a table based on the spatial layout of a spreadsheet. Van
                                                                          spreadsheets, that are used in 20 existing research projects in the
Assem and colleagues [23] introduced disambiguation strategies
                                                                          domain of natural science (Table 1). All spreadsheets fall within
for units of measure and quantities ([23]) based on the way these
                                                                          the scope of our research, i.e., natural science spreadsheets that
are notated in table cells. And Chen and Cafarella [3] use heuristics
                                                                          consist of numerical data, quantities and units of measure, and
and rules on spreadsheet layout and implicit metadata structure to
                                                                          information on the associated objects and events. About half of
automatically extract relational data from spreadsheets.
                                                                          the inspected tables is used in our earlier work on interpretation
                                                                          and annotation of spreadsheets [5, 6], or formulas [7]. We collect
                                                                          additional spreadsheets from colleagues at Wageningen University
3   REQUIREMENTS FOR SPREADSHEET                                          and Research and we use the Google Scholar web search engine to
    TABLES                                                                find spreadsheets that are published online as supplementary data
Spreadsheets from the domain of natural science, e.g., biology,           alongside journal papers.
physics and medical science, often represent laboratory or field
observations. The tables in these spreadsheets therefore typically        Data analysis
consist of numerical data, quantities and units of measure [23], and      Our analysis consists of a manual inspection of all spreadsheet
information on the associated phenomena, i.e., objects, events and        tables, in which we only consider the content of the tables, and
substances.                                                               ignore the title or comments. We color code the cells in each block
   Annotation of these spreadsheets with vocabulary concepts sets         based on the content (see, for example, the legend in Figure 2), and
requirements for both the vocabularies and the content of the             analyze the syntax of the formulas and their composition in the
spreadsheet tables. The selected vocabularies should contain la-          table. Subsequently, we determine for each table to what extent
beled concepts, and comprise at least one vocabulary that covers          it meets our requirements described in section 3. We explain our
the domain of the considered spreadsheets, and a dedicated vo-            results in terms of deviations from these requirements, and discuss
cabulary on quantities and units. In our research we use the OM           some additional observations we made on the structure of the
Ontology for units of Measure and related concepts [19].                  analyzed tables.
   Regarding the requirements for the content of spreadsheet ta-
bles, the terms representing units of measure should follow the           4.1    Results
international notation standards [19] (Figure 1). This implies that
                                                                             Deviant unit notations. More than half of the analyzed tables
these terms consist of short strings containing one or more symbols,
                                                                          contains unit cells (Table 2), but in almost one third the notation
and optional brackets and slashes [23]. The symbol(s) in the term
                                                                          of the unit terms and symbols is not according to the international
should lexically match with a unit symbol from the OM vocabulary,
                                                                          standards. In the majority of the cases the unit terms are customized
How plausible is automatic annotation of scientific spreadsheets?                        K-CAP2017 Workshops and Tutorials Proceedings,




                                           Figure 1: Tables in the stylized example spreadsheet


by the scientists. The resulting unit terms are not incorrect per         Table 2: Presence of unit, quantity and phenomenon cells in
se, but rather unconventional, and automatic recognition of these         the analyzed tables
terms is hindered. Scientists often combine phenomena with unit
symbols, e.g., “MJ/1000 kg milk/yr” and “g CO2e/MJ” (Figures 2, 3).                Cell type                          Fraction of tables (%)
                                                                                                               Unit    Quantity Phenomenon
                                                                                   Present, complete info.      29         15              46
   Incomplete quantity notations. The majority of the analyzed ta-                 Present, incomplete info.    29         47              29
bles contains quantity cells (Table 2). In almost half of the analyzed             Not present                  42         38              25
tables, the quantity cells do not contain complete information, and
automatic recognition of the quantities is not straightforward. Some
tables contain cells with a phenomenon description, and an asso-          these semantic relations are often constructed by the spreadsheet
ciated unit of measure located in the neighboring cell (Figure 2).        developer. The developer groups instances according to common
Although no quantity concept is mentioned, these cells implicitly         properties, which may be clear to users or peer scientists, but not
represent quantities. In Table 2 we do not consider these cells as        easily recognized in a domain vocabulary.
quantity cells.
                                                                          5     COMPLEMENTARY HEURISTICS
   Unclear phenomenon notations. In the majority of the analyzed          We observe that most of the analyzed tables do not meet one or more
tables phenomenon cells are present (Table 2), However, part of           of the requirements listed in Section 3, and we expect that lexical
these tables contains cells that, judging from the position in the        matching will not yield many useful annotations. In this section we
table, probably represent phenomena, but do not contain full words.       therefore propose to complement lexical matching with heuristics
Instead, these cells contain numbers or codes representing, e.g.,         and knowledge from vocabularies to overcome the challenges of
dates, scientific experiments, identification numbers (Figures 2,3),      incomplete information.
or abbreviations which are either application specific, e.g., chemical
elements or geographical codes, or related to scientific experiments.     5.1    Recognizing and annotating blocks
   Other observations. In more than half of the analyzed tables the       Domain scientists typically group cells that are semantically related
float cells, containing the values of observations, and the string        [16] and use structure and layout features to distinguish between
cells, containing contextual information on these observations, are       these groups [3]. We assume that this grouping not only applies
not located in homogeneous blocks. In most of these tables, the           to phenomenon cells, but also to cells representing quantities and
float blocks are interrupted, either by empty cells (Figure 3), or less   units of measure.
frequent, by qualitative, i.e., string values. In a small part of the        We have developed heuristics that support us in the recognition
analyzed tables the string and float blocks are not aligned with each     of the type of cell, i.e., unit of measure, quantity or phenomenon,
other, i.e., these blocks do not have similar dimensions. In these        and the annotation of its content [6]. These heuristics combine
tables it is not clear which observations are associated with which       information on the notation of terms in cells, and the composition
context.                                                                  and positioning of blocks of cells in a table, e.g.:
   Cells representing semantically related phenomenon instances                • If a cell contains both a string term and a unit of measure, it is
are typically grouped in the same string block. We observe that                  a quantity cell
K-CAP2017 Workshops and Tutorials Proceedings,                  Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan Top




Figure 2: Examples of spreadsheet tables in which the quantity is present in the title (A),the units of measure are associated
with phenomena cells (A,B), phenomenon cells contain abbreviations or numbers (A,B), the units of measure are customized
(B). The color markup is applied in our analysis, and not part of the original table.




Figure 3: Example of spreadsheet tables with interrupted float and string blocks (A), customized units of measure (A), and
phenomenon cells with codes (B). The color markup is applied in our analysis, and not part of the original table.


     • A block is considered a “Quantity” or a “Unit” block when at         as quantity cells by considering the presence and position of the
       least 30% of the cells is recognized as a quantity or unit cell      units of measure in the table.
     • A block is considered a “Quantity” block when it is vertically or
       horizontally aligned with the “Unit” block and the float block
                                                                            5.2    Knowledge from vocabularies
   Although the quantity cells in Figure 2B and 3B contain no de-
scription of a quantity concept, these cells could still be recognized      The phenomenon cells in natural science tables may not contain
                                                                            explicit terms, but codes, numbers or abbreviations (Section 4.1),
How plausible is automatic annotation of scientific spreadsheets?                      K-CAP2017 Workshops and Tutorials Proceedings,


that are commonly used by scientists to refer to domain specific           Automatic annotation is not possible. A small part of the analyzed
entities, e.g., chemical elements or human hormones (resp. Figure       tables display serious deviations in their basic structure (e.g., Figure
2A and 3B). Vocabularies with additional information at the in-         3A). In these tables the blocks with numerical data on observations
stance level could be used to annotate the content of these cells,      either are accompanied by only one block of contextual information,
and to recognize these as phenomenon cells. Furthermore, these          or from the table structure it is not clear which string and float cells
vocabularies may also facilitate recognition of these phenomenon        are related to each other.
cells, by distinguishing the codes and abbreviations from quantities       Although a good basic structure is not a requirement for success-
and units of measure.                                                   ful lexical matching, it is a prerequisite for our heuristics. Without
   Many quantity cells in the analyzed tables do not contain a          the presence and alignment of string and float blocks, our block
concept description, but do have an associated unit of measure          heuristics (Section 5.1) can not be used to recognize the units of
(Section 4.1). The missing quantity concept may be obtained by          measure, quantities and phenomena in these tables. Consequently,
using the following heuristic:                                          without knowledge of the types of cells, deduction of information
     • Annotation concepts for quantity cells can be deduced from the   from the table context (Section 5.3) and deriving additional knowl-
        included unit term                                              edge from vocabularies (Section 5.2) is not possible. The annotation
Some units are commonly associated with certain quantities [23],        of these tables would thus be based solely on lexical matching. As
and this type of information is included in, for example, the OM        these tables do not contain complete and explicit information on
vocabulary. For example, the missing quantity concept in a cell         the underlying research, we expect that automatic annotation of
containing the term “BT (nmol/L)” (Figure 3B) is probably “Molar        these tables is not possible. By the way, the majority of these tables
Concentration”, as the unit of measure “mol/L” is commonly as-          would probably be hard to interpret for human readers as well.
sociated with this quantity. However,this requires both a correct
notation and interpretation of the associated units of measure, and        Automatic annotation is difficult. Several of the analyzed tables
that the information on the ‘common association’ is indeed present      have a good basic structure, but are missing entities.
in OM.                                                                     In some of these tables the units of measure are missing, which
   Besides, domain scientists often customize the units of measure      hinders the recognition and annotation of quantities. The recogni-
in their spreadsheets by combining unit symbols with phenomenon         tion of quantity cells in a table may be improved by block heuristics
terms. Recognizing these phenomena using a domain vocabulary,           (Section 5.1). For the annotation of these quantities it is, however,
and subsequently removing these from the unit terms would prob-         not possible to derive additional knowledge from vocabularies.
ably result in better recognition and interpretation of the units of       In other tables the quantities are only implicitly represented (e.g.,
measure in a table.                                                     Figure 2A). Block heuristics may facilitate recognizing which of
                                                                        the contextual blocks in the table serves as a quantity block. The
                                                                        annotation of the quantities in these tables is difficult, as these are
5.3    Deduction from table context                                     technically not present, but may be deduced from the associated
Empty cells in tables are often left empty on purpose, as data on       units of measure.
either the observation value or its context is missing. In many cases
these empty cells are surrounded by non-empty neighbouring cells,          Reconstruction is possible. The majority of the analyzed tables
which could be used to deduce the missing information :                 have a good basic structure, and consist of unit, quantity, and phe-
     • The content type of an empty cell is the same as that of the     nomenon cells, but do not contain complete and explicit information
        neighbouring cells                                              on these entities. As we expect all of our heuristics to be applicable
   Annotation of a whole group of phenomena is often difficult, as      in these type of tables, the missing information may be comple-
the semantic relation between the grouped phenomenon instances          mented and succesful recognition and annotation of the entities in
can not be recognized in a domain vocabulary. As suggested by           these tables is possible.
[1, 3, 10], the following heuristic may be used:
     • A group of phenomenon instances may have a common denom-         7    DISCUSSION AND CONCLUSION
        inator cell that is present above or left from the phenomenon   In this study we show that it is plausible to automatically anno-
        block                                                           tate natural science spreadsheets, even if the tables do not contain
The term in this common denominator cell may be annotated and           complete and explicit information on the corresponding research
provide the concept of the phenomenon class.                            project.
                                                                           The quality and the level of detail of the annotations will, of
6     PLAUSIBILITY OF AUTOMATIC                                         course, still be depending on the completeness and accuracy of the
                                                                        content of the tables. However, even if the quality of the content
      ANNOTATION                                                        is not sufficient to automatically annotate terms on an individual
In this section we investigate the applicability of our heuristics      level, the block heuristics may be used to recognize the quantities,
on the set of tables analyzed in our survey. Given that for the         phenomena and units of measure blocks, thereby providing a basic
majority of these tables automatic annotation solely based on lexical   understanding of the table. What is more, we think that block
matching would not be successful, the applicability of the heuristics   heuristics are useful in all spreadsheet tables, as these heuristics
gives us as an indication to what extent automatic annotation may       provide insight in how the cells in a table are related, and as such
still be possible. We distinguish three levels of plausibility:         facilitate interpretation.
K-CAP2017 Workshops and Tutorials Proceedings,                               Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan Top


   The number of existing research spreadsheets, especially in the                             359–374.
informal area, is large. Our proposed approach could be used to                           [12] Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and
                                                                                               searching web tables using entities, types and relationships. In Proceedings of the
facilitate search, reuse and integration of these spreadsheet data, by                         VLDB Endowment, Vol. 3. 1338–1347. https://doi.org/10.14778/1920841.1921005
analyzing the information in annotations. Furthermore, the obser-                         [13] Eamonn Maguire, Alejandra González-Beltrán, Patricia L. Whetzel, Susanna As-
                                                                                               sunta Sansone, and Philippe Rocca-Serra. 2013. OntoMaton: A Bioportal powered
vations and heuristics from this study can be used as guidelines or in                         ontology widget for Google Spreadsheets. Bioinformatics 29, 4 (2013), 525–527.
support tools for domain scientists to design new spreadsheet tables                           https://doi.org/10.1093/bioinformatics/bts718
that are easier to interpret by both humans and machines. How-                            [14] Daniel McDonald, Jose C Clemente, Justin Kuczynski, Jai Rideout, Jesse
                                                                                               Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker
ever, we do not believe that the common practice of spreadsheet                                Meyer, Rob Knight, and J Caporaso. 2012. The Biological Observation Matrix
development by domain scientists is easily changed. We expect that                             (BIOM) format or: how I learned to stop worrying and love the ome-ome. Giga-
                                                                                               Science 1, 1 (2012), 7. https://doi.org/10.1186/2047-217X-1-7
domain scientists will keep using spreadsheets, as these structures                       [15] Albert Meroño-Peñuela, Ashkan Ashkpour, Laurens Rietveld, Rinke Hoekstra,
provide an easy and accessible way to store and manipulate reseach                             and Stefan Schlobach. 2013. Linked Humanities Data : The Next Frontier ? A
data according to their preferences. Therefore we expect that the                              Case-study in Historical Census Data. In The Semantic Web: Semantics and Big
                                                                                               Data. Springer Berlin Heidelberg, 645–649.
automatic annotation of existing natural science spreadsheets will                        [16] Roland T. Mittermeir and Markus Clermont. 2002. Finding High-Level Structures
remain an issue. Our proposed method provides part of the solution                             in Spreadsheet Programs. In Proceedings of the 9th Working Conference on Reverse
to handle this issue.                                                                          Engineering. Richmond,VA,USA, 221–232.
                                                                                          [17] Varish Mulwad, Tim Finin, and Anupam Joshi. 2012. A Domain Independent
                                                                                               Framework for Extracting Linked Semantic Data from Tables. In Search Comput-
ACKNOWLEDGMENTS                                                                                ing. Springer Berlin Heidelberg, 16–33.
                                                                                          [18] Tim F Rayner, Philippe Rocca-Serra, Paul T Spellman, Helen C Causton, Anna
This publication was supported by Dutch national program COM-                                  Farne, Ele Holloway, Rafael A Irizarry, Junmin Liu, Donald S Maier, Michael Miller,
MIT.                                                                                           Kjell Petersen, John Quackenbush, Gavin Sherlock, Christian J Stoeckert, Joseph
                                                                                               White, Patricia L. Whetzel, Farrell Wymore, Helen Parkinson, Ugis Sarkans,
                                                                                               Catherine A Ball, and Alvis Brazma. 2006. A simple spreadsheet-based, MIAME-
REFERENCES                                                                                     supportive format for microarray data: MAGE-TAB. BMC bioinformatics 7 (2006),
 [1] Robin Abraham and Martin Erwig. 2006. Inferring Templates from Spreadsheets.              489. https://doi.org/10.1186/1471-2105-7-489
     In Proceedings of the 28th international conference on Software engineering. ACM,    [19] Hajo Rijgersberg, M. Wigham, and Jan Top. 2011. How semantics can improve
     182–191.                                                                                  engineering processes: A case of units of measure and quantities. Advanced
 [2] Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang.              Engineering Informatics 25, 2 (apr 2011), 276–287. https://doi.org/10.1016/j.aei.
     2008. WebTables: Exploring the Power of Tables on theWeb. Proceedings of the              2010.07.008
     VLDB Endowment 1, 1 (2008), 538–549. https://doi.org/10.14778/1453856.1453916        [20] Ivelize Rocha Bernardo, Matheus S Mota, and André Santanchè. 2013. Extracting
 [3] Zhe Chen and Michael J Cafarella. 2013. Automatic web spreadsheet data extrac-            and Semantically Integrating Implicit Schemas from Multiple Spreadsheets of
     tion. In Proceedings of the 3rd International Workshop on Semantic Search Over the        Biology based on the Recognition of their Nature. Journal of Information and
     Web - SS@ ’13. 1–8. https://doi.org/10.1145/2509908.2509909                               Database Management 4, 2 (2013), 104–113.
 [4] Martin J O Connor, Christian Halaschek-wiener, and Mark A Musen. 2010. Map-          [21] Susanna-Assunta Sansone, Philippe Rocca-Serra, Dawn Field, Eamonn Maguire,
     ping Master : a Flexible Approach for Mapping Spreadsheets to OWL. In The                 Chris Taylor, Oliver Hofmann, Hong Fang, Steffen Neumann, Weida Tong, Linda
     Semantic WebâĂŞISWC. Springer Berlin Heidelberg, 194–208.                                 Amaral-Zettler, Kimberly Begley, Tim Booth, Lydie Bougueleret, Gully Burns,
 [5] Martine G De Vos, Willem Robert Van Hage, Jan Ros, and Guus Schreiber. 2012.              Brad Chapman, Tim Clark, Lee-Ann Coleman, Jay Copeland, Sudeshna Das,
     Reconstructing Semantics of Scientific Models : a Case Study. In Proceedings of           Antoine de Daruvar, Paula de Matos, Ian Dix, Scott Edmunds, Chris T Evelo,
     the OEDW workshop on Ontology engineering in a data driven world, EKAW 2012.              Mark J Forster, Pascale Gaudet, Jack Gilbert, Carole Goble, Julian L Griffin, Daniel
     Galway, Ireland.                                                                          Jacob, Jos Kleinjans, Lee Harland, Kenneth Haug, Henning Hermjakob, Shannan J
 [6] Martine G De Vos, Jan Wielemaker, Hajo Rijgersberg, Guus Schreiber, Bob                   Ho Sui, Alain Laederach, Shaoguang Liang, Stephen Marshall, Annette McGrath,
     Wielinga, and Jan Top. 2017. Combining Information on Structure and Con-                  Emily Merrill, Dorothy Reilly, Magali Roux, Caroline E Shamu, Catherine A
     tent to Automatically Annotate Natural Science Spreadsheets. International                Shang, Christoph Steinbeck, Anne Trefethen, Bryn Williams-Jones, Katherine
     Journal of Human-Computer Studies (in press), 0 (2017).                                   Wolstencroft, Ioannis Xenarios, and Winston Hide. 2012. Toward interoperable
 [7] Martine G De Vos, Jan Wielemaker, Bob Wielinga, Guus Schreiber, and Jan                   bioscience data. Nature genetics 44, 2 (2012), 121–6. https://doi.org/10.1038/ng.
     Top. 2015. A methodology for constructing the calculation model of scientific             1054
     spreadsheets. In Proceedings of the 8th International Conference on Knowledge        [22] Yanfeng Shu, David Ratcliffe, Michael Compton, Geoffrey Squire, and Kerry
     Capture.                                                                                  Taylor. 2015. A semantic approach to data translation: A case study of en-
 [8] Andres Garcia-silva, Asuncion Gomez-perez, Mari Carmen Suarez-figueroa, and               vironmental observations data. Knowledge-Based Systems 75 (2015), 104–123.
     Boris Villazon-terrazas. 2008. A Pattern Based Approach for Re-engineering                https://doi.org/10.1016/j.knosys.2014.11.023
     Non-Ontological Resources into Ontologies. In The Semantic Web. Number 2.            [23] Mark Van Assem, Hajo Rijgersberg, M. Wigham, and Jan Top. 2010. Converting
     Springer Berlin Heidelberg, 167–181.                                                      and Annotating Quantitative Data. In ISWC2010, P.F. Patel-Schneider (Ed.). 16–31.
 [9] Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and Anupam Joshi. 2008. RDF123      [24] Petros Venetis, Alon Halevy, and J Madhavan. 2011. Recovering semantics
     : From Spreadsheets to RDF. In The Semantic Web-ISWC 2008. Springer Berlin                of tables on the web. In Proceedings of the VLDB Endowment, Vol. 4. 528–538.
     Heidelberg, 451–466.                                                                      https://doi.org/10.14778/2002938.2002939
[10] Felienne Hermans, Martin Pinzger, and Arie Van Deursen. 2010. Automati-              [25] Katy Wolstencroft, Stuart Owen, Matthew Horridge, Olga Krebs, Wolfgang
     cally Extracting Class Diagrams from Spreadsheets. In 24th European Conference            Mueller, Jacky L Snoep, Franco du Preez, and Carole Goble. 2011. RightField:
     on Object-Oriented Programming (ECOOP),Lecture Notes in Computer Science,.                embedding ontology annotation in spreadsheets. Bioinformatics (Oxford, England)
     Springer-Verlag, 52–75.                                                                   27, 14 (jul 2011), 2021–2. https://doi.org/10.1093/bioinformatics/btr312
[11] Andreas Langegger and W Wolfram. 2009. XLWrap âĂŞ Querying and Integrating
     Arbitrary Spreadsheets with SPARQL. In International Semantic Web Conference.