=Paper=
{{Paper
|id=Vol-2065/paper06
|storemode=property
|title=How Plausible is Automatic Annotation of Scientific Spreadsheets?
|pdfUrl=https://ceur-ws.org/Vol-2065/paper06.pdf
|volume=Vol-2065
|authors=Martine de Vos,Jan Wielemaker,Bob Wielinga,Guus Schreiber,Jan Top
|dblpUrl=https://dblp.org/rec/conf/kcap/VosWWST17
}}
==How Plausible is Automatic Annotation of Scientific Spreadsheets?==
How plausible is automatic annotation of scientific spreadsheets? Martine de Vos∗ Jan Wielemaker Bob Wielinga † Computer science, Network Institute, Computer science, Network Institute, Computer science, Network Institute, Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam martine.de.vos@vu.nl j.wielemaker@vu.nl Guus Schreiber Jan Top† Computer science, Network Institute, Food and Biobased Research, Vrije Universiteit Amsterdam Wageningen University and Research guus.schreiber@vu.nl Centre jan.top@wur.nl ABSTRACT [23] they tend to be sloppy in the specification of the semantics of It is possible to automatically annotate a natural science spreadsheet their data, and the free format allows them to do so [19]. However, using lexical matching, given that the tables in these spreadsheets as the domain model is essential to understand the meaning and meet a number of requirements regarding the content. Results of a context of spreadsheet data, it is currently hard to unambiguously survey show that most of the existing natural science spreadsheets interpret these data for people other than the original developers. deviate from the ideal situation. We propose to complement lex- The content of the tables can be annotated with concepts from ical matching with both heuristics and knowledge from external external vocabularies to facilitate interpretation, evaluation and vocabularies to overcome these deviations. reuse of spreadsheet data. This annotation can be performed au- tomatically by using lexical matching, i.e., by assessing the lexical CCS CONCEPTS similarity between spreadsheet terms and labels from selected vo- cabularies. This method requires, ironically, that the spreadsheet • Computing methodologies → Model development and anal- cells contain explicit and complete information on the correspond- ysis; • Applied computing → Physical sciences and engineer- ing research. Furthermore, as natural science spreadsheets often ing; represent observational data, the tables typically contain mostly KEYWORDS numbers and little text. The amount of data that can be annotated in natural science spreadsheets is thus limited. Spreadsheets, Knowledge engineering, Domain Knowledge, Heuris- The goal of this paper is to evaluate to what extent automatic tics, Vocabularies annotation of the table content is possible for existing natural sci- ACM Reference Format: ence spreadsheets. In Section 3 we explain the requirements for Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan the content of natural science spreadsheets that enable automatic Top. 2018. How plausible is automatic annotation of scientific spreadsheets?. annotation. In Section 4 we present an analysis on the nature and In . ACM, New York, NY, USA, 6 pages. frequency of common design characteristics in existing natural science spreadsheets, and discuss how these deviate from the ideal 1 INTRODUCTION situation. We propose to repair these deviations by complementing In this paper we investigate the feasiblity of automatically annotat- the lexical matching method with heuristics and rules (Section 5) ing existing natural science spreadsheets. that have been developed in earlier work [6]. Finally, we discuss Scientists in the domain of natural science, hereafter referred to how these heuristics could be applied to the set of analyzed tables, as domain scientists, frequently use spreadsheets to analyze and ma- and to what extent automatic annotation would be possible (Section nipulate their research data [13, 18, 25]. The format of spreadsheets 6). gives them a great deal of freedom in how they enter their data. Domain scientists can make their own choices with respect to the entities and processes to be included, and the way in which these 2 RELATED WORK are organized in tables. In this way their domain model is implicitly Many studies focus on improving the interpretation of spreadsheet reflected in the content and structure of the spreadsheet tables. As data to facilitate reuse and integration. In order to derive a correct researchers do not anticipate the reuse of their spreadsheet data interpretation, the studies use different strategies to infer the seman- ∗ Corresponding author tics from spreadsheet data and dissolve ambiguities. We observe † Second affiliation: Computer science, Network Institute, Vrije Universiteit Amsterdam two main types of strategies. One strategy is to encourage domain scientists to standardize their data to facilitate interpretation and reuse. Semantic markup K-CAP2017 Workshops and Tutorials Proceedings, tools like RightField [25] and OntoMaton [13], may be used to © Copyright held by the owner/author(s). develop templates for domain scientists to enter and annotate their K-CAP2017 Workshops and Tutorials Proceedings, Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan Top data simultaneously. MAGE-Tab [18], ISA-Tab [21], and BIOM [14] Table 1: Sources of tables inspected to perform the analysis are tabular formats that use an underlying data model with relevant on design characteristics in natural science spreadsheets metadata from scientific experiments. These formats can be used to either directly enter data, or as a template for mapping other Online supplementary data From institutes # research projects 12 8 spreadsheet files onto one structure. # spreadsheets 40 44 Another approach is to annotate tabular data with concepts from # tables 128 233 vocabularies. Some tools [12, 17] use existing generic ontologies, like Yago, DBPedia , while other tools [13, 25], use existing domain ontologies for semantic markup. Some approaches develop their , e.g., “ha” representing the unit “hectare” Cells that contain infor- own ontology, either manually [22] or by extracting concepts and mation on quantities should contain a description of a quantity relations from the web [24], to annotate tabular data. concept, that can be lexically matched with a concept from the All the abovementioned studies acknowledge that a correct inter- OM vocabulary, e.g., “area” or “mass”. Furthermore, quantity cells pretation of tabular data is essential for conversion or annotation. should have an associated unit of measure, that is located either In order to derive a correct interpretation, the studies use different in the same or in a neighboring cell. Phenomenon cells, i.e. cells approaches to infer the semantics from tabular data and dissolve am- containing phenomenon instances, should contain terms that can biguities. Some of these approaches rely on manual mapping specifi- not be confused with quantities or units, i.e., these terms should pre- cations constructed by users [22] or human analysts with sufficient ferrably not consist of very short strings, symbols, or abbreviations. knowledge of applying semantic web techniques [4, 9, 11, 15, 25]. The phenomena in tables are annotated with domain concepts, e.g., Others compare their tabular data with large collections of exam- "corn" and "urea". ple data, e.g., large vocabularies like Yago or DBPedia, or generic databases extracted from the Web, and rely on probabilistic reason- 4 ANALYSIS OF TABLE DESIGN ing methods to find the best suitable annotation or interpretation Data set for table cells and columns [2, 12, 17, 24]. And, many studies use We conduct an analysis on a set of existing natural science spread- knowledge on the structural properties of a table to derive a correct sheet tables, in order to gain knowledge on common practice of interpretation of its content. Several studies created a library on table design, and to find out in what ways the content of these commonly used layout patterns in tabular data [8, 10, 20]. Abraham tables may deviate from the ideal situation as described in the pre- and Erwig [1] developed a framework to automatically classify roles vious section. To this end, we analyse a total of 361 tables in 84 of cells in a table based on the spatial layout of a spreadsheet. Van spreadsheets, that are used in 20 existing research projects in the Assem and colleagues [23] introduced disambiguation strategies domain of natural science (Table 1). All spreadsheets fall within for units of measure and quantities ([23]) based on the way these the scope of our research, i.e., natural science spreadsheets that are notated in table cells. And Chen and Cafarella [3] use heuristics consist of numerical data, quantities and units of measure, and and rules on spreadsheet layout and implicit metadata structure to information on the associated objects and events. About half of automatically extract relational data from spreadsheets. the inspected tables is used in our earlier work on interpretation and annotation of spreadsheets [5, 6], or formulas [7]. We collect additional spreadsheets from colleagues at Wageningen University 3 REQUIREMENTS FOR SPREADSHEET and Research and we use the Google Scholar web search engine to TABLES find spreadsheets that are published online as supplementary data Spreadsheets from the domain of natural science, e.g., biology, alongside journal papers. physics and medical science, often represent laboratory or field observations. The tables in these spreadsheets therefore typically Data analysis consist of numerical data, quantities and units of measure [23], and Our analysis consists of a manual inspection of all spreadsheet information on the associated phenomena, i.e., objects, events and tables, in which we only consider the content of the tables, and substances. ignore the title or comments. We color code the cells in each block Annotation of these spreadsheets with vocabulary concepts sets based on the content (see, for example, the legend in Figure 2), and requirements for both the vocabularies and the content of the analyze the syntax of the formulas and their composition in the spreadsheet tables. The selected vocabularies should contain la- table. Subsequently, we determine for each table to what extent beled concepts, and comprise at least one vocabulary that covers it meets our requirements described in section 3. We explain our the domain of the considered spreadsheets, and a dedicated vo- results in terms of deviations from these requirements, and discuss cabulary on quantities and units. In our research we use the OM some additional observations we made on the structure of the Ontology for units of Measure and related concepts [19]. analyzed tables. Regarding the requirements for the content of spreadsheet ta- bles, the terms representing units of measure should follow the 4.1 Results international notation standards [19] (Figure 1). This implies that Deviant unit notations. More than half of the analyzed tables these terms consist of short strings containing one or more symbols, contains unit cells (Table 2), but in almost one third the notation and optional brackets and slashes [23]. The symbol(s) in the term of the unit terms and symbols is not according to the international should lexically match with a unit symbol from the OM vocabulary, standards. In the majority of the cases the unit terms are customized How plausible is automatic annotation of scientific spreadsheets? K-CAP2017 Workshops and Tutorials Proceedings, Figure 1: Tables in the stylized example spreadsheet by the scientists. The resulting unit terms are not incorrect per Table 2: Presence of unit, quantity and phenomenon cells in se, but rather unconventional, and automatic recognition of these the analyzed tables terms is hindered. Scientists often combine phenomena with unit symbols, e.g., “MJ/1000 kg milk/yr” and “g CO2e/MJ” (Figures 2, 3). Cell type Fraction of tables (%) Unit Quantity Phenomenon Present, complete info. 29 15 46 Incomplete quantity notations. The majority of the analyzed ta- Present, incomplete info. 29 47 29 bles contains quantity cells (Table 2). In almost half of the analyzed Not present 42 38 25 tables, the quantity cells do not contain complete information, and automatic recognition of the quantities is not straightforward. Some tables contain cells with a phenomenon description, and an asso- these semantic relations are often constructed by the spreadsheet ciated unit of measure located in the neighboring cell (Figure 2). developer. The developer groups instances according to common Although no quantity concept is mentioned, these cells implicitly properties, which may be clear to users or peer scientists, but not represent quantities. In Table 2 we do not consider these cells as easily recognized in a domain vocabulary. quantity cells. 5 COMPLEMENTARY HEURISTICS Unclear phenomenon notations. In the majority of the analyzed We observe that most of the analyzed tables do not meet one or more tables phenomenon cells are present (Table 2), However, part of of the requirements listed in Section 3, and we expect that lexical these tables contains cells that, judging from the position in the matching will not yield many useful annotations. In this section we table, probably represent phenomena, but do not contain full words. therefore propose to complement lexical matching with heuristics Instead, these cells contain numbers or codes representing, e.g., and knowledge from vocabularies to overcome the challenges of dates, scientific experiments, identification numbers (Figures 2,3), incomplete information. or abbreviations which are either application specific, e.g., chemical elements or geographical codes, or related to scientific experiments. 5.1 Recognizing and annotating blocks Other observations. In more than half of the analyzed tables the Domain scientists typically group cells that are semantically related float cells, containing the values of observations, and the string [16] and use structure and layout features to distinguish between cells, containing contextual information on these observations, are these groups [3]. We assume that this grouping not only applies not located in homogeneous blocks. In most of these tables, the to phenomenon cells, but also to cells representing quantities and float blocks are interrupted, either by empty cells (Figure 3), or less units of measure. frequent, by qualitative, i.e., string values. In a small part of the We have developed heuristics that support us in the recognition analyzed tables the string and float blocks are not aligned with each of the type of cell, i.e., unit of measure, quantity or phenomenon, other, i.e., these blocks do not have similar dimensions. In these and the annotation of its content [6]. These heuristics combine tables it is not clear which observations are associated with which information on the notation of terms in cells, and the composition context. and positioning of blocks of cells in a table, e.g.: Cells representing semantically related phenomenon instances • If a cell contains both a string term and a unit of measure, it is are typically grouped in the same string block. We observe that a quantity cell K-CAP2017 Workshops and Tutorials Proceedings, Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan Top Figure 2: Examples of spreadsheet tables in which the quantity is present in the title (A),the units of measure are associated with phenomena cells (A,B), phenomenon cells contain abbreviations or numbers (A,B), the units of measure are customized (B). The color markup is applied in our analysis, and not part of the original table. Figure 3: Example of spreadsheet tables with interrupted float and string blocks (A), customized units of measure (A), and phenomenon cells with codes (B). The color markup is applied in our analysis, and not part of the original table. • A block is considered a “Quantity” or a “Unit” block when at as quantity cells by considering the presence and position of the least 30% of the cells is recognized as a quantity or unit cell units of measure in the table. • A block is considered a “Quantity” block when it is vertically or horizontally aligned with the “Unit” block and the float block 5.2 Knowledge from vocabularies Although the quantity cells in Figure 2B and 3B contain no de- scription of a quantity concept, these cells could still be recognized The phenomenon cells in natural science tables may not contain explicit terms, but codes, numbers or abbreviations (Section 4.1), How plausible is automatic annotation of scientific spreadsheets? K-CAP2017 Workshops and Tutorials Proceedings, that are commonly used by scientists to refer to domain specific Automatic annotation is not possible. A small part of the analyzed entities, e.g., chemical elements or human hormones (resp. Figure tables display serious deviations in their basic structure (e.g., Figure 2A and 3B). Vocabularies with additional information at the in- 3A). In these tables the blocks with numerical data on observations stance level could be used to annotate the content of these cells, either are accompanied by only one block of contextual information, and to recognize these as phenomenon cells. Furthermore, these or from the table structure it is not clear which string and float cells vocabularies may also facilitate recognition of these phenomenon are related to each other. cells, by distinguishing the codes and abbreviations from quantities Although a good basic structure is not a requirement for success- and units of measure. ful lexical matching, it is a prerequisite for our heuristics. Without Many quantity cells in the analyzed tables do not contain a the presence and alignment of string and float blocks, our block concept description, but do have an associated unit of measure heuristics (Section 5.1) can not be used to recognize the units of (Section 4.1). The missing quantity concept may be obtained by measure, quantities and phenomena in these tables. Consequently, using the following heuristic: without knowledge of the types of cells, deduction of information • Annotation concepts for quantity cells can be deduced from the from the table context (Section 5.3) and deriving additional knowl- included unit term edge from vocabularies (Section 5.2) is not possible. The annotation Some units are commonly associated with certain quantities [23], of these tables would thus be based solely on lexical matching. As and this type of information is included in, for example, the OM these tables do not contain complete and explicit information on vocabulary. For example, the missing quantity concept in a cell the underlying research, we expect that automatic annotation of containing the term “BT (nmol/L)” (Figure 3B) is probably “Molar these tables is not possible. By the way, the majority of these tables Concentration”, as the unit of measure “mol/L” is commonly as- would probably be hard to interpret for human readers as well. sociated with this quantity. However,this requires both a correct notation and interpretation of the associated units of measure, and Automatic annotation is difficult. Several of the analyzed tables that the information on the ‘common association’ is indeed present have a good basic structure, but are missing entities. in OM. In some of these tables the units of measure are missing, which Besides, domain scientists often customize the units of measure hinders the recognition and annotation of quantities. The recogni- in their spreadsheets by combining unit symbols with phenomenon tion of quantity cells in a table may be improved by block heuristics terms. Recognizing these phenomena using a domain vocabulary, (Section 5.1). For the annotation of these quantities it is, however, and subsequently removing these from the unit terms would prob- not possible to derive additional knowledge from vocabularies. ably result in better recognition and interpretation of the units of In other tables the quantities are only implicitly represented (e.g., measure in a table. Figure 2A). Block heuristics may facilitate recognizing which of the contextual blocks in the table serves as a quantity block. The annotation of the quantities in these tables is difficult, as these are 5.3 Deduction from table context technically not present, but may be deduced from the associated Empty cells in tables are often left empty on purpose, as data on units of measure. either the observation value or its context is missing. In many cases these empty cells are surrounded by non-empty neighbouring cells, Reconstruction is possible. The majority of the analyzed tables which could be used to deduce the missing information : have a good basic structure, and consist of unit, quantity, and phe- • The content type of an empty cell is the same as that of the nomenon cells, but do not contain complete and explicit information neighbouring cells on these entities. As we expect all of our heuristics to be applicable Annotation of a whole group of phenomena is often difficult, as in these type of tables, the missing information may be comple- the semantic relation between the grouped phenomenon instances mented and succesful recognition and annotation of the entities in can not be recognized in a domain vocabulary. As suggested by these tables is possible. [1, 3, 10], the following heuristic may be used: • A group of phenomenon instances may have a common denom- 7 DISCUSSION AND CONCLUSION inator cell that is present above or left from the phenomenon In this study we show that it is plausible to automatically anno- block tate natural science spreadsheets, even if the tables do not contain The term in this common denominator cell may be annotated and complete and explicit information on the corresponding research provide the concept of the phenomenon class. project. The quality and the level of detail of the annotations will, of 6 PLAUSIBILITY OF AUTOMATIC course, still be depending on the completeness and accuracy of the content of the tables. However, even if the quality of the content ANNOTATION is not sufficient to automatically annotate terms on an individual In this section we investigate the applicability of our heuristics level, the block heuristics may be used to recognize the quantities, on the set of tables analyzed in our survey. Given that for the phenomena and units of measure blocks, thereby providing a basic majority of these tables automatic annotation solely based on lexical understanding of the table. What is more, we think that block matching would not be successful, the applicability of the heuristics heuristics are useful in all spreadsheet tables, as these heuristics gives us as an indication to what extent automatic annotation may provide insight in how the cells in a table are related, and as such still be possible. We distinguish three levels of plausibility: facilitate interpretation. K-CAP2017 Workshops and Tutorials Proceedings, Martine de Vos, Jan Wielemaker, Bob Wielinga †, Guus Schreiber, and Jan Top The number of existing research spreadsheets, especially in the 359–374. informal area, is large. Our proposed approach could be used to [12] Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and searching web tables using entities, types and relationships. In Proceedings of the facilitate search, reuse and integration of these spreadsheet data, by VLDB Endowment, Vol. 3. 1338–1347. https://doi.org/10.14778/1920841.1921005 analyzing the information in annotations. Furthermore, the obser- [13] Eamonn Maguire, Alejandra González-Beltrán, Patricia L. Whetzel, Susanna As- sunta Sansone, and Philippe Rocca-Serra. 2013. OntoMaton: A Bioportal powered vations and heuristics from this study can be used as guidelines or in ontology widget for Google Spreadsheets. Bioinformatics 29, 4 (2013), 525–527. support tools for domain scientists to design new spreadsheet tables https://doi.org/10.1093/bioinformatics/bts718 that are easier to interpret by both humans and machines. How- [14] Daniel McDonald, Jose C Clemente, Justin Kuczynski, Jai Rideout, Jesse Stombaugh, Doug Wendel, Andreas Wilke, Susan Huse, John Hufnagle, Folker ever, we do not believe that the common practice of spreadsheet Meyer, Rob Knight, and J Caporaso. 2012. The Biological Observation Matrix development by domain scientists is easily changed. We expect that (BIOM) format or: how I learned to stop worrying and love the ome-ome. Giga- Science 1, 1 (2012), 7. https://doi.org/10.1186/2047-217X-1-7 domain scientists will keep using spreadsheets, as these structures [15] Albert Meroño-Peñuela, Ashkan Ashkpour, Laurens Rietveld, Rinke Hoekstra, provide an easy and accessible way to store and manipulate reseach and Stefan Schlobach. 2013. Linked Humanities Data : The Next Frontier ? A data according to their preferences. Therefore we expect that the Case-study in Historical Census Data. In The Semantic Web: Semantics and Big Data. Springer Berlin Heidelberg, 645–649. automatic annotation of existing natural science spreadsheets will [16] Roland T. Mittermeir and Markus Clermont. 2002. Finding High-Level Structures remain an issue. Our proposed method provides part of the solution in Spreadsheet Programs. In Proceedings of the 9th Working Conference on Reverse to handle this issue. Engineering. Richmond,VA,USA, 221–232. [17] Varish Mulwad, Tim Finin, and Anupam Joshi. 2012. A Domain Independent Framework for Extracting Linked Semantic Data from Tables. In Search Comput- ACKNOWLEDGMENTS ing. Springer Berlin Heidelberg, 16–33. [18] Tim F Rayner, Philippe Rocca-Serra, Paul T Spellman, Helen C Causton, Anna This publication was supported by Dutch national program COM- Farne, Ele Holloway, Rafael A Irizarry, Junmin Liu, Donald S Maier, Michael Miller, MIT. Kjell Petersen, John Quackenbush, Gavin Sherlock, Christian J Stoeckert, Joseph White, Patricia L. Whetzel, Farrell Wymore, Helen Parkinson, Ugis Sarkans, Catherine A Ball, and Alvis Brazma. 2006. A simple spreadsheet-based, MIAME- REFERENCES supportive format for microarray data: MAGE-TAB. BMC bioinformatics 7 (2006), [1] Robin Abraham and Martin Erwig. 2006. Inferring Templates from Spreadsheets. 489. https://doi.org/10.1186/1471-2105-7-489 In Proceedings of the 28th international conference on Software engineering. ACM, [19] Hajo Rijgersberg, M. Wigham, and Jan Top. 2011. How semantics can improve 182–191. engineering processes: A case of units of measure and quantities. Advanced [2] Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Engineering Informatics 25, 2 (apr 2011), 276–287. https://doi.org/10.1016/j.aei. 2008. WebTables: Exploring the Power of Tables on theWeb. Proceedings of the 2010.07.008 VLDB Endowment 1, 1 (2008), 538–549. https://doi.org/10.14778/1453856.1453916 [20] Ivelize Rocha Bernardo, Matheus S Mota, and André Santanchè. 2013. Extracting [3] Zhe Chen and Michael J Cafarella. 2013. Automatic web spreadsheet data extrac- and Semantically Integrating Implicit Schemas from Multiple Spreadsheets of tion. In Proceedings of the 3rd International Workshop on Semantic Search Over the Biology based on the Recognition of their Nature. Journal of Information and Web - SS@ ’13. 1–8. https://doi.org/10.1145/2509908.2509909 Database Management 4, 2 (2013), 104–113. [4] Martin J O Connor, Christian Halaschek-wiener, and Mark A Musen. 2010. Map- [21] Susanna-Assunta Sansone, Philippe Rocca-Serra, Dawn Field, Eamonn Maguire, ping Master : a Flexible Approach for Mapping Spreadsheets to OWL. In The Chris Taylor, Oliver Hofmann, Hong Fang, Steffen Neumann, Weida Tong, Linda Semantic WebâĂŞISWC. Springer Berlin Heidelberg, 194–208. Amaral-Zettler, Kimberly Begley, Tim Booth, Lydie Bougueleret, Gully Burns, [5] Martine G De Vos, Willem Robert Van Hage, Jan Ros, and Guus Schreiber. 2012. Brad Chapman, Tim Clark, Lee-Ann Coleman, Jay Copeland, Sudeshna Das, Reconstructing Semantics of Scientific Models : a Case Study. In Proceedings of Antoine de Daruvar, Paula de Matos, Ian Dix, Scott Edmunds, Chris T Evelo, the OEDW workshop on Ontology engineering in a data driven world, EKAW 2012. Mark J Forster, Pascale Gaudet, Jack Gilbert, Carole Goble, Julian L Griffin, Daniel Galway, Ireland. Jacob, Jos Kleinjans, Lee Harland, Kenneth Haug, Henning Hermjakob, Shannan J [6] Martine G De Vos, Jan Wielemaker, Hajo Rijgersberg, Guus Schreiber, Bob Ho Sui, Alain Laederach, Shaoguang Liang, Stephen Marshall, Annette McGrath, Wielinga, and Jan Top. 2017. Combining Information on Structure and Con- Emily Merrill, Dorothy Reilly, Magali Roux, Caroline E Shamu, Catherine A tent to Automatically Annotate Natural Science Spreadsheets. International Shang, Christoph Steinbeck, Anne Trefethen, Bryn Williams-Jones, Katherine Journal of Human-Computer Studies (in press), 0 (2017). Wolstencroft, Ioannis Xenarios, and Winston Hide. 2012. Toward interoperable [7] Martine G De Vos, Jan Wielemaker, Bob Wielinga, Guus Schreiber, and Jan bioscience data. Nature genetics 44, 2 (2012), 121–6. https://doi.org/10.1038/ng. Top. 2015. A methodology for constructing the calculation model of scientific 1054 spreadsheets. In Proceedings of the 8th International Conference on Knowledge [22] Yanfeng Shu, David Ratcliffe, Michael Compton, Geoffrey Squire, and Kerry Capture. Taylor. 2015. A semantic approach to data translation: A case study of en- [8] Andres Garcia-silva, Asuncion Gomez-perez, Mari Carmen Suarez-figueroa, and vironmental observations data. Knowledge-Based Systems 75 (2015), 104–123. Boris Villazon-terrazas. 2008. A Pattern Based Approach for Re-engineering https://doi.org/10.1016/j.knosys.2014.11.023 Non-Ontological Resources into Ontologies. In The Semantic Web. Number 2. [23] Mark Van Assem, Hajo Rijgersberg, M. Wigham, and Jan Top. 2010. Converting Springer Berlin Heidelberg, 167–181. and Annotating Quantitative Data. In ISWC2010, P.F. Patel-Schneider (Ed.). 16–31. [9] Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and Anupam Joshi. 2008. RDF123 [24] Petros Venetis, Alon Halevy, and J Madhavan. 2011. Recovering semantics : From Spreadsheets to RDF. In The Semantic Web-ISWC 2008. Springer Berlin of tables on the web. In Proceedings of the VLDB Endowment, Vol. 4. 528–538. Heidelberg, 451–466. https://doi.org/10.14778/2002938.2002939 [10] Felienne Hermans, Martin Pinzger, and Arie Van Deursen. 2010. Automati- [25] Katy Wolstencroft, Stuart Owen, Matthew Horridge, Olga Krebs, Wolfgang cally Extracting Class Diagrams from Spreadsheets. In 24th European Conference Mueller, Jacky L Snoep, Franco du Preez, and Carole Goble. 2011. RightField: on Object-Oriented Programming (ECOOP),Lecture Notes in Computer Science,. embedding ontology annotation in spreadsheets. Bioinformatics (Oxford, England) Springer-Verlag, 52–75. 27, 14 (jul 2011), 2021–2. https://doi.org/10.1093/bioinformatics/btr312 [11] Andreas Langegger and W Wolfram. 2009. XLWrap âĂŞ Querying and Integrating Arbitrary Spreadsheets with SPARQL. In International Semantic Web Conference.