=Paper=
{{Paper
|id=Vol-2277/paper09
|storemode=property
|title=
Systematization of Tabular and Graphical Resources in Quantitative Spectroscopy
|pdfUrl=https://ceur-ws.org/Vol-2277/paper09.pdf
|volume=Vol-2277
|authors=Nikolai Lavrentiev,Alexey Privezentsev,Alexander Fazliev
|dblpUrl=https://dblp.org/rec/conf/rcdl/LavrentievPF18
}}
==
Systematization of Tabular and Graphical Resources in Quantitative Spectroscopy
==
Systematization of Tabular and Graphical Resources in Quantitative Spectroscopy © N.A. Lavrentiev © A.I. Privezentsev © A.Z. Fazliev Institute of Atmospheric Optics SB RAS, Tomsk, Russia lnick@iao.ru remake@iao.ru faz@iao.ru Abstract. An approach to the formation of applied ontologies in data intensive subject domains with predominant tabular and graphical forms of data representation is suggested. Sources of data and of information about data in tabular and graphical forms are described. Using the quantitative spectroscopy as an example, an approach is presented to the formation of semantic annotations characterizing these sources. The main types of sources and methods for controlling the spectral data quality are described. Using scientific graphics in the spectroscopy of molecular complexes as an example, an approach to the solution of the problem of reduction and classification of graphical resources for searching for elementary plots in the subject domain is described. The role of ontology metrics in the comparison between data collections is discussed. Keywords: big data systematization, quantitative spectroscopy, applied ontologies. with a highly detailed query. Note that already in the 1 Introduction middle of the 2000s, attempts were made in several subject domains to systematize non-textual parts of Research results in tabular and graphical forms take scientific resources [3-5]. Methods for systematization in a significant part in publications related to data intensive our work are detailed on examples from quantitative subject domains. Usually, when processing such spectroscopy. publications by search agents, this part of information We have systematized sets of spectral data on resources is ignored. The reason is due to the lack of spectroscopy during the past 15 years. Semantic universal software, which allows describe of such annotations of these data sets have become a part of resources from different subject domains. applied ontologies characterizing one of the basic The implementation of search for information about properties of these sets, that is, the trust in these data [6]. tabular and graphical resources was started in the 1990s We digitized tables and plots representing the parameters using metadata integrated into html-pages. The creation of spectral lines and spectral functions. The digitization of Semantic Web technology was declared in the early of the tables was needed for the control of expert spectral 2000s [1]; its aim was replacing traditional metadata by data quality, and the digitization of spectral functions semantic annotations. No total transition to semantic was caused by the need to have spectral information in annotations occurred, since, on the one hand, the the cases where there were no high resolution results, and introduction of new technologies turned out to be a also for their usage for controlling the asymptotic complicated process and, on the other hand, there was no behavior of the calculated data. demand for detailed queries that gave near-unambiguous We constructed applied ontologies that characterize answers. the quality of information resources on molecular During the initial stage of the creation of the Web spectroscopy [7], states and transitions of atmospheric technologies, the volume of unscientific resources molecules [8] and graphical resources on spectroscopy significantly exceeded the amount of scientific [9]. The ontologies created characterize tabular data that resources. Since the end of the 2000s, the situation has describe the spectral lines studied during the past 80 begun to change and the volume of scientific data has years. In the first thirty years of this period, publications, begun to grow catastrophically. In future, these data along with a small number of data tables, included many exceed all other resources [2]. Scientific information scientific plots describing spectral functions. Creation of resources are represented on the Internet in publications Fourier spectrometers in the late 1960s initiated the (files), data collections (databases), subject domain appearance of many numerical arrays of precise data on ontologies (knowledge bases), etc. Below we mainly spectral lines parameters, and graphical representation of focus on scientific papers and their systematization. This spectral data was replaced by tabular representation in part of the resources is chosen, on the one hand, because high-resolution quantitative spectroscopy in subsequent of their traditional use in research, and on the other hand, years. because of a need in searching for scientific resources Nevertheless, there are spectroscopy domains where it is difficult to achieve a high resolution of the spectral Proceedings of the XX International Conference parameters with the help of modern experimental “Data Analytics and Management in Data Intensive techniques. For example, the continuum absorption, Domains” (DAMDID/RCDL’2018), Moscow, Russia, October 9-12, 2018 25 important in the study of planetary and exoplanetary caused by the task of automatic cataloging of such atmospheres; spectral properties of weakly bonded informational resources in a subject domain. Our molecular complexes and molecular spectral functions in collection of papers on the quantitative spectroscopy the UV region necessary for quantitative description of already exceeds 12,000 publications relating to the period photochemical reactions in the gaseous phase. In these from 1898 till the present. The model of the subject subject domains, the amount of spectral information domain chosen by us [8] contains solutions of seven contained in scientific graphics significantly exceeds the spectroscopic problems that are of decisive importance amount of information represented in the tabular form. for such applied subject domains, as astronomy, In this work, we discuss the models and features of atmospheric optics, spectroscopy, etc. tabular and graphical representations of data in scientific Tables in the publications contain not only data publications, define the primary and composite data arrays, but also scientific graphics. Graphical resources sources, information sources, elementary and composite in scientific subject domains can be divided into two plots and figures. In the final part of the article, we parts: mathematical plots (usually 2- and 3D) and figures estimate the metrics of the created ontologies on (raster graphics and graphics represented by data arrays). quantitative spectroscopy. Today, digital images of scientific graphics appear in a number of journals in supplementary materials, which 2 Features of tabular and graphical makes possible the quantitative comparison of graphics representations of resources in publications with less cost. 2.1 Publication model 2.2 Tabular representation Publications are the most common means for Intensive use of numerical data led to a wide variety storage, communication, and analysis of the scientific information. Traditionally, scientific papers include text of forms of tabular representation. Tabular data in the in a natural language, mathematical equations, chemical paper text and in plain-text files contain data arrays with positional formatting with whitespace characters or reactions, physical formulas, tables, plots, figures, etc. formatting with separating symbols, so-called CSV files To find the information requested by a user, the text part (Comma-Separated Values). The form of a table does not is mainly used. In many subject domains, data arrays, which are solutions to computational problems, impose restrictions on metadata to the data arrays. The measurements or observations, are used in tabular and subject domain model chosen in a specific information graphical representations. Every such solution is a part of system allows one to distinguish the structure of the intension of semantically significant data arrays in a a paper that contains a large number of typed facts. tabular form. Thus, not all information published in the Equations, formulas, and sets of reactions are much more tabular form should be semantically annotated, but only abstract resources, since most of them have no unique names and their annotation requires a certain level of the information necessary for W@DIS information professional training. system. To form the part of semantic annotations that In journals, the tabular data representation is still used in spectroscopy, but the volume of spectral data characterizes tabular and graphical resources of a paper there has decreased significantly; most of the in a simple case, one can take into account the description information resources presented in the tabular form are of properties of the domain problem solutions. Note that the current trend is creation of supplementary materials concentrated in supplementary materials. Note that the to papers, many of which contain additional data in the number of plots in papers was much higher than of tables tabular and/or graphical forms. in the first half of the 20th century. In the W@DIS information system described The solution of a computational problem is a data below, data arrays extracted from tables published in array supplemented by a set of properties of this array; it can represent a more accurate formal model of one or scientific papers are the main resources. Figure 1 shows another part of a paper. The specification of the set of some stages of the formation of these resources. Figure properties is determined by the problems of searching for 1a shows a fragment of a table from a paper; Fig. 1b gives a typical representation of data from tables in W@DIS, information resources, which are of interest to and Fig. 1c shows metadata that are automatically researchers of the given subject domain. generated when importing data published into the IS. The choice of a publication model for collections of data arrays represented in tabular and graphical forms is 26 a b c Fig. 1 Models of the paper fragment that contains the tabular representation: (a) source table with measurement data; (b) representation of this table in the information system; and (c) metadata that characterize the properties of the numerical array that is represented in the tables in fragments (a) and (b). 27 a) b) c) d) e) Fig. 2 Models of a graphical resource of a publication: (a) a fragment of original publication that contains a figure; (b) the figure used for quantization; (c) the complex plot built on the basis of the quantization results of fragment b; (d) an elementary plot from fragment с; and (e) the description of the elementary plot in fragment d. important in the investigations of planetary and 2.3 Scientific graphics exoplanetary atmospheres, of spectral properties of Scientific plots are used in quantitative spectroscopy weakly bonded molecular complexes and molecules in the fields where exact measurements are lacking in modern UV region necessary for quantitative description of experimental techniques (for example, due to the complex photochemical reactions in the gaseous phase. atomic composition of a molecule or short-wavelength Plots with which a user works in the W@DIS IS can range), e.g., in the study of continuum absorption be divided into two classes: simple and composite. Simple plots contain only one set of coordinates, 28 represented by a curve, a set of dots or bars. Composite units, and sets of metadata describing each plot from this plots can contain many curves in one coordinate space. figure. There are two types of composite plots in the IS: (1) plots Definition 3. The primitive image in a figure obtained by combining simple plots from one publication published is an image of one object under study and the and (2) plots obtained from comparison of different data related set of metadata that characterizes the properties sets from different publications. of the object and its image. A simple plot is a basic data structure in the IS. It is Definition 4. An image that contains more than one stored as a collection of abscissas and ordinates for the primitive image of an object from a figure published is corresponding data set and associated metadata. A set of called the composite image in the figure published. metadata for each plot includes physical quantities, such In particular, the set of metadata of an elementary as: a substance participating in the physical process image includes a reference to the publication from which described by the plot, the temperature and pressure of the the figure described has been extracted. Composite process, the data type (experimental or theoretical), images can be single- or multipaper. spectral function and method (measurements or Definition 5. The primitive figure is a figure that calculations), and X- and Y-coordinates and their units contains a single scientific plot or image. of measurement; as well as auxiliary metadata, Definition 6. The composite figure is a figure that including: the plot style (a curve representable in several contains scientific plots and images. ways or a set of points or bars); linear or logarithmic scales along the abscissa and ordinate, a caption and a 3 Data and information sources commentary for the plot, a bibliographic reference to the paper from which the plot has been taken, and the figure 3.1 Definitions number in this paper. Each simple plot is accompanied The variety of molecules for which the problems by the attached scanned image from the source paper, mentioned in [10] have been solved and the related which allows us to compare the original figure with the methods is quite wide. For this reason, solutions to plot built automatically in the system. In turn, combining several problems by different methods for different simple plots from one publication, one can obtain a molecules or their isotopologues can be presented in one composite plot. publication. The solution to one task can be the content The search and comparison interface allows one to of several tables. During systematization of data find already loaded plots by a wide range of criteria, such extracted from publications, such a variety of tables as physical values along the both axes with appropriate creates many problems, especially in the cases where the units of measurement, substance, temperature, and solution to a subject task is divided into parts and is pressure; or any other physical or auxiliary metadata. As represented in several tables. There is no sense to refer a result of the search, one obtains sets of data from individual data arrays to the tables they were extracted different publications, which can then be combined in from. For this reason, we here use an information object one coordinate space for further comparison. that represents the original data of a publication The scientific plots, described in this work, describing one molecule, one spectroscopy task, and one represents the dependencies of physical quantities in 1D– solution method. 3D Cartesian coordinates. The most common are 2D plots. As a rule, several curves are shown in one plot in 3.1.1 Primitive and composite data sources one coordinate space, which characterize the behavior of physical parameters under different thermodynamic This information object shall be called the data conditions or provide the comparison of original results source. Different data source types are met in scientific by authors with works of other researchers. The number papers. Let us give several definitions. of plots that contains the only curve is relatively small in Definition 7. All parts of the published solution to a the total volume of plots published. task of quantitative spectroscopy along with the The main idea of systematization is a separation of molecule name, reference, and name of the solution every curve from a set of curves in a complex plot into method (or reference to the method description) are primitive plots, which is supplemented by a set of called the “primitive data source”. metadata describing the plot with the level of detail We assume that empty solutions are not published. necessary for searching for it. On the other hand, solutions can include measurement Let us give several definitions. data which go out of date with time or be wrong Definition 1. The primitive plot is a plot in Cartesian themselves. A data source the content of which is coordinates that contain only one curve from a figure completely declined by experts is called negligible. The published, in the same coordinate system, relating to the number of such sources in the modern spectroscopy is same physical parameter and its measuring units, and a insignificant. set of metadata describing the plot. Definition 2. The composite plot is a plot in Definition 8. An information object exhibiting basic Cartesian coordinates that contain all primitive plots (>1) properties of a primary source of data cardinality of from a figure published, in the same coordinate system, which differs from unity is called the composite data having the same physical parameters and their measuring source. 29 Any expert set of spectral data (e.g., HITRAN [11]) sources can be tens of thousands, which makes it more can serve an example of composite data source. convenient to represent them graphically. The representation of this information in the text form is 3.1.2 Information source cumbersome and allows one to see only a local picture. A primary source can be endowed with additional 4 Ontology metrics in quantitative properties. The list and number of these properties spectroscopy depend on information tasks for solution of which these properties are used. A data source with additional Users of applied data stored in data collections, properties is called the source of information. related to data intensive subject domains, currently meet Definition 9. A primitive data source with problems of selection of necessary data, which concern additional properties is called a primitive source of not only the data intension, but also its quality. The information extracted from a publication. ontologically described collections are preferable. Such The source of information is a set of properties and collections can be objectively compared in terms of their values attributed to a data source. For a number of metrics of the corresponding ontologies. Naturally, the information tasks, for example, the search for reliable multiplicity of ontology descriptions gives information solutions to quantitative spectroscopy problems, one can about a collection significantly better quality. A certain select properties values of which are automatically standard of such a description should arise for each of calculated. A source of information usually includes applied subject domains with time. Below we give an some statements from the publication that contains the example of the quantitative estimation of the ontology data source described by this source of information. The description of resources in the W@DIS IS [12]. better half of a source of information characterizes the As a result of the work, a set of spectral data was knowledge contained in the publication in an implicit collected and systematized within the Molecular form. Spectroscopy IS for several molecules: H2O, H2S, HOCl, The list of additional properties is determined by a OCS, O3, SO2, C2H2, CH4, CO2, CH3OH, CO, HBr, HCl, researcher on the basis of information tasks that are to be HF, HI, N2, CH3Br, CH3Cl, N2O, NH3, NO2, PH3, and solved. There are two such tasks in our work: the task of their isotopologues. The numerical array of spectral data semantic search and the task of automated composition in the Molecular Spectroscopy IS is about 80 GB in of an expert data set. Let us note that primary sources of MySQL database, where most of the data is on H2O information relating to one publication do not contain molecule and its isotopologues. The size of the numerical identical statements. The difference between a data array could be reduced by the means of additional publication and a related primary source of information optimization of the data structure, but then the load on can be significantly smaller than the difference between the computing resources of the Molecular Spectroscopy the publication and a related primary data source. This is IS would have to significantly increase. To describe the due to those additional properties of the task solution in parts of the complete array, the IS contains about 25 GB the publication that are included in the definition of a of metadata stored in the MySQL database, where the particular source of information. For example, such an overwhelming majority is the quantitative criteria of data additional property can be the description of validity of quality derived from the calculations of the values of the the solution or the description of the standard deviations correlations between pieces of the numerical data. On the of the initial data source from other data sources, etc. In basis of the complete 80-GB data array, ontologies of addition, the statements contained in the primary source molecular states and transitions are formed, which are of information may not be contained in the publication. represented as XML files in RDF/XML notation of the OWL language of about 280 GB in total size. It should 3.1.3 Sources of information attributed to pairs of be noted that the OWL language has several syntax data sources notations, from the shortest in the Manchester syntax to the longest in the OWL/XML syntax. The relatively The representation of a source of information that verbose RDF/XML syntax was selected for the characterizes the properties of all pairs, including a representation of OWL ontologies in the Molecular selected data source with all other data sources, is much Spectroscopy IS because of historical reasons; this more complex. The visualization of such a source of choice seemed optimal in the beginning of the work on information is necessary for researchers for a number of ontology representations in the Molecular Spectroscopy reasons. First, in spectroscopy, as well as in other data IS in 2006. intensive subject domains, it is common to compare the On the basis of the 25-GB array of metadata, a results of experiments performed by different groups. semantic information model is formed as the ontology of Second, there can be several types of such pair information resources, represented as XML files in the relationships. Third, the number of data sources in the IS RDF/XML notation of the OWL language of about 3 GB varies with time (new works on state and transition in size. A semantic model of information on parameters appear). Fourth, the measurement accuracy spectroscopic graphics in the form of the ontology of increases; therefore, the values of the criteria that spectroscopic plots, represented as an XML file in determine the reliability of facts are to be reviewed. Fifth, RDF/XML notation of the OWL language of only 2 MB the number of facts in the comparison between data in size, should be mentioned separately. More complete 30 quantitative information on resources is given in Table. estimated using metrics of the ontologies. Some metrics 1. of the applied ontologies on spectroscopy are given in The completeness of description of the subject Table 2. domain and its parts by different applied ontologies is Table 1. Volume of data, metadata, and ontologies in W@DIS IS List of resources in W@DIS IS Volume, GB Data layer Spectral data 80.779 Metadata layer Metadata 24.772 Ontology layer Ontology of information resources on quantitative spectroscopy 3.231 Ontology of molecular states and transitions 280.079 Ontology of scientific graphics on quantitative spectroscopy 0.002 All resources 398.8 Table 2. Estimation of the metrics of applied ontologies on quantitative spectroscopy Logical Declaration Object Data Ontology Axiom Class Individual DL expressivity axiom axioms property property OIR 5.4*106 4.6*106 606 324 92 355 1.4*106 ALCHON(D) OSPM 0.97*10 9 0.9*10 9 68 30 13 25 2.0*10 9 ALC(D) OSG 1.81*104 1.37*104 3690 62 17 10 3.7*103 ALCHO(D) OIR means the ontology of information resources, OSPM means the ontology of molecular states and transitions, OSG means the ontology of spectroscopic plots. 5 Conclusion The Semantic Web, Scientific American, May 17, 2001. The aim of the work was focused on ontological [2] L. Kalinichenko, A. Fazliev, E. Gordov, N. description of information resources collections on Kiselyova, D. Kovaleva, O. Malkov, I. quantitative spectroscopy. This description give us kladnikov, N. Podkolodny, N. Ponomareva, A. possibility to organize the semantic search in the domain Pozanenko, S. Stupnikov, A. Volnova, New on the base of traditional criteria of the spectroscopy. The Data Access Challenges for Data Intensive publication models were developed and formalized with Research in Russia, CEUR Workshop help of OWL 2DL. The data and information sources Proceedings, v. 1536, 2015, P.215-237, 17-th were constructed as a part of the formalization. International Conference on Data Analytics and Description of sources, state, transitions and spectral Management in Data Intensive Domains, functions became a basis for the construction of three DAMDID/RCDL 2015; Obninsk; Russian applied ontologies. These ontologies were used for Federation; 13 - 16 October 2015; Code 118237. catalogization of the articles of the quantitative [3] Keller-Rudek, H., Moortgat, G. K., Sander, R., spectroscopy topics and their parts.. The metrics of the and Sörensen, R., The MPI-Mainz UV/VIS ontologies were estimated. spectral atlas of gaseous molecules of The proposed model can be used under formalization atmospheric interest, Earth System Science of the information resources of differen type in other Data, 5, 365–373, (2013) subject domains. doi:10.5281/zenodo.6951. [4] Привезенцев А.И., Царьков Д.В., Фазлиев Acknowledgments. The work was financially supported А.З., Базы знаний для описания by the Russian Foundation for Basic Research (grant no. информационных ресурсов в молекулярной 07-13-0411). спектроскопии 3. Базовая и прикладная онтологии, Электронные библиотеки, 2012, т. References 15, в.2. http://elbib.ru/ index.phtml?page=elbib/rus/journal/2012/part2, [1] Tim Berners-Lee, James Hendler and Ora 2012. Lassilla, [5] N. A. Lavrentiev, O. B. Rodimova, A. Z. Fazliev, A. A. Vigasin, "Systematization of published 31 research graphics characterizing weakly bound Spectroscopic Database, Journal of Quantitative molecular complexes with carbon dioxide," Spectroscopy and Radiative Transfer, 2013, Proc. SPIE 10466, 23rd International Symposium Volume 130, Pages 4-50, DOI: on Atmospheric and Ocean Optics: Atmospheric 10.1016/j.jqsrt.2013.07.002. Physics, 104660E (30 November 2017); doi: [12] A. Akhlyostin, Z. Apanovich, A. Fazliev, A. 10.1117/12.2289932. Kozodoev, N. Lavrentiev, A. Privezentsev, O. [6] N.A. Lavrentyev, M.M. Makogon, A.Z. Fazliev, Rodimova, S. Voronina, A.G. Csaszar, J. Comparison of the HITRAN and GEISA Tennyson, The current status of the W@DIS Spectral Databases Taking into Account the information system, Proc. SPIE of 22-nd Restriction on Publication of Spectral Data, International Symposium Atmospheric and Atmospheric and Oceanic Optics, 2011, Vol. 24, Ocean Optics: Atmospheric Physics, Eds. No. 5, pp. 436–451. Gennadii G. Matvienko; Oleg A. Romanovskii, [7] A.Privezentsev, D.Tsarkov, A.Fazliev, Tomsk, Russian Federation, v. 10035, 100350D J.Tennyson, Computed Knowledge Base for (November 29, 2016); doi: Description of Information Resources of Water 10.1117/12.2249235. Spectroscopy Proc. of the 7th International Workshop on OWL: Experiences and Directions (OWLED 2010), San Francisco, California, USA, June 21-22, 2010. Edited by Evren Sirin, Kendall Clark, CEUR-WS Proc. Vol-614, http://ceur-ws.org/Vol-614/ owled2010_submission_6.pdf. [8] S. S. Voronina, A. I. Privezentsev, D V. Tsarkov, A. Z. Fazliev, An Ontological Description of States and Transitions in Quantitative Spectroscopy, Proc. of SPIE XX-th International Symposium on Atmospheric and Ocean Optics: Atmospheric Physics, 2014, Vol. 9292, 92920C. [9] N. A. Lavrentiev, O. B. Rodimova, A. Z. Fazlie v, Systematization of graphically plotted published spectral functions of weakly bound water complexes, Proc. SPIE of 22nd International Symposium Atmospheric and Ocean Optics: Atmospheric Physics, Eds. Gennadii G. Matvienko; Oleg A. Romanovskii, Tomsk, Russian Federation, v. 10035, 100350C (November 29, 2016); doi: 10.1117/12.2249159. [10] A.D. Bykov, A.V. Kozodoev, A.I. Privezentsev, L.N.Sinitsa, M.V.Tonkov, N.N.Filippov, A.Z. Fazliev, M.Yu. Tretyakov, Distributed information system on molecular spectroscopy, Proc. of SPIE, International Symposium on High Resolution Molecular Spectroscopy, 2006, v. 6580 pp. 65800W. [11] L.S. Rothman, I.E. Gordon, Y. Babikov, A. Barbe, D.Chris Benner, P.F. Bernath, M. Birk, L. Bizzocchi, V. Boudon, L.R. Brown, A. Campargue, K. Chance, L. Coudert, V.M. Devi, B.J. Drouin, A. Fayt, J.-M. Flaud, R.R. Gamache, J. Harrison, J.-M. Hartmann, C. Hill, J.T. Hodges, D. Jacquemart, A. Jolly, J. Lamouroux, R.J. LeRoy, G. Li, D. Longo, C.J. Mackie, S.T. Massie, S. Mikhailenko, H.S.P. Muller, O.V. Naumenko, A.V. Nikitin, J. Orphal, V. Perevalov, A. Perrin, E.R. Polovtseva, C. Richard, M.A.H. Smith, E. Starikova, K. Sung, S. Tashkun, J. Tennyson, G.C. Toon, Vl.G. Tyuterev, J. Vander Auwera, G. Wagner, The HITRAN 2012 Molecular 32