<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Cereal Science</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Identifying and Extracting Quantitative Data in Annotated Text?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Don J.M. Willems</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hajo Rijgersberg</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan L. Top</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Sciences, Vrije Universiteit Amsterdam</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Wageningen UR, Food and Biobased Research</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>46</volume>
      <issue>1</issue>
      <fpage>409</fpage>
      <lpage>416</lpage>
      <abstract>
        <p>In science it is difficult to reuse quantitative scientific data. For example, it is not possible to search for quantitative data in papers in a directed way, such as using the query "Select the storage modulus of dairy product A after the temperature has decreased from 90 to 4 ±C". This is caused by the fact that data is made available in (relatively) free formats as in scientific papers, spreadsheets, or databases, all with limited annotation and description of the way they were obtained. Meaning is lost, for example about what the numbers relate to (quantities and units are often poorly indicated). Many researchers, especially in the physical and computer sciences use LATEX in their creation of scientific papers. In this paper we present a set of LATEX-style files, which use the terminology defined in wurvoc.org, that can be used to annotate scientific papers. These style files define a set of commands, each representing a specific quantity or unit. If the LATEX is typeset into a PDF file, quantities and units in the PDF will be annotated with the appropriate references (URIs) to the corresponding concepts in the OM ontology. This will not only disambiguate the use of these quantities and units, but will also enable us to extract triples from the PDF, facilitating the use of SPARQL queries to answer advanced quantitative question.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Many scientific papers are written using the LATEX typesetting system and published
as PDF. It is desirable to process this knowledge automatically. We propose a method
to add semantic annotations to LATEX files, extending the typesetting methods that
LATEX offers. Using these annotations we can automatically extract information from
the generated PDF files.</p>
      <p>
        In scientific research one is often looking for quantitative data. One can find these
for example in external databases and spreadsheets or in scientific papers. However,
it is difficult to look for quantitative data on the Web in a directed way. For example,
the query "Select the storage modulus of dairy product A after the temperature has
decreased from 90 to 4 ±C" can not be carried out because computer tools are not
able to link the correct numbers to the correct units and correct quantities. Another
example is to give maximum, minimum and average values observed for parameter
X in a set of papers. The problem is caused by the fact that data are expressed in
? This publication was supported by the Dutch national program COMMIT.
relatively free formats, such as in text or in spreadsheets. The structure of the data is
often not clear for a computer. A lot of research is being done on parsing the structure
of scientific papers including tables (see for instance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), which often contain a lot of
quantitative data. The annotation of the data, however, is usually limited, including
quantities and units that are often poorly indicated [2]. In other words, the context
to which the numbers refer is often lost.
      </p>
      <p>To approach these problems, formal terminology is developed in the discipline
of the Semantic Web [3]. This terminology can be made publicly available through
the Web, so that it can be used (referred to) in digital sources [4]. The ideal
situation would be to have all information available in formal languages. This,
however, requires a major effort in creating a consistent conceptual model of scientific
knowledge and significant advances in automatic parsing and annotation of
scientific texts. The work presented in this paper represents an important step in
annotating texts with formal concepts.</p>
      <p>Often quantitative work, in for instance the physical sciences, is presented in
scientific papers that were created using the LATEX typesetting system[5]. LATEX is
especially suited for the physical and computer sciences because of the extensive support
for mathematical typesetting. The use of LATEXstyle files provides a convenient way to
add annotations to content, not only to the main text of an article but also to tables
and even graphics created in LATEX.</p>
      <p>In this exercise we focus on the annotation and extraction of quantities and units,
defined in our Ontology of units of Measure and related concepts (OM; see
Section 3) from annotated content. Using custom LATEX commands, quantities and units
with semantic annotations are inserted into the PDF with the appropriate references
(URIs) to the corresponding concepts in the OM ontology. This will not only
disambiguate the use of these quantities and units, but will also enable us to extract RDF
[6] triples from the PDF, enabling the use of SPARQL [7] queries to answer advanced
quantitative questions. This may greatly enhance searching and the processing of
data in general. Because a computer will suffer less from ambiguity owing to the
annotations, it can reach a higher quality in the support that it offers.</p>
      <p>In this paper we will first focus on related work (Section 2). Subsequently, in
Section 3, we will briefly describe OM, discuss aliases in LATEX, propose commands for
referring to quantities and units in LATEX, and describe the method for transforming
equations from LATEX to RDF. In Section 4 we will present a few examples of annotated
scientific texts and in Section 5 we will discuss the results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>General developments in the area of semantic publishing, have been presented in
several papers [8,9]. The authors express the need for annotating publications and
extracting structured information from them. Otherwise the sheer number of
references hinders interaction between related scientific activities. Assessing and
integrating previous work benefits significantly if factual information is (also) available
as Linked Open Data. These authors indicate the need for explicating the structure
of arguments in scientific discourse, as well as a structured presentation of the
concepts and relations between concepts. Our work is complementary, in the sense that
we start by disclosing numerical facts which can be related to other concepts. We
take a pragmatic approach by building on standard practice in writing LATEX
documents.</p>
      <p>Some approaches exist for semantically annotating LATEX files. STEX, Semantic
Markup for TEX/LATEX [10], consists of a collection of TEX macro packages that allows
the user to markup TEX/LATEX documents semantically, turning the documents in a
format for mathematical knowledge management (MKM). The method focuses on
mathematical relations, collections, and formats of numbers (e.g., decimals). STEX,
however, cannot be used to annotate quantities and units.</p>
      <p>The SIunits package for LATEX [11] is a package that provides support for
typesetting units, in more or less the same way as we do. Quantities, however, do not appear
as such in this package. Moreover, as the package only provides typesetting, the
concepts are not linked to a centrally available vocabulary on the Semantic Web.</p>
      <p>SALT, Semantically Annotated LATEX[12], provides a means for externalising
rhetorical and argumentation captured within a publication’s content. However, the
approach does not relate to mathematical concepts or quantities and units.</p>
      <p>Mathematics can be expressed in XML using MathML [13] and can as such be
incorporated into web pages. Equations in MathML can be expressed in two distinct
formats: i) Presentation encoding, which as the name suggest supports the
construction of traditional mathematical notation. ii) Content encoding supports the
"encoding of the underlying mathematical meaning of an expression" [13]. While the scope
of MathML itself does not include units, they can, however, be expressed in MathML
[14], including a reference to the URI of a concept in a formal vocabulary of units
(such as OM).</p>
      <p>Ideally, structured information should be extracted automatically from
publications. Frameworks such as GATE [15] provide functionality for automatically
annotating text. It does not, however, provide the ability to automatically annotate
equations, where only symbols are used for quantities and units, or graphics.</p>
      <p>In previous work we have annotated quantities and units in Excel files, using
an add-in for Excel we developed along with web services disclosing quantities and
units from OM, and operations that can be performed on these terms, such as
dimension and unit consistency checks on formulas and returning possible units for
a particular quantity. In the present work we relate terms in LATEX documents to this
same ontology (OM [16]), extending the use of OM concepts to a broader audience.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>Typesetting (the creation of a visual representation of a text) using LATEX is done using
the TEX typesetting engine [17] developed by Donald Knuth in the 1970s and 1980s.
TEX provides a set of low level declarations or commands for typesetting. LATEX,
developed by Leslie Lamport in the 1980s, provides a set of higher level commands that
can be used to easily create documents without having to worry about their
typographical appearance [5]. These sets of commands can easily be extended (and often
are) by users to define a set of personal commands for typesetting particular pieces
of information that are often used. These commands are either defined in the main
LATEX source file or in style files which can be imported into the main LATEX file.</p>
      <p>In this paper we present a LATEX package (as a set of style files) that uses
terminology as offered through our ontology platform wurvoc.org.
3.1</p>
      <sec id="sec-3-1">
        <title>Ontology of units of Measure and related concepts (OM)</title>
        <p>The ontology that we use to refer to in the LATEX files is OM. OM is an ontology based
on older ontologies of units of measure, such as EngMath by Tom Gruber [18]. In
earlier work [16] we have compared OM, EngMath and other ontologies, and OM
appeared to be the most extended ontology, e.g. defining the most of the relevant
concepts in the quantitative domain, such as “quantity”, “unit of measure”, “dimension”,
“measure”, “measurement scale”, etc.</p>
        <p>OM defines concepts such as unit, quantity and dimension. Quantities are
related to units of measure and measurement scales that can be used to express them
using the relation \unit_of_measure. Units of measure are defined by some
observable standard phenomenon, such as the length of the path travelled by light in a
vacuum during a time interval of 1/299 792 458 of a second, for meter. Measures, such
as “3 kilogram” are used to indicate values of quantities. Multiples and submultiples
of units have a prefix, such as in kilogram and millimetre.</p>
        <p>Systems of units organise quantities and units of measure, e.g. the International
System of Units (SI). Such a system defines base units and derived units. Base units
are units that cannot be defined in terms of other units (e.g. metre and second).
Base units can be combined into derived units, such as for example metre per
second ( m s¡1).</p>
        <p>OM is based on a semi-formal description of the domain of units of measure,
drafted from several paper standards that we have analysed, e.g. the Guide for the
Use of the International System of Units [19], by the NIST. For a full list of statements,
the sources that we have used, and ontological choices made, see previous work [16].</p>
        <p>OM is modelled in OWL 2 [20]. The ontology is published as Linked Open Data [21]
through our vocabulary and ontology portal wurvoc.3 OM can be used freely under
the Creative Commons 3.0 Netherlands license.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3.2 Aliases in LATEX</title>
      <p>When using LATEX it is often preferable to create aliases for often used (complex)
command structures instead of retyping these command structures again and again.</p>
      <p>For instance, LATEX source code becomes more difficult to interpret when units
are used in an equation. To create a statement like:</p>
      <p>G Æ 6.673 £ 10¡11 N m2 kg¡2
(1)
which is the gravitational constant, the following LATEX source code can be used:
3 http://www.wurvoc.org/vocabularies/om-1.8/. The objective of wurvoc.org is to publish
vocabularies and associated web services relevant to the general domain of physical units
and quantities and in particular the domains of life sciences and agrotechnology. In wurvoc
one can browse vocabularies and directly interface with them.</p>
      <p>G = 6.673\times 10^{-11} \mathrm{N} \mathrm{m^2} \mathrm{kg^{-2}}
Units are written in a non-bold but upright font (i.e. not cursive as is used for
quantities and variables).</p>
      <p>To make things easier, authors using LATEX construct self-defined aliases. For
instance, for the unit for the gravitational constant we might define a new command:
\newcommand{\Gunit}{\mathrm{N} \mathrm{m^2} \mathrm{kg^{-2}}}
and for exponents:</p>
      <p>
        \newcommand{\E}[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]{\times10^{#1}}
The author can then use his custom defined commands (or aliases) \Gunit and \E
to insert the correct unit and exponent. Equation 1 can then be typeset using
      </p>
      <p>G = 6.673\E{-11} \Gunit
which is much easier to interpret by humans.</p>
      <p>Sets of often used aliases can be created and distributed using style files. These
LATEX examples use custom commands to provide easier typesetting of
mathematical expressions. We would like to use these typesetting commands (aliases) to insert
semantic information into the mathematical expressions.</p>
      <sec id="sec-4-1">
        <title>3.3 Semantic annotations</title>
        <p>As aliases are used quite often by authors, it becomes possible to add extra
information to the output produced when typesetting LATEX files. The extra information we
would like to provide in scientific texts are links (URIs) to ontological definitions of
the quantities and units used as defined in the Ontology of units of Measure (OM,
[22], prefix is om:). To this end we have created a set of style files (a package) that
define a large set of aliases that not only create the correct symbols and layout for
quantities and units, but also provides annotated links to the ontological concepts
describing these quantities and units. As most LATEX source files are typeset into PDF
files these days, we have decided to use PDF annotations (more specifically
hyperlinks) to create the links to the ontological concepts. The URI of the annotated
concept is added to the PDF as a hyperlink and will generally only be visible when the
mouse cursor hovers above the linked text.</p>
        <p>Using the hyperref package, hyperlinks are inserted into the PDF produced by
LATEX by defining aliases with a custom command:
In the first line, the command \annot is defined with three parameters. The first
parameter is the part of the URI in the OM ontology of the concept (quantity or unit)
that comes after om:, which is the base URI for the OM ontology. The second
parameter is the default symbol used for this quantity or unit, and the third (optional)
parameter can be used by an author to insert a custom symbol for the same concept.
The second line checks whether the author has used an optional symbol. If not, the
third line will insert the default symbol (the second parameter) with a hyperlink
consisting of the base URI concatenated with the first parameter. If the author used the
third parameter for a custom symbol, this symbol is used instead, with the same URI
(line 4).</p>
        <p>The \annot command is defined in the om.sty style file provided in our
OMLATEX distribution. A few other commonly used commands, such as \vect to typeset
vectors, \E to typeset exponents such as 4.2£103, and \unit to typeset units are also
provided in this file.</p>
        <p>To annotate an equation like:
jjajj Æ 5.433 £ 10¡1 m s¡2
(2)
which would normally be produced by the source code:</p>
        <p>||\vect{a}|| = 5.433 \times10^{-1}\unit{m}\unit{s^{-2}}
can now be obtained, with the same result, using the following code:
||\Acceleration || = 5.433\E{-1}\metrePerSecondSquared
This LATEX code, while not much shorter, is more easy to interpret by humans. For
all units and quantities in OM a human readable alias, such as \Acceleration, is
provided in the LATEX style files. Aliases from the SIUnits package [11] will also be
included, so that texts created with SIUnits can easily be converted to include OM
annotations.</p>
        <p>
          More importantly, however, for our purposes, is the addition of the hyperlink
pointing to the relevant concept. To facilitate this, the following aliases were defined:
\newcommand{\Acceleration}[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ][]{
        </p>
        <p>\annot{Acceleration}{\vect{a}}{#1}
}
In line 2, we use the \annot command to create an alias for the quantity acceleration
with URI: om:Acceleration and default symbol ’a’. In line 6 the same is done for the
unit metre per second squared (URI: om:metre_per_second_squared). The first
parameter to the \annot command (Acceleration in line 2, and
metre_per_second_squared in line 6) provide semantic annotations to the mathematical expression.
The second (\vect{a} in line 2, and \unit{m} \unit{s^{-2}} in line 6) and third
parameters (#1 in both line 2 and 6) are only concerned with typesetting.</p>
        <p>All LATEX commands representing quantities and units can also be used with user
defined symbols simply by adding an (optional) parameter to a command. For
instance, the command \LuminousFlux produces the symbol for the quantity
luminous flux ’F ’ with a link to the related concept (om:Lumious_flux) in the OM
ontology. If the author wants to use another symbol to represent luminous flux, he or she
can achieve this by specifying the alternative symbol as an argument:
\LuminousFlux[\Phi] produces ’©’, still linked with the same concept in the OM
ontology. If desired sub- and superscripts can also be used in the argument:
\LuminousFlux[F_{\lambda}] produces ’F¸’, again linked with the same URI.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.4 URI and equation extraction</title>
        <p>When using the typesetting tool pdflatex to create PDF files from the LATEX source,
the URIs representing the unit and quantity concepts are inserted as hyperlinks into
the PDF. To use these annotations we have to parse the PDF files to find the
hyperlinks (URIs). Using Apache’s PDFBox http://pdfbox.apache.org/ we were able to
create a small Java tool to parse the PDF files and extract the URIs representing
concepts in OM and linking these URIs to the text.</p>
        <p>Using this setup we are able to extract OM concepts (units and quantities) from a
text generated with OM-annotated LATEX. We would, however, also like to extract the
semantics of statements like V Æ 15.2 m3 (i.e. we would like to extract the fact that
the quantity volume has a value of 15.2 in units of cubic metre). To this end we have
also added the functionality of finding binary (Æ, Ç, È, ¼, etc.) relations in the text to
the extraction tool.</p>
        <p>When a PDF is parsed by the extraction tool, URIs for units and quantities,
numeric values and binary relations are tagged in the text. Operators, like \E are also
recognised, and in the case of exponents, the value is changed accordingly (e.g. 5.2 £
103 is changed to 5200). The tool then applies rules to find patterns in the text like:
[QUANTITY] [BINARY_RELATION] [VALUE] [UNIT]
If the tool comes across such a pattern, the combination of quantity, relation, value,
and unit is stored. For instance the equation:</p>
        <p>Ek Æ 1.209 £ 10¡2 eV
is extracted as:
1 [QUANTITY=om:Kinetic_energy] [BINARY_RELATION=’=’]
2 [VALUE=0.01209] [UNIT=om:electronvolt]
In this manner quantitative statements can be extracted from PDF generated with
OM annotated LATEX.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.5 Transformation to RDF</title>
        <p>The result of the extraction can then be transformed into RDF statements using the
OM ontology. For instance, the following equation
or_omme:ausnuirte_omf_emnte_ascsuarlee_
e
p
y
:ftr
s
d
om:newton</p>
        <p>INSTANCE
can be transformed into RDF (in turtle format [23]):
where mm and mq are prefixes for custom defined namespaces (possibly pointing to
the URI for the original text, thereby ensuring provenance) for measures and
quantities respectively, and om is the prefix for the OM namespace. This statement can also
be visualised as a graph (Figure 1).</p>
        <p>The current extraction tool is not only able to create the statements to model the
equation in RDF, but it is also able to export these RDF statements to an RDF triple
store, where it can be combined with other semantic data extracted from the PDF, or
obtained from other sources.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4 Real-world examples</title>
      <p>The number of detected measures depends on the type of paper; experimental
papers tend to contain more measures than theoretical papers. For example the fifth
page of a paper on water vapour sorption in gluten and starch films contains the
following text:
[...] The obtained parameter values for starch films (T g Æ 540§10 K and ¢Cp Æ
0.32 § 0.02 J g¡1 K¡1) differ from the values obtained by van der Sman and
Meinders (2011) from the data of starches from different botanical sources
and obtained with different experimental techniques (T g ¼ 475 K, ¢Cp ¼
0.43 J g¡1 K¡1) but are in close agreement with the values obtained by
Bizot et al. (1997) for pea amylose determined using DSC (T g Æ 589 K and
¢Cp Æ 0.27 J g¡1 K¡1). [...]4
which contained six measures. Our extraction tool extracted all six measures
correctly, for example, the following RDF triples were extracted for the last measure of
change of water specific heat (¢Cp Æ 0.27 J g¡1 K¡1):</p>
      <p>As one can see the measure has been extracted successfully and is correctly
modelled in terms of OM. At the moment, we assume that ¢Cp is one symbol for a specific
quantity and not a mathematical operation. Finally, it is not possible to add error
values to the conceptual model (e.g. T g Æ 540 § 10 K) and to distinguish between Æ and
¼.</p>
      <p>The extracted triples do not specify the source of the specific heat (i.e. the specific
heat for pea amylose determined using DSC). This is a case for further (automatic)
semantic annotation beyond OM. Including more extensive annotation would make
the semantic information even more valuable. One could start searching in scientific
RDF databases for articles on "specific water vapour heat" with a value between "0.2"
and "0.3 J g¡1 K¡1".</p>
      <p>As a second example of measures in an experimental paper consider the abstract
of a paper on observations of a young star. The abstract alone contains six measures:
We present CS(J=2-1) interferometric observations obtained with the Nobeyama
Millimeter Array (NMA) toward a protostar (GH2O 092.67 + 03.07) in the
Cygnus OB7 giant molecular cloud (distance Æ 800 pc). The data clearly
indicate the presence of a compact (size ' 8£103 AU) and young out-flow with
dynamical time scale ' 3500 year. [...] We derive a total mass of ' 0.6 M¯ and
' 12 M¯ for the outflow and disk respectively. The comparison of the NMA
data with a simple model of infalling disk indicates a mass of the central
object in the range 4.0 Ç M Ç 7.5 M¯. [...]5
In this example the names of the quantities are annotated as text (not math, e.g. ’size
' 8 £ 103 AU’) as in: \Diameter[size] and is actually parsed correctly by our
extraction tool:</p>
      <p>Please note that the numerical value containing an exponent (8 £ 103) is interpreted
correctly (8000).</p>
    </sec>
    <sec id="sec-6">
      <title>5 Discussion</title>
      <p>Embedding numerical facts in otherwise textual documents incurs a tension
between the use of natural language and structured formats. We submit that scientists
should be able to put their arguments forward with minimal technical constraints.
On the other hand, embedding RDF-OWL type annotations eliminates ambiguity
and simplifies computer processing. For example, consider the following statement:
The water vapor permeability for optimal crispness and crumb softness
retention was 8 £ 10¡9 g/(m s Pa).6
It is possible to request the author to annotate individual quantities and units of
measure (and concepts), but it would also be possible to have the author provide
RDF triples for the entire sentence. The first option seems less attractive from a
computer processing perspective. In that case, more effort is required to parse the
information into an equivalent RDF triple afterwards. Nevertheless we choose to stay
close to normal writing as much as possible. By annotating at the level of quantities
and units only, precisely enough formalisation is provided to enable automated
construction of the composite triple. Moreover, by using the alias mechanism provided
by LATEX and our definition of \annot, the natural language style is approximated as
much as possible.</p>
      <p>This paper describes how units and quantities can be annotated. However, the
value of such annotation is limited if it is not clear to which objects or phenomena
these quantities refer. For example, V Æ 15.2 m3 only becomes a useful fact if we know
that it refers to a container containing water, or even to a specific container in an
experiment. This would require annotating objects and phenomena using
domainspecific ontologies, and relating them to the quantities used. A simple generalisation
of our approach is to include the full URI in the \annot construct. This allows the
user to link any object to an ontological class or instance. However, some heuristic
processing would still be needed to link these objects to the annotated quantities. We
consider this a necessary step in our method, but beyond the scope of the present
paper.</p>
      <p>Finally we note that OM, the ontology of quantities and units, is accessible through
a set of web services that provide additional functionality if data is annotated along
the above lines. They allow automatic checking of combinations of units and
quantities for correctness and completeness, but also automatic unit conversions. These
can be useful aids during paper writing or reading.
6</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>For the research presented in this paper we have created a set of style files for LATEX
that refer to concepts from an ontology of units and quantities. By using the
commands used in these style files, quantities and units are annotated directly. The
concept’s URI is included as a hyperlink when generating the PDF. Using these
annotations we have been able to extract triples from the PDF and insert them into an RDF
triple store, which can be queried with specific querying constraints. The LATEX style
files and corresponding PDF extraction tool will be made available in the near future.</p>
      <p>In a broader sense it becomes feasible to do more with the annotated data, such
as unit conversion, checking of dimension and unit consistency, integrating,
performing computational methods on the data, etc. This functionality is available via
OM web services [16]. To make the data even more reusable, it will be important to
extract other concepts than quantities and units, such as the object or event that a
value of such quantity refers to.</p>
      <p>Ideally we should be able to annotate existing papers automatically. Frameworks
such as GATE [15], which provide automatic annotation will play an important role
in this endeavour. In earlier work [2] we have drafted heuristic rules for interpreting
and formalising quantitative information in spreadsheets. This research could be
extended towards quantitative information in scientific papers. At this moment,
however, automatic annotation of measurements cannot be performed reliably enough
in the cases we observed, which are intrinsically ambiguous and incomplete [2].
So, manual (and therefore, user-validated) annotation by authors is still required.
The described method in this paper helps to annotate quantitative concepts such as
quantities and units of measure, using the embedded URLs.</p>
      <p>As the user will likely be using alias commands in LATEX anyway, extending these
with semantic annotations does not require extra effort for the user and these
annotations are, therefore, relatively for free. The user only needs to include the OM-LATEX
package and is then able to use aliases with names close to the actual names of the
units or measures, making the text easier to read.</p>
      <p>In the light of extending this approach, we aim to investigate whether integration
with STEX, SIunits, or SALT is possible. And as it is useful to annotate mathematical
relations and operators, we will, moreover, define the URIs for relations and
operators and more in OQR (Ontology of Quantitative Research) [24].
2. van Assem, M., Rijgersberg, H., Wigham, M., Top, J.: Converting and Annotating
Quantitative Data Tables. In Patel-Schneider, P.F., Pan, Y., Hitzler, P., Mika, P., Zhang, L., Pan, J.,
Horrocks, I., Glimm, B., eds.: Proc. 9th Int’l Semantic Web Conf. (ISWC’10). Number 6496
in LNCS, Springer-Verlag (2010)
3. Berners-Lee, T., Hendler, J.: Publishing on the Semantic Web. Nature (April 26) (2001)
1023–1025
4. Hey, T., Trefethen, A.: Cyberinfrastructure for e-science. Science 308 (2005) 817 – 821
5. Mittelbach, F., Goossens, M., Carlisle, J.B.D., Rowley, C.: The LATEXCompanion, 2nd edition
(TTCT series). Addison-Wesley, Reading, Massachusetts
6. W3C: Resource Description Framework (RDF). http://www.w3.org/RDF/ (2004)
7. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF. http://www.w3.</p>
      <p>org/TR/rdf-sparql-query/ (2008)
8. De Waard, A.: From proteins to fairytales: Directions in semantic publishing. IEEE
Intelligent Systems (March/April) (2010)
9. Shum, S.B., Clark, T., de Waard, A., Groza, T., Handschuh, S., Sandor, A.: Scientific
discourse on the semantic web : A survey of models and enabling technologies.
Semantic Web Journal Interoperability Usability Applicability (Special Issue on Survey Articles)
(2010)
10. Kohlhase, M.: Semantic Markup for TEX/LATEX. (2004)
11. Heldoorn, M.: The SIunits package. Consistent applications of SI units. (2007)
12. Groza, T., Handschuh, S., MŽller, K., Decker, S.: Salt - semantically annotated latex for
scientific publications. Lecture Notes in Computer Science 4519 (2007) 518–532
13. Ausbrooks, R., Buswell, S., Carlisle, D., Chavchanidze, G., Dalmas, S., Devitt, S., Diaz, A.,
Dooley, S., Hunter, R., Ion, P., Kohlhase, M., Lazrek, A., Libbrecht, P., Miller, B., Miner, R.,
Rowley, C., Sargent, M., Smith, B., Soiffer, N., Sutor, R., Watt, S.: Mathematical Markup
Language (MathML) Version 3.0. http://www.w3.org/TR/MathML3/ (2010)
14. Harder, D.W., Devitt, S.: Units in MathML. http://www.w3.org/TR/mathml-units/
(2003)
15. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and
graphical development environment for robust NLP tools and applications. In: Proceedings of
the 40th Anniversary Meeting of the Association for Computational Linguistics. (2002)
16. Rijgersberg, H., Wigham, M., Top, J.L.: How semantics can improve engineering processes:
A case of units of measure and quantities. Advanced Engineering Informatics 25(2) (2011)
276–287
17. Knuth, D.E.: The TEXbook. Addison-Wesley (1986)
18. Gruber, T., Olsen, G.: An Ontology for Engineering Mathematics. In: Fourth
International Conference on Principles of Knowledge Representation and Reasoning, Morgan
Kaufmann (1994)
19. Taylor, B.N.: Guide for the use of the International System of Units (SI). 2008 edn.
Technical report, National Institute of Standards and Technology (2008)
20. W3C: Owl 2 web ontology language. Technical report, World Wide Web Consortium (W3C)
(2009)
21. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data - The Story So Far. International Journal
on Semantic Web and Information Systems 5 (2009) 1–22
22. Rijgersberg, H., van Assem, M., Wigham, M., Broekstra, J., Top, J.: Ontology of units of
measure (OM). http://www.wurvoc.org/vocabularies/om-1.8/ (2010)
23. Beckett, D., Berners-Lee, T.: Turtle - Terse RDF Triple Language. http://www.w3.org/</p>
      <p>TeamSubmission/turtle/ (2011)
24. Rijgersberg, H., Top, J.L., Meinders, M.: Semantic Support for Quantitative Research
Processes. Intelligent Systems 24(1) (2011) 37–46</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Oro</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruffolo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <string-name>
            <surname>PDF-TREX</surname>
          </string-name>
          :
          <article-title>An Approach for Recognizing and Extracting Tables from PDF Documents</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Document Analysis and Recognition</source>
          . (july
          <year>2009</year>
          )
          <fpage>906</fpage>
          -
          <lpage>910</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>