<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying Ontologies in the Dairy Farming Domain for Big Data Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jack P.C. Verhoosel</string-name>
          <email>jack.verhoosel@tno.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacco Spek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TNO Data Science</institution>
          ,
          <addr-line>Soesterberg</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the Dutch SmartDairyFarming project, main dairy industry organizations like FrieslandCampina, AgriFirm and CRV work together on better decision support for the dairy farmer on daily questions around feeding, insemination, calving and milk production processes. This paper is concerned with the inherent semantic interoperability problem in decision support information in a variety of big data sources containing static and dynamic sensor data of individual cows. Semantic alignment is achieved using ontologies and linked data mechanisms on a large amount of sensor data, such as grazing activity, feed intake, weight, temperature and milk production of individual cows at 7 dairy farms in The Netherlands. A Common Dairy Ontology (CDO) and a specific measurement ontology have been developed and used to transform 12GB of yearly sensor data into 350GB of RDF triples, made accessible via a SPARQL interface on the Apache Jena Fuseki triplestore. A few example applications have been developed to show how the CDO can be used for decision support and historic analysis. The performance of our linked data semantic solution is acceptable for analysis queries on large sets of data. Without optimization of queries the time for answering queries ranged from a few seconds to a couple of minutes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Dairy farmers are currently in an era of precision livestock farming in which
information provisioning for decision support is becoming crucial to maintain a
competitive advantage. Therefore, getting access to a variety of data sources on and off the
farm that contain static and dynamic individual cow data is necessary in order to
provide improved answers on daily questions around feeding, insemination, calving and
milk production processes.</p>
      <p>The process of selecting data sources was done with the wellbeing of the animals
in mind. The sensors were applied to the animals, when necessary, by the farmers
themselves. Also, these were non-invasive sensors like step-counters. Most sensors
are external sensors attached to the machines that already interact with the animals
(like milk-robots and feeding equipment). All data could be collected without causing
additional stress or discomfort to the animals. The goal of the overall project in which
this research was done was also to measure the improvement of the quality of life of
the animals. The results can be used to better cater the individual needs of the cows,
and be able to detect symptoms of illness of the animal, making a positive impact on
their wellbeing.</p>
      <p>In the Dutch SmartDairyFarming project, we work together with the main dairy
industry organizations such as FrieslandCampina, AgriFirm and CRV, and use sensor
equipment to monitor cows at 7 dairy farms in The Netherlands. Thereby, a large
amount of sensor data is generated on grazing activity, feed intake, weight,
temperature and milk production of individual cows. A challenge in this sense is how to
tackle the problem of semantic interoperability between the concepts present in dairy
farming data sources. Semantic alignment of the different meanings of similar
concepts in various data sources is therefore necessary for improved decision support and
historical analysis.</p>
      <p>We have focused on the use of ontologies and linked data mechanisms as a
solution direction for this semantic interoperability problem. A Common Dairy Ontology
(CDO) has been developed that serves as the main semantic interface to applications
for dairy farming decision-making and analysis. Besides semantic alignment, the
CDO also enables the reasoning on the dairy concepts and therefore on a large variety
of different analysis questions. In addition, all sensor data has been transformed into
triples according to the Resource Description Framework (RDF) standard 1, made
available in a triplestore and accessible via a SPARQL engine2. With a few example
applications we have measured the performance of this solution and assessed the
feasibility and performance of our solution direction.</p>
      <p>In the remainder of this paper we will describe subsequently our linked dairy
sensor setup, the ontologies used for semantic alignment, the test applications and the
performance of our setup.</p>
      <p>The main contribution of our research is the practical insights in the challenges that
arise when large amounts of sensor data are stored as triples. We have gained insights
in performance behavior that arise when working with different technologies. We
have also found some new challenges for future research, like how well do
technologies that provide a mapping between traditional relational-databases (that are known
to perform well with large amounts of data) and triples perform, and scale.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>Sensors are a great tool for collecting near real-time data, vast networks of sensors
are collecting more data than can be processed[1]. The sensor data can be enriched
with semantic metadata to increase the possibility for interoperability between sensor
networks and can also provide contextual information to the sensor data. Semantic
web technologies can help to achieve this and can aid in the discovery and integration
of new data sources. There have been efforts to enrich sensor data with semantic
metadata in several domains.</p>
      <p>In the agriculture domain, it is critical to monitor various environmental parameters
(for example temperature, moisture, pH and electric conductivity) to sustain the best
environment for the growth of plants. With the help of semantic technology like the
semantic web in combination with semantic data integration, alerts can be set to
signal the farmer of any issues regarding the attributes[2].</p>
      <sec id="sec-2-1">
        <title>1 https://www.w3.org/RDF/ 2 http://www.w3.org/TR/sparql11-query/</title>
        <p>In the industrial building context, there is great value in a high quality alerting
system, since the cost of fixing damages is significantly higher than implementing an
early warning system that signals the possibility of damage happening to a building.
Such a system, using sematic sensor technologies is proposed in [3] and allows for
notifications to be sent in case of water leaks. But also when a dangerous combination
of factors occur, like the buildup of pressure and temperature, which could indicate a
fire.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Linked dairy sensor data</title>
      <p>At each of the 7 dairy farms involved in our SmartDairyFarming project3 sensor
equipment is installed to monitor an average of 400 cows per dairy farm. These cows
are continuously monitored since 2014, which has generated a yearly 12GB of sensor
data. This sensor data is made available via a software component called the
InfoBroker to be used by various different applications for the dairy farmer. Our
approach is to transform this sensor data into linked data in order to make it
semantically rich and easily accessible.</p>
      <p>To do this in a methodological way, we have developed a Linked Data Roadmap.
This roadmap is developed to standardize the process of converting data of different
types into linked data. In this roadmap we have defined nine steps containing
toolsand best practices to generate high quality linked data. The Linked Data Roadmap is
visualized in Fig. 1. For the sake of simplicity we will not describe the details of the
roadmap here and further details can be found at the website of PDLN4.</p>
      <p>We applied the process defined in this roadmap to make the data that is currently
available through the InfoBroker available as linked data. Thereby, it becomes
possi</p>
      <sec id="sec-3-1">
        <title>3 http://www.smartdairyfarming.nl</title>
        <p>4 http://www.pilod.nl/wiki/BoekTNO/stappenplan
ble to query this data using semantic queries and make use of the advantages that this
technology offers.</p>
        <p>For the second step of the roadmap on preparing the data, we needed to extract all
the data from the InfoBroker. Because the InfoBroker offers its data as a REST-API,
we designed a python script that extracts data from the InfoBroker by calling a
sequence of API calls, and transforms these JSON responses into CSV files. We then
used GoogleRefine5 to determine the quality of the data to make a clean and
consistent dataset.</p>
        <p>Based on this clean dataset we applied step three of the roadmap on modeling the
data. Modeling data in terms of linked data means defining an ontology or knowledge
model for the data. This can be seen as the database schema, it defines the meaning of
things in terms of relationships. For this case, we defined a specific measurement
ontology based on three basic classes: Cow, Sensor and Measurement6. This ontology
is depicted in Fig. 2.</p>
        <p>With this ontology it was possible to express all the data offered by the InfoBroker
in terms of triples of the form &lt;subject, predicate, object&gt;. We then defined a
sensible naming convention for the URI’s, so that all URI’s would be unique and
meaningful.</p>
        <p>Finally, we’ve developed a python script that automates the process of converting
the CSV files to RDF files. This script uses the API of GoogleRefine to automatically
convert the CSV to RDF triples. This script uses earlier defined operations on the
CSV data (including the definition of the RDF structure). This script then exports the
GoogleRefine projects as RDF files, which it then uploads to our triplestore, so
expose the data and make it queryable using the included SPARQL endpoint. The yearly
12GB sensor data was transformed into 310GB of triple data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>5 http://openrefine.org 6 http://minion02.sensorlab.tno.nl/ontologies/SDF.ttl</title>
        <p>We’ve chosen to use Apache Fuseki7 as our triplestore and SPARQL endpoint. It
uses Apache TDB (Triple DataBase) which is a graph database for storing of triples.
Apache Fuseki in combination with the Apache TDB triplestore offers advanced
SPARQL support, including federated queries, as well as a very high performance
compared to for instance Apache Marmotta8. The linked data is queryable through the
Apache Fuseki SPARQL endpoint. This data is kept up-to-date with a fully automated
process that daily extracts data from the InfoBroker and stores it as triples in the
Apache Fuseki store.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Semantic alignment</title>
      <p>One of our goals is to enable the answering of analysis questions on the
combination of large sources of measured sensor data. One of these questions is for instance
“What is the average weight per day over the last lactation period of a cow of a
farmer?” In order to achieve this, we have developed a Common Dairy Ontology
(CDO)9 that is meant to contain the main, common dairy farming concepts. The CDO</p>
      <sec id="sec-4-1">
        <title>7 https://jena.apache.org/index.html 8 http://marmotta.apache.org 9 http://minion02.sensorlab.tno.nl/ontologies/cow-model.ttl</title>
        <p>ontology includes concepts like Farm, Farmer, Cow, Weight, Milkyield, Activity,
Feed, Parcel, Equipment and so on. In addition, the CDO ontology covers the most
important relationships between these concepts. The CDO then functions as the
“semantic interface” to the users of the information that is captured by the sensor data
system. The users can be farmers, advisors of farmers or other stakeholders around
the farm and they want to express their information needs in terms of these common
concepts in the CDO. See Fig. 3 with an excerpt of the concepts in the CDO.</p>
        <p>The CDO can be used as the knowledge model for accessing the sensor data that
has been modeled as triples using the specific measurement ontology. In order to do
this, we have made a mapping from the CDO to the specific measurement ontology
and the specific parameters that are measured by the sensor equipment. For example,
the common concept Weight is mapped onto various different parameters for weight
measured by different weighing equipment at the farm, such as Lely.BodyWeight and
GallagherDairyScale.dsweight. In addition, the concept Activity is mapped onto
various specific parameters for activity, such as Lely.Activity2hours and
DeLaval.HoogActNiveau. We used the rdfs:label mechanism to make the mapping between
the classes and properties of the CDO and the measurement ontology.</p>
        <p>See Fig. 4 with an example of the mapping between the CDO and the measurement
ontology. This mapping is being used when big data analysis question are asked to the
CDO for which measured sensor data is needed.</p>
        <p>Common ontology
rdfs:label = “Activity2hours”</p>
        <p>rdfs:label = “BodyWeight”</p>
        <p>Measurement ontology</p>
        <p>In order to use the large set of historical sensor data and to assess the performance
of the triple solution, big data applications have been developed that focus on the
analysis of possible patterns in the data of individual cows over the period of one
year.</p>
        <p>One application looks at the relation between bodyweight and milk yield of
individual cows during the lactation period in 2014. For each individual cow the
development of bodyweight and milk yield during the lactation period can be drawn in a
graph. In addition, the increase/decrease of the bodyweight on a weekly basis during
that same period can be drawn as well. Finally, an overall view of the average weight
over all the cows of the same parity can be depicted. Using these views, the farmer
can derive possible relationships between the bodyweight and the milk yield. Another
application tries to find similar relations between the different types of feed and milk
yield of individual cows during the lactation period in 2014. For each individual cow
the development of total feed intake and milk yield during the lactation period can be
drawn in a graph. In addition, the division of the total feed intake over various feed
types during that same period can be drawn as well. Finally, an overall view of the
total feed intake over all the cows of the farmer can be depicted. Using these views,
the farmer can derive possible relationships between the intake of different types of
feed and the milk yield. See Fig. 5 for a view on the front end of the applications.</p>
        <p>Each of the graphs that can be shown by the applications is build up based on the
result of a SPARQL query on the large set of triples. The SPARQL queries use
FILTER statements to select those triples that are related to measurements for the
specific cow number, the correct time period and the measured parameters. See Fig. 6
for the SPARQL query code for two of the queries that we used to select feed and
weight data from the Fuseki triplestore.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Performance</title>
      <p>We’ve started our experiments with the Apache Marmotta triplestore including a
SPARQL endpoint. However, it turned out that our large volume of data had a
significant impact on the performance of the Marmotta triplestore.</p>
      <p>The bad performance of the Marmotta triplestore was likely due to the fact that
Marmotta stores the data in a relational database, and translates this to triples when
it’s queried. We tried to optimize the Marmotta and the underlying PostgreSQL
database configuration. However, performance remained unacceptable for our purposes.
Another drawback of the Marmotta server is the limited support of the full
SPARQL1.1 standard and federated queries. This was a functionality we wanted to
have to be able to query the data combined with other datasets offered through other
SPARQL endpoint. Apache Fuseki performed a lot better in that sense. This is most
likely because Fuseki implements a native triple graph database, instead of a
relational database that converts data into triples.</p>
      <p>Query Input
Select an overview with the number of cows Farmer-S
of a farmer Farmer-B</p>
      <p>Farmer-A
Select the list of cows with number and parity Farmer-S</p>
      <p>Farmer-B</p>
      <p>Farmer-A
Select feed per type per day over all cows of a Farmer-S
farmer Farmer-B</p>
      <p>Farmer-A
Select average weight over all cows per day Farmer-A
per parity
Select static info for a cow NL 715820911
Select weight per day in lactation period NL 715820911
Select weight and milkyield per day in NL 715820911
lactation period
Select milkyield per day in lactation period NL 715820911</p>
      <p>We did a number of measurements on the response times for the SPARQL queries
executed in our application. In Fig. 7 we present an overview of the queries, the input
parameters (farmer or cow), the graph size in number of triples in which the query is
executed, the number of search parameters in the query and the response time in
seconds.</p>
      <p>An interesting observation is that there is hardly any pattern to be recognized
between the graph size, the number of search parameters and the response times.
Apparently, the SPARQL search engine has specific ways of indexing to assist the search
process through the set of triples. The performance of our linked data semantic
solution is acceptable for analysis queries on large sets of data. Without optimization of
queries the time for answering queries ranged from a few seconds to a couple of
minutes. This is acceptable for analysis purposes, but for analyzing real-time data, this
is probably not acceptable. The more complex queries take several minutes to
generate a response (this could even take up to 30 minutes for a complex query when the
server-load is high). A farmer might not be willing to wait that long for a response.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future work</title>
      <p>The overall conclusion of our work presented in this paper is that the use of linked
data mechanisms on large sets of sensor data is feasible. Even if big data in the order
of hundreds of gigabytes is put into RDF triple format, it remains accessible with
reasonable response times for analysis purposes. For near real-time purposes the
question remains whether this solution approach is still feasible, but this is a topic for
further work. Another future work topic is to further extend the Common Dairy
Ontology with more dairy concepts and make it available as the semantic interface in
conjunction with the InfoBroker or a local relational database. In that setup the sensor
data is not being put into RDF triples and the CDO is used as mapping tool towards
the specific sensor data sources.
8</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Lefort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barnaghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Graybeal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herzog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Page</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Castro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Compton</surname>
          </string-name>
          , “
          <article-title>Semantic Sensor Network XG Final Report</article-title>
          .”
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Henson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Sahoo</surname>
          </string-name>
          , “Semantic sensor web,
          <source>” IEEE Internet Comput.</source>
          , vol.
          <volume>12</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Zarri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sabri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chibani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Amirat</surname>
          </string-name>
          , “
          <article-title>Semantic-Based Industrial Engineering: Problems</article-title>
          and Solutions,” Complex,
          <source>Intelligent and Software Intensive Systems (CISIS)</source>
          , 2010 International Conference on. pp.
          <fpage>1022</fpage>
          -
          <lpage>1027</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>