=Paper=
{{Paper
|id=Vol-2180/paper-18
|storemode=property
|title=RichRDF: A Tool for Enriching Food, Energy, and Water Datasets with Semantically Related
Facts and Images
|pdfUrl=https://ceur-ws.org/Vol-2180/paper-18.pdf
|volume=Vol-2180
|authors=Mohamed Gharibi,Praveen Rao,Nouf Alrasheed
|dblpUrl=https://dblp.org/rec/conf/semweb/GharibiRA18
}}
==RichRDF: A Tool for Enriching Food, Energy, and Water Datasets with Semantically Related
Facts and Images==
RichRDF: A Tool for Enriching Food, Energy, and Water
Datasets with Semantically Related
Facts and Images
Mohamed Gharibi, Praveen Rao, and Nouf Alrasheed
University of Missouri-Kansas City (UMKC), Kansas City MO, USA
mggvf@mail.umkc.edu, raopr@umkc.edu, nalrasheed@mail.umkc.edu
Abstract. Food, energy, and water (FEW) are the key resources to sustain human
life and economic growth on Earth. While there is a plethora of information
related to FEW systems online, there is a lack of reliable knowledge management
tools that enable easy consumption of such information. In this paper, we present
a web-based tool called RichRDF with the goal of enriching exiting FEW systems
with semantically related facts and images. The main features of RichRDF
include (1) an entity extraction algorithm that extracts meaningful subjects from
Resource Description Framework (RDF) statements using natural language
processing (NLP) techniques, (2) a reliable approach to add semantic similarity
scores and relationships between different RDF subjects based on ConceptNet,
(3) an efficient way to use the numbers of WordNet synsets to request the
associated images from ImageNet, and (4) a user friendly interface that allows
users to load and convert FEW datasets to RDF and then query the RDF datasets
using an existing SPARQL engine. A video highlighting the key features of
RichRDF is available at https://youtu.be/vyHgh4LgKCo.
Keywords: Knowledge Graphs, Food, Energy, and Water, RDF.
1 Introduction
Food, energy, and water (FEW) are the interdependent components that are
undoubtedly imperative for our lives on Earth. Tremendous stress in these
resources are expected by 2050 due to population growth, natural disasters, and human
activities, which emphasize the need to improve available FEW resources. The United
Nations has classified FEW components as a high priority within their
sustainable development goals [1].
Meanwhile, there are several federal agencies such as the United States
Department of Agriculture (USDA) and the National Drought Mitigation Center
(NDMC) that provide massive amounts of data related to FEW systems. However, the
available data exist in CSV, XML, and JSON formats that are not readily consumable
in the world of Linked Data (LD).1
1 http://linkeddata.org
Today, billions of RDF triples are available on the Web for developing new
Semantic Web applications [1]. These triples are expressed as (subject, predicate,
object) to represent entities and their relationships within a knowledge base. These
relationships can be indicated by using different ontologies such as FOAF and DBpedia
[2]. A specific Internationalized Resource Identifier (IRI) can be added as the context
to RDF triples. Such statements are called as RDF quads. Here is an example of an RDF
triple capturing the relationship between Oswego and the United States [2]:
.
The first term in the triple represents the subject, which is ‘Oswego.’ The second
term is the predicate, which was provided by the DBpedia ontology, is ‘country’, and
the last term of the triple represents the object, the ‘United States.’ Such triples can be
generated from files in different formats including JSON, CSV, and TSV. Converting
such files to RDF triples will allow users to express semantic relationships between
subjects and objects and to structure information using RDF graphs. Moreover, these
RDF triples will provide several benefits than processing raw files including the ease
of integration and use, modeling of semantic similarity between entities, and the ability
to query the knowledge base using a SPARQL engine.
While there are several tools that can produce RDF datasets (e.g., Karma [3]), they
do allow us to easily add new assertions in the form of triples or quads to a dataset.
Moreover, the lack of reliable knowledge graphs serving FEW systems has motivated
us to build our own knowledge graph that helps in decision-making, enriching FEW
datasets by providing extra knowledge and images based on the semantic similarities
between the dataset entities, improving knowledge discovery, simplifying access, and
providing better search results. Our system, RichRDF, employs RDF, Web Ontology
Language (OWL), and SPARQL to construct and query the FEW knowledge graph.
2 RichRDF
The overall architecture of RichRDF is shown in Fig. 1. RichRDF has four stages of
execution to ensure that the system can run under different conditions before it starts
each stage so that the total processing time is minimized.
Fig. 1: Architecture of RichRDF
We assume that an input CSV file is converted into RDF using an integration tool
such as Karma. The first stage of RichRDF checks the structure of the file uploaded by
a user. If the file contains RDF quads, then the file will be ready for the next stage of
processing. Otherwise, the user is asked to provide a context IRI, which will be added
as the context of each RDF triple to produce RDF quads. In the second
stage, RichRDF runs NLP techniques [4] on the subjects, to extract the meaningful
entities in order to use these entities later on for further processing. The main goal of
entity extraction is to identify a real entity as the main subject, since subjects may be a
word, a couple of words, long text, or a number like an ID. The
extracted entities will subsequently represent an entire subject.
RichRDF processes two quads at a time, therefore, there are several possibilities
during the entity extraction. Consider two subjects that end with the following strings
“CHEESE,COTTAGE,CRMD,W/FRU ” and “BUTTER,PDR,1.5OZ,PREP,W/1/1.HYD”
After the entity extraction stage, “CHEESE,COTTAGE” denote the entities extracted
from the first subject and “BUTTER” from the second subject. We use the “isA” relation
to link the original subjects with the extracted entities. The third stage of RichRDF uses
the extracted entities on ConceptNet [5] to obtain the relationship between these
subjects. If a relationship exists between the first and the second subject, another
request will be generated to fetch the semantic similarity score. We use RDF
reification/blank nodes to represent the similarity score between the original subjects.
The semantic similarity score and the relationship between subjects can be exploited
during information retrieval and NLP, and it also expands the search to ontology
keywords. Furthermore, these scores can be used in advanced machine learning
techniques to understand the data in a better way and to build better models [6].
At the last stage, RichRDF queries WordNet using the extracted entities to generate
synset groups of words [7]. Using these synset groups, we generate the offset ID
numbers based on their relationship with the subjects. The offset ID numbers are used
to look up images on ImageNet [8]. For every offset ID, we request ImageNet to provide
us with the Uniform Resource Locators (URLs) of all the images that are associated
with the subject ID number. As a result, hundreds of URLs will be returned. Instead of
adding all these URLs using blank nodes, we split them into two categories based on
their content. The first category is a single blank node with the relationship
“IURLs_subjectName” that contains a link to a page containing hundreds of images
related to this subject. The second category contains the pure images that represent the
subject only. We add the second category as multiple blank nodes using the relationship
“subjectName_Images”. Finally, the user will be able to download the output at this
stage or RichRDF can provide the user with another service to query the output using
a SPARQL engine with a user-friendly interface. The user will be able to download the
output of the queries at any time.
Performance Evaluation. We report the performance evaluation results of
RichRDF to provide insights on its speed. Fig. 2 shows the best-case and worst-case
time taken for RichRDF while running an input file containing 1,000 triples. This file
was processed in four different rounds where we added a different feature in each round.
Table 1 shows the time taken for different numbers of triples for various datasets.
We would like to mention that the execution time depends on the triples in the input
RDF dataset. Richer the dataset, i.e., containing commonly used entities such as food
types, fruits, brand name, objects names, etc., more would be the time taken to process
all the relationships, semantic scores, and obtain the relevant images. In the future, we
plan to leverage concurrency and parallelism to speed up RichRDF.
Table 1: Number of triples with the
average execution time
Type Triples Time required
Food 1,000 5.05 seconds
Energy 10,000 46.13 seconds
Water 100,000 8.43 minutes
Fig. 2: Four execution rounds of RichRDF
To conclude, RichRDF is a new tool for modeling FEW datasets using Semantic
Web technologies to enable easy consumption and analysis of FEW information for
intelligent decision making. In the future, we would like to automatically publish the
RDF data produced by RichRDF on Linked Data. The source code of RichRDF is
available at https://github.com/UMKC-BigDataLab/RichRDF.
Acknowledgments: The first author (M. G.) would like to thank the support of UMKC
SGS Travel Grant.
References
1. P. Rao, A. Katib, D. Barron.: A knowledge ecosystem for the food, energy, and water
system. In KDD 2016 Workshop on Data Science for Food, Energy and Water, pp. 1-4, San
Francisco, 2016.
2. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives.: DBpedia: A nucleus for
a web of open data. In the Semantic Web Lecture Notes in Computer Science, pp. 722-735.
Springer, Berlin, 2007.
3. C. Knoblock, P. Szekely.: Exploiting semantics for big data integration. In AI Magazine,
Vol. 36, no. 1, 2015.
4. C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, D. McClosky.: The Stanford
CoreNLP natural language processing toolkit. In Proc. of 52nd annual meeting of the
Association for Computational Linguistics, pp. 55-60, 2014.
5. R. Speer, C. Havasi.: ConceptNet 5: A large semantic network for relational knowledge. The
People’s Web Meets NLP, pp 161-176. Springer-Verlag Berlin, 2013.
6. M. Pham, S. Alse, C. Knoblock, P. Szekely.: Semantic labeling: A domain-independent
approach. In International Semantic Web Conference (ISWC), pp. 446-462, Kobe, 2016.
7. M. Hsu, M. Tsai, H. Chen.: Combining WordNet and ConceptNet for automatic query
expansion: A learning approach. Information Retrieval Technology, vol. 4993, pp 213–224,
Springer, 2008.
8. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A.
Khosla, M. Bernstein, A. Berg, L. Fei-Fei.: ImageNet large scale visual recognition
challenge. International Journal of Computer Vision, Vol. 115, pp. 211-252, Springer, 2015.