RichRDF: A Tool for Enriching Food, Energy, and Water Datasets with Semantically Related Facts and Images Mohamed Gharibi, Praveen Rao, and Nouf Alrasheed University of Missouri-Kansas City (UMKC), Kansas City MO, USA mggvf@mail.umkc.edu, raopr@umkc.edu, nalrasheed@mail.umkc.edu Abstract. Food, energy, and water (FEW) are the key resources to sustain human life and economic growth on Earth. While there is a plethora of information related to FEW systems online, there is a lack of reliable knowledge management tools that enable easy consumption of such information. In this paper, we present a web-based tool called RichRDF with the goal of enriching exiting FEW systems with semantically related facts and images. The main features of RichRDF include (1) an entity extraction algorithm that extracts meaningful subjects from Resource Description Framework (RDF) statements using natural language processing (NLP) techniques, (2) a reliable approach to add semantic similarity scores and relationships between different RDF subjects based on ConceptNet, (3) an efficient way to use the numbers of WordNet synsets to request the associated images from ImageNet, and (4) a user friendly interface that allows users to load and convert FEW datasets to RDF and then query the RDF datasets using an existing SPARQL engine. A video highlighting the key features of RichRDF is available at https://youtu.be/vyHgh4LgKCo. Keywords: Knowledge Graphs, Food, Energy, and Water, RDF. 1 Introduction Food, energy, and water (FEW) are the interdependent components that are undoubtedly imperative for our lives on Earth. Tremendous stress in these resources are expected by 2050 due to population growth, natural disasters, and human activities, which emphasize the need to improve available FEW resources. The United Nations has classified FEW components as a high priority within their sustainable development goals [1]. Meanwhile, there are several federal agencies such as the United States Department of Agriculture (USDA) and the National Drought Mitigation Center (NDMC) that provide massive amounts of data related to FEW systems. However, the available data exist in CSV, XML, and JSON formats that are not readily consumable in the world of Linked Data (LD).1 1 http://linkeddata.org Today, billions of RDF triples are available on the Web for developing new Semantic Web applications [1]. These triples are expressed as (subject, predicate, object) to represent entities and their relationships within a knowledge base. These relationships can be indicated by using different ontologies such as FOAF and DBpedia [2]. A specific Internationalized Resource Identifier (IRI) can be added as the context to RDF triples. Such statements are called as RDF quads. Here is an example of an RDF triple capturing the relationship between Oswego and the United States [2]: . The first term in the triple represents the subject, which is ‘Oswego.’ The second term is the predicate, which was provided by the DBpedia ontology, is ‘country’, and the last term of the triple represents the object, the ‘United States.’ Such triples can be generated from files in different formats including JSON, CSV, and TSV. Converting such files to RDF triples will allow users to express semantic relationships between subjects and objects and to structure information using RDF graphs. Moreover, these RDF triples will provide several benefits than processing raw files including the ease of integration and use, modeling of semantic similarity between entities, and the ability to query the knowledge base using a SPARQL engine. While there are several tools that can produce RDF datasets (e.g., Karma [3]), they do allow us to easily add new assertions in the form of triples or quads to a dataset. Moreover, the lack of reliable knowledge graphs serving FEW systems has motivated us to build our own knowledge graph that helps in decision-making, enriching FEW datasets by providing extra knowledge and images based on the semantic similarities between the dataset entities, improving knowledge discovery, simplifying access, and providing better search results. Our system, RichRDF, employs RDF, Web Ontology Language (OWL), and SPARQL to construct and query the FEW knowledge graph. 2 RichRDF The overall architecture of RichRDF is shown in Fig. 1. RichRDF has four stages of execution to ensure that the system can run under different conditions before it starts each stage so that the total processing time is minimized. Fig. 1: Architecture of RichRDF We assume that an input CSV file is converted into RDF using an integration tool such as Karma. The first stage of RichRDF checks the structure of the file uploaded by a user. If the file contains RDF quads, then the file will be ready for the next stage of processing. Otherwise, the user is asked to provide a context IRI, which will be added as the context of each RDF triple to produce RDF quads. In the second stage, RichRDF runs NLP techniques [4] on the subjects, to extract the meaningful entities in order to use these entities later on for further processing. The main goal of entity extraction is to identify a real entity as the main subject, since subjects may be a word, a couple of words, long text, or a number like an ID. The extracted entities will subsequently represent an entire subject. RichRDF processes two quads at a time, therefore, there are several possibilities during the entity extraction. Consider two subjects that end with the following strings “CHEESE,COTTAGE,CRMD,W/FRU ” and “BUTTER,PDR,1.5OZ,PREP,W/1/1.HYD” After the entity extraction stage, “CHEESE,COTTAGE” denote the entities extracted from the first subject and “BUTTER” from the second subject. We use the “isA” relation to link the original subjects with the extracted entities. The third stage of RichRDF uses the extracted entities on ConceptNet [5] to obtain the relationship between these subjects. If a relationship exists between the first and the second subject, another request will be generated to fetch the semantic similarity score. We use RDF reification/blank nodes to represent the similarity score between the original subjects. The semantic similarity score and the relationship between subjects can be exploited during information retrieval and NLP, and it also expands the search to ontology keywords. Furthermore, these scores can be used in advanced machine learning techniques to understand the data in a better way and to build better models [6]. At the last stage, RichRDF queries WordNet using the extracted entities to generate synset groups of words [7]. Using these synset groups, we generate the offset ID numbers based on their relationship with the subjects. The offset ID numbers are used to look up images on ImageNet [8]. For every offset ID, we request ImageNet to provide us with the Uniform Resource Locators (URLs) of all the images that are associated with the subject ID number. As a result, hundreds of URLs will be returned. Instead of adding all these URLs using blank nodes, we split them into two categories based on their content. The first category is a single blank node with the relationship “IURLs_subjectName” that contains a link to a page containing hundreds of images related to this subject. The second category contains the pure images that represent the subject only. We add the second category as multiple blank nodes using the relationship “subjectName_Images”. Finally, the user will be able to download the output at this stage or RichRDF can provide the user with another service to query the output using a SPARQL engine with a user-friendly interface. The user will be able to download the output of the queries at any time. Performance Evaluation. We report the performance evaluation results of RichRDF to provide insights on its speed. Fig. 2 shows the best-case and worst-case time taken for RichRDF while running an input file containing 1,000 triples. This file was processed in four different rounds where we added a different feature in each round. Table 1 shows the time taken for different numbers of triples for various datasets. We would like to mention that the execution time depends on the triples in the input RDF dataset. Richer the dataset, i.e., containing commonly used entities such as food types, fruits, brand name, objects names, etc., more would be the time taken to process all the relationships, semantic scores, and obtain the relevant images. In the future, we plan to leverage concurrency and parallelism to speed up RichRDF. Table 1: Number of triples with the average execution time Type Triples Time required Food 1,000 5.05 seconds Energy 10,000 46.13 seconds Water 100,000 8.43 minutes Fig. 2: Four execution rounds of RichRDF To conclude, RichRDF is a new tool for modeling FEW datasets using Semantic Web technologies to enable easy consumption and analysis of FEW information for intelligent decision making. In the future, we would like to automatically publish the RDF data produced by RichRDF on Linked Data. The source code of RichRDF is available at https://github.com/UMKC-BigDataLab/RichRDF. Acknowledgments: The first author (M. G.) would like to thank the support of UMKC SGS Travel Grant. References 1. P. Rao, A. Katib, D. Barron.: A knowledge ecosystem for the food, energy, and water system. In KDD 2016 Workshop on Data Science for Food, Energy and Water, pp. 1-4, San Francisco, 2016. 2. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives.: DBpedia: A nucleus for a web of open data. In the Semantic Web Lecture Notes in Computer Science, pp. 722-735. Springer, Berlin, 2007. 3. C. Knoblock, P. Szekely.: Exploiting semantics for big data integration. In AI Magazine, Vol. 36, no. 1, 2015. 4. C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, D. McClosky.: The Stanford CoreNLP natural language processing toolkit. In Proc. of 52nd annual meeting of the Association for Computational Linguistics, pp. 55-60, 2014. 5. R. Speer, C. Havasi.: ConceptNet 5: A large semantic network for relational knowledge. The People’s Web Meets NLP, pp 161-176. Springer-Verlag Berlin, 2013. 6. M. Pham, S. Alse, C. Knoblock, P. Szekely.: Semantic labeling: A domain-independent approach. In International Semantic Web Conference (ISWC), pp. 446-462, Kobe, 2016. 7. M. Hsu, M. Tsai, H. Chen.: Combining WordNet and ConceptNet for automatic query expansion: A learning approach. Information Retrieval Technology, vol. 4993, pp 213–224, Springer, 2008. 8. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, L. Fei-Fei.: ImageNet large scale visual recognition challenge. International Journal of Computer Vision, Vol. 115, pp. 211-252, Springer, 2015.