=Paper= {{Paper |id=None |storemode=property |title=LiDDM: A Data Mining System for Linked Data |pdfUrl=https://ceur-ws.org/Vol-813/ldow2011-paper07.pdf |volume=Vol-813 |dblpUrl=https://dblp.org/rec/conf/www/KapparaIV11 }} ==LiDDM: A Data Mining System for Linked Data== https://ceur-ws.org/Vol-813/ldow2011-paper07.pdf
              LiDDM: A Data Mining System for Linked Data

             Venkata Narasimha                          Ryutaro Ichise                         O.P. Vyas
               Pavan Kappara                          National Institute of         Indian Institutes of Information
          Indian Institute of Information                 Informatics                   Technology Allahabad
              Technology Allahabad                      Tokyo, Japan                       Allahabad, India
                 Allahabad, India                      ichise@nii.ac.jp                  opvyas@iiita.ac.in
            kvnpavan@gmail.com


ABSTRACT                                                           linked data comes into picture as there is a need to integrate
In today’s scenario, the quantity of linked data is growing        different data sources available in different structured format
rapidly. The data includes ontologies, governmental data,          to answer such type of complex queries. If you look at data
statistics and so on. With more and more sources publish-          sources like World FactBook [5], Data.gov [16], DBpedia [2],
ing the data, the amount of linked data is becoming enor-          the data that they provide is real world data. The infor-
mous. The task of obtaining the data from various sources,         mation that these kinds of data provide can be helpful in
integrating and fine-tuning the data for desired statistical       many ways such as predicting the future outcome given the
analysis assumes prominence. So there is need of a good            past statistics, the dependency of one attribute over another
model with efficient UI design to perform the Linked Data          attribute and so on. In this context, it is necessary to ex-
Mining. We proposed a model that helps to effectively inter-       tract hidden information from the linked data considering
act with linked data present in the web in structured format,      its richness of information.
retrieve and integrate data from different sources, shape and
fine-tune the so formed data for statistical analysis, perform     Our proposed model suggest a Framework tool for Linked
data mining and also visualize the results at the end.             Data Mining that capture data from linked data cloud and
                                                                   extract various interesting hidden information. This model
1.   INTRODUCTION                                                  is targeted to deal with the complexities associated with
Since the revolution of linked data, the amount of data that       mining the linked data efficiently. Our hypothesis is imple-
is being available in the web in structured format in the          mented in form of a tool that takes the data from linked
cloud of linked data is growing at a very fast pace. LOD           data cloud, performs various KDD(Knowledge Discovery in
(Linking Open Data) forms the foundation for linking the           Databases) operations on linked data and applies data min-
data available on the web in structured format. This com-          ing technique such as association, clustering etc. and also
munity helps to link the data published by various domains         visualizes the result at the end.
as companies, books, scientific publication, films, music, ra-
dio program, genes, clinical trial, online communities, sta-       The remaining sections are organized as follows. The sec-
tistical and scientific data [3]. This community provides dif-     ond section deals with back ground and related work. The
ferent datasets in RDF(Resource Description Framework)             third section describes the architecture of LiDDM(Linked
format and also provides RDF links between these datasets          Data Data Miner). The fourth section discusses the tool
that enables us to move from one data item in one dataset          that we made to implement the model. The fifth section
to other data item in other data set. There are number of          deals with the case study. The sixth section comes up with
organizations that are publishing their data in the linked         discussions and future work. Finally the seventh section is
data cloud in different domains. Linked data, as we look           the conclusion.
at it today, is very complex and dynamic pertaining to its
heterogeneity and diversity.
                                                                   2.   RELATED WORK
                                                                   Linked data refers to a set of best practices for publishing
Various datasets available in the Linked Data Cloud has
                                                                   and connecting structured data on the web [3]. With the ex-
their own significance in terms of their usability. In today’s
                                                                   pansion of Linking Open Data Project, more and more data
scenario the result related to user query for extracting a use-
                                                                   available on the web are getting converted into RDF and
ful hidden pattern may not always be completely answered
                                                                   getting published as linked data. The difference between in-
by using only one (or many) of the dataset in isolation. Here
                                                                   teracting with a web of data and a web of documents has
                                                                   been discussed in [11]. This web of data is richer in infor-
                                                                   mation and is also available in standard format. Therefore,
                                                                   to exploit the hidden information in this kind of data, we
                                                                   have to first understand the related work done previously.
                                                                   Looking at the general process of KDD, the steps in the
                                                                   process of knowledge discovery in databases have been ex-
                                                                   plained [8]. The data has to be selected, preprocessed, trans-
Copyright is held by the author/owner(s).
LDOW2011, March 29, 2011, Hyderabad, India.                        formed, mined, evaluated and interpreted for the process of
                                                                   Knowledge Data Discovery [8]. For the process of knowledge
discovery in the semantic web, SPARQL-ML was introduced
by extending the given SPARQL language to work with sta-
tistical learning methods [12]. This imposes the burden of
having the knowledge of extended SPARQL and its ontol-
ogy on the users. Some researches [14] have extended the
model for adding data mining method to SPARQL [18] by
relieving the burden on users to have the exact knowledge of
ontological structures by asking them to specify the context
to automatically retrieve the items that form the transac-
tion. However, ontology axioms and semantic annotations
for the process of association rule mining have been used
earlier [14].

In our approach, we modified the model used by U. Fayyad
et al [8], which is general process of KDD, to suit the needs
of linked data. Instead of extending SPARQL [18], we re-
trieved the linked data using normal SPARQL queries and
instead focused on the process of refining and weaving the
retrieved data to finally transform it to be fed into the data
mining module. This approach separated the work of re-
trieving data from the process of data mining and relieved
the users from the burden of learning extended SPARQL and
its ontology. Also this separation allowed more flexibility in
choosing whatever data we needed from various data sources
first and then concentrating on mining the data once all the
data needed had been retrieved, integrated and transformed.
Also LiDDM works by finding classifications and clustering                  Figure 1: Architecture of LiDDM
in addition to finding associations.

3.    LIDDM: A MODEL                                             in different sources. For example, if we want to study the
                                                                 effect of growth rate of each country on its film production,
To start with, our model modified the process of KDD, as we
                                                                 data sources selected can be the World FactBook and the
discussed in the previous section to conform to the needs of
                                                                 Linked Movie Data Base [10]. We can first query the World
linked data and proceeded in a hierarchical manner. A data
                                                                 FactBook for the growth rate of each country. Then we
mining system was used for statistical analysis and linked
                                                                 can query the Linked Movie Data Base for information re-
data from the linked data cloud was retrieved, processed and
                                                                 garding film production of each country and now we have
fed onto it. Figure 1 provides the overview of our model.
                                                                 to integrate both the results in order to find the answer of
                                                                 respected query.
3.1     Data Retrieval through Querying
In this initial step of LiDDM, the data from linked data         3.2.2    Data Filtering
cloud is queried and retrieved. This step can be compared
                                                                 In this step, data that is retrieved and integrated is filtered.
to the data selection step in the KDD Process. The data
                                                                 Some rows or columns or both are deleted if necessary. Fil-
retrieved will be in the form of a table with some rows and
                                                                 tering eliminates the unwanted and unnecessary data. For
columns. The rows denote instances of data retrieved and
                                                                 example, let’s consider the previous case of the World Fact-
the columns denote the value of each attribute for each in-
                                                                 Book and Linked Movie Data Base. If we want the growth
stance.
                                                                 rate of a country to be not less than a certain minimum
                                                                 value for research purposes, we can eliminate instances with
3.2     Data Preprocessing                                       growth rates less than a certain minimum value at this step.
Once the data retrieval is done, data preprocessing comes
into picture which plays a significant role in data mining       3.2.3    Data Segmentation
process. Most of the time data is not in a format suitable       The main purpose of segmenting the data is to divide the
for immediate application of data mining techniques. This        data in each column into some classes if necessary for statis-
step highlights that data must be appropriately preprocessed     tical analysis. For example, the data in a certain range can
before going for further stages of knowledge discovery.          be placed into some class if necessary. Consider the attribute
                                                                 ‘population of a country’. In this case, populations less than
3.2.1    Data Integration                                        10,000,000 can be placed under the segment named ‘Low
In the previous step of Linked Data Mining, data is retrieved    population’. Populations from 10,000,000 to 99,999,999 can
from multiple data sources existing in Linked data cloud.        be placed under the segment named ‘Average Population’
This allows the feasibility of having distributed data. This     and populations from 100,000,000 to 999,999,999 can be
data must be integrated in order to provide answer to user’s     placed under the segment named ‘High Population’. The
query. Data is integrated based on some common relation          step of segmentation step divides the data into different
presented in respected data sources. Data sources are se-        classes and segments, for a class based statistical analysis
lected depending on different factors a user wants to study      at the end.
Figure 2: This UI shows the data retrieved from World FactBook and Linked Movie Data Base Integration


3.3   Preparing Input Data for Mining                           cloud. Weka API [9] was used for the process of data min-
More often than not, the format in which we retrieve the        ing. Weka is widely recognized as the unified platform for
linked data is not the correct format that is required for      performing most of the machine learning algorithms in a
feeding into the data mining system. Therefore, it is nec-      single place. Jena is a java framework for building semantic
essary to change the format to the one that is required by      web applications. The tool was made using Java in a Net
the data mining system. The step does exactly this work of      Beans environment.
format conversion. Thus, this step basically does the same
as the transformation of data part in the KDD process.          4.2   Working of the Tool
                                                                Step 1. This tool emulates our model in the following ways.
3.4   Data Mining on Linked Data                                    It has a UI for querying the remote data sets. There
In this step, the data mining of the already filtered and           are two types of querying that are allowed in this model.
transformed data is performed. In this step, you can input          One is that the user can specify the SPARQL endpoint
the data that is in the format accepted by the data mining          and SPARQL query for the data to be retrieved. The
system from previous step into the data mining system for           second type of querying is an automatic query builder
analysis. Here the data may be classified or clustered or set       that reduces the burden on the user. The possibility
for finding association rules. After applying these methods,        of using sub graph patterns for generating automatic
the results are obtained and visualized for interpretation.         RDF queries has been discussed [7]. Our query builder
Thus LiDDM with all the above features, we believe, will            gives the user all the possible predicates he can use
ensure a very good and easy to use framework tool not only          given the SPARQL endpoint and asks him to specify
for interacting with linked data and visualizing the results        only the triples and returns the constructed query. The
but also for re-shaping the data retrieved. The next section        Drupal Sparql Query Builder [17] also asks the user to
deals with the implementation of our model in an applica-           specify triples.
tion.                                                           Step 2. Regarding Step 2 of our model, which is integration
                                                                    of data retrieved, our tool implements a UI that uses a
4. IMPLEMENTATION WORK                                              JOIN operation to perform the JOIN of the retrieved
4.1 Tool Environment                                                results from two or more queries. It also uses an ‘ap-
To test our model LiDDM, we made an application that im-            pend at the end’ operation, which adds the results of
plements it. This application was called ‘LiDDMT: Linked            two or more queries. Figure 2 shows this functional-
Data Data Mining Tool’. With this tool, we used Jena                ity. In this figure the text area under ‘Result-Query1’
API [4] for querying remote data sets in the linked data            gives the results of Query 1, which is a query from
     the World FactBook and the text area under ‘Result-
     Query2’ gives the results of Query 2, which is a query
     from the Linked Movie Data Base. The text area
     under ‘RESULT-CURRENT STATE AFTER MERG-
     ING BOTH THE QUERIES’ gives the result of the
     JOIN operation performed between the 3rd column of
     query 1 and the 3rd column of query 2 as shown in
     the figure. Once merging is done, clicking the ‘Add
     another Query’ button gives you the option to add a
     third query. Clicking ‘Continue’ takes you to Step 3.

Step 3. Now moving to Step 3 of our model, our tool im-
    plements a UI, which is named ‘Filter’ that filters and
    cleans the data thus retrieved and integrated. This UI
    has features of removing unwanted columns, deleting
    the rows that have values out of a certain range in a
    numerical column, deleting the rows that have certain
    strings in certain columns, etc.

Step 4. Now after filtering the data, we move onto UI for
    Step 4 of our model, which is the segmentation of data.      Figure 3: This UI shows the simplified version of
    It asks for the name of the segment, and if the values       data mining tool.
    in the column are numeric, we can specify the interval
    of values that comes in that segment. If the values in
    the column are string based, then we can specify the              ation if any, can be visualized in the form of printing
    set of strings that comes in that segment. Thus our UI            the best associations found.
    converts the data into segments or classes as desired
    by us for the data mining algorithms to work on it.               Also as described in our model, our tool LiDDMT
                                                                      has forward and backward moment flexibility in Step
Step 5. The UI for Step 5 of our model performs the task of           3, Step 4 and Step 5 i.e.; in filter, segmentation and
    writing the data into the format as required for min-             writer, where you can get the results at any step and
    ing. We used Weka in our tool, and Weka accepts                   can go back and forth to any other step. The same
    input data in the ARFF(Artribute-Relation File For-               is the case with Step 2(in the model) where even with
    mat) format [13]. Thus this UI asks for the relation              our tool, the UI allows integration of any number of
    name and also the values of attributes for conversion             queries as long as they can be merged using either the
    to ARFF format. Once you have finished this conver-               ‘JOIN’ operation or ‘append at the end’ operation.
    sion, the linked data retrieved becomes acceptable to
    use for data mining applications using Weka.
                                                                 5.   CASE STUDY
Step 6. Our tool has a very flexible UI for data mining          Our tool LiDDMT has been tested with many datasets like
    (Step 6) in that it has a separate UI for using the orig-    DBpedia, Linked Movie Data Base, World FactBook, Data.gov
    inal Weka with its full functionality. It also has a sim-    etc. However, here for the process of explanation, we choose
    plified version of the UI that is for quick mining where     to demonstrate the effectiveness of our tool from the exper-
    we have implemented the J48 decision tree classifica-        iments with the World FactBook dataset.
    tion [15], Apriori association [1], and EM (estimation
    maximization) clustering [6]. Figure 3 shows the sim-        World FactBook dataset provides information on the history,
    plified version of the data mining tool. Using this UI,      people, government, economy, geography, communications
    you can perform data mining for the ARFF file that           and other transnational issues of every country for about
    was made in Step 5; also you have a file chooser that        266 world entities. We explored this dataset and found out
    accepts any other already formed ARFF files, which           some interesting patterns using our tool.
    can also be input for mining and results can be com-
    pared and visualized at the same time. In our simpli-        First, we queried the World FactBook Database for GDP per
    fied version of the UI for mining, you can specify the       capita, GDP composition by agriculture, GDP composition
    most common options for each of the methods (J48,            by industry, and GDP composition by services of every coun-
    Apriori, and EM) and can cross check the results by          try. Then in step 4, which is segmentation, we divided each
    varying different parameters.                                of the attributes, i.e. GDP by agriculture, industry, and ser-
                                                                 vices, into 10 classes each at equal intervals of 10 percent.
Views of Results. The results that are output from this          Then GDP per capita is divided into three classes called
    step are visualized at the end. The results from the         low, average, and high depending on whether the value is
    J48 decision tree classifier are visualized in the form of   less than 10,000, between 10,000 and 25,000, or more than
    a decision tree along with classifier output like preci-     25,000 respectively. This segmented data is sent as input to
    sion recall, F-Measure etc. Similarly, the results from      the Apriori algorithm, and we found two association rules
    EM clustering are visualized in the form of an X-Y plot      that have proved to be very accurate. The rules are as fol-
    with clusters shown. The results from Apriori associ-        lows:
                                                                 Figure 5: This figure shows that when labor force
                                                                 from agriculture is low (A L), then literacy rate is
                                                                 high (L H) with a 7 percent error rate out of 68 in-
                                                                 stances. Also when labor force from agriculture is
                                                                 medium (A M), then the literacy rate is high (L H)
                                                                 with 11 percent error rate out of 43 instances. Thus
                                                                 this can signify an inverse relationship between lit-
                                                                 eracy rate and labor force in agriculture.

Figure 4: Here PC denotes GDP per capita and
aggr-X denotes GDP composition by agriculture                         • If population is between 58,147,733 and 190,010,647,
which is X percent.                                                     and median age is less than 38, the movie production
                                                                        is low with a confidence of 1.

   • When the GDP per capita income is high (40 instances),
     the GDP composition by agriculture is between 0 to          Thus the above results prove that our LiDDMT is helping
     10 percent (39 instances) with a confidence of 0.98.        us to find out hidden relationships between the attributes
                                                                 in linked data thereby helping effectively in Knowledge Dis-
   • When the GDP composition by services is between 70          covery.
     to 80 percent (32 instances), the GDP composition by
     agriculture is between 0 to 10 percent (29 instances)       6.    DISCUSSIONS AND FUTURE WORK
     with a confidence of 0.91.                                  From our experiments and case study we can say that the
                                                                 model that we proposed, LiDDM, has its strength in that it
                                                                 can retrieve data from multiple data sources and integrate
If the same data is allowed to undergo EM clustering using       them instead of just retrieving the data from a single data
the Step 6, the visualizations (shown in Figure 4) that are      source. It can treat data from various sources in the same
obtained also prove this fact.                                   manner. The preprocessing and transformation steps make
                                                                 our model unique to deal with linked data. This allows us
Then we queried the World FactBook database for literacy         the flexibility of choosing data at will and then concentrates
rate, labor force in agriculture, labor force in industry, and   on mining. Also our tool, LiDDMT, helps us to mine and
labor force in services of every country. Then using step 4,     visualize data from more than one ARFF file at the same
which is segmentation, we segmented each of the attributes       time, thus giving us the option for comparison.
of labor force in agriculture, labor force in industry, labor
force in services into three classes namely low, medium, and     By introducing graph-based techniques, triples could be found
high respectively. We segmented the literacy rate attribute      out automatically in future. Also, currently all the available
into three classes namely low, medium, and high depend-          predicates are obtained for only DBpedia and Linked Movie
ing on whether the literacy rate is between 0 and 50, 50 to      Data Base. For others you have to specify the predicates
85, and 85 to 100 respectively. Here we are comparing the        yourselves without prefixes if you use the automatic query
effects of labor force on each sector on the literacy rate of    builder. This functionality can be extended to other data
the country. Figure 5 shows the effect of labor force from       sources easily. Thus, more and more data sets can be imple-
agriculture on literacy rate.                                    mented here drawing predicates from all of them. But with
                                                                 our tool, even though you cannot get all the available pred-
We have also tested our tool by retrieving information about     icates for datasets other than DBpedia and Linked Movie
movies from 1991 to 2001 by DBpedia and Linked Movie             Data Base, you can use the automatic query builder to gen-
Data Base from various countries and integrated that with        erate SPARQL queries automatically, if you know the URI
data retrieved from the World FactBook like median age           of the predicate that you are using. Thus, more functionality
of the population and total population and found out the         can be imparted into the automatic query builder.
following patterns.
                                                                 Also in future, some artificial intelligence measures can be
                                                                 introduced into LiDDM for suggesting the best machine
   • If the population is greater than 58,147,733 and me-        learning algorithms that can give the best possible results
     dian age is greater than 38, the movie production is        depending on the data obtained from the linked data cloud.
     high with a confidence of 1.                                All in all, the existing functionality of the LiDDMT has been
tested with many examples and our tool is proved to be very      [5] Central Intelligence Agency. The world factbook.
effective and usable.                                                https://www.cia.gov/library/publications/the-world-
                                                                     factbook/,2011.
                                                                 [6] A. P. Dempster, N. M. Laird, and D. B. Rubin.
7.   CONCLUSIONS                                                     Maximum likelihood from incomplete data via the EM
Linked data with all its diversity and complexity acts as
                                                                     algorithm. Journal of the Royal Statistical Society,
a huge database of information in RDF format, which is
                                                                     39:1–38, 1977.
machine readable. There is a need to mine that data to find
different hidden patterns and also make it conceivable for       [7] J. Dokulil and J. Katreniaková. RDF query generator.
people to find out what it has in store for us.                      In Proceedings of the 12th International Conference on
                                                                     Information Visualisation, pages 191–193. IEEE
Our model, LiDDM, successfully builds a data mining mech-            Computer Society, 2008.
anism on top of linked data for effective understanding and      [8] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The
analysis of linked data. The features in our model are built         KDD process for extracting useful knowledge from
upon the classical KDD process and are modified to serve             volumes of data. Communications of the ACM,
the needs of linked data. The step of getting the required           39(11):27–34, Nov. 1996.
data from the remote database itself makes our model dy-         [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
namic. Flexibility is an added feature of our model as the           P. Reutemann, and I. H. Witten. The WEKA data
steps of data retrieval and mining are separate. This allows         mining software: an update. SIGKDD Explorations,
users to retrieve all the possible results first and then to         11(1):10–18, 2009.
decide on the mining techniques. Also, the smooth cyclic        [10] O. Hassanzadeh and M. Consens. Linked movie data
movement in Step 3, Step 4, and Step 5, i.e. filter, seg-            base. In Proceedings of the Linked Data on the Web
mentation, and writer respectively, makes our model more             Workshop, 2009.
adaptable and more inclined towards removal of unwanted         [11] T. Heath. How will we interact with the web of data?
data and finding richer patterns. Visualizations at the end          IEEE Internet Computing, 12(5):88–91, 2008.
solve our problem by pictorially representing the interesting   [12] C. Kiefer, A. Bernstein, and A. Locher. Adding data
relationships hidden in the data there by making the data            mining support to SPARQL via statistical relational
more understandable.                                                 learning methods. In Proceedings of the 5th European
                                                                     Semantic Web Conference, volume 5021 of Lecture
Regarding our tool, LiDDMT which we built on top of our              Notes in Computer Science, pages 478–492. Springer,
model, the functioning is effective and the results are ef-          2008.
ficient as shown in case studies. Using Weka in our tool        [13] Machine Learning Group at University of Waikato.
for the process of data mining makes it more efficient con-          Attribute-relation file format.
sidering the vast popularity of Weka. The tool has much              http://www.cs.waikato.ac.nz/ ml/weka/arff.html,
functionality implemented at each step of our model in an            2008.
effort to make it more dynamic and usable. Also, having a       [14] V. Nebot and R. Berlanga. Mining association rules
chance to view more than one visualization at a time when            from semantic web data. In Proceedings of the 23rd
implementing more than one data mining method makes our              International Conference on Industrial, Engineering &
tool a very suitable one to compare data. But still the tool         Other Applications of Applied Intelligent Systems,
could be made more efficient as we discussed in the previous         volume 6097 of Lecture Notes in Computer Science,
section.                                                             pages 504–513. Springer Berlin / Heidelberg, 2010.
                                                                [15] J. R. Quinlan. C4.5: Programs for Machine Learning.
8.   REFERENCES                                                      Morgan Kaufmann, San Mateo, CA, 1993.
 [1] R. Agrawal and R. Srikant. Fast algorithms for mining      [16] the United States Government. Data.gov.
     association rules in large databases. In Proceedings of         http://www.data.gov/, 2011.
     the 20th International Conference On Very Large Data       [17] C. Wastyn. Drupal sparql query builder.
     Bases, pages 487–499, San Francisco, Ca., USA, Sept.            http://drupal.org/node/306849, 2008.
     1994. Morgan Kaufmann Publishers, Inc.                     [18] E. PrudŠhommeaux and A. Seaborne. SPARQL
 [2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,                    Query Language for RDF. In W3C WD,4th October,
     R. Cyganiak, and Z. Ives. DBpedia: A nucleus for a              2006. http://www.w3.org/TR/2006/WD-rdf-sparql-
     web of open data. In Proceedings of the 6th                     query-20061004
     International Semantic Web Conference, volume 4825
     of Lecture Notes in Computer Science, pages 722–735,
     Busan, Korea, Nov. 2007.
 [3] C. Bizer, T. Heath, and T. Berners-Lee. Linked data -
     the story so far. International Journal on Semantic
     Web and Information Systems, 5(3):1–22, 2009.
 [4] J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds,
     A. Seaborne, and K. Wilkinson. Jena: implementing
     the semantic web recommendations. In Proceedings of
     the 13th International World Wide Web Conference
     on Alternate Track Papers & Posters, pages 74–83,
     New York, NY, USA, 2004. ACM.