<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Wordi cation Approach to Relational Data Mining: Early Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matic Perovsek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anze Vavpetic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nada Lavrac</string-name>
          <email>nada.lavracg@ijs.si</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Knowledge Technologies, Jozef Stefan Institute</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jozef Stefan International Postgraduate School</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Nova Gorica</institution>
          ,
          <addr-line>Nova Gorica</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes a propositionalization technique called wordi cation. Wordi cation is inspired by text mining and can be seen as a transformation of a relational database into a corpus of documents. As in previous propositionalization methods, after the wordi cation step any propositional data mining algorithm can be applied. The most notable advantage of the presented technique is greater scalability - the propositionalization step is done in time linear to the number of attributes times the number of examples for one-to-many databases. Furthermore, wordi cation results in easily understandable propositional feature representation. We present our initial experiments on two real-life datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>propositionalization</kwd>
        <kwd>text mining</kwd>
        <kwd>association rules</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Unlike traditional data mining algorithms, which look for models/patterns in
a single table (propositional patterns), relational data mining algorithms look
for models/patterns in multiple tables (relational patterns). For most types of
propositional models/patterns there are corresponding relational patterns, for
example: relational classi cation rules, relational regression tree, relational
association rules, and so on.</p>
      <p>
        For individual-centered relational databases, where there is a clear notion
of individual, there exist techniques for transforming such a database into a
propositional or single-table format. This transformation, called
propositionalization [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ], can be used by traditional propositional learners, such as decision
tree or classi cation rule learners.
      </p>
      <p>In this paper we introduce a novel propositionalization technique called
wordi cation. Wordi cation is inspired by text mining techniques and can be
seen as a transformation of a relational database into a corpus of documents,
where each document can be characerized by a set of properties describing the
entries of a relational database.</p>
      <p>
        Unlike other propositionalization techniques [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref6">2, 3, 4, 6</xref>
        ], which search for good
(and possibly complex) relational features to construct a subsequent
propositional representation, this methodology focuses on a large number of simple
Feature vector
Feature vector
Feature vector
      </p>
      <p>
        Feature vector
features with the aim of greater scalability. Since the feature construction step
is very e cient, it can scale well for large relational databases. In fact, the
presented methodology transforms a given database in time linear to the number
of attributes times the number of examples for one-to-many databases.
Furthermore, due to the simplicity of features, the generated features are easily
interpretable by domain experts. On the other hand, this methodology su ers from
the loss of information, since the generated features do not explicitly connect
relations through variables. Instead, by using the Term Frequency{Inverse
Document Frequency (TF-IDF) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], it tries to capture the importance of a certain
feature (attribute value) of a relation in an aggregate manner.
      </p>
      <p>In this paper we report our initial experiments on two real-life relational
databases: a collection of best and worst movies from the Internet Movie DataBase
(IMBD) and a database of Slovenian tra c accidents.</p>
      <p>The rest of the paper is organized as follows. In Section 2 we present the
new wordi cation methodology. Section 3 presents the initial experiments and
Section 4 concludes the paper by presenting some ideas for further work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Wordi cation</title>
      <p>This section presents the two main steps of the wordi cation propositionalization
technique. First, a straightforward transformation from a relational database to
a textual corpus is performed (Fig.1). One instance (i.e., one entry of the main
table) of the initial relational database is transformed into one text document
and the features (attribute values), describing the instance, constitute the words
of this document. One document is constructed simply as a list of attribute-value
pairs - words (or features) constructed as a combination of the table's name and
the attribute's name with its discrete value:</p>
      <p>[table name] [attribute name] [attributevalue]:
Note that every attribute needs to be discretized beforehand in order to be
able to represent every value as a word. For each instance, the features are rst
generated for the main table and then for each entry from the additional tables,
and nally concatenated together.</p>
      <p>Because we do not explicitly use existential variables in our features, we
instead rely on the Term Frequency{Inverse Document Frequency (TF-IDF)
measure to implicitly capture the importance of a certain feature for a given
instance. In the context of text mining, TF-IDF re ects the representativeness
of a certain word (feature) for the given document (instance). In the rest of this
section we refer to instances as documents and to features as words.</p>
      <p>For a given word w in a document d from corpus D, the TF-IDF is de ned
as follows: t df (w; d) = tf (w; d) log jfd2 DjD:wj 2dgj ; where tf ( ) represents the
frequency of word w in document d. In other words, a certain word will have
a high TF-IDF (i.e., the feature is given a high weight for this instance), if it
is frequent within this document (instance) and infrequent in the given corpus
(the original database). In other words, the weight of a word gives a strong
indication of how relevant is the feature for the given individual. The TF-IDF
weights can then be used either for ltering words with low importance or using
them directly by the propositional learner.</p>
      <p>The technique is illustrated on a simpli ed version of the well-known
EastWest trains domain, where the input database consists of two tables shown in
Table 1; we have one east-bound and one west-bound train, each with two cars
with certain properties. The Train table is the main table and the trains are the
instances. We want to learn a classi er to determine the direction of an unseen
train. For this purpose the class attribute (train direction) is not preprocessed
and is only appended to the resulting feature vector of words.</p>
      <p>First, the corresponding two documents (one for each train) are generated
(Fig. 2). After this, the documents are transformed into a bag-of-words
representation by calculating the TF-IDF values for each word of each document, with
the class attribute column appended to the transformed bag-of-words table.
def wordification(table,individual)
words=[]
for att,val in table[individual]:</p>
      <p>words.append(table_att_val)
#loop through tables which contain current table's foreign key
for secondary_table in table.secondary_tables():</p>
      <p>words.extend(wordification(secondary_table,individual))
return
#STEP1
documents={}
for individual in main_table:</p>
      <p>documents[individual]=wordification(main_table,individual)
#STEP2
feature_vectors=tfidf(documents)
results=propositional_learner(feature_vectors)</p>
      <p>Fig. 3. Pseudo-code of the wordi cation algorithm.
This section presents the initial experiments of the wordi cation methodology.
We performed association rule learning in combination with the wordi cation
approach on two real-life datasets: the best and worst ranked IMDB movies
database and the Slovenian tra c accidents database. Table 2 lists the
characteristics of both databases.</p>
      <p>The preprocessing procedure was performed on both databases as follows.
First, the wordi cation step was applied as described in Section 2. Next,
irrelevant features (which have the same value across all examples) were removed,
resulting in less than half of the features (see Table 3). In order to prepare the
data for association rule mining, the data was also binarized: a feature was
assigned value true if the corresponding TF-IDF value was above 0.06, otherwise
false.</p>
      <p>IMDB database. The complete IMDB database is publicly available in the
SQL format1. This database contains tables of movies, roles, actors, movie
genres, directors, director genres.</p>
      <p>The database used in our experiments consists only of the movies whose
titles and years of production exist on IMDB's top 250 and bottom 100 chart2.
The database therefore consists of 166 movies, along with all of their actors,
genres and directors. Movies present in the IMDB's top 250 chart were added
an additional label goodMovie, while those in the bottom 100 were marked as
badMovie. Additionally, attribute age was discretized; a movie was marked as
old if it was made before 1950, fairlyNew if it was produced between 1950 and
2000 and new otherwise.</p>
      <p>
        After preprocessing the dataset using the wordi cation methodology, we
performed association rule learning. Frequent itemsets were generated using
RapidMiner's [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] FP-growth implementation. Next, association rules for the resulting
1 http://www.webstepbook.com/supplements/databases/imdb.sql
2 As of July 2, 2012
goodMovie
director genre drama; movie genre thriller;
director name AlfredHitchcock: (Support: 5.38% Confidence: 100.00%)
movie genre drama goodMovie; actor name RobertDeNiro:
(Support: 3.59% Confidence: 100.00%)
director name AlfredHitchcock
(Support: 4.79% Confidence: 100.00%)
      </p>
      <p>actor name AlfredHitchcock:
director name StevenSpielberg
(Support: 1.79% Confidence: 100.00%)
goodMovie; movie genre adventure;
actor name TedGrossman:
frequent itemsets were produced. Among all the discovered rules, several
interesting rules were found. Figure 4 presents some of the interesting rules.</p>
      <p>The rst rule states that if the movie's genre is drama and is directed by
Alfred Hitchcock, who is also known for drama movies, then the movie is a good
movie. The second rule concludes that if a movie is good and Robert De Niro
acts in it, than it must be a drama movie. The third interesting rule shows
that Alfred Hitchcock acted only in the movies he also directed. The last rule
implies that if Ted Grossman acts in a good adventure movie, then the director
is Steven Spielberg (Ted Grossman usually plays the role of a stunt coordinator
or performer).</p>
      <p>Tra c accident database. The second dataset consists of all accidents that
happened in Slovenia's capital city Ljubljana between the years 1995 and 2005.
The data is publicly accessible from the national police department's website3.
The database is multi-relational and consists of the information about the
accidents along with all the accidents' participants.
noInjuries
accident trafficDensity rare;</p>
      <p>accident location parkingLot: (Support: 0.73% Confidence: 97.66%)
person gender male person vehicleType motorcycle:
(Support: 0.11% Confidence: 99.12%)</p>
      <p>The data already contained discretized attributes, so further discretization
was not needed. Similarly as with the IMDB databse, preprocessing using the
wordi cation methodology, FP-growth itemset mining and association rule
mining were performed. Figure 3 presents some of the interesting rules found in the
Slovenian tra c accidents dataset.</p>
      <p>The rst rule indicates that if the tra c is rare and the accident happened in
a parking lot, then no injuries occurred. The second rule implies that whenever
a motorcycle is involved in an accident, a male person is involved.
3 http://www.policija.si/index.php/statistika/prometna-varnost</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>This paper presented a novel propositionalization technique called wordi cation.
This methodology is inspired by text mining and can be seen as a transformation
of a relational database into a corpus of documents. As is typical for
propositionalization methods, any propositional data mining algorithm can be applied after
the wordi cation step. The most notable advantage of the presented technique
is greater scalability - the propositionalization step is done in time linear to the
number of attributes times the number of examples for one-to-many databases.
Moreover, the methodology allows for producing simple, easy to understand
features, and consequently, easily understandable rules.</p>
      <p>We have presented initial results on two real-life databases: the best and worst
movies from the IMDB database and a database of Slovenian tra c accidents.
Given that we have found some interesting patterns using our methodology,
we are motivated to further explore this approach on new datasets. In future
work we will apply the methodology to larger databases to explore its potential
limitations. Furthermore, we will experimentally compare this methodology with
other known propositionalization techniques.</p>
      <p>Acknowledgements
We are grateful to Marko Grobelnik and Peter Ljubic for their initial work on this
topic. This work was supported by the Slovenian Research Agency and the FP7
European Commission projects MUSE (grant no. 296703) and FIRST (grant no.
257928).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Spa</surname>
          </string-name>
          <article-title>rck Jones. A statistical interpretation of term speci city and its application in retrieval</article-title>
          .
          <source>Journal of Documentation</source>
          ,
          <volume>28</volume>
          :
          <fpage>11</fpage>
          {
          <fpage>21</fpage>
          ,
          <year>1972</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Kramer</surname>
          </string-name>
          , Bernhard Pfahringer, and
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Helma</surname>
          </string-name>
          .
          <article-title>Stochastic propositionalization of non-determinate background knowledge</article-title>
          .
          <source>In Proceedings of the Inductive Logic Programming Workshop</source>
          , pages
          <volume>80</volume>
          {
          <fpage>94</fpage>
          . Springer,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ondrej</given-names>
            <surname>Kuzelka</surname>
          </string-name>
          and
          <string-name>
            <given-names>Filip</given-names>
            <surname>Zelezny</surname>
          </string-name>
          .
          <article-title>Block-wise construction of tree-like relational features with monotone reducibility and redundancy</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>83</volume>
          (
          <issue>2</issue>
          ):
          <volume>163</volume>
          {
          <fpage>192</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Nada</given-names>
            <surname>Lavrac</surname>
          </string-name>
          , Saso Dzeroski, and
          <string-name>
            <given-names>Marko</given-names>
            <surname>Grobelnik</surname>
          </string-name>
          .
          <article-title>Learning nonrecursive de nitions of relations with LINUS</article-title>
          .
          <source>In EWSL</source>
          , pages
          <volume>265</volume>
          {
          <fpage>281</fpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ingo</given-names>
            <surname>Mierswa</surname>
          </string-name>
          , Michael Wurst, Ralf Klinkenberg,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Scholz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Timm</given-names>
            <surname>Euler</surname>
          </string-name>
          . YALE:
          <article-title>Rapid prototyping for complex data mining tasks</article-title>
          .
          <source>In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'06)</source>
          , pages
          <fpage>935</fpage>
          {
          <fpage>940</fpage>
          , Philadelphia, PA, USA,
          <fpage>20</fpage>
          -
          <lpage>23</lpage>
          August
          <year>2006</year>
          . ACM Press, NY, USA.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Filip</given-names>
            <surname>Zelezny</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nada</given-names>
            <surname>Lavrac</surname>
          </string-name>
          .
          <article-title>Propositionalization-based relational subgroup discovery with RSD</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>62</volume>
          :
          <fpage>33</fpage>
          {
          <fpage>63</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>