<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A CUSTOMIZABLE MULTI-AGENT SYSTEM FOR DISTRIBUTED DATA MINING</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giuseppe Di Fatta</string-name>
          <email>G.DiFatta@reading.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giancarlo Fortino</string-name>
          <email>g.fortino@unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DEIS - Università della Calabria</institution>
          ,
          <addr-line>Via P. Bucci cubo 41c, 87036 Rende (CS)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Systems Engineering, University of Reading</institution>
          ,
          <addr-line>Whiteknights, Reading RG6 6AY</addr-line>
          ,
          <country country="UK">U.K.</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a general Multi-Agent System framework for distributed data mining based on a Peer-toPeer model. The framework adopts message-based asynchronous communication and a dynamic load balancing policy that is particularly suitable for irregular search algorithms. A modular design allows a separation of the general-purpose system protocols and software components from the specific data mining algorithm. While the general architecture has been implemented and successfully tested on a parallel frequent subgraph mining algorithm, several interesting issues still have to be explored. The present work will discuss them and will introduce the ongoing research efforts aimed at exploiting and leveraging the MAS for distributed data mining applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        architecture is organized as a flat P2P network of nodes, each of which is a MAS. Such an organization
supports an efficient dynamic load balancing particularly suitable for irregular search algorithms. The
MAS has been customized for the frequent subgraph mining problem for the discovery of
discriminative molecular fragments. Experimental tests in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] on real molecular compounds confirmed
its effectiveness.
      </p>
      <p>Further applications will allow the evaluation of the generality of the architecture. In order to
customize the framework for a specific domain, the problem and the sequential application should have
some general characteristics, i.e. it should be possible to partition the original problem in independent
sub-problems. Typical examples are data parallel problems and problems based on a search strategy.
Parallel applications with more complex communication patterns can also be adopted, but they require
a greater effort to customize the framework. The simple parallel computing model at which the
architecture has been applied already includes important data mining problems with broad applications,
e.g. classification trees, association rule mining and in general all the frequent pattern mining problems
(itemsets, sequences, trees, subgraphs). Even though the parallel algorithm is often quite
straightforward (e.g. parallel backtracking), the parallel efficiency is severely limited by the irregularity
of the search space. Static partitioning and load balancing cannot be applied to data mining problems
and simple dynamic load balancing policies may also be ineffective for highly irregular search
problems. For this reason, we believe that there is the need of a general approach for this category of
problems. Nevertheless, interesting research activities could look at the generalization of the
architecture for different and more complex parallel computing models.</p>
      <p>
        The sequential algorithm has to be modified in order to be embedded in the system according to the
interface of the Worker agent (see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for a general description of the MAS architecture). Three
methods have to be implemented to adapt the sequential code.
      </p>
      <p>A partitioning method (Work Splitting) divides a sequential task into two independent subtasks. The
Work Splitting mechanism depends on the particular data mining applications. In applications based on
a search tree, the search nodes are typically stored in a stack and the work splitting mechanism
corresponds to the selection of one or more elements of the local stack to generate and donate a subtask
to an idle Worker. A second method has to merge two partial results. The two methods, partitioning
and merging, will be invoked asynchronously at any peer of the distributed system according to the
agent protocols and the selected dynamic load balancing policy. These two methods are the only real
effort required to the programmer in order to embed a sequential algorithm in the DDM environment.
A third method will be used to invoke the execution of a sequential task. This simply corresponds to a
wrapper of the sequential algorithm to comply with the Worker interface.</p>
      <p>Future research efforts will focus on a more accurate formalization of the model and its simulation on
very large-scale systems. Other research directions include the adoption of the approach in other
application domains to verify and extend its general applicability and the introduction of advanced and
intelligent services based on the MAS potentiality. For example, ongoing research efforts are looking at
the dynamic and autonomous management of overlay networks of peers. Agents can dynamically
organize themselves in clusters in order to aggregate computing resources for computing-intensive
subtasks. This will be particularly useful for those problems whose complexity is not known in advance
or it is difficult to be estimated with enough precision. At run time further resources will be
dynamically aggregated, whenever necessary, for specific subtasks. This dynamic computing resource
aggregation process will play orthogonally with the search space partitioning. It can take into account
updated network and systems loads and configurations for an efficient partitioning of the resources.
Resource partitioning is also a practical mechanism to provide greater scalability to the architecture.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Di Fatta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fortino. A</surname>
          </string-name>
          <article-title>Customizable Multi-Agent System for Distributed Data Mining</article-title>
          .
          <source>Proceedings of the 22nd ACM Symposium on Applied Computing (SAC</source>
          <year>2007</year>
          ), Special Track on Agents, Interactions, Mobility, and
          <string-name>
            <surname>Systems</surname>
          </string-name>
          (AIMS),
          <source>March 11 - 15</source>
          ,
          <year>2007</year>
          , Seoul, Korea. (in press)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klusch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lodi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Moro</surname>
          </string-name>
          .
          <article-title>Agent-based Distributed Data Mining: The KDEC Scheme</article-title>
          .
          <source>Intelligent Information Agents - The AgentLink Perspective. Lecture Notes in Computer Science 2586 Springer</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.G.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duggan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lyons.</surname>
          </string-name>
          <article-title>A Multi-Agent System for Context-based Distributed Data Mining</article-title>
          .
          <source>Technical Report Number NUIG-IT-170503</source>
          , Department of Information Technology,
          <string-name>
            <surname>NUI</surname>
          </string-name>
          , Galway,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>