<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Formal Concept Analysis on Graphics Hardware</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>W. B. Langdon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shin Yoo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Harman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CREST centre, Department of Computer Science, University College London Gower Street</institution>
          ,
          <addr-line>London WC1E 6BT</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We document a parallel non-recursive beam search GPGPU FCA CbO like algorithm written in nVidia CUDA C and test it on software module dependency graphs. Despite removing repeated calculations and optimising data structures and kernels, we do not yet see major speed ups. Instead GeForce 295 GTX and Tesla C2050 report 141 072 concepts (maximal rectangles, clusters) in about one second. Future improvements in graphics hardware may make GPU implementations of Galois lattices competitive.</p>
      </abstract>
      <kwd-group>
        <kwd>software module clustering</kwd>
        <kwd>MDG</kwd>
        <kwd>close-by-one</kwd>
        <kwd>arithmetic intensity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Formal Concept Analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a well known technique for grouping objects
by the attributes they have in common. It can be thought of as discrete data
clustering. In general the number of conceptual clusters grows exponentially.
However there are a few specialised algorithms which render FCA manageable,
even on quite large problems, provided the object-attribute table is sparse [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Krajca, Outrata and Vychodil [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] report considerable improvement in FCA
algorithms in the last two decades. All these successful algorithms use depth
first tree search to find all the conceptual clusters in an object-attribute table.
      </p>
      <p>
        Computer graphics gaming cards (GPUs) are relatively cheap and yet offer
far more computing power than the computer’s CPU alone. (E.g. a 295 GTX
contains 480 fully functioning processors and yet costs only a few hundred pounds.)
Also microprocessor trends suggest faster computing will require parallel
computing in future. There are already hundreds of millions of computers fitted with
graphics hardware which might be used for general purpose computing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Krajca et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] report using a distributed computer to overcome the “major
drawback [of FCA’s] computational complexity”. They report their parallel
algorithm PCbO gives near linear speed increase with number of computing nodes
in a network of up to 15 PCs. In other work [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] they conclude that there is
no universal best FCA data structure. Instead they suggest that the optimum
performance will depend upon the application. In earlier work, Huaiguo Fu had
created a parallel implementation of NextClosure but it was limited to 50
attributes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] but this was subsequently greatly extended [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, like Krajca
et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], both Fu’s [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] approaches use conventional distributed
computers composed of a few CPUs rather than hundreds of GPU processing elements.
Similarly Djoufak Kengue et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]’s ParCIM implementation used a
conventional network of 8 computers connected in a star fashion with MPI. Ours is the
first FCA implementation to run in parallel on computer graphics cards (GPUs).
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>CUDA FCA Implementation</title>
      <p>
        Although In-close [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] claims to be faster we easily obtained FCbO [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] from Source
Forge. We initially implemented the Krajca sequential algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in Python.
This was followed by a version in CUDA C, where ComputeClosure is
implemented in parallel on the GPU. (For details see our technical report [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].)
      </p>
      <p>Krajca’s routines ComputeClosure and GenerateFrom essentially form a depth
first search algorithm which builds and navigates a tree of formal concepts from
a binary 0/1 matrix describing which object has which property. Since the search
is recursive and operates on one point in the tree at one time, it is unsuitable for
parallel operation on graphics cards. Our graphics card parallel version retains
the tree but uses beam search rather than depth first search.</p>
      <p>
        Instead of proceeding to the first leaf of the tree, recursively backing up
and then going forward to the next leaf and so on, in beam search, we also
start from the top of the tree and then proceed along every branch to the next
level. This requires saving information on the beam for every node at that level.
Beam search next expands the search again to cover everything at the next level
and so on until all the leafs of the tree have been reached. Notice instead of
working on a single point in the tree the beam covers many points which can
be worked on in parallel. Indeed within a couple of levels we can get a beam
containing tens of thousands of individual search points which can be processed
independently. This suits the GPU architecture which needs literally thousands
of independent processing threads for it to deliver its best performance [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. You
will have spotted that in an exponential problem, like FCA, beam search quickly
runs out of memory.
      </p>
      <p>Even for quite modest tree depths the beam width is limited by the available
space in the GPU card. (We have a configuration limit of 1.8 million
simultaneous parallel operations.) When a beam search exceeds this limit, only the first
1.8 million searches are loaded onto the GPU and the rest of the beam is queued
on the host PC. (Although we have not done this, in multi-GPU systems it would
be possible to split the beam between the GPUs, allocating up to 1.8 million to
each GPU.) The GPU only searches to the next level. It returns the concepts
found by the searches and the newly discovered branches which remain to be
searched. The concepts are printed by the host PC and the new branches are
added to the end of the beam to await their turn. Effectively the beam becomes
a queue of points in the tree waiting to be searched. The number of parallel
searches is mostly limited by the need to have space on the GPU for all the
potential new branches. This depends upon the tree’s fan out which is problem
dependent. Nonetheless the GPU can manage modest real software engineering
examples (e.g. dependence clustering of the Linux kernel). Notice the beam will
contain a mixture of pending search points at different depths in the tree.
FCbO (version 2010/10/05) was downloaded and compiled without changes on
a 2.66 GHz PC with 3 Gigabytes of RAM running 64 bit CentOS 5.0. The
performance of FCbO, our Python code and our CUDA code on two types of GPU
are given in Table 1. They show performance on: two bench mark problems, a
selection of randomly generated symmetric object-attribute pairings and software
module dependency graphs of real world example programs.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>It is unclear why our code does not do better.</p>
      <p>We would expect a linear speed advantage for FCbO from both using 64 bit
operations and from using compiled rather than interpreted code. However on
sizable examples, the ratio between the speed of FCbO and that of our Python
code is huge. This hints that FCbO has some algorithmic advantage.</p>
      <p>
        GPUs are often limited by the time taken to move data rather than to
perform calculations. “Arithmetic intensity” is the ratio of calculations per data
item. Typically this is in the range 4–64 FLOP/TDE [2, p206], we estimate the
arithmetic intensity of Krajca et al.’s algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is less than 1. Thus a
potential problem might be there is simply is not enough computation required by
FCA compared to the volume of data.
      </p>
      <p>Newer versions of CUDA have make it easier to overlap GPU operations.
However our implementation does not do this. Since the work is spread across
the multi-processors, we suspect that idle time is not a major problem.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>
        There are many problems which are traditionally solved by depth first search.
However this may not suit low cost computer graphics GPU hardware. We have
implemented a form of beam search and demonstrated it on several existing FCA
benchmarks and ten software engineering dependence clustering problems [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
GPU beam search may also be more widely applicable.
      </p>
      <p>Acknowledgements
I am grateful for the assistance of Gernot Ziegler of nVidia. Steve Worley,
Sarnath Kannan, Stephen Swift, Stan Seibert and Yuanyuan Zhang. Software
engineering MDGs were supplied by Spiros Mancoridis. Tesla donated by nVidia.
Funded by EPSRC grant EP/G060525/2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Andrews</surname>
          </string-name>
          .
          <article-title>In-close, a fast algorithm for computing formal concepts</article-title>
          .
          <source>In Conceptual Structures Tools Interoperability Workshop at the 17th International Conference on Conceptual Structures</source>
          , Moscow,
          <fpage>26</fpage>
          -
          <issue>31</issue>
          <year>July 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Schenk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Burkhart</surname>
          </string-name>
          .
          <article-title>Automatic code generation and tuning for stencil kernels on modern shared memory architectures</article-title>
          .
          <source>CSRD</source>
          ,
          <volume>26</volume>
          (
          <issue>3</issue>
          ):
          <fpage>205</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B. Del</given-names>
            <surname>Rizzo</surname>
          </string-name>
          .
          <article-title>Dice puts faith in nvidia PhysX technology for Mirror's Edge. NVIDIA Corporation press release</article-title>
          ,
          <source>Nov</source>
          <volume>19</volume>
          2008.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J. Djoufak</given-names>
            <surname>Kengue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Valtchev</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. Tayou</given-names>
            <surname>Djamegni</surname>
          </string-name>
          .
          <article-title>Parallel computation of closed itemsets and implication rule bases</article-title>
          .
          <source>In I. Stojmenovic</source>
          , et al., eds.,
          <source>ISPA</source>
          <year>2007</year>
          , LNCS
          <volume>4742</volume>
          ,
          <fpage>pp359</fpage>
          -
          <lpage>370</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Huaiguo</given-names>
            <surname>Fu</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Nguifo</surname>
          </string-name>
          .
          <article-title>A parallel algorithm to generate formal concepts for large data</article-title>
          . In P. Eklund, ed.,
          <source>ICFCA</source>
          , LNAI
          <volume>2961</volume>
          ,
          <fpage>pp141</fpage>
          -
          <lpage>142</lpage>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Huaiguo</given-names>
            <surname>Fu and M. O'Foghlu.</surname>
          </string-name>
          <article-title>A distributed algorithm of density-based subspace frequent closed itemset mining</article-title>
          .
          <source>In HPCC</source>
          ,
          <fpage>pp750</fpage>
          -
          <lpage>755</lpage>
          . IEEE,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ganter</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wille</surname>
          </string-name>
          .
          <source>Formal Concept Analysis</source>
          . Springer,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swift</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          .
          <article-title>An empirical study of the robustness of two module clustering fitness functions</article-title>
          . In H.-G. Beyer, et al., eds.,
          <string-name>
            <surname>GECCO</surname>
          </string-name>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>P.</given-names>
            <surname>Krajca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Outrata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vychodil</surname>
          </string-name>
          .
          <article-title>Parallel recursive algorithm for FCA</article-title>
          . In R. Belohlavek and
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          , eds.,
          <source>CLA</source>
          <year>2008</year>
          , Olomouc, Czech Republic.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>P.</given-names>
            <surname>Krajca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Outrata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vychodil</surname>
          </string-name>
          .
          <article-title>Parallel algorithm for computing fixpoints of Galois connections</article-title>
          . Ann Math Artif Intel,
          <volume>59</volume>
          :
          <fpage>257</fpage>
          -
          <lpage>272</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>P.</given-names>
            <surname>Krajca</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vychodil</surname>
          </string-name>
          .
          <article-title>Comparison of data structures for computing formal concepts</article-title>
          . In V.
          <string-name>
            <surname>Torra</surname>
          </string-name>
          , et al., eds.,
          <source>MDAI</source>
          <year>2009</year>
          , LNCS
          <volume>5861</volume>
          ,
          <fpage>pp114</fpage>
          -
          <lpage>125</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. W. B.
          <string-name>
            <surname>Langdon</surname>
          </string-name>
          .
          <article-title>Graphics processing units and genetic programming: An overview</article-title>
          .
          <source>Soft Computing</source>
          ,
          <volume>15</volume>
          :
          <fpage>1657</fpage>
          -
          <lpage>1669</lpage>
          , Aug.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. W. B.
          <string-name>
            <surname>Langdon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yoo</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Harman</surname>
          </string-name>
          .
          <article-title>Non-recursive beam search on GPU for formal concept analysis</article-title>
          .
          <source>RN/11/18</source>
          ,
          <string-name>
            <surname>Computer</surname>
            <given-names>Science</given-names>
          </string-name>
          , UCL, London, UK,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>