<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling Complex Structures in Graph-FCA: Illustration on Natural Language Syntax</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sébastien Ferré</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peggy Cellier</string-name>
          <email>peggy.cellier@irisa.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Graph-FCA, Complex Data, Data Modeling, N-ary Relation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Univ Rennes, INSA, CNRS, IRISA. Campus de Beaulieu</institution>
          ,
          <addr-line>35042 Rennes</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Graph-FCA is an extension of formal concept analysis for multi-relational data. In this paper, we discuss the freedom of representation ofered by Graph-FCA, in particular by its support of n-ary relations, considering natural language syntax as a use case. Published in Pablo Cordero, Ondrej Kridlo (Eds.): The 16ℎ International Conference on Concept Lattices and Their ∗Corresponding author.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Several extensions of formal concept analysis have been proposed to handle multi-relational data,
such as Relational Concept Analysis (RCA) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Graph-FCA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Graph-FCA adopts a graph
view of multi-relational data, seeing entities as nodes and relationships as directed edges between
nodes. Graph-FCA concept intents are therefore characterized and visualized as graph patterns.
Graph-FCA is the formal basis of concepts of neighbors, which have been applied to similarity
search in knowledge graphs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], knowledge graph completion [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and relation extraction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        A distinctive feature of Graph-FCA is to uniformly represent entity attributes, relations and
n-ary relations ( &gt; 2 ) through the notion of hyper-edge, whereas RCA diferentiates between
attributes and binary relations. N-ary relations ofer a lot of freedom in the representation of
complex structures: e.g., a ternary relation can be represented directly by one ternary relation
or decomposed into three binary relations, whereas RCA has only the latter option [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The objective of this paper is to demonstrate and discuss this freedom of representation in
GraphFCA. At the same time, the input and output format of the Graph-FCA tool is explained through
concrete examples. As a use case, we have chosen to analyze syntactic descriptions of English
sentences with the goal to discover common syntactic patterns. Such descriptions are complex
because they combine: tokens, words, part-of-speech (POS) tags, parse trees, and the relative
position of tokens and phrases. An important lesson learned is that a naive representation of structures
may lead to uninteresting patterns, and that n-ary relations are useful to alleviate the problem.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Graph-FCA Theory</title>
      <p>Graph-FCA defines a graph context as an incidence relation between tuples of objects and
attributes. A graph context is a triple  = (,, ) , where  is a set of objects (the nodes),  is
a set of attributes (the hyper-edge labels), and  ⊆  ∗ × is an incidence relation (the hyper-edges).
An hyper-edge (( 1,…,  ),) ∈  can be seen as the logical atom ( 1,…,  ) where attribute  is
used as a predicate with arity  (we use the atom notation for readability). Unary attributes ( = 1 )
are used to label nodes while binary attributes ( = 2 ) are used to label edges.</p>
      <p>In a graph concept, the extent is a set of  -tuples of objects, and the intent expresses what those
 -tuples of objects have in common, where  is the arity of the concept. Unary concepts ( = 1 )
are about sets of objects, while other concepts ( &gt; 1 ) are about relationships between objects. A
concept intent is a graph pattern with  distinguished nodes, called a Projected Graph Pattern (PGP),
and it can be seen as conjunctive query. For example, the PGP [(, ) ←  (,), ( ,) ]
has three nodes , , , two edges  (,) and  ( ,) , and two distinguished nodes , .
It can be used as a definition of the ”sibling” relationship, i.e. the fact that  and  have a common
parent  . A core concept is a concept whose graph pattern cannot be “simplified”, i.e. is not
homomorphic to a smaller graph pattern. Unlike RCA, there is a concept lattice for each concept
arity rather than for each object type. Classical FCA is a special case of Graph-FCA where  = 1
(only unary attributes) and  = 1 (only unary concepts).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Graph-FCA Implementation</title>
      <p>
        A Graph-FCA tool, called gfca, is available as a web application, a web service and a source code
(visit https://bitbucket.org/sebferre/graph-fca/). It is implemented in about 3000 lines of OCaml,
and a general presentation is given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A reference manual details the command line options,
the input format, and the diferent output formats. The input format is inspired by  Prolog [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
an extension of Prolog to  -terms that uses a currified notation for logical atoms. For example,
the atom (  ,  ) is noted capital France Paris. More in general, an hyper-edge
( 1,…,  ) is noted   1 …   . The input format ofers a system of macros (attribute definitions)
and syntactic sugar to allow for more concise notations of graph contexts.
      </p>
      <p>The tool can output graph concepts (and optionally the graph context) in a textual format
close to the input format (option -txt), and/or in JSON (option -json), and/or a graphical format
using DOT and SVG (option -dot). Examples are shown below. There are several options to
control the display of concepts: e.g., -only-cores to only show core concepts; -ext to label
each unary concept by its extent (as a list of objects); -minsupp to only show the concepts with
minsupp objects in their extent.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Use Case: Modeling Complex Syntactic Structures</title>
      <p>Syntactic structures ofer an interesting use case for demonstrating and discussing the
representation of complex structures in Graph-FCA. To make the following illustrations fit in
the paper, we consider a small context made of only two short sentences: (1) “a black cat sits on
a mat”, and (2) “a dog barks at the red car”. To obtain syntactic information about those sentences,
we apply CoreNLP (https://corenlp.run/) to each of them. Figure 1 shows the two diferent parse
trees returned by CoreNLP: the dependency tree that labels tokens with POS tags and links them
with dependencies; and the constituency tree that aggregates tokens into phrases, and recursively
phrases into larger phrases.</p>
      <sec id="sec-4-1">
        <title>4.1. Labelled Directed Graphs</title>
        <p>:X1 : [a, DT],
X2 : [black, JJ],
X3 : [cat, NN, det X1, amod X2],
X4 : [sits, VBZ, nsubj X3, obl X7],
X5 : [on, IN],
X6 : [a, DT],</p>
        <p>X7 : [mat, NN, case X5, det X6].</p>
        <p>The dependency tree of a sentence is mathematically a labelled directed graph, with tokens as
nodes and dependencies as edges. Each node is labelled by a word (e.g., “cat”) and a POS tag (e.g.,
NN), and each edge is labelled by a dependency type (e.g., nsubj). A dependency tree therefore
admits an immediate representation as a graph context, with tokens as objects, words and POS
tags as unary attributes, and dependency types as binary attributes. Figure 2 (left) shows the
graphical notation of the first sentence in the gfca tool. Each box represents an object, with its
identifier at the top (e.g., X4, the 4th token), and its unary attributes at the bottom (e.g., sits, VBZ).</p>
        <p>In gfca’s input format, the first sentence can be written as follows, as a set of edges: sits
X4, VBZ X4, nsubj X4 X3, cat X3, NN X3, det X3 X1…To enable a more compact notation, the
input format ofers some syntactic sugar. Figure 2 (right) shows an object-centric notation, as an
alternative to the default edge-centric notation. Each line describes an object by listing all edges
starting at it. For instance, the set of edges cat X3, NN X3, det X3 X1, amod X3 X2 is compacted
into X3 : [cat, NN, det X1, amod X2] by factorizing on the first argument X3.</p>
        <p>Figure 3 shows the 4 distinct graph patterns of concept intents. Several concepts may share
the same pattern, only difering by the distinguished nodes. Here, as in the rest of this paper,
we applied the options -only-cores -dot -minsupp 2 to get the graphical representation of
core patterns that occur at least twice. The 4 graph patterns can be read as follows: (Q1) a noun
(NN) with the specific determiner ’a’; (Q2) a noun with a determiner ( det, DT); (Q3) a noun with
a determiner and an adjective modifier ( amod, JJ); (Q4) a verb (VBZ) with, as a subject (nsubj),
a noun with determiner ’a’, and as an oblique complement (obl), a noun with a preposition (case,
IN) and a determiner. Q4 abstracts over the two sentences, showing what they have in common.
Q2 abstracts over all noun phrases, showing that all nouns have a determiner. Q1 and Q3 are
specializations of Q2, respectively with a specific determiner and an adjective.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Rooted Trees</title>
        <p>The constituency tree is mathematically a rooted tree. It can be seen as a directed graph with edges
from each phrase to its constituents. Compared to the dependency tree, the nodes represent not
only tokens but also phrases, and there is only one type of edge that represents a part-of hierarchy
over the nodes. The nodes are labelled with words and phrase types (e.g., S, NP). In our Graph-FCA
representation, we choose to merge the atomic phrases and the tokens, and we introduce a binary
attribute part as an edge label. Figure 4 (left) shows the graphical notation for the first sentence
(X - denotes the phrase spanning from token X to X ). Figure 5 shows the 7 found graph patterns,
among which patterns Q3, Q4, Q6 and Q7 respectively correspond to the four patterns found on
the dependency trees. The three other patterns have no other attribute than part. They only
reflect the structural shape of trees, and do not bring information on English syntax.</p>
        <p>In order to avoid uninteresting patterns, all attributes should be meaningful for the domain.
It is therefore desirable to eliminate the part attribute, while retaining the tree structure. We
observe that, whenever there is an edge (part    ℎ ), e.g., part X1-7 X4-7, there is also
a unique edge ( ℎ ) where  is a phrase type, e.g., VP X4-7. A solution is therefore to merge
the two edges into a domain-specific edge (     ℎ ), e.g. VP X1-7 X4-7. Each unary
attribute  is therefore lifted into a binary attribute that encodes for both the tree structure and
the phrase type. Figure 4 (right) shows the new representation for the first sentence. Note that
phrase types now appear as edge labels because they are now binary attributes. It can be verified
that only the four interesting graph patterns are now found (omitted for lack of space).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Constituency Trees: Handling Relative Positions</title>
        <p>As one can see in the above figures, the previous representations adequately handle the part-of
aspect of constituency trees but not the relative positions of phrases, which is a key aspect of
syntax in general. For instance, in a noun phrase, the determiner comes first, not at an arbitrary
position, and the adjective comes before the noun. An efective way to handle relative positions
and ordering is to describe each phrase by its starting and ending position. In Graph-FCA terms,
we introduce new objects for the ofset positions ( I0..I7 for the first sentence), and two new
binary attributes start and end for linking phrases to their start/end positions. Position I is
the position right after the  th token. For instance, the new edges for X1-3 are start X1-3 I0
and end X1-3 I3. However, like with part, the start and end attributes are purely structural,
and lead to uninteresting patterns saying that phrases end where other phrases start and the
like. We can eliminate them by observing that edges (start    ) and (end    ) come with one
and only one edge (  ′  ). We can therefore merge them into a single quaternary edge (  ′
     ) that expresses at the same time the phrase type, the tree structure, and the position in the
sentence. For instance, the description X1-7 : [NP I0 X1-3 I3, VP I3 X4-7 I7] says that phrase
X1-7 is composed by a NP from position 0 to 3, followed by a VP from position 3 to 7.</p>
        <p>From there, we obtain the same four patterns, except that we now have additional information
about the relative position of constituents. The patterns show that all noun phrases start with
a determiner, and end with a noun. In some pattern, the determiner and the noun are adjacent
while in another pattern, there is a gap between the two.</p>
        <p>To conclude, we note that the elimination of structural attributes implies an increase of the
arity of domain-specific attributes. In our example, the arity of phrase types goes from unary
to binary when eliminating part, and then to quaternary when eliminating start and end.
The ability of Graph-FCA to handle n-ary edges/attributes is here a key feature. Without them,
one would be forced to keep the structural attributes, and this would result in possibly many
uninteresting concepts and patterns.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research is supported by ANR project SmartFCA (ANR-21-CE23-0023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ayats</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cellier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferré</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Extracting relations in texts with concepts of neighbours</article-title>
          .
          <source>In: Formal Concept Analysis</source>
          . pp.
          <fpage>155</fpage>
          -
          <lpage>171</lpage>
          . LNCS 12733, Springer (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Belleannée</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brisset</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ridoux</surname>
            ,
            <given-names>O.:</given-names>
          </string-name>
          <article-title>A pragmatic reconstruction of  prolog</article-title>
          .
          <source>The Journal of Logic Programming</source>
          <volume>41</volume>
          ,
          <fpage>67</fpage>
          -
          <lpage>102</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ferré</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Answers partitioning and lazy joins for eficient query relaxation and application to similarity search</article-title>
          . In: Gangemi,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , et al. (eds.)
          <source>The Semantic Web (ESWC)</source>
          . pp.
          <fpage>209</fpage>
          -
          <lpage>224</lpage>
          . LNCS 10843, Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Ferré</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cellier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Graph-FCA</surname>
          </string-name>
          :
          <article-title>An extension of formal concept analysis to knowledge graphs</article-title>
          .
          <source>Discrete Applied Mathematics</source>
          <volume>273</volume>
          ,
          <fpage>81</fpage>
          -
          <lpage>102</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ferré</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Application of concepts of neighbours to knowledge graph completion</article-title>
          .
          <source>Data Science: Methods, Infrastructure, and Applications 4</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Keip</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferré</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutierrez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huchard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silvie</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Practical comparison of FCA extensions to model indeterminate value of ternary data</article-title>
          .
          <source>In: Concept Lattices and their Applications (CLA)</source>
          .
          <source>vol. CEUR 2668</source>
          , pp.
          <fpage>197</fpage>
          -
          <lpage>208</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Rouane-Hacene</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huchard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valtchev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Relational concept analysis: mining concept lattices from multi-relational data</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>67</volume>
          (
          <issue>1</issue>
          ),
          <fpage>81</fpage>
          -
          <lpage>108</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>