<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Clustering Enterprise Networks by Patent Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fulvio D´Antonio</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Orsini</string-name>
          <email>orsini@diiga.univpm.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Cucchiarelli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Velardi</string-name>
          <email>velardi@di.uniroma1.it</email>
        </contrib>
      </contrib-group>
      <fpage>10</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>The analysis of networks of enterprises can lead to some important insights concerning strategic aspects that can drive the decision making process of different players: business analysts, entrepreneurs, public administrators. In this paper we present the current development status of an integrated methodology to automatically extract enterprise networks from public textual data and analyzing them. We show an application to the enterprises operating in the Italian region of Marche.</p>
      </abstract>
      <kwd-group>
        <kwd>Social Network Analysis</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p> Partnership discovery
Individuating similar or complementary enterprises aimed at establishing
business/productive co-operations.
 Funds allocation
Analysis of productive trends and gaps, and setup of regional/national funding
schemes.</p>
      <p>But where the data about Networks of Enterprises come from?</p>
      <p>The usual scenario is that the graph structure of the network is not explicitly
available but has to be “distilled” from a dataset D, i.e., one has to infer the network
structure starting from such data by applying some processing steps.</p>
      <p>
        Let‟s examine, as an example, the case of networks whose (weighted) links
represent the degree of “similarity” between the nodes. We have two possibilities:
1. We can submit questionnaires to the actors involved asking them to estimate their
similarity with, let‟s say, one hundred of other enterprises. The similarity value
could be a real number in the range [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], a set of symbols (sequence of stars, for
example: * little, ** medium , *** high or no stars for no similarity) or similar
representations.
2. If we have some textual data available, e.g. papers, websites, product manuals etc.
we can use some form of natural language processing and information retrieval
metrics to (semi)-automatically estimate the similarity.
      </p>
      <p>The first approach is expensive, exposed to questionnaire‟s compiler subjectivity and
implies a series of practical issues: distribution of the questionnaires, commitment to
the questionnaire compilation in a given time and collection of the results.</p>
      <p>The second approach enjoys the benefits of the general wealth of publicly available
data and of automatic processing; everyone can search the web and obtain a great
number of information (mainly textual) about the enterprises under examination. The
drawbacks of this approach rely in the generally worse performance of natural
language processing systems with respect to humans. Humans seems to be better in
performing tasks like word-sense disambiguation, contextualizing judgement and
understanding the textual information.</p>
      <p>Hybrid approaches are also commonly adopted: an automatic NLP system interact
from time to time with humans that take decisions about some harsh points.</p>
      <p>Let‟s consider an enterprise interested in finding potential partners among the
enterprises in a given geographical area, that, in turn, requires to find partners with
similar interest. Even in small areas the enterprises, generally mostly SMEs,
(SmallMedium Enterprises) can easily be in the order of several hundreds. If we decide to
assign such task to a person we could apply the following strategy: we give him/her a
list of some hundreds of enterprise names and some thousands of documents and
related websites and we ask him/here to read the documents and surf the websites to
extract key information about the business/productive sector of the enterprise in order
to estimate from such information the degree of similarity and potential
collaboration. This task is clearly not feasible for a human. A valid support can come
from a carefully designed NLP system that can be supervised by the user and
occasionally corrected by him/her (e.g. eliminating non-relevant keywords in a
particular domain, individuating uncaught spelling variation, etc).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Patent and Enterprise Networks</title>
      <p>In this section we describe how we have distilled Networks of Enterprises starting
from textual data publicly available about patents deposited by European enterprises.</p>
      <p>The European Patent Office (EPO)1 provides a uniform application procedure for
individual inventors and companies seeking patent protection in up to 40 European
countries. It is the executive arm of the European Patent Organisation and is
supervised by the Administrative Council. Through its web-site and exposed
webservices it is possible to access to information about European patents that have been
registered; the information include, among the other things, the date of presentation,
the applicant name and mission, the address of the applicant and the textual
description of the patent.</p>
      <p>
        The patents presented by an enterprise is a good indicator of the business sector in
which the enterprise operates. Therefore through the EPO database we can gather
textual data about the business/industrial sector of the enterprises in a given
geographical location and we can use such data to extract similarity networks. The
methodology we use is summarized in the following steps and it is similar to the ones
used in [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ]:
1. Gather patents registered by enterprises located in a given geographical area (a
city, a region, a country, …);
2. Pre-process textual data to extract raw text;
3. Process raw text with a part-of-speech tagger;
4. Extract candidate annotating terms using a set of part-of-speech patterns [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ];
5. Rank candidates, possibly filter them choosing a threshold [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ];
6. Output a set of weighted vectors V of annotating terms for each documents;
7. Group the vectors by enterprise (that presented the patent applications) and
construct a centroid (i.e. a mean vector) with such groups. This centroid roughly
represents the business sector of the enterprise.
8. Build a graph computing a similarity function [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for each pair of centroids.
2.1
      </p>
      <p>Clustering</p>
      <p>
        Data Clustering [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], originally conceived in the data mining field, is a very active
research domain aiming at developing methods for dividing a set of data-points into
subsets (called clusters) so that points in the same cluster are similar in some sense.
We can use clustering techniques on our Enterprise Networks in order to discover
potentially interesting networks patterns and to filter noisy phenomena.
1 http://www.epo.org/
      </p>
      <p>
        One of the main drawbacks of clustering is the substantial lack of possibility of
validating results except for very special cases, e.g. when the distribution of data is
known (like a multivariate Gaussian) or we have access to other forms of ground
truth. In literature clustering validation is approached using internal and external
validity criteria: the external criteria rely on comparison with available ground truth
while the internal ones are constituted by metrics that estimate the internal coherence
of a cluster (inter-cluster similarity) and its substantial dissimilarity from other
clusters (intra-cluster dissimilarity). According to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], each clustering technique
should be evaluated in the context of a micro-economic setting, i.e. in maximizing an
objective function.
      </p>
      <p>We relax as much as possible the notion of clustering: given a set A, a clustering C
is a set of subsets of A, i.e. C  ( A) where P(A) is the power set of A. A crisp
clustering is a clustering with pairwise disjoint clusters and a partitive clustering is
when the union of clusters is A (  Ci  A ).</p>
      <p>CiC
Most of the clustering techniques developed concentrate on producing partitive crisp
clusterings.
2.2</p>
      <p>
        Graph clustering by mean of components density maximization
In this paper we use a very simple algorithm for graph clustering. Given a graph
G=(V,E) in which V is a set of vertices and E is a set of weighted edges (x,y,w) with
x,y in V e w in [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], we order the edges in E with respect to the weights obtaining the
sequence e1,…,e|E|. We then construct the sequence of graphs GS=G0,…G|E| in which
Gi=(V,{e1,…,ei}, i.e. the i-eth graph is the graph containing the top-i weighted edges.
The clusters are the connected components of each graph and each graph contains all
the others following in the sequence so that, therefore, we have a hierarchical
clustering.
      </p>
      <p>To choose a representative of this sequence we maximize the function scoring the
mean components density: for a graph we compute the density of each connected
component, we sum them and we divide by the number of components. The
(weighted) density of a connected graph is:
The mean components density is:
 w
d (G)  (x, y,w)EG
| V |
 
 2 
meand (G) </p>
      <p> d (C)</p>
      <p>CComponents(G)
| Components(G) |
And finally, we can choose the preferred clustering Gpref by maximizing meand:
G pref  arg max meand (Gi ) .</p>
      <p>GiGS
3</p>
    </sec>
    <sec id="sec-3">
      <title>Applications</title>
      <p>In figure we show a detail of the graph obtained by applying the described method to
the enterprises operating in the Italian region of Marche that registered European
patents. The graph has been clustered according to the algorithm in section 2.2.
In the figure we can visually locate a very dense cluster in the middle-left;
unfortunately an in deep analysis of this clusters reveals that it is consisting of all
enterprises that deposited patents in German language. At the beginning of the
experimentation we didn‟t notice that some patents descriptions are not written in
English language. This noisy phenomenon, anyway, emerged because of clustering
and we suggest that this can become one important use of clustering techniques:
locating “spam” clusters in order to eliminate them and iteratively refine the process.</p>
      <p>In the rest of the picture we notice a high degree of fragmentation: several very
small groups (2 or 3 elements) and rare bigger groups.</p>
      <p>We report here some examples of clusters:
 Moretti forni S.p.a
 Defendi Italy S.r.l
 Officine Meccaniche Defendi S.r.l
 S.o.m.i press</p>
      <p>In which the similarity links depend mainly on the terms: gas, flame, burner,
cooking. We can suppose this is a cluster consisting of cooking-furniture enterprises.</p>
      <p>Another cluster is constituted by:
 Best S.p.a
 Gitronica S.r.l
 Intec-s.r.l</p>
      <p>depending on the terms phone, microphone, voice, electronic component.
In general is very difficult to evaluate the quality of the produced clusters and we
performed only a qualitative analysis.</p>
      <p>A high level of fragmentation is, indeed, a problem. The utility of clustering in
general is to reduce the dimension of problems: if the number of clusters is
comparable with the number of elements we haven‟t performed any reduction at all
and the clustering is useless. As we performed just an initial experimentation we are
not able to say if the fragmentation observed is a real phenomenon in the application
domain or can be reduced by refining the techniques used in the various steps of the
process.</p>
      <p>Therefore, in the future, we plan to work on the following points:
 The NLP analysis tools and techniques we adopt are powerful enough to put in
light important similarities/differences in the domain studied?
 The data used are enough complete/noise-free/etc? If not, how can we perform
data cleaning and gather additional data?
 The clustering method proposed is comparable with respect to state-of-the-art
methods?
4</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ricardo</surname>
          </string-name>
          Baeza-Yates and
          <article-title>Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1st edn</article-title>
          ., May
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>T.</given-names>
            <surname>Elfring</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Hulsink</surname>
          </string-name>
          , „
          <article-title>Networking by entrepreneurs: Patterns of tie-formation in emerging organizations‟</article-title>
          ,
          <source>Organization Studies</source>
          ,
          <volume>28</volume>
          (
          <issue>12</issue>
          ),
          <fpage>1849</fpage>
          -
          <lpage>1872</lpage>
          , (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Sclano</surname>
          </string-name>
          and Paola Velardi, „
          <article-title>Termextractor: a web application to learn the shared terminology of emergent web communities‟</article-title>
          ,
          <source>in Proceedings of the 3rd International Conference on Interoperability for Enterprise Software and Applications (I-ESA</source>
          <year>2007</year>
          ), Funchal, Portugal, (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Paola</given-names>
            <surname>Velardi</surname>
          </string-name>
          , Alessandro Cucchiarelli, and
          <string-name>
            <surname>Fulvio</surname>
            <given-names>D</given-names>
          </string-name>
          ‟Antonio, „
          <article-title>Monitoring the status of a research community through a knowledge map‟, Web Intelli</article-title>
          . and
          <string-name>
            <given-names>Agent</given-names>
            <surname>Sys</surname>
          </string-name>
          .,
          <volume>6</volume>
          (
          <issue>3</issue>
          ),
          <fpage>273</fpage>
          -
          <lpage>294</lpage>
          , (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Paola</given-names>
            <surname>Velardi</surname>
          </string-name>
          , Roberto Navigli, Alessandro Cucchiarelli, and
          <string-name>
            <surname>Fulvio</surname>
            <given-names>D</given-names>
          </string-name>
          ‟Antonio, „
          <article-title>A new content-based model for social network analysis‟</article-title>
          ,
          <source>in ICSC ‟08: Proceedings of the 2008 IEEE International Conference on Semantic Computing</source>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          , Washington, DC, USA, (
          <year>2008</year>
          ). IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Stanley</given-names>
            <surname>Wasserman</surname>
          </string-name>
          , Katherine Faust, and Dawn Iacobucci,
          <article-title>Social Network Analysis : Methods and Applications (Structural Analysis in the</article-title>
          <source>Social Sciences)</source>
          , Cambridge University Press,
          <year>November 1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Papadimitriou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          .
          <article-title>A micro-economic view of data mining</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ),
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ian</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          , Eibe Frank ,
          <source>Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)</source>
          , Morgan Kaufmann June 2005
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>