<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Clustering Enterprise Networks by Patent Analysis</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Fulvio</forename><surname>D´antonio</surname></persName>
							<email>dantonio@di.uniroma1.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Polytechnic University of Marche</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Simone</forename><surname>Orsini</surname></persName>
							<email>orsini@diiga.univpm.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Polytechnic University of Marche</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessandro</forename><surname>Cucchiarelli</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Polytechnic University of Marche</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paola</forename><surname>Velardi</surname></persName>
							<email>velardi@di.uniroma1.it</email>
							<affiliation key="aff1">
								<orgName type="institution">Sapienza Università di Roma</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Clustering Enterprise Networks by Patent Analysis</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">26370AAD463A4C746E35CA56D1FC745E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T18:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Social Network Analysis</term>
					<term>Natural Language Processing</term>
					<term>Clustering</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The analysis of networks of enterprises can lead to some important insights concerning strategic aspects that can drive the decision making process of different players: business analysts, entrepreneurs, public administrators. In this paper we present the current development status of an integrated methodology to automatically extract enterprise networks from public textual data and analyzing them. We show an application to the enterprises operating in the Italian region of Marche.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Networks of Enterprises <ref type="bibr" target="#b1">[2]</ref> are a special kind of social networks in which the nodes represent enterprises and the links indicate some form of relationship among them.</p><p>The relationships that have been traditionally represented through links are business collaborations, enterprise similarity, mutual exchange of capitals, information flows, or hierarchical relationships like the ones representing supply chains or enterprises aggregation into districts.</p><p>Social Network Analysis <ref type="bibr" target="#b5">[6]</ref> defines a number of measures and techniques that can be used for the evaluation and analysis of enterprise networks. Such measures, if examined by a business analyst, an entrepreneur or a public administrator can lead to some important insights concerning some strategic aspects of the network.</p><p>We describe here few scenarios in which the analysis can be conveniently applied:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head> Domain analysis</head><p>The analyst inspects the network in order to understand which are the main productive sectors, the groups of similar enterprises, the relative strengths of such groups and their inter-relationships.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head> Determining competitors</head><p>Mining non-cooperating similar enterprises which may be potential competitors in a given productive sector. There is either high or low level of competition? There is a potential for market penetration of my enterprise?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head> Partnership discovery</head><p>Individuating similar or complementary enterprises aimed at establishing business/productive co-operations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head> Funds allocation</head><p>Analysis of productive trends and gaps, and setup of regional/national funding schemes.</p><p>But where the data about Networks of Enterprises come from?</p><p>The usual scenario is that the graph structure of the network is not explicitly available but has to be "distilled" from a dataset D, i.e., one has to infer the network structure starting from such data by applying some processing steps.</p><p>Let"s examine, as an example, the case of networks whose (weighted) links represent the degree of "similarity" between the nodes. We have two possibilities:</p><p>1. We can submit questionnaires to the actors involved asking them to estimate their similarity with, let"s say, one hundred of other enterprises. The similarity value could be a real number in the range [0,1], a set of symbols (sequence of stars, for example: * little, ** medium , *** high or no stars for no similarity) or similar representations. 2. If we have some textual data available, e.g. papers, websites, product manuals etc.</p><p>we can use some form of natural language processing and information retrieval metrics to (semi)-automatically estimate the similarity.</p><p>The first approach is expensive, exposed to questionnaire"s compiler subjectivity and implies a series of practical issues: distribution of the questionnaires, commitment to the questionnaire compilation in a given time and collection of the results.</p><p>The second approach enjoys the benefits of the general wealth of publicly available data and of automatic processing; everyone can search the web and obtain a great number of information (mainly textual) about the enterprises under examination. The drawbacks of this approach rely in the generally worse performance of natural language processing systems with respect to humans. Humans seems to be better in performing tasks like word-sense disambiguation, contextualizing judgement and understanding the textual information.</p><p>Hybrid approaches are also commonly adopted: an automatic NLP system interact from time to time with humans that take decisions about some harsh points.</p><p>Let"s consider an enterprise interested in finding potential partners among the enterprises in a given geographical area, that, in turn, requires to find partners with similar interest. Even in small areas the enterprises, generally mostly SMEs, (Small-Medium Enterprises) can easily be in the order of several hundreds. If we decide to assign such task to a person we could apply the following strategy: we give him/her a list of some hundreds of enterprise names and some thousands of documents and related websites and we ask him/here to read the documents and surf the websites to extract key information about the business/productive sector of the enterprise in order to estimate from such information the degree of similarity and potential collaboration. This task is clearly not feasible for a human. A valid support can come from a carefully designed NLP system that can be supervised by the user and occasionally corrected by him/her (e.g. eliminating non-relevant keywords in a particular domain, individuating uncaught spelling variation, etc).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Patent and Enterprise Networks</head><p>In this section we describe how we have distilled Networks of Enterprises starting from textual data publicly available about patents deposited by European enterprises.</p><p>The European Patent Office (EPO) <ref type="foot" target="#foot_0">1</ref> provides a uniform application procedure for individual inventors and companies seeking patent protection in up to 40 European countries. It is the executive arm of the European Patent Organisation and is supervised by the Administrative Council. Through its web-site and exposed webservices it is possible to access to information about European patents that have been registered; the information include, among the other things, the date of presentation, the applicant name and mission, the address of the applicant and the textual description of the patent.</p><p>The patents presented by an enterprise is a good indicator of the business sector in which the enterprise operates. Therefore through the EPO database we can gather textual data about the business/industrial sector of the enterprises in a given geographical location and we can use such data to extract similarity networks. The methodology we use is summarized in the following steps and it is similar to the ones used in <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>:</p><p>1. Gather patents registered by enterprises located in a given geographical area (a city, a region, a country, …); 2. Pre-process textual data to extract raw text; 3. Process raw text with a part-of-speech tagger; 4. Extract candidate annotating terms using a set of part-of-speech patterns <ref type="bibr" target="#b2">[3]</ref>; 5. Rank candidates, possibly filter them choosing a threshold <ref type="bibr" target="#b2">[3]</ref>; 6. Output a set of weighted vectors V of annotating terms for each documents; 7. Group the vectors by enterprise (that presented the patent applications) and construct a centroid (i.e. a mean vector) with such groups. This centroid roughly represents the business sector of the enterprise. 8. Build a graph computing a similarity function <ref type="bibr" target="#b0">[1]</ref> for each pair of centroids.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Clustering</head><p>Data Clustering <ref type="bibr" target="#b7">[8]</ref>, originally conceived in the data mining field, is a very active research domain aiming at developing methods for dividing a set of data-points into subsets (called clusters) so that points in the same cluster are similar in some sense. We can use clustering techniques on our Enterprise Networks in order to discover potentially interesting networks patterns and to filter noisy phenomena.</p><p>One of the main drawbacks of clustering is the substantial lack of possibility of validating results except for very special cases, e.g. when the distribution of data is known (like a multivariate Gaussian) or we have access to other forms of ground truth. In literature clustering validation is approached using internal and external validity criteria: the external criteria rely on comparison with available ground truth while the internal ones are constituted by metrics that estimate the internal coherence of a cluster (inter-cluster similarity) and its substantial dissimilarity from other clusters (intra-cluster dissimilarity). According to <ref type="bibr" target="#b6">[7]</ref>, each clustering technique should be evaluated in the context of a micro-economic setting, i.e. in maximizing an objective function.</p><p>We relax as much as possible the notion of clustering: given a set A, a clustering C is a set of subsets of A, i.e.</p><p>) (A C   where P(A) is the power set of A. A crisp clustering is a clustering with pairwise disjoint clusters and a partitive clustering is when the union of clusters is A (</p><formula xml:id="formula_0"> C C i i A C   ).</formula><p>Most of the clustering techniques developed concentrate on producing partitive crisp clusterings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Graph clustering by mean of components density maximization</head><p>In this paper we use a very simple algorithm for graph clustering. Given a graph G=(V,E) in which V is a set of vertices and E is a set of weighted edges (x,y,w) with x,y in V e w in [0,1], we order the edges in E with respect to the weights obtaining the sequence e 1 ,…,e |E| . We then construct the sequence of graphs GS=G 0 ,…G |E| in which G i =(V,{e 1 ,…,e i }, i.e. the i-eth graph is the graph containing the top-i weighted edges. The clusters are the connected components of each graph and each graph contains all the others following in the sequence so that, therefore, we have a hierarchical clustering.</p><p>To choose a representative of this sequence we maximize the function scoring the mean components density: for a graph we compute the density of each connected component, we sum them and we divide by the number of components. The (weighted) density of a connected graph is:</p><formula xml:id="formula_1">           2 | | ) ( ) , , ( V w G d G E w y x</formula><p>The mean components density is:</p><formula xml:id="formula_2">| ) ( | ) ( ) ( ) ( G Components C d G meand G Components C   </formula><p>And finally, we can choose the preferred clustering G pref by maximizing meand:</p><formula xml:id="formula_3">) ( max arg i GS Gi pref G meand G   .</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Applications</head><p>In figure we show a detail of the graph obtained by applying the described method to the enterprises operating in the Italian region of Marche that registered European patents. The graph has been clustered according to the algorithm in section 2.2. In the figure we can visually locate a very dense cluster in the middle-left; unfortunately an in deep analysis of this clusters reveals that it is consisting of all enterprises that deposited patents in German language. At the beginning of the experimentation we didn"t notice that some patents descriptions are not written in English language. This noisy phenomenon, anyway, emerged because of clustering and we suggest that this can become one important use of clustering techniques: locating "spam" clusters in order to eliminate them and iteratively refine the process.</p><p>In the rest of the picture we notice a high degree of fragmentation: several very small groups (2 or 3 elements) and rare bigger groups.</p><p>We report here some examples of clusters:</p><p> Moretti forni S.p.a  Defendi Italy S.r.l  Officine Meccaniche Defendi S.r.l  S.o.m.i press</p><p>In which the similarity links depend mainly on the terms: gas, flame, burner, cooking. We can suppose this is a cluster consisting of cooking-furniture enterprises.</p><p>Another cluster is constituted by:  Best S.p.a  Gitronica S.r.l  Intec-s.r.l depending on the terms phone, microphone, voice, electronic component. In general is very difficult to evaluate the quality of the produced clusters and we performed only a qualitative analysis.</p><p>A high level of fragmentation is, indeed, a problem. The utility of clustering in general is to reduce the dimension of problems: if the number of clusters is comparable with the number of elements we haven"t performed any reduction at all and the clustering is useless. As we performed just an initial experimentation we are not able to say if the fragmentation observed is a real phenomenon in the application domain or can be reduced by refining the techniques used in the various steps of the process.</p><p>Therefore, in the future, we plan to work on the following points:</p><p> The NLP analysis tools and techniques we adopt are powerful enough to put in light important similarities/differences in the domain studied?  The data used are enough complete/noise-free/etc? If not, how can we perform data cleaning and gather additional data?  The clustering method proposed is comparable with respect to state-of-the-art methods?</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. The Network Of Enterprises of Region Marche (detail)</figDesc><graphic coords="5,126.38,300.90,346.37,168.75" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.epo.org/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Modern Information Retrieval</title>
		<author>
			<persName><forename type="first">Ricardo</forename><surname>Baeza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">-</forename><surname>Yates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Berthier</forename><surname>Ribeiro-Neto</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999-05">May 1999</date>
			<publisher>Addison Wesley</publisher>
		</imprint>
	</monogr>
	<note>1st edn</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Networking by entrepreneurs: Patterns of tie-formation in emerging organizations</title>
		<author>
			<persName><forename type="first">T</forename><surname>Elfring</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hulsink</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Organization Studies</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="1849" to="1872" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Termextractor: a web application to learn the shared terminology of emergent web communities</title>
		<author>
			<persName><forename type="first">Francesco</forename><surname>Sclano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paola</forename><surname>Velardi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd International Conference on Interoperability for Enterprise Software and Applications (I-ESA 2007)</title>
				<meeting>the 3rd International Conference on Interoperability for Enterprise Software and Applications (I-ESA 2007)<address><addrLine>Funchal, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Monitoring the status of a research community through a knowledge map</title>
		<author>
			<persName><forename type="first">Paola</forename><surname>Velardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Cucchiarelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D"</forename><surname>Fulvio</surname></persName>
		</author>
		<author>
			<persName><surname>Antonio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Web Intelli. and Agent Sys</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="273" to="294" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A new content-based model for social network analysis</title>
		<author>
			<persName><forename type="first">Paola</forename><surname>Velardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Navigli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Cucchiarelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D ;</forename><surname>Fulvio</surname></persName>
		</author>
		<author>
			<persName><surname>Antonio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2008 IEEE International Conference on Semantic Computing</title>
				<meeting>the 2008 IEEE International Conference on Semantic Computing<address><addrLine>Washington, DC, USA</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE Computer Society</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="18" to="25" />
		</imprint>
	</monogr>
	<note>ICSC &quot;08</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">Stanley</forename><surname>Wasserman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Katherine</forename><surname>Faust</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dawn</forename><surname>Iacobucci</surname></persName>
		</author>
		<title level="m">Social Network Analysis : Methods and Applications (Structural Analysis in the Social Sciences)</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="1994-11">November 1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A micro-economic view of data mining</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kleinberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Papadimitriou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Mining and Knowledge Discovery</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">Ian</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eibe</forename><surname>Frank</surname></persName>
		</author>
		<title level="m">Data Mining: Practical Machine Learning Tools and Techniques</title>
				<imprint>
			<publisher>Morgan Kaufmann</publisher>
			<date type="published" when="2005-06">June 2005</date>
		</imprint>
	</monogr>
	<note>Second Edition</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
