<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Methodology for Analyzing Web Search Results</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Gloria</forename><surname>Bordogna</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Consiglio Nazionale delle Ricerche</orgName>
								<address>
									<settlement>Dalmine (Bg)</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Psaila</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Facoltà di Ingegneria</orgName>
								<orgName type="institution">Università degli Studi di Bergamo</orgName>
								<address>
									<settlement>Dalmine (BG)</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Methodology for Analyzing Web Search Results</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8C67B2B14C023878E814A2C5F8422C3F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T09:03+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>clustered results</term>
					<term>soft aggregation operators</term>
					<term>Web exploration</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A methodology based on the use of soft aggregation operators for filtering shared contents between the results of distinct Web searches, organized into granules of distinct resolution, is described.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>This work aims at improving the potential exploitation and comprehension of the contents retrieved by multiple Web searches to search engines <ref type="bibr" target="#b7">[8]</ref>. In previous works, we approached this objective in several ways, by first proposing the use of operators to combine clustered results <ref type="bibr" target="#b0">[1]</ref>, then by the automatic generation of disambiguated queries from clusters <ref type="bibr" target="#b2">[3]</ref>, and finally by personalized facilities for re-ranking the clusters <ref type="bibr" target="#b1">[2]</ref>. All these approaches were defined within the Matrioshka project, and implemented in the homonymous prototypal system.</p><p>In this paper, we describe a methodology for exploring the results of several web searches to filter out documents containing shared and correlated contents. Highlighting hidden content relationships between documents retrieved by distinct queries can help understanding the topics dealt with in the documents text, and, thus, give new hints of their relevance <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b9">10]</ref>. In order to make this task feasible, without accessing the full text of a retrieved, our solution extracts the necessary information from within the contents reported in the result lists provided by the search engines <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>. Then, to analyse the content relationships between the retrieved documents we have defined soft operators based on fuzzy set theory <ref type="bibr" target="#b10">[11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Soft Operators for combining granules of search results</head><p>The finest information granule we consider is the item i, representing a document in a ranked list retrieved by a search engine as a result of a query evaluation. i is defined by an Uri i , i.e., the Uniform Resource Identifier of the web document; its Title i , Snippet i and Bag i that is a bag of strings (single terms), each one weighted with a score in [0,1], expressing the significance of the string in representing the contents of the item. The strings in Bag i are obtained by performing lexicographic analysis of Uri i , Titles i and Snippets i of item i by applying Lucene functions, removing stop-words, conflating terms having the same stem, expanding single terms with associated terms by using Wordnet <ref type="bibr" target="#b6">[7]</ref>; then, all the selected single terms in Uri i , Titles i and Snippets i are included in the bag of strings. Each string s in Bag i is then associated with a weight w s [0,1]: an occurrence in the title is considered as twice occurrences in the snippet and Uri, and the total number of occurrences of a string is then normalized with respect to the maximum weight of the strings in Bag i . An item i has also an Irank i [0,1] that expresses the estimated relevance of the retrieved Web document with respect to the query, and is computed as a function of the position of the item in the query result list normalized by the list's length. Thus, Irank i is independent of the actual relevance score computed by the search engine.</p><p>The intermediate information granule is the cluster c, that is a fuzzy set of items. It has a Label c that is the title of the item which is the most relevant in the cluster <ref type="bibr" target="#b0">[1]</ref>, and a crank c [0,1], that, by default, is defined as the average of the Iranks of its items, or can be computed based on personal preferences evaluating some cluster properties, such as the cluster cardinality, novelty, heterogeneoity <ref type="bibr" target="#b3">[4]</ref>. A cluster can be generated by applying an operator combining two other clusters, or by a clustering operation. In this context, we do not focus of the clustering algorithm. For extracting the features necessary to cluster the items we parse the result list provided by the search engine, containing the first N results, and extract all the information which constitutes the representation of an item. In the Matrioshka system <ref type="bibr" target="#b1">[2]</ref>, Lingo clustering is applied <ref type="bibr" target="#b8">[9]</ref>. We are aware that the effectiveness of the proposed approach strongly depends on the clustering. Nevertheless, the combination of clusters can aid to better understand the clusters' contents, and thus complements the information provided by a clustering algorithm.</p><p>The coarsest information granule is the group g, composed of ranked clusters. g has a Labelg that semantically synthesizes its main contents. A direct way to generate a group is submitting a query to a search engine and cluster the N top ranked items in the results' list. Alternatively, a group can be generated by an operator working on groups <ref type="bibr" target="#b0">[1]</ref>. When a group is generated by a query to a search engine, its label is the text of the query, otherwise it is the title of the most representative item of the group <ref type="bibr" target="#b0">[1]</ref>.</p><p>Notice that, the same web page retrieved by different search engines (or by different queries) may be represented by distinct items in distinct result lists. In this case, the document is uniquely identified by the same Uri, while it may have distinct Snippet, Bag and Irank. On the other side, distinct web pages with distinct Uris may share the same or very similar Title and snippet, because they are indeed duplicated documents at distinct web sites retrieved by the same query.</p><p>To filter documents retrieved by distinct searches that have different snippet and bag but same uri, we first introduced in <ref type="bibr" target="#b0">[1]</ref> the ranked intersection, RIntersection, and the ranked union, RUnion, operations as the usual intersection and union of fuzzy set, since clusters are regarded as fuzzy sets of ranked items. They are crisp operations uniquely identifying the items by their Uri, which are compared based on an exact matching. The membership degree of the resulting item is obtained as the minimum and maximum of the Iranks of the items in RIntersection, and RUnion, respectively. To obtain the Title, the Snippet and the Bag of the resulting items, we select those belonging to the document having the minimum (in the case of RIntersection) or the maximum Irank (in the case of RUnion). By this choice we represent the cluster by its worst (best) representative in case of intersection (union), in accordance with fuzzy set theory <ref type="bibr" target="#b10">[11]</ref>.</p><p>Nevertheless, it can happen that the same web page is duplicated at distinct sites, so two web pages may differ just for their Uris while they may share similar Titles, snippets and bags. With the RIntersection and RUnion operations duplicated web pages are filtered out from the results. This could be a limitation, when one would like either to identify documents dealing with shared contents or to eliminate documents dealing with redundant contents. Let us consider, for example, the page of Expedia of the same hotel but retrieved in two different searches with two different dates of booking. They refer to the same hotel in the same Web site, but they have different Uris. RIntersection considers these documents as distinct, even if their semantics is the same. This is the reason for introducing the soft operators between clusters <ref type="bibr" target="#b3">[4]</ref>. The soft intersection, SIntersection, and the soft union, SUnion, uniquely identify the ranked items by their bags, i.e., by fuzzy subsets on strings. A fuzzy relation between any two items can be defined to perform their partial matching as for two fuzzy sets. Thus SIntersection, and SUnion, are defined as the intersection and union of fuzzy sets of fuzzy sets <ref type="bibr" target="#b3">[4]</ref>.</p><p>In order to filter duplicated documents the Soft Intersection between clusters can be applied. The soft intersection relaxes the ranked intersection, so that its resulting cluster includes the results of the ranked intersection, plus other ranked items of the input clusters that share the most specific common contents, as represented by their bags of strings. Let us give a simple example. Given two documents, one dealing with Italian tourist places, and the second with Tourist places in the Mediterranean area, they probably share most of the places listed in the first document, but the vice versa is unlikely to occur, since the second document contains also places of other countries than Italy such as Greece, Spain and so on. So, the soft intersection retains only the shared contents, i.e., the first document on Italian places.</p><p>Conversely, the soft union restricts the ranked union, so that the resulting cluster is included in the results of the ranked union. SUnion generates a cluster that contains the results of the ranked intersection of the input clusters plus the most general ranked items that share common contents, as represented by their bags. Let us make an example: to have a panoramic overview of the Mediterranean Tourist information; having two documents, one dealing with Italian tourist places, and the second with Tourist places in the Mediterranean area, the second one is most general one and thus it is selected by the soft union. These operations between clusters are the basic bricks on which the operators between Groups of clusters were defined <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Conclusions</head><p>A methodology for exploring the results contents organized into information granules of distinct resolution (Groups, clusters and single documents) and obtained within a Web search process by querying possibly several search engines has been proposed. This method is based on the application of soft operators to combine pairs of granules to filter documents with shared contents. Ongoing research is aimed at improving the understanding of the results yielded by the soft operators, by providing new directions of navigation within the set of retrieved documents.</p></div>		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A language for manipulating groups of clustered web documents results</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bordogna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Campi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Psaila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ronchi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 17 th ACM CIKM&apos;08</title>
				<meeting>of the 17 th ACM CIKM&apos;08</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="23" to="32" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A Cluster Manipulation Paradigm for Mobile Web Search Interaction</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bordogna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Campi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Psaila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ronchi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 1 st IIR&apos;10</title>
				<meeting>of the 1 st IIR&apos;10</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="53" to="57" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Query Disambiguation Based on Novelty and Similarity Users Feedback</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bordogna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Campi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Psaila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ronchi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of FQAS09, LNCS</title>
				<meeting>of FQAS09, LNCS</meeting>
		<imprint>
			<publisher>Springer Verlag</publisher>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="179" to="190" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Soft operators for exploring Information granules of Web search results</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bordogna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Psaila</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">submitted to the World Conference on Soft Computing</title>
				<meeting><address><addrLine>San Francisco</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2011">May 23-26, (2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents</title>
		<author>
			<persName><forename type="first">K</forename><surname>Belew</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 12 th ACM SIGIR&apos;89</title>
				<meeting>of the 12 th ACM SIGIR&apos;89</meeting>
		<imprint>
			<date type="published" when="1989">1989</date>
			<biblScope unit="page" from="11" to="20" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Clustering improves the exploration of graph mining results</title>
		<author>
			<persName><forename type="first">E</forename><surname>De Graaf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Kosters</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of AII&apos;07, 247 of International Federation for Information Processing</title>
				<meeting>of AII&apos;07, 247 of International Federation for Information essing</meeting>
		<imprint>
			<publisher>Springer Verlag</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="13" to="20" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">WordNet An Electronic Lexical Database</title>
		<editor>Fellbaum, C.</editor>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>The MIT Press</publisher>
			<pubPlace>Cambridge, MA; London</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">How are we searching the World Wide Web? A comparison of nine search engine transaction logs</title>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">J</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Spink</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="248" to="263" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A concept-driven algorithm for clustering search results</title>
		<author>
			<persName><forename type="first">S</forename><surname>Osinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>&amp;weiss</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Intelligent Systems</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="48" to="54" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Information navigation on the web by clustering and summarizing query results</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Roussinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing and Management</title>
		<imprint>
			<biblScope unit="volume">37</biblScope>
			<biblScope unit="page" from="789" to="816" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Fuzzy sets</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Zadeh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information and control</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="338" to="353" />
			<date type="published" when="1965">1965</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
