<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Methodology for Analyzing Web Search Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gloria Bordogna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Psaila</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consiglio Nazionale delle Ricerche</institution>
          ,
          <addr-line>Dalmine (Bg)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università degli Studi di Bergamo, Facoltà di Ingegneria</institution>
          ,
          <addr-line>Dalmine (BG)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A methodology based on the use of soft aggregation operators for filtering shared contents between the results of distinct Web searches, organized into granules of distinct resolution, is described.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        This work aims at improving the potential exploitation and
comprehension of the contents retrieved by multiple Web searches to search
engines [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In previous works, we approached this objective in several
ways, by first proposing the use of operators to combine clustered results
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], then by the automatic generation of disambiguated queries from
clusters [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and finally by personalized facilities for re-ranking the clusters
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. All these approaches were defined within the Matrioshka project, and
implemented in the homonymous prototypal system.
      </p>
      <p>
        In this paper, we describe a methodology for exploring the results of
several web searches to filter out documents containing shared and
correlated contents. Highlighting hidden content relationships between
documents retrieved by distinct queries can help understanding the topics
dealt with in the documents text, and, thus, give new hints of their
relevance [
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ]. In order to make this task feasible, without accessing the
full text of a retrieved, our solution extracts the necessary information from
within the contents reported in the result lists provided by the search
engines [
        <xref ref-type="bibr" rid="ref6 ref8 ref9">6,8,9</xref>
        ]. Then, to analyse the content relationships between the
retrieved documents we have defined soft operators based on fuzzy set
theory [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Soft Operators for combining granules of search results</title>
      <p>
        The finest information granule we consider is the item i, representing a
document in a ranked list retrieved by a search engine as a result of a query
evaluation. i is defined by an Urii , i.e., the Uniform Resource Identifier of
the web document; its Titlei, Snippeti and Bagi that is a bag of strings (single
terms), each one weighted with a score in [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], expressing the significance
of the string in representing the contents of the item. The strings in Bagi are
obtained by performing lexicographic analysis of Urii, Titlesi and Snippetsi of
item i by applying Lucene functions, removing stop-words, conflating terms
having the same stem, expanding single terms with associated terms by using
Wordnet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]; then, all the selected single terms in Urii, Titlesi and Snippetsi are
included in the bag of strings. Each string s in Bagi is then associated with a
weight ws[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]: an occurrence in the title is considered as twice occurrences in
the snippet and Uri, and the total number of occurrences of a string is then
normalized with respect to the maximum weight of the strings in Bagi. An item
i has also an Iranki[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] that expresses the estimated relevance of the
retrieved Web document with respect to the query, and is computed as a
function of the position of the item in the query result list normalized by the
list’s length. Thus, Iranki is independent of the actual relevance score
computed by the search engine.
      </p>
      <p>
        The intermediate information granule is the cluster c, that is a fuzzy set
of items. It has a Labelc that is the title of the item which is the most relevant
in the cluster [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and a crankc[
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], that, by default, is defined as the average
of the Iranks of its items, or can be computed based on personal preferences
evaluating some cluster properties, such as the cluster cardinality, novelty,
heterogeneoity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A cluster can be generated by applying an operator
combining two other clusters, or by a clustering operation. In this context, we
do not focus of the clustering algorithm. For extracting the features necessary to
cluster the items we parse the result list provided by the search engine,
containing the first N results, and extract all the information which constitutes
the representation of an item. In the Matrioshka system [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Lingo clustering is
applied [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We are aware that the effectiveness of the proposed approach
strongly depends on the clustering. Nevertheless, the combination of clusters
can aid to better understand the clusters’ contents, and thus complements the
information provided by a clustering algorithm.
      </p>
      <p>
        The coarsest information granule is the group g, composed of ranked
clusters. g has a Labelg that semantically synthesizes its main contents. A direct
way to generate a group is submitting a query to a search engine and cluster the
N top ranked items in the results’ list. Alternatively, a group can be generated
by an operator working on groups [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. When a group is generated by a query to
a search engine, its label is the text of the query, otherwise it is the title of the
most representative item of the group [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Notice that, the same web page retrieved by different search engines (or
by different queries) may be represented by distinct items in distinct result
lists. In this case, the document is uniquely identified by the same Uri,
while it may have distinct Snippet, Bag and Irank. On the other side, distinct
web pages with distinct Uris may share the same or very similar Title and
snippet, because they are indeed duplicated documents at distinct web sites
retrieved by the same query.</p>
      <p>
        To filter documents retrieved by distinct searches that have different snippet
and bag but same uri, we first introduced in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the ranked intersection,
RIntersection, and the ranked union, RUnion, operations as the usual
intersection and union of fuzzy set, since clusters are regarded as fuzzy sets of
ranked items. They are crisp operations uniquely identifying the items by their
Uri, which are compared based on an exact matching. The membership degree
of the resulting item is obtained as the minimum and maximum of the Iranks of
the items in RIntersection, and RUnion, respectively. To obtain the Title, the
Snippet and the Bag of the resulting items, we select those belonging to the
document having the minimum (in the case of RIntersection) or the maximum
Irank (in the case of RUnion). By this choice we represent the cluster by its
worst (best) representative in case of intersection (union), in accordance with
fuzzy set theory [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>Nevertheless, it can happen that the same web page is duplicated at distinct
sites, so two web pages may differ just for their Uris while they may share
similar Titles, snippets and bags. With the RIntersection and RUnion operations
duplicated web pages are filtered out from the results. This could be a
limitation, when one would like either to identify documents dealing with
shared contents or to eliminate documents dealing with redundant contents. Let
us consider, for example, the page of Expedia of the same hotel but retrieved in
two different searches with two different dates of booking. They refer to the
same hotel in the same Web site, but they have different Uris. RIntersection
considers these documents as distinct, even if their semantics is the same.</p>
      <p>
        This is the reason for introducing the soft operators between clusters [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The soft intersection, SIntersection, and the soft union, SUnion, uniquely
identify the ranked items by their bags, i.e., by fuzzy subsets on strings. A
fuzzy relation between any two items can be defined to perform their partial
matching as for two fuzzy sets. Thus SIntersection, and SUnion, are defined as
the intersection and union of fuzzy sets of fuzzy sets [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In order to filter duplicated documents the Soft Intersection between clusters
can be applied. The soft intersection relaxes the ranked intersection, so that its
resulting cluster includes the results of the ranked intersection, plus other
ranked items of the input clusters that share the most specific common contents,
as represented by their bags of strings. Let us give a simple example. Given two
documents, one dealing with Italian tourist places, and the second with Tourist
places in the Mediterranean area, they probably share most of the places listed
in the first document, but the vice versa is unlikely to occur, since the second
document contains also places of other countries than Italy such as Greece,
Spain and so on. So, the soft intersection retains only the shared contents, i.e.,
the first document on Italian places.</p>
      <p>
        Conversely, the soft union restricts the ranked union, so that the resulting
cluster is included in the results of the ranked union. SUnion generates a cluster
that contains the results of the ranked intersection of the input clusters plus the
most general ranked items that share common contents, as represented by their
bags. Let us make an example: to have a panoramic overview of the
Mediterranean Tourist information; having two documents, one dealing with
Italian tourist places, and the second with Tourist places in the Mediterranean
area, the second one is most general one and thus it is selected by the soft
union. These operations between clusters are the basic bricks on which the
operators between Groups of clusters were defined [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusions</title>
      <p>A methodology for exploring the results contents organized into
information granules of distinct resolution (Groups, clusters and single
documents) and obtained within a Web search process by querying possibly
several search engines has been proposed. This method is based on the
application of soft operators to combine pairs of granules to filter
documents with shared contents. Ongoing research is aimed at improving
the understanding of the results yielded by the soft operators, by providing
new directions of navigation within the set of retrieved documents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bordogna</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campi</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Psaila</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ronchi</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>A language for manipulating groups of clustered web documents results</article-title>
          ,
          <source>In Proc. of the 17th ACM CIKM'08</source>
          ,
          <fpage>23</fpage>
          -
          <lpage>32</lpage>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bordogna</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Psaila</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ronchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>A Cluster Manipulation Paradigm for Mobile Web Search Interaction</article-title>
          .
          <source>In Proc. of the 1st IIR'10</source>
          ,
          <fpage>53</fpage>
          -
          <lpage>57</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bordogna</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Psaila</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ronchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Query Disambiguation Based on Novelty and Similarity Users Feedback</article-title>
          ,
          <source>in Proc. of FQAS09</source>
          , LNCS, Springer Verlag,
          <fpage>179</fpage>
          -
          <lpage>190</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bordogna</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Psaila</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Soft operators for exploring Information granules of Web search results</article-title>
          , submitted to the World Conference on Soft Computing (San Francisco, May 23-26, (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Belew</surname>
            <given-names>K.</given-names>
          </string-name>
          <article-title>Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents</article-title>
          .
          <source>In Proc. of the 12th ACM SIGIR'89</source>
          ,
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          (
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>de Graaf</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kok</surname>
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kosters</surname>
            <given-names>W.</given-names>
          </string-name>
          <article-title>Clustering improves the exploration of graph mining results</article-title>
          .
          <source>In Proc. of AII'07, 247 of International Federation for Information Processing</source>
          , Springer Verlag,
          <fpage>13</fpage>
          -
          <lpage>20</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fellbaum</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (Ed.)
          <source>WordNet An Electronic Lexical Database</source>
          . Cambridge, MA; London: The MIT Press.
          <article-title>(</article-title>
          <year>1998</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Spink</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>How are we searching the World Wide Web? A comparison of nine search engine transaction logs</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>42</volume>
          ,
          <fpage>248</fpage>
          -
          <lpage>263</lpage>
          . (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Osinski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>A concept-driven algorithm for clustering search results</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>20</volume>
          ,
          <fpage>48</fpage>
          -
          <lpage>54</lpage>
          . (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Roussinov</surname>
            ,
            <given-names>D. G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>Information navigation on the web by clustering and summarizing query results</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>37</volume>
          ,
          <fpage>789</fpage>
          -
          <lpage>816</lpage>
          . (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          <article-title>Fuzzy sets</article-title>
          .
          <source>Information and control</source>
          ,
          <volume>8</volume>
          ,
          <fpage>338</fpage>
          -
          <lpage>353</lpage>
          . (
          <year>1965</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>