<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovering user groups in professional image search</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright c by the paper's authors. Copying permitted only for private and academic purposes. In: M. Lupu, M. Salampasis, N. Fuhr, A. Hanbury, B. Larsen, H. Strindberg (eds.): Proceedings of the Integrating IR technologies for Professional Search Workshop</institution>
          ,
          <addr-line>Moscow, Russia, 24-March-2013, published at</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Theodora Tsikrika Royal School of Library and Information Science</institution>
          ,
          <addr-line>Copenhagen</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <fpage>92</fpage>
      <lpage>99</lpage>
      <abstract>
        <p>This study aims at gaining insights into user group identification in professional image search. The user groups are built by analysing the search logs recorded by a commercial picture portal for a sample of 170 users, in conjuction with the users' occupational and topical profile information, and a topical classification of the available images. Our analysis indicates that the examined groupings are meaningful and that there is variation among the groups in what people searched for and in what people considered relevant.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>di↵ ers. To this end, they analysed large scale web search logs in terms of the users’ query topics and/or session
characteristics, together with the users’ demographic profile information augmented with U.S. census data. Their
results showed that it is possible to identify distinct patterns of behaviour along di↵ erent demographic features,
a finding that could be exploited in many applications, such as sponsored search.</p>
      <p>
        Our study also focusses on analysing the searching behaviour of user groups, but in a context that di↵ ers to
the search environments examined in all of the above work in at least one of the following aspects: (i) ours is
a professional, rather than a web, environment, and (ii) it is oriented towards image, rather than text retrieval.
Furthermore, we examine occupational and topical features for discovering user groups, similarly to [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], but
consider the users’ log activity rather than data collected through a user study. Finally, previous studies in
journalistic search (e.g., [
        <xref ref-type="bibr" rid="ref4 ref6">6, 4</xref>
        ]) have mainly investigated the searching behaviour, the nature of queries, and the
image selection criteria applied by individual users, rather than user groups.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Acquisition and Processing</title>
      <p>Three data sources were used in this study: (i) a subset of the search log data collected by the commercial picture
portal of a European news agency (http://www.belga.be/), (ii) the IPTC (International Press
Telecommunications Council, http://www.iptc.org) classification of the images provided by the news agency, and (iii) profile
information of a subset of their registered users; all data were made available to us under an NDA.
3.1</p>
      <sec id="sec-2-1">
        <title>Search Logs Processing</title>
        <p>The search log data used in this study were collected over a two year period (June 2007 – July 2009), with a
three-month hiatus (October – December 2007). A sample consisting only of registered users logged into their
account for whom profile information was available (see Section 3.3) was considered for analysis in this study. The
logs recorded several search interactions, including users’ query submissions and their clicks on selected images
for further viewing and/or downloading (i.e., purchasing). Each log entry consists of a timestamp, the user’s ID,
and the submitted query. Click actions (viewing/downloading images) also logged the ID of the selected image.</p>
        <p>
          Our sample was processed as follows in preparation for the analysis. First, the logs were segmented into
sessions, i.e., series of a single user’s consecutive search actions assumed to correspond to a single information
need. No intent-aware session detection was applied [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]; session boundaries were identified when the period of
inactivity between two successive actions exceeded a 30-minute timeout, similarly to [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Next, the submitted
queries’ text was ‘lightly’ normalised by converting it to lower case and removing punctuation, quotes, special
characters, extraneous whitespace, URLs, and the names of major photo agencies. Also, empty queries and
queries consisting only of numbers or whitespace characters were removed. No stemming or stopword removal
was applied at this stage. Furthermore, consecutive identical queries submitted in the same session were conflated.
The final step was to further sample the logs so as to include only “active” users, i.e., those who had issued at
least 10 queries with each followed by at least one click. Our final sample thus contains 170 registered users who
submitted a total of 198,410 queries (86,663 unique) and clicked on a total of 567,467 images (312,702 unique).
        </p>
        <p>
          Table 1 lists some session statistics and their distribution across users. On average, our sample contains 547
sessions for each user with an average duration of about 17 minutes. The average number of queries/session is
4.4, close to the upper bound reported in previous web image search studies that employed the same session
detection approach, where it ranges from 2.8 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to 4.8 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] queries per session. However, it is slightly higher than
what has been reported in professional image search [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] (3.3 queries per session); it should be noted though that
no clear description of the session identification approach applied in that study is available. There are also 7.2
clicks per session on average by the users in our sample; no comparable statistics are available for similar image
search studies (e.g., [
          <xref ref-type="bibr" rid="ref13 ref5 ref6">6, 13, 5</xref>
          ]). Finally, 71% of sessions resulted in at least one click, higher than what has been
reported in web image search [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], where the same percentage is 56%. Overall, our analysis indicates that there
are similarities with session characteristics reported in other analyses in journalistic and web image search.
        </p>
        <p>
          Compared to an earlier analysis of a much larger sample of the same logs [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], where a 15-minute (rather than
a 30-minute) timeout was employed, the session statistics in our sample when using a 15-minute timeout (3.6
queries/session, 6 clicks per session, 68% of sessions with at least one click) are comparable to those for the
logged in users in the much larger sample (3.3, 5.5, and 62%, respectively); this indicates that the sample used in
this study is representative of the user population of this commercial picture portal with respect to their session
characteristics. Compared, though, to logs obtained from general-purpose web search engines, such as the much
larger Yahoo! sample used in a similar study [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] that also detected sessions using a 30-minute timeout (where,
on average, session duration is close to 7 minutes, with 2.4 queries/session, and 59% of sessions with at least
one click), our session statistics are di↵ erent. Such di↵ erences have been observed before [
          <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
          ], indicating that
image search is potentially a more complex cognitive task than other types of search, and probably more so
in the professional context investigated in this study. This suggests that potentially di↵ erent features could be
important for characteristing users’, and thus groups’, searching behaviour in di↵ erent environments.
IPTC is a consortium of the world’s major news agencies that provides news exchange formats to the news
industry, and also creates and maintains concepts to be used as metadata to news objects; this allows for a
consistent coding of news metadata. The news agency that provided us with the data uses, in addition to textual
captions, the 17 IPTC subject codes (http://cv.iptc.org/newscodes/subjectcode/) listed in Table 2, i.e.,
the top level of IPTC’s hierarchical newscodes, to describe the content of the images it provides.
        </p>
        <p>Out of the 312,702 unique images clicked in our sample, 274,201 (87.8%) had been manually classified by the
news agency’s archivists. The performed classification is considered to have close to 100% precision. A manual
inspection of some randomnly selected samples indicates that there is some noise, but its level appears to be
low, though this cannot be accurately quantified. Some of this noise may be introduced by the requirement for
strict classification to a single category and the inherent subjectivity in any such process.</p>
        <p>Table 2 lists the distribution of the clicked images over the 17 IPTC subjects. Sports dominate, indicating a
slight bias in the topical interests of the sampled users, as these are reflected by their searching behaviour. This
is followed by politics and cultural topics in almost equal measures. Economics, human interest topics, crime,
and war are the next subjects of interest in descending order, with the rest following in much lower percentages.
The news agency’s editorial sta↵ provided us with the following information for each of their registered users:
their a liations (i.e., the name of the company they work for), the type of that a liation (i.e., if it is a newspaper,
a TV station, etc.), and some remarks in plain text regarding the topics of interest of that user (i.e., if they are
mainly interested in sports, politics, etc.). Table 3 lists the number of users a liated with each company type,
which shows that most journalists work for radio/TV stations, or magazines. Furthermore, based on the above
information (a liation, a liation type, and remarks), users were manually classified to each of the three levels of
the IPTC hierarchy (http://show.newscodes.org/), i.e., subject (level 1), subject matter (level 2), and subject
detail (level 3), so as to reflect their topical interests; this classification is listed in Table 5 and is discussed next.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>User Groups</title>
      <p>This study groups users along two axes. The first relates to whether group membership is (i) determined explicitly
by information provided by the users (to the news agency’s sta↵ ), or (ii) inferred implicitly by their searching
behaviour. The second relates to the features shared by group members: (i) occupational, relating to the jobs
people have, and (ii) topical, relating to their interests. In particular, the investigated user groupings are: (i)
two types of explicit groupings (occupational and topical ), and (ii) three types of implicit groupings (all topical ).</p>
      <p>Explicit groups. Occupational groups consist of people with related jobs. In our case, it is assumed that
people working in similar types of companies perform similar types of journalistic tasks. The occupational groups
listed in Table 3 correspond to the company type grouping.</p>
      <p>Topical groups consist of people who share an interest in a particular topic. Here, topical groups of users with
shared interests are explicitly formed through the manual classification of users to each of the three levels of the
IPTC hierarchy (see Section 3.3). The first grouping, denoted as iptc user (level 1), is formed by considering
only the top level IPTC classification, shown in the leftmost part of Table 5. Given the high percentage of users
assigned to Economy, Business &amp; Finance (EBF), a very broad category that appears to encompass a wide range
of topics and therefore not being very discrimininative, a refined classification of the users belonging to EBF
was applied and their second level IPTC classification was considered, shown in the middle part of Table 5. The
second grouping, denoted as iptc user (level 2), thus consists of 15 groups: 9 IPTC subject (level 1) groups,
those listed in Table 5 excluding EBF, and the 6 IPTC subject matter (level 2) groups under EBF. Similarly,
given the high percentage of users under the EBF/media class, a further refinement was applied to its members
and their third level IPTC classification was considered, shown in the rightmost part of Table 5. The third
grouping, denoted as iptc user (level 3), consists of 19 groups: the 14 groups of the iptc user (level 2) grouping
excluding the EBF/media group, and the 5 IPTC subject detail (level 3) groups under EBF/media. Some of
the users in the EBF/media group had not been assigned to a third level class; these are all grouped under
EBF/media/ no detail . This third grouping achieves a less biased distribution of users across the groups.</p>
      <p>Implicit groups. Topical groups of users with shared interests are implicitly formed based on the hypothesis
that such users issue similar queries and/or click on similar images, e.g., with similar captions or IPTC codes.</p>
      <p>
        The first grouping, denoted as text–kmeans–k , is formed by applying k-means clustering on term vectors
each corresponding to a user and representing their queries and the captions of their clicked images. To this end,
the text of their queries and the captions of their clicked images are concatenated. The term vector is created
after stemming, but without removing stopwords, and the term weights are estimated using a tf.idf scheme with
normalisation. The k-means clustering uses the Euclidean distance and terminates when the objective function
shows no further improvement; k-means ran 100 times for each k. Several di↵ erent groupings were generated
by varying the number of clusters k from 5 to 20. Both the term vector generation and the clustering were
performed with the Text to Matrix Generator (TMG) Matlab toolbox [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>The second grouping, iptc clicks–kmeans–k , is formed by applying k-means clustering on vectors of the
IPTC subject distribution of users’ clicked images. Similarly to above, several di↵ erent groupings were generated
Company type
agency
government
international organisation
magazine
newspaper
private customer
radio - tv
website
by varying the number of clusters k between 5 and 20. The third grouping, iptc clicks, is formed by assigning to
each user the IPTC subject code of the majority of their clicked images leading to the 8 groups listed in Table 4.</p>
      <p>
        A comparison of the distribution of IPTC subjects manually assigned to users to reflect their interests (Table 5)
with the distribution of IPTC subjects of their clicked images (Tables 2 and 4) shows a clear disparity. This is
di cult to explain but may be due to a number of reasons. For instance, there might be a discrepancy between
what users state when registering with the news agency based on their anticipation of what they will be working
on and actual practice in their work life. Furthermore, journalists do not necessarily work on the same topic
for a long period, and this e↵ ect might be more pronounced here given the long time period covered by our
logs. Finally, journalists may select images to illustrate articles, reports, web sites, etc., which evoke associations
rather than stress the (subject) content of such texts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and thus diverge from their topical interests.
      </p>
      <p>Overall, four explicit and 33 implicit groupings are analysed. Topical user groups are the main focus as we
aim to gain insights on the main feature employed by most personalisation approaches for discriminating among
users: their topical interests. Further groups could be formed based on other features, including the sesssion
characteristics presented in Section 3.1; this is left as future work.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Analysis</title>
      <p>5.1</p>
      <sec id="sec-4-1">
        <title>Method</title>
        <p>
          The above groupings are analysed to investigate how meaningful they are, i.e., whether group members are more
similar to each other than to members of other groups, by examining the variation in what people searched for
and in what people considered relevant. Further insights are gained by zooming in on particular clusters. First
the method applied for evaluating our user segmentation outside the context of a specific application is presented.
Evaluating these user groupings is akin to performing cluster validation; see [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for a discussion on the notions
presented next. Comparing individual groups within a given grouping or entire groupings can be performed in an
unspurevised manner, by evaluating how well the groups fit the data without any reference to external
information, or in a supervised manner, against known ground truth. Both these approaches are applied in our analysis
and are briefly described next. Clustering(s)/cluster(s) are used interchangeably with grouping(s)/group(s).
        </p>
        <p>Unsupervised cluster validation evaluates the goodness of a clustering based on inter– and intra–cluster
pairwise proximity measures. In particular, the overall validity of a clustering can be expressed as the weighted
sum of the validity of individual clusters, which is in turn measured either by inter–cluster cohesion expressing
how closely related the objects in a cluster are, or by intra–cluster separation expressing how well-separated a
cluster is from other clusters. Cohesion is defined as the average of the pairwise proximity values of all points
within the cluster and separation as the average of the pairwise proximity values of each point within the cluster
to all points in all other clusters. Average proximity in a clustering is computed based on all possible pairs.</p>
        <p>Here, proximity is computed using the following similarity measures for user pairs: (i) the number of their
common queries, normalised by the maximum such value observed in our sample (there are on average about
23 common queries per user pair in our sample), or (ii) the Jensen-Shannon divergence (a symmetrised KL
divergence) between the IPTC subject distributions of users’ clicked images. The latter is actually a distance
metric and its value subtracted from 1 is employed instead1. Another similarity measure that could be used
1Jensen-Shannon divergence values range between 0 and 1 when the logarithm with base 2 is used, as done in this study.
is the number of common clicks in a user pair. However, our analysis showed that this would probably be an
unreliable measure given our sample’s very low numbers of shared clicks for users’ common queries. This might
be due to the highly dynamic and recency-oriented journalistic context, where image collections are constantly
updated and users typically seek up-to-date information. Therefore, time-dependent relevance in conjunction
with the long time period covered by our logs is a likely explanation for this phenomenon.</p>
        <p>
          In addition to cohesion and separation, their combination, as this is reflected in the silhouette coe cient, is
used. The silhouette coe cient of a clustering is defined as the average silhouette coe cient of all its points,
while that of a point is computed using each of the similarity measures described above (see [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for its definition).
        </p>
        <p>Supervised cluster validation measures how well the constructed groups match a given ground truth. The
following measures that evaluate the extent to which a cluster contains objects of a single class are used: (i) the
entropy of a cluster over the class distribution in the ground truth, and (ii) its purity, i.e., the frequency of the
most frequent class of the ground truth in a cluster. The entropy (purity) of a clustering is computed as the sum
of the entropy (purity) values of each cluster weighted by its size.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Variation within Groups</title>
        <p>Our analysis first explores how meaningful the 37 groupings are by examining whether the members in a group
are more similar to each other than to members of other groups, using the similarity measures described above.
Figure 1 (left, middle) clearly indicates that, in all cases, group members are more similar to each other than to
users in other groups in terms of the queries they issue and the IPTC subjects of the images they click.</p>
        <p>Starting with the explicit groupings, users a liated with similar types of companies (company type grouping)
appear to share more common queries, but to click on less similar IPTC subjects, compared to users manually
classified as sharing common interests (iptc user (level l) grouping), at least for l = {2, 3}. These latter manual
user classifications also appear to benefit by their refinements towards the deeper levels of the IPTC hierarchy
with respect to their cohesion, but not their separation. Given that in our case only one IPTC subject (EBF)
was refined, this indicates that indeed this top level subject is too broad to be discriminative, but that, on the
other hand, its refined groups share topical interests not only among themselves, but also with users in other
groups, most likely with those in the other groups obtained from the EBF refinement.</p>
        <p>similarity: common queries (normalisation: max)
● iinnttrear−−cclluusstteerr csoehpeasraiotinon</p>
        <p>average similarity
5
0
.
0
4
0
.
0
3
0
.
0
2
0
.
0
1
.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0
0
.
0
sean5−msean6−msean7−msean8−msean9−msan01e− san11e− san12e− san13e− san14e− san15e− san16e− san17e− san18e− san19e− san20e− snea5−msean6−msean7−msean8−msean9−msan01e− san11e− san12e− san13e− san14e− san15e− san16e− san17e− san18e− san19e− san20e− liitccksc llv1e lv2e lv3e tyep</p>
        <p>e le le y
ttxke− ttxke− ttxke− ttxke− ttkxe− ttkxe−mttkxe−mttkxe−mttxke−mttxke−mttxke−mttxke−mttxke−mttxke−mttkxe−mttkxe−militccckskp− iiltccckskp− iiltccckskp− iiltccckskp− iiltcccskkp− litccckks−miltcckcks−miltcckcks−miltcckcks−miltccckks−miltcckcks−miltckccks−miltkcccks−miltcckcks−miltcckcks−miltckccks−mp ijttccspeub− ijttccspeub− ijttccspeub− cpanom
ip ip ip ip ip ip ip ip ip ip ip</p>
        <p>For the implicit groupings, our analysis should take into account the relation that exists in some cases between
the objective function used for the clustering and the similarity measure used for cluster validation. For instance,
the objective function in the iptc clicks–kmeans–k clusterings and the Jensen-Shannon divergence measure for
cluster validation are both based on the distribution of the IPTC subjects of the images clicked by users. This
results in boosting these groupings’ cohesion and separation values when this measure is applied, as shown in
Figure 1 (middle). Similarly, but to a lesser extent, the objective function in the text–kmeans–k clusterings
considers, together with the clicked images’ captions, the users’ queries, also considered by the common queries
measure. Therefore, our focus is mainly on the validation of the text–kmeans–k clusterings using the
JensenShannon divergence measure, and the validation of the iptc clicks–kmeans–k using the common queries measure.
The results of this analysis in Figure 1 (left, middle) show that group members who clicked on images with
similar IPTC subjects also issued more similar queries, and vice versa.</p>
        <p>Figure 1 (right) plots the silhouette coe cient for all groupings, combining cohesion and separation. The
e↵ ects of the relations between cluster validity measures and objective functions are also evident here. Generally,
the most cohesive and well-separated groupings are those for lower k, and also the iptc clicks grouping.</p>
        <p>Next, the entropy and the purity of each of the 32 implicit clusterings (text–kmeans–k and iptc clicks–kmeans–
k ) is examined in terms of users’ classifications in each of the four explicit groupings and in iptc clicks. Regarding
the explicit groupings, Figure 2 shows that all clusterings have the lowest entropy and highest purity for the iptc
user (level 1) grouping, followed by the iptc user (level 2), company type, and iptc user (level 3) groupings. This
indicates that the clusters created in the implicit groupings mostly cluster together users who have been assigned
the same IPTC subject, rather than working for the same company. This is to be expected though given the
highly unbalanced data in the top IPTC level manual classification (see Table 5).</p>
        <p>When using the iptc clicks classification, the entropy and purity of the iptc clicks–kmeans–k clusterings achieve
their best values, as expected. In addition, though, also the text–kmeans–k clusterings have low entropy and
relatively high purity with respect to iptc clicks. This indicates that groups consisting of users issuing similar
queries and clicking on images with similar captions contain users that mostly select images with the same IPTC
subject, thus further confirming the correlation observed above.</p>
        <p>2
3
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
4
3
1
0
entropy</p>
        <p>0
● icpotcmcplaicnkystype .1
iptc user (level 1)
iptc user (level 2) .8
iptc user (level 3) 0
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
.06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
To gain further insights into group membership, we selected one clustering to examine: text–kmeans–10, a
clustering with relatively low entropy, high purity, and high silhouette coe cient. Table 6 lists for each cluster
its distribution, entropy, and purity over the classes in iptc clicks, the two most frequent queries submitted by its
members, and their average session statistics. For clusters with a dominant class and thus relatively low entropy
(i.e., all clusters except 5 and 8), the most frequent queries are very representative of the dominant IPTC subject
(with a Belgian focus, given their origin). For example, the most popular queries in clusters 1 and 9 are clearly
relevant to culture/entertainment and sports, respectively, while those in cluster 2 relate to the imperial and
royal matters category of IPTC’s Human Interest subject. For clusters 5 and 8, where the distribution is equally
split across a number of classes, an examination of their 20 most frequent queries shows that they cover several
di↵ erent subjects, indicating a more mixed membership. Regarding the session statistics, there is a clear outlier,
while the rest are well below the sample’s overall averages (see Table 1).
5.4</p>
      </sec>
      <sec id="sec-4-3">
        <title>Individual Users</title>
        <p>Finally, we have a closer look at individual users’ searching behaviour and in particular at the distribution of the
IPTC subjects of their clicked images. Figure 3 plots the entropy of that distribution for each user. Only about
a fifth of our users have an entropy below 1 and about half have an entropy over 2. This indicates that many of</p>
        <p>ACE CLJ DIS EBF HUM POL REL SPO
17564823910 0100000000..........903000100100008002100000200170 0000000000..........200000000052000000000500000000 0000000000..........500000000000000000000000000000 0000000000..........850000000002000000050500000007 0000000000..........720000000053000000020700000007 0000000000..........352000010050000409140000030125 0000000000..........200000000050000000000000000000 1100000000..........000000033080000030003000003000
the users in our sample click on images assigned to many di↵ erent IPTC subjects; thus, their interests appear to
cover several topics. Given the long period covered our searchlogs and the discussion in Section 4 on journalists’
work practices, a more in-depth analysis that is time-based and also considers fuzzy clusterings is needed.
6
This work studied user group identification through the analysis of search log data collected by a commercial
picture portal for a sample of 170 of their registered users, in conjuction with the users’ occupational and topical
profile information, and the topical IPTC classification of the available images. Overall, our analysis indicates
that the examined groupings are meaningful, since groups members are more similar to each other than to
users in other groups. Furthermore, it provides some support to the hypothesis that users who click on images
with similar IPTC subjects also issue more similar queries, and vice versa, than the population at large. It
also indicates that the relationship between group membership and issued queries and/or IPTC subjects of
clicked images might be a good source of evidence for determining groups when these are not known a priori.
This suggests that members of highly cohesive groups with respect to the above would probably benefit from a
‘groupisation’ approach. However, given the preliminary nature of this work, further investigations are needed
for consolidating and generalising our findings. Finally, future work will follow a number of directions, including
the use of the images’ visual features for group identification, the consideration of semantic evidence for query
similarity and classification, the exploitation of session information, and joint clustering using multiple features.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          .
          <article-title>Inferring and using location metadata to personalize web search</article-title>
          .
          <source>In Proc. of the 34th SIGIR</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Collins-Thompson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          , S. de la Chica, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Sontag</surname>
          </string-name>
          .
          <article-title>Personalizing web search results by reading level</article-title>
          .
          <source>In Proc. of the 20th CIKM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gayo-Avello</surname>
          </string-name>
          .
          <article-title>A survey on session detection methods in query logs and a proposal for future evaluation</article-title>
          .
          <source>Information Sciences</source>
          ,
          <volume>179</volume>
          (
          <issue>12</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Hollink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tsikrika</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. P.</surname>
          </string-name>
          de Vries.
          <article-title>Semantic search log analysis: A method and a study on professional image search</article-title>
          .
          <source>JASIST</source>
          ,
          <volume>62</volume>
          (
          <issue>4</issue>
          ),
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          .
          <article-title>The e↵ ect of specialized multimedia collections on web searching</article-title>
          .
          <source>J. of Web Engineering</source>
          ,
          <volume>3</volume>
          (
          <issue>3-4</issue>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>J¨orgensen and</article-title>
          <string-name>
            <surname>P. J</surname>
          </string-name>
          <article-title>¨orgensen. Image querying by image professionals</article-title>
          .
          <source>JASIST</source>
          ,
          <volume>56</volume>
          (
          <issue>12</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharitonov</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          .
          <article-title>Demographic context in web search re-ranking</article-title>
          .
          <source>In Proc. of the 21st CIKM</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharitonov</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Serdyukov</surname>
          </string-name>
          .
          <article-title>Gender-aware re-ranking</article-title>
          .
          <source>In Proc. of the 35th SIGIR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ornager</surname>
          </string-name>
          .
          <article-title>The newspaper image database: empirical supported analysis of users' typology and word association clusters</article-title>
          .
          <source>In Proc. of the 18th SIGIR</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>O¨ zmutlu, A. Spink, and</article-title>
          <string-name>
            <surname>H. C.</surname>
          </string-name>
          <article-title>O¨ zmutlu</article-title>
          .
          <source>Multimedia web searching trends: 1997-2001. Information Processesing and Management</source>
          ,
          <volume>39</volume>
          (
          <issue>4</issue>
          ),
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.-N.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Steinbach</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Introduction to Data Mining, (First Edition), chapter 8. Cluster Analysis: Basic Concepts and Algorithms</article-title>
          .
          <string-name>
            <surname>Addison-Wesley Longman</surname>
          </string-name>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Teevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Morris</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bush</surname>
          </string-name>
          .
          <article-title>Discovering and using groups to improve personalized search</article-title>
          .
          <source>In Proc. of the 3nd ACM WSDM</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tjondronegoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          .
          <article-title>A study and comparison of multimedia web searching: 1997-2006</article-title>
          . JASIST,
          <volume>60</volume>
          (
          <issue>9</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Weber</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Castillo</surname>
          </string-name>
          .
          <article-title>The demographics of web search</article-title>
          .
          <source>In Proc. of the 33rd SIGIR</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Weber</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          .
          <article-title>Who uses web search for what: and how</article-title>
          .
          <source>In Proc. of the 4th WSDM</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zeimpekis</surname>
          </string-name>
          and
          <string-name>
            <surname>E. Gallopoulos.</surname>
          </string-name>
          <article-title>TMG: A Matlab toolbox for generating term-document matrices from text collections</article-title>
          .
          <source>In Grouping Multidimensional Data</source>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>