Introduction

Discovering user groups in professional image search

0 Copyright c by the paper's authors. Copying permitted only for private and academic purposes. In: M. Lupu, M. Salampasis, N. Fuhr, A. Hanbury, B. Larsen, H. Strindberg (eds.): Proceedings of the Integrating IR technologies for Professional Search Workshop , Moscow, Russia, 24-March-2013, published at 1 Theodora Tsikrika Royal School of Library and Information Science , Copenhagen , Denmark

92 99

This study aims at gaining insights into user group identification in professional image search. The user groups are built by analysing the search logs recorded by a commercial picture portal for a sample of 170 users, in conjuction with the users' occupational and topical profile information, and a topical classification of the available images. Our analysis indicates that the examined groupings are meaningful and that there is variation among the groups in what people searched for and in what people considered relevant.

Introduction

di↵ ers. To this end, they analysed large scale web search logs in terms of the users’ query topics and/or session characteristics, together with the users’ demographic profile information augmented with U.S. census data. Their results showed that it is possible to identify distinct patterns of behaviour along di↵ erent demographic features, a finding that could be exploited in many applications, such as sponsored search.

Our study also focusses on analysing the searching behaviour of user groups, but in a context that di↵ ers to the search environments examined in all of the above work in at least one of the following aspects: (i) ours is a professional, rather than a web, environment, and (ii) it is oriented towards image, rather than text retrieval. Furthermore, we examine occupational and topical features for discovering user groups, similarly to [ 12 ], but consider the users’ log activity rather than data collected through a user study. Finally, previous studies in journalistic search (e.g., [ 6, 4 ]) have mainly investigated the searching behaviour, the nature of queries, and the image selection criteria applied by individual users, rather than user groups. 3

Data Acquisition and Processing

Three data sources were used in this study: (i) a subset of the search log data collected by the commercial picture portal of a European news agency (http://www.belga.be/), (ii) the IPTC (International Press Telecommunications Council, http://www.iptc.org) classification of the images provided by the news agency, and (iii) profile information of a subset of their registered users; all data were made available to us under an NDA. 3.1

Search Logs Processing

The search log data used in this study were collected over a two year period (June 2007 – July 2009), with a three-month hiatus (October – December 2007). A sample consisting only of registered users logged into their account for whom profile information was available (see Section 3.3) was considered for analysis in this study. The logs recorded several search interactions, including users’ query submissions and their clicks on selected images for further viewing and/or downloading (i.e., purchasing). Each log entry consists of a timestamp, the user’s ID, and the submitted query. Click actions (viewing/downloading images) also logged the ID of the selected image.

Our sample was processed as follows in preparation for the analysis. First, the logs were segmented into sessions, i.e., series of a single user’s consecutive search actions assumed to correspond to a single information need. No intent-aware session detection was applied [ 3 ]; session boundaries were identified when the period of inactivity between two successive actions exceeded a 30-minute timeout, similarly to [ 15 ]. Next, the submitted queries’ text was ‘lightly’ normalised by converting it to lower case and removing punctuation, quotes, special characters, extraneous whitespace, URLs, and the names of major photo agencies. Also, empty queries and queries consisting only of numbers or whitespace characters were removed. No stemming or stopword removal was applied at this stage. Furthermore, consecutive identical queries submitted in the same session were conflated. The final step was to further sample the logs so as to include only “active” users, i.e., those who had issued at least 10 queries with each followed by at least one click. Our final sample thus contains 170 registered users who submitted a total of 198,410 queries (86,663 unique) and clicked on a total of 567,467 images (312,702 unique).

Table 1 lists some session statistics and their distribution across users. On average, our sample contains 547 sessions for each user with an average duration of about 17 minutes. The average number of queries/session is 4.4, close to the upper bound reported in previous web image search studies that employed the same session detection approach, where it ranges from 2.8 [ 13 ] to 4.8 [ 5 ] queries per session. However, it is slightly higher than what has been reported in professional image search [ 6 ] (3.3 queries per session); it should be noted though that no clear description of the session identification approach applied in that study is available. There are also 7.2 clicks per session on average by the users in our sample; no comparable statistics are available for similar image search studies (e.g., [ 6, 13, 5 ]). Finally, 71% of sessions resulted in at least one click, higher than what has been reported in web image search [ 13 ], where the same percentage is 56%. Overall, our analysis indicates that there are similarities with session characteristics reported in other analyses in journalistic and web image search.

Compared to an earlier analysis of a much larger sample of the same logs [ 4 ], where a 15-minute (rather than a 30-minute) timeout was employed, the session statistics in our sample when using a 15-minute timeout (3.6 queries/session, 6 clicks per session, 68% of sessions with at least one click) are comparable to those for the logged in users in the much larger sample (3.3, 5.5, and 62%, respectively); this indicates that the sample used in this study is representative of the user population of this commercial picture portal with respect to their session characteristics. Compared, though, to logs obtained from general-purpose web search engines, such as the much larger Yahoo! sample used in a similar study [ 15 ] that also detected sessions using a 30-minute timeout (where, on average, session duration is close to 7 minutes, with 2.4 queries/session, and 59% of sessions with at least one click), our session statistics are di↵ erent. Such di↵ erences have been observed before [ 5, 10 ], indicating that image search is potentially a more complex cognitive task than other types of search, and probably more so in the professional context investigated in this study. This suggests that potentially di↵ erent features could be important for characteristing users’, and thus groups’, searching behaviour in di↵ erent environments. IPTC is a consortium of the world’s major news agencies that provides news exchange formats to the news industry, and also creates and maintains concepts to be used as metadata to news objects; this allows for a consistent coding of news metadata. The news agency that provided us with the data uses, in addition to textual captions, the 17 IPTC subject codes (http://cv.iptc.org/newscodes/subjectcode/) listed in Table 2, i.e., the top level of IPTC’s hierarchical newscodes, to describe the content of the images it provides.

Out of the 312,702 unique images clicked in our sample, 274,201 (87.8%) had been manually classified by the news agency’s archivists. The performed classification is considered to have close to 100% precision. A manual inspection of some randomnly selected samples indicates that there is some noise, but its level appears to be low, though this cannot be accurately quantified. Some of this noise may be introduced by the requirement for strict classification to a single category and the inherent subjectivity in any such process.

Table 2 lists the distribution of the clicked images over the 17 IPTC subjects. Sports dominate, indicating a slight bias in the topical interests of the sampled users, as these are reflected by their searching behaviour. This is followed by politics and cultural topics in almost equal measures. Economics, human interest topics, crime, and war are the next subjects of interest in descending order, with the rest following in much lower percentages. The news agency’s editorial sta↵ provided us with the following information for each of their registered users: their a liations (i.e., the name of the company they work for), the type of that a liation (i.e., if it is a newspaper, a TV station, etc.), and some remarks in plain text regarding the topics of interest of that user (i.e., if they are mainly interested in sports, politics, etc.). Table 3 lists the number of users a liated with each company type, which shows that most journalists work for radio/TV stations, or magazines. Furthermore, based on the above information (a liation, a liation type, and remarks), users were manually classified to each of the three levels of the IPTC hierarchy (http://show.newscodes.org/), i.e., subject (level 1), subject matter (level 2), and subject detail (level 3), so as to reflect their topical interests; this classification is listed in Table 5 and is discussed next. 4

User Groups

This study groups users along two axes. The first relates to whether group membership is (i) determined explicitly by information provided by the users (to the news agency’s sta↵ ), or (ii) inferred implicitly by their searching behaviour. The second relates to the features shared by group members: (i) occupational, relating to the jobs people have, and (ii) topical, relating to their interests. In particular, the investigated user groupings are: (i) two types of explicit groupings (occupational and topical ), and (ii) three types of implicit groupings (all topical ).

Explicit groups. Occupational groups consist of people with related jobs. In our case, it is assumed that people working in similar types of companies perform similar types of journalistic tasks. The occupational groups listed in Table 3 correspond to the company type grouping.

Topical groups consist of people who share an interest in a particular topic. Here, topical groups of users with shared interests are explicitly formed through the manual classification of users to each of the three levels of the IPTC hierarchy (see Section 3.3). The first grouping, denoted as iptc user (level 1), is formed by considering only the top level IPTC classification, shown in the leftmost part of Table 5. Given the high percentage of users assigned to Economy, Business & Finance (EBF), a very broad category that appears to encompass a wide range of topics and therefore not being very discrimininative, a refined classification of the users belonging to EBF was applied and their second level IPTC classification was considered, shown in the middle part of Table 5. The second grouping, denoted as iptc user (level 2), thus consists of 15 groups: 9 IPTC subject (level 1) groups, those listed in Table 5 excluding EBF, and the 6 IPTC subject matter (level 2) groups under EBF. Similarly, given the high percentage of users under the EBF/media class, a further refinement was applied to its members and their third level IPTC classification was considered, shown in the rightmost part of Table 5. The third grouping, denoted as iptc user (level 3), consists of 19 groups: the 14 groups of the iptc user (level 2) grouping excluding the EBF/media group, and the 5 IPTC subject detail (level 3) groups under EBF/media. Some of the users in the EBF/media group had not been assigned to a third level class; these are all grouped under EBF/media/ no detail . This third grouping achieves a less biased distribution of users across the groups.

Implicit groups. Topical groups of users with shared interests are implicitly formed based on the hypothesis that such users issue similar queries and/or click on similar images, e.g., with similar captions or IPTC codes.

The first grouping, denoted as text–kmeans–k , is formed by applying k-means clustering on term vectors each corresponding to a user and representing their queries and the captions of their clicked images. To this end, the text of their queries and the captions of their clicked images are concatenated. The term vector is created after stemming, but without removing stopwords, and the term weights are estimated using a tf.idf scheme with normalisation. The k-means clustering uses the Euclidean distance and terminates when the objective function shows no further improvement; k-means ran 100 times for each k. Several di↵ erent groupings were generated by varying the number of clusters k from 5 to 20. Both the term vector generation and the clustering were performed with the Text to Matrix Generator (TMG) Matlab toolbox [ 16 ].

The second grouping, iptc clicks–kmeans–k , is formed by applying k-means clustering on vectors of the IPTC subject distribution of users’ clicked images. Similarly to above, several di↵ erent groupings were generated Company type agency government international organisation magazine newspaper private customer radio - tv website by varying the number of clusters k between 5 and 20. The third grouping, iptc clicks, is formed by assigning to each user the IPTC subject code of the majority of their clicked images leading to the 8 groups listed in Table 4.

A comparison of the distribution of IPTC subjects manually assigned to users to reflect their interests (Table 5) with the distribution of IPTC subjects of their clicked images (Tables 2 and 4) shows a clear disparity. This is di cult to explain but may be due to a number of reasons. For instance, there might be a discrepancy between what users state when registering with the news agency based on their anticipation of what they will be working on and actual practice in their work life. Furthermore, journalists do not necessarily work on the same topic for a long period, and this e↵ ect might be more pronounced here given the long time period covered by our logs. Finally, journalists may select images to illustrate articles, reports, web sites, etc., which evoke associations rather than stress the (subject) content of such texts [ 9 ], and thus diverge from their topical interests.

Overall, four explicit and 33 implicit groupings are analysed. Topical user groups are the main focus as we aim to gain insights on the main feature employed by most personalisation approaches for discriminating among users: their topical interests. Further groups could be formed based on other features, including the sesssion characteristics presented in Section 3.1; this is left as future work. 5

Analysis

5.1

Method

The above groupings are analysed to investigate how meaningful they are, i.e., whether group members are more similar to each other than to members of other groups, by examining the variation in what people searched for and in what people considered relevant. Further insights are gained by zooming in on particular clusters. First the method applied for evaluating our user segmentation outside the context of a specific application is presented. Evaluating these user groupings is akin to performing cluster validation; see [ 11 ] for a discussion on the notions presented next. Comparing individual groups within a given grouping or entire groupings can be performed in an unspurevised manner, by evaluating how well the groups fit the data without any reference to external information, or in a supervised manner, against known ground truth. Both these approaches are applied in our analysis and are briefly described next. Clustering(s)/cluster(s) are used interchangeably with grouping(s)/group(s).

Unsupervised cluster validation evaluates the goodness of a clustering based on inter– and intra–cluster pairwise proximity measures. In particular, the overall validity of a clustering can be expressed as the weighted sum of the validity of individual clusters, which is in turn measured either by inter–cluster cohesion expressing how closely related the objects in a cluster are, or by intra–cluster separation expressing how well-separated a cluster is from other clusters. Cohesion is defined as the average of the pairwise proximity values of all points within the cluster and separation as the average of the pairwise proximity values of each point within the cluster to all points in all other clusters. Average proximity in a clustering is computed based on all possible pairs.

Here, proximity is computed using the following similarity measures for user pairs: (i) the number of their common queries, normalised by the maximum such value observed in our sample (there are on average about 23 common queries per user pair in our sample), or (ii) the Jensen-Shannon divergence (a symmetrised KL divergence) between the IPTC subject distributions of users’ clicked images. The latter is actually a distance metric and its value subtracted from 1 is employed instead1. Another similarity measure that could be used 1Jensen-Shannon divergence values range between 0 and 1 when the logarithm with base 2 is used, as done in this study. is the number of common clicks in a user pair. However, our analysis showed that this would probably be an unreliable measure given our sample’s very low numbers of shared clicks for users’ common queries. This might be due to the highly dynamic and recency-oriented journalistic context, where image collections are constantly updated and users typically seek up-to-date information. Therefore, time-dependent relevance in conjunction with the long time period covered by our logs is a likely explanation for this phenomenon.

In addition to cohesion and separation, their combination, as this is reflected in the silhouette coe cient, is used. The silhouette coe cient of a clustering is defined as the average silhouette coe cient of all its points, while that of a point is computed using each of the similarity measures described above (see [ 11 ] for its definition).

Supervised cluster validation measures how well the constructed groups match a given ground truth. The following measures that evaluate the extent to which a cluster contains objects of a single class are used: (i) the entropy of a cluster over the class distribution in the ground truth, and (ii) its purity, i.e., the frequency of the most frequent class of the ground truth in a cluster. The entropy (purity) of a clustering is computed as the sum of the entropy (purity) values of each cluster weighted by its size. 5.2

Variation within Groups

Our analysis first explores how meaningful the 37 groupings are by examining whether the members in a group are more similar to each other than to members of other groups, using the similarity measures described above. Figure 1 (left, middle) clearly indicates that, in all cases, group members are more similar to each other than to users in other groups in terms of the queries they issue and the IPTC subjects of the images they click.

Starting with the explicit groupings, users a liated with similar types of companies (company type grouping) appear to share more common queries, but to click on less similar IPTC subjects, compared to users manually classified as sharing common interests (iptc user (level l) grouping), at least for l = {2, 3}. These latter manual user classifications also appear to benefit by their refinements towards the deeper levels of the IPTC hierarchy with respect to their cohesion, but not their separation. Given that in our case only one IPTC subject (EBF) was refined, this indicates that indeed this top level subject is too broad to be discriminative, but that, on the other hand, its refined groups share topical interests not only among themselves, but also with users in other groups, most likely with those in the other groups obtained from the EBF refinement.

similarity: common queries (normalisation: max) ● iinnttrear−−cclluusstteerr csoehpeasraiotinon

average similarity 5 0 . 0 4 0 . 0 3 0 . 0 2 0 . 0 1 .00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 . 0 sean5−msean6−msean7−msean8−msean9−msan01e− san11e− san12e− san13e− san14e− san15e− san16e− san17e− san18e− san19e− san20e− snea5−msean6−msean7−msean8−msean9−msan01e− san11e− san12e− san13e− san14e− san15e− san16e− san17e− san18e− san19e− san20e− liitccksc llv1e lv2e lv3e tyep

e le le y ttxke− ttxke− ttxke− ttxke− ttkxe− ttkxe−mttkxe−mttkxe−mttxke−mttxke−mttxke−mttxke−mttxke−mttxke−mttkxe−mttkxe−militccckskp− iiltccckskp− iiltccckskp− iiltccckskp− iiltcccskkp− litccckks−miltcckcks−miltcckcks−miltcckcks−miltccckks−miltcckcks−miltckccks−miltkcccks−miltcckcks−miltcckcks−miltckccks−mp ijttccspeub− ijttccspeub− ijttccspeub− cpanom ip ip ip ip ip ip ip ip ip ip ip

For the implicit groupings, our analysis should take into account the relation that exists in some cases between the objective function used for the clustering and the similarity measure used for cluster validation. For instance, the objective function in the iptc clicks–kmeans–k clusterings and the Jensen-Shannon divergence measure for cluster validation are both based on the distribution of the IPTC subjects of the images clicked by users. This results in boosting these groupings’ cohesion and separation values when this measure is applied, as shown in Figure 1 (middle). Similarly, but to a lesser extent, the objective function in the text–kmeans–k clusterings considers, together with the clicked images’ captions, the users’ queries, also considered by the common queries measure. Therefore, our focus is mainly on the validation of the text–kmeans–k clusterings using the JensenShannon divergence measure, and the validation of the iptc clicks–kmeans–k using the common queries measure. The results of this analysis in Figure 1 (left, middle) show that group members who clicked on images with similar IPTC subjects also issued more similar queries, and vice versa.

Figure 1 (right) plots the silhouette coe cient for all groupings, combining cohesion and separation. The e↵ ects of the relations between cluster validity measures and objective functions are also evident here. Generally, the most cohesive and well-separated groupings are those for lower k, and also the iptc clicks grouping.

Next, the entropy and the purity of each of the 32 implicit clusterings (text–kmeans–k and iptc clicks–kmeans– k ) is examined in terms of users’ classifications in each of the four explicit groupings and in iptc clicks. Regarding the explicit groupings, Figure 2 shows that all clusterings have the lowest entropy and highest purity for the iptc user (level 1) grouping, followed by the iptc user (level 2), company type, and iptc user (level 3) groupings. This indicates that the clusters created in the implicit groupings mostly cluster together users who have been assigned the same IPTC subject, rather than working for the same company. This is to be expected though given the highly unbalanced data in the top IPTC level manual classification (see Table 5).

When using the iptc clicks classification, the entropy and purity of the iptc clicks–kmeans–k clusterings achieve their best values, as expected. In addition, though, also the text–kmeans–k clusterings have low entropy and relatively high purity with respect to iptc clicks. This indicates that groups consisting of users issuing similar queries and clicking on images with similar captions contain users that mostly select images with the same IPTC subject, thus further confirming the correlation observed above.

2 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4 3 1 0 entropy

0 ● icpotcmcplaicnkystype .1 iptc user (level 1) iptc user (level 2) .8 iptc user (level 3) 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● .06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● To gain further insights into group membership, we selected one clustering to examine: text–kmeans–10, a clustering with relatively low entropy, high purity, and high silhouette coe cient. Table 6 lists for each cluster its distribution, entropy, and purity over the classes in iptc clicks, the two most frequent queries submitted by its members, and their average session statistics. For clusters with a dominant class and thus relatively low entropy (i.e., all clusters except 5 and 8), the most frequent queries are very representative of the dominant IPTC subject (with a Belgian focus, given their origin). For example, the most popular queries in clusters 1 and 9 are clearly relevant to culture/entertainment and sports, respectively, while those in cluster 2 relate to the imperial and royal matters category of IPTC’s Human Interest subject. For clusters 5 and 8, where the distribution is equally split across a number of classes, an examination of their 20 most frequent queries shows that they cover several di↵ erent subjects, indicating a more mixed membership. Regarding the session statistics, there is a clear outlier, while the rest are well below the sample’s overall averages (see Table 1). 5.4

Individual Users

Finally, we have a closer look at individual users’ searching behaviour and in particular at the distribution of the IPTC subjects of their clicked images. Figure 3 plots the entropy of that distribution for each user. Only about a fifth of our users have an entropy below 1 and about half have an entropy over 2. This indicates that many of

ACE CLJ DIS EBF HUM POL REL SPO 17564823910 0100000000..........903000100100008002100000200170 0000000000..........200000000052000000000500000000 0000000000..........500000000000000000000000000000 0000000000..........850000000002000000050500000007 0000000000..........720000000053000000020700000007 0000000000..........352000010050000409140000030125 0000000000..........200000000050000000000000000000 1100000000..........000000033080000030003000003000 the users in our sample click on images assigned to many di↵ erent IPTC subjects; thus, their interests appear to cover several topics. Given the long period covered our searchlogs and the discussion in Section 4 on journalists’ work practices, a more in-depth analysis that is time-based and also considers fuzzy clusterings is needed. 6 This work studied user group identification through the analysis of search log data collected by a commercial picture portal for a sample of 170 of their registered users, in conjuction with the users’ occupational and topical profile information, and the topical IPTC classification of the available images. Overall, our analysis indicates that the examined groupings are meaningful, since groups members are more similar to each other than to users in other groups. Furthermore, it provides some support to the hypothesis that users who click on images with similar IPTC subjects also issue more similar queries, and vice versa, than the population at large. It also indicates that the relationship between group membership and issued queries and/or IPTC subjects of clicked images might be a good source of evidence for determining groups when these are not known a priori. This suggests that members of highly cohesive groups with respect to the above would probably benefit from a ‘groupisation’ approach. However, given the preliminary nature of this work, further investigations are needed for consolidating and generalising our findings. Finally, future work will follow a number of directions, including the use of the images’ visual features for group identification, the consideration of semantic evidence for query similarity and classification, the exploitation of session information, and joint clustering using multiple features.

[1]

P. N.

Bennett ,

Radlinski ,

R. W.

White ,

and E.

Yilmaz . Inferring and using location metadata to personalize web search . In Proc. of the 34th SIGIR , 2011 .

[2]

Collins-Thompson ,

P. N.

Bennett ,

R. W.

White , S. de la Chica, and

Sontag . Personalizing web search results by reading level . In Proc. of the 20th CIKM , 2011 .

[3]

Gayo-Avello . A survey on session detection methods in query logs and a proposal for future evaluation . Information Sciences , 179 ( 12 ), 2009 .

[4]

Hollink ,

Tsikrika , and A. P. de Vries. Semantic search log analysis: A method and a study on professional image search . JASIST , 62 ( 4 ), 2011 .

[5]

B. J.

Jansen ,

Spink , and

J. O.

Pedersen . The e↵ ect of specialized multimedia collections on web searching . J. of Web Engineering , 3 ( 3-4 ), 2004 .

[6] C.

J¨orgensen and

P. J

¨orgensen. Image querying by image professionals . JASIST , 56 ( 12 ), 2005 .

[7]

Kharitonov and

Serdyukov . Demographic context in web search re-ranking . In Proc. of the 21st CIKM , 2012 .

[8]

Kharitonov and

Serdyukov . Gender-aware re-ranking . In Proc. of the 35th SIGIR , 2012 .

[9]

Ornager . The newspaper image database: empirical supported analysis of users' typology and word association clusters . In Proc. of the 18th SIGIR , 1995 .

[10] S.

O¨ zmutlu, A. Spink, and

H. C.

O¨ zmutlu . Multimedia web searching trends: 1997-2001. Information Processesing and Management , 39 ( 4 ), 2003 .

[11]

P.-N.

Tan ,

Steinbach , and

Kumar . Introduction to Data Mining, (First Edition), chapter 8. Cluster Analysis: Basic Concepts and Algorithms . Addison-Wesley Longman , 2005 .

[12]

Teevan ,

M. R.

Morris , and

Bush . Discovering and using groups to improve personalized search . In Proc. of the 3nd ACM WSDM , 2009 .

[13]

Tjondronegoro ,

Spink , and

B. J.

Jansen . A study and comparison of multimedia web searching: 1997-2006 . JASIST, 60 ( 9 ), 2009 .

[14]

Weber and

Castillo . The demographics of web search . In Proc. of the 33rd SIGIR , 2010 .

[15]

Weber and

Jaimes . Who uses web search for what: and how . In Proc. of the 4th WSDM , 2011 .

[16]

Zeimpekis and E. Gallopoulos. TMG: A Matlab toolbox for generating term-document matrices from text collections . In Grouping Multidimensional Data . Springer, 2006 .