=Paper= {{Paper |id=Vol-2038/paper2 |storemode=property |title=Investigating Per-user Time Sensitivity of Search Topics |pdfUrl=https://ceur-ws.org/Vol-2038/paper2.pdf |volume=Vol-2038 |authors=Jivashi Nagar,Hussein Suleman |dblpUrl=https://dblp.org/rec/conf/ercimdl/NagarS17 }} ==Investigating Per-user Time Sensitivity of Search Topics== https://ceur-ws.org/Vol-2038/paper2.pdf
     Investigating Per-user Time Sensitivity Of
                   Search Topics

                       Jivashi Nagar and Hussein Suleman

            Department of Computer Science, University of Cape Town,
           Private Bag X3, Rondebosch, 7701, Cape Town, South Africa.
                        {jnagar,hussein}@cs.uct.ac.za




      Abstract. Search engines give the same results for the same query. They
      do not consider that a user’s topics of interest may diverge at different
      times even if the query terms are the same. This paper presents the
      findings of a study into how different topics of interest of a user are
      influenced by time. The results show that most of the users have time
      sensitive search patterns, indicating that they have different topics of
      interest that are dominant at different times.

      Keywords: Information retrieval, Query log analysis, Topic modelling,
      Users search behaviour, Time sensitive search patterns



1   Introduction

Search engines are used to search and retrieve information from the Web. The
Web has information on almost every topic but the search engines do not consider
diverging interests of a user and retrieve the same results for a query even if it
is issued at different times.
     An example of this could be a user who is a computer science student who
also likes sports. So, it can’t be said that (s)he will search only for the topics
related to computer science. It is possible that at some point of time, (s)he will
search for sports also. So, if (s)he issues a query ”tag”, during study hours, (s)he
may be looking for HTML tags but, in some leisure time, the same query may
mean Tag Sports Gear, a sporting goods brand.
     Although the user queries are mostly small and ambiguous in nature [11],
better results can be provided if a user’s topics of interest and search patterns are
known. Search patterns have been studied before based on the query logs. Query
logs serve as an excellent store of knowledge as they have complete information
about what the users have searched in a given time frame. It has been observed
by previous studies that a user’s search behaviour varies from workplace to home
[18]. Change in topical categories and search query volume also varies with time
[3][23][2]. According to these studies, after analysing the query log, they found
that there is a pattern in search queries. But a general search pattern cannot be
applicable to all users.
        Jivashi Nagar and Hussein Suleman

2     Research Question

Do the topics of interest of a user vary with time of the day?
    Motivated by the observations of past studies, this study explored the time
sensitive search pattern of a user to find his/ her different topics of interest
that are dominant at different times. This study observed queries issued by 100
different users in an AOL query log [1] and analysed the query set of each one of
the 100 users individually. The details and outcomes of this study are presented
in this paper. Section 2 shows the related work on user’s search behaviour,
pattern identification and topic modelling. Section 3 covers the methodology
of our work. Section 4 presents the analysis of the data. Section 5 includes
limitations of the study. Section 6 is the conclusion of the work and future
directions.


3     Related Work

3.1   User’s Search Behaviour And Pattern Identification

It is crucial to analyse a user’s search behaviour to provide effective and efficient
search services. The query log data can be used to know how users use the
search engines and also about their diverse interests and preferences [8]. Rose
and Levinson [19] made an attempt to understand users’ search goals. They
analysed an Alta Vista query log and found that the goal of users’ searches is less
navigational and more resource seeking. In a work by Tyler and Teevan [21], the
authors analysed repeated queries and user behaviour. According to this study,
search engines can capitalize this re-finding behaviour of the user to improve
the user’s search experience. In the same line of re-finding behaviour, Tyler et
al. [22] in their study found that repeated re-finding behaviour also contains
diversification. According to Srivastva et al. [20], identifying the patterns in
Web usage can be helpful for marketers in placing advertisements focusing on
a certain target group. Temporal analysis of the sequence patterns can prove
useful in finding the trending topics.
    Temporal analysis of query logs has also been done by many researchers to
explore users’ search behaviour and search patterns. A significant outcome of
the study by Rieh [18] is that the author was able to find the difference in search
behaviour of the users. According to this study, users’ search behaviour differs in
their workplace from that at home. The websites visited during working hours
were mostly related to their work while, at home, the search was of diverse
nature. In a similar work [24], Yuye and Alistair analysed an MSN query log.
They found a general pattern in the volume of queries. There was a peak early
in the week that dropped steadily until Friday and decreased sharply over the
weekend. This pattern speaks about the weekly routine of a common working
person. They also observed an hourly pattern and found a rise in the volume
of queries from early morning, peaking at noon and decreasing steadily through
midnight. Judit et al. [2] analysed the same MSN query log for topic specific
                    Investigating Per-user Time Sensitivity Of Search Topics

analysis. They found that, during weekdays, queries related to work were domi-
nant. In a comparatively recent work by Michael et al. [23], the authors analysed
a Russian query log spanning one year. According to this study, queries related
to categories like ”Health” and ”Beauty and Style” were distributed more or
less constantly throughout the year. Some categories like ”Education” observed
a drop during vacation periods. John Cosley [6] has shown that the query terms
and search patterns of users vary with time of the day and also with device (Mo-
bile and PC). He also compared the search patterns of weekdays and weekends.
According to this study, queries regarding task completion were dominant dur-
ing the morning on weekdays, while entertainment and shopping related queries
have shown their dominance in the evening on Mobiles and Tablets. All these
studies have analysed the query logs as a whole and suggested that there is a
pattern in users’ search behaviour. Some of them [23][2][6] also analysed the time
dependent popularity of some topics. They did not consider and analyse each
individual user’s search pattern. A common trend of topic change and popularity
cannot be applied to improve the search experience of an individual user. Each
user may have a specific search pattern, which is different from the others. As
reported by Michael et al. [23], the queries related to the category ”Education”
observed a decline during the vacation period; this trend or pattern cannot be
generalized for each user. A user, who is looking for extra classes or lessons, may
search for ”education” even in the vacation period.

3.2   Topic Inference
Every user has different topics of interests. To find the different topics from query
logs, most of the previous studies about topic based personalized information
retrieval systems have relied on Open Directory Project (ODP) categories [15][13]
[4][15]. Jansen et al. [10] used the Google Directory topical hierarchy to classify
the queries into subject categories. Arguably, Web search is not limited to these
categories because of the rich nature of the Web. It is not ideal to put the
wide range of a user’s interests into predefined categories. Unsupervised Machine
Learning may be a better tool to learn latent topics from users’ search queries
[14].

4     Methodology
The goal of this study is to investigate the time sensitivity of a user’s topics of
interest. We analysed queries submitted by each user separately to explore the
time sensitive search pattern of that user. This section presents the details of
our study in the following steps:

4.1   Data Collection
In this study, the search history of 100 users from an AOL query log [1] has been
analysed. The AOL query log is publicly available log data for research and anal-
ysis [1][17]. The AOL query log has been analysed before by many researchers.
        Jivashi Nagar and Hussein Suleman

Duarte et al. [7] identified queries with children intent from this AOL query log.
This log collection contains about 20M Web queries from 650K users issued in
three months from March 2006 to May 2006. The data is anonymized and con-
sists of: UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl. Each
UserID represents a unique user. For this study, UserID, Query and QueryTime
fields of the query log were considered. It was assumed that each UserID is
representing a unique user. The logs of each unique user were cleaned and pre-
processed for further analysis. The details of data cleaning and pre-processing
are described below.


4.2   Data Cleaning

The queries of each of the 100 users were processed individually. The entries
with empty queries were removed. The same queries that were submitted on
the same date within a time difference of less than 10 seconds were also not
considered for analysis. According to Odjik et al. [16], queries issued within a
few seconds of time are more likely to be spelling correction or substitution type
of formulations of the previous query issued. These queries were not considered
as different queries but some modification of the previous ones. This process of
removing incomplete, irrelevant or duplicate data is called data cleaning.


4.3   Pre-processing

Pre-processing, also known as text normalization, gives a syntactical view of the
original text. Pre-processing was accomplished by using the Natural Language
Toolkit (NLTK). NLTK is a free and open source community-driven project [9].
It is the most used platform to work with natural human language data. This
involves tokenization, stopword and punctuation marks removal, and lemmati-
zation.
    The process of dividing a phrase or sentence into tokens is called tokenization.
The tokens may represent words, digits or punctuation marks. After tokeniza-
tion, stopwords were detected from the data. Stopwords are common and high
frequency words that are independent of any topic like a, an, the, for and and.
Following detection of stopwords and punctuation marks, they were removed
from the query terms and the query terms were lemmatized.
    Lemmatization aims to remove inflectional endings and to return the base or
dictionary form of a word, which is called the lemma. This step utilizes vocab-
ulary along with a morphological analysis of words.
    Table 1 shows the aggregate number of queries of 100 users and the queries
that remained for analysis after cleaning and pre-processing.


4.4   Topic Modelling

Topic modelling is a technique that is used to identify the latent topics present in
a corpus. Topic models are the algorithms that are used to find the main themes
                    Investigating Per-user Time Sensitivity Of Search Topics

                            Table 1. Queries analysed

             Original number of queries Queries remained after cleaning
                       22096                         6850




or ideas in a data collection. In a number of previous works, authors have utilized
predefined topical categories like the Open Directory Project [12][4][15] and the
Google Directory topical hierarchy [10] to find the topics of interest of a user.
According to Mehrotra [14], to learn the latent topics of interest from the user
search logs, unsupervised machine learning would be a better tool.
    Latent Dirichlet Allocation (LDA) is an unsupervised approach for topic
modelling. It is a generative probabilistic mode for a text corpus [5] and the most
commonly used approach for topic modelling. LDA is a three-level hierarchical
Bayesian model. In LDA, each document of a corpus is modeled as a finite
mixture over an underlying set of topics and each topic is modeled as an infinite
mixture over an underlying set of topic probabilities [5].
    In this study, we assumed that each user has at least 4 different topics of
interest. After applying LDA on the queries of each user, topics were assigned
to the individual queries. According to LDA, a document may have more than
one topic. So, in this case, if a query fell under more than one topic, all those
topics were assigned to that query. The reason behind doing this is that most of
the Web queries are ambiguous in nature i.e. they may belong to many topics.
The aim of this study is to find the temporal dominance of topics of interest of a
user. We analysed which topics were dominant at a particular time. This notion
of assigning multiple topics to a query along with finding temporal dominance
of topics can be utilized to disambiguate the ambiguous queries. After assigning
the topic(s), the queries were grouped according to the Time-bins.

4.5   Time-bins
The aim of this study is to find time sensitive search patterns. We analyse the
relation between a topic of interest and the time when it is searched dominantly.
According to Rieh [18], users search behaviour differs in their workplace from
that at home. For the purpose of our study, we divided the time of a day into four
customized bins according to the common daily routine of a working person. The
four Time-bins are: Early morning (6h-8hr), Working hours (8h-18h), Evening
time (18h-24h) and Midnight (24h-6h).


5     Results And Analysis
In this section, search patterns of all of the 100 users were analysed. Keeping in
mind that this AOL query log is from 2006 i.e. more than 10 years old, people
searched less frequently because of Internet availability and usage cost. People
used to search for fewer topics. In present times, due to fast Internet connections,
       Jivashi Nagar and Hussein Suleman

easy availability and more advanced communication devices, users search more
frequently. Moreover, the number of topics of interest has also increased. In spite
of this limitation, some promising facts are revealed from its analysis. Table 2
shows the number of Time-bins used by the users to issue queries.


                       Table 2. Number of Time-bins used

                   Number of Users Number of Time-bins Used
                        12                    4
                        35                    3
                        50                    2
                         3                    1




    As can be seen from Table 2, out of 100 users, the majority of the users (50)
have utilized 2 Time- bins to issue their queries. 35 users have made their queries
in 3 Time-bins while 12 users have searched in 4 Time-bins. Very few users (3)
have searched only in one Time-bin. Table 3 presents the time-sensitive search
patterns of users along with their respective numbers of dominant topics. Every
unique number in Example Pattern indicates the different dominant topic in a
user’s search pattern. For instance, pattern 123 represents the search pattern of
a user who has searched in 3 Time-bins and in every Time-bin the dominant
topic is different. One of the possible pattern, 1122, when a user searches for
2 dominant topics in 4 Time-bins, was not found in any of the 100 patterns
analysed. Out of 100 users, 1 user has searched for 4 different dominant topics
in the 4 Time-bins, 13 users have searched 3 Time-bins with 3 distinct dominant
topics and 42 users have 2 different dominant topics for the 2 Time-bins they
searched in. So, 56 users have searched for different dominant topics in every
Time-bin. Among the users who searched in 4 Time-bins, 6 have 3 dominant
topics and 3 have 2 dominant topics. 19 users have 2 dominant topics in 3 Time-
bins they searched. Thus, there are 28 users who have at least either 3 or 2
dominant topics of interest. Only 13 users have shown the same dominant topic
in every searched Time-bin and, as shown in Table 2, 3 users have searched in
only 1 Time-bin.
    The following figures represent search patterns of some of the categories
shown in Table 3.
    Figure 1 shows the search pattern of Category 2 users. Each time the user
has searched, the dominant topic is different. Topic3 which is not searched much
at Time1 and Time3, becomes dominant at Time4.
    Figure 2 represents the search pattern of a Category 4 user. The user has
searched only in 2 Time-bins but, in both the Time-bins, the dominant topics
are different. Topic2, which has not been searched at Time3, is dominating at
Time4. Figure 3 shows the search pattern of a Category 5 user who has made
searches in 3 Time-bins. At Time1, only Topic0 is searched and it also dominates
                            Investigating Per-user Time Sensitivity Of Search Topics

                 Table 3. Dominant Topic patterns and calculated entropy

 Category Number of users Number of Dominant Topics Example Pattern Entropy
    1           1                    4                   1234        1.386
    2          13                    3                    123        1.098
    3           6                    3                   1233        1.039
    4          42                    2                     12        0.693
    5          18                    2                    122        0.636
    6           4                    2                   1333        0.215
    7          16                    1               1111,111,11,1     0




         18

         16

         14

         12

                                                                              Topic0
         10
                                                                              Topic1
             8
                                                                              Topic2
                                                                              Topic3
             6

             4

             2

             0
                     Time1              Time2        Time3         Time4




                                Fig. 1. Search pattern of Category 2



        35


        30


        25


        20                                                                  Topic0
                                                                            Topic1
                                                                            Topic2
        15
                                                                            Topic3

        10


         5


         0
                    Time1              Time2        Time3         Time4




                             Fig. 2. Search pattern of a Category 4 user


at Time3. Topic2, which has not been searched either at Time1 or at Time3,
clearly dominates at Time4.
       Jivashi Nagar and Hussein Suleman



         50

         45

         40

         35

         30
                                                                           Topic0
         25                                                                Topic1
                                                                           Topic2
         20                                                                Topic3
         15

         10

         5

         0
                   Time1             Time2         Time3          Time4




                           Fig. 3. Search pattern of a Category 5 user



              60



              50



              40

                                                                           Topic0
                                                                           Topic1
              30
                                                                           Topic2
                                                                           Topic3
              20



              10



              0
                     Time1             Time2        Time3          Time4




                           Fig. 4. Search pattern of a Category 6 user



   Figure 4 shows the search pattern of a Category 6 user. The user has searched
Topic0 and Topic1 in all the 4 Time-bins. While Topic1 is dominating in 3 Time-
bins (Time1, Time2 and Time3), at Time4, Topic3 is dominating followed by
Topic2.
    Figure 5 shows the search pattern of a Category 7 user. The Category rep-
resents the users who have the same dominant topic in every Time-bin whether
they searched in 4, 3 or 2 Time-bins.
   These search patterns indicate that users have different topics of interest and
they prefer to search about them at different times.
    For analysing the variability in patterns of dominant topics, entropy was
calculated. Entropy is a measure of disorder or randomness and refers to the
                        Investigating Per-user Time Sensitivity Of Search Topics


         90

         80

         70

         60

         50                                                                 Topic0
                                                                            Topic1
         40                                                                 Topic2
                                                                            Topic3
         30

         20

         10

         0
                Time1              Time2           Time3            Time4




                        Fig. 5. Search pattern of a Category 7 user


number of possible states a variable can have. It is calculated as:
                                           n
                                           X
                               H(X) = −          ln p(xi ).p(xi )
                                           i=1

A greater value of entropy points to more possible states or randomness of a
variable. Table 3 shows the calculated entropy for different patterns of dominant
topics, suggesting an ordering and grouping of different topic patterns based on
variability.
    80 users have entropy greater than or equal to 0.636. Even if a user searched
in at least 3 Time-bins, the dominant topics were unique in 2 Time-bins. In
other words, some topics are dominantly searched in a particular time interval.
It clearly means that there is variability and uncertainty in a user’s topics of
interest. A user’s topics of interest differ in different time intervals and so we
can say that the topics of interest of a user are time-sensitive.
    This inference can be utilized to disambiguate the queries and provide more
useful and relevant search results.


6   Limitations

The aim of this study is to explore the temporal dominance of topics of interest of
a user. For this purpose, we selected the queries submitted by 100 users from an
AOL query log. The time of the day was divided into four bins and it is assumed
that each user has at least four topics of interest. After processing and analysing,
it was found that most of the users search for different topics at different times.
It can be said that topics are time sensitive.
    This study was able to find the time-sensitive search patterns of a user but
it has some shortcomings also.
        Jivashi Nagar and Hussein Suleman

 1. The query log data was old but readily available for analysis.
 2. The queries of only 100 users were analysed because the query set of every
    user was cleaned manually.
 3. It did not figure out the exact number of topics of interest of each user. As
    we divided the time of the day into 4 Time-bins according to the routine
    of a common working person, it was assumed that each user has at least 4
    topics of interest for 4 time bins.
 4. Although we were able to find search patterns based on these Time-bins,
    each user may have different search Time-bins.


7   Conclusions And Future Work
We have studied an AOL query log to explore the time sensitive search patterns
of users. We have analysed the queries of 100 different users and found that,
out of 100 users, 84 users have at least 2 different dominant topics searched at
different times. Only 13 users have searched for the same topic in every Time-
bin and 3 users have searched only in 1 Time-bin. This study concludes that
most of the users have time sensitive topics. They search for different topics at
different times. Different topics are dominant at different time intervals. The
goal of future work is to explore and exploit the time sensitive search patterns
of a user to model a user’s time sensitive search behaviour, which could prove
helpful to search engines in disambiguating the short and ambiguous queries and
also providing users with more relevant search results.


References
1. http://www.cim.mcgill.ca/ dudek/206/Logs/AOL-user-ct-collection/user-ct-test-
   collection-01.txt/.
2. Judit Bar-Ilan, Zheng Zhu, and Mark Levene. Topic-specific analysis of search
   queries. In Proceedings of the 2009 Workshop on Web Search Click Data, WSCD
   ’09, pages 35–42, New York, NY, USA, 2009. ACM.
3. Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, David Grossman, and Ophir
   Frieder. Hourly analysis of a very large topically categorized web query log. In
   Proceedings of the 27th Annual International ACM SIGIR Conference on Research
   and Development in Information Retrieval, SIGIR ’04, pages 321–328, New York,
   NY, USA, 2004. ACM.
4. Paul N Bennett, Ryen W White, Wei Chu, Susan T Dumais, Peter Bailey, Fedor
   Borisyuk, and Xiaoyuan Cui. Modeling the impact of short-and long-term behavior
   on search personalization. In Proceedings of the 35th international ACM SIGIR
   conference on Research and development in information retrieval, pages 185–194.
   ACM, 2012.
5. David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
   Journal of machine Learning research, 3(Jan):993–1022, 2003.
6. John Cosley.        Hearing the rhythms of human search behavior: What
   weve learned.       http://searchengineland.com/human-behavior-influences-search-
   marketing-197486/, July 2014.
                      Investigating Per-user Time Sensitivity Of Search Topics

7. Sergio Duarte Torres, Djoerd Hiemstra, and Pavel Serdyukov. Query log analysis
   in the context of information retrieval for children. In Proceedings of the 33rd
   international ACM SIGIR conference on Research and development in information
   retrieval, pages 847–848. ACM, 2010.
8. Yi Fang, Naveen Somasundaram, Luo Si, Jeongwoo Ko, and Aditya P Mathur.
   Analysis of an expert search query log. In Proceedings of the 34th international
   ACM SIGIR conference on Research and development in Information Retrieval,
   pages 1189–1190. ACM, 2011.
9. Mansi Gera and Shivani Goel. Data mining - techniques, methods and algorithms: A
   review on tools and their validity. International Journal of Computer Applications,
   113(18), 2015. Copyright - Copyright Foundation of Computer Science 2015; Last
   updated - 2015-04-14.
10. Bernard J Jansen, Zhe Liu, Courtney Weaver, Gerry Campbell, and Matthew
   Gregg. Real time search on the web: Queries, topics, and economic value. Informa-
   tion Processing & Management, 47(4):491–506, 2011.
11. Bernard J Jansen, Amanda Spink, and Tefko Saracevic. Real life, real users, and
   real needs: a study and analysis of user queries on the web. Information processing
   & management, 36(2):207–227, 2000.
12. Hyoung R Kim and Philip K Chan. Learning implicit user interest hierarchy for
   context in personalization. In Proceedings of the 8th international conference on
   Intelligent user interfaces, pages 101–108. ACM, 2003.
13. Jin Young Kim, Kevyn Collins-Thompson, Paul N Bennett, and Susan T Dumais.
   Characterizing web content, user interests, and search behavior by reading level and
   topic. In Proceedings of the fifth ACM international conference on Web search and
   data mining, pages 213–222. ACM, 2012.
14. Rishabh Mehrotra. Topics, tasks & beyond: Learning representations for personal-
   ization. In Proceedings of the Eighth ACM International Conference on Web Search
   and Data Mining, pages 459–464. ACM, 2015.
15. Ashish Nanda, Rohit Omanwar, and Bharat Deshpande. Implicitly learning a user
   interest profile for personalization of web search using collaborative filtering. In Web
   Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM
   International Joint Conferences on, volume 2, pages 54–62. IEEE, 2014.
16. Daan Odijk, Ryen W White, Ahmed Hassan Awadallah, and Susan T Dumais.
   Struggling and success in web search. In Proceedings of the 24th ACM Interna-
   tional on Conference on Information and Knowledge Management, pages 1551–1560.
   ACM, 2015.
17. Greg Pass, Abdur Chowdhury, and Cayley Torgeson. A picture of search. In
   Proceedings of the 1st International Conference on Scalable Information Systems,
   InfoScale ’06, New York, NY, USA, 2006. ACM.
18. Soo Young Rieh. Investigating web searching behavior in home environments.
   Proceedings of the American Society for Information Science and Technology,
   40(1):255–264, 2003.
19. Daniel E. Rose and Danny Levinson. Understanding user goals in web search. In
   Proceedings of the 13th International Conference on World Wide Web, WWW ’04,
   pages 13–19, New York, NY, USA, 2004. ACM.
20. Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web
   usage mining: Discovery and applications of usage patterns from web data. SIGKDD
   Explor. Newsl., 1(2):12–23, January 2000.
21. Sarah K Tyler and Jaime Teevan. Large scale query log analysis of re-finding.
   In Proceedings of the third ACM international conference on Web search and data
   mining, pages 191–200. ACM, 2010.
        Jivashi Nagar and Hussein Suleman

22. Sarah K Tyler and Yi Zhang. Multi-session re-search: in pursuit of repetition
   and diversification. In Proceedings of the 21st ACM international conference on
   Information and knowledge management, pages 2055–2059. ACM, 2012.
23. Michael Völske, Pavel Braslavski, Matthias Hagen, Galina Lezina, and Benno Stein.
   What users ask a search engine: Analysing one billion russian question queries.
   In Proceedings of the 24th ACM International on Conference on Information and
   Knowledge Management, CIKM ’15, pages 1571–1580, New York, NY, USA, 2015.
   ACM.
24. Yuye Zhang and Alistair Moffat. Some observations on user search behaviour.
   Austr. J. Intelligent Information Processing Systems, 9(2):1–8, 2006.