<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Where is the user? Filtering Bots from the Edurep Query Logs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wim Muskee</string-name>
          <email>w.muskee@kennisnet.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kennisnet Foundation Paletsingel 32 2718 NT Zoetermeer</institution>
          ,
          <addr-line>NL</addr-line>
        </aff>
      </contrib-group>
      <fpage>23</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>Edurep indexes learning object metadata from several repositories, o ering a webservice interface on which portals can build their own search implementation. At Edurep query log level, no obvious distinction can be made between human users and webcrawlers visiting these portal sites. This makes it impossible to gather any meaningful data on user search behaviour. Four query types, distinguished from the six largest portals' websites were related to one month of query logs. For two query types a distinction between human and automatic generated tra c could be found. However, these results can only be used to advise connected portals on their interface implementations. More research is needed to actually perform any reliable ltering.</p>
      </abstract>
      <kwd-group>
        <kwd>webservice</kwd>
        <kwd>crawler detection</kwd>
        <kwd>log analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Edurep is a Dutch learning object search engine, indexing harvested learning
object metadata from more than 50 di erent repositories. Search portal
developers can interface with the search engine using the Edurep webservice, available
through the SRU/SRW protocol (Figure 1).</p>
      <p>
        Although operational for some years [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the operators gained access to the
search query logs only recently (december 2009). Through analysis of these logs
and webserver logs of one portal, the operators discovered that a signi cant
amount of queries came from various search engine bots1. Among several
harmful aspects, Edurep is a ected by two in particular. First, and obviously,
webcrawlers generate extra tra c, possibly limiting performance for human users.
Secondly, webcrawlers generate automated tra c, making it harder for the
operators to infer meaningful human interaction results from the Edurep query logs.
Most of these search engine bots can be identi ed at search portal level based
on their HTTP request User-agent string or IP adress [
        <xref ref-type="bibr" rid="ref12 ref9">12,9</xref>
        ]. However, this
information is no longer available when the request reaches Edurep.
This problem is not typical for Edurep, but applies to any webservice which
allows connections from a third-party search interface. Examples of these in the
learning object context include the LRE [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], MACE [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and the Spider project
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], all of them available through the SQI protocol [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>With Edurep as context, this paper aims to explore methods to make a
distinction between automated and human queries in webservice query logs. To
this end, four query types were distinguished from several search portal web
interfaces. The SRU representations for each query were used to lter the logs
for a speci c query type and analyze it more closely. The paper ends with a
discussion of the results.
1 A type of webcrawler; a program which gathers information from the internet by
recursively following found hyperlinks.</p>
    </sec>
    <sec id="sec-2">
      <title>Modeling Automated Queries</title>
      <p>Because webcrawlers only follow hyperlinks, automated searches are caused by
the presence of hyperlinks which cause an Edurep search query. An analysis of
the portals' search interfaces is necessary to combine hyperlinks with logged SRU
queries.
2.1</p>
      <p>Portal Search Interfaces
Looking at the search interfaces of the six largest portals (consisting of 97% of
query total), four types of hyperlinks were distinguished.</p>
      <p>{ search links: Issuing a search to retrieve a rst page resultset.
{ pagination links: Issuing a search to retrieve another resultset page.
{ result links: Issuing a search to retrieve a speci c record.</p>
      <p>{ facet links: Issuing a search to retrieve the amount of records for that facet.
Typically, the portals retrieved either 5 or 10 results after a search query. The
number of navigation links ranged from 5 to 20, always including a next and/or
previous link and sometimes including links to the rst and/or last page. A few
included result and facet links.</p>
      <p>Only one portal (C) performed a search on page arrival. The resulting page
included all link types. All the portals' queries were represented as a url in the
browser navigation bar, meaning they can be pasted easily on other webpages
for others to click on, including bots. When searching for the portals' url query
pre xes on Google, indeed some results were found. Also corresponding queries
were discovered in the query logs.
2.2</p>
      <p>
        SRU/SRW
Edurep can be queried using the searchRetrieve operation of the SRU/SRW
protocol [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Among several supported request parameters [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the startRecord
parameter determines which record of the resultset is displayed rst. When
omitted, it defaults to 1. The maximumRecords parameter sets the number of records
each resultset contains. Edurep's default is 10.
      </p>
      <p>A search query typically has no startRecord value at all or a value of 1. Also, to
present a reasonable amount of results, the maximumRecords value is set to 5
or higher, or left out to return 10. Pagination queries have a startRecord value
higher than 1.</p>
      <p>In a result query, the startRecord value is omitted or 1. Since a result of 1 is
expected, the value for maximumRecords does not need to be 1. However,
because a speci c record is requested, part of the query value is characteristic. In
Edurep, a speci c record can be requested by ltering on lom.general.identi er
or lom.general.catalogentry, the LOM identi er, or meta.upload.id, Edurep's
internal unique identi er.</p>
      <p>Facet queries can be performed inside a search query by adding Edurep's
xterm-drilldown parameter to the SRU query. In addition to the search results,
a count drilldown for each facet of the requested eld is retrieved. Because this
function is not supported for all LOM elds, separate facet queries can also be
executed. These have a startRecord value of 1 or none at all. Also, the value for
maximumRecords is 0 or 1 2.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>
        The logs of January 2010 were used as dataset and the analysis is done in R [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Each log entry consisted of the portal's ip adress, the timestamp when a search
query entered the system (UTC), the size of the response data in kilobytes, the
processing time in seconds, the entrypoint of a query on the server indicating
the used protocol (SRU or SRW), and the SRU search query.
      </p>
      <p>Five variables from each query were used. The IP adress, startRecord and
maximumRecords values were used unprocessed. The query argument was used as
a whole, assuming each portal constructed their queries in the same way and
query uniqueness was not compared across portals. An identi er boolean was
set to 1 if a result link was detected.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>Concerning search queries, the distinction between human and automatic
induced queries can be made based on the occurence of the queries. Automatic
induced queries will appear more often in relation to human generated ones.
While Portal C's startup page query appeared more than 6 times than any of
its other queries, a good threshold could not be determined.</p>
      <p>
        Assuming most users will never click past the second page of search results [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
facet queries with a startRecord value over 200 will probably be auto-generated
(PAG1). A more elegant method for determing automatic facet queries is to
scan the logs for pagination ranges. A range was crudely de ned as a set of SRU
queries (min. 10) with equal query values, a startRecord di erence of
maximumRecords and a maximum startRecord value higher than 200 (PAG2).
Based on occurence of result queries, no clear evidence for automatic querying
was found in the logs. This was attributed to the dynamic nature of Edurep's
content, with changing resultsets, di erent results will be queried.
After plotting the unique facet queries of Portal C (Figure 2), the small layer of
queries below the top coincided with the facet queries executed on entering the
search page. Observing that 10 of the 12 sub-top queries were executed about
2330 times, it was assumed they were caused by automatic querying. From the
queries of these types, that amount could be subtracted, leaving their human
induced occurences (FACET). Following from this assumption, at least the same
amount of automatic hits were generated by Portal C's startup search query, and
could thus also be subtracted.
2 Technically, by setting this value to 0, the same total can be retrieved, but since the
usage of 1 had been observed, it was included
coun 1500
0
0
0
1
0
0
5
0
0
      </p>
      <p>The subtractions from each ltering method are displayed next to to each
portal's total amount of queries in table 1.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>Considerable automatic induced querying was observed. In terms of bandwidth
the found ranges from PAG2 alone caused 13,3 Gb of tra c, 26; 5% of the total
A-F amount. Concerning the amount of queries, PAG2 and FACET accounted
for 8; 4% of the total A-F amount of queries.</p>
      <p>
        However, assumptions were made and the used lter methods are still
rudimentary and incomplete. In using PAG2 for instance, tails or heads of the ranges
may lie outside the used dataset. Also, the dataset probably contains heads or
tails of ranges from other months. This is even more true when considering the
pagination queries don't need to appear on the timeline in the same order as
they appear on the page [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Secondly, rst- and lastpage pagination queries were
not considered in PAG2.
The immediate ndings of this study make it possible to tailor our advise for
portals. One aspect of this is related to blocking crawlers at the portal by
implementing the Robots Exclusion Standard [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Use of this standard could also be
enforced through Edurep's user level agreement. As an unintended side e ect,
automated usage ampli ed some examples of ine cient quering on Edurep.
Another aspect of the advise should include information on how to interface with
Edurep better.
      </p>
      <p>Use of various scripts to parse and lter the log les proved very useful during
the conduct of this study. Automating the used scripts will allow the
administrators to detect undesirable behaviour in an earlier stage and act on it sooner,
leaving Edurep free to be used by actual users.</p>
      <p>Future research should improve on several aspects. First of all, more months
of logging need to be used to combine and compare with current results.
Secondly, the SRU query values need to be parsed fully to allow more accurate
ltering options and to compare queries across portals. Last is the usage of the
portal website. Parameters like the size and format of the pagination links, and
the types of search, result or facet links on the page could prove useful in
implementing better automatic detection methods.</p>
      <p>
        A more long term product change would be to also request the end user's orginal
User agent string in the query to Edurep. Also requesting the original IP adress
could lead to privacy concerns. Since lots of crawler User agent strings are
publicly available [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], this information could greatly enhance our ltering e orts.
An new Edurep component could be introduced, making it possible to block
requests before they are processed by the system. However, at this point it is
unclear if such an extra check on all requests outweighs the bene ts of not having
to process the blocked requests. For now, such a ltering component will have
to implemented before the logs are processed by our business level reporting tool.
While the ideas in this paper could be used in similar architectures, the actual
scripts cannot because they are made for SRU and Edurep's query log format.
With more standardization in repository query languages (like SQI),
corresponding logging standards can be thought of, making sure developed analysis tools
bene t many and query logs can be shared easily.
      </p>
      <p>Filtering automatic queries is after all needed to look more closely at the human
ones. The focus of interest is teacher search behaviour, not only on Edurep but
beyond our borders.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Studying user strategies and characteristics for developing web search interfaces</article-title>
          .
          <source>Dissertations in Interactive Technology 3 (December</source>
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dikaiakos</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stassopoulou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papageorgiou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>An investigation of web crawler behavior: characterization and metrics</article-title>
          .
          <source>Computer Communications</source>
          <volume>28</volume>
          (
          <issue>8</issue>
          ),
          <volume>880</volume>
          {
          <fpage>897</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Massart</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Towards a pan-european learning resource exchange infrastructure</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          <volume>5831</volume>
          /
          <year>2009</year>
          ,
          <volume>121</volume>
          {
          <fpage>132</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Paulsson</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Connecting learning object repositories: Strategies, technologies and issues. Internet and Web Applications</article-title>
          and Services, International Conference on
          <volume>0</volume>
          , 583{
          <fpage>589</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R</given-names>
            <surname>Development Core Team: R: A Language</surname>
          </string-name>
          and
          <article-title>Environment for Statistical Computing</article-title>
          . R Foundation for Statistical Computing, Vienna, Austria (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. robotstxt.org:
          <article-title>The web robots page</article-title>
          .
          <source>Retrieved August</source>
          ,
          <volume>3</volume>
          <fpage>2010</fpage>
          , from http://www.robotstxt.org. (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Simon</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Massart</surname>
          </string-name>
          , D., van
          <string-name>
            <surname>Assche</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ternier</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duval</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brantner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olmedilla</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miklos</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>A simple query interface for interoperable learning repositories</article-title>
          .
          <source>In: Proceedings of the 1st Workshop On Interoperability of Web-Based Educational Systems</source>
          . pp.
          <volume>11</volume>
          {
          <issue>18</issue>
          (
          <year>2005</year>
          ), http://citeseerx.ist.psu.edu/viewdoc/summary?doi
          <source>=10.1.1.67.7745</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Staeding</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>List of user-agents (spiders, robots</article-title>
          , browser).
          <source>Stichting Kennisnet</source>
          .
          <article-title>Edurep wiki</article-title>
          .
          <source>Retrieved August 5</source>
          ,
          <year>2010</year>
          , from http://www.user-agents.org
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stassopoulou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dikaiakos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Web robot detection: A probabilistic reasoning approach</article-title>
          .
          <source>Computer Networks</source>
          <volume>53</volume>
          (
          <issue>3</issue>
          ),
          <volume>265</volume>
          {278 (
          <year>February 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Stichting Kennisnet:
          <article-title>Edurep wiki</article-title>
          .
          <source>Retrieved June</source>
          ,
          <volume>3</volume>
          <fpage>2010</fpage>
          , from http://edurep.wiki.kennisnet.nl
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Stichting Kennisnet</surname>
            <given-names>ICT</given-names>
          </string-name>
          op School: De educatieve contentketen: leertechnologische afspraken voor de toekomst.
          <source>Retrieved May</source>
          ,
          <volume>2</volume>
          <fpage>2007</fpage>
          , from http://contentketen.kennisnet.nl/attachments/990312/De Educatieve contentketen - Leertechnologische afspraken voor de toekomst.
          <source>pdf (December</source>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Discovery of web robot sessions based on their navigational patterns</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>6</volume>
          (
          <issue>1</issue>
          ),
          <volume>9</volume>
          {35 (
          <year>January 2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <article-title>The Library of Congress: Sru: Search/retrieval via url</article-title>
          .
          <source>Stichting Kennisnet</source>
          .
          <article-title>Edurep wiki</article-title>
          .
          <source>Retrieved August 5</source>
          ,
          <year>2010</year>
          , from http://www.loc.gov/standards/sru/., http://www.loc.gov/standards/sru/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wolpers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Memmel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klerkx</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parra</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandeputte</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duval</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schirru</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niemann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bridging repositories to form the mace experience</article-title>
          .
          <source>New Review of Information Networking</source>
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <volume>102</volume>
          {
          <fpage>116</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>