<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WorWk owr kitwhitKh nkonwowl eleddggee oonninthteernIentt-elronceatl s-eaLrcohcal Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonín Pavlíček</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josef Muknšnábl</string-name>
          <email>m@vatsiecs.canzd</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Anton ́ın Pavl ́ıˇcek</institution>
          ,
          <addr-line>Josef Muknˇsn ́abl</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>both of Department of System Analysis, Faculty of Informatics and Statistics, University of Economics</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country>Czech Republic Department of</country>
        </aff>
      </contrib-group>
      <fpage>127</fpage>
      <lpage>131</lpage>
      <abstract>
        <p>Authors are looking within their research grant new original web local search algorithm respecting specifics of Czech national environment. We would like to initiate further debate on topic. We are addressing three subtasks that include: identification of user geographical location, identification of web locality and final algorithm design working with these information altogether. A staggering pace of internet growth together with steadily increasing broadband penetration availability and general information literacy lead to more frequent internet usage. Such trend is not visible only in US but worldwide too - number of internet users and overall usage numbers constantly grow[1]. Although capabilities of engines and catalogs have improved significantly within last several years (especially after Google ranking algorithm arrival) they are still not perfect in terms of accuracy and relevancy. Typical areas where there is a potential for improvement are e.g. personalized and local search (in terms of geography and regions). Local search is a matter of an internal research grant that has been launched these days at University of Economics, Prague by us. A focus on that topic is not rare, especially in global scale, as several patents related to local search have already been filed[2] in US. Our main goal is to design and implement new web local search algorithm that will respect Czech national specifics and verify its function on local web page catalog Jihozapad.info. We would like to indicate possible ways of solution by the article in hope that some wider discussion bringing new ideas will be initiated.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Preamble</title>
    </sec>
    <sec id="sec-2">
      <title>Local Search and its possibilities</title>
      <p>
        Web local search is a type of search when user is trying to find not only topic relevant
but also locally (in terms of geographical distance) relevant web page/pages.
Typically the users are searching for local/regional pages related to local businesses, local
authorities or local events. Local search could be achieved by several ways. The most
common one is by specification of country/state/area/district/city/village name (or
other local information such as ZIP code) in query that is submitted to search site.
Other one is that the search site recognizes user's physical location and will offer
results relevant to recognized position only. The type of way is used depends on type
of search site used. All major players in search engines/web catalog branch on
global/Czech level offer local search tools. Let remind at least Google Maps, Yahoo!
Local, MSN Live Search, from local Czech search sites mapy.cz and centrum.cz.
As latest numbers indicate an interest in local searching (geo-searching) is not a
fiction or wish but a reality everyone has to count with. For example some recent poll,
provided by comScore (Global Internet Information Provider) says[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that more than
109 million of people performed about 849 millions of local searches in July 2006
which also represents 43% year over year increase. Most of the users, about 41 %
were searching for items such as car rental office or dry cleaner [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A split by
particular search engines / portals looks like this [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: Google sites 29,5 %, Yahoo sites 29,2
%, Microsoft 12,3 %, Time Warner Network 7,1 %, Verizon Communications 6.6 %,
YellowPages.com 3.9 %, Ask Network 2.7 %, Local.com 1.9 %, InfoSpace Network
1.9 %, DexOnline.com 1.4 %, all other sites 3.2 percent.
      </p>
      <p>
        Such trend is confirmed by other polls and studies and is generally accepted and
confirmed within whole IT/marketing industry[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Our initial conditions</title>
      <p>As already mentioned within preamble we’ve decided to go the way of establishing
new improved local search algorithm. That algorithm should be implemented in local
web catalog/directory called jihozapad.info and its results verified within set of
jihozapad.info registered www links.</p>
      <p>Web catalog / directory jihozapad.info is primarily focused on area of South-West
Bohemia (part of The Czech Republic). It contains primarily www links related to
local subjects such as stores, companies, authorities etc. It geographically covers an
area about 17 617 km² with about 1 180 541 inhabitants (population density is 67
inhabitants / km²). The catalog was launched in August 2005 and its 12
monthaverage unique visitor number is 1458 visitors per month. Catalog and its interface is
primarily available in Czech, other available language is German, as covered area
directly neighbor with Germany. Surprisingly most of visitors is from US (62 %)1
followed by The Czech Republic (16%). Number of German visitors is quite
insignificant (about 1%)!
There are 121 registered users and 1148 registered local web links. Locality is in
catalog presented by possibility to determine district within which searching should
be performed (there are actually four main districts called Klatovy [KT], Domažlice
[DO], Strakonice [ST] and Plzeň-jih [PJ]).2 A catalog has no true search engine at the
moment, all links are added and registered only by registered users (approved by
portal administrator)
1 Robots are excluded.
2 Information about district is available for all registered links. It is a mandatory attribute.</p>
      <p>Problem decomposition
4.1 Identification of user geographical location
is also called geo-location. Typically geo-location of users is derived from their IP
addresses (or MAC address). Such service is often available on commercial basis
(such as IP2Location, MaxMind etc.). However it will not be very likely our case due
to from our perspective high cost of such services. We’ll try to discuss that with local
providers and agree on some cooperation at this point. A level of details that we can
obtain from IP address will depend on quality of service/database it will be for such
purpose used. The easiest way task is to obtain name of the country (IP registrars
supply that information for free), the more difficult is to get some other details such
as region state, province/district, city, latitude/longitude etc. Other possible and used
way of determining locations of user is to use information that user provided us
during portal registration (such as address, ZIP code, phone numbers, GPS coordinates
etc.). The problem is that number of registered users will be very likely much smaller
than number of visitors, so its capabilities will be rather limited comparing the first
mentioned method. Very likely combined approach will be chosen.</p>
      <sec id="sec-3-1">
        <title>4.2 Web page link and its relation to particular geographical area</title>
        <p>
          There are many ways that can help us to determine web page locality. We’ve thought
about following, so far:
Use information provided by web page owners: there is information about district
for each registered link right now in Jihozapad.info. We do not consider this as fully
sufficient and there has been implemented an improvement leading to make location
of www links more precise these days. We still will come from information that will
be entered by user during link registration but this information will be more detail and
will be expressed in a standard way. As an appropriate standard we have chosen split
into geographical areas based on EU legal framework for the geographical division of
the territory of the European Union also know as NUTS [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. There will be a possibility
to enter for one www link more geographical locations as one www link may
represent a company with different stores within region (for example www.welstam.cz).
Following NUTS information will be gathered:
• NUTS1_uzemi: Česká republika (same for all registered link)
• NUTS1_kod: CZ01 (will be the same for all registered link)
• NUTS2_oblast: Jihozápad (will be the same for all registered link)
• NUTS2_kod: CZ03 (will be the same for all registered link)
• NUTS3_kraj: Jihočeský kraj / Plzeňský kraj
• NUTS3_kod: CZ031 / CZ032
• NUTS4_okres: Strakonice / Domažlice / Klatovy / Plzeň-jih
• NUTS4_kod: CZ0316/ CZ0321 / CZ0322 / CZ0324
Such information will be also enhanced by particular address in form: City/Town,
Street house number, ZIP code. Also information about latitude/ longitude and
altitude will be gathered include precise GPS coordination (WGS-84). We strongly hope
that all gathered information will help in providing better result on local search.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Use local specialties from web page content: Such approach is applicable in the</title>
        <p>
          case of automated geo-spatial search engine (which is apparently not the case of such
improvement because of time restrictions). The idea is to search particular web page
(include all subpages) for existence of unique local words such as addresses parts
(village/town/district/area names), dialect words, ZIP codes, dial codes and derive
web page locality from occurrence frequency of such words (or via other algorithm).
Situation in that might be complicated by fact that many addresses can be found on
webpage. However as Jihozapad.info is strictly oriented on region of South-West
Bohemia (districts Klatovy, Domažlice, Strakonice, Plzeň - Jih), found addresses
from other regions could be ignored. Similar algorithm to "Geographic Scope[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]"
developed by Kyoto researches could be applied or other algorithms coming of
datamining techniques such as association analysis, clustering methods[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] etc.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Cooperation with local webhosting providers: identification and focus on local</title>
        <p>webhosting servers where there can local content will be very likely stored. For
example local Webhosting provider ŠumavaNet contains lot of regionally oriented web
pages. Webhosters also could become partners in gathering locally oriented content,
via some unified interface for example.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Supporting and propagating standards helping in geo-location: jihozapad.info</title>
        <p>
          should be prepared to extract web page locality from some HTML-GEO
formats/protocols such as Microformats hCard [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] (extension of item a) or cooperate in
exchange of geo-spatial data associated to GIS systems distributed in a set of
predefined formats. It would significantly improve catalog accuracy however because of
timing restrictions it will not be possible the case.
        </p>
        <p>Although there are many ways by which we can determine web page locality, no one
of them guarantees for 100% the result. The reasons for that may vary. Many regional
web pages, even those locally oriented don’t contain any significant information
about their origin (they can be just topic oriented). Many of them are locality
independent and finally quality of locality information derived by using methods
mentioned above doesn’t need to be sufficient for locality determination.</p>
      </sec>
      <sec id="sec-3-5">
        <title>4.3 A final search algorithm structure</title>
        <p>These days jihozapad.info offers its users „district“ level of detail in relation to
registered web pages. This granularity is of course not sufficient for being real locally
oriented search site and improvements have already started to be implemented.
Having all information about users accessing jihozapad.info and locality of registered
web pages we can think about appropriate algorithm. At this moment we think of
some kind of Google style ranking algorithm with different weights for particular
levels of granularity (region/district/town/street) and specific metrics for deriving web
page importance in given area (pages with links from other pages same
district/region/town etc. would be considered as more relevant).
To find a good algorithm for local searching is a complex task that combines methods
from many areas such as data mining, web pages constructions, search engine
principles etc. We are tat the beginning right now, all methods mentioned in our article
would help us, finding and optimal balance that will provide the most relevant and
accurate result will be matter of real algorithm tuning on real data.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Market Research, Internet World Stats - Usage and Population Statistics [online].
          <source>[cit</source>
          . 2006-
          <volume>12</volume>
          -20]. URL: &lt;http://www.internetworldstats.com/stats.htm&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. SLAWSKI, William, Assigning Geographic Locations to Web Pages [online].
          <source>[cit</source>
          . 2006-
          <volume>12</volume>
          - 28]. URL: &lt;http://www.seobythesea.com/?p=
          <volume>386</volume>
          &gt;
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. YAMADA, Naoharu - LEE, Ryong - KAMBAYASHI, Yahiko.
          <article-title>Classification of Web Pages with Geographic Scope and Level of Details for Mobile Cache Management</article-title>
          ,
          <source>Proceedings of the Third International Conference on Web Information Systems Engineering (Workshops) 0-7695-1754-3/02</source>
          ,
          <string-name>
            <surname>2002</surname>
            <given-names>IEEE</given-names>
          </string-name>
          , [online].
          <source>[cit</source>
          . 2006-
          <volume>12</volume>
          -20]. URL: &lt;http://csdl.computer.org/dl/proceedings/wisew/2002/1813/00/18130022.pdf&gt;
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. HAN, Jiawei - KAMBER, Micheline.
          <source>Data Mining: Concepts and Techniques</source>
          . San Diego,(CA), USA: Academic Press,
          <year>2001</year>
          ,
          <volume>550</volume>
          s.,
          <source>ISBN 1-55860-489-8</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. hCard Description [online].
          <source>[cit</source>
          . 2006-
          <volume>12</volume>
          -20]. URL: &lt;http://microformats.org/wiki/hcard&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. comScore: Local Web Searching Soars [online].
          <source>[cit</source>
          . 2006-
          <volume>10</volume>
          -02]. URL: &lt;http://www.mediaweek.com/mw/search/article_display.jsp?vnu_content_id=1003188359 &amp;schema=&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. New Developments in Local Search,
          <source>Part</source>
          <volume>4</volume>
          [online].
          <source>[cit</source>
          . 2003-
          <volume>11</volume>
          -19]. URL: &lt;http://www.clickz.com/showPage.html?page=clickz_print&amp;id=
          <fpage>3110641</fpage>
          &gt;.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>8. Common classification of territorial units for statistical purposes [online]</article-title>
          .
          <source>[cit</source>
          . 2006-
          <volume>02</volume>
          -06]. URL: &lt; http://europa.eu/scadplus/leg/en/lvb/g24218.htm&gt;.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>