<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Company Search | When Documents are only Second Class Citizens</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Blank</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Boosz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Henrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bamberg</institution>
          ,
          <addr-line>D-96047 Bamberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>361</fpage>
      <lpage>368</lpage>
      <abstract>
        <p>Usually retrieval systems search for documents relevant to a certain query or|more general|information need. However, in some situations the user is not interested in documents but other types of entities. In the paper at hand, we will propose a system searching for companies with expertise in a given eld sketched by a keyword query. The system covers all aspects: determining and representing the expertise of the companies, query processing and retrieval models, as well as query formulation and result presentation.</p>
      </abstract>
      <kwd-group>
        <kwd>Domain speci c search solutions</kwd>
        <kwd>expertise retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>List with
• Name
• URL
• Address</p>
      <p>Web-Site
Crawling</p>
      <p>Text
Processing</p>
      <p>LDA</p>
      <p>Indexing
Company</p>
      <p>Index</p>
      <p>Web Page</p>
      <p>Index</p>
      <p>Auxiliary</p>
      <p>Data</p>
      <p>Query
Processing</p>
      <p>Result</p>
      <p>
        Preparation
as a special case of expertise retrieval|a topic intensely considered in literature
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The contribution of the paper at hand in this context is the design and
re ection of a system based on state-of-the-art models and components adapted
for a real world application scenario. As we will see this requires some peculiar
design decisions and user interface aspects.
      </p>
      <p>The remainder of the paper is organized as follows: In the two subsections
of section 1 below we present a rough overview of the proposed system and we
shortly address related work. Thereafter we discuss the four components
identied to make up an expertise retrieval system according to Balog et al. [3, p. 145]:
modeling and retrieval (section 2), data acquisition (section 3), preprocessing and
indexing (section 4), as well as interaction design (section 5). Finally section 6
concludes the paper.</p>
      <p>System Overview The basic assumption of the system is that a manually
de ned set of companies should be searchable in the system. These are the
members and associated companies of the IT-Cluster Upper Franconia. A further
assumption is that all these companies have a more or less expressive website.
Hence, the starting point for the system is a list of companies, consisting of
the name of the company, the URL of the corresponding website and the o ce
address (represented in the upper left corner of Fig. 1).</p>
      <p>The URLs are used to crawl the websites of the companies. Roughly spoken,
each company c is represented by the concatenation of the content blocks of its
web pages, called dc. Of course, some text processing is necessary here for noise
elimination, tokenizing, stemming, and so forth.</p>
      <p>Since the corpus (consisting of about 700 companies at present) is rather
small and queries might be speci c (for example a search for \Typo3") we
incorporated topic models (using Latent Dirichlet Allocation currently) to boost
companies with a broader background in the topic of the query. Using the terms
and LDA-based boost-factors two indexes are build: In the rst index the
companies are indexed based on the pseudo documents dc. In the second index the
single web pages are indexed because we also want to deliver the best landing
pages for the query within the websites of the ranked companies in the result.
Finally some auxiliary data (for example the topic models generated via LDA)
is stored since it is needed during query processing or result presentation.</p>
      <p>When a query is issued the query is processed on the company index and on
the web page index. Then a result page is generated which represents companies
as rst class citizens. For each company a home page thumbnail, a
characterization of its competencies and its relationship to the query, as well as up to three
query related landing pages are presented. All these aspects will be considered
in more detail in the remainder of this paper, but beforehand we want to shortly
address related work.</p>
      <p>
        Related Work To our best knowledge, there is no directly related work on
company search. The two most related areas are expertise retrieval [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
entity search [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Many lessons can be learned from these areas. Nevertheless,
there are some peculiarities with our company search scenario. In expert nding
scenarios the identi cation of the experts is often a hard problem (see the expert
search in the TREC 2007 enterprise track as an example [
        <xref ref-type="bibr" rid="ref1 ref11">1</xref>
        ]). Another aspect
is the ambiguity of names or the vague relationship between persons and
documents. On the other hand, representing experts by pseudo documents is also an
established approach in expert search [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and an elaborate result presentation is
important here as well.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Modeling and retrieval</title>
      <p>When thinking about the retrieval model for the given scenario on a higher level,
a model of the competencies of a company has to be matched with the query
representing the user's information need. From the requirements it was de ned
that a keyword query should be used. With respect to the representation of
the company pro les interviews showed that an automatic extraction process is
preferable to the manual de nition of pro les because of the sheer creation e ort
and update problems. Due to the addressed domain of IT companies it can be
assumed that all companies worth to be found maintain a website depicting their
competencies. Of course other sources of evidence could also be addressed|such
as newspaper reports, business reports, or mentions of the companies on other
pages in the Internet. These approaches are surely worth consideration in the
future. Nevertheless, the concentration and restriction to the company's own
website also has the advantage of predictability and clear responsibilities. Put
simply, if a company complains that it is not among the best matches for a
particular query, we can pass the buck back and encourage them to improve
their website|what they should do anyway because of SEO considerations.</p>
      <p>
        To avoid our arguments eventually turning against us, we have to exploit
the information on the websites as good as we can. Besides crawling and
preprocessing aspects addressed in the following sections 3 and 4, in particular we
have to use an appropriate retrieval model. As a rst decision we have to choose
between a company-based approach (each company is represented by a pseudo
document used directly for ranking) and a document-based approach (the
documents are ranked and from this ranking a ranking of the companies is derived).
A comparison of these approaches can, for instance, be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], however not
showing a clear winner. We plan to test both approaches in the future but we
started with the company-based approach where each company c is represented
by a pseudo document dc generated as the concatenation of the content blocks
of the web pages crawled from the respective website. In the future, we plan to
test weighting schemes based on the markup information, the depth of the single
pages, and other parameters.
      </p>
      <p>For a more formal look we use the following de nitions:
{ q = fw1; w2; : : : wng is the query submitted by the user (set of words)
{ C = fc1; c2; : : : cmg is the set of companies
{ dc is the concatenation of the documents representing company c
{ fw;dc is the number of times w appears in dc
{ cfw is the number of companies for which dc contains w
{ and are design parameters</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] we use a candidate generation model and try to rank the
companies by P (cjq), the likelihood of company c to be competent for query q. As
usual, by invoking Bayes' Theorem, this probability can be refactored as follows
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]:
      </p>
      <p>P (cjq) = P (qjc)P (c) ra=nk P (qjc)P (c)</p>
      <p>P (q)</p>
      <p>P (qjdc)P (c)
Currently, we use a constant for the company prior P (c). However, it turned out
that this will be an interesting point for future optimizations highly correlated
with aspects of document length normalization for the pseudo documents dc. For
test purposes we inserted big national and international IT companies in the list.
In our current implementation these companies did not make it to the absolute
top ranks even for queries simply consisting of a registered product name of the
company. Instead, small service providers which have specialized in support for
this product were ranked higher. Interestingly, this problem is already an issue
with expert nding, but an even bigger challenge in company search because of
the heterogeneous company sizes.</p>
      <p>
        Another point which became obvious in rst experiments was the well known
vocabulary mismatch. For example with the query \Typo3" the ranking did
not consider the broader competence of the companies in the topics of web
applications or content management systems. As proposed in [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] we decided to
use Latent Dirichlet Allocation (LDA) to address this problem. An independence
assumption to calculate the probabilities wordwise by P (qjdc) = Qw2q P (wjdc)
and a combination of the word-based perspective with a topic-based one would
then lead to:
      </p>
      <p>P (wjdc) =
fw;dc +
jdcj +
cfw !
jCj
+ (1
)Plda(wjdc)</p>
      <p>
        Plda(wjdc) stands for the probability that a word w is generated by a topic
which is generated by dc (see [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ] for more details). To simplify things further,
we employed an idea presented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Here the Lucene Payloads are used to boost
terms via LDA. The payload lda(w; dc) assigned to word w is determined as the
weight of w according to the topic distribution of dc. This means that lda(w; dc)
is high when w ts well with the broader topics dealt with in dc. Combining this
boost factor with term frequency and inverse document frequency information
we yield the following scoring function:
score(c; q) = X tf (w; dc) idf (w) lda(w; dc)
      </p>
      <p>w2q</p>
      <p>Of course, this is only a rst pragmatic starting point and the above
considerations point out various interesting aspects for future comparisons and
optimizations.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data acquisition</title>
      <p>As a prerequisite documents representing companies have to be obtained rst.
For crawling company websites we chose to employ crawler4j, a lightweight Java
web crawler (https://github.com/yasserg/crawler4j).</p>
      <p>The crawling of each company is an individual process which allows us to
crawl multiple companies at once. We start with the company's home URL as
a seed and use a truncated form of that URL as a pattern to discard all links
to external domains found during the process. For our rst investigation, we
crawled a maximum amount of 2000 documents per company in a breadth rst
manner, where each document is a web page. We plan to leverage additional
document types, such as PDF, in the future.</p>
      <p>
        For each page the corresponding company, page title and full URL are stored
in a database. This information is reused later when creating the web page
indexes. To obtain dc (the pseudo document describing company c) the contents
of all crawled pages of the company are concatenated. To reduce noise, we apply
Boilerpipe [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (https://code.google.com/p/boilerpipe) to all documents in
order to extract the main textual content from those pages rst. This step aims
to eliminate those page elements which do not contribute to the actual content
of a page and are repeated very often: navigation menus, footer information, etc.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Preprocessing and indexing</title>
      <p>Early experiments have shown that data quality plays a seminal role for the
quality of a topic model learned. That is why we utilize a customized Lucene
Analyzer before applying the LDA to dc or indexing the company documents. The
analyzer lters German stop words, applies a Porter Stemmer for the German
language and uses a series of regular expressions to remove or modify tokens. As
an example, digit-only tokens are removed, while tokens of the form word1:word2
are split into two tokens, word1 and word2. Consistently, incoming user queries
are processed by the same analyzer.</p>
      <p>After the analyzing step, an LDA topic model of all company representations
dc is created, utilizing the jgibbsLDA (http://jgibblda.sourceforge.net/)
implementation. The resulting model is represented in a Java class hierarchy,
which enables us to directly access the distribution of topics for each company,
as well as the word probability distributions within topics. Therefore the payload
function lda(wjdc) for each word w in dc can be computed immediately. Another
representation of dc is created, where each term is enriched with its determined
LDA payload. The generated LDA model is reused for result preparation.</p>
      <p>The company index is created from all pseudo documents dc enriched with
payloads. When executing a query it is examined by the index searcher and
consequently determines the ranks of the companies in the result set. To be able
to show a query's top documents for a given company, we also create an index
for the companies' web pages. All crawled web pages are considered and for each
page we also preserve the information of the corresponding company. Both types
of indexes are based on Lucene (https://lucene.apache.org/core/). Prior to
indexing we apply the analyzing process described above.</p>
      <p>With the creation of a company index representing companies and their
competencies, a web page index for the companies as well as the overall topic model,
all steps necessary to enable searching are completed.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Interaction design</title>
      <p>As usual the query is given as a simple keyword query. In the future more
sophisticated variants are conceivable, for example allowing for geographic lter
constraints. Nevertheless, the simple and familiar keyword solution has its
advantages and the use of geographic lter constraints is debatable as long as only
companies located in Upper Franconia are listed, anyway.</p>
      <p>With respect to the result presentation the situation is more demanding.
Discussions with potential users disclosed the following requirements: (1)
Companies are the main objects of interest. (2) Address information, a rst overview,
and a visual clue would be nice. (3) The general company pro le as well as the
relationship to the query should become obvious. (4) Entry points (landing pages)
for the query within the website of the company are desired.</p>
      <p>The result page depicted in Fig. 2 directly implements these requirements.
Companies are ranked with respect to the retrieval model described in section 2.
For each company in the result list a row with the name and three information
blocks is shown. The company name is directly taken from the input data as well
as the address information (Fig. 1 upper left corner). A screenshot of the
homepage (captured with Selenium; http://www.seleniumhq.org/) and a prepared
link to OpenStreetMap complete the left overview block for each company. In
the middle block a word cloud is given. Here the size of the terms represents the
importance of the terms for the company pro le (based on tf idf information).
The color represents the relationship of the terms to the query. Orange
represents a strong relationship. The relationship is calculated based on a company's
prevalent LDA topics. Currently, we consider the ve top terms of the ve topics
with the highest correlation to the query. At most thirty terms are shown in the
word cloud taking terms important for the company pro le and important for
the relationship to the query in a round robin procedure. Finally, the right block
consists of up to three most relevant landing pages within the company website
represented by their title, the URL, and a query-dependent snippet.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper we have described the company search problem and presented
a solution based on pseudo document ranking, the use of LDA to incorporate
topical relevance, and a suitable result presentation. Currently the prototype
implementation is tested. It turned out that the e ectiveness and the e ciency are
promising in preliminary interviews with representatives of local IT companies.
Current response times of the system are below two seconds. The most
obvious challenges are the appropriate ranking of companies with di erent sizes, the
visualization of the company pro les in the result page, and a reasonable
modeling and presentation of topics (number of topics in LDA and also alternative
approaches). The current prototype is available on the project web page1.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bailey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Vries</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Craswell</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soboro</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Overview of the TREC-2007 enterprise track</article-title>
          .
          <source>In: The Sixteenth Text REtrieval Conference (TREC</source>
          <year>2007</year>
          )
          <article-title>Proceedings</article-title>
          . NIST Special Publication: SP 500-
          <fpage>274</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azzopardi</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>de Rijke</surname>
          </string-name>
          , M.:
          <article-title>Formal models for expert nding in enterprise corpora</article-title>
          .
          <source>In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '06</source>
          , pp.
          <volume>43</volume>
          {
          <fpage>50</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2006</year>
          ).
          <source>DOI 10</source>
          .1145/1148170.1148181
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang</surname>
          </string-name>
          , Y.,
          <string-name>
            <surname>de Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serdyukov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Si</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Expertise retrieval</article-title>
          .
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>6</volume>
          (
          <issue>2</issue>
          {3),
          <volume>127</volume>
          {
          <fpage>256</fpage>
          (
          <year>2012</year>
          ).
          <source>DOI 10.1561/1500000024</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>3</volume>
          ,
          <issue>993</issue>
          {
          <fpage>1022</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strohman</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Search Engines: Information Retrieval in Practice</article-title>
          . Pearson
          <string-name>
            <surname>Education</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Probabilistic models for expert nding</article-title>
          .
          <source>In: Proceedings of the 29th European Conference on IR Research, ECIR'07</source>
          , pp.
          <volume>418</volume>
          {
          <fpage>430</fpage>
          . Springer-Verlag, Berlin, Heidelberg (
          <year>2007</year>
          ). URL http://dl.acm.org/citation.cfm?id=
          <volume>1763653</volume>
          .
          <fpage>1763703</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Kohlschutter,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Fankhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Nejdl</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          :
          <article-title>Boilerplate detection using shallow text features</article-title>
          .
          <source>In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10</source>
          , pp.
          <volume>441</volume>
          {
          <fpage>450</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          ).
          <source>DOI 10</source>
          .1145/1718487.1718542
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W.B.:
          <article-title>LDA-based document models for ad-hoc retrieval</article-title>
          .
          <source>In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pp.
          <volume>178</volume>
          {
          <fpage>185</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Zhang,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>A new ranking method based on latent dirichlet allocation</article-title>
          .
          <source>Journal of Computational Information Systems</source>
          <volume>8</volume>
          (
          <issue>24</issue>
          ),
          <volume>10</volume>
          ,141{
          <fpage>10</fpage>
          ,
          <issue>148</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Entity-centric search: querying by entities and for entities</article-title>
          .
          <source>Dissertation</source>
          , University of Illinois at Urbana-Champaign (
          <year>2014</year>
          ). URL http://hdl.handle.
          <source>net/2142/72748</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>1 http://www.uni-bamberg.de/minf/forschung/firmensuche</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>