<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Person attribute extraction from the textual parts of Web pages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>István T. Nagy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richárd Farkas</string-name>
          <email>rfarkas@inf.u-szeged.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Group of Artificial Intelligence, Hungarian Academy of Sciences</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitiy of Szeged, Department of Informatics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask, and average results on the clustering subtask.</p>
      </abstract>
      <kwd-group>
        <kwd>natural language processing</kwd>
        <kwd>information extraction</kwd>
        <kwd>Web content mining</kwd>
        <kwd>person attribute extraction</kwd>
        <kwd>document clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Personal names are among the most frequently searched items in Web search
engines. At the same time these types of search results ignore the fact that one name
may be related to more than one person. Sometimes person names are highly
ambiguous. The first WePS challenge organized in 2007 focused on this
disambiguation problem. As input, the participants’ systems received Web pages
retrieved from a Web search engine using a given person name as a query. The aim of
the task was to find all the different people among the results lists and assign a
corresponding document to each person. During the evaluation of WePS1, the
organizers realized that some attributes are very useful for the person disambiguation
problem. Hence the second WePS challenge organized in 2009 contained an
absolutely new challenge. The attribute extraction subtask was to identify 16 different
attributes from Web pages such as birth date, affiliation, and occupation. This subtask
proved very difficult and the best system only achieved an F-measure score of 12.2.
The third WePS shared task introduced a novel subtask which sought to mine
attributes for persons, i.e. the attribute extraction from the clusters of pages belonging
to each given person. We will now describe our system that participated in this third
WePS challenge.
The aim of Web Content Mining is to extract useful information from the natural
language-written parts of websites. The first attempts on Web Content Mining began
with the Internet began around 1998-’99 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They were expert systems
with hand-crafted rules or indued rules used in a supervised manner and based on
labeled corpora. The next generation of approaches on the other hand works in
weakly-supervised settings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Here the input is a seed list of target
information pairs and the goal is to gather a set of pairs which are related to each
other in the same manner as the seed pairs. The pairs may contain related entities (for
example, country - capital city in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and celebrity partnerships in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) or form an
entity-attribute pair (like Nobel Prize recipient - year in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) or may be concerned
with retrieving all available attributes for entities [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These systems generally
download Web pages that contain the seed pairs, then learn syntactic / semantic rules
from the sentences of the pairs (they generally use the positive instances for one case
as negative instances for another case).
      </p>
      <p>
        The person name disambiguation subtask of the second WePS challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was
dominated by systems which had a preprocessing step, where the HTML documents
were converted to the plain texts then standard clustering algorithms were employed
with a bag of words representation of the pages. The participants of the attribute
extraction subtask of this challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] generally used hand-crafted rules for the
attribute classes. Named Entity Recogniser was also applied here but with
combination of pre and postprocessing heuristics (the best performing system used
only expert rules).
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 Our methods</title>
      <p>We shall focus on the raw text parts of the Web pages because we found that more
pages express content in textual form than in structured form [16]. The first step of
information extraction may be to construct a good section selection module. When
handling the problem, we first extract the candidate attributes from the relevant
sections of Web pages, then we cluster the pages by merging clusters having common
person attributes and aggregate attributes to the persons identified.</p>
      <sec id="sec-2-1">
        <title>3.1 Preprocessing</title>
        <p>The input of the participants’ system was a set of pages retrieved from a Web search
engine using a given person’s name as a query. We assumed that useful information is
available in the natural language-written part of websites and tables [16]. This is why
we concentrated on the natural language-written part of websites and tables, and we
discarded a lot of noisy and misleading elements from pages (e.g. menu elements).
These elements can seriously hinder the proper functioning of Natural Language
Processing (NLP) tools.</p>
        <p>
          In order to identify textual paragraphs we applied the Stanford POS tagger for each
section of the DOM tree of the HTML files. We assumed that one piece of text was a
textual paragraph if it was longer than 60 characters and it contained more than one
verb. We extracted several attributes with our own Named Entity Recognition (NER)
[
          <xref ref-type="bibr" rid="ref13">14</xref>
          ] system which was trained on CoNLL-2003 training data sets. When we used this
model on the entire set of paragraphs, the accuracy score obtained was low. To handle
this problem we developed attribute-specific, relevant section selection modules.
Firstly we looked for the occurrences of all gold standard attributes using simple
string matching in each extracted paragraph. In this way we created a database with
‘positive’ and ‘negative’ paragraphs for the actual attribute. Then we created a set of
positive words with the most frequently occurring words from the positive
paragraphs. If a paragraph in the prediction phase contained at least one word from
the actual positive list, we marked it as a positive paragraph and we only extracted
attributes from these paragraphs. This approach was used to find the occupation,
affiliation, award and school attributes.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2 Attribute extraction</title>
        <p>
          Like other WePS2 systems [12],[
          <xref ref-type="bibr" rid="ref12">13</xref>
          ], our attribute extraction system also consists of
two fundamental parts: the candidate attributes extraction module and an attribute
verification module. Based on this approach, we first mark potential attribute values
in a paragraph. Second, we find out which candidate values are exhibited.
        </p>
        <p>When handling this subtask of attribute extraction, it seems necessary to classify
the attribute classes in several ways. First, we aggregated similar attributes into
logical groups. For instance, the name group contains the other name, relatives and
mentor attribute classes On the other, we can assume subordinate relations among the
coherent attributes. For example, we only marked a candidate name as mentor if it
was not relatives or other name.
Next, we will elaborate on the extraction procedure for each of the attributes.
Date of birth: if a paragraph contains born, birth or birthday phrases we find
candidate dates with a date validator within a window of the word. This validator
works with 9 different regular expressions rules, and can identify dates written in
different formats in the span of text.</p>
        <p>
          Birth place: when a paragraph contains born, birth, birthplace, hometown and native
phrases we use the location markups given by the NER tool [
          <xref ref-type="bibr" rid="ref13">14</xref>
          ] trained on the
locations class of the CoNLL-2003 training dataset to identify candidate locations for
the birthplace. We accept a location as a birthplace if a birthplace validator validates
it.
        </p>
        <p>
          Occupation: according to the WePS2 results, it was one of the most difficult,
ambiguous and frequent attribute classes, which is due to the abstract nature of this
attribute. Hence we avoided using lists. It is not available in any NER model or
training database. So we created a training database by matching all gold annotation
to paragraphs. We used simple string matching and we did not know where the actual
attribute occurred. However, the resulting dataset was very noisy. We trained our
NER tool [
          <xref ref-type="bibr" rid="ref13">14</xref>
          ] on this training database, and we used it on the candidate occupation
paragraphs.
        </p>
        <p>
          Organizations (school, award, affiliation): we found that these types of attributes
were names of organizations so we grouped them together. We also used an NER tool
[
          <xref ref-type="bibr" rid="ref13">14</xref>
          ] here trained on the organization class of the CoNLL-2003 training data to
identify candidate organization mentions only in affiliation-candidate paragraphs.
When the NER model marks a candidate organization phrase, we first search for the
school attribute. Then a potential candidate organization is marked as a school if it
appears near some cue phrases such as graduate, degree, attend, education and
science. Next we defined a school validator that uses the MIVTU [12] school word
frequency list with School, High, Academy, Christian, HS, Central and Senior. We
extended this list with University, College, Elementary, New, State, Saint, Institute
phrases. First letter capitalized sequences, except for some stopwords like of and at
which contain at least one of these words were marked as a school by a validator. If
the school validator did not validate the potential candidate organization, we looked
for the award attribute. When candidate sequences appear near cue phrases such as
award, win, won, receive and price, we assumed an expression with award was an
attribute. We also defined an award validator, which validates a first letter capitalized
sequence except for some stopword like at and of, if it contains at least one element of
the award, prize, medal, order, year, player and best phrases. When the candidate
string is not a valid school and award, we tag it to the affiliation attribute.
Degree: a list of degrees complied manually which contains 62 items. When we
found one element from these lists in a paragraph we marked it as a degree attribute.
We assumed that the degree attribute might be located far from the name in a
CVtype Web page.
        </p>
        <p>
          Names (relatives, other name, mentor): these types of attributes are person names so
we found that they occur together. To identify name attributes we used an NER tool
[
          <xref ref-type="bibr" rid="ref13">14</xref>
          ] trained on the person names of the CoNLL training data. A model extracts name
phrases as relatives if they appear in the immediate context of the candidate that
indicates various relationships like father, son, daughter and so on. Cue phrases were
the same as in the MIVTU [12] system used in WePS2 and are also available in
Wikipedia. Sometimes we did not mark the potential candidate sequence for relatives,
but looked for other name attributes instead. We hypothesized that a person does not
write his or her name using the same number of tokens; at the same time other name
has to contain at least a part of the original name. This hypothesis may not be true for
nicknames. For example when the original name was Helen Thomas, we did not
accept the candidate string Helen McCumber, but we accepted the Helen M. Thomas
sequence. If a name was not marked as relatives or other name, we analyzed the
potential candidates for mentor name. If it appeared near cue phrases such as study
with, work with, coach, train, advis, mentor, supervisor, principal, manager and
promote we marked the potential candidate sequence as a mentor attribute.
Nationality: We created a list of nationalities that contained 371 elements. It has
multiple entries for certain nationalities. Once we found one element from this list in a
paragraph or table, we assumed a potential nationality attribute. Then we selected the
most frequent potential nationality attribute of the Web pages.
        </p>
        <p>When extracting availability attribute classes we did not used just the textual
paragraphs, but examined the whole text of Web pages as these types of attributes
may occur in other parts as well.</p>
        <p>Phone: when a text contains tel, telephone, ph:, phone, mobile, call, reached at,
office, cell or contact words or a part of the original name, we applied the following
regular expression:</p>
        <p>
          (((?[0-9+(][.()0-9s/-]{4,}[
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">0-9</xref>
          ])((?(s?x|s?ext|s?hart).?)? d{1,5})?)
It is a permitted regular expression for potential phone numbers. We defined a phone
number validator that validated the sequence determined by the regular expression.
Fax: we use the same method as for phone numbers, i.e. we look for fax, telfax and
telefax phrases.
        </p>
        <p>E-mail: we assumed that if somebody offers their e-mail address, it is also a link.
Therefore, we examined links that contain the mailto tag. Moreover, we assumed that
every mail address contains the original name or one part of the original name. Hence
we defined an e-mail address validator that validates e–mail addresses. We generate
all character trigrams from the original name and when an e-mail address contains at
least one of them, the validator accepts it. We defined a stop list as well. This list
contains words such as wiki, support, and webmaster. Should a candidate e-mail
address contain one from the stop list, the validator does not accept it. Next we
extracted the domain from all accepted e-mail addresses, which we used for the
website attribute.</p>
        <p>Web-site: we assumed that when somebody displays a Web address on a website, it
is also a link too, so a Web address is a link at the same time. In this case we only
extract a website attribute from links. We marked a potential candidate attribute as a
website when it contained the original name or one part of the original or extracted
domain name from the e-mail attribute.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3 Clustering</title>
        <p>Our chief hypothesis in the clustering subtask was that it can be effectively solved by
using extracted person attributes. We defined a weighting of attribute classes. The
most useful attribute classes were web address, e-mail, telephone, fax number and
other name and they got a weight of 3. In addition, we weighted birth date as 2 while
birth place, mentor, affiliation, occupation, nationality, relatives, school, and award
each got a weight of 1. Then every document was represented by a vector with
extracted person attribute values.</p>
        <p>To define a document similarity measure, we needed to normalize the attribute
values, i.e. spelling variants and synonyms have to be handled as equivalents. We
developed individual normalization rules for each attribute class. For example, the
birth place of United States of America could be referred to as USA, U.S.A. United
States, Federal United States and so on. Here, we created a synonym dictionary based
on the re-direct links of the English Wikipedia and we developed regular expressions
or transformation rules for other attribute classes.</p>
        <p>As a first approach for Web page clustering, a bottom-up heuristic clustering was
performed. Here the starting clusters consist of the individual Web pages and then the
clusters are merged iteratively until a stopping criterion is reached. For each step of
this procedure the most similar clusters are merged (the union of their attributes
formed the attribute set of the resulting cluster), where the similarity measure of the
weighted size of the intersection of the cluster attribute sets was employed using
normalization rules. The stopping criterion was defined to be a similarity value
threshold of 2, i.e. if the similarity value of the closest clusters is less than 2 the
procedure is terminated (RGAI5 submission).</p>
        <p>Besides this heuristic bottom-up clustering, we employed the Xmeans algorithm in
the WEKA Java package [15] as well. The advantage of this approach is that it is not
necessary to define the number of clusters, but we can define the minimum number of
clusters. We used the final number of clusters obtained by the heuristic clustering as
the minimum number of clusters for Xmeans. (RGAI3)</p>
        <p>In addition to person attribute-based Web page clustering, we also experimented
with a text-based approach. With the results of RGA1, we only used the search engine
snippet data. These types of representation compress the most important pieces of
information. We represented the dataset with the tf-idf vector space model where
every document is a vector. The RGAI1 and the RGAI2 results were almost identical.
Lastly, RGAI4 is a hybrid method of the above two approaches, i.e. the feature sets of
the person-based attribute and the snippet-based clustering were merged.</p>
      </sec>
      <sec id="sec-2-4">
        <title>3.4 Attribute aggregation</title>
        <p>As a last step, we had to aggregate those attributes that occurred in Web pages and
were found in a cluster of pages, i.e. belonged to a particular person. The official
evaluation metric of the challenge required only one attribute from each class. As we
extracted more than one potential attribute values for each class, we had to choose one
(e.g. a person may mention several of his affiliations). In the end we chose the most
frequent element per person from each attribute class. When some attribute frequency
was equal, we just chose it at random.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Results and discussion</title>
      <p>
        Because the WePS3 attribute extraction subtask required clustering, the documents
we submitted ran the attribute extraction and clustering tasks as well. The test dataset
was composed of 300 person names and nearly 200 Web documents for each name.
The attributes had to be assigned to each person cluster rather than to individual
pages. The training dataset was the WePS2 train and test sets, which contains 5,122
websites with 187,032 textual paragraphs. We found 2,781 affiliation, 3,419
occupation and 2,092 biographical paragraphs. For the location, organization and
names, markups given by the NER tool [
        <xref ref-type="bibr" rid="ref13">14</xref>
        ] trained on the CoNLL-2003 training
dataset achieved F scores of 89.94 on names, 87.06 on locations and 76.37 on
organizations.
      </p>
      <p>During the evaluation of the clustering subtask the organizers used the extended
versions of BCubed Precision and Recall, which was the official evaluation metric
with alpha set to 0.5. They evaluated the clustering of documents for each query just
focusing on two different people, except for 50 names, where only documents about
one person were considered. The official results on the clustering task of the RGAI
systems, the best performing participant and two baselines are shown in Table 1. Here
our RGAI 1 system achieved the best scores.
For the attribute extraction subtask1, the evaluation metrics were computed as follows,
Precision: for a given person, it is the number of correct attribute/value pairs divided
by the total number of attribute/value pairs extracted.</p>
      <p>Recall: for a given person, it is the number of attributes having at least one correct
value divided by the total number of attributes for which a correct value has been
found by at least one of the systems.</p>
      <p>F-measure: 1 / (alpha * 1/prec + (1-alpha) * 1/rec), where alpha was 0.5.
The above defined “given person” is taken from the prediction of the clustering
subtask. The gold standard annotation of clustering consists of two person (clusters)
for every document set. During the evaluation process the most similar predicted
clusters was taken into account where the F score or recall was used as similarity
metric, where
Precision: the number of documents in the cluster that refer to the person / number of
documents in cluster.
1 Please note that at the time of preparing the workshop proceedings, official results for the
attribute extraction subtask were not available due to unexpected difficulties of the task
organizers with the manual assessments. Due the organizers, the results of the paper are
achived on the 12.5 percent of the test dataset.</p>
      <p>Recall: the number of documents in the cluster that refer to the person / number of
documents that refer to the person.</p>
      <p>Next, the organizers defined two different interpretations of the manual annotations,
which were combined with the other two clustering evaluation options.
Strict evaluation: we count as correct all attribute / value pairs judged as correct by a
majority of annotators and as incorrect otherwise.</p>
      <p>Lenient evaluation: we count as correct all attribute / value pairs judged as correct or
inexact by a majority of annotators, and as incorrect otherwise.</p>
      <p>Table 2 shows the results of the RGAI systems when the clustering resemblance was
the recall approach and the manual annotation was lenient. Our best result was
achieved by the RGAI 3 system, but the Intelius system was outstanding.
However, when we used the lenient annotation interpretation and the clustering
approach based on the F score, our RGAI 3 system achieved significantly better
results (see Table 3).
When we used the strict annotation and recall-based clustering, the results of Intelius
system were dramatically better than those of other systems. It was able to cluster the
documents better (see Table 4. ).
Finally, Table 5 shows the results when the clustering approach was based on the F
score and the annotation was strict. RGAI 3 could achieved the best result systems
performed fairly well.
The above tables show that our approach achieved an F score sligthly above 10 of F
score based clustering. Compared to the WePS2 results – where the best system
achieved about an of F score of twelve – these results are competitive as we solved a
more complex problem here. On the other hand, the recall-based results show that our
clustering approach has to be improved.</p>
    </sec>
    <sec id="sec-4">
      <title>5 Summary and Conclusions</title>
      <p>In this article we presented a person name disambiguation method with biographical
attribute extraction from documents related to a person. We handled the name
disambiguation problem from person Web search results. Our method is based on
extracted biographical attributes and snippet information. The proposed clustering
method was evaluated using the test dataset created for the name disambiguated
subtask of the third Web People Search Task. Our clustering approach got an F score
of 40 and was ranked fourth among the eight participants.</p>
      <p>For the second subtask of the shared task, our method efficiently extracted the
different types of attributes from Web pages and we achieved top results on the
WePS3 challenge. We think that the reasons for the success of our attribute extractor
are the followings. First, our approach groups attribute classes and introduces rules
which efficiently handle the interdependencies among these classes. Second, we
focused on the textual parts of the web pages using NLP tools which demonstrates
that raw text parts of person Web pages should be analyzed besides the structured
parts of the pages.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported in part by NKTH grant of the Jedlik Ányos R&amp;D
Programme (project codename TEXTREND) of Hungarian government. The authors
would like to thank the shared task organizers for their devoted efforts.</p>
    </sec>
    <sec id="sec-6">
      <title>6 References</title>
      <p>15.Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H.</p>
      <p>Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations,
Volume 11, Issue 1.
17.István Nagy, Richárd Farkas and Márk Jelasity. Researcher affiliation extraction from
homepages. NLPIR4DL ACL Workshop 2009 pp 1-9</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Javier</given-names>
            <surname>Artiles</surname>
          </string-name>
          ,
          <article-title>Julio Gonzalo and Satoshi Sekine. WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering</article-title>
          . Task In: 2nd Web People Search Evaluation Workshop (WePS
          <year>2009</year>
          ),
          <source>18th WWW Conference, April 20th-24th</source>
          ,
          <year>2009</year>
          , Madrid, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Satoshi</given-names>
            <surname>Sekine</surname>
          </string-name>
          and
          <string-name>
            <given-names>Javier</given-names>
            <surname>Artiles</surname>
          </string-name>
          .
          <article-title>WePS 2 Evaluation Campaign: Overview of the Web People Search Attribute Extraction Task</article-title>
          .
          <source>In: 2nd Web People Search Evaluation Workshop (WePS</source>
          <year>2009</year>
          ),
          <source>18th WWW Conference, April 20th-24th</source>
          ,
          <year>2009</year>
          , Madrid, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Brad</given-names>
            <surname>Adelberg</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Nodose - a tool for semiautomatically extracting structured and semistructured data from text documents</article-title>
          .
          <source>ACM SIGMOD</source>
          ,
          <volume>27</volume>
          (
          <issue>2</issue>
          ):
          <fpage>283</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Mary</given-names>
            <surname>Elaine</surname>
          </string-name>
          Califf and
          <string-name>
            <given-names>Raymond J.</given-names>
            <surname>Mooney</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Relational learning of pattern-match rules for information extraction</article-title>
          .
          <source>In Proceedings of the Sixteenth National Conference on Artificial Intelligence</source>
          , pages
          <fpage>328</fpage>
          -
          <lpage>334</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Dayne</given-names>
            <surname>Freitag</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Information extraction from html: Application of a general machine learning approach</article-title>
          .
          <source>In Proceedings of the Fifteenth National Conference on Artificial Intelligence</source>
          , pages
          <fpage>517</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Raymond</given-names>
            <surname>Kosala</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hendrik</given-names>
            <surname>Blockeel</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Web mining research: A survey</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>2</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Cafarella</surname>
          </string-name>
          , Doug Downey, Ana maria Popescu, Tal Shaked, Stephen Soderl, Daniel S. Weld, and
          <string-name>
            <given-names>Er</given-names>
            <surname>Yates</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Unsupervised named-entity extraction from the web: An experimental study</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>165</volume>
          :
          <fpage>91</fpage>
          -
          <lpage>134</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Satoshi</given-names>
            <surname>Sekine</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>On-demand information extraction</article-title>
          .
          <source>In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions</source>
          , pages
          <fpage>731</fpage>
          -
          <lpage>738</lpage>
          , Sydney, Australia, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Kedar</given-names>
            <surname>Bellare</surname>
          </string-name>
          , Partha Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman,
          <string-name>
            <surname>Andrew McCallum</surname>
            , and
            <given-names>Mark</given-names>
          </string-name>
          <string-name>
            <surname>Dredze</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Lightly-supervised attribute extraction for web search</article-title>
          .
          <source>In Proceedings of NIPS 2007 Workshop on Machine Learning for Web Search.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.Xiwen Cheng, Peter Adolphs, Feiyu Xu,
          <string-name>
            <given-names>Hans</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hong</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Gossip galore - a selflearning agent for exchanging pop trivia</article-title>
          .
          <source>In Proceedings of the Demonstrations Session at EACL</source>
          <year>2009</year>
          , pages
          <fpage>13</fpage>
          -
          <lpage>16</lpage>
          , Athens, Greece, April. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>Hong</given-names>
            <surname>Li Feiyu Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hans</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>A seeddriven bottom-up machine learning framework for extracting relations of various complexity</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2007</year>
          ,
          <article-title>45th Annual Meeting of the Association for Computational Linguistics</article-title>
          , Prague, Czech Republic 12.
          <article-title>Keigo Watanabe and Danushka Bollegala. MIVTU: A Two-Step Approach to Extracting Attributes for People on the Web</article-title>
          .
          <source>In: 2nd Web People Search Evaluation Workshop (WePS</source>
          <year>2009</year>
          ),
          <source>18th WWW Conference, April 20th-24th</source>
          ,
          <year>2009</year>
          , Madrid, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          13.Xianpei Han and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <source>CASIANED: People Attribute Extraction based on Information Extraction. In: 2nd Web People Search Evaluation Workshop (WePS</source>
          <year>2009</year>
          ),
          <source>18th WWW Conference, April 20th-24th</source>
          ,
          <year>2009</year>
          , Madrid, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          14.
          <string-name>
            <surname>György</surname>
            <given-names>Szarvas</given-names>
          </string-name>
          , Richárd Farkas, and
          <string-name>
            <given-names>András</given-names>
            <surname>Kocsor</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>A multilingual named entity recognition system using boosting and C4.5 decision tree learning algorithms</article-title>
          .
          <source>DS2006</source>
          , LNAI,
          <volume>4265</volume>
          :
          <fpage>267</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>