<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Models for automatic retrieval of health information on the Web</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana Marilza Pernas1.2</string-name>
          <email>ana.pernas@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Bulegon Gassen2</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Palazzo M. de Oliveira2</string-name>
          <email>palazzo@inf.ufrgs.br</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1Centro de Desenvolvimento, Tecnológico, Universidade Federal de</institution>
          ,
          <addr-line>Pelotas, RS</addr-line>
          ,
          <country country="BR">Brasil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>2Instituto de Informática, Universidade Federal do Rio Grande</institution>
          ,
          <addr-line>do Sul, Porto Alegre</addr-line>
          ,
          <country country="BR">Brasil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>2Instituto de Informática, Universidade Federal do Rio Grande</institution>
          ,
          <addr-line>do Sul, Porto Alegre</addr-line>
          ,
          <country country="BR">Brasil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evaluate the data available in Web pages is necessary to allow the suggestion of adequate content to a specific public, for such, an idea is to develop an automatic detection mechanism of the quality present in Web pages content. Due to its quickly and not standardized growth, the content freely available in the Web has reached very large proportions, becoming extremely difficult to automatic manage this large mass of data. One of the main reason for this complexity is the not appliance of standards for its construction, which should lead to a standardized access. This article describes the development of models and techniques to locate, standardize and extract the content in web pages associated with health issues. After that, the objective is to provide an appropriate content to evaluate the quality of a web page according to specific metrics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Location and extraction models</kwd>
        <kwd>standardization</kwd>
        <kwd>quality metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>One of the biggest challenges in Web systems is to automatically
deal with the large amount of data available in the Web as a
large distributed system. As a consequence of its rapid not
standardized growing, the Web reached a huge proportion that
became extremely difficult to be efficiently controlled, being the
automatic management of data very complex.</p>
      <p>The automatic web-data management is especially desired when
the subject is the quality evaluation of a given web page for its
subsequent recommendation to a specific public. In the case of
health information content it is necessary a careful check on the
validity of the information before an effective recommendation.
We know that check all health information on the Web to
determine if is reliable it‘s very difficult, but we can evaluate a
number of indicators present in a web page to try to guarantee a
minimum of quality. Examples of indicators, or metrics, could
be: sponsors, if exists and who are; the subject of the page; the
ways of contact with the responsible by the web page; the web
page readability; if the content is continuously updated
(freshness).</p>
      <p>
        To achieve this objective there are organizations that grant
certificates of quality in form of quality-stamps to ensure the
quality of the information presented on a website. That is the case of
Web Mèdica Acreditada1 (WMA), an organization of the
Medical Association of Barcelona that offers ―a quality programme
addressed to medical websites. Through a voluntarily
certification process, websites that follow that programme meet a set of
quality criteria, making possible a trustworthy virtual
community on the Internet for general public, patients and health
professionals‖ [1]. Other examples of organizations which grant
quality seals to websites are the Internet Content Rating Association
(ICRA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and the Internet Quality Agency (IQUA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Even with this kind of organization to ensure the quality of
health information on the Web by a formal request, the best
alternative would be to orchestrate this practice performing an
automated analysis of the content presented in health web pages.
However, given the size of the Web nowadays this task would
be really complex and hard, as an example the indexing of the
Web content by the Yahoo! indicates that ―the production of the
Web Map, an index of the World Wide Web that is a critical
component of search {takes} (75 hours elapsed time, 500
terabytes of MapReduce intermediate data, 300 terabytes total
output)‖ using larger clusters of 3500 nodes [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. In a post
published in July 2008 in the Google Blog the Web size was
evaluated as already exceeding 1 trillion of URLs (Uniform
Resource Locator) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>To enable automatic retrieval of data from web pages about
health, some models and techniques are presented here to
support: (i) location of health web-data (ii) standardization and (iii)
(ii) extraction of this data, based on pre-established criteria.
Conscious of the complexity of dealing to the entire Web, the
location, standardization and extraction models presented here
are applied to specific search engines.</p>
      <p>The remainder of this article is structured as follows. Section 2
explores some related works. Section 3 describes the general
vision of this work, showing our starting point and the relevant
data in estimating the quality of a web page. Section 4 is
associated with the task of automatic location of web page about
health, explaining its general model. Section 5 present the task
of automatic extraction data from the localized web pages
pre</p>
      <sec id="sec-1-1">
        <title>1 http://wma.comb.es/</title>
        <p>senting problems related to the lack of standardization. Section 6
shows a case study developed on pages related with Alzheimer‘s
disease. Finally section 7 presents the conclusions and future
works.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORKS</title>
      <p>
        An example of project related to the analysis of quality content
on health web pages is the QUATRO Project (The Quality
Assurance and Content Description Project) and its successor,
QUATRO+ [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The objective of this project is to offer a
common machine readable vocabulary to certify the quality of health
information on the Web. One of the results of the QUATRO
project consists on a vocabulary that presents a list of
descriptors and their definitions to be used as a base for creation of
quality seals. Among the participants of the QUATRO project is
the aforementioned Web Mèdica Acreditada – WMA.
An old work, but that can still be referenced by its vision is the
so called ―Oh, yeah?‖ button, proposed by Tim Berners-Lee in
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this proposition, the author mention how important is the
Web say to the users something about the information being
presented. This ―Oh, yeah?‖ button would be responsible when
the user is not so confident in a web page content to show a
number of reasons to trust in that information. In a practical
view, the button would access some meta information about the
content and show that to the user.
      </p>
      <p>
        Related to information extraction, the Project AQUA (Assisting
Quality Assessment) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposes to automate parts of the work
manually done by organizations that offers quality labels to
websites, making this process easily and, consequently,
increasing the number of sites with quality seals. The objective is to
crawl the Web to locate unlabelled health web resources,
suggest machine readable labels for them, according to predefined
labeling criteria, and monitor them. To do that, the crawling
mechanism uses Google2 and Yahoo!3 search engines to do a
meta-search engine on the Web, collecting and filtering (to
ignore sub-paths of URLs already in list and removes URLs
having already a content label) the resulting URLs. This work is
very similar to our work, but data related with authorship
information and inLinks and outLinks are not mentioned.
Another related work is described in [8] which aim to extract a
number of quality indicators defined by organizations like
HONCode to establish the quality content of a health website.
The work looks to detect measurable indicators in the evaluated
website, searching this information in HTML tags and
metatags. However, in our view, only this simple analysis could not
cover a great set of websites.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. GENERAL MODEL</title>
      <p>The general model presented in this work intends to cover
critical points considered to achieve the final goal: evaluate the
quality of a Web page about health. In a general view it is necessary
to accurately automate the following steps:
</p>
      <p>
        Localization – answer to the question: given an item X
stored in some dynamic set of nodes in the system, how to
find it? [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>2 http://www.google.com</title>
      </sec>
      <sec id="sec-3-2">
        <title>3 http://yahoo.com</title>
        <p>

</p>
        <p>Standardization – relates to existing standards in
presentation of data in web pages which may indicate ways to
achieve automatic access (automatic extraction) of this
data.</p>
        <p>
          Extraction – once the web page is located, how to obtain
and properly manage its data to be evaluated?
Quality – is the analysis of collected data where metrics
are applied to define the level of quality present in the
evaluated web page, according to a specific user profile.
As we mentioned in section 1, in this work only the phases of
Location, Standardization and Extraction are treated. The
starting point in developing models was the research described in
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], the main points to be collected in a Web page were
analyzed, in order to determine its quality. This analysis resulted
in an ontology of quality, which was reduced to the model
depicted in the Figure 1 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This reduction was made to simplify
the task of automatic retrieval because obtaining this main data
about a health web page is already possible to define a first
quality estimative.
        </p>
        <p>As we can see in Figure 1, information about the web page is
obviously necessary to define its quality. Information about
author is very important because if the content was written or
revised by a specialist in the subject, possibly the quality will be
enhanced. Is either important define if the web page is
sponsored by some organization and if is a recognized organization,
for example, a governmental organization, university or
industry.</p>
        <p>The E-R (Entity-Relationship) model of the Figure 1 describes
the main classes for determining the page quality: (i) the page
itself, with data related with its title, language, authorship,
references and dates of creation and update; (ii) the web page author,
with data related with contact and expertise in the subject
covered by the web page; (iii) the organization that sponsor the Web
page (if exists); (iv) the links of another pages or sites in the
Web pointing to this (inLinks); and (v) which are the web links
from this page to other web pages (outLinks).</p>
        <p>In the development of the techniques to collect data on the
model of Figure 1, was noted the difficulty in obtain information
about the date of creation and update of web pages. This is very
important in determining the quality of a web page, because
expresses the content freshness. We observed that this happen in
general or by the absence of this kind of information or because
is presented in a non machine readable format, like in an image.
Thus, the set of data in which was possible to automatic extract
information in this work is related to the site language and its
inLinks and outLinks. About the authorship, it was possible to
determine the author expertise and contact (e-mail address).
The following sections describe the models and techniques
developed to perform the automatic location and extraction of web
pages to obtain the set of data described on the model presented
in the Figure 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. AUTOMATIC LOCATION</title>
      <p>
        One of the main points in the system‘s operation was the
automatic location of health web pages to be evaluated and
extracted. The results achieved must be returned to the evaluation
process in order of relevance, i.e. according to how they satisfy
the search criteria [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For more details regarding the criteria of
quality, we recommend the reading of the work developed in
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Given the estimated size of the Web, find the more appropriate
page to the user‘s intention is not a trivial task. Its necessary a
sophisticate algorithm, combined with massive computational
power, to accomplish this task. For that reasons, in this work
was chose to use search engines specifically oriented for
healthpages retrieval. In this category of search engines the search is
performed on a database of previously indexed and constantly
updated pages. Some of the specialized search engines are
SearchMedica4 and MedStory5, which are specific to the medical
field. In the next sections is present the model and prototype
developed to locate health web pages, but not focusing on
specific search engines – as this choice is application-dependent.
The specific configuration of this model to a practical
application is presented in section 6, with a Case Study about
Alzheimer‘s disease.</p>
    </sec>
    <sec id="sec-5">
      <title>4.1 Location Model</title>
      <p>
        The model presented in the Figure 2 is based on the interaction
among the application, a database and the Internet. It is
application dependent because it needs the existing pattern for
presenting the results of the search engine used in the application, i.e.,
the internal standard to display the results to the user.
Initially, the data about the search engine (pattern related) is
required by the system. Then, the criterion to be used by the
search engine is recovered from the database and for each
criteria is stored the first URLs returned as answer by this search
engine. For more details about this model see [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The database
stores the identification of the search engine used, as well as
their model. This is necessary because different engines offer
specifics return forms. Thus, it is possible to extend the system,
adding new patterns of search engines or removing them from
the database. A prototype was developed for retrieval and
index
      </p>
      <sec id="sec-5-1">
        <title>4 http://www.searchmedica.com</title>
      </sec>
      <sec id="sec-5-2">
        <title>5 http://www.medstory.com</title>
        <p>
          ing of returned results as well as for its storage in the database
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. DATA EXTRACTION</title>
      <p>After recovering the URLs from the search engines, obtained at
the location step, the system starts the next task, which extracts
the content that will be used to evaluate the quality of the web
page. As this task was idealized to occur in an automatic way a
search is performed for specific attributes that are relevant to
determine the page quality, those attributes can be found in the
Figure 1. The initial supposition was that it should be possible to
find a standardization way that allows encompassing the
maximum portion of data existent in web pages in order to make the
data extraction easier and faster. However, as presented in a
more detailed way in the 5.1 subsection, it was not possible to
use this approach because until today there are no
standardization models for naming the tags or to place data in pages.
Therefore, specific strategies were proposed, trying to attend the
necessary requirements to the search of each data existent in the
ER model (Figure 1), as: site language; authorship data; inLinks
and outLinks.</p>
      <p>The next sections present each one of those strategies, starting
with troubles and adopted strategies for standardization of web
pages content and structure.</p>
    </sec>
    <sec id="sec-7">
      <title>5.1 Standardization</title>
      <p>
        The considered standardization has the objective to define the
ways in which the content of a web page could be structured.
That research is applied later as an information source to the
a higher score, and consequently, appear in a better position in
the ranking. In this work, inLinks are recovered as follows [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
An request is submitted to the Browser Google Web:
http://www.google.com.br/search?q=link%3A+&lt;url_target
_page&gt;;
After recover the results page, for each one of the listed
pages, the XPath expression: //cite is executed;
      </p>
      <sec id="sec-7-1">
        <title>All lists of URLs obtained in each page are stored.</title>
      </sec>
      <sec id="sec-7-2">
        <title>For the recovery of outLinks the steps are [11]:</title>
      </sec>
      <sec id="sec-7-3">
        <title>Recover the page which intends to get the outLinks; Execution of the expression XPath: //a[not(contains(@href = ‗DOMAIN‘))]/@href, where DOMAIN represents the domain of the desired page;</title>
        <p>
          Through the expression cited in the previous step, the
duplicated links that are hosted in the same domain of the
page are removed;
When counting the number of outLinks, the duplicates are
removed and the links that point to the same domain are
counted as only one entry.
extraction phase. The structure and content of search engines, as
well as web pages, is extremely dynamic. Consequently, to
overcome this constantly changes the standardization model adopted
here apply terms defined in the WorldNet [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] trying to found
synonyms for the keywords found in web pages. Examples of
synonyms that could be obtained for the term "author", based in
the WorldNet, are: "writer", "generator", "source".
        </p>
        <p>
          There are others interesting approaches that can be applied as a
standardization strategy. An example is the application of
ontologies [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], where they can be used to support decisions about
which fields must to be analyzed; these ontologies could, either,
be enriched by the WorldNet terms. The use of terms from
WorldNet is interesting to find synonyms of the fields that could
appear in the pages and be used in the extraction. Unfortunately,
there are no guarantees that the website developer has used some
synonymous term to define fields (as author or writer, for the
author´s name field). So, is important to think in alternative
strategies.
        </p>
        <p>
          In this work, the recognition of HTML tags in web pages had
generated special attention. Techniques for that are applied in
areas as Information Storage and Retrieval to identify the
relevant topic of a page, since that definition is very important to the
information filtering phase [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Some of the HTML tags by it
self can provide good information for the definition/search of
patterns, for example, the tags: &lt;h1&gt;, &lt;h2&gt;, &lt;h3&gt;, &lt;h4&gt;
could indicate field patterns, as: title, author´s name and contact.
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>5.2 Extracting the Site Language</title>
      <p>
        These step objectives find out the language in which the text of
the web page was written. In this project the system informs the
URL of the health web page that must to be analyzed. The first
step refers to search for meta tags that indicate the language of
the web page as defined in the Dublin Core pattern: meta tag
language (1)[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        &lt;meta name="language" content=”english”/&gt;
(1)
If the system didn't found the meta tag, is necessary to make a
deeper analysis on the content of the web page to verify that. In
this case, some programs that analyses the text in order to detect
the language can be used. Examples of analyzers used in the
application are: Fuzzums6, Applied Language7 e Google
Language Detection8. After obtaining the page content subsets of
text, as well as the keyword used in this search, are extracted
from the content. In the most part of the cases tested the analysis
made by two of those programs was enough to detect the web
page language. When the analysis made by two programs
disagreed, one extra program was used to reach a conclusion [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-9">
      <title>5.3 Extraction of InLinks and OutLinks</title>
      <p>Supposing the website X, inLinks are all the links existing in the
Web that point to X, outLinks are the links of the website X that
point to sites that are hosted in other domains. The inLinks and
outLinks of each website should be stored for posterior analysis.
InLinks are used for some search engines in order to build a
ranking of the results. Websites that have many inLinks receive
6 http://www.fuzzums.nl/~joost/talenknobbel/index.php
7 http://www.appliedlanguage.com
8 http://www.google.com/uds/samples/language/detect.html








</p>
    </sec>
    <sec id="sec-10">
      <title>5.4 Extraction of Author’s Information</title>
      <p>
        Data about the website author are important for determine the
quality of the presented content. If the author is considered a
specialist on the subject, the web page will have more credibility
then the others, written by people not recognized as such.
However, this kind of information is not easily determined in a
manual way and in case of automatic treatment of data it became
even more complex. There are some techniques that could be
applied to help in analyzing data about the author, as is the case
of h-index [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which applies the number of author‘s
publications and citations. In case of a complex task, where an
individual technique does not solve the problem, in this work was
applied a solution described in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] where techniques for data
extraction about pages authorship are employed. The authorship
model is defined by the combination of vocabularies defined in
the Dublin Core pattern [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and in the FOAF (Friend Of A
Friend) ontology [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], for descriptions related to the authors
expertise. Another tool for automatic extraction obtains the
author‘s organization; his or her electronic mail address; the
website address; the author‘s number of publications and h-index
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-11">
      <title>6. CASE STUDY</title>
      <p>The models described in the previous sections were developed
as prototypes to evaluate the proposed approach. During the
tests the topic used for search was "Alzheimer Disease". The
first step was focused in the automatic location of data, the
considered tasks were:</p>
      <p>Study of the existent search engines, as well as analysis of
each one of them in order to identify which ones were
appropriate to search data about health;
Definition of the criteria that will be used for the searches
related to the topic "Alzheimer Disease".</p>
      <p>The section 6.1 below presents details of the search engines.
About the criteria, without a real population of users to develop
this case study, was chosen to use common criteria from the
topic. A more detailed description about the applied criteria is
presented in the subsection 6.2.</p>
    </sec>
    <sec id="sec-12">
      <title>6.1 Health and General Search Engines</title>
      <p>
        The queries have two kinds of results: texts that are appropriated
to general public; technical texts with more detailed information
of the topic. During the search engine analysis, three of them
have been choose. That was necessary for automating the
location of web pages, because the developed tool should be
prepared for a specific pattern of the engines. Among the analyzed
ones, the following were adopted [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
      </p>
      <p>Medstory: Search in the Web and sorts the results based
on several keywords, indicating which ones occur more
frequently. These keywords, about healthcare, belong to a
predefined static list, grouped by categories, as: drugs,
symptoms, procedures and so on. This list predefined by
the website allows refining the results presented to the
user. After the search, several complementary categories of
data related to the topic are presented to the user. These
categories consist of pre-processed information, stored in a
specific database.</p>
      <p>SearchMedica: Allows setting up the search scope in
which the search engine will search: the whole Web or
sites defined by the user. Also, the search can be made
over a determined health area (e.g. cardiology, geriatrics,
dermatology, etc.). The results page has an area in which
several related information are presented.</p>
      <p>
        Google: Allows search in the whole Web, so, can find
since information for laymen until news or scientific work.
However, there are no guarantees about the reliability of
the presented data in the resulting websites, unlike what
happens in the Medstory, which search in a recognized
source internally indexed. At the same time, searches that
encompass a bigger number of pages can find more
relevant results for the user. In these cases, algorithms as
PageRank™ [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], used by this search engine, try to rank the
results based on reliability.
      </p>
    </sec>
    <sec id="sec-13">
      <title>6.2 Search Criteria</title>
      <p>
        As described in the introduction of this case study, the search
criteria applied were chosen according to the topic "Alzheimer
Disease", so, they are application dependent. Thus, an analysis
should be done in order to choose good criteria for more
coherent results. This issue refers to business analysis [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and not to
a development issue, i.e., an analysis preferably held by an
expert in the area, in order to provide the closer criteria that could
be used by laymen and experts in its researches. Also, keywords
were separated into categories; they are listed below [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
―General information‖ - introductory information.
Keywords: alzheimer and alzheimer introduction.
―Treatment methods‖ - has two goals: for practitioners,
interested in new methods for treatment; and for laymen,
interested in know about its treatment. Keyword:
alzheimer‘s treatment.
―Diagnosis‖ - looks to provide information about the first
symptoms of the disease. Keywords: alzheimer‘s
diagnostic and alzheimer's symptoms.











―Drugs‖ - intend to provide information about drug
interactions, allergic reaction and efficiency. Keywords:
alzheimer drugs interactions and alzheimer drug treatment.
―Case study‖ - has focus on expert users, which may be
searching for information to base their researches about a
disease, for example. Keyword: alzheimer case study.
―Tips‖ – has the objective to answer questions as: ways to
deal with a patient with Alzheimer; how is possible do
help; what are the risk factors of the disease. Keyword:
alzheimer‘s practical tips.
      </p>
      <p>These search criteria allows the automatic location of websites
by the execution of searches over the used search engines
(subsection 6.1), as well as store the 10 first URLs, associated with
each search engine. After the location and storing the URLs, the
system can go further - data extraction.</p>
    </sec>
    <sec id="sec-14">
      <title>6.3 Extraction</title>
      <p>
        This step intends to extract the following attributes: page
language, authorship data, inLinks and outLinks. Each one of those
attributes was described during the previous sections. In order to
develop the proposed solution as a prototype some tools were
used to support retrieve data in the websites, they are: Web
Harvest9 e XPather [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
      </p>
      <p>The Web-Harvest provide an API (Application Program
Interface) that allows: consult Web servers; recover the
HTML page; convert it in XHTML(Extensible Hypertext
Markup Language) file and apply some manipulation
techniques of XML (Extensible Markup Language)
documents, as XPath, XQuery and XSLT (Extensible
Stylesheet Language Transformations) to extract the desired
information;
The XPather allows search for patterns in websites, being
used in a semi-automatic manner.</p>
    </sec>
    <sec id="sec-15">
      <title>7. CONCLUSIONS AND FUTURE WORK</title>
      <p>This article described the development of models, techniques
and prototypes aiming the automatic retrieval of content
presented in web pages with health subject. After the automatic
location, standardization and extraction of page‘s data the
objective is to deliver normalized data to perform evaluations on the
quality of the health associated web page. The process of
developing the model for automatic retrieval of data anticipated a
series of challenges and steps to overcome and the process was
directed to work with these problems. For location were applied
well known search engines as they use quite efficient algorithms
for Web search. The search for patterns for presenting data in
web pages has led to the conclusion that despite the already
existence of standards such as Dublin Core, in practice, these
patterns are not strictly applied by the vast majority of pages, or
is partially applied. Thus, the need of some kind of
standardization for the development of these pages is clearly needed. As a
future work, it would be necessary to end the process of
evaluating the quality of a health web page making the task of quality
evaluation itself automatic.</p>
      <sec id="sec-15-1">
        <title>9 http://web-harvest.sourceforge.net/</title>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>8. ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by Conselho Nacional de
Desenvolvimento Científico e Tecnológico - CNPq, Brazil,
Edital Universal - MCT/CNPQ - 14/2010. We would also like to
thank the group of students of the discipline CMP112,
PPGC/UFRGS, that developed the initial work on this subject.</p>
    </sec>
    <sec id="sec-17">
      <title>9. REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Web</given-names>
            <surname>Medica Acreditada - WMA</surname>
          </string-name>
          .
          <source>Retrieved March</source>
          <volume>21</volume>
          ,
          <year>2011</year>
          , from: &lt;http://wma.comb.es/&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Internet</given-names>
            <surname>Content Rating</surname>
          </string-name>
          Association - ICRA.
          <source>Retrieved March</source>
          <volume>21</volume>
          ,
          <year>2011</year>
          , from: &lt;http://www.fosi.org/icra/&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Internet</given-names>
            <surname>Quality Agency - IQUA</surname>
          </string-name>
          .
          <source>Retrieved March</source>
          <volume>21</volume>
          ,
          <year>2011</year>
          , from: &lt;http://www.iqua.net/&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Alpert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hajaj</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>We knew the Web was big</article-title>
          ...
          <source>The Oficial Google Blog. Retrieved March</source>
          <volume>20</volume>
          ,
          <year>2011</year>
          , from: &lt;http://googleblog.blogspot.com/
          <year>2008</year>
          /07/we-knew-Webwas-big.
          <source>html&gt;.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>The</given-names>
            <surname>Quality Assurance and Content Description</surname>
          </string-name>
          Project - QUATRO+.
          <source>Retrieved March</source>
          <volume>28</volume>
          ,
          <year>2011</year>
          , from: &lt;http://legacy.quatro-project.org/&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Cleaning up the User Interface</article-title>
          ,
          <string-name>
            <surname>Section-</surname>
          </string-name>
          The ―Oh, yeah?‖-Button, Retrieved March 28,
          <year>2011</year>
          , from: http://www.w3.org/DesignIssues/UI.html.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Stamatakis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chandrinos</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karkaletsis</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mayer</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzales</surname>
          </string-name>
          . D.V.,
          <string-name>
            <surname>Labsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pöllä</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>AQUA, a system assisting labeling experts assess health Web resources</article-title>
          .
          <source>In Proceeding of Symposium on Health Information Management Research - ISHIMR</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Liu
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>Automatic detecting indicators for quality of health information on the Web</article-title>
          ,
          <source>International Journal of Medical Informatics</source>
          ,
          <volume>76</volume>
          (
          <issue>8</issue>
          ),
          <fpage>575</fpage>
          -
          <lpage>582</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Balakrishnan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaashoek</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Looking up data in P2P systems</article-title>
          .
          <source>In Communications of the ACM</source>
          . DOI=http://doi.acm.
          <source>org/10</source>
          .1145/606272.606299.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Lichtnow</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jouris</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordignon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pernas</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levin</surname>
            ,
            <given-names>F. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nascimento</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>I. C. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gasparini</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teixeira</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossi</surname>
            ,
            <given-names>L. H. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>O. E. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schreiner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomes</surname>
            ,
            <given-names>S. R. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>J. P. M.</given-names>
          </string-name>
          <year>de</year>
          .
          <year>2009</year>
          . Relato e Considerações sobre o Desenvolvimento de uma Ontologia para Avaliação de Sites da Área de Saúde. Cadernos de Informática (UFRGS),
          <source>v. 4</source>
          , p.
          <fpage>7</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Pernas</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palazzo</surname>
            , J. M. de
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donassolo</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bezerra</surname>
            ,
            <given-names>C.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manica</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalil</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>L.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Svoboda</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mesquita</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torres</surname>
            ,
            <given-names>P.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petry</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leithardt</surname>
            ,
            <given-names>V.R.Q.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Relato sobre o Desenvolvimento de Modelos para Obtenção Automática do Conteúdo de Sites sobre Saúde</article-title>
          . Cadernos de Informática (UFRGS),
          <source>v. 4</source>
          , p.
          <fpage>47</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Wordnet</surname>
            <given-names>:</given-names>
          </string-name>
          <article-title>a lexical database for the English</article-title>
          . Princeton University. Retrieved March 20,
          <year>2011</year>
          , from: &lt;http://wordnet.princeton.edu/wordnet/download/&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Tiun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdullah</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and Kong,
          <string-name>
            <surname>T.E.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>Automatic Topic Identification Using Ontology Hierarchy</article-title>
          .
          <source>In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing - CICLing '01</source>
          . Springer-Verlag, London, UK.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Mining Topic-Specific Concepts and Definitions on the Web</article-title>
          .
          <source>In The 12th International World Wide Web Conference - WWW</source>
          <year>2003</year>
          , Budapest, Hungary, May
          <volume>20</volume>
          -24.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Hirsch</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>An index to quantify an individual's scientific research output</article-title>
          .
          <source>PNAS</source>
          <volume>102</volume>
          (
          <issue>46</issue>
          ),
          <fpage>16569</fpage>
          -
          <lpage>16572</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Lichtnow</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pernas</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manica</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalil</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
          </string-name>
          , J. P. M. de,
          <string-name>
            <surname>Leithardt</surname>
            ,
            <given-names>V. R. Q.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Automatic Collection of Authorship Information for Web Publications</article-title>
          .
          <source>In: Proceedings of 6th International Conference on Web Information Systems and Technologies - WEBIST. v. 1</source>
          . p.
          <fpage>339</fpage>
          -
          <lpage>344</lpage>
          . Lisboa: INSTICC. Valencia.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17] Dublin Core Metadata Initiative.
          <source>Retrieved March</source>
          <volume>22</volume>
          ,
          <year>2011</year>
          , from: &lt;http://dublincore.org/&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Brickley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <source>FOAF Vocabulary Specification 0.98. Namespace Document 9 August</source>
          <year>2010</year>
          .
          <article-title>Marco Polo Edition</article-title>
          .
          <source>Retrieved March</source>
          <volume>20</volume>
          ,
          <year>2011</year>
          , from: &lt;http://xmlns.com/foaf/spec/&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shvachko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Radia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chansler</surname>
          </string-name>
          ,
          <article-title>The Hadoop Distributed File System</article-title>
          ,
          <source>IEEE 26th Symposium on Mass Storage Systems and Technologies</source>
          ,
          <year>2010</year>
          , ISBN:
          <fpage>978</fpage>
          - 1-
          <fpage>4244</fpage>
          -7152-2, p.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Page</surname>
          </string-name>
          . L.
          <article-title>The Anatomy of a Large-Scale Hypertextual Web Search Engine</article-title>
          . In
          <source>Computer Networks and ISDN Systems</source>
          . Elsevier Science Publishers, Amsterdam, The Netherlands.
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Data</surname>
          </string-name>
          <article-title>Mining: Practical Machine Learning Tools and Techniques</article-title>
          . 2a ed. Morgan Kaufmann.
          <year>629p</year>
          .
          <year>2005</year>
          . ISBN-
          <volume>13</volume>
          :
          <fpage>978</fpage>
          -0-
          <fpage>12</fpage>
          -088407-0.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>