,

=Paper=
{{Paper
|id=None
|storemode=property
|title=Models for Automatic Retrieval of Health Information on the Web
|pdfUrl=https://ceur-ws.org/Vol-729/paper5.pdf
|volume=Vol-729
}}
==Models for Automatic Retrieval of Health Information on the Web==
<pdf width="1500px">https://ceur-ws.org/Vol-729/paper5.pdf</pdf>
<pre>
Models for automatic retrieval of health information on the
                           Web
          Ana Marilza Pernas1.2                      Jonas Bulegon Gassen2                     José Palazzo M. de Oliveira2
      1                                                 2                                            2
    Centro de Desenvolvimento               Instituto de Informática                                  Instituto de Informática
Tecnológico, Universidade Federal de Universidade Federal do Rio Grande                        Universidade Federal do Rio Grande
         Pelotas, RS, Brasil             do Sul, Porto Alegre, Brasil                              do Sul, Porto Alegre, Brasil
     ana.pernas@inf.ufrgs.br                          jbgassen@inf.ufrgs.br                         palazzo@inf.ufrgs.br


ABSTRACT                                                             be: sponsors, if exists and who are; the subject of the page; the
Evaluate the data available in Web pages is necessary to allow       ways of contact with the responsible by the web page; the web
the suggestion of adequate content to a specific public, for such,   page readability; if the content is continuously updated (fresh-
an idea is to develop an automatic detection mechanism of the        ness).
quality present in Web pages content. Due to its quickly and not     To achieve this objective there are organizations that grant cer-
standardized growth, the content freely available in the Web has     tificates of quality in form of quality-stamps to ensure the quali-
reached very large proportions, becoming extremely difficult to      ty of the information presented on a website. That is the case of
automatic manage this large mass of data. One of the main rea-       Web Mèdica Acreditada1 (WMA), an organization of the Medi-
son for this complexity is the not appliance of standards for its    cal Association of Barcelona that offers ―a quality programme
construction, which should lead to a standardized access. This       addressed to medical websites. Through a voluntarily certifica-
article describes the development of models and techniques to        tion process, websites that follow that programme meet a set of
locate, standardize and extract the content in web pages asso-       quality criteria, making possible a trustworthy virtual communi-
ciated with health issues. After that, the objective is to provide   ty on the Internet for general public, patients and health profes-
an appropriate content to evaluate the quality of a web page         sionals‖ [1]. Other examples of organizations which grant quali-
according to specific metrics.                                       ty seals to websites are the Internet Content Rating Association
                                                                     (ICRA) [2] and the Internet Quality Agency (IQUA) [3].
Categories and Subject Descriptors                                   Even with this kind of organization to ensure the quality of
H.3.3 [Information Search and Retrieval]: Information filter-        health information on the Web by a formal request, the best
ing; Retrieval models; Selection process.                            alternative would be to orchestrate this practice performing an
J.3 [Life and Medical Sciences]: Health; Medical information         automated analysis of the content presented in health web pages.
systems.                                                             However, given the size of the Web nowadays this task would
                                                                     be really complex and hard, as an example the indexing of the
                                                                     Web content by the Yahoo! indicates that ―the production of the
General Terms                                                        Web Map, an index of the World Wide Web that is a critical
Management, Design, Experimentation, Standardization, Verifi-        component of search {takes} (75 hours elapsed time, 500 tera-
cation.                                                              bytes of MapReduce intermediate data, 300 terabytes total out-
                                                                     put)‖ using larger clusters of 3500 nodes [19]. In a post pub-
Keywords                                                             lished in July 2008 in the Google Blog the Web size was eva-
Location and extraction models, standardization, quality metrics.    luated as already exceeding 1 trillion of URLs (Uniform Re-
                                                                     source Locator) [4].
1. INTRODUCTION                                                      To enable automatic retrieval of data from web pages about
One of the biggest challenges in Web systems is to automatically     health, some models and techniques are presented here to sup-
deal with the large amount of data available in the Web as a         port: (i) location of health web-data (ii) standardization and (iii)
large distributed system. As a consequence of its rapid not stan-    (ii) extraction of this data, based on pre-established criteria.
dardized growing, the Web reached a huge proportion that be-         Conscious of the complexity of dealing to the entire Web, the
came extremely difficult to be efficiently controlled, being the     location, standardization and extraction models presented here
automatic management of data very complex.                           are applied to specific search engines.

The automatic web-data management is especially desired when         The remainder of this article is structured as follows. Section 2
the subject is the quality evaluation of a given web page for its    explores some related works. Section 3 describes the general
subsequent recommendation to a specific public. In the case of       vision of this work, showing our starting point and the relevant
health information content it is necessary a careful check on the    data in estimating the quality of a web page. Section 4 is asso-
validity of the information before an effective recommendation.      ciated with the task of automatic location of web page about
We know that check all health information on the Web to de-          health, explaining its general model. Section 5 present the task
termine if is reliable it‘s very difficult, but we can evaluate a    of automatic extraction data from the localized web pages pre-
number of indicators present in a web page to try to guarantee a
minimum of quality. Examples of indicators, or metrics, could        1
                                                                         http://wma.comb.es/
senting problems related to the lack of standardization. Section 6         Standardization – relates to existing standards in presen-
shows a case study developed on pages related with Alzheimer‘s              tation of data in web pages which may indicate ways to
disease. Finally section 7 presents the conclusions and future              achieve automatic access (automatic extraction) of this da-
works.                                                                      ta.
                                                                           Extraction – once the web page is located, how to obtain
2. RELATED WORKS                                                            and properly manage its data to be evaluated?
An example of project related to the analysis of quality content
on health web pages is the QUATRO Project (The Quality As-                 Quality – is the analysis of collected data where metrics
surance and Content Description Project) and its successor,                 are applied to define the level of quality present in the eva-
QUATRO+ [5]. The objective of this project is to offer a com-               luated web page, according to a specific user profile.
mon machine readable vocabulary to certify the quality of health      As we mentioned in section 1, in this work only the phases of
information on the Web. One of the results of the QUATRO              Location, Standardization and Extraction are treated. The start-
project consists on a vocabulary that presents a list of descrip-     ing point in developing models was the research described in
tors and their definitions to be used as a base for creation of       [10]. In [10], the main points to be collected in a Web page were
quality seals. Among the participants of the QUATRO project is        analyzed, in order to determine its quality. This analysis resulted
the aforementioned Web Mèdica Acreditada – WMA.                       in an ontology of quality, which was reduced to the model de-
An old work, but that can still be referenced by its vision is the    picted in the Figure 1 [11]. This reduction was made to simplify
so called ―Oh, yeah?‖ button, proposed by Tim Berners-Lee in          the task of automatic retrieval because obtaining this main data
[6]. In this proposition, the author mention how important is the     about a health web page is already possible to define a first qual-
Web say to the users something about the information being            ity estimative.
presented. This ―Oh, yeah?‖ button would be responsible when          As we can see in Figure 1, information about the web page is
the user is not so confident in a web page content to show a          obviously necessary to define its quality. Information about
number of reasons to trust in that information. In a practical        author is very important because if the content was written or
view, the button would access some meta information about the         revised by a specialist in the subject, possibly the quality will be
content and show that to the user.                                    enhanced. Is either important define if the web page is spon-
Related to information extraction, the Project AQUA (Assisting        sored by some organization and if is a recognized organization,
Quality Assessment) [7] proposes to automate parts of the work        for example, a governmental organization, university or indus-
manually done by organizations that offers quality labels to          try.
websites, making this process easily and, consequently, increas-      The E-R (Entity-Relationship) model of the Figure 1 describes
ing the number of sites with quality seals. The objective is to       the main classes for determining the page quality: (i) the page
crawl the Web to locate unlabelled health web resources, sug-         itself, with data related with its title, language, authorship, refer-
gest machine readable labels for them, according to predefined        ences and dates of creation and update; (ii) the web page author,
labeling criteria, and monitor them. To do that, the crawling         with data related with contact and expertise in the subject cov-
mechanism uses Google2 and Yahoo!3 search engines to do a             ered by the web page; (iii) the organization that sponsor the Web
meta-search engine on the Web, collecting and filtering (to ig-       page (if exists); (iv) the links of another pages or sites in the
nore sub-paths of URLs already in list and removes URLs hav-          Web pointing to this (inLinks); and (v) which are the web links
ing already a content label) the resulting URLs. This work is         from this page to other web pages (outLinks).
very similar to our work, but data related with authorship infor-
mation and inLinks and outLinks are not mentioned.
Another related work is described in [8] which aim to extract a
number of quality indicators defined by organizations like
HONCode to establish the quality content of a health website.
The work looks to detect measurable indicators in the evaluated
website, searching this information in HTML tags and meta-
tags. However, in our view, only this simple analysis could not
cover a great set of websites.

3. GENERAL MODEL
The general model presented in this work intends to cover criti-
cal points considered to achieve the final goal: evaluate the qual-
ity of a Web page about health. In a general view it is necessary
to accurately automate the following steps:
       Localization – answer to the question: given an item X
        stored in some dynamic set of nodes in the system, how to
        find it? [9].
                                                                         Figure 1. E-R model proposed to evaluate the content of
2                                                                                        health web pages [11].
    http://www.google.com
3
    http://yahoo.com
In the development of the techniques to collect data on the mod-       ing of returned results as well as for its storage in the database
el of Figure 1, was noted the difficulty in obtain information         [11].
about the date of creation and update of web pages. This is very
important in determining the quality of a web page, because
expresses the content freshness. We observed that this happen in
general or by the absence of this kind of information or because
is presented in a non machine readable format, like in an image.
Thus, the set of data in which was possible to automatic extract
information in this work is related to the site language and its
inLinks and outLinks. About the authorship, it was possible to
determine the author expertise and contact (e-mail address).
The following sections describe the models and techniques de-
veloped to perform the automatic location and extraction of web
pages to obtain the set of data described on the model presented
in the Figure 1.

4. AUTOMATIC LOCATION
One of the main points in the system‘s operation was the auto-
matic location of health web pages to be evaluated and ex-
tracted. The results achieved must be returned to the evaluation
process in order of relevance, i.e. according to how they satisfy
the search criteria [11]. For more details regarding the criteria of
quality, we recommend the reading of the work developed in
[10].
Given the estimated size of the Web, find the more appropriate
page to the user‘s intention is not a trivial task. Its necessary a
sophisticate algorithm, combined with massive computational
power, to accomplish this task. For that reasons, in this work
was chose to use search engines specifically oriented for health-
pages retrieval. In this category of search engines the search is
performed on a database of previously indexed and constantly                     Figure 2. Model to locate data in the Web
updated pages. Some of the specialized search engines are Sear-                               (modified from [11])
chMedica4 and MedStory5, which are specific to the medical
field. In the next sections is present the model and prototype
developed to locate health web pages, but not focusing on spe-
                                                                       5. DATA EXTRACTION
                                                                       After recovering the URLs from the search engines, obtained at
cific search engines – as this choice is application-dependent.
                                                                       the location step, the system starts the next task, which extracts
The specific configuration of this model to a practical applica-
                                                                       the content that will be used to evaluate the quality of the web
tion is presented in section 6, with a Case Study about Alzhei-
                                                                       page. As this task was idealized to occur in an automatic way a
mer‘s disease.
                                                                       search is performed for specific attributes that are relevant to
                                                                       determine the page quality, those attributes can be found in the
4.1 Location Model                                                     Figure 1. The initial supposition was that it should be possible to
The model presented in the Figure 2 is based on the interaction        find a standardization way that allows encompassing the maxi-
among the application, a database and the Internet. It is applica-     mum portion of data existent in web pages in order to make the
tion dependent because it needs the existing pattern for present-      data extraction easier and faster. However, as presented in a
ing the results of the search engine used in the application, i.e.,    more detailed way in the 5.1 subsection, it was not possible to
the internal standard to display the results to the user.              use this approach because until today there are no standardiza-
Initially, the data about the search engine (pattern related) is       tion models for naming the tags or to place data in pages. There-
required by the system. Then, the criterion to be used by the          fore, specific strategies were proposed, trying to attend the ne-
search engine is recovered from the database and for each crite-       cessary requirements to the search of each data existent in the
ria is stored the first URLs returned as answer by this search         ER model (Figure 1), as: site language; authorship data; inLinks
engine. For more details about this model see [11]. The database       and outLinks.
stores the identification of the search engine used, as well as        The next sections present each one of those strategies, starting
their model. This is necessary because different engines offer         with troubles and adopted strategies for standardization of web
specifics return forms. Thus, it is possible to extend the system,     pages content and structure.
adding new patterns of search engines or removing them from
the database. A prototype was developed for retrieval and index-
                                                                       5.1 Standardization
                                                                       The considered standardization has the objective to define the
4                                                                      ways in which the content of a web page could be structured.
    http://www.searchmedica.com
5                                                                      That research is applied later as an information source to the
    http://www.medstory.com
extraction phase. The structure and content of search engines, as      a higher score, and consequently, appear in a better position in
well as web pages, is extremely dynamic. Consequently, to over-        the ranking. In this work, inLinks are recovered as follows [11]:
come this constantly changes the standardization model adopted
here apply terms defined in the WorldNet [12] trying to found               An request is submitted to the Browser Google Web:
synonyms for the keywords found in web pages. Examples of                    http://www.google.com.br/search?q=link%3A+<url_target
synonyms that could be obtained for the term "author", based in              _page>;
the WorldNet, are: "writer", "generator", "source".                         After recover the results page, for each one of the listed
There are others interesting approaches that can be applied as a             pages, the XPath expression: //cite is executed;
standardization strategy. An example is the application of ontol-           All lists of URLs obtained in each page are stored.
ogies [13], where they can be used to support decisions about
which fields must to be analyzed; these ontologies could, either,      For the recovery of outLinks the steps are [11]:
be enriched by the WorldNet terms. The use of terms from
                                                                            Recover the page which intends to get the outLinks;
WorldNet is interesting to find synonyms of the fields that could
appear in the pages and be used in the extraction. Unfortunately,           Execution of the expression XPath: //a[not(contains(@href
there are no guarantees that the website developer has used some             = ‗DOMAIN‘))]/@href, where DOMAIN represents the
synonymous term to define fields (as author or writer, for the               domain of the desired page;
author´s name field). So, is important to think in alternative
strategies.                                                                 Through the expression cited in the previous step, the dup-
                                                                             licated links that are hosted in the same domain of the
In this work, the recognition of HTML tags in web pages had                  page are removed;
generated special attention. Techniques for that are applied in
areas as Information Storage and Retrieval to identify the rele-            When counting the number of outLinks, the duplicates are
vant topic of a page, since that definition is very important to the         removed and the links that point to the same domain are
information filtering phase [14]. Some of the HTML tags by it                counted as only one entry.
self can provide good information for the definition/search of
patterns, for example, the tags: <h1>, <h2>, <h3>, <h4>                5.4 Extraction of Author’s Information
could indicate field patterns, as: title, author´s name and contact.   Data about the website author are important for determine the
                                                                       quality of the presented content. If the author is considered a
5.2 Extracting the Site Language                                       specialist on the subject, the web page will have more credibility
These step objectives find out the language in which the text of       then the others, written by people not recognized as such. How-
the web page was written. In this project the system informs the       ever, this kind of information is not easily determined in a ma-
URL of the health web page that must to be analyzed. The first         nual way and in case of automatic treatment of data it became
step refers to search for meta tags that indicate the language of      even more complex. There are some techniques that could be
the web page as defined in the Dublin Core pattern: meta tag           applied to help in analyzing data about the author, as is the case
language (1)[11].                                                      of h-index [15], which applies the number of author‘s publica-
                                                                       tions and citations. In case of a complex task, where an individ-
       <meta name="language" content=”english”/>             (1)       ual technique does not solve the problem, in this work was ap-
If the system didn't found the meta tag, is necessary to make a        plied a solution described in [16] where techniques for data
deeper analysis on the content of the web page to verify that. In      extraction about pages authorship are employed. The authorship
this case, some programs that analyses the text in order to detect     model is defined by the combination of vocabularies defined in
the language can be used. Examples of analyzers used in the            the Dublin Core pattern [17] and in the FOAF (Friend Of A
application are: Fuzzums6, Applied Language7 e Google Lan-             Friend) ontology [18], for descriptions related to the authors
guage Detection8. After obtaining the page content subsets of          expertise. Another tool for automatic extraction obtains the au-
text, as well as the keyword used in this search, are extracted        thor‘s organization; his or her electronic mail address; the web-
from the content. In the most part of the cases tested the analysis    site address; the author‘s number of publications and h-index
made by two of those programs was enough to detect the web             [16].
page language. When the analysis made by two programs disa-
greed, one extra program was used to reach a conclusion [11].          6. CASE STUDY
                                                                       The models described in the previous sections were developed
5.3 Extraction of InLinks and OutLinks                                 as prototypes to evaluate the proposed approach. During the
Supposing the website X, inLinks are all the links existing in the     tests the topic used for search was "Alzheimer Disease". The
Web that point to X, outLinks are the links of the website X that      first step was focused in the automatic location of data, the con-
point to sites that are hosted in other domains. The inLinks and       sidered tasks were:
outLinks of each website should be stored for posterior analysis.           Study of the existent search engines, as well as analysis of
InLinks are used for some search engines in order to build a                 each one of them in order to identify which ones were ap-
ranking of the results. Websites that have many inLinks receive              propriate to search data about health;

6
                                                                            Definition of the criteria that will be used for the searches
    http://www.fuzzums.nl/~joost/talenknobbel/index.php                      related to the topic "Alzheimer Disease".
7
    http://www.appliedlanguage.com
                                                                       The section 6.1 below presents details of the search engines.
8
    http://www.google.com/uds/samples/language/detect.html             About the criteria, without a real population of users to develop
this case study, was chosen to use common criteria from the                    ―Drugs‖ - intend to provide information about drug inte-
topic. A more detailed description about the applied criteria is                ractions, allergic reaction and efficiency. Keywords: alz-
presented in the subsection 6.2.                                                heimer drugs interactions and alzheimer drug treatment.
                                                                               ―Case study‖ - has focus on expert users, which may be
6.1 Health and General Search Engines                                           searching for information to base their researches about a
The queries have two kinds of results: texts that are appropriated              disease, for example. Keyword: alzheimer case study.
to general public; technical texts with more detailed information
of the topic. During the search engine analysis, three of them                 ―Tips‖ – has the objective to answer questions as: ways to
have been choose. That was necessary for automating the loca-                   deal with a patient with Alzheimer; how is possible do
tion of web pages, because the developed tool should be pre-                    help; what are the risk factors of the disease. Keyword:
pared for a specific pattern of the engines. Among the analyzed                 alzheimer‘s practical tips.
ones, the following were adopted [11]:                                 These search criteria allows the automatic location of websites
     Medstory: Search in the Web and sorts the results based          by the execution of searches over the used search engines (sub-
      on several keywords, indicating which ones occur more            section 6.1), as well as store the 10 first URLs, associated with
      frequently. These keywords, about healthcare, belong to a        each search engine. After the location and storing the URLs, the
      predefined static list, grouped by categories, as: drugs,        system can go further - data extraction.
      symptoms, procedures and so on. This list predefined by
      the website allows refining the results presented to the us-     6.3 Extraction
      er. After the search, several complementary categories of
                                                                       This step intends to extract the following attributes: page lan-
      data related to the topic are presented to the user. These
                                                                       guage, authorship data, inLinks and outLinks. Each one of those
      categories consist of pre-processed information, stored in a
                                                                       attributes was described during the previous sections. In order to
      specific database.
                                                                       develop the proposed solution as a prototype some tools were
     SearchMedica: Allows setting up the search scope in              used to support retrieve data in the websites, they are: Web
      which the search engine will search: the whole Web or            Harvest9 e XPather [11]:
      sites defined by the user. Also, the search can be made
                                                                               The Web-Harvest provide an API (Application Program
      over a determined health area (e.g. cardiology, geriatrics,
                                                                                Interface) that allows: consult Web servers; recover the
      dermatology, etc.). The results page has an area in which
                                                                                HTML page; convert it in XHTML(Extensible Hypertext
      several related information are presented.
                                                                                Markup Language) file and apply some manipulation
     Google: Allows search in the whole Web, so, can find                      techniques of XML (Extensible Markup Language) docu-
      since information for laymen until news or scientific work.               ments, as XPath, XQuery and XSLT (Extensible Style-
      However, there are no guarantees about the reliability of                 sheet Language Transformations) to extract the desired in-
      the presented data in the resulting websites, unlike what                 formation;
      happens in the Medstory, which search in a recognized
                                                                               The XPather allows search for patterns in websites, being
      source internally indexed. At the same time, searches that
                                                                                used in a semi-automatic manner.
      encompass a bigger number of pages can find more rele-
      vant results for the user. In these cases, algorithms as Pa-
      geRank™ [20], used by this search engine, try to rank the        7. CONCLUSIONS AND FUTURE WORK
      results based on reliability.                                    This article described the development of models, techniques
                                                                       and prototypes aiming the automatic retrieval of content pre-
                                                                       sented in web pages with health subject. After the automatic
6.2 Search Criteria                                                    location, standardization and extraction of page‘s data the objec-
As described in the introduction of this case study, the search
                                                                       tive is to deliver normalized data to perform evaluations on the
criteria applied were chosen according to the topic "Alzheimer
                                                                       quality of the health associated web page. The process of devel-
Disease", so, they are application dependent. Thus, an analysis
                                                                       oping the model for automatic retrieval of data anticipated a
should be done in order to choose good criteria for more cohe-
                                                                       series of challenges and steps to overcome and the process was
rent results. This issue refers to business analysis [21] and not to
                                                                       directed to work with these problems. For location were applied
a development issue, i.e., an analysis preferably held by an ex-
                                                                       well known search engines as they use quite efficient algorithms
pert in the area, in order to provide the closer criteria that could
                                                                       for Web search. The search for patterns for presenting data in
be used by laymen and experts in its researches. Also, keywords
                                                                       web pages has led to the conclusion that despite the already
were separated into categories; they are listed below [11]:
                                                                       existence of standards such as Dublin Core, in practice, these
     ―General information‖ - introductory information. Key-           patterns are not strictly applied by the vast majority of pages, or
      words: alzheimer and alzheimer introduction.                     is partially applied. Thus, the need of some kind of standardiza-
                                                                       tion for the development of these pages is clearly needed. As a
     ―Treatment methods‖ - has two goals: for practitioners, in-      future work, it would be necessary to end the process of evaluat-
      terested in new methods for treatment; and for laymen, in-       ing the quality of a health web page making the task of quality
      terested in know about its treatment. Keyword: alzhei-           evaluation itself automatic.
      mer‘s treatment.
     ―Diagnosis‖ - looks to provide information about the first
      symptoms of the disease. Keywords: alzheimer‘s diagnos-
      tic and alzheimer's symptoms.                                    9
                                                                           http://web-harvest.sourceforge.net/
8. ACKNOWLEDGMENTS                                                  [11] Pernas, A. M., Palazzo, J. M. de O., Santos, A.H., Donas-
This work was partially supported by Conselho Nacional de                solo, B.M., Bezerra, C.B., Manica, E., Kalil, F., Soares,
Desenvolvimento Científico e Tecnológico - CNPq, Brazil,                 L.S., Svoboda, L.H., Mesquita, M.P., Torres, P.R., Petry,
Edital Universal - MCT/CNPQ - 14/2010. We would also like to             R.L., Santos, R.L., Leithardt, V.R.Q. 2009. Relato sobre o
thank the group of students of the discipline CMP112,                    Desenvolvimento de Modelos para Obtenção Automática
PPGC/UFRGS, that developed the initial work on this subject.             do Conteúdo de Sites sobre Saúde. Cadernos de Informáti-
                                                                         ca (UFRGS), v. 4, p. 47-91.

9. REFERENCES                                                       [12] Wordnet: a lexical database for the English. Princeton Uni-
                                                                         versity. Retrieved March 20, 2011, from:
                                                                         <http://wordnet.princeton.edu/wordnet/download/>.
[1] Web Medica Acreditada – WMA. Retrieved March 21,
                                                                    [13] Tiun, S., Abdullah, R. and Kong, T.E. 2001. Automatic
    2011, from: <http://wma.comb.es/>.
                                                                         Topic Identification Using Ontology Hierarchy. In Pro-
[2] Internet Content Rating Association - ICRA. Retrieved                ceedings of the 2nd International Conference on Computa-
    March 21, 2011, from: <http://www.fosi.org/icra/>.                   tional Linguistics and Intelligent Text Processing - CICL-
[3] Internet Quality Agency – IQUA. Retrieved March 21,                  ing '01. Springer-Verlag, London, UK.
    2011, from: <http://www.iqua.net/>.                             [14] Liu, B., Chin, C. and Ng, H. 2003. Mining Topic-Specific
[4] Alpert, J. and Hajaj, N. We knew the Web was big... The              Concepts and Definitions on the Web. In The 12th Interna-
    Oficial Google Blog. Retrieved March 20, 2011, from:                 tional World Wide Web Conference - WWW 2003, Budap-
    <http://googleblog.blogspot.com/2008/07/we-knew-Web-                 est, Hungary, May 20-24.
    was-big.html>.                                                  [15] Hirsch, J. E. 2005. An index to quantify an individual's
[5] The Quality Assurance and Content Description Project –              scientific research output. PNAS 102 (46), 16569–16572.
    QUATRO+. Retrieved March 28, 2011, from:                        [16] Lichtnow, D., Pernas, A. M., Manica, E., Kalil, F., Olivei-
    <http://legacy.quatro-project.org/>.                                 ra, J. P. M. de, Leithardt, V. R. Q. 2010. Automatic Collec-
[6] Berners-Lee, T. 1997. Cleaning up the User Interface, Sec-           tion of Authorship Information for Web Publications. In:
    tion—The ―Oh, yeah?‖-Button, Retrieved March 28, 2011,               Proceedings of 6th International Conference on Web In-
    from: http://www.w3.org/DesignIssues/UI.html.                        formation Systems and Technologies – WEBIST. v. 1. p.
                                                                         339-344. Lisboa: INSTICC. Valencia.
[7] Stamatakis, K., Chandrinos, K., Karkaletsis, V., Mayer,
    M.A., Gonzales. D.V., Labsky, M., Amigo, E. and Pöllä,          [17] Dublin Core Metadata Initiative. Retrieved March 22,
    M. 2007. AQUA, a system assisting labeling experts assess            2011, from: <http://dublincore.org/>.
    health Web resources. In Proceeding of Symposium on             [18] Brickley, D. and Miller, L. FOAF Vocabulary Specification
    Health Information Management Research – ISHIMR                      0.98. Namespace Document 9 August 2010. Marco Polo
    2007.                                                                Edition. Retrieved March 20, 2011, from:
[8] Wang, Y., Liu Z. 2007. Automatic detecting indicators for            <http://xmlns.com/foaf/spec/>.
    quality of health information on the Web, International         [19] K. Shvachko, H. Kuang, S. Radia, R. Chansler, The Ha-
    Journal of Medical Informatics, 76(8), 575-582.                      doop Distributed File System, IEEE 26th Symposium on
[9] Balakrishnan, H., Kaashoek, M. F., Karger, D., Morris, R.,           Mass Storage Systems and Technologies, 2010, ISBN: 978-
    Stoica, I. 2003. Looking up data in P2P systems. In Com-             1-4244-7152-2, p.1-10.
    munications of the ACM.                                         [20] Brin, S., Page. L. The Anatomy of a Large-Scale Hypertex-
    DOI=http://doi.acm.org/10.1145/606272.606299.                        tual Web Search Engine. In Computer Networks and ISDN
[10] Lichtnow, D., Jouris, A., Bordignon, A., Pernas, A. M.,             Systems. Elsevier Science Publishers, Amsterdam, The
     Levin, F. H., Nascimento, G. S., Silva, I. C. S., Gasparini,        Netherlands. 1998.
     I., Teixeira, J. M., Rossi, L. H. L., Oliveira, O. E. D.,      [21] Witten, I.H. and Frank, E. Data Mining: Practical Machine
     Schreiner, P., Gomes, S. R. V., Oliveira, J. P. M. de. 2009.        Learning Tools and Techniques. 2a ed. Morgan Kaufmann.
     Relato e Considerações sobre o Desenvolvimento de uma               629p. 2005. ISBN-13:978-0-12-088407-0.
     Ontologia para Avaliação de Sites da Área de Saúde. Ca-
     dernos de Informática (UFRGS), v. 4, p. 7-46.

</pre>