<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MMuullttiimmeeddiiaa iinnffoorrmmaattiioonn eexxttrraaccttiioonn ffrroomm HHTTMMLL pprroodduucctt ccaattaalloogguueess</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Labsky´</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Praks</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojtˇech Sv´atek</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ondˇrej Sˇva´b</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Labsky´</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Praks</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vojtˇech Sv´atek</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ondˇrej Sˇv´ab</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Department of Information</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Knowledge Engineering</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>nDiveeprasirttymoefnEtcoofnIonmfoircms</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>aPtiroanguaen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>dWK.nCowhulerdcgheillESnqg.in</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>UPnrivaehrasi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CofzeEcchoRnoempuicbsl</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ic Prague</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>lhaubrcshkiyll</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>sSvqa.t</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>r}a@hvas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>.Cczzech Republic</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Departme</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>nltaobfskMya</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>tshveamteakti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>cxssavnado</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>}e@scvrsipe.ticvze Geometry</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>. listopadu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ostrava-Poruba</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Czech Republic pavel.praks@vsb.cz</string-name>
        </contrib>
      </contrib-group>
      <fpage>84</fpage>
      <lpage>93</lpage>
      <abstract>
        <p>We describe a demo application of information extraction from company websites, focusing on bicycle product offers. A statistical approach (Hidden Markov Models) is used in combination with different ways of image classification, including latent semantic analysis of image collections. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in an RDF repository and made available for structured search.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Tools and techniques for web information extraction (WIE) have recently been
recognised as one of key enablers for semantic web (SW) scaling. In our long-term
project named Rainbow 3 we address several intertwined topics that we consider
important for efficient ‘WIE for SW’ applications:
1. Exploitation of multiple information modalities available in web documents
2. Synergy of learning and reuse of ontological information
3. Automated acquisition and labelling of training data for extractor learning
4. Bridging between automated acquisition of SW data and their usage
5. Support for easy design of WIE applications from components.</p>
      <p>In this paper, we focus on an ongoing demo application in the domain of
bicycle product offers . Section 2 presents the core method: automated HTML
annotation based on Hidden Markov Models. Section 3 extends the analysis of
HTML code with that of images. Section 4 describes the composition of product
offer instances with the help of a simple on tology. Section 5 outlines the
architecture of the demo application and the subsequent usage of extracted data in
an RDF repository. Finally, section 6 focuses on future work.</p>
      <sec id="sec-1-1">
        <title>3 http://rainbow.vse.cz</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Web Page Annotation Using HMMs</title>
      <p>For extracting product entries from web catalogues, we built a Hidden Markov
Model (HMM) tagger, which assigns a semantic tag to each token from a
document. Tokens are either words, formatting tags or images. In our experiments,
we evaluated the HMM performance on a diverse set of web pages, which come
from different web sites and have heterogenous formattings.</p>
      <p>We manually annotated a set of 100 HTML documents chosen from the
Google Directory Sports-Cycling-BikeShops-Europe-UK-England. Each document
contains from 1 to 50 bicycle offers, and each offer consists of at least the bicycle
name and price. There are typically 3–4 documents from the same shop in the
data. Annotations for 15 bicycle characteristics were made using SGML tags4.
A sample annotated data is shown in Figure 1.</p>
      <p>
        To represent web documents, we employed extensive pre-processing. Similarly
to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we transform each document into XHTML and perform canonicalisation
of XML entities5. Certain HTML tags and tag groups are replaced by their
generalisations6. Since only words and images can be extracted, we dispose of
mark-up blocks that do not directly contain words or images.
      </p>
      <p>
        HMMs are probabilistic finite state machines, which represent text as a
sequence of tokens. An HMM consists of states which generate tokens, and of
4 The training data and a demo are available at http://rainbow.vse.cz.
5 This step unifies different ways of writing the same characters in XML.
6 Most tags are only represented using their names, disregarding any attributes.
Oftenoccuring design patterns, such as add-to-basket buttons, are identified using several
manually authored rules, and replaced by dedicated tokens.
transitions between these states. States are associated with token generation
probabilities, and transitions with transition probabilities. Both kinds of these
probababilities are estimated from training data. For the purposes of information
extraction, states are typically associated with semantic tags to be extracted. To
annotate a document using a trained HMM, that document is assumed to have
been generated by that HMM. The most probable state sequence is then found
using the Viterbi algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        The structure of our HMM is inspired by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and is sketched in Figure 2.
Extracted slots are modelled using target states (denoted as T). Each target
state is accompanied by two types of helper states responsible for representing
the slot’s characteristic context – the prefix and suffix states (P and S). Irrelevant
tokens are modelled by a single background state (B). Contrary to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
which use independent HMMs trained for each slot separately, we train a single
composite HMM capable of extracting all slots at once. Our model thus contains
multiple target, prefix and suffix states. This approach, also used in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], captures
the ordering relations between nearby slots (e.g. product image often follows
its name). We experimented also with other HMM architectures, with results
presented in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Impact of Image Classification</title>
      <p>For the purpose of extracting product images, we examined the impact of image
information available to the HMM tagger. As a baseline approach, we measured
the tagging performance when no image information was available for tagging.
In this case, all images were represented by the same token and product pictures
could only be distinguished based on the context in which they appeared.</p>
      <p>In order to provide our tagger with more information, we built image
classifiers to determine whether the extracted product is depicted in a particular
image. We used the following features for classification: image dimensions,
similarity to training product images, and whether there is more than one occurrence
of the same image in the containing document.</p>
      <p>For our domain, we modelled images of bicycles using a 2-dimensional normal
distribution, only estimated from positive training examples7. The dimensions
x, y of a new image I are first evaluated using the estimated normal density N .
The density value is then normalized to the interval (0,1) using the density’s
maximum value Nmax.</p>
      <p>Dim(I) :=</p>
      <p>N (x, y)
Nmax
An image I is then classified as P os or N eg by comparing its Dim(I) score to
a threshold TDim. This threshold was estimated by minimizing the classification
error rate on a separate heldout set of 150 images.
3.1</p>
      <p>
        Within our document collection, image dimensions appeared to be the best single
predictor with the error rate of 6.6%. However, this is mainly due to our
collection being limited to relevant product catalogues only. When dealing with more
heterogeneous data, features describing the actual image content will become
necessary.
We experimented with a latent semantic approach to measuring image similarity,
described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This kind of image similarity has been applied to
image retrieval from collections, where the task often is to find the most similar
image to a query. We used this image-to-image similarity measure sim(I, J ) to
compute simC (I), the similarity of an image I to a collection of images C. In our
experiments, C contained the training bicycle pictures (positive examples only).
To compute simC (I), we used the K nearest neighbor approach and averaged
the similarities of the K most similar images from the collection.
simC (I) =
      </p>
      <p>K best images J∈ C sim(I, J )</p>
      <p>K
Experimentaly, we set K = 20, since lower values of K lead to a decrease in the
similarity’s robustness8 and higher values did not bring further improvement. To
build a classifier, a similarity threshold TSim was estimated on a heldout set in
the same way as for the dimension classifier above. The error rate of the classifier
was 26.7% on our document collection.
7 The positive examples comprise of all bicycle pictures found in the documents, not
only those labeled as parts of bicycle offers. For information extraction, this increases
the role of image context for correct tagging.
8 With low values of K, simC(I) became too sensitive to individual images J with
misleading values of sim(I, J).
(1)
(2)
(3)
3.3</p>
      <p>Combined classifier
For the combined image classifier, we used the above described dimension score
Dim(I), similarity score Sim(I) and a binary feature indicating whether the
image occurs more than once in the document. We experimented with different
classifiers available in the Weka 9 environment, and the best error rate10 of 4.8%
was achieved by the multilayer perceptron algorithm.</p>
      <p>Results for all three classifiers are compared in Table 1. All results were
measured using 10-fold cross-validation on a set of 1, 507 occurences of 999 unique
images taken from our training documents. The first two algorithms used
additional 150 heldout images to estimate their decision thresholds. The
crossvalidation splitting was done at the level of documents, so that all images from
a single document were either used for training or for testing.
To improve extraction results, we need to communicate the image classifier’s
results to the HMM tagger. Currently we do this simply by substituting each
image occurence in a document by its class. Since these binary decisions would
leave little room for the HMM tagger to fix incorrect classifications, we adapted
the above binary classifiers to classify into 3 classes: P os, N eg, and U nk. In this
way, the HMM tagger learns to classify the P os and N eg classes correspondingly,
and the tagging of the U nk class depends more strongly on the context.</p>
      <p>To build the ternary versions of the dimension- and similarity-based
classifiers, we introduced costs for the classifier’s decision s. Each wrong decision was
penalized by CMiss = 1 and the cost of each U nk decision was CUnk ∈ (0, 1).
We set CUnk manually such that the classifier produced 5-10% of U nk decisions
on the heldout set. While minimizing the sum of these costs on the heldout
set, two thresholds were estimated for both the dimension- and similarity-based
classifiers, delimiting their N eg, U nk and P os decisions.</p>
      <p>For the combined ternary classifier, we achieved the best results with a
decision list shown in Table 2. The list combines image occurence count with
the results of the dimension- and similarity-based ternary classifiers, denoted as
class3Dim and class3Sim respectively.</p>
      <p>We evaluated information extraction results with all three ternary classifiers
and compared the results to the case where no image information was available.</p>
      <sec id="sec-3-1">
        <title>9 http://www.cs.waikato.ac.nz/~ml</title>
        <p>
          10 This error rate comes from 10-fold cross-validation without using heldout data.
The new image information from the combined classifier lead to an increase of
19.1% points in picture precision and also to subtle improvements for other tags.
Improvements in precision and recall for 3 chosen slots (product pictures, names
and prices), measured on a per-token basis, are shown in Table 3 for all three
classifiers.
Semantic web is not about isolated tagged items but about complex and
interrelated entities; we thus need to group the labels produced by automated
annotation into instances. We currently use a simple sequential algorithm that
exploits constraints defined in a tiny presentation ontology11 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which partly
pertain to the generic domain (bike offers) and partly to the way of presenting
information in web catalogues. Figure 3 shows an experimental presentation
ontology containing the class ’Bike offer’. The utilized constraints are uniqueness,
multiplicity and optionality of certain properties, the latter two indicated with
the * and ? symbols, respectively12. In addition, ‘sticky’ properties (indicated
with !) are distinguished: as soon as the value of sticky property is discovered
11 Similar to ‘extraction ontologies’ used by Embley [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
12 Although not shown in the example, we can also use e.g. property value types or
regular expressions.
on a page, it is filled to all objects extracted afterwards, until a new value is
discovered for this property.
        </p>
        <p>An annotated item is added to the currently assembled (bike offer) instance
unless it would cause inconsistency; otherwise, the current instance is saved and
a new instance created to accommodate this item and the following ones.
Despite acceptable performance on error-free, hand-annotated training data, where
the algorithm correctly groups about 90% of names and prices, this ‘baseline’
approach achieves very poor results on automatically annotated data: on
average, less than 50% of corresponding annotations are grouped properly, often for
trivial reasons. The most critical problems are connected with missing or extra
annotations, multiple different references to a single slot, and with transposed
HTML tables.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Result Transformation, Storage And Retrieval</title>
      <p>
        All components developed within the Rainbow project are wrapped as web
services. The WIE component itself is currently being called by a simple control
routine (written in Java), which also optionally calls other analysis tools: in
the bicycle application, we so far experimented with URL-based navigation over
the website, extraction of the content of selected META tags, and extraction
of ‘company profile sentences’ from free text 13. The results are transformed to
RDF (with respect to a ‘bicycle-offer RDFS ontology’) and stored in a Sesame
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] repository. An end-user search interface to this repository14 is shown in Fig. 4.
It relies on a collection of query templates expressed in SeRQL (the native query
language of Sesame) and enables a simple form of navigational retrieval [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
13 These three approaches to website analysis, implemented independent of the bicycle
demo application, are evaluated in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
14 Available at http://rainbow.vse.cz:8000/sesame.
Most urgently, we need to replace the ‘toy’ implementation of ontology-based
instance composition with a version reasonably robust on automatically
annotated data. For some of the layout-oriented problems mentioned in section 4,
partial solutions recently suggested in IE research (e.g. [
        <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
        ]) could be reused.
We also consider introducing HMMs even to this phase of extraction; a modified
version of Viterbi algorithm supporting domain constraints (such as those in our
presentation ontology) has already been described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Another aspect worth
investigation is the possibility of (semi-)automatic construction of presentation
ontologies from the corresponding domain ontologies.
      </p>
      <p>
        A critical bottleneck of ML-based IE methods (in particular of statistical
ones) is the volume of labelled training data required. In our experiments with
product catalogues, we noticed that the tagger often classifies most product
entries correctly but misses a few product names that are very different from the
training data. We developed a simple symbolic algorithm that identifies similar
structural patterns in a document. For example, the HTML tag sequence &lt;td&gt;
&lt;a&gt; &lt;font&gt; &lt;br/&gt; &lt;/font&gt; &lt;/a&gt; &lt;/td&gt; with arbitrary words in between
appears 34 times in one of our training documents: the tagger successfully
annotated 28 product names contained in these patterns between &lt;font&gt; and &lt;br/&gt;,
but missed the remaining 6. In such cases, we could collect the remaining
product names and use them to enrich the model’s training data. By learning novel
product names from these ‘easy’ pages, the model will learn to also recognise
them in less structured documents15. We also plan to bootstrap the method with
data picked from public resources related to product offering, following up with
our earlier experiments with Open Directory headings and references [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Another important task is to replace hard-coded control routines with
semiautomatically constructed, implementation-independent application models. A
knowledge modelling framework has already been introduced for this purpose
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]; currently we examine the adaptability of a PSM-based semantic web-service
configuration technique in connection with this framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>
        Eventually, we plan to associate our efforts with the popular Armadillo
project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], with which we share most of our abovementioned research interests.
      </p>
      <p>
        The research is partially supported by grant no.201/03/1318 of the Grant
Agency of the Czech Republic, “Intelligent analysis of the WWW content and
structure”.
15 Similar bootstrapping strategies are shown in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Borkar</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deshmukh</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarawagi</surname>
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Automatic segmentation of text into structured records</article-title>
          .
          <source>In: SIGMOD Conference</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Broekstra</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kampman</surname>
            <given-names>A</given-names>
          </string-name>
          .,
          <string-name>
            <surname>van Harmelen</surname>
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Sesame: An Architecture for Storing and Querying RDF and RDF Schema</article-title>
          .
          <source>In: Proc. ISWC</source>
          <year>2002</year>
          , Springer LNCS no.
          <volume>2342</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ciravegna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dingli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning to Harvest Information for the Semantic Web</article-title>
          . In: ESWS-04, Heraklion, Springer LNCS
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dingli</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciravegna</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guthrie</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Mining Web Sites Using Unsupervised Adaptive Information Extraction</article-title>
          . In: EACL,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Embley</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liddle</surname>
            ,
            <given-names>S.W.</given-names>
          </string-name>
          :
          <article-title>Automatically extracting ontologically specified data from HTML tables with unknown structure</article-title>
          .
          <source>In: ER2002, Tampere</source>
          <year>2002</year>
          ,
          <fpage>322</fpage>
          -
          <lpage>337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Freitag</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCallum</surname>
            <given-names>A</given-names>
          </string-name>
          .:
          <article-title>Information extraction with HMMs and shrinkage</article-title>
          .
          <source>In: Proceedings of the AAAI-99 Workshop on Machine Learning for IE</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Grover</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gearailt</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karkaletsisy</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farmakiotouy</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samaritakisy</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petasis</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazienza</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vindigni</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vichotz</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolinskiz</surname>
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Multilingual XML-Based Named Entity Recognition for E-Retail Domains</article-title>
          . In: LREC Conference, Las Palmas,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kavalec</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Sv´atek, V.:
          <article-title>Information Extraction and Ontology Learning Guided by Web Directory</article-title>
          . In: ECAI Workshop on NLP and
          <article-title>ML for ontology engineering</article-title>
          .
          <source>Lyon</source>
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Labsky´,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Sv´atek, V., Sˇv´ab, O.:
          <article-title>Types and Roles of Ontologies in Web Information Extraction</article-title>
          .
          <source>In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies</source>
          , Pisa
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Praks</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dvorsky</surname>
          </string-name>
          ´ J., Sn´aˇsel V.
          <article-title>: Latent semantic indexing for image retrieval systems</article-title>
          .
          <source>In: Proceedings of the SIAM Conference on Applied Linear Algebra (LA03)</source>
          , Williamsburg, USA, The College of William and Mary,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Praks</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Machala</surname>
            <given-names>L.</given-names>
          </string-name>
          , Sn´aˇsel V.
          <article-title>: Iris Recognition Using the SVD-Free Latent Semantic Indexing</article-title>
          . In: MDM/KDD 2004 - Fifth International Workshop on Multimedia Data Mining, Seattle, USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Rabiner</surname>
            ,
            <given-names>L.R.:</given-names>
          </string-name>
          <article-title>A tutorial on hidden Markov models and selected applications in speech recognition</article-title>
          .
          <source>In: Proceedings of the IEEE</source>
          ,
          <volume>77</volume>
          (
          <issue>2</issue>
          ),
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Sv´atek V.,
          <string-name>
            <surname>Berka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavalec</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , V´avra, V.:
          <article-title>Discovering Company Descriptions on the Web by Multiway Analysis</article-title>
          .
          <source>In: Intelligent Information Processing and Web Mining</source>
          , IIPWM'
          <fpage>03</fpage>
          ., Springer Verlag,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Sv´atek, V., Labsky´,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Vacura</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Knowledge Modelling for Deductive Web Mining</article-title>
          .
          <source>In: Proc. EKAW</source>
          <year>2004</year>
          , Springer Verlag, LNCS,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Sv´atek, V.,
          <string-name>
            <surname>ten Teije</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vacura</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Web Service Composition for Deductive Web Mining: A Knowledge Modelling Approach</article-title>
          .
          <source>In: Proc. Znalosti</source>
          <year>2005</year>
          ,
          <article-title>VSB-TU Ostrava</article-title>
          , to appear
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. Sˇv´ab,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Labsky</surname>
          </string-name>
          <string-name>
            <surname>´</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Sv´atek, V.:
          <article-title>RDF-Based Retrieval of Information Extracted from Web Product Catalogues</article-title>
          . In: SIGIR'04 Semantic Web Workshop, Sheffield.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Valarakos</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sigletos</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karkaletsis</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliouras</surname>
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>A Methodology for Semantically Annotating a Corpus Using a Domain Ontology and Machine Learning</article-title>
          . In: RANLP Conference, Borovets,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>