=Paper= {{Paper |id=Vol-1885/228 |storemode=property |title=Repeatable Web Data Extraction and Interlinking |pdfUrl=https://ceur-ws.org/Vol-1885/228.pdf |volume=Vol-1885 |authors=Michal Kopecký,Marta Vomlelová,Peter Vojtáš |dblpUrl=https://dblp.org/rec/conf/itat/KopeckyVV17 }} ==Repeatable Web Data Extraction and Interlinking== https://ceur-ws.org/Vol-1885/228.pdf
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 228–234
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Kopecky, M. Vomlelova, P. Vojtas




                                   Repeatable Web Data Extraction and Interlinking

                                            M. Kopecky1, M. Vomlelova2, P. Vojtas1
                                              Faculty of Mathematics and Physics
                                                      Charles University
                                                   Malostranske namesti 25,
                                                    Prague, Czech Republic
                             1
                              {kopecky|vojtas}@ksi.mff.cuni.cz, 2marta@ktiml.mff.cuni.cz


          Abstract. We would like to make all the web content            annotation of web pages directly in Google Chrome. It
          usable in the same way as it is in 5 star Linked (Open)        requires no complicated installation and is available on all
          Data. We face several challenges. Either there are no LODs     platforms and devices where it is possible to install Google
          in the domain of interest or the data project is no longer     Chrome. Semantic annotation is available to all current
          maintained or even something is broken (links, SPARQL          users of the Internet not only to authors’ site. Browser
          endpoint etc.).                                                extension Semantic Annotator began as a prototype
                                                                         implementation in the Thesis [F1].
          We propose a dynamic logic extension of the semantic              Google Chrome extension Semantic Annotator is used
          model. Data could bear also information about their            for manual semantic annotation of Web sites. The goal of
          creation process. We calculate this on several movie           semantic annotation is to assign meaning to each part of the
          datasets.                                                      page. The significance of real-world objects and their
          In this work in progress we provide some preference            properties and relationships are described in dictionaries –
          learning experiments over extracted and integrated data.       either self-created or imported. Then the annotation process
                                                                         consists of selecting parts of the site and the assignment of
                                                                         meaning from the dictionary.
          Keywords
          Repeatable experiments; web data extraction, annotation,
          linking; dynamic logic; preference learning

          1        Introduction, Motivation, Recent Work
              For our decisions we often need automated processing
          of integrated web data. Linked (open) data are one
          possibility to achieve this vision. Still, there are some
          challenges.
              Production URLs are sometimes subjects of
          change ([A]). Data migrate, data project run out of
          contracted sustainability period and so on.
              SPARQL endpoints are expensive for the server, and not
          always available for all datasets. Downloadable dumps are
          expensive for clients, and do not allow live querying on the
          Web ([V+]).
              In some areas there are no corresponding Linked data        Figure 1: Illustrative figure for Semantic annotator, more
          projects available at all. Imagine e.g. a customer looking                              in [F1], [F2]
          for a car. He or she would like to aggregate all web data.
          Our idea is to remember previous successful extractions in        In Figure 1 we show an example of annotation of
          given domain and use this in the current situation. For        product pages in e-shop. The user selects a web page and
          evaluation of previous extractions can help also social        product name from the dictionary describing products that
          networks. We have presented this idea first in [PLEDVF].       assigns a name meaning "product name". Then on the same
          We concentrated on one specific purpose – extract object       page the user selects a product price and gives it the
          attributes and use of these data in recommender systems. In    meaning of "price". Because of similarity of pages and
          this research we have tried to contribute to increase the      usage of templates, annotating few pages enables to train an
          degree of automation of web content processing. We             annotator for the whole site. Consequently, search of
          presented several methods for mining web information and       annotated website is then much more accurate.
          assisted annotations.
                                                                         1.2      Semantic Annotations for Texts
          1.1      Semantic Annotator for (X)HTML
                                                                            In previous section we have described a tool for
            A tool for assisted annotation is available in [F2].         annotation of (X)HTML pages. These are useful for
          Semantic Annotator allows both manual and assisted             extraction of attributes of products on pages created by the
Repeatable Web Data Extraction and Interlinking                                                                                          229



       same template even if texts are a little bit different. Even if   2.1          Integration
       there are no templates, in a fixed domain like reports on
                                                                            We use Flix data (enriched Netflix competition data),
       traffic accidents, there are still repetitions.
                                                                         RecSys 2014 challenge data [T] and RuleML Challenge
          A tool for annotation of text was developed in our group
                                                                         data [K].
       in PhD thesis [D1]. The tool is available under [D2]. It is
                                                                            The datasets are quite different but they still have few
       built on GATE [BTMC], using UFAL dependency parsing
                                                                         things in common. Movies have their title and usually also
       [UFAL] and Inductive Logic Programming extension. It
                                                                         the year of their production. Ratings are equipped by
       represents a tool for information extraction and consequent
                                                                         timestamp that allows us to order ratings from individual
       annotation.
                                                                         users chronologically.
          In the thesis [D1] are presented four relatively separate
                                                                            One of problems was that different datasets use different
       topics. Each topic represents one particular aspect of the
                                                                         MOVIEID’s, so the movies cannot be straightforwardly
       information extraction discipline. The first two topics are
                                                                         mapped across datasets. To achieve this goal we wanted to
       focused on new information extraction methods based on
                                                                         enhance every movie by the corresponding IMDb
       deep language parsing in combination with manually
                                                                         identifier.
       designed extraction rules. The second topic deals with
                                                                            We observed that the Twitter datasets use as their
       a method for automated induction of extraction rules using
                                                                         internal MOVIEID the numeric part of the IMDb
       inductive logic programming. The third topic of the thesis
                                                                         identifier. So the movie “Midnight Cowboy” with Twitter
       combines information extraction with rule based reasoning.
                                                                         MOVIEID = 64665 corresponds to the IMDb record with
       The core extraction method was experimentally
                                                                         ID equal to ’tt0064665’. Therefore, we construct the
       reimplemented using semantic web technologies, which
                                                                         algorithm
       allows saving the extraction rules in so called shareable
       extraction ontologies that are not dependent on the original      τ1:Twitter-MOVIEID → IMDbId
       extraction tool. See Fig.2. on traffic accident data.             which simply concatenated prefix ’tt’ with the MOVIEID
          The last topic of the thesis deals with document               left-padded by zeroes to seven positions. The
       classification and fuzzy logic. The possibility of using          successfulness of this algorithm is shown in Table 1.
       information obtained by information extraction techniques
       to document classification is examined. For more see also                      Table 1: Simple identifier transformation
       [PLEDVF].
          In this research proposal we concentrate on synergy              Algorithm                MovieLe        Flix      Twitter
       effect of annotation and integration of data for user                                             ns
       preference learning, and consequently for recommendation.               τ1()                     0%          0%            100%


                                                                            To be able to assign IMDb identifiers to movies from
                                                                         other datasets, we had to use the search capabilities of the
                                                                         IMDb database. We used an HTTP interface for searching
                                                                         movies according to their name. The request is send by
                                                                         HTTP               request             in              form
                                                                         http://www.imdb.com/find?q=movie+title&s
                                                                         =tt.
                                                                            The other versions of algorithm for assigning IMDb
                                                                         identifiers to movies can be in general formally described
                                                                         as
           Figure 2: Illustrative figure of an extraction pattern for
                   dependency tree, more in [D1], [D2].                  τ2i,j: TITLE × YEAR → TT
                                                                         that is implemented by two independent steps
          It turns out that similarity and dynamic aspects of web        τ2i,j(TITLE,YEAR) = σj(ηi(TITLE),TITLE,YEAR)
       data play a role here as well. We propose an appropriate
       dynamic logic model of web data dynamics and provide              where the algorithm ηi transforms movie title to query
       some experiments. We hope this can serve as preliminary           string needed for IMDb search, while the σj algorithm then
       experiments for a more extended research proposal.                looks for the correct record in returned table. The simplest
                                                                         implementation of algorithms can be denoted as follows:
       2         Extraction Experiments                                  η1: TITLE → TABLE
           Our main goal is: Try to remember information about           σ1: TABLE × TITLE × YEAR → TT
       the context of data creation in order to enable repetition of
       extraction. In general we can remember algorithms,                where η1 algorithm concatenates all words of the title by
       training data and metrics of success.                             the plus sign and σ1 algorithm returns TT in case the
           In this chapter we try to describe some extraction            resulting table contains exactly one record. The results of
       algorithms and process of data integration in movie               this combination of algorithm are shown in Table 2 (ratio
       domain. These data will be used in preference learning. By        of correct answers).
       this we would like to illustrate our main goal.
230                                                                                                        M. Kopecky, M. Vomlelova, P. Vojtas



      Table 2: The simplest version of IMDb search by title name            σ3: The IMDb search provides more levels of tolerance
                                                                         in title matching. Try to use thee of them from the most
         Algorithm            MovieLe            Flix      Twitte        exact one to the most general. If the matching record from
                                    ns                          r        requested year cannot be found using stricter search, the
          σ1(η1())              42,7%         51,2%           Not        other search level is used.
                                                           needed           Currently, we have 13 081 out of all 17 770 Flix movies
                                                                         mapped onto the IMDb database. Even all 27 278 movies
         To illustrate different algorithms for same extraction          from the MovieLens set are mapped to the equivalent IMDb
      task we describe another version. Here the algorithm is not        record. So the current results provided by the combination
      learned, but it is hand crafted. One of reasons for relatively     of most advanced versions of algorithms are:
      low effectiveness of σ1(η1()) algorithm was the sub-optimal
      query string used for IMDb search due to quite different           Table 4: The most complex version of IMDb search by title
      naming conventions of movie titles in different datasets. To                                name
      improve the results we enhanced the movie title
      transformation incrementally and produced its new                    Algorithm           MovieLe           Flix      Twitter
      versions. Every new version added new step of                                                  ns
      transformation of the movie title:                                       σ3(η7())         100.0%         73.6%           Not
                                                                                                                            needed
         η2: Convert all letters in movie title to lower case.
         η3: If the movie title contains year of production at its
      end in brackets remove it.
         η4: If the movie title still contains text in brackets at its
      end, remove it. This text usually contained original name of
      movie in original language.
         η5: Move word “the”, respectively “a”/ “an” from the
      end of the title to the beginning.
         η6: Translate characters ”_”, ”.”, ”?” and ”,” to spaces
         η7: Translate ”&” and ”&” in titles to word ”and”
         For example, the η7 transformation changes title
      “Official Story, The (La Historia Oficial) (1985)” to its
      canonical form “the official story”.
         This version of transformation then constructs the IMDb
                                                                         Figure 3: Numbers of movies mapped onto IMDb database
      query                         in                       form
      http://www.imdb.com/find?q=the+official+story&s=tt&…
                                                                            The diagram in the Figure 3 shows the number of
      and then looks up the resulting table to find identifier
                                                                         movies in individual datasets and number of movies
      ”tt0089276”.
                                                                         assigned to their corresponding IMDb record. Amount of
         The results of this combination of algorithm were:
                                                                         movies associated to the IMDb record in different
                                                                         intersections after the integration is different. For example,
      Table 3: The more complex version of IMDb search by title
                                                                         the MovieLens dataset contains in total 27 278 movies.
                               name
                                                                         From these are 13 134 unique associated movies and also
                                                                         contain 3 759 associated movies common with the Flix
         Algorithm            MovieLe          Flix       Twitter
                                                                         dataset not existing in the Twitter dataset. The number of
                                    ns
                                                                         movies common for all three datasets is equal to 4 075. By
          σ1(η7())              45,4%          70,9           Not        summing of 3 759 and 4 075 we get the total number of
                                                 %         needed        7 654 associated movies belonging to both MovieLens and
                                                                         Flix datasets, etc.
         In optimal case, the table returning from the IMDb
      search contains exactly one row with the requested record.
                                                                         2.2        Extraction of attributes
      For this situation the algorithm σ1 behaves well and is able
      to retrieve the correct IMDb identifier. In many other cases          For each movie registered in the IMDb database we then
      the result contains more rows and the correct one or the           retrieved XML data from the URL address
      best possible one has to be identified. For this purpose we
      constructed more versions of the σj algorithm as well:             http://www.omdbapi.com/?i=ttNNNNNNN&plot
                                                                         =full&r=xml
         σ2: The correct record should be from the requested
      year, so search only for records from this year and ignore         and then from the XML data retrieve following movie
      other records                                                      attributes:
Repeatable Web Data Extraction and Interlinking                                                                                        231



          IMDb title            (/root/movie/@title),                   Using the example above, let
          IMDb rating           (/root/movie/@imdbRating),              ϕ be the statement that a Twitter data entry has title
          IMDb rated            (/root/movie/@rated),                 “Midnight Cowboy” and MOVIEID = 64665;
          IMDb avards           (/root/movie/@awards),                  α be the algorithm concatenating prefix ’tt’ with the
          IMDb metascore        (/root/movie/@metascore),             MOVIEID left-padded by zeroes to seven positions; and
          IMDb year             (/root/movie/@year),                    ψ says movie “Midnight Cowboy” corresponds to the
          IMDb country          (/root/movie/@country),               IMDb record with ID equal to ’tt0064665’.
          IMDb language         (/root/movie/@language),                The corresponding dynamic logic expression is
          IMDb genres           (/root/movie/@genre),
          IMDb director         (/root/movie/@director),                 ∀x(ϕ(x)  [α]ψ(x))
          IMDb actors           (/root/movie/@actors)
                                                                      saying that whenever α starts in a state satisfying ϕ(m1)
          The similar way the movies from datasets are mapped         then whenever α halts, it does so in a state satisfying ψ(m1)
       onto IMDb movies, we implemented the mapping                   - see illustration in Fig.4.
       technique described in [K] and assigned DbPedia 1                  Programs (extractors) remain propositional, states
       identifiers and semantic data to IMDb movies.                  correspond to different representation of content on the
          The DbPedia identifier of movie is a string, for example    web. On each of states the respective semantics is defined
       ”The_Official_Story” or ”The_Seventh_Seal”. This               using appropriate query language.
       identifier can then be used to access directly the DbPedia         Our logic has expressions of two sorts and each sort is,
       graph database or retrieve data in an XML format through       respectively can be typed:
       the          URL           address          in         form        Statements       about      web      data:     can      be
       http://dbpedia.org/page/DbPediaIdentifie
                                                                      either atomic, e.g. Φ0RDF, Φ0FOL, Φ0RDB, Φ0XML, Φ0DOM,
       r. From the data available on the DbPedia page can be
                                                                      Φ0BoW, Φ0PoS, Φ0DepTree, etc. or more complex, e.g. ϕRDF,
       directly or indirectly extracted movie attributes GENRE,
                                                                      ψFOL, etc. With the corresponding data model and query
       GENRE1, ACTION, ADVENTURE,                   ANIMATION,
                                                                      language based semantics. All can be subject of
       CHILDRENS, COMEDY, CRIME, DOCUMENTARY,
                                                                      uncertainty, probability extensions.
       DRAMA,         FANTASY,         FILM_NOIR,       HORROR,
       MYSTERY,          MUSICAL,         ROMANCE,         SCI_FI,        Programs (propositional): atomic Π0σ for subject
       THRILLER, WAR, WESTERN or attributes CALIF, LA,                extraction, Π0π for property extraction or Π0ω for object
       NY, CAMERON, VISUAL, SEDIT, NOVELS, SMIX,                      value extraction in case of HTML, XHTML, or XML data;
       SPIELBERG, MAKEUP, WILLIAMS and many others.                   Π0ner for named entity extraction in case of text data, etc.
                                                                      and more complex ασπω, βσπω, γσπω, etc. In this logic we do
       3     A Dynamic Logic Model for Web                            not prove any statements about program depending on their
                                                                      code, so program names point to code one would reuse.
       Annotation                                                         Statements are typically accompanied by information
           For effective using of changing and/or increasing          about program creation like data mining tool, training data,
       information we have to evolve tools (e.g. inductive            metrics (e.g. precision, recall), etc. There is also a lot of
       methods) used for creation of specific web service (here       reification describing the training and testing data and the
       recommendation of movies). Our goal is to extend the           metrics of learning. Our model is based on dynamic logic,
       semantic web foundations to enable describing creation,        calculates      similarity   of    states    and     describes
       dynamics and similarities on data. To describe the             uncertain/stochastic character of our knowledge.
       reliability of extraction algorithms we propose a "half-a-
       way" extension of dynamic logic.
          Our reference for dynamic logic is the book of D. Harel,
       D. Kozen, J. Tiuryn [HKT].
           Dynamic logic has two types of symbols:
       propositions/formulas ϕ, ψ ∈ Π and programs α, β ∈ Φ.
       One can construct a program also from a formula by test ϕ?
       and formulas also by generalized modality operations
       [α], <α>. The expression <α>ϕ says that it is possible to
       execute α and halt in a state satisfying ϕ; the expression
       [α]ϕ says that whenever α halts, it does so in a state
       satisfying ϕ.
          Main goal of dynamic logic is reasoning about programs,
       e.g. in program verification. In our case programs will be
       extractor/annotators and can be kept propositional, as for
       now we are not interested in procedural details of
       extractors. Formulas will be more expressible in order to be
       able to describe the context of extraction.                                 Figure 4: Extraction data enriched

         1
             http://wiki.dbpedia.org/
232                                                                                                            M. Kopecky, M. Vomlelova, P. Vojtas




         Hence we are able to express our extraction experience          Table 7: Computed variables for each pair of user and
      in statements like                                                                       movie
                               ϕ  [α]x ψ
                                                                              Variable                                      Description
      where ϕ is a statement about data D1 before extraction
      (preconditions), ψ is a statement about data/knowledge D2,       α5                         Equality of the most frequent
      K2 after extraction (postconditions), α is the program used     GENREMATCH                user’s genre and the movie’s genre
      for extraction. Modality [α]x can be weighted, describing
      uncertainty aspects of learning.
         Lot of additional reification about learning can be
      helpful.
         The main idea of this paper is that if there are some data
                                                                      4.2          Results
      D1’ similar to D1 and ϕ is true in some degree – e.g.
      because both resources were created using same template -          Based on these attributes, we learned a linear model of
      then after using α we can conclude with high                    RATING as a function of
      certainty/probability that the statement ψ will be true on      (CNT+BAYESAVG+MAVG+USERSHIFT+GENREMAT
      data D2’ (knowledge K2’).                                       CH).
         For instance the formula                                        Table 8 summarizes the train mean square error, test
                                                                      mean square error and for comparison the mean square
         “MyData are similar to IRI3”  [σ3η7]0.736 “IMDBId is        error of the 'zero' model predicting always the overall
      correct”                                                        average of training data. These are the uncertainty part of
                                                                      our dynamic logic.
         Experiments with extraction and integration of movie
      data can serve as an example of this. In the next chapter we                       Table 8: Result summarization
      would like illustrate how this influences recommendation.
                                                                             Dataset               MRSS             MRSS         MTSS
                                                                                                  train             test        train
      4          Preference Learning Experiments                            Flix                    0.690           0.714        1.100
         To show usability of extracted and annotated data, we              Twitter                 1.420           1.660        2.890
      provide experiments in area of recommender systems.                   MovieLens               0.607           0.633        1.070


      4.1        Data Preprocessing
                                                                          We can see that in all datasets, the difference between
         We selected all Twitter users with ratings less or equal     train and test error is very small compared with the results
      to 10, random 3000 MovieLens users and random 3000 Flix         of the zero model. This means the model is not overfitted.
      users.                                                              Since the Twitter dataset uses scale 0 to 10 compared to
         For these, we split the RATING data by assigning last        the scale 0 to 5 for Flix and MovieLens, the error
      (according to time stamp) 5 records from each user as a test    differences cannot be compared directly.
      data, the remaining data was used as train data.                    The models may be compared by the R2 statistics, the
         Based on train data, we calculated aggregated variables:     amount of variance explained by the model
                                                                          R2 = 1 - Sum(R[i]2) / Sum((y[i]- y*)2)
               Table 5: Computed variables for each movie             Here R[i] is the ith residual, y[i] is the rating of ith record, y*
                                                                      is average rating over all train records in dataset. Its range
             Variable                               Description       is from 0 (useless model) to 1 (prefect prediction).
                                                                          In table 9 we can see significant differences between
            α1 CNT         Number of ratings for a movie              datasets, ranging from 0.506 for Twitter (best
            α2 MAVG        Average rating for a movie                 improvement) compared to 0.356 for MovieLens and 0.262
                                                                      for Flix.
       α3                 (GLOBAL.AVERAGE*50
      BAYESAVG           +MAVG*CNT) / (CNT+50)                                                  Table 9: R2 statistics

                                                                                          Dataset                    R2
                Table 6: Computed variables for each user
                                                                                         Flix               0.262
             Variable                               Description                          Twitter            0.506
        α4                 The average over rated movies                                 MovieLens          0.356
      USERSHIFT          (user.rating-BAYESAVG)
Repeatable Web Data Extraction and Interlinking                                                                                         233




          In Table 10 we show preliminary results on testing             5.1          Proposal of Reusability Quality
       repeatability. We trained the model on the data set in the
                                                                            Similarly, as in Linked data quality assessment, we can
       row and tested on the test data in column. No surprise that
                                                                         imagine similar assessment for reusability.
       in each column the best result is on the diagonal.
                                                                            The main idea is, that this will be less sensitive to URL
                                                                         change, migration and “end-of-project-phenomenon”. One
                        Table 10: Repeatability tests
                                                                         can imagine, that these information are published at
                                                                         https://sourceforge.net/ or similar service. What follows is
            Dataset          Flix test       Twitter      MovieLe        a vision, we would like to discuss:
                                                test       ns test
           Flix train           0.714         1.731          0.635
           Twitter              0.723         1.658          0.639
        train
           MovieLens            0.715         1.679          0.633
        train
           Zero model           1.100         2.890          1.070


          As the zero model we use movie rating average MAVG.
          In this research proposal, we do not evaluate role of
       similarity, we just illustrate similarity of our datasets.
          In Figure 5 we show MAVG function on a sample of
       movies (with IMDB ID#s). Table 11 show MAVG
       distances below diagonal. So far it is not clear to us which
       metrics to use to compute similarity – Euclidean or cosine?
       Further experiments are required as this can depend on
       domain.                                                                           Figure 6: Maybe genres play a role?


                                                                         Table 11: Distance of datasets. Below diagonal by MAVG,
                                                                                        above diagonal with Genres

                                                                                               Flix          Twitter        Movie
                                                                                                                          Lens
                                                                               Flix                   0.00         0.10         0.14
                                                                               Twitter                0.51         0.00         0.06
                                                                               MovieLens              0.42         0.54         0.00



                                                                            6 Reusability extraction describes the algorithm,
                                                                         training data, metrics and results.
                                                                            7 Reusability extraction extends a 6 one by
        Figure 5: Towards calculating similarity – which vectors,        additional similarity measures and thresholds for successful
                     which metrics? Here MAVG.                           reuse. A corresponding formula can look like

                                                                            “Data ≈0.2 IRI3”  [σ3η7]0.654 “IMDBId is correct”
          Maybe the right idea to calculate similarity is content
       based. We illustrate this by Fig. 6 with behavior of statistics      8 Reusability extraction describes a 7 one in a more
       on genres. Table 11 show genre based distance in cells            extensive way with several different data examples and
       above diagonal.                                                   similarities. This can increase the chance that for a given
                                                                         domain you find solution which fits your data.
       5         Proposal, Conclusions, Future Work                         9 Reusability extraction assumes a 8 one in
                                                                         a server/cloud implementation. You do not have to solve
           We have provided preliminary experiments with                 problems that the extractor does not run in your
       reusability of our algorithms. Results are promising, but         environment properly.
       still we need more extensive testing.
234                                                                                                   M. Kopecky, M. Vomlelova, P. Vojtas




         10 Reusability extraction assumes a 9 one in a more        Challenge, Berlin, Germany. Published by CEUR
      user friendly way, you just upload your data (or their URL)     Workshop Proceedings, 2015
      and the system finds solution and you can download result.      [PLEDVF] L. Peska, I. Lasek, A. Eckhardt, J. Dedek,
      It is also possible to imagine this enhanced with some          P. Vojtas, D. Fiser: Towards web semantization and user
      social network interaction.                                     understanding. In EJC 2012, Y. Kiyoki et al Eds. Frontiers
                                                                      in Artificial Intelligence and Applications 251, IOS Press
      5.2      Conclusions                                            2013, pp 63-81
         We have presented a research proposal for improving          [T] Twitter data from RecSys 2014 challenge
      degree of automation of web information extraction and          http://2014.recsyschallenge.com/
      annotation. We propose a formal dynamic logic model for         [UFAL] S. Cinková, J. Hajič, M. Mikulová, L. Mladová,
      automated web annotation with similarity and reliability.       A. Nedolužko, P. Pajas, J. Panevová, J. Semecký,
         We illustrated our approach by an initial prototype and      J. Šindlerová, J. Toman, Z. Urešová, Z. Žabokrtský (2006),
      experiments on recommendation on movie data (annotated          Annotation of English on the tectogrammatical level,
      and integrated).                                                Technical Report 35, UFAL MFF UK, URL
                                                                      http://ufal.mff.cuni.cz/pedt/papers/TR_En.pdf.
      5.3      Future work                                            [V+] Verborgh R. et al. (2014) Querying Datasets on the
         The challenge is twofold:                                    Web with High Availability. In: Mika P. et al. (eds) The
         - extend this to other domains                               Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes
         - provide deeper analysis of data mining and possible        in Computer Science, vol 8796. Springer, Cham
      similarities
         We can consider some more complex algorithms for
      preference learning, e.g., based on the spreading activation
      [GG].

      Acknowledgement

      Research was supported by Czech project Progres Q48.

      References
      [A] [IBM] John Arwe. Coping with Un-Cool URIs in the
      Web of Linked Data, LEDP-Linked Enterprise Data
      Patterns workshop. Data-driven Applications on the Web.
      6-7 December 2011, Cambridge, MA Hosted by W3C/MIT
      [BTMC] K. Bontcheva, V. Tablan, D. Maynard, and
      H. Cunningham (2004), Evolving GATE to Meet New
      Challenges in Language Engineering, Natural Language
      Engineering, 10(3/4):349—373.
      [D1]     J.    Dedek,    Semantic     Annotations     2012,
      http://www.ksi.mff.cuni.cz/~dedek/publications/Dedek_Ph
      D_Semantic_Annotations.pdf
      [D2] J. Dedek. Semantic Czech,
      http://czsem.sourceforge.net/
      [F1] D. Fiser. Semantic annotation of domain dependent
      data (in Czech). Master thesis, Charles University, Prague
      2011
      [F2] D. Fiser. Semantic annotator.
      https://chrome.google.com/webstore/detail/semantic-
      annotator/gbphidobolopkilhnpkbagodhalimojj                  ,
      http://www.doser.cz/projects/semantic-annotator/
      (User Guide and Installation Guide (server part) in Czech),
      [GG] L. Grad-Gyenge, Recommendations on a knowledge
      graph, 2015
      [HKT] D. Harel, D. Kozen, J. Tiuryn. Dynamic Logic
      (Foundations of Computing) The MIT Press 2000
      [K]       Kuchar J.: Augmenting a Feature Set of Movies
      Using Linked Open Data, Proceedings of the RuleML 2015