=Paper=
{{Paper
|id=Vol-1885/228
|storemode=property
|title=Repeatable Web Data Extraction and Interlinking
|pdfUrl=https://ceur-ws.org/Vol-1885/228.pdf
|volume=Vol-1885
|authors=Michal Kopecký,Marta Vomlelová,Peter Vojtáš
|dblpUrl=https://dblp.org/rec/conf/itat/KopeckyVV17
}}
==Repeatable Web Data Extraction and Interlinking==
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 228–234 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Kopecky, M. Vomlelova, P. Vojtas Repeatable Web Data Extraction and Interlinking M. Kopecky1, M. Vomlelova2, P. Vojtas1 Faculty of Mathematics and Physics Charles University Malostranske namesti 25, Prague, Czech Republic 1 {kopecky|vojtas}@ksi.mff.cuni.cz, 2marta@ktiml.mff.cuni.cz Abstract. We would like to make all the web content annotation of web pages directly in Google Chrome. It usable in the same way as it is in 5 star Linked (Open) requires no complicated installation and is available on all Data. We face several challenges. Either there are no LODs platforms and devices where it is possible to install Google in the domain of interest or the data project is no longer Chrome. Semantic annotation is available to all current maintained or even something is broken (links, SPARQL users of the Internet not only to authors’ site. Browser endpoint etc.). extension Semantic Annotator began as a prototype implementation in the Thesis [F1]. We propose a dynamic logic extension of the semantic Google Chrome extension Semantic Annotator is used model. Data could bear also information about their for manual semantic annotation of Web sites. The goal of creation process. We calculate this on several movie semantic annotation is to assign meaning to each part of the datasets. page. The significance of real-world objects and their In this work in progress we provide some preference properties and relationships are described in dictionaries – learning experiments over extracted and integrated data. either self-created or imported. Then the annotation process consists of selecting parts of the site and the assignment of meaning from the dictionary. Keywords Repeatable experiments; web data extraction, annotation, linking; dynamic logic; preference learning 1 Introduction, Motivation, Recent Work For our decisions we often need automated processing of integrated web data. Linked (open) data are one possibility to achieve this vision. Still, there are some challenges. Production URLs are sometimes subjects of change ([A]). Data migrate, data project run out of contracted sustainability period and so on. SPARQL endpoints are expensive for the server, and not always available for all datasets. Downloadable dumps are expensive for clients, and do not allow live querying on the Web ([V+]). In some areas there are no corresponding Linked data Figure 1: Illustrative figure for Semantic annotator, more projects available at all. Imagine e.g. a customer looking in [F1], [F2] for a car. He or she would like to aggregate all web data. Our idea is to remember previous successful extractions in In Figure 1 we show an example of annotation of given domain and use this in the current situation. For product pages in e-shop. The user selects a web page and evaluation of previous extractions can help also social product name from the dictionary describing products that networks. We have presented this idea first in [PLEDVF]. assigns a name meaning "product name". Then on the same We concentrated on one specific purpose – extract object page the user selects a product price and gives it the attributes and use of these data in recommender systems. In meaning of "price". Because of similarity of pages and this research we have tried to contribute to increase the usage of templates, annotating few pages enables to train an degree of automation of web content processing. We annotator for the whole site. Consequently, search of presented several methods for mining web information and annotated website is then much more accurate. assisted annotations. 1.2 Semantic Annotations for Texts 1.1 Semantic Annotator for (X)HTML In previous section we have described a tool for A tool for assisted annotation is available in [F2]. annotation of (X)HTML pages. These are useful for Semantic Annotator allows both manual and assisted extraction of attributes of products on pages created by the Repeatable Web Data Extraction and Interlinking 229 same template even if texts are a little bit different. Even if 2.1 Integration there are no templates, in a fixed domain like reports on We use Flix data (enriched Netflix competition data), traffic accidents, there are still repetitions. RecSys 2014 challenge data [T] and RuleML Challenge A tool for annotation of text was developed in our group data [K]. in PhD thesis [D1]. The tool is available under [D2]. It is The datasets are quite different but they still have few built on GATE [BTMC], using UFAL dependency parsing things in common. Movies have their title and usually also [UFAL] and Inductive Logic Programming extension. It the year of their production. Ratings are equipped by represents a tool for information extraction and consequent timestamp that allows us to order ratings from individual annotation. users chronologically. In the thesis [D1] are presented four relatively separate One of problems was that different datasets use different topics. Each topic represents one particular aspect of the MOVIEID’s, so the movies cannot be straightforwardly information extraction discipline. The first two topics are mapped across datasets. To achieve this goal we wanted to focused on new information extraction methods based on enhance every movie by the corresponding IMDb deep language parsing in combination with manually identifier. designed extraction rules. The second topic deals with We observed that the Twitter datasets use as their a method for automated induction of extraction rules using internal MOVIEID the numeric part of the IMDb inductive logic programming. The third topic of the thesis identifier. So the movie “Midnight Cowboy” with Twitter combines information extraction with rule based reasoning. MOVIEID = 64665 corresponds to the IMDb record with The core extraction method was experimentally ID equal to ’tt0064665’. Therefore, we construct the reimplemented using semantic web technologies, which algorithm allows saving the extraction rules in so called shareable extraction ontologies that are not dependent on the original τ1:Twitter-MOVIEID → IMDbId extraction tool. See Fig.2. on traffic accident data. which simply concatenated prefix ’tt’ with the MOVIEID The last topic of the thesis deals with document left-padded by zeroes to seven positions. The classification and fuzzy logic. The possibility of using successfulness of this algorithm is shown in Table 1. information obtained by information extraction techniques to document classification is examined. For more see also Table 1: Simple identifier transformation [PLEDVF]. In this research proposal we concentrate on synergy Algorithm MovieLe Flix Twitter effect of annotation and integration of data for user ns preference learning, and consequently for recommendation. τ1() 0% 0% 100% To be able to assign IMDb identifiers to movies from other datasets, we had to use the search capabilities of the IMDb database. We used an HTTP interface for searching movies according to their name. The request is send by HTTP request in form http://www.imdb.com/find?q=movie+title&s =tt. The other versions of algorithm for assigning IMDb identifiers to movies can be in general formally described as Figure 2: Illustrative figure of an extraction pattern for dependency tree, more in [D1], [D2]. τ2i,j: TITLE × YEAR → TT that is implemented by two independent steps It turns out that similarity and dynamic aspects of web τ2i,j(TITLE,YEAR) = σj(ηi(TITLE),TITLE,YEAR) data play a role here as well. We propose an appropriate dynamic logic model of web data dynamics and provide where the algorithm ηi transforms movie title to query some experiments. We hope this can serve as preliminary string needed for IMDb search, while the σj algorithm then experiments for a more extended research proposal. looks for the correct record in returned table. The simplest implementation of algorithms can be denoted as follows: 2 Extraction Experiments η1: TITLE → TABLE Our main goal is: Try to remember information about σ1: TABLE × TITLE × YEAR → TT the context of data creation in order to enable repetition of extraction. In general we can remember algorithms, where η1 algorithm concatenates all words of the title by training data and metrics of success. the plus sign and σ1 algorithm returns TT in case the In this chapter we try to describe some extraction resulting table contains exactly one record. The results of algorithms and process of data integration in movie this combination of algorithm are shown in Table 2 (ratio domain. These data will be used in preference learning. By of correct answers). this we would like to illustrate our main goal. 230 M. Kopecky, M. Vomlelova, P. Vojtas Table 2: The simplest version of IMDb search by title name σ3: The IMDb search provides more levels of tolerance in title matching. Try to use thee of them from the most Algorithm MovieLe Flix Twitte exact one to the most general. If the matching record from ns r requested year cannot be found using stricter search, the σ1(η1()) 42,7% 51,2% Not other search level is used. needed Currently, we have 13 081 out of all 17 770 Flix movies mapped onto the IMDb database. Even all 27 278 movies To illustrate different algorithms for same extraction from the MovieLens set are mapped to the equivalent IMDb task we describe another version. Here the algorithm is not record. So the current results provided by the combination learned, but it is hand crafted. One of reasons for relatively of most advanced versions of algorithms are: low effectiveness of σ1(η1()) algorithm was the sub-optimal query string used for IMDb search due to quite different Table 4: The most complex version of IMDb search by title naming conventions of movie titles in different datasets. To name improve the results we enhanced the movie title transformation incrementally and produced its new Algorithm MovieLe Flix Twitter versions. Every new version added new step of ns transformation of the movie title: σ3(η7()) 100.0% 73.6% Not needed η2: Convert all letters in movie title to lower case. η3: If the movie title contains year of production at its end in brackets remove it. η4: If the movie title still contains text in brackets at its end, remove it. This text usually contained original name of movie in original language. η5: Move word “the”, respectively “a”/ “an” from the end of the title to the beginning. η6: Translate characters ”_”, ”.”, ”?” and ”,” to spaces η7: Translate ”&” and ”&” in titles to word ”and” For example, the η7 transformation changes title “Official Story, The (La Historia Oficial) (1985)” to its canonical form “the official story”. This version of transformation then constructs the IMDb Figure 3: Numbers of movies mapped onto IMDb database query in form http://www.imdb.com/find?q=the+official+story&s=tt&… The diagram in the Figure 3 shows the number of and then looks up the resulting table to find identifier movies in individual datasets and number of movies ”tt0089276”. assigned to their corresponding IMDb record. Amount of The results of this combination of algorithm were: movies associated to the IMDb record in different intersections after the integration is different. For example, Table 3: The more complex version of IMDb search by title the MovieLens dataset contains in total 27 278 movies. name From these are 13 134 unique associated movies and also contain 3 759 associated movies common with the Flix Algorithm MovieLe Flix Twitter dataset not existing in the Twitter dataset. The number of ns movies common for all three datasets is equal to 4 075. By σ1(η7()) 45,4% 70,9 Not summing of 3 759 and 4 075 we get the total number of % needed 7 654 associated movies belonging to both MovieLens and Flix datasets, etc. In optimal case, the table returning from the IMDb search contains exactly one row with the requested record. 2.2 Extraction of attributes For this situation the algorithm σ1 behaves well and is able to retrieve the correct IMDb identifier. In many other cases For each movie registered in the IMDb database we then the result contains more rows and the correct one or the retrieved XML data from the URL address best possible one has to be identified. For this purpose we constructed more versions of the σj algorithm as well: http://www.omdbapi.com/?i=ttNNNNNNN&plot =full&r=xml σ2: The correct record should be from the requested year, so search only for records from this year and ignore and then from the XML data retrieve following movie other records attributes: Repeatable Web Data Extraction and Interlinking 231 IMDb title (/root/movie/@title), Using the example above, let IMDb rating (/root/movie/@imdbRating), ϕ be the statement that a Twitter data entry has title IMDb rated (/root/movie/@rated), “Midnight Cowboy” and MOVIEID = 64665; IMDb avards (/root/movie/@awards), α be the algorithm concatenating prefix ’tt’ with the IMDb metascore (/root/movie/@metascore), MOVIEID left-padded by zeroes to seven positions; and IMDb year (/root/movie/@year), ψ says movie “Midnight Cowboy” corresponds to the IMDb country (/root/movie/@country), IMDb record with ID equal to ’tt0064665’. IMDb language (/root/movie/@language), The corresponding dynamic logic expression is IMDb genres (/root/movie/@genre), IMDb director (/root/movie/@director), ∀x(ϕ(x) [α]ψ(x)) IMDb actors (/root/movie/@actors) saying that whenever α starts in a state satisfying ϕ(m1) The similar way the movies from datasets are mapped then whenever α halts, it does so in a state satisfying ψ(m1) onto IMDb movies, we implemented the mapping - see illustration in Fig.4. technique described in [K] and assigned DbPedia 1 Programs (extractors) remain propositional, states identifiers and semantic data to IMDb movies. correspond to different representation of content on the The DbPedia identifier of movie is a string, for example web. On each of states the respective semantics is defined ”The_Official_Story” or ”The_Seventh_Seal”. This using appropriate query language. identifier can then be used to access directly the DbPedia Our logic has expressions of two sorts and each sort is, graph database or retrieve data in an XML format through respectively can be typed: the URL address in form Statements about web data: can be http://dbpedia.org/page/DbPediaIdentifie either atomic, e.g. Φ0RDF, Φ0FOL, Φ0RDB, Φ0XML, Φ0DOM, r. From the data available on the DbPedia page can be Φ0BoW, Φ0PoS, Φ0DepTree, etc. or more complex, e.g. ϕRDF, directly or indirectly extracted movie attributes GENRE, ψFOL, etc. With the corresponding data model and query GENRE1, ACTION, ADVENTURE, ANIMATION, language based semantics. All can be subject of CHILDRENS, COMEDY, CRIME, DOCUMENTARY, uncertainty, probability extensions. DRAMA, FANTASY, FILM_NOIR, HORROR, MYSTERY, MUSICAL, ROMANCE, SCI_FI, Programs (propositional): atomic Π0σ for subject THRILLER, WAR, WESTERN or attributes CALIF, LA, extraction, Π0π for property extraction or Π0ω for object NY, CAMERON, VISUAL, SEDIT, NOVELS, SMIX, value extraction in case of HTML, XHTML, or XML data; SPIELBERG, MAKEUP, WILLIAMS and many others. Π0ner for named entity extraction in case of text data, etc. and more complex ασπω, βσπω, γσπω, etc. In this logic we do 3 A Dynamic Logic Model for Web not prove any statements about program depending on their code, so program names point to code one would reuse. Annotation Statements are typically accompanied by information For effective using of changing and/or increasing about program creation like data mining tool, training data, information we have to evolve tools (e.g. inductive metrics (e.g. precision, recall), etc. There is also a lot of methods) used for creation of specific web service (here reification describing the training and testing data and the recommendation of movies). Our goal is to extend the metrics of learning. Our model is based on dynamic logic, semantic web foundations to enable describing creation, calculates similarity of states and describes dynamics and similarities on data. To describe the uncertain/stochastic character of our knowledge. reliability of extraction algorithms we propose a "half-a- way" extension of dynamic logic. Our reference for dynamic logic is the book of D. Harel, D. Kozen, J. Tiuryn [HKT]. Dynamic logic has two types of symbols: propositions/formulas ϕ, ψ ∈ Π and programs α, β ∈ Φ. One can construct a program also from a formula by test ϕ? and formulas also by generalized modality operations [α], <α>. The expression <α>ϕ says that it is possible to execute α and halt in a state satisfying ϕ; the expression [α]ϕ says that whenever α halts, it does so in a state satisfying ϕ. Main goal of dynamic logic is reasoning about programs, e.g. in program verification. In our case programs will be extractor/annotators and can be kept propositional, as for now we are not interested in procedural details of extractors. Formulas will be more expressible in order to be able to describe the context of extraction. Figure 4: Extraction data enriched 1 http://wiki.dbpedia.org/ 232 M. Kopecky, M. Vomlelova, P. Vojtas Hence we are able to express our extraction experience Table 7: Computed variables for each pair of user and in statements like movie ϕ [α]x ψ Variable Description where ϕ is a statement about data D1 before extraction (preconditions), ψ is a statement about data/knowledge D2, α5 Equality of the most frequent K2 after extraction (postconditions), α is the program used GENREMATCH user’s genre and the movie’s genre for extraction. Modality [α]x can be weighted, describing uncertainty aspects of learning. Lot of additional reification about learning can be helpful. The main idea of this paper is that if there are some data 4.2 Results D1’ similar to D1 and ϕ is true in some degree – e.g. because both resources were created using same template - Based on these attributes, we learned a linear model of then after using α we can conclude with high RATING as a function of certainty/probability that the statement ψ will be true on (CNT+BAYESAVG+MAVG+USERSHIFT+GENREMAT data D2’ (knowledge K2’). CH). For instance the formula Table 8 summarizes the train mean square error, test mean square error and for comparison the mean square “MyData are similar to IRI3” [σ3η7]0.736 “IMDBId is error of the 'zero' model predicting always the overall correct” average of training data. These are the uncertainty part of our dynamic logic. Experiments with extraction and integration of movie data can serve as an example of this. In the next chapter we Table 8: Result summarization would like illustrate how this influences recommendation. Dataset MRSS MRSS MTSS train test train 4 Preference Learning Experiments Flix 0.690 0.714 1.100 To show usability of extracted and annotated data, we Twitter 1.420 1.660 2.890 provide experiments in area of recommender systems. MovieLens 0.607 0.633 1.070 4.1 Data Preprocessing We can see that in all datasets, the difference between We selected all Twitter users with ratings less or equal train and test error is very small compared with the results to 10, random 3000 MovieLens users and random 3000 Flix of the zero model. This means the model is not overfitted. users. Since the Twitter dataset uses scale 0 to 10 compared to For these, we split the RATING data by assigning last the scale 0 to 5 for Flix and MovieLens, the error (according to time stamp) 5 records from each user as a test differences cannot be compared directly. data, the remaining data was used as train data. The models may be compared by the R2 statistics, the Based on train data, we calculated aggregated variables: amount of variance explained by the model R2 = 1 - Sum(R[i]2) / Sum((y[i]- y*)2) Table 5: Computed variables for each movie Here R[i] is the ith residual, y[i] is the rating of ith record, y* is average rating over all train records in dataset. Its range Variable Description is from 0 (useless model) to 1 (prefect prediction). In table 9 we can see significant differences between α1 CNT Number of ratings for a movie datasets, ranging from 0.506 for Twitter (best α2 MAVG Average rating for a movie improvement) compared to 0.356 for MovieLens and 0.262 for Flix. α3 (GLOBAL.AVERAGE*50 BAYESAVG +MAVG*CNT) / (CNT+50) Table 9: R2 statistics Dataset R2 Table 6: Computed variables for each user Flix 0.262 Variable Description Twitter 0.506 α4 The average over rated movies MovieLens 0.356 USERSHIFT (user.rating-BAYESAVG) Repeatable Web Data Extraction and Interlinking 233 In Table 10 we show preliminary results on testing 5.1 Proposal of Reusability Quality repeatability. We trained the model on the data set in the Similarly, as in Linked data quality assessment, we can row and tested on the test data in column. No surprise that imagine similar assessment for reusability. in each column the best result is on the diagonal. The main idea is, that this will be less sensitive to URL change, migration and “end-of-project-phenomenon”. One Table 10: Repeatability tests can imagine, that these information are published at https://sourceforge.net/ or similar service. What follows is Dataset Flix test Twitter MovieLe a vision, we would like to discuss: test ns test Flix train 0.714 1.731 0.635 Twitter 0.723 1.658 0.639 train MovieLens 0.715 1.679 0.633 train Zero model 1.100 2.890 1.070 As the zero model we use movie rating average MAVG. In this research proposal, we do not evaluate role of similarity, we just illustrate similarity of our datasets. In Figure 5 we show MAVG function on a sample of movies (with IMDB ID#s). Table 11 show MAVG distances below diagonal. So far it is not clear to us which metrics to use to compute similarity – Euclidean or cosine? Further experiments are required as this can depend on domain. Figure 6: Maybe genres play a role? Table 11: Distance of datasets. Below diagonal by MAVG, above diagonal with Genres Flix Twitter Movie Lens Flix 0.00 0.10 0.14 Twitter 0.51 0.00 0.06 MovieLens 0.42 0.54 0.00 6 Reusability extraction describes the algorithm, training data, metrics and results. 7 Reusability extraction extends a 6 one by Figure 5: Towards calculating similarity – which vectors, additional similarity measures and thresholds for successful which metrics? Here MAVG. reuse. A corresponding formula can look like “Data ≈0.2 IRI3” [σ3η7]0.654 “IMDBId is correct” Maybe the right idea to calculate similarity is content based. We illustrate this by Fig. 6 with behavior of statistics 8 Reusability extraction describes a 7 one in a more on genres. Table 11 show genre based distance in cells extensive way with several different data examples and above diagonal. similarities. This can increase the chance that for a given domain you find solution which fits your data. 5 Proposal, Conclusions, Future Work 9 Reusability extraction assumes a 8 one in a server/cloud implementation. You do not have to solve We have provided preliminary experiments with problems that the extractor does not run in your reusability of our algorithms. Results are promising, but environment properly. still we need more extensive testing. 234 M. Kopecky, M. Vomlelova, P. Vojtas 10 Reusability extraction assumes a 9 one in a more Challenge, Berlin, Germany. Published by CEUR user friendly way, you just upload your data (or their URL) Workshop Proceedings, 2015 and the system finds solution and you can download result. [PLEDVF] L. Peska, I. Lasek, A. Eckhardt, J. Dedek, It is also possible to imagine this enhanced with some P. Vojtas, D. Fiser: Towards web semantization and user social network interaction. understanding. In EJC 2012, Y. Kiyoki et al Eds. Frontiers in Artificial Intelligence and Applications 251, IOS Press 5.2 Conclusions 2013, pp 63-81 We have presented a research proposal for improving [T] Twitter data from RecSys 2014 challenge degree of automation of web information extraction and http://2014.recsyschallenge.com/ annotation. We propose a formal dynamic logic model for [UFAL] S. Cinková, J. Hajič, M. Mikulová, L. Mladová, automated web annotation with similarity and reliability. A. Nedolužko, P. Pajas, J. Panevová, J. Semecký, We illustrated our approach by an initial prototype and J. Šindlerová, J. Toman, Z. Urešová, Z. Žabokrtský (2006), experiments on recommendation on movie data (annotated Annotation of English on the tectogrammatical level, and integrated). Technical Report 35, UFAL MFF UK, URL http://ufal.mff.cuni.cz/pedt/papers/TR_En.pdf. 5.3 Future work [V+] Verborgh R. et al. (2014) Querying Datasets on the The challenge is twofold: Web with High Availability. In: Mika P. et al. (eds) The - extend this to other domains Semantic Web – ISWC 2014. ISWC 2014. Lecture Notes - provide deeper analysis of data mining and possible in Computer Science, vol 8796. Springer, Cham similarities We can consider some more complex algorithms for preference learning, e.g., based on the spreading activation [GG]. Acknowledgement Research was supported by Czech project Progres Q48. References [A] [IBM] John Arwe. Coping with Un-Cool URIs in the Web of Linked Data, LEDP-Linked Enterprise Data Patterns workshop. Data-driven Applications on the Web. 6-7 December 2011, Cambridge, MA Hosted by W3C/MIT [BTMC] K. Bontcheva, V. Tablan, D. Maynard, and H. Cunningham (2004), Evolving GATE to Meet New Challenges in Language Engineering, Natural Language Engineering, 10(3/4):349—373. [D1] J. Dedek, Semantic Annotations 2012, http://www.ksi.mff.cuni.cz/~dedek/publications/Dedek_Ph D_Semantic_Annotations.pdf [D2] J. Dedek. Semantic Czech, http://czsem.sourceforge.net/ [F1] D. Fiser. Semantic annotation of domain dependent data (in Czech). Master thesis, Charles University, Prague 2011 [F2] D. Fiser. Semantic annotator. https://chrome.google.com/webstore/detail/semantic- annotator/gbphidobolopkilhnpkbagodhalimojj , http://www.doser.cz/projects/semantic-annotator/ (User Guide and Installation Guide (server part) in Czech), [GG] L. Grad-Gyenge, Recommendations on a knowledge graph, 2015 [HKT] D. Harel, D. Kozen, J. Tiuryn. Dynamic Logic (Foundations of Computing) The MIT Press 2000 [K] Kuchar J.: Augmenting a Feature Set of Movies Using Linked Open Data, Proceedings of the RuleML 2015