Semantic Annotation to Support Description of the Art Market Dominik Filipiak, Krzysztof W˛ecel, Agata Filipowska Department of Information Systems, Poznań University of Economics al. Niepodległości 10 61-875 Poznań, Poland dominik.filipiak,krzysztof.wecel,agata.filipowska@kie.ue.poznan.pl ABSTRACT to artworks is recently gaining more and more attention1 . The estimation of prices on the art market has been investi- As a consequence, numerous studies have been carried out gated as a research topic for many years, but only recently to explore the topic. new approaches to this problem have been applied. One of these approaches concerns extending data on a work of art Surprisingly, artworks (such as paintings) traded in auction with data from the Internet to improve the quality of as- houses can be described with a decent number of variables, sessment. This, however, creates a lot of challenges mostly regardless of one’s art history knowledge. The name of an regarding the information extraction. Semantic annotation author, medium used, size of a painting, initial and hammer and enrichment of the crawled data enable additional rea- price, to name a few. This data is often provided by in- soning and introduce new features into existing methods, stitutions taking care of the sale. Sometimes, among other resulting in a better estimation of indices for the art mar- features, a long text description is associated with the paint- ket. ing or artist. This is a valuable source of information, but as it is provided in a natural language, it has to be processed The problem tackled by this paper is as follows: what kind before machine-based data processing is possible. of semantic enrichment on documents collected from the In- ternet can be introduced to extend the data on the artwork and influence the efficiency and quality of indices calculated 1.2 Problem Statement Since auction houses have started to frequently publish sales for artworks. results on the Internet, the perception of the art market has changed. Numerous services have started to collect data and Categories and Subject Descriptors even prepare market reports. Minimisation of information I.2 [Artificial Intelligence]: Natural Language Processing; asymmetry was not the only consequence of this step. With J.5 [Arts and Humanities] a sufficient amount of high-quality data, research carried out on the art market finally can be conducted in a data science manner. An employment of a hedonic regression General Terms using semantically enriched information is presented in this Econometrics paper as a base for building art market indices. The concept of using a regression in the art market analysis has been Keywords intensively studied, but, to the best of our knowledge, the Art Market Analysis, Linked Data, Semantic Tagging, Econo- use of semantic data enrichment constitutes a contribution metrics to this field. The problem tackled by this paper was defined as follows: 1. INTRODUCTION what kind of semantic enrichment on data published on the Internet, e.g. by auction houses, can be introduced to extend 1.1 Art Market the data on the artwork and influence the efficiency and With all their invaluable qualities, artworks are often treated quality of the calculation of artworks’ indices. as a type of an asset, just like stocks or bonds. The idea of considering art as a form of an alternative investment can be perceived as a controversial one. Nonetheless, this approach 2. SOLUTION 2.1 Architecture of the Solution In order to tackle the presented problem, a solution offer- ing a four-step processing pipeline has been designed. These steps concern data collection, data refinement, information extraction and data enrichment. The solution may be per- 1 An interesting case is provided by the National Bank of Hungary, which started to invest in art SEMANTiCS 2015 Vienna, Austria works http://blogs.wsj.com/emergingeurope/2014/03/ 31/hungary-central-bank-to-buy-art/ 51 ceived as a sort of framework, which is a base for future by the underlying engine. Depending on the language some- research. times additional processing is required to improve recall of spotting. For example, for the Polish language an exter- A reasonable number of observations regarding sales in auc- nal morphology analyser is necessary to normalise various tion houses must be collected to facilitate building of an inflectional forms. effective prediction model. Therefore, the first step consid- ers the data collection. Numerous services provide histori- Several solutions base also on considering many ontologies cal sales information. Artprice 2 and Artnet 3 are the most at once. NERD – Named Entity Recognition and Disam- prominent examples of data providers. However, these sites biguation5 proposes unified numerous named entity extrac- are often subscription-based and do not provide data in a tors using the NERD ontology, which provides a set of ax- parse-friendly format, not to mention various legal issues. ioms aligning various underlying taxonomies. Mappings are As a consequence, the data collection must be performed by established manually. According to the documentation, sev- dedicated crawlers, operating on pages of numerous auction eral extractors are supported, including: DBpedia Spotlight, houses. OpenCalais and Zemanta. Data refinement and cleansing is indispensable in order to A similar “meta-approach” is taken by Apache Stanbol, a obtain robust results. For example, due to a human error, general framework for semantic enhancement of unstruc- some observations have misspelled information about artists. tured text. DBpedia Spotlight can work as an Enhance- This issue may be resolved by applying various fuzzy string mentEngine for Stanbol. Stanbol also links to several other matching algorithms. According to the so-called garbage external services via enhancement modules, for example: in, garbage out principle, this step is crucial to assure the Named Entity Linking Engine (suggests links to linked data quality of the experiment’s results. sources), FST Linking Engine (links Entities indexed in a Solr index), Geonames Enhancement Engine (links to geon- The third step considers information extraction from the col- ames.org, with hierarchical links for locations), OpenCalais lected documents. Auction lots are often described with an (both NER and Entity Linking), Zemanta Enhancement En- unstructured text which contains useful information. For in- gine (both NLP and Entity Linking). stance, the presence of a signature on a painting or a number in an edition in lithographies may carry important informa- One of the best solutions regarding disambiguation is the tion influencing the hammer prices. Due to the complexity Dandelion6 service offered by Spaziodati. The integration of this process, possible approaches are described in detail is much deeper than in the case of NERD where only on- in section 2.2. tology was aligned. It builds truly own knowledge graph which allows for much better ranking and thus more precise The data enrichment, the final step, makes use of annotated disambiguation of the mentions. Figure 1 presents a sample entities in order to provide more complementary informa- annotation of an artwork with the Dandelion API. tion. Although minimising information asymmetry on the art market is an obvious goal behind this approach, there are various applications of enriched data. These possibilities are covered in section 2.3. 2.2 Annotation – From Text to Triples Ontologies used for annotation make it easier for people and Figure 1: Named entity recognition by Dandelion machines to understand the text. Document retrieval can be API significantly improved when additional relations from ontol- ogy are leveraged. For example, we can ask for documents 2.3 Data Enrichment containing information about impressionists and we actually Sometimes it is hard to distinguish data extraction from do not have to know the names of individual artists. Also, enrichment; very often these phases are combined. Having any document containing the phrase “oil painting” will be identified the entity in text, additional data can be retrieved. classified as a “document about art media”. Semantic enrichment sometimes covers phases three and four of the proposed approach, being information extraction and DBpedia Spotlight4 is a tool for automatically annotating data enrichment stemming from semantic annotation. In the mentions of DBpedia resources in text. It classifies enti- context of data mining links to external information results ties according to the DBpedia ontology. Two modes are in additional attributes, they can improve the quality of the available. In the first one (candidates), it spots the po- predictive model [1]. tential mentions (either statistically or based on gazetteer) and retrieves the candidate DBpedia resources bound to There is a big number of open data sources that may enrich Wikipedia. In the second mode (annotate), it additionally data on artworks currently available on the Web. The do- disambiguates candidates and links the mentions to the best main of fine art requires more thoughtful selection as not all one. One of the strong points of the DBpedia Spotlight is the datasets contain relevant information. The obvious source richness of language resources that can be used for indexing is DBpedia7 . The recent version (2014) provides informa- 2 5 http://www.artprice.com http://nerd.eurecom.fr/ 3 6 https://www.artnet.com/price-database/ https://dandelion.eu/ 4 7 http://spotlight.dbpedia.org http://dbpedia.org 52 tion about more than 1,445,000 people and 411,000 creative works in its English edition only. Other language chapters can potentially provide additional data, particularly, when the DBpedia language matches the nationality of the artist. An intensive effort has been observed to align artists’ de- scriptions in Wikipedia with external authoritative sources, when possible. Therefore, we can expect more precise and more complete information about at least some of the artists. At the bottom of the Wikipedia article for some people one can find “Authority control” with links to external sources, e.g. VIAF (Virtual International Authority File), ISNI (In- ternational Standard Name Identifier), ULAN (Union List of Artist Names). The number of available references depends on the popularity of the artist. For example, there are 11 entries for the famous Polish painter Wojciech Kossak and only four for his lesser-known son (see Figure 2). Figure 4: ULAN search results for ‘Kossak’ Figure 2: Authority control record for Wojciech Kossak VIAF is a joint project of several national libraries. It is apparently the biggest dataset, with information on about over 35 million names8 . The provenance information is kept for each piece of data, which is useful as some discrepancies still exists. For example, Figure 3 presents different dates of birth of Wojciech Kossak. Figure 5: Information about Wojciech Kossak in ULAN Figure 6). A set of Getty’s vocabularies is available as linked Figure 3: VIAF records for Wojciech Kossak open data (LOD)11 making it a perfect fit for supplementing annotations in an ontological form. ISNI is an ISO 27729 global standard number for identifying As we keep data in Open Refine, it would be convenient contributors to creative works. It currently (2015) holds the to use one of the extensions for named entity recognition. information about more than 8 million individuals. ULAN9 That would allow us to extend our data about a certain is particularly interesting as it focuses on artists’ names, artwork with additional attributes, thus leading to better holding information on about 120,000 of them. The basic predictive models. In such context RDF-extension (devel- information includes given names (in multiple languages), oped by DERI Galway) with such functionalities as recon- pseudonyms and variants spelling, i.e. various surface forms ciliating against SPARQL endpoints or RDF dumps and ex- (almost 300,000). Such information is crucial for finding porting to RDF might be used. DBpedia-extension (by Ze- mentions of an artist in the text. The search interface allows manta) added the possibility to extend reconciled data with to find all artists with a given name (Figure 4) and within data from DBpedia and to extract entities from full text de- individual page relations between artists are also provided scriptions via Zemanta API. Regarding integration, one of (Figure 5). the most comprehensive solutions is LODGrefine, developed within thr LOD2 project12 . Unfortunately, it is targeted at Another dataset offered by Getty is also relevant to our re- the English language, and we need to adapt it for Polish. search – The Art & Architecture Thesaurus (AAT)10 . It con- It also does not contain domain-specific ontologies like for tains terms useful in the description of art techniques (see example ULAN. 8 To conclude, the way various tools conduct analysis is very VIAF Annual Report 2014, http:// www.oclc.org/content/dam/oclc/viaf/ similar. In fact, only two aspects make these solutions dif- OCLC-2014-VIAF-Annual-Report-to-VIAF-Council.pdf ferent: the underlying dictionary and the ability to disam- 9 http://www.getty.edu/research/tools/vocabularies/ 11 ulan/index.html http://www.getty.edu/research/tools/vocabularies/ 10 lod/ http://www.getty.edu/research/tools/vocabularies/ 12 aat/index.html http://lod2.eu 53 tion, such as a time of death of the artists, may also be included. In some cases information can be missing, there- fore this method is considered to be prone to the selection bias. Having a wide range of relevant data is one of the most important steps in the index calculation process. Therefore, this is the place where the approach discussed in the paper can be used for yielding more accurate indices. More com- plete data with extracted variables (such as the mentioned presence of a signature or edition in the case of lithographies) allows to build more sophisticated representation of a paint- ing. Used in the equation (1), it results in more accurate coefficients representing various sales periods (γt ). These coefficients are actually employed to construct indices: eγt Indext+1 = (2) Figure 6: AAT classification about cochinilin pig- eγt+1 ment which can be used to measure and visualise overall art mar- ket performance through different periods. biguate entities. None of the solutions offer a direct support for the Polish language. These aspects open a space for our 4. CONCLUSIONS improvements. Our solution will base on the DBpedia Spot- Nowadays, we deal with an increasing popularity of invest- light with an index built for the Polish language resources ment in artworks. This imposes the need for employing from the Polish DBpedia supplemented with domain-specific various methods for estimation of prices of these artworks. ontologies like ULAN. Therefore, researchers and practitioners work on methods enabling market description and price estimation. This re- lates also to indices developed for the art market that were 3. MARKET INDICES PREDICTION addressed in the paper. The semantically annotated data used while describing art works may improve the process of creation of indices for The paper presented the approach of how the semantic pro- the art market. The art market indices are build for out- cessing may enrich data available for the current methods lining general trends and measure its volatility and overall of estimation of indices for the art market. It discussed the value. Comparison of artworks with more traditional forms data sources as well as proposed the semantically-based doc- of assets (like bonds) or searching for a correlation between ument processing pipeline. Currently, this approach is being various economic factors and behaviour of the market com- implemented and the first results seem promising. plement the rationale behind constructing indices [2]. Having annotated descriptions of artists and artworks, it Currently, two ways to develop art market indices are the is possible to conduct further research. A complementary, most popular: based on repeat-sales and hedonic regression. detailed and semantically enriched catalogue raisonné ob- The first method takes into account all items sold at least tained in the previously mentioned process could be a valu- twice and calculates indices based on the proportion of the able source of information for performing art market anal- first and the second sale prices. Probably the most notable ysis itself. In addition, well-structured data may pave the example of this approach is the Mei&Moses Art Index [4]. way towards usage of methods from a graph theory, topic Its weakness relates to the fact that artworks are considered labelling or even employment of machine learning. as a long-term form of investment, what results in a rel- atively small amount of data to base on. Therefore, many researches have employed hedonic regression [3]. It is a form 5. REFERENCES of linear regression, which takes into account various features [1] C. d’Amato, P. Berka, V. Svátek, and K. Wecel, , editors. of artworks and their year of sale separately compared to the Proc. of the International Workshop on Data Mining auction lot hammer price in this case. The Ordinal Least on Linked Data collocated with (ECMLPKDD 2013), Squared method is used to estimate coefficients. A simple Prague, Czech Republic, Sep. 23, 2013, volume 1082 of example of this linearised model is presented in equation 1. CEUR Workshop Proceedings. CEUR-WS.org, 2013. z τ [2] V. Ginsburgh, J. Mei, and M. Moses. The Computation X X of Prices Indices. In Handbook of the Economics of Art ln Pit = α + βj Xij + γt Dit + εit , (1) and Culture, volume 1, pages 947–979. Elsevier, 2006. j=1 t=0 [3] R. Kräussl and N. van Eisland. Constructing the True where ln Pit represents the natural logarithm of a price of Art Market Index - A Novel 2-Step Hedonic Approach a given painting i ∈ {1, 2, ..., N } at time t ∈ {1, 2, ..., τ }; α, and its Application to the German Art Market. 2008. β and γ are regression coefficients for estimated characteris- [4] J. Mei and M. Moses. Art as an Investment and the tics. Xij represents hedonic variables included in the model, Underperformance of Masterpieces. NYU Finance whereas Dit stands for time dummy variables. Working Paper, (FIN-01-012):1–23, 2001. Considered hedonic variables are, for example, the artist’s name, the painting’s size, year of creation and other related features describing a given painting. An indirect informa- 54