J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 23–26 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Linková, P. Gurský Attributes Extraction from Product Descriptions on e-Shops Michaela Linková, Peter Gurský Institute of Computer Science Faculty of Science, P.J.Šafárik University in Košice Jesenná 5, 040 01 Košice, Slovakia michaela.linkova@student.upjs.sk, peter.gursky@upjs.sk Abstract. Some e-shops present product attributes in be the names of the entities and their URLs become the structured form, but many others use the textual description identifiers. The common learning constellation for only. Attributes of products are essential in automated product supervised and semi-supervised techniques is the deduplication. We suggest methods for automated extraction of attributes and their values from product descriptions to processing of annotated texts. Majority of learning models a structural form. The structural data extracted from other process entity names as well as surrounding words. Many e-shops are used as background knowledge. learning approaches have been used to handle NER: Hidden Markov Models [4], decision trees [5], Support 1 Introduction Vector Machines [6], Conditional Random Fields [7]. Another approach, similar to NER, is terminology / Nowadays there is an increasing interest in effective entity / term extraction. The goal of terminology extraction process of extracting information from big amount of data. is to automatically extract relevant terms from text, The problem of searching and obtaining relevant typically based on a vocabulary of domain-relevant information is handled by several areas of computer (possibly multi-word) terms. Typical approach is to extract science. Project Kapsa [1] deals with extraction and term candidates using linguistic processors and filter them unification of information from web pages, focusing on using statistical and/or machine learning methods. The products on e-shops. The aim of the project is the creation C-value/NC-value method [8] can be an example. To and management of a collection of products which are handle multi-word terms, the methods usually use n-grams, offered by e-shops. Crucial part of processing the e-shops’ that is, the combination of n words appearing in the corpus. data is a deduplication of products, i.e. the decision if any two products extracted from different e-shops are the same. 3 Background Knowledge To increase the precision of the deduplication, structured Unlike general named entity recognition, as a part of data about the products (product properties and their natural language processing, we can profit from knowledge values) are essential. of product domain and drastically reduce the number of Although some e-shops present attributes of products in possible entities to search in product description. The table from, many other e-shops provide a textual product domain can be determined from the product web description only. The descriptions usually contain values of presentation, since it is usually presented on specific many product properties and are written in natural position on every product detail page of the e-shop. language. The second advantage is the structured and annotated This work-in-progress paper presents our current data of product domain in background knowledge. These methods of automatic extraction of product attributes with data are extracted from the e-shops with structured attribute their values from product descriptions. Products have presentation in form of tables. Therefore, we can use the attributes of 3 main types: String, number with unit and dictionary of the attribute names in different languages Boolean. Each type is presented individually in natural (English, Slovak …) and variations (synonyms, language. Therefore we propose unique extraction method abbreviations) for each attribute of a given product domain. for each attribute type. Similarly, we can use various forms of units’ names 2 State of the Art (e.g. kg, kilograms, kilos, kilogramov, kíl …). Our background database contains also unit conversions To extract product attribute/property with its value from between convertible units (e.g. grams vs. kg). Finally, the a text description, we need to recognize that the attribute attribute types and the list of extracted values of each and/or its value are mentioned in the text. Named-entity attribute and product domain is stored in the background recognition (NER) is a close research area to our problem. knowledge. NER is the information extraction task of identifying and The annotation of attributes in Kapsa [1] is classifying mentions of people, organizations, locations and a semiautomatic process driven by administrator in web other named entities within text. Approaches to NER are GUI. Input for the annotation is a list of attribute names and surveyed in [3]. The dominant technique for addressing the values in String form for each product extracted from NER problem is supervised learning. A usual NER method e-shop web pages, possibly with some additional tags. The consists of tagging words of a test corpus when they are annotation produces a set of rules that determines product annotated as entities in the (rather big) training corpus. domain, attribute identification (including attribute A semi-supervised techniques decrease amount of manual deduplication), attribute type, value and unit extraction, etc. annotation needed to train a classifier. Typically, the If the product domain or attribute is already annotated for sentences in Wikipedia articles are considered annotated, other e-shop, annotator usually just plays the role of the because they contain context links to other Wikipedia pages validator of an automatic annotation. in sentences. The titles of such pages are then considered to 24 M. Linková, P. Gurský Figure 1: Extraction of attributes having number type requires an administrator intervention and converts our 4 Extraction Methods automatic method to the semiautomatic. Our analysis of products’ descriptions showed that Our method does not recognize sentences as positive or attributes and their values are presented differently for negative. So the sentence The mobile phone does not have Boolean, String and number types in natural language. thermosensor, induces the result that the product has Since we have the type information for each attribute we Boolean attribute thermosensor with value ‘true’. The are searching for in background knowledge, we can utilize approaches known from sentiment analysis can be extraction method for each type. All of the presented incorporated to cover this problem. methods are still works in progress and represent the 4.2 Extraction of Numeric Attributes baseline methods for the future work. Some of possible modifications, we believe that they can improve the quality Attributes of number type have their values composed of of methods, are proposed at the ends of the following a number and a unit (12 g, 15 cm, 42 ’’, 2 pieces…). Our sections as well as in the experiments section. method is composed of 4 main steps. First, the method searches for attribute names like the method extracting the 4.1 Extraction of Boolean Attributes Boolean attribute names. Next, all the variants of the Our method for extraction of Boolean attributes suggests, attribute unit and the variants of units convertible to this that the presence of the attribute name in product unit are searched in the sentence, where the attribute name description induce that the value of the product’s attribute was found. The only exception is the unit pieces, because it is ‘true’. The method searches every variation of the is not common in natural language sentences (e.g. 2 shelfs attribute name (languages and synonyms) that is present in instead of 2 pieces of shelves). In our method the search of background knowledge. If the attribute name is matched, the units requires an exact match. If both the attribute name the attribute with value ‘true’ is sent to output result. and the unit was found, then, in the third step, the numbers Since quite a lot of the attributes were misspelled or are searched in the sentence using regular expression inflected in our test data, we have replaced the exact search “([0-9]+)(\\ )*(\\.)*(\\,)*([0-9]*)“. If the sentence contains of the attribute names by fuzzyfied search using more numbers, the closest number to the attribute name is Levenshtein (editing) distance. The threshold for the selected in the 4th step. positive result was set to 75% match of the attribute name. The extraction method can be extended to cover word The method uses fuzzyfied search, only if the exact match variants of the numbers (i.e. one, two, twenty-three...), but is not found. it requires new dictionary for each language. Stemming and We believe that improvement of our method can be lemmatization can be also used for unit search (in Slovak achieved by stemming or lemmatization of the attribute language there are 3 variants for singular and plural forms names and words of the product description to cover of units e.g. kilogram, kilogramy, kilogramov). inflections as an equivalent to exact match. Another 4.3 Extraction of String Attributes improvement can be to include common misspelled attribute names to background knowledge. However it String attributes are the most sensitive type to the size of background knowledge. The specialty of this type is that Attributes Extraction from Product Descriptions on e-Shops 25 Table 1. Precision, recall and F-score for English descriptions Domain numeric Boolean String P R F P R F P R F tablet 100 97.47 98.7 100 100 100 100 50 66.67 refrigerator 100 100 100 100 100 100 100 100 100 average 100 98.8 99.4 100 100 100 100 85.71 92.31 Table 2. Precision, recall and F-score for Slovak descriptions Domain numeric Boolean String P R F P R F P R F tablet 87.5 20.59 33.34 100 78.95 88.24 100 70 82.35 refrigerator 100 50 66.67 80 88.88 84.21 80 70.59 75 average 96.15 35.71 52.08 92 82.14 89.79 86.36 70.37 77.55 the attribute values are often self-explanatory and the The background knowledge was created by extraction of attribute name isn’t necessary. For example, in the sentence structured data from 2 e-shops with table representation of “This Candy GC41472D1S Washing Machine with stylish attributes. Data contained 142 tablets and 41 fridges1. All Silver finish looks great in any home.” three String attributes found in test data descriptions were present in attributes can be found: producer (Candy), product name background knowledge. (Candy GC41472D1S) and color (Silver). If the washing The results of our tests are summed up in tables 1 and 2 machine was already extracted from another e-shop in separately for English and Slovak descriptions. structural form, all the String values are present in the 5.1 Results for Numeric Attributes background data and can be used to identify the attributes. Extraction method for String attributes firstly searches Method for attributes of numeric type correctly found for attribute names as well as the previous methods do. If 98.8% of all attribute name and value pairs in English the attribute name is found, values of the same attribute descriptions, but only 35.71% of pairs in Slovak extracted from all products of the same domain are description. Such a low recall in our test is caused by searched in the same sentence. If the value is found the various reasons. We have analyzed the results and attribute-value pair is sent to the result. String attributes of identified the following problems: the product domain, which were not found in the first step,  the absence of synonymic names of given attribute are searched only by their known values. Since each value in the background dictionary, corresponds to some attribute in the background knowledge, it is easy to send attribute-value pair to the  the absence of the synonymic unit of the attribute result. The implemented method does not use the fuzzyfied value, search of the attribute values in product descriptions.  presence of a shortcut, instead of full form of Similarly to the attribute names search, the attribute attribute name, or missing words of the full multi- value search could by extended with stemming and words terms, lemmatization to cover inflections as an equivalent to exact  missing attribute name (just the value and units were match. present in the description), 5 Experiments  different order of words in multi-word name of To verify the methods, we created test data containing attribute, and the real e-shops product descriptions of 2 domains: fridges  other words inserted into multi-word name of and tablets. We have selected 20 products from each attribute. domain. 10 descriptions were in English and 10 were in The first three problems are caused by a small dictionary. Slovak. We have manually selected attributes and their After adding more e-shops to the background knowledge, it values that appeared in the descriptions and typed them into should become a less important problem. Different e-shops the test table. Each product description was an input for our can use different terminology and unit abbreviations, which extraction methods and the results were compared to the expands the background knowledge dictionary. manually selected ones. Sentence “V chladničke je možné uchovávať 225 l Tablet descriptions contained 4 Boolean attributes, potravín v 4 sklenených poličkách” (en. It is possible to 1 String attribute and 5 Number attributes. Fridge store 225 l of groceries on 4 glass shelves) mentions the descriptions contained 4 Boolean attributes, 4 String attributes and 9 number attributes. 1 Dataset is available at: http://kapsa.sk/2017-itat-dataset.zip 26 M. Linková, P. Gurský volume of the refrigerator and the number of shelves in the appropriate for attributes color and color of the front of the refrigerator, but because the full names of the attributes are refrigerator. not present in the sentence, the method for numeric types The second problem is again the small dictionary, this did not find these product properties in the sentence. time, the dictionary of known attribute values. For A definite solution for the missing attribute name problem example, in sentence Pri hrúbke len 6,1 mm je vôbec would probably not be easy. One approach can be to use najtenší iPad zároveň aj najschopnejší (en. Having the attribute values’ units. If the unit found in the description, depth only 6.1 mm, it is the thinnest iPad as well as the is used by only one known attribute of the product domain, most capable.), the method did not found attribute “product the value and unit can be assigned to the attribute. name”, since iPad value is not in the value dictionary. The last two reasons deal with multi-word names. The Again, to remove the problem of the absence of an attribute solution to the problem can be to search each word of the value, it is sufficient to increase the set of attribute values term separately. If each word of multi-word term was found in the dictionary. in the same sentence, then we can declare the match. It is The precision was decreased by false fuzzy match of the possible that automatic morphological analysis of the attribute value with different word. Again, we can improve sentence can improve this approach, because it can reveal the precision using stemming or lemmatization instead of the connections between words and reduce false matches of fuzzy matching with editing distance. such method. The precision of the method is decreased by fuzzy 6 Conclusions matches, when the editing distance of 75% was too This work-in-progress paper presents our base-line generous and matched the words with different meaning. algorithms for automatic attribute-value pairs extraction We can improve the precision using stemming or from product descriptions on e-shops. We divided attributes lemmatization instead of fuzzy matching with editing to 3 main types: Boolean, String and numeric. Boolean distance. Another improvement can be achieved by attributes are matched, if the name is found in the accepting fuzzy matched words only if they are not present description. String attributes are search by match with pair in classic dictionary of the language, i.e. they are probably attribute name and its value or by value only. Numeric misspelled. attributes require three things to find: attribute name, 5.2 Results for Boolean Attributes number and unit. We have probed our methods against real world data, The method for Boolean type of attribute was the most analyzed the results and proposed the improvements that successful in finding attributes. Using this method, all the would be incorporated in our methods in the future. required attributes were found in the English descriptions and 82.14% of the attributes in the Slovak descriptions. The This work was supported by the Agency of the Slovak reason for not finding attributes in our tests within Slovak Ministry of Education for the Structural Funds of the EU, descriptions was similar to the synonymic variations under project CeZIS, ITMS: 26220220158 mentioned in the previous method. Concretely, the term in our dictionary had fewer words, because some words were References split into two words. Since we do fuzzy comparisons word- [1] Project Kapsa, web page: http://kapsa.sk/ by-word, it made the match less than 75%. [2] J. NothMan at al.: Learning multilingual named entity For example, the sentence Už žiadna námraza, recognition from Wikipedia. Artificial Intelligence 194 Technológia No Frost zabraňuje vzniku námrazy (2013) 151–175 a udržiava konštantnú teplotu v celej chladničke, (en. No [3] D. Nadeau, S. Sekine, A survey of named entity more frost cover, the technology No Frost prevent frost recognition and classification, Lingvisticae Investigationes 30 (2007) 3–26 creation and keeps constant temperature throughout the [4] D. M. Bikel et al.: Nymble: a High-Performance fridge) didn’t match with our two-word term Learning Name-finder. In ANLP-97, Washington, Technológia NoFrost. The solution would be to add D.C., pp. 194 – 201, 1997. Technológia No Frost to the directory. [5] J. Cowie: Description of the CRL/NMSU System Used Since we used Levenshtein distance to search for a name, for MUC-6. In Proceedings of the Sixth Message the method found two attributes in two descriptions that Understanding Conference, Morgan Kaufmann, 1995 were not there. These were the Auto Defrost and NoFrost [6] J. M. Castillo et al.: Named Entity Recognition Using attributes. Support Vector Machine for Filipino Text Documents. International Journal of Future Computer and 5.3 Results for String Attributes Communication, Vol. 2, No. 5, October 2013 [7] J. Lafferty, A. McCallum, F. Pereira: Conditional The method for attributes of String type is special, Random Fields: Probabilistic models for segmenting because it does not need the attribute name. It causes the and labeling sequence data. In proceedings of ICML, ambiguity of the attribute assignment. pages 282–289., 2001 For example, in the sentence Farba kombinovanej [8] K. Frantzi, S. Ananiadou, J. Tsujii: The C-value/NC- chladničky Goddness je biela.(en. The color of the value Method of Automatic Recognition of Multi-word Goddness fridge is white.), the value biela (en. white) is Terms. In proceedings of ECDL, pp. 585-604. ISBN 3- 540-65101-2, 1998