J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 23–26
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Linková, P. Gurský


                        Attributes Extraction from Product Descriptions on e-Shops
                                                 Michaela Linková, Peter Gurský
                                                   Institute of Computer Science
                                       Faculty of Science, P.J.Šafárik University in Košice
                                                Jesenná 5, 040 01 Košice, Slovakia
                             michaela.linkova@student.upjs.sk, peter.gursky@upjs.sk

      Abstract. Some e-shops present product attributes in              be the names of the entities and their URLs become the
      structured form, but many others use the textual description      identifiers. The common learning constellation for
      only. Attributes of products are essential in automated product   supervised and semi-supervised techniques is the
      deduplication. We suggest methods for automated extraction
      of attributes and their values from product descriptions to       processing of annotated texts. Majority of learning models
      a structural form. The structural data extracted from other       process entity names as well as surrounding words. Many
      e-shops are used as background knowledge.                         learning approaches have been used to handle NER:
                                                                        Hidden Markov Models [4], decision trees [5], Support
      1     Introduction                                                Vector Machines [6], Conditional Random Fields [7].
                                                                           Another approach, similar to NER, is terminology /
         Nowadays there is an increasing interest in effective          entity / term extraction. The goal of terminology extraction
      process of extracting information from big amount of data.        is to automatically extract relevant terms from text,
      The problem of searching and obtaining relevant                   typically based on a vocabulary of domain-relevant
      information is handled by several areas of computer               (possibly multi-word) terms. Typical approach is to extract
      science. Project Kapsa [1] deals with extraction and              term candidates using linguistic processors and filter them
      unification of information from web pages, focusing on            using statistical and/or machine learning methods. The
      products on e-shops. The aim of the project is the creation       C-value/NC-value method [8] can be an example. To
      and management of a collection of products which are              handle multi-word terms, the methods usually use n-grams,
      offered by e-shops. Crucial part of processing the e-shops’       that is, the combination of n words appearing in the corpus.
      data is a deduplication of products, i.e. the decision if any
      two products extracted from different e-shops are the same.       3    Background Knowledge
      To increase the precision of the deduplication, structured
                                                                           Unlike general named entity recognition, as a part of
      data about the products (product properties and their
                                                                        natural language processing, we can profit from knowledge
      values) are essential.
                                                                        of product domain and drastically reduce the number of
         Although some e-shops present attributes of products in
                                                                        possible entities to search in product description. The
      table from, many other e-shops provide a textual
                                                                        product domain can be determined from the product web
      description only. The descriptions usually contain values of
                                                                        presentation, since it is usually presented on specific
      many product properties and are written in natural
                                                                        position on every product detail page of the e-shop.
      language.
                                                                           The second advantage is the structured and annotated
         This work-in-progress paper presents our current
                                                                        data of product domain in background knowledge. These
      methods of automatic extraction of product attributes with
                                                                        data are extracted from the e-shops with structured attribute
      their values from product descriptions. Products have
                                                                        presentation in form of tables. Therefore, we can use the
      attributes of 3 main types: String, number with unit and
                                                                        dictionary of the attribute names in different languages
      Boolean. Each type is presented individually in natural
                                                                        (English, Slovak …) and variations (synonyms,
      language. Therefore we propose unique extraction method
                                                                        abbreviations) for each attribute of a given product domain.
      for each attribute type.
                                                                        Similarly, we can use various forms of units’ names
      2     State of the Art                                            (e.g. kg, kilograms, kilos, kilogramov, kíl …). Our
                                                                        background database contains also unit conversions
         To extract product attribute/property with its value from      between convertible units (e.g. grams vs. kg). Finally, the
      a text description, we need to recognize that the attribute       attribute types and the list of extracted values of each
      and/or its value are mentioned in the text. Named-entity          attribute and product domain is stored in the background
      recognition (NER) is a close research area to our problem.        knowledge.
      NER is the information extraction task of identifying and            The annotation of attributes in Kapsa [1] is
      classifying mentions of people, organizations, locations and      a semiautomatic process driven by administrator in web
      other named entities within text. Approaches to NER are           GUI. Input for the annotation is a list of attribute names and
      surveyed in [3]. The dominant technique for addressing the        values in String form for each product extracted from
      NER problem is supervised learning. A usual NER method            e-shop web pages, possibly with some additional tags. The
      consists of tagging words of a test corpus when they are          annotation produces a set of rules that determines product
      annotated as entities in the (rather big) training corpus.        domain, attribute identification (including attribute
      A semi-supervised techniques decrease amount of manual            deduplication), attribute type, value and unit extraction, etc.
      annotation needed to train a classifier. Typically, the           If the product domain or attribute is already annotated for
      sentences in Wikipedia articles are considered annotated,         other e-shop, annotator usually just plays the role of the
      because they contain context links to other Wikipedia pages       validator of an automatic annotation.
      in sentences. The titles of such pages are then considered to
24                                                                                                                   M. Linková, P. Gurský


                                         Figure 1: Extraction of attributes having number type

                                                                     requires an administrator intervention and converts our
     4    Extraction Methods                                         automatic method to the semiautomatic.
        Our analysis of products’ descriptions showed that              Our method does not recognize sentences as positive or
     attributes and their values are presented differently for       negative. So the sentence The mobile phone does not have
     Boolean, String and number types in natural language.           thermosensor, induces the result that the product has
     Since we have the type information for each attribute we        Boolean attribute thermosensor with value ‘true’. The
     are searching for in background knowledge, we can utilize       approaches known from sentiment analysis can be
     extraction method for each type. All of the presented           incorporated to cover this problem.
     methods are still works in progress and represent the           4.2 Extraction of Numeric Attributes
     baseline methods for the future work. Some of possible
     modifications, we believe that they can improve the quality         Attributes of number type have their values composed of
     of methods, are proposed at the ends of the following           a number and a unit (12 g, 15 cm, 42 ’’, 2 pieces…). Our
     sections as well as in the experiments section.                 method is composed of 4 main steps. First, the method
                                                                     searches for attribute names like the method extracting the
     4.1 Extraction of Boolean Attributes
                                                                     Boolean attribute names. Next, all the variants of the
        Our method for extraction of Boolean attributes suggests,    attribute unit and the variants of units convertible to this
     that the presence of the attribute name in product              unit are searched in the sentence, where the attribute name
     description induce that the value of the product’s attribute    was found. The only exception is the unit pieces, because it
     is ‘true’. The method searches every variation of the           is not common in natural language sentences (e.g. 2 shelfs
     attribute name (languages and synonyms) that is present in      instead of 2 pieces of shelves). In our method the search of
     background knowledge. If the attribute name is matched,         the units requires an exact match. If both the attribute name
     the attribute with value ‘true’ is sent to output result.       and the unit was found, then, in the third step, the numbers
        Since quite a lot of the attributes were misspelled or       are searched in the sentence using regular expression
     inflected in our test data, we have replaced the exact search   “([0-9]+)(\\ )*(\\.)*(\\,)*([0-9]*)“. If the sentence contains
     of the attribute names by fuzzyfied search using                more numbers, the closest number to the attribute name is
     Levenshtein (editing) distance. The threshold for the           selected in the 4th step.
     positive result was set to 75% match of the attribute name.         The extraction method can be extended to cover word
     The method uses fuzzyfied search, only if the exact match       variants of the numbers (i.e. one, two, twenty-three...), but
     is not found.                                                   it requires new dictionary for each language. Stemming and
        We believe that improvement of our method can be             lemmatization can be also used for unit search (in Slovak
     achieved by stemming or lemmatization of the attribute          language there are 3 variants for singular and plural forms
     names and words of the product description to cover             of units e.g. kilogram, kilogramy, kilogramov).
     inflections as an equivalent to exact match. Another            4.3 Extraction of String Attributes
     improvement can be to include common misspelled
     attribute names to background knowledge. However it               String attributes are the most sensitive type to the size of
                                                                     background knowledge. The specialty of this type is that
Attributes Extraction from Product Descriptions on e-Shops                                                                                  25


                                           Table 1. Precision, recall and F-score for English descriptions

           Domain                             numeric                               Boolean                             String
                                     P           R            F            P            R            F         P          R         F
           tablet                  100         97.47         98.7         100          100          100       100        50       66.67
           refrigerator            100          100          100          100          100          100       100        100       100
           average                 100          98.8         99.4         100          100          100       100       85.71     92.31


                                           Table 2. Precision, recall and F-score for Slovak descriptions

           Domain                             numeric                               Boolean                             String
                                     P           R            F            P            R            F         P          R         F
           tablet                  87.5        20.59         33.34        100         78.95        88.24      100        70       82.35
           refrigerator            100           50          66.67         80         88.88        84.21       80       70.59       75
           average                 96.15       35.71         52.08         92         82.14        89.79     86.36      70.37     77.55

       the attribute values are often self-explanatory and the                 The background knowledge was created by extraction of
       attribute name isn’t necessary. For example, in the sentence         structured data from 2 e-shops with table representation of
       “This Candy GC41472D1S Washing Machine with stylish                  attributes. Data contained 142 tablets and 41 fridges1. All
       Silver finish looks great in any home.” three String                 attributes found in test data descriptions were present in
       attributes can be found: producer (Candy), product name              background knowledge.
       (Candy GC41472D1S) and color (Silver). If the washing                   The results of our tests are summed up in tables 1 and 2
       machine was already extracted from another e-shop in                 separately for English and Slovak descriptions.
       structural form, all the String values are present in the
                                                                            5.1 Results for Numeric Attributes
       background data and can be used to identify the attributes.
          Extraction method for String attributes firstly searches             Method for attributes of numeric type correctly found
       for attribute names as well as the previous methods do. If           98.8% of all attribute name and value pairs in English
       the attribute name is found, values of the same attribute            descriptions, but only 35.71% of pairs in Slovak
       extracted from all products of the same domain are                   description. Such a low recall in our test is caused by
       searched in the same sentence. If the value is found the             various reasons. We have analyzed the results and
       attribute-value pair is sent to the result. String attributes of     identified the following problems:
       the product domain, which were not found in the first step,             the absence of synonymic names of given attribute
       are searched only by their known values. Since each value                    in the background dictionary,
       corresponds to some attribute in the background
       knowledge, it is easy to send attribute-value pair to the                   the absence of the synonymic unit of the attribute
       result. The implemented method does not use the fuzzyfied                    value,
       search of the attribute values in product descriptions.                     presence of a shortcut, instead of full form of
          Similarly to the attribute names search, the attribute                    attribute name, or missing words of the full multi-
       value search could by extended with stemming and                             words terms,
       lemmatization to cover inflections as an equivalent to exact                missing attribute name (just the value and units were
       match.                                                                       present in the description),
       5     Experiments                                                           different order of words in multi-word name of
          To verify the methods, we created test data containing                    attribute, and
       the real e-shops product descriptions of 2 domains: fridges                other words inserted into multi-word name of
       and tablets. We have selected 20 products from each                         attribute.
       domain. 10 descriptions were in English and 10 were in                  The first three problems are caused by a small dictionary.
       Slovak. We have manually selected attributes and their               After adding more e-shops to the background knowledge, it
       values that appeared in the descriptions and typed them into         should become a less important problem. Different e-shops
       the test table. Each product description was an input for our        can use different terminology and unit abbreviations, which
       extraction methods and the results were compared to the              expands the background knowledge dictionary.
       manually selected ones.                                                 Sentence “V chladničke je možné uchovávať 225 l
          Tablet descriptions contained 4 Boolean attributes,               potravín v 4 sklenených poličkách” (en. It is possible to
       1 String attribute and 5 Number attributes. Fridge                   store 225 l of groceries on 4 glass shelves) mentions the
       descriptions contained 4 Boolean attributes, 4 String
       attributes and 9 number attributes.                                      1
                                                                                 Dataset is available at:
                                                                                http://kapsa.sk/2017-itat-dataset.zip
26                                                                                                                     M. Linková, P. Gurský


     volume of the refrigerator and the number of shelves in the      appropriate for attributes color and color of the front of the
     refrigerator, but because the full names of the attributes are   refrigerator.
     not present in the sentence, the method for numeric types           The second problem is again the small dictionary, this
     did not find these product properties in the sentence.           time, the dictionary of known attribute values. For
     A definite solution for the missing attribute name problem       example, in sentence Pri hrúbke len 6,1 mm je vôbec
     would probably not be easy. One approach can be to use           najtenší iPad zároveň aj najschopnejší (en. Having the
     attribute values’ units. If the unit found in the description,   depth only 6.1 mm, it is the thinnest iPad as well as the
     is used by only one known attribute of the product domain,       most capable.), the method did not found attribute “product
     the value and unit can be assigned to the attribute.             name”, since iPad value is not in the value dictionary.
        The last two reasons deal with multi-word names. The          Again, to remove the problem of the absence of an attribute
     solution to the problem can be to search each word of the        value, it is sufficient to increase the set of attribute values
     term separately. If each word of multi-word term was found       in the dictionary.
     in the same sentence, then we can declare the match. It is          The precision was decreased by false fuzzy match of the
     possible that automatic morphological analysis of the            attribute value with different word. Again, we can improve
     sentence can improve this approach, because it can reveal        the precision using stemming or lemmatization instead of
     the connections between words and reduce false matches of        fuzzy matching with editing distance.
     such method.
        The precision of the method is decreased by fuzzy             6    Conclusions
     matches, when the editing distance of 75% was too                   This work-in-progress paper presents our base-line
     generous and matched the words with different meaning.           algorithms for automatic attribute-value pairs extraction
     We can improve the precision using stemming or                   from product descriptions on e-shops. We divided attributes
     lemmatization instead of fuzzy matching with editing             to 3 main types: Boolean, String and numeric. Boolean
     distance. Another improvement can be achieved by                 attributes are matched, if the name is found in the
     accepting fuzzy matched words only if they are not present       description. String attributes are search by match with pair
     in classic dictionary of the language, i.e. they are probably    attribute name and its value or by value only. Numeric
     misspelled.                                                      attributes require three things to find: attribute name,
     5.2 Results for Boolean Attributes                               number and unit.
                                                                         We have probed our methods against real world data,
        The method for Boolean type of attribute was the most         analyzed the results and proposed the improvements that
     successful in finding attributes. Using this method, all the     would be incorporated in our methods in the future.
     required attributes were found in the English descriptions
     and 82.14% of the attributes in the Slovak descriptions. The     This work was supported by the Agency of the Slovak
     reason for not finding attributes in our tests within Slovak     Ministry of Education for the Structural Funds of the EU,
     descriptions was similar to the synonymic variations             under project CeZIS, ITMS: 26220220158
     mentioned in the previous method. Concretely, the term in
     our dictionary had fewer words, because some words were          References
     split into two words. Since we do fuzzy comparisons word-        [1] Project Kapsa, web page: http://kapsa.sk/
     by-word, it made the match less than 75%.                        [2] J. NothMan at al.: Learning multilingual named entity
        For example, the sentence Už žiadna námraza,                       recognition from Wikipedia. Artificial Intelligence 194
     Technológia No      Frost zabraňuje      vzniku     námrazy           (2013) 151–175
     a udržiava konštantnú teplotu v celej chladničke, (en. No        [3] D. Nadeau, S. Sekine, A survey of named entity
     more frost cover, the technology No Frost prevent frost               recognition     and     classification,    Lingvisticae
                                                                           Investigationes 30 (2007) 3–26
     creation and keeps constant temperature throughout the           [4] D. M. Bikel et al.: Nymble: a High-Performance
     fridge) didn’t match with our two-word term                           Learning Name-finder. In ANLP-97, Washington,
     Technológia NoFrost. The solution would be to add                     D.C., pp. 194 – 201, 1997.
     Technológia No Frost to the directory.                           [5] J. Cowie: Description of the CRL/NMSU System Used
        Since we used Levenshtein distance to search for a name,           for MUC-6. In Proceedings of the Sixth Message
     the method found two attributes in two descriptions that              Understanding Conference, Morgan Kaufmann, 1995
     were not there. These were the Auto Defrost and NoFrost          [6] J. M. Castillo et al.: Named Entity Recognition Using
     attributes.                                                           Support Vector Machine for Filipino Text Documents.
                                                                           International Journal of Future Computer and
     5.3 Results for String Attributes                                     Communication, Vol. 2, No. 5, October 2013
                                                                       [7] J. Lafferty, A. McCallum, F. Pereira: Conditional
       The method for attributes of String type is special,                Random Fields: Probabilistic models for segmenting
     because it does not need the attribute name. It causes the            and labeling sequence data. In proceedings of ICML,
     ambiguity of the attribute assignment.                                pages 282–289., 2001
       For example, in the sentence Farba kombinovanej                [8] K. Frantzi, S. Ananiadou, J. Tsujii: The C-value/NC-
     chladničky Goddness je biela.(en. The color of the                    value Method of Automatic Recognition of Multi-word
     Goddness fridge is white.), the value biela (en. white) is            Terms. In proceedings of ECDL, pp. 585-604. ISBN 3-
                                                                           540-65101-2, 1998