=Paper=
{{Paper
|id=Vol-1885/23
|storemode=property
|title=Attributes Extraction
from Product Descriptions on e-Shops
|pdfUrl=https://ceur-ws.org/Vol-1885/23.pdf
|volume=Vol-1885
|authors=Michaela Linková,Peter Gurský
|dblpUrl=https://dblp.org/rec/conf/itat/LinkovaG17
}}
==Attributes Extraction
from Product Descriptions on e-Shops==
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 23–26
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Linková, P. Gurský
Attributes Extraction from Product Descriptions on e-Shops
Michaela Linková, Peter Gurský
Institute of Computer Science
Faculty of Science, P.J.Šafárik University in Košice
Jesenná 5, 040 01 Košice, Slovakia
michaela.linkova@student.upjs.sk, peter.gursky@upjs.sk
Abstract. Some e-shops present product attributes in be the names of the entities and their URLs become the
structured form, but many others use the textual description identifiers. The common learning constellation for
only. Attributes of products are essential in automated product supervised and semi-supervised techniques is the
deduplication. We suggest methods for automated extraction
of attributes and their values from product descriptions to processing of annotated texts. Majority of learning models
a structural form. The structural data extracted from other process entity names as well as surrounding words. Many
e-shops are used as background knowledge. learning approaches have been used to handle NER:
Hidden Markov Models [4], decision trees [5], Support
1 Introduction Vector Machines [6], Conditional Random Fields [7].
Another approach, similar to NER, is terminology /
Nowadays there is an increasing interest in effective entity / term extraction. The goal of terminology extraction
process of extracting information from big amount of data. is to automatically extract relevant terms from text,
The problem of searching and obtaining relevant typically based on a vocabulary of domain-relevant
information is handled by several areas of computer (possibly multi-word) terms. Typical approach is to extract
science. Project Kapsa [1] deals with extraction and term candidates using linguistic processors and filter them
unification of information from web pages, focusing on using statistical and/or machine learning methods. The
products on e-shops. The aim of the project is the creation C-value/NC-value method [8] can be an example. To
and management of a collection of products which are handle multi-word terms, the methods usually use n-grams,
offered by e-shops. Crucial part of processing the e-shops’ that is, the combination of n words appearing in the corpus.
data is a deduplication of products, i.e. the decision if any
two products extracted from different e-shops are the same. 3 Background Knowledge
To increase the precision of the deduplication, structured
Unlike general named entity recognition, as a part of
data about the products (product properties and their
natural language processing, we can profit from knowledge
values) are essential.
of product domain and drastically reduce the number of
Although some e-shops present attributes of products in
possible entities to search in product description. The
table from, many other e-shops provide a textual
product domain can be determined from the product web
description only. The descriptions usually contain values of
presentation, since it is usually presented on specific
many product properties and are written in natural
position on every product detail page of the e-shop.
language.
The second advantage is the structured and annotated
This work-in-progress paper presents our current
data of product domain in background knowledge. These
methods of automatic extraction of product attributes with
data are extracted from the e-shops with structured attribute
their values from product descriptions. Products have
presentation in form of tables. Therefore, we can use the
attributes of 3 main types: String, number with unit and
dictionary of the attribute names in different languages
Boolean. Each type is presented individually in natural
(English, Slovak …) and variations (synonyms,
language. Therefore we propose unique extraction method
abbreviations) for each attribute of a given product domain.
for each attribute type.
Similarly, we can use various forms of units’ names
2 State of the Art (e.g. kg, kilograms, kilos, kilogramov, kíl …). Our
background database contains also unit conversions
To extract product attribute/property with its value from between convertible units (e.g. grams vs. kg). Finally, the
a text description, we need to recognize that the attribute attribute types and the list of extracted values of each
and/or its value are mentioned in the text. Named-entity attribute and product domain is stored in the background
recognition (NER) is a close research area to our problem. knowledge.
NER is the information extraction task of identifying and The annotation of attributes in Kapsa [1] is
classifying mentions of people, organizations, locations and a semiautomatic process driven by administrator in web
other named entities within text. Approaches to NER are GUI. Input for the annotation is a list of attribute names and
surveyed in [3]. The dominant technique for addressing the values in String form for each product extracted from
NER problem is supervised learning. A usual NER method e-shop web pages, possibly with some additional tags. The
consists of tagging words of a test corpus when they are annotation produces a set of rules that determines product
annotated as entities in the (rather big) training corpus. domain, attribute identification (including attribute
A semi-supervised techniques decrease amount of manual deduplication), attribute type, value and unit extraction, etc.
annotation needed to train a classifier. Typically, the If the product domain or attribute is already annotated for
sentences in Wikipedia articles are considered annotated, other e-shop, annotator usually just plays the role of the
because they contain context links to other Wikipedia pages validator of an automatic annotation.
in sentences. The titles of such pages are then considered to
24 M. Linková, P. Gurský
Figure 1: Extraction of attributes having number type
requires an administrator intervention and converts our
4 Extraction Methods automatic method to the semiautomatic.
Our analysis of products’ descriptions showed that Our method does not recognize sentences as positive or
attributes and their values are presented differently for negative. So the sentence The mobile phone does not have
Boolean, String and number types in natural language. thermosensor, induces the result that the product has
Since we have the type information for each attribute we Boolean attribute thermosensor with value ‘true’. The
are searching for in background knowledge, we can utilize approaches known from sentiment analysis can be
extraction method for each type. All of the presented incorporated to cover this problem.
methods are still works in progress and represent the 4.2 Extraction of Numeric Attributes
baseline methods for the future work. Some of possible
modifications, we believe that they can improve the quality Attributes of number type have their values composed of
of methods, are proposed at the ends of the following a number and a unit (12 g, 15 cm, 42 ’’, 2 pieces…). Our
sections as well as in the experiments section. method is composed of 4 main steps. First, the method
searches for attribute names like the method extracting the
4.1 Extraction of Boolean Attributes
Boolean attribute names. Next, all the variants of the
Our method for extraction of Boolean attributes suggests, attribute unit and the variants of units convertible to this
that the presence of the attribute name in product unit are searched in the sentence, where the attribute name
description induce that the value of the product’s attribute was found. The only exception is the unit pieces, because it
is ‘true’. The method searches every variation of the is not common in natural language sentences (e.g. 2 shelfs
attribute name (languages and synonyms) that is present in instead of 2 pieces of shelves). In our method the search of
background knowledge. If the attribute name is matched, the units requires an exact match. If both the attribute name
the attribute with value ‘true’ is sent to output result. and the unit was found, then, in the third step, the numbers
Since quite a lot of the attributes were misspelled or are searched in the sentence using regular expression
inflected in our test data, we have replaced the exact search “([0-9]+)(\\ )*(\\.)*(\\,)*([0-9]*)“. If the sentence contains
of the attribute names by fuzzyfied search using more numbers, the closest number to the attribute name is
Levenshtein (editing) distance. The threshold for the selected in the 4th step.
positive result was set to 75% match of the attribute name. The extraction method can be extended to cover word
The method uses fuzzyfied search, only if the exact match variants of the numbers (i.e. one, two, twenty-three...), but
is not found. it requires new dictionary for each language. Stemming and
We believe that improvement of our method can be lemmatization can be also used for unit search (in Slovak
achieved by stemming or lemmatization of the attribute language there are 3 variants for singular and plural forms
names and words of the product description to cover of units e.g. kilogram, kilogramy, kilogramov).
inflections as an equivalent to exact match. Another 4.3 Extraction of String Attributes
improvement can be to include common misspelled
attribute names to background knowledge. However it String attributes are the most sensitive type to the size of
background knowledge. The specialty of this type is that
Attributes Extraction from Product Descriptions on e-Shops 25
Table 1. Precision, recall and F-score for English descriptions
Domain numeric Boolean String
P R F P R F P R F
tablet 100 97.47 98.7 100 100 100 100 50 66.67
refrigerator 100 100 100 100 100 100 100 100 100
average 100 98.8 99.4 100 100 100 100 85.71 92.31
Table 2. Precision, recall and F-score for Slovak descriptions
Domain numeric Boolean String
P R F P R F P R F
tablet 87.5 20.59 33.34 100 78.95 88.24 100 70 82.35
refrigerator 100 50 66.67 80 88.88 84.21 80 70.59 75
average 96.15 35.71 52.08 92 82.14 89.79 86.36 70.37 77.55
the attribute values are often self-explanatory and the The background knowledge was created by extraction of
attribute name isn’t necessary. For example, in the sentence structured data from 2 e-shops with table representation of
“This Candy GC41472D1S Washing Machine with stylish attributes. Data contained 142 tablets and 41 fridges1. All
Silver finish looks great in any home.” three String attributes found in test data descriptions were present in
attributes can be found: producer (Candy), product name background knowledge.
(Candy GC41472D1S) and color (Silver). If the washing The results of our tests are summed up in tables 1 and 2
machine was already extracted from another e-shop in separately for English and Slovak descriptions.
structural form, all the String values are present in the
5.1 Results for Numeric Attributes
background data and can be used to identify the attributes.
Extraction method for String attributes firstly searches Method for attributes of numeric type correctly found
for attribute names as well as the previous methods do. If 98.8% of all attribute name and value pairs in English
the attribute name is found, values of the same attribute descriptions, but only 35.71% of pairs in Slovak
extracted from all products of the same domain are description. Such a low recall in our test is caused by
searched in the same sentence. If the value is found the various reasons. We have analyzed the results and
attribute-value pair is sent to the result. String attributes of identified the following problems:
the product domain, which were not found in the first step, the absence of synonymic names of given attribute
are searched only by their known values. Since each value in the background dictionary,
corresponds to some attribute in the background
knowledge, it is easy to send attribute-value pair to the the absence of the synonymic unit of the attribute
result. The implemented method does not use the fuzzyfied value,
search of the attribute values in product descriptions. presence of a shortcut, instead of full form of
Similarly to the attribute names search, the attribute attribute name, or missing words of the full multi-
value search could by extended with stemming and words terms,
lemmatization to cover inflections as an equivalent to exact missing attribute name (just the value and units were
match. present in the description),
5 Experiments different order of words in multi-word name of
To verify the methods, we created test data containing attribute, and
the real e-shops product descriptions of 2 domains: fridges other words inserted into multi-word name of
and tablets. We have selected 20 products from each attribute.
domain. 10 descriptions were in English and 10 were in The first three problems are caused by a small dictionary.
Slovak. We have manually selected attributes and their After adding more e-shops to the background knowledge, it
values that appeared in the descriptions and typed them into should become a less important problem. Different e-shops
the test table. Each product description was an input for our can use different terminology and unit abbreviations, which
extraction methods and the results were compared to the expands the background knowledge dictionary.
manually selected ones. Sentence “V chladničke je možné uchovávať 225 l
Tablet descriptions contained 4 Boolean attributes, potravín v 4 sklenených poličkách” (en. It is possible to
1 String attribute and 5 Number attributes. Fridge store 225 l of groceries on 4 glass shelves) mentions the
descriptions contained 4 Boolean attributes, 4 String
attributes and 9 number attributes. 1
Dataset is available at:
http://kapsa.sk/2017-itat-dataset.zip
26 M. Linková, P. Gurský
volume of the refrigerator and the number of shelves in the appropriate for attributes color and color of the front of the
refrigerator, but because the full names of the attributes are refrigerator.
not present in the sentence, the method for numeric types The second problem is again the small dictionary, this
did not find these product properties in the sentence. time, the dictionary of known attribute values. For
A definite solution for the missing attribute name problem example, in sentence Pri hrúbke len 6,1 mm je vôbec
would probably not be easy. One approach can be to use najtenší iPad zároveň aj najschopnejší (en. Having the
attribute values’ units. If the unit found in the description, depth only 6.1 mm, it is the thinnest iPad as well as the
is used by only one known attribute of the product domain, most capable.), the method did not found attribute “product
the value and unit can be assigned to the attribute. name”, since iPad value is not in the value dictionary.
The last two reasons deal with multi-word names. The Again, to remove the problem of the absence of an attribute
solution to the problem can be to search each word of the value, it is sufficient to increase the set of attribute values
term separately. If each word of multi-word term was found in the dictionary.
in the same sentence, then we can declare the match. It is The precision was decreased by false fuzzy match of the
possible that automatic morphological analysis of the attribute value with different word. Again, we can improve
sentence can improve this approach, because it can reveal the precision using stemming or lemmatization instead of
the connections between words and reduce false matches of fuzzy matching with editing distance.
such method.
The precision of the method is decreased by fuzzy 6 Conclusions
matches, when the editing distance of 75% was too This work-in-progress paper presents our base-line
generous and matched the words with different meaning. algorithms for automatic attribute-value pairs extraction
We can improve the precision using stemming or from product descriptions on e-shops. We divided attributes
lemmatization instead of fuzzy matching with editing to 3 main types: Boolean, String and numeric. Boolean
distance. Another improvement can be achieved by attributes are matched, if the name is found in the
accepting fuzzy matched words only if they are not present description. String attributes are search by match with pair
in classic dictionary of the language, i.e. they are probably attribute name and its value or by value only. Numeric
misspelled. attributes require three things to find: attribute name,
5.2 Results for Boolean Attributes number and unit.
We have probed our methods against real world data,
The method for Boolean type of attribute was the most analyzed the results and proposed the improvements that
successful in finding attributes. Using this method, all the would be incorporated in our methods in the future.
required attributes were found in the English descriptions
and 82.14% of the attributes in the Slovak descriptions. The This work was supported by the Agency of the Slovak
reason for not finding attributes in our tests within Slovak Ministry of Education for the Structural Funds of the EU,
descriptions was similar to the synonymic variations under project CeZIS, ITMS: 26220220158
mentioned in the previous method. Concretely, the term in
our dictionary had fewer words, because some words were References
split into two words. Since we do fuzzy comparisons word- [1] Project Kapsa, web page: http://kapsa.sk/
by-word, it made the match less than 75%. [2] J. NothMan at al.: Learning multilingual named entity
For example, the sentence Už žiadna námraza, recognition from Wikipedia. Artificial Intelligence 194
Technológia No Frost zabraňuje vzniku námrazy (2013) 151–175
a udržiava konštantnú teplotu v celej chladničke, (en. No [3] D. Nadeau, S. Sekine, A survey of named entity
more frost cover, the technology No Frost prevent frost recognition and classification, Lingvisticae
Investigationes 30 (2007) 3–26
creation and keeps constant temperature throughout the [4] D. M. Bikel et al.: Nymble: a High-Performance
fridge) didn’t match with our two-word term Learning Name-finder. In ANLP-97, Washington,
Technológia NoFrost. The solution would be to add D.C., pp. 194 – 201, 1997.
Technológia No Frost to the directory. [5] J. Cowie: Description of the CRL/NMSU System Used
Since we used Levenshtein distance to search for a name, for MUC-6. In Proceedings of the Sixth Message
the method found two attributes in two descriptions that Understanding Conference, Morgan Kaufmann, 1995
were not there. These were the Auto Defrost and NoFrost [6] J. M. Castillo et al.: Named Entity Recognition Using
attributes. Support Vector Machine for Filipino Text Documents.
International Journal of Future Computer and
5.3 Results for String Attributes Communication, Vol. 2, No. 5, October 2013
[7] J. Lafferty, A. McCallum, F. Pereira: Conditional
The method for attributes of String type is special, Random Fields: Probabilistic models for segmenting
because it does not need the attribute name. It causes the and labeling sequence data. In proceedings of ICML,
ambiguity of the attribute assignment. pages 282–289., 2001
For example, in the sentence Farba kombinovanej [8] K. Frantzi, S. Ananiadou, J. Tsujii: The C-value/NC-
chladničky Goddness je biela.(en. The color of the value Method of Automatic Recognition of Multi-word
Goddness fridge is white.), the value biela (en. white) is Terms. In proceedings of ECDL, pp. 585-604. ISBN 3-
540-65101-2, 1998