Using information retrieval to evaluate
         trustworthiness assessment of eshops

           Claudio Carpineto, Giovanni Romano, and Davide Lo Re

                       Fondazione Ugo Bordoni, Rome, Italy
                       {carpinet, romano, dlore}@fub.it


      Abstract. To protect consumers from online counterfeiting, several sys-
      tems have been recently made available that check whether an ecom-
      merce website is trustworthy or not. In this paper we study how to eval-
      uate and compare trust checkers using an information retrieval methodol-
      ogy to gather suitable data and build a ground truth test collection. The
      main findings of our experimental evaluation are that the inter-checker
      agreement was moderate and that the single trust checkers presented
      relative advantages and disadvantages. Essentially, review-based systems
      were very precise but largely incomplete, whereas feature-based systems
      provided assessments for any ecommerce website being submitted but
      were more prone to errors.


1   Introduction

Online counterfeiting continues to be a thorn in the side of both consumers and
enterprises, and there are signs that the problem is worsening despite growing
efforts to combat it [7] [6]. One particular but important aspect of this phe-
nomenon is represented by ecommerce websites selling counterfeit goods. These
fake websites can attract purchasers and enhance their attempts at deception by
using a number of channels of promotion and advertising such as email, social
networking, web ads, and search engine optimization techniques. On the other
side, better technical countermeasures have begun to appear.
    As an anti-counterfeiting aid, some desktop web browsers now remove the
full URL from view in the URL bar and display only the domain name, which
is usually short and clear for legitimate websites (as opposed to long and messy
strings for fake ones). To better protect consumers from being led into a swin-
dle, several research and commercial systems have been recently developed that
explicitly assess whether a given website is trustworthy or not. Although these
systems employ different paradigms, algorithms, and information sources, they
can be roughly grouped in two main categories, namely those based on user re-
views and white/black lists (e.g., WOT mywot.com, Trustpilot trustpilot.com,
Webutation webutation.net, Scamvoid scamvoid.com), and those making use of
website features; e.g., Scamadviser scamadviser.com, [8], [1].
    The relative availability of trust checkers raises the question of their eval-
uation and comparison. To the best of our knowledge, this issue has not been
addressed so far. In this paper we focus on two main research questions: do
current trust checkers agree on each other?, which is the best trust checker?
    To answer these questions, we present an information retrieval methodology
consisting of three main steps. We first collect data retrieved by major search
engines in response to search queries with brand names; then we identify ecom-
merce websites (whether legitimate or fake) in search results with a suitable
classifier and use them for measuring inter-checker agreement; and finally build
a ground truth dataset containing manually-labeled legitimate and fake ecom-
merce websites used for measuring the accuracy of trust checkers.


2   Why assessing trustworthiness of ecommerce websites
    is a difficult task

Assessing whether an ecommerce website is trustworthy or not may be difficult
even for humans. In [1], it is shown that non-experts were inconsistent in discrim-
inating between legitimate and fake ecommerce websites containing discounted
offers. To illustrate the difficulty of this task, in Figure 1 we show two websites
selling products of two well known luxury brands, namely ‘Hugo Boss’ (a) and
‘Iceberg’ (b). At first glance, website (a) has a nice look and feel, offers prod-
ucts with reasonably discounted prices, and shows a well designed navigation
system for finding products. It also presents other features that might reassure
shoppers that the store is trustworthy. Security seals are in place, a live chat is
provided, a Facebook label is displayed. Further, on a more technical side, the
brand name (i.e., Hugo Boss) is in the URL’s path and not in the domain name,
as customary in legitimate ecommerce retailers (except for the official website of
the brand). However, on closer inspection, we realized that the security label was
appropriated without signing up with its vendor, that the store did not have its
own Facebook page, and the live chat did not work. Also, more important, the
website did not provide any contact information except for a contact form to be
filled in by shoppers. This is a clear sign that the website may be fake. Turning to
website (b), we see that it has several nice characteristics to recommend itself.
In particular, very detailed contact information are displayed in the footer to
increase trust of shoppers, including telephone number and address of physical
stores. However, we found that the provided contact information were contained
in an image and they turned out to be dummy contacts, on more inquiry; the
website did not even offer an email address. This was probably another scam
website.
     Trust checkers may use people’s evaluations and reviews in their assessments,
or detect specific mechanisms employed by fake websites such as cloaking [9]
and search redirection [4]. A more comprehensive approach consists of training a
classifier on a large set of learning features, possibly including the earlier mech-
anisms. Two recent trust classifiers are [8] and [1], the latter of which makes
use of 33 learning features spanning product offer, merchant information, pay-
ment methods, website registration data, ecommerce-specific SEO, and relative
behavior of website. Unlike users, automatic checkers can take into account a
large number of a website’s features simultaneously, and they can easily access
useful external information that may be not readily available to consumers; e.g.,
white/black lists, consumer reviews, WHOIS data, Alexa metrics, other checker’s
assessments, etc. On the other hand, fake websites continue to take steps to in-
crease their similarity to genuine ones, so that sometimes it may be required a
deeper understanding of what is behind a feature.


3   Gathering legitimate and fake ecommerce websites

First of all we needed to collect data containing (legitimate and) fake ecommerce
websites. One of the major promotion channels used by counterfeiters is manip-
ulation of search engine results. As brand names are extensively searched on
the web, usually with a shopping intent [2], online sellers of counterfeits can use
search engine optimization techniques to achieve high listings in search results.
This way, they may increase traffic on their websites by fraudulently attracting
purchasers seeking to buy genuine products or ‘complicit’ shoppers who are will-
ing to buy replicas. Here we talk about entirely bogus websites, not legitimate
sellers that may occasionally sell counterfeit products. Their massive presence
in search-engine results has been observed in some recent studies [8] [1].
    We collected search results for 39 major shoe brands (e.g., ‘Armani’, ‘Chris-
tian Louboutin’, ‘Gucci’, ‘Prada’, ‘Valentino’) through the following procedure.
We used three types of queries. One neutral query consisting of the brand name
followed by ‘shoes’, to help with ambiguity in brand names and focus on the same
product for multi-product brands; one biased query were we added ‘cheap’ to
the brand name and ‘shoes’, to emphasize that shoppers were seeking discounted
offers; one complicit query formed by adding ‘replica’ to the brand name and
‘shoes’, to clearly indicate that users were happy with counterfeit products. The
queries were automatically submitted to three search engines (i.e., Bing, Google,
and Yahoo!) set to the English language, and the first 100 results were saved. In
all we collected about 35000 search results. The collected URLs were then post-
processed. We performed URL normalization and grouped together the URLs
with a same domain name because they usually refer to distinct offers from the
same eshop. We kept as a group representative one randomly selected group
member. After this operation, the number of URLs reduced to about 24000.
    Search results for brand queries usually contain many ecommerce webistes
but also many non-ecommerce websites such as shopbots, product catalogues,
shop locators, ecommerce blogs, etc. The next step was to identify proper ecom-
merce websites in search results. This is a research task in its own. We used
a specific classifier described in [1] that makes use of 24 classification features
about various aspects of the website including product navigation and search,
product display, purchase management, and customer service information. In an
experimental evaluation, this classifier was able to discriminate between ecom-
merce and non-ecommerce websites with a classification accuracy of about 90%.
By running this classifier on the search results we identified about 10000 ecom-
merce websites (whether legitimate or fake), with some approximation error.
                                         (a)


                                         (b)


Fig. 1.   Screenshots    of   two     presumably     fake   ecommerce web-
sites,      http://www.lowpricesfashion.com/en/31 iceberg       (a),   and
http://www.designershopp.com/designer/hugo-boss.html (b),   as of 14 March,
2017.
    The next step was to build a ground-truth dataset containing eshops la-
beled as legitimate or fake. As this is a time-consuming process that requires
trained personnel, the size of the ground truth dataset was small. Starting from
the large dataset containing eshops described above, we manually labeled ran-
domly selected items until a balanced dataset containing 255 legitimate and fake
ecommerce websites was found (by downsampling the class containing legitimate
websites).


4     Trust checkers

We used four trust checkers: WOT, Trustpilot, Scamadviser, and RI.SI.CO. [1].1
WOT and Trustpilot are mainly based on lists of websites and on user reviews.
Scamadviser essentially uses WHOIS data, but the details of both the features
and the algorithm are not disclosed. RI.SI.CO. is an SVM classifier with a large
set of features. The interfaces to the four systems are shown in Figure 2. The
three commercial system take an URL as an input and return a weighted as-
sessment of thrustworthiness, whereas the input to RI.SI.CO. is a brand name
and its output consists of a list of fake eshops. In Figure 2, we show (top left)
the output of RI.SI.CO. for the query ‘Christian Louboutin’ (i.e., a luxury shoe
brand). We also show the outputs of Scamadviser (right top) and WOT (left
bottom) for one (highlighted) result of RI.SI.CO. Scamadviser rated it as mod-
erately unsafe, with a score of 52% in a range from 0% to 100% safeness), while
for WOT it was clearly untrustworthy; i.e., 9 out of 100. Trustpilot was not able
to assess the website retrieved by RI.SI.CO.; we show the output of Trustpilot
(right bottom) for Zalando, a legitimate website receiving four stars out of five.
    Strictly speaking, the assessment weights returned by WOT, Trustpilot, and
Scamadviser are scores, not probabilities. In order to compare the four sys-
tems, we had to convert assessment weights into binary classification values.
For WOT and Scamadviser we used a simple splitting criterion based on equal
width intervals; i.e, threshold = 0.5. While more powerful, supervised methods
are conceivable [3], we will see in Section 1 that it yielded results close to the
theoretical upper bound of performance. For Trustpilot, we converted any score
into the class ”legitimate” (the rationale is explained below).


5     Inter-checker agreement

The first experiment was aimed to measure the consistency of rating across dif-
ferent checkers. One practical constraint concerned the time necessary to gather
the rates. Because APIs were available only for WOT, it was not possible to
run all the checkers through the whole dataset containing about 10000 ecom-
merce websites. To keep the times manageable, we used a subset containing 632
1
    RI.SI.CO. is available at http://uibm-ici.fub.it/risico with a password-protected ac-
    cess. It was developed for and in cooperation with the Directorate-General for the
    Fight against Counterfeiting - Italian Patent and Trademark Office.
                      (a)                                            (b)


                      (c)                                            (d)


Fig. 2. Screenshots of RI.SI.CO. (a) for ‘Christian Louboutin’ shoe
brand, Scamadviser (b) and WOT (c) for the fake ecommerce website
‘www.christianlouboutinshoessaleinc.com’, and Trustpilot (d) for the legitimate
ecommerce website ‘www.zalando.com’.
randomly extracted URLs. We submitted these URLs to the checkers and col-
lected the binarized assessments, used to measure the inter-checker agreement.
First of all we noted that Trustpilot and WOT were able to assess only a subset
of the URLs, respectively 81 and 351 URLs. We removed Trustpilot from this
evaluation and considered the 351 URLs assessed by all three remaining sys-
tems. The percent agreement (calculated by averaging the number of agreement
scores divided by the total number of scores for each URL) was 88%, which can
be seen as a moderate agreement for this simple measure of interrater reliability
[5]. More specifically, we found that only for 232 URLs out of 351 the three
checkers returned the same class label, which means that in more than one third
of cases they were not able to make a unanimous decision. An analysis of pair-
wise consistency showed that RI.SI.CO. and WOT agreed 323 times, while the
agreement between RI.SI.CO. and Scamadviser was equal to that between WOT
and Scamadviser and was much lower; i.e., 246. From these findings, it is clear
that Scamadviser made more unique decisions. On the whole, this experiment
suggests that the predictions made by trust checkers are different, but it does
not tell us if they are correct. This question is answered in the next section.


6     Classification accuracy on ground-truth dataset
Using the ground truth dataset, we evaluated the classification accuracy of the
four checkers. The results are shown in Table 1. For the systems returning
trustworthiness scores, we show the performance values using both a predefined
threshold value = 0.5 and the value that maximizes the (a posteriori) global clas-
sification accuracy.2 The main findings of this experiment are that the review-
based systems (Trustpilot and WOT) made their assessments on only a subset
of the URLs, in proportions similar to those reported above in the inter-checker
agreement analysis, but they were more precise than the feature-based systems
(Scamadviser and RI.SI.CO.). In particular, Trustpilot (under the extensive in-
terpretation) and WOT achieved an overall accuracy of, respectively, 100% and
93%, versus 77% of Scamadviser and 87% of RI.SI.CO.. Table 1 also suggests
that evaluation of trustworthiness was, in general, more difficult for fake than for
legitimate eshops. One possible interpretation is that legitimate eshops usually
have only legitimacy features in place, whereas fake websites may or may not
have illegitimacy features.
     Choosing a system-specific threshold value may have a great impact not
only on Trustpilot, as already mentioned, but also on Scamadviser, whose global
accuracy improved from 77% to 81% with a threshold value equal to 0.75 rather
than 0.5. By contrast, it did not affect much the performance of WOT. Its score
distribution was such that the classification accuracy with a threshold = 0.5 was
nearly the same as that with the optimal threshold value (i.e., 0.52), with as
2
    We noticed that the optimal threshold of Trustpilot is set assuming that whenever
    the system returns an assessment the website is trustworthy, regardless of the number
    of stars. In other words, it seems that Trustpilot assesses the convenience of purchase
    or the quality of service for otherwise legitimate ecommerce websites.
many as 49 fake websites receiving a WOT score = 1 (in a range from 1 to 100).
Note also that optimizing the threshold value for global accuracy will lower the
recall on legitimate or fake eshops.

Table 1. Performance of WOT, Trustpilot, Scamadviser (with predefined and optimal
threshold value) and RI.SI.CO. on ground truth dataset.

                                  WOT      Trustpilot Scamadviser RI.SI.CO.
    Number of assessments         55%         14%        100%       100%
                       ACCURACY OF ACTUAL ASSESSMENTS
    Legitimate and fake eshops 93% (94%) 66% (100%)       77% (81%)        87%
    Only legitimate eshops     96% (95%) 100% (100%)      92% (86%)        89%
    Only fake eshops           92% (93%) 0% (0%)          64% (77%)        85%


7    Conclusions
Using an evaluation methodology inspired by information retrieval, we found
that trustworthiness assessments of eshops made by existing checkers are char-
acterized by moderate inter-consistency and varying classification accuracy. This
research suggests that there is much room for performance improvement and that
combination of existing methods holds potential for developing better solutions.

References
1. Claudio Carpineto and Giovanni Romano. Learning to detect and measure fake
   ecommerce websites in search-engine results. Submitted.
2. Jeffrey P. Dotson, Ruixue Rachel Fan, McDonnel Feit Elea, Jeffrey D. Oldham, and
   Yi-Hsin Yeh. Brand attitudes and search engine queries. Journal of Interactive
   Marketing, 37:105–116, 2017.
3. Elizabeth A. Freeman and Gretchen G. Moison. A comparison of the performance
   of threshold criteria for binary classification in terms of predicted prevalence and
   kappa. Ecological Modelling, 217:48–58, 2008.
4. N. Leontiadis, T. Moore, and N. Christin. Measuring and analyzing search-
   redirection attacks in the illicit online prescription drug trade. In Proceedings of
   USENIX Security 2011, San Francisco, CA, USA, 2011.
5. Mary L. McHugh. Interrater reliability. Biochemia Medica, 22(3):276–282, 2012.
6. NetNames. The risks of the online counterfeit economy. Technical report, 2016.
7. OECD/EUIPO. Trade in Counterfeit and Pirated Goods: Mapping the Economic
   Impact. OECD Publishing, Paris, 2016.
8. John Wadleigh and Jake Drew andTyler Moore. The e-commerce market for
   ”lemons”: Identification and analysis of websites selling counterfeit goods. In In
   WWW ’15, pages 1188–1197, 2015.
9. D. Y. Wang, M. Der, M. Karami, L. Saul, D. McCoy, S. Savage, and G. M. Voelker.
   Search + seizure: The effectiveness of interventions on seo campaigns. In In IMC’14,
   Ney York, NY, USA, pages 359–372. ACM Press, 2014.