<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Software Quality Analysis, Monitoring, Improvement, and Applications, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards an automatic tool for detecting third-party data leaks on websites</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robin Carlsson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Panu Puhtila</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sampsa Rauti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Turku</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>0</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>Everyday tasks are increasingly completed with the help of various web-based services, and many users with little technical know-how are using these services. Due to this development, online privacy has emerged as a paramount concern when developing web services. One particular privacy concern involves third-party services such as analytics services that are nowadays commonplace on almost any website. In the current study, we explore the possibilities of automating the data collection in scientific research on personal data leaks related to third-party analytics tools, and build a proof-of-concept implementation of a tool that uses automated trafic analysis to record and analyze potential leaks of personal data to third-party services. The current implementation of the tool is intended to detect URL leaks, and to specifically inspect how this happens in the search functionalities found on the analyzed websites. Our findings indicate that the automation of this kind of data collection is very efective, and could potentially increase the quality of the research significantly as it allows for faster and more wide-spread data collection.</p>
      </abstract>
      <kwd-group>
        <kwd>Data leaks</kwd>
        <kwd>online privacy</kwd>
        <kwd>web security</kwd>
        <kwd>robotic process automation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the rapid advancement of digitalization, reliance on electronic services for everyday tasks
keeps growing. Web technologies in particular have become popular due to their accessibility,
scalability and ease of installation and maintenance. At the same time, there are various essential
web services which involve processing sensitive personal data, posing a risk of data disclosure
to third-party entities like analytics services. To address privacy concerns, the General Data
Protection Regulation (GDPR) has been established. The GDPR regulates the processing of
personal data and provides individuals with greater control over their sensitive information.
As per the requirements of the GDPR, users must always be properly informed about personal
data collection.</p>
      <p>
        Many users are likely aware of analytics services embedded into websites to some extent, but
often they do not fully understand what kind of personal, potentially very sensitive data is sent
to third-party services when visiting websites. Privacy policy documents that are supposed to
shed light on the nature of collected personal data regularly use vague and unclear expressions,
often failing to inform users correctly [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Many web service developers also seem to be
oblivious to dangers of using excessive amounts of third-party services on their websites. The
situation calls for increased attention to the privacy of web services and implementation of
efective tools to assess potential data leaks.
      </p>
      <p>This paper describes an automated tool for detecting data leaks on web pages1, and discusses
both its potential uses and further development possibilities in the future. The motive behind
the development of this tool has risen out of the needs of the IDA2 research project. As a part
of the project, we have conducted several studies on third-party data leaks and their connection
to user-given consent to data collection, or the lack of such, happening on numerous popular
websites. In such studies, the acquisition of datasets for analysis is often slow and tedious,
involving lots of manual and repetitive work. Large portions of this labor could be automated
to allow for the collection of larger datasets in the same timeframe and make the data collection
process more systematic, enhancing the quality of data.</p>
      <p>The rest of the paper is structured as follows. In Section 2, we present a brief survey of the
related research and development that precedes our own work. Section 3 gives an overview of
third party services, mainly web analytics tools, the reasons why they are used at the websites
and how this impacts user privacy. In Section 4, the conceptual design of the data leak detection
algorithm is explained and the details of the actual implementation of the tool are laid bare.
In Section 5, we take a look at the potential future challenges and possibilities related to the
development process. Finally, in Section 6, we present our definitive conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Third-Party Services and Online Privacy</title>
      <sec id="sec-2-1">
        <title>2.1. Data Collection by Third-Party Services</title>
        <p>
          There has been a shift towards digital platforms and online business models over the past few
decades, accelerating significantly in recent years. Organizations have turned to web analytics to
gain insights about their customers, optimize operations and make data-driven decisions. Today,
various diferent third-party analytics services are embedded into websites in order to analyze
behavior of users, measure performance of websites and provide demographic information
about website visitors [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ].
        </p>
        <p>
          One important use of web analytics is conversion tracking [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ]. In online marketing,
conversion refers to a desired action taken by a website visitor, and represents a successful
outcome of a marketing strategy or the desired engagement with a website – something that
adds value to the company. With web analytics, websites can set conversion goals and track
user actions that indicate successful conversions. Examples include successfully persuading a
user to submit a form or make a purchase.
        </p>
        <p>
          Many noncommercial pages use the same third-party analytics services and track
"conversions" in the same manner, although the term conversion takes a more generic meaning
here. As Bekos et al. note, "actions the website has configured to be tracked are defined as
1The tool was developed by the first author in 2023 and is available on Gitlab at
https://gitlab.utu.fi/crcarl/idaselenium-search-bar-tool
2Intimacy in Data-Driven Culture (https://www.dataintimacy.fi/en/)
conversions." [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] Understood in this sense, using the search function can also be considered a
conversion. The visits to individual pages are also often tracked. They can be used as a part of
the process called funneling, which refers to tracking and analyzing the steps that users take
on a website to achieve a certain goal or conversion. Landing on an important page can be
considered a conversion as well.
        </p>
        <p>
          The fact that these actions are being tracked, often by several analytics services, naturally
causes privacy concerns. Sensitive searches, visits on delicate web pages, and several private
actions (say, purchasing a specific prescription medicine in an online pharmacy) leak to third
parties. Figure 1 depicts this situation. Third-party analytics services, shown as red boxes,
capture identifying information on the user (such as IP addresses and device identifiers) as well
as previously mentioned contextual data on the user’s visit and actions on the website. The
collected data is then sent to external servers of analytics companies, such as Google or Meta. In
a way, this setting resembles a man-in-the-browser attack [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], in which a third party stealthily
spies on network trafic without the user noticing anything. Usually, the web developers and
the organizations operating websites do not fully understand that such serious data leaks are
taking place in their services.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. An example illustrating the need for automatic data leak detection in privacy research</title>
        <p>
          The challenge in researching personal data leaks described previously in larger scale is that
suitable automatic tools for this purpose seem to be missing. To illustrate the need for automatic
data leak detection in more detail, we shall take a look at a previous study conducted on the
personal data leaks in online pharmacies operating in Finland [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In this paper, it was observed
that 70% of the studied pharmacies leaked sensitive information to third parties, and 35% leaked
highly sensitive health related data such as the specific prescription medicine the customer is
ordering. The data leaks happened mostly through analytics tools, for example Google Analytics,
deployed on the pharmacy websites for the exact purposes described earlier.
        </p>
        <p>The data collection for the study was conducted as follows: first, a researcher navigated to the
selected online pharmacy website, opened the Google Chrome Developer Tools (devtools from
this point onwards) and cleared the cache, after which the page was reloaded. Then the devtools
were made to record all trafic that happened between the online pharmacy website and any
third parties. Then the researcher proceeded to make a dummy purchase: they searched for a
given prescription medicine, followed the links given by the sites’ search function to the specific
medicines page, added it to their cart and proceeded to the check-out. No actual purchases were
made in the experiments, so the data trail ends at the point where the customer would have to
have the law-mandated chat with an online pharmacist before the medicine purchase.</p>
        <p>During the test sequence explained above, on every step of the way, sensitive personal data
could potentially be leaked to third parties. In the tests the search term (usually a medicine
name) was regularly leaked. Data on visiting the product page of the given medicine was leaked.
Moreover, data on actually initializing an order for a specific prescription medicine was leaked,
even without making any actual purchase.</p>
        <p>This example illustrates both the cumbersome way this kind of research data is gathered and
the need for this kind of research in the first place. User anonymity becoming compromised is
an ongoing and widespread issue, and one which happens all too much hidden from the public
scrutiny. Automating large parts of the procedure of detecting such behavior in websites allows
researchers to conduct essentially similar surveys on much larger scope in much smaller time,
yielding better statistical information on how prevalent these kinds of data leakages are.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        There are other applications designed to perform functions similar to our tool, such as Website
Evidence Collector (WEC), which was developed at the behest of the European Data Protection
Supervisor (EDPS)3. Another somewhat similar tool was developed and used in research by
Wesselkamp et al.[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] as a browser extension. The big diference to what our application or
the WEC are doing, compared to the ERNIE browser extension, is that ERNIE is ultimately
meant to be used by the proprietors of the websites to help them understand how tracking
happens in their own domains, so as to better comply with data privacy regulations. We have
used the WEC in previous studies [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], but found it wanting in several aspects. For one, WEC
alone can not process multiple websites with one command. Although there is a third-party
distribution of it available which can do this4, inability to analyze several pages was not the
only shortcoming which led to the development of a tool of our own. The primary feature our
research needed, and WEC lacked, was the ability to automate the interaction with the website
being inspected, which is the whole idea of the tool discussed in this paper. Another important
reason was that while it is possible to make WEC move from one page to the next, the network
trafic taking place on individual pages is not diferentiated with enough clarity in the report it
generates.
      </p>
      <p>
        Another roughly similar application is OpenWPM (abbreviated from Open Web Privacy
Measurement) which was developed by the Englehardt as a part of his doctoral thesis [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] at the
Princeton University. The tool he devised is, like ours, based on the idea of automatically
processing through websites and detecting the user tracking systems that are being used. However,
OpenWPM is more concerned with categorizing the diferent forms of tracking technologies
used, especially the diferent forms of fingerprinting techniques that are used in tracking the
user, unlike our tool which is aimed at identifying the actual data leaks that happen in the
websites, and specifically in their search functionalities. Englehardt was inspired in his research
by earlier similar tools such as FPDetective, which was developed by Acar et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. FPDetector
is apparently quite similar application to the later OpenWPM, although perhaps slightly more
limited in scope and modularity. Another somewhat similar application is TrackingObserver
5, developed by the University of Washington. It is a modifiable Google Chrome extension
designed to automatically detect user tracking. It is designed to be customizable, and exposes
several APIs for tracking detection, measurement, and blocking. The tool can be further modified
with installable add-ons.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Tool Design &amp; Implementation</title>
      <sec id="sec-4-1">
        <title>4.1. Algorithm Design</title>
        <p>The algorithm implemented by this tool is designed to process a list of websites, inspect each
page in the list for certain elements (div tags, input fields and interactable elements which have
the term "search" in their HTML tag attributes), add found elements to a list, then loop through
this list and attempt to interact with each element. If the element is an input field, this means
inputting a pre-determined search term. If the element is clickable, it is clicked. If the interaction
succeeds, the algorithm records all trafic between the website and third parties, then moves
to the next website on the list. This process is presented in detail in the state diagram for the
algorithm, shown in Figure 2. The algorithm is presented below in the pseudo-code notation.
INITIALIZE web browser
READ list I from input file
FOR each website URL in list I:</p>
        <p># Search for input/interactive elements with specific attributes
IF input/interactive element with id/name/class indicating search EXISTS:</p>
        <p>APPEND element to list A/B
FOR each div element:</p>
        <p>IF div with class/id/name indicating search EXISTS:</p>
        <p>FOR each child of div:</p>
        <p>IF child is input/interactive element:</p>
        <p>APPEND child to list A/B
ELSE IF child is a div:</p>
        <p>RECURSIVE call (Inspect this div element)
5https://chrome.google.com/webstore/detail/trackingobserver/obheeflpdipmaefcoefhimnaihmhpkao
# Process lists A and B
WHILE there are unprocessed items in list A OR list B:
# Process input elements in list A
FOR each element in list A:</p>
        <p>TRY:</p>
        <p>INPUT predetermined search term and COMMIT
IF successful:</p>
        <p>BREAK
ELSE:
# Process input elements in list A
FOR each element in list B:</p>
        <p>TRY:</p>
        <p>INTERACT with element
WAIT for predetermined time
INPUT search term
IF successful:</p>
        <p>BREAK
WAIT for predetermined time</p>
        <p>RECORD traffic
TERMINATE algorithm</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation</title>
        <p>
          The actual implementation of the algorithm is developed in the Python programming language
and Selenium6, which is a popular technology for browser automation and web testing. Selenium
allows robotic process automation (RPA) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] in the web environment, which makes it possible
to automate interactions with web browsers, enabling developers to simulate user actions,
such as navigating through web pages, clicking buttons, and filling and submitting forms [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
Selenium’s core component is Selenium WebDriver that provides a programming interface for
interacting with web browsers and allows the developers to automate browser actions by writing
code in their preferred programming language. Selenium WebDriver, designed to facilitate the
development of browser automation applications, makes it quite efortless to build software that,
for example, loops through chosen web pages and performs actions within them, as Selenium
ofers an ample package of classes and functions designed specifically for these purposes. It also
has a decent level of integration with the Google Chrome Developer Tools, which allows for
the easy automation of the network recording. In addition to Selenium, the application imports
the pychrome package which is also used in recording the trafic. The current implementation
consists of only 298 lines of code in total, which speaks to the eficiency of the used technology.
        </p>
        <p>On the start of the application the software initializes the Selenium WebDriver to use Google
Chrome. It then configures Google Chrome Developer Tools to enable performance logging, and
initiates the main loop which consumes a list of website URLs’ which is provided from inputted
.txt file. Then the application goes through the list and attempts to open each website in the list
by using Selenium WebDrivers’ get(url) -function. If successful, the application attempts
to perform certain actions on each website in certain sequence. First, it initiates function
find_inputs() which attempts to look for input, form or div -elements with a correct
identifying attribute (either id, name or class which contains the term "search", "query" or
"haku") in the web page and append them to a list. If the element it finds is a div, it creates a
list out of its’ children, which is recursively looped until it finds an input inside one of the list
items. Then it attempts to run function find_buttons(), which looks for interactive or div
-elements, referred to as buttons from this point onwards, currently defined as button, a and
span tags (the three most commonly used clickable elements) with proper identifying attribute
(same as in the inputs part) and where there is no string "close" present in the attribute, and
appends them to a list. If the element it finds is a div, it creates a list out of its’ children, which
is recursively looped until it finds an interactive element inside one of the list items.</p>
        <p>After this phase has been completed the application attempts to process these lists in the
following manner; first, if there are only input fields but no buttons found, it tries to initiate
the function input_search_term(), which commits search with a predetermined arbitrary
term. If there are both inputs and buttons, it attempts to first use the input_search_term and
then the function click_search_button(), which attempts to interact with found button
elements in such a way as to first click the element, then if this succeeds wait a second (for
the search input to appear) and then input the same search term as in the previous step. If
there are only interactive elements it attempts to use only click_search_button(). If either
input_search_term() or click_search_button() or both yield results, in other words
that activating them does not result in error, the ensuing trafic is recorded by Google Chrome
Developer Tools, which is then outputted in JSON format in to a log file (see Figure 3). If there
are no suitable elements it moves to the next website in the parameter list. When all websites
in the parameter list are processed in this fashion the application execution ends.</p>
        <p>The current implementation of the application has been implemented and tested on macOS,
but it is efortless to write a version of it for other operating systems and devices too, as the
diference in practice is just two lines of code which define the version of the WebDriver used.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Test Results</title>
        <p>The application was tested by conducting several runs with it. In these test runs, the application
was given a batch of websites, after which the test results, i.e. the log files the application
created were analyzed. Various characteristics were determined, such as the success rate and
several and reasons for occasional failures. In our initial test, the success rate of the application
was roughly 90% out of a test batch of 64 websites (58/64). While 90% may not sound like a
high percentage at first, the tool still reduces manual work in our data leak analysis by 90% and
is found to be very useful in research. Here, success refers to cases in which the application
managed to commit search using an input field or managed to click an interactive element and
then carry out the search. A test run was also considered a success if the application managed
to open the web site, but no search related elements were found (as they did not exist at the
website). The time the application used in interacting with the website was around 15–40
seconds, depending on how many elements it had to process through.</p>
        <p>In our subsequent tests, which were conducted with a batch of 309 websites, the success
rate of the application dropped somewhat, but was still 79.3% (245/309). The reason for this
drop in the success rate was due to several problems which did not arise during the initial tests,
mainly due to the nature of the larger batch size which happened to contain lots of similarly
built websites, which of course produced the same problems. In these later tests, the median
time the application used for processing one website was 43 seconds. The instances in which
the application did not work as expected were because of several reasons:
• Pop-up elements. In some of the studied websites, there were pop-up elements, such
as cookie consent banners or other notification windows that were overlayed with the
search elements of the web page, which prevented the application from clicking the search
button.
• Empty search. In some of the studied websites, the application proceeded to push buttons
that initiated an empty search for some reason.
• Predetermined search terms. Some websites use input fields that allow the user to search
with only predetermined terms, and inputting anything else just causes the search to not
to commit. Regardless of whether the inputted term is hard-coded into the program or
inputted by the user, this sometimes leads to situations in which the application cannot
function correctly.
• Responsiveness of the web pages. In some of the tested websites, the responsiveness of the
website layout forced the website to open in mobile view, mainly due to the reason of the
default size of the browser window opened by the application, which sometimes situated
the search functionality outside the reach of the application.
• Unusually complex search functionality. Some websites used search functionalities that
were more complex than others, and sometimes this caused the application to fail.
Possible solutions to these situations are further explored in Section 6.</p>
        <p>The log files outputted by the application are currently written only in JSON format, which
is not the best possible option for human readers, but is suficient for the current needs of the
research. The application goes through the log file recorded by the Google Chrome Developer
Tools and picks the values of the "method" key in network object, which it then prints to the
JSON file. In addition to this, the JSON file consists of several technical details, and notifications
on when the interception of data by third parties happened.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Challenges and Future Improvements</title>
      <p>The most obvious challenge in developing a data leak detection tool we have described
previously is in defining what kind of HTML tags it should attempt to inspect for purposes of
ifnding interactable elements or input fields, as in the modern website ecosystem these are
not necessarily obvious. For example, large amounts of elements that appear as "buttons" in
websites are actually links, or even span type elements. This is largely because the classical
button element comes with predetermined appearance and functionality, which in the modern,
constantly evolving landscape of web development does not necessarily serve the purpose it had
in the past. Thus the developers often choose to use more flexible alternatives, since it is easy
to masquerade other elements to look like "buttons" with CSS styling and to add interactivity to
them with JavaScript.</p>
      <p>
        Another major developmental challenge lies in the fact that the elements being looked for
might not, and indeed often do not, have a correct id, name or class attribute that is needed
to properly identify the element in question to be a "search" element. In some cases they might
not have any kind of verbal indicator of their nature, for example being represented only by a
looking glass icon, which is not even named correctly, to describe their nature. This is especially
challenging in the cases where the search input field is hidden behind a collapsible element,
accessible only through an interactive element that lacks any identifiable attributes. In some
cases the elements sought for are not present in the DOM (Document-object model) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] at
all (in other words, they do not exist yet) until some interactive element is clicked, which
makes it very hard to detect the sought-after element if the connected interactive elements have
misleading attributes.
      </p>
      <p>While the application worked well in our testing, it had several functionality issues that
demand further development. In some of the tested websites the application could not click the
search button it detected as there was another element, usually a cookie banner, overlaid above
it in some way. Another problem was that in some cases clicking the search button initiated
an empty search, which produced no results. The third issue was that sometimes the websites
deploy search fields that accept only certain inputs, and this can not be detected properly
by automation. In addition to this, there were other issues that were slightly more specific,
such as the occasional responsiveness attributes interfering with the application, complex
search functionalities which the application could not use. Several diferent solutions to these
challenges have been considered, and we plan to implement them in the future.</p>
      <p>For the reasons above, there have been considerations on adding machine-learning aspects
to this application in the future. For example, elements whose search functionality is indicated
only by an icon, could be identified with a properly taught neural network. Using artificial
intelligence could also help in avoiding some of the pitfalls mentioned in the preceding paragraph.
Another area where the artificial intelligence technology could be found useful is in automating
the categorization and analysis of the results gathered from websites, thus hastening the
actual research considerably. Furthermore, another avenue of further development would be a
mechanism to detect the language of the website in question, and then add the "search" also in
this language to the criteria of elements to be looked for. Currently, the application outputs
the data it collects only in JSON format, which is human-readable but still onerous to analyze.
This can be developed further for it to produce an easily readable output like Web Evidence
Collector mentioned in Section 2 does.</p>
      <p>The current implementation of the algorithm is very specifically meant to study only the
data leaks happening in the search functionality of the websites in question. However, it would
be quite easy to use this algorithm and codebase to, for example, make the application just go
through the given websites and click all clickable elements, and then record what kind of data
was leaked to third parties while doing so. With minor alterations this tool could be used also
in other research projects, basically in any studies that demand large amounts of statistical
data about websites, such as how prevalent certain content types, elements or cookies are. The
basic structure of the algorithm, in other words that it loops through a given list of websites
and performs actions in them, could be easily turned into retrieving all kinds of information
from the sites, which could be put to actual use in many non-technical sciences that study
human behavior on the internet. For example, this concept could be used to scrape all posts
with certain keywords or topics from a list of discussion forums or social media platforms at
once, or find all items with a specific keyword from several diferent webstores. It could also
be implemented in such a way as to access video streaming websites and search for specific
types of videos based on their tags. However, in order to be useful for other than computer
science research interests, the implementation should include a graphical user interface, as it
currently has none. Adding the GUI and some modularity, for example the parametrization
of the terms the algorithm uses in detecting the wanted types of elements to interact with, to
make the application more usable for other research interests has been proposed in our internal
discussions, and will be implemented in the future.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In the current paper, we have presented a proof-of-concept implementation of a tool employing
automated trafic analysis to record and analyze potential leaks of personal data to third-party
services. This makes it easier to systematically gather large datasets for scientific research on
potential data leaks caused by third-party services planted on the websites. The algorithm in its
current form provides a good basis on which to improve further in the future, for example by
adding machine learning to aid in searching for the wanted elements and in empowering the
analysis of the results. Currently, the test results indicate that the application is not infallible,
but still quite capable of detecting correct interactable elements from websites, activate them
and record the trafic with third parties if it happens. We intend to continue the development of
the application in the future, improving its usability and performance. The objective is to make
it a useful tool not just for the needs of our data leak research project, but for other scientific
interests as well.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research has been funded by Academy of Finland project 327397, IDA – Intimacy in
Data-Driven Culture.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Breaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Reidenberg</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Norton</surname>
          </string-name>
          ,
          <article-title>A theory of vagueness and privacy risk perception</article-title>
          ,
          <source>in: 2016 IEEE 24th International Requirements Engineering Conference (RE)</source>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Reidenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Breaux</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Norton</surname>
          </string-name>
          ,
          <article-title>Ambiguity in privacy policies and the impact of regulation</article-title>
          ,
          <source>The Journal of Legal Studies</source>
          <volume>45</volume>
          (
          <year>2016</year>
          )
          <fpage>S163</fpage>
          -
          <lpage>S190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Palomino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Paz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moquillaza</surname>
          </string-name>
          ,
          <article-title>Web Analytics for User Experience: A Systematic Literature Review</article-title>
          , in: International Conference on Human-Computer Interaction, Springer,
          <year>2021</year>
          , pp.
          <fpage>312</fpage>
          -
          <lpage>326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Ogunmola</surname>
          </string-name>
          ,
          <article-title>Web analytics for knowledge creation: a systematic review of tools, techniques, and practices</article-title>
          ,
          <source>International Journal of Cyber Behavior, Psychology and Learning (IJCBPL) 10</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Huidobro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Monroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Godoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cervantes</surname>
          </string-name>
          ,
          <article-title>A Contrast-Pattern Characterization of Web Site Visitors in Terms of Conversions</article-title>
          ,
          <source>in: Technology-Enabled Innovations in Education: Select Proceedings of CIIE 2020</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chitkara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M. J.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          ,
          <article-title>Importance of web analytics for the success of a startup business</article-title>
          ,
          <source>in: Data Science and Analytics: 5th International Conference on Recent Developments in Science</source>
          , Engineering and Technology,
          <string-name>
            <surname>REDSET</surname>
          </string-name>
          <year>2019</year>
          , Gurugram, India,
          <source>November 15-16</source>
          ,
          <year>2019</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Selected</surname>
          </string-name>
          <string-name>
            <surname>Papers</surname>
          </string-name>
          ,
          <source>Part II 5</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>366</fpage>
          -
          <lpage>380</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bekos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Markatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kourtellis</surname>
          </string-name>
          ,
          <article-title>The Hitchhiker's Guide to Facebook Web Tracking with Invisible Pixels and Click IDs</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>2132</fpage>
          -
          <lpage>2143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Curran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Dougan</surname>
          </string-name>
          ,
          <article-title>Man-in-the-browser attack</article-title>
          ,
          <source>International Journal of Ambient Computing and Intelligence</source>
          <volume>4</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Carlsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rauti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mickelsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mäkilä</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pirjatanniemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Leppänen</surname>
          </string-name>
          ,
          <article-title>Several online pharmacies leak sensitive health data to third parties</article-title>
          ,
          <source>Accepted to WorldCIST</source>
          <year>2023</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Wesselkamp</surname>
          </string-name>
          , I. Fouad,
          <string-name>
            <given-names>C.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Boussad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bielova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Legout</surname>
          </string-name>
          ,
          <article-title>In-depth technical and legal analysis of tracking on health related websites with ernie extension</article-title>
          ,
          <source>in: Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society</source>
          , WPES '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>151</fpage>
          -
          <lpage>166</lpage>
          . URL: https: //doi.org/10.1145/3463676.3485603. doi:
          <volume>10</volume>
          .1145/3463676.3485603.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Englehardt</surname>
          </string-name>
          , et al.,
          <source>Automated discovery of privacy violations on the web</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Acar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Juarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikiforakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gürses</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piessens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Preneel</surname>
          </string-name>
          ,
          <article-title>Fpdetective: dusting the web for fingerprinters</article-title>
          ,
          <source>in: Proceedings of the 2013 ACM SIGSAC conference on Computer &amp; communications security</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1129</fpage>
          -
          <lpage>1140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>W. M. Van der Aalst</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bichler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Heinzl</surname>
          </string-name>
          , Robotic process automation,
          <source>Business &amp; information systems engineering 60</source>
          (
          <year>2018</year>
          )
          <fpage>269</fpage>
          -
          <lpage>272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gallego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gortázar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Munoz-Organero</surname>
          </string-name>
          ,
          <article-title>A survey of the selenium ecosystem</article-title>
          ,
          <source>Electronics</source>
          <volume>9</volume>
          (
          <year>2020</year>
          )
          <fpage>1067</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stenback</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Le Hégaret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Le</surname>
          </string-name>
          <string-name>
            <surname>Hors</surname>
          </string-name>
          ,
          <article-title>Document object model (dom) level 2 html specification</article-title>
          ,
          <source>W3C</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>