<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pharmaguard WebApp: an application for the detection of illegal online pharmacies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matteo Contini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igino Corona</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Mulas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Giacinto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Ariu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIEE, University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a demo for PharmaGuard, a novel system for the automatic discovery of illegal online pharmacies. With its easy to use graphic user interface, a web application architectural approach and leveraging the powers of automatic knowledge discovery, PharmaGuard can assist law enforcement agencies in identifying, blacklisting and shuttingdown illegal pharmacies.</p>
      </abstract>
      <kwd-group>
        <kwd>Detection of Illegal Pharmacies</kwd>
        <kwd>Search Engines</kwd>
        <kwd>Pattern Classi cation</kwd>
        <kwd>Human-Machine Interaction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction
PharmaGuard has been designed for two kind of users: law enforcement operators
and academic researchers. The average user is as a computer lettered person with
a simple goal: to check a list of suspected web pages, inspect their content and
classify them. Web Pages within the system are classi ed as belonging to one
and only one of these categories:
{ ILLEGAL Pharmacy: the web page is considered to be a full edged
illegal pharmacy or is it at least involved in the illegal selling of drugs.
{ Legal Pharmacy: the web page is either a legal online pharmacy or there
are no enough \proofs" to actually establish if it's legal or not.
{ Other: the web page is not involved in any kind of drug selling at all.
{ ?: the web page is hard or impossible to evaluate because of the nature of
the content (e.g. content is written in an unknown foreign language).
{ Pharmacy Advertisement: the web page is not involved in drug selling
but rather on advertisement of similar content. Web pages that belong to
this category can link to real online pharmacies.</p>
      <p>System classi es web pages using only three labels: ILLEGAL Pharmacy,
Legal Pharmacy and Other. Human users, while changing the classi cation
of a web page, can use all ve labels. System uses also an additional ag called
\useful/not useful". System set by default this ag to \useful" for every classi ed
web page; users can change ag's value at any time. The next paragraphs will
explain this ag role in the actual discovery of potentially illegal online
pharmacies. PharmaGuard system can be accessed by users thanks to a Graphic User
Interface (from now onwards GUI). System provides two di erent access levels:
Validator and Administrator; the former can classify and delete web pages, the
latter can also create new users.</p>
      <p>The following list describes a typical scenario of use for a Validator user:
1. A user visualizes the login page and access the system.
2. A list of analyzed web pages is shown.
3. User selects one web page from the list.
4. System shows complete information regarding the selected web page.
5. User can either change or con rm the actual label.
6. User presses the save button (e.g. user presses the save button).
7. User continue from point 2 or leave the system.</p>
      <p>It's worth noting that user is shown a complete list of all web pages: new
ones with a label suggested by the system, previously labeled ones that has been
already validated by a human user (i.e. not necessarily the current user). Once
a web page is selected, user can see the following information:
{ Results of automatic classi cation: system assigned labels.
{ URLs: web page's initial and nal URL (i.e. in case of redirection).
{ Download time: a timestamp of web page's download time.
{ Network source: TTL, IP address and Autonomous System informations.
{ Label: GUI's element for web page's classi cation change.
{ Checked by: list of users that have previously checked the web page.
{ Useful: GUI's element for web page's \useful" ag change.
{ Snapshot: web page's snapshot.
{ Page: web page's html.</p>
      <p>User's manual classi cation is important for the system and heavily in uences
its performances. Webpage Finder searches the web for new content according
to a criteria of similitude by exploiting search engines capabilities of suggesting
content similar to that of a given URL. PharmaGuard keeps stored all previously
analyzed web pages; all those labeled as ILLEGAL Pharmacies and agged as
Useful are used as input for search engine based queries. This is how the system
founds new web pages to analyze. System's ability to nd illegal pharmacies
depends on previous analysis, the more currently classi ed web pages are samples of
actual illegal pharmacy pages the better the system will be at nding potentially
illegal new ones. While system's ability to discover new potentially illegal page
is tied to the quality of previous analysis, on the other hand feature extraction
and classi cation are not. It's worth noting that users have an impact only on
system's performances in discovering new content while automatic classi cation
is una ected. System's behavior can be seen as a sequence of steps: download a
web page, extract information and features, classify, store the result, show the
result, create a list of web pages to download, repeat. Although currently not
completely divided into separate components, system has been designed to be
modular. System's functionalities can be grouped into blocks according to the
previous sequence and mapped to the components described in this short list:
{ Scheduler: this component starts the work ow loop providing the Browser
and Metadata Extractor components with a list of web pages to analyze.
{ Browser: this component downloads a given web page's html and creates
a snapshot for a visual inspection. Browser is based on Selenium WebDriver
web automation tool and guarantees that inspected web page's content is
exactly the same a human user would see with his browser. Downloaded
html is the one actually \visible" from the browser and takes in account
all changes to the DOM such as javascript injection (i.e. ajax). Downloaded
html is sent to the Feature Extraction for further analysis.
{ Metadata Extractor: given a web page's URL, this component collects
additional information such as IP, TTL, Autonomous System Number, etc.
{ Feature Extraction: this component takes an html as input and provides
a list of extracted features as output.
{ Classi cation: based on a sequence of two di erent classi ers, this
component is able to classify a given web page as either a legal online pharmacy,
an illegal one or as totally unrelated web page that is not a pharmacy at
all. As previously stated, these labels are the result of the conjoined work
of two classi ers: the rst one, called PHARMA vs OTHER, decides if a
web page is an online pharmacy or not; the second one, called PHARMA vs
PHARMA decides if an online pharmacy is either a legal or illegal one. The
nal result is sent to the Core component. It's worth noting that the second
classi er actually works only if the rst one detected an online pharmacy.
{ Webpage Finder: modern search engines have lot of advanced search
functionalities, among these there is \related content search". Webpage Finder
takes all previously analyzed web pages that has been labeled as ILLEGAL
Pharmacies and are currently agged as Useful then feeds them to search
engines in order to nd \related" web pages. Webpage Finder's output will
be used by Scheduler component.
{ Core: this component is based on Django framework, uses a MVC
architectural pattern and takes care of data persistence, components communication
and GUI requirements. Core component uses a relational database to store
web page's classi cation labels, downloaded content and all the necessary
information for users login and session handling.</p>
      <p>We already described system's purpose and main components, we also
presented a typical scenario of use for a human user. We can nally describe a daily
\work ow" from the machine point of view, a set of steps required in order to
provide the nal user with a list of classi ed web pages:
Illegal online pharmacies, despite LEAs e ort, are still a problem without
solution. Substances sold don't just constitute a source of income for criminals but
constitute a real threat for the health of the buyers. In this work, we presented
PharmaGuard, a powerful tool for LEAs that can be helpful in many ways:
1. Finding \never-seen-before" illegal pharmacies.
2. Keeping track of all already seen web pages as a reference.
3. Easing the collaboration of multiple agents.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>I.</given-names>
            <surname>Corona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Contini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giacinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Roli</surname>
          </string-name>
          , M. Lund, e G. Marinelli, \
          <article-title>PharmaGuard: Automatic Identi cation of Illegal Search-Indexed Online Pharmacies"</article-title>
          , IEEE International Conference on Cybernetics - Special session Cybersecurity (CYBERSEC),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>