Block and Roll: A Metric-based Evaluation of Reputation
                                Block Lists
                                Siôn Lloyd1 , Carlos Hernandez-Gañán1 and Samaneh Tajalizadehkhoob1
                                1
                                    ICANN


                                                                          Abstract
                                                                           Reputation Block Lists (RBLs) serve as a common defense mechanism against harmful and unwanted internet
                                                                           content. These lists contain the IP addresses, domain names, or full URLs of known spam sources, phishing,
                                                                           malicious sites or other unwanted content. By using RBLs, internet service providers, email providers, and
                                                                           other organizations can effectively safeguard their users from online threats. They are also used for more
                                                                           academic research and as training sets for machine learning models. To help evaluate and understand the
                                                                           effectiveness of RBLs, this paper covers a set of metrics that can be used to evaluate and characterize them.
                                                                           These metrics include RBL focus, mechanics, metadata, volume, overlap, timeliness, and churn. We categorise
                                                                           the metrics into four groups: a general description; metrics that can be directly measured; metrics that can
                                                                           be indirectly measured and metrics that can only be discovered second-hand. When it comes to RBLs there is
                                                                           no “one size fits all”. We argue that understanding the strengths and weaknesses of any one RBL, or set of
                                                                           multiple RBLs, is key to getting a good fit for a particular use-case. To maximize the benefit of RBLs, we
                                                                           suggest combining two or more to get a fuller picture than can be provided by any single RBL.

                                                                           Keywords
                                                                           DNS abuse, blocklists, phishing, malware


                                1. Introduction                                                              ferent identifier types: domain names, IP addresses
                                                                                                             or full URLs, and in many cases a mixture of two
                                Domain name and IP address reputation lists have or more identifier types. They can also specialise
                                been used for many years as a way to identify and in particular threat types, like spam, phishing, mal-
                                block potentially harmful or unwanted traffic on ware, etc.; or they may contain a mixture of multiple
                                the internet. The earliest known reputation list threat types. They can differ in collection method-
                                was created by Paul Vixie in the 1990s, and was ology, licensing, distribution method, intended use
                                called the “Real-time Blackhole List” or RBL [1]. and almost every other conceivable way.
                                This list contained the IP addresses of known spam                              There are many examples of RBLs being used in
                                sources and was used by mail servers to block in- many different scenarios, some more obvious than
                                coming email from those sources. Over time, similar others, for example services like google safe brows-
                                lists were created for other types of online activ- ing1 can be thought of like an RBL protecting a
                                ities, such as domain or URL reputation lists for browser user from known phishing sites. The aca-
                                identifying malicious or phishing websites, and IP demic community also utilises RBLs to understand
                                address reputation lists for identifying sources of the current and historical reputation of domain
                                malware or other online threats. Today, these lists names in various types of analysis, to measure secu-
                                are widely used by internet service providers, email rity threat concentrations within the internet inter-
                                providers, and other organisations to help protect mediaries such as TLD, registry, registrars or host-
                                their users from online threats. They continue to ing providers and finally to assess mitigation strate-
                                evolve and improve as new threats emerge and new gies of internet intermediaries [2, 3, 4, 5, 6, 7, 8, 9].
                                technologies are developed to combat them.                                      In many cases the use of this data is not nec-
                                   We refer to these sources as “Reputation Block essarily aligned with how the producers intend it
                                Lists” or RBLs, others may call them by slightly to be used, and so its suitability may not be clear.
                                different names like “threat intelligence”, “security In other cases conclusions drawn from the analysis
                                feeds”, “abuse feed” or similar. They can contain dif- based on this data does not necessary reflect the
                                                                                                             specifications and limitations of the data. Moreover,
                                APWG.EU Tech 2023, June 21–22, 2023, Dublin, IE                              for all use-cases it is hard to know if the RBL being
                                $ sion.lloyd@icann.org (S. Lloyd); carlos.ganan@icann.org used is the best fit, if there is a better option or
                                (C. Hernandez-Gañán); samaneh.tajali@icann.org                               if a combination of two or more RBLs would add
                                (S. Tajalizadehkhoob)
                                         © 2023 Copyright for this paper by its authors. Use permitted under enough benefit to justify any extra cost. Note that
                                                                       Creative Commons License Attribution 4.0 International (CC BY
                                                                       4.0).
                                                                                                                                       1
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                             https://developers.google.com/safe-browsing


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the cost can be in terms of time and complexity as       them with confidence, or understand why they are
well as financial, so even free open-source feeds have   not suitable for a particular project. We also do
some cost associated with them.                          not consider cost or licensing terms here; although
   Misalignment with the intended use can have a         these could be significant factors in any decision on
significant impact on a project. For example, an         whether to use an RBL. Lastly, we are not aiming
RBL which contains low confidence or not vetted          to evaluate the absolute effectiveness of our RBLs
entries could result in an appreciable number of         as some of the existing work have already looked at
incorrect entries, known as false positives. Such        that aspect [10, 11, 12, 13, 14].
a data feed might be perfectly acceptable if used          The metrics we use are listed below and described
to protect a small network where the mitigation of       in more detail in section 3.
incorrect entries has a low associated cost. However,
the same RBL may not be suitable for an application      2.1. General RBL Description
where a false positive results in a time and resource
consuming investigation.                             These are characteristics that we will know before
                                                     we start to ingest data into our system. It may
                                                     be a feature that initially brought the RBL to our
2. Objectives                                        interest, maybe to fill an identified gap in our other
                                                     RBLs. We also include details that we need to know
Given the problem introduced above, in this docu- during the integration of an RBL into our system,
ment we propose a method to evaluate and charac- like how it is distributed and what data it contains.
terise an RBL; not just in isolation but also in how
multiple RBLs complement one another. We’ll look         • RBL focus - What entry types does it contain
at the general description of the RBL; things we can         (spam, phish, etc.)
measure directly; things that we can make approx-        • RBL mechanics - how is the RBL dissemi-
imations of and things that we can only discover             nated, what format is the data in, etc.
second-hand. We’ll also discuss the implications
                                                         • Metadata - does the RBL provide more
and limitations of these measurements.
                                                             context on list entries, like malware family,
   This work has been informed from earlier exam-
                                                             phished brand, etc.
ples [10, 11, 12, 13, 14, 15, 16, 17]; but we have
kept or modified parts of their suggested method
to suit our requirements. As such; our approach is 2.2. What We Can Measure Directly
grounded in the projects that we have been involved These are the metrics we can measure directly based
with, other parties with other experiences may well on the information provided in the RBLs.
have other metrics which they regard as important.
   To move towards evaluating an RBL, or group           • Volume - how many entries are present
of RBLs, we propose metrics that help measuring          • Overlap - how many entries in one RBL are
multiple aspects of a list. We then demonstrate              in common with other RBLs
the methods by which the metrics can be measured.
                                                         • Timeliness - how quickly do entries appear
Recall though that this work is based on the sources
                                                             (compared to other RBLs)
that we are already familiar with; it is likely that
                                                         • Churn - how dynamic are the entries
other RBLs have features which will require modifi-
cations to this method.
   We will not discuss here the steps required to 2.3. What We Can Measure Indirectly
read RBL data as this will vary between RBLs. We
                                                     These are metrics that can be measured indirectly
do show, in Appendix A, the database schema that
                                                     from the data. So where we maybe have to sample
we use to harmonise data into a single, consistent,
                                                     the data in order to get an approximation of the
format. All of the RBL data we read is written into
                                                     answer, or where we have surrogate measurements in
this structure, although it has had to evolve as new
                                                     lieu of the thing that we actually want to investigate.
RBLs with new fields have been added.
   Finally, there ae some things that we are explic-     • Liveliness - how many entries are “active”
itly not trying to measure. We are not looking to        • Purity - how many are potential false posi-
put a score on an RBL or say that one is demon-              tives
strably better than another; we want to increase
                                                         • Accuracy - what proportion of stated threat
our understanding of RBL data used regularly by
                                                             types match reality
us and our community, so that we can either use
Figure 1: Number of unique domains seen over a fixed period of time


2.4. What We Cannot Measure                             of “point in time” observations). In the case of the
                                                        latter the decision on how long an entry remains
These are characteristics which cannot be derived
                                                        active is decided by us rather than the provider.
from the data itself, but are discovered based on
second-hand information. For example we look at
the documentation for the RBL, consult FAQs or       Metadata It can be useful to have context
talk to the RBL providers to get this information. around a particular entry, and some RBLs provide
                                                   more information, like a timestamp the entry was
   • Catchment - are there geographic blindspots, added, the malware family seen, the brand being
     collection method gaps (e.g. no mobile data), phished and so on. Another useful data point is
     etc.                                          whether the entry is believed to be a malicious reg-
   • Entry retesting - how frequently are en- istration or a compromised but otherwise legitimate
     tries retested to check if they should still registration. All of this forms the metadata of an
     be present on the RBL                         observation.
   • Reliability - is the data always available or
     are there issues transferring                      3.2. What We Can Measure Directly
                                                           Volume Possibly the easiest measurement to
3. Method                                               make is how many entries are present. Although
                                                        here some care needs to be taken that the same
Looking in more detail at the metrics outlined above,   thing is being measured in each case. For example,
in the remainder of the paper we demonstrate what       some RBLs contain just domains while others may
we measure and, where appropriate, how we might         contain URLs, but of course multiple URLs may
use visualisations. Sticking to our four categories.    well map to a single domain. We look at unique
                                                        entries over a period of time, preferably a month or
                                                        more, to give as good a representation as possible.
3.1. General RBL Description
                                                        This is particularly significant for those RBLs which
  RBL Focus The first thing to consider is the          provide a stream of new entries and so don’t have
threat types that the RBL contains. Does it focus       the concept of a ”current list”. If we look at unique
on a single threat type or contain multiple types?      domains we see something like Figure 1. We can also
How does this relate to any other RBLs in our set,      produce similar figures but showing unique hosts,
does it fill a known gap?                               URLs, domains broken down by different threat
                                                        types, etc. .
   RBL Mechanics A prosaic but significant issue           Higher volumes are, in general, desirable; this
is how we can read the RBL and merge it into            is not, however, the whole story. For example the
our larger dataset. We need to understand what          DGArchive [18] data is based on enumeration of
delivery mechanism is used, is there an API, do         domain generation algorithms, and so the major-
we get the data formatted in CSV, JSON, etc. .          ity of those entries may never be registered. It is
Also, when we read the RBL does it provide the          therefore arguable that we are not comparing like
whole current list or a stream of new entries (a list   with like to other RBLs; we look to addresses issues
Figure 2: Overlap of unique domain entries seen between RBLs over a fixed period of time


like these later on. It is also true that some threats the same entry, which gets it earlier and by how
are more serious or active than others and so some     much? To this end we look at the time delta between
entries offer more ”value”.                            an entry appearing on our “base feed” that we
                                                       are considering and any other RBL, this gives us
   Overlap If we are looking to add a new RBL visualisations like that shown in Figure 3.
into our existing data, it is interesting to know how    Here entries with a negative time show the base
many entries are in common with our current data. feed leading other RBL entries, whereas a positive
Again we need to aggregate over a period of time time indicates it lagging behind. So ideally we want
and be careful to compare apples with apples. It to see more weight to the left of the graph indicat-
may also be instructive to see different threat types ing that the RBL being considered is consistently
separately. One simple measure is the overlap of getting entries earlier than others.
unique domains shown in Figure 2.
   This shows how much of one RBL is contained
within another (and vice versa). For example, if
we look at SURBL and openphish we can see that
SURBL contains 0.85 (85%) of openphish. However,
openphish contains just 0.015 (1.5%) of SURBL;
while the absolute number of domains in common
is the same, the difference is the underlying size of
the RBL.
   The view shows us some other interesting features;
while the majority of overlaps are small, less than 5%
or so, there are some which are much higher. This is
where open sources are being read and incorporated
into other RBLs, presumably after being validated Figure 3: Where overlap is seen we can show if our
                                                       considered RBL saw the domain earlier or later than the
to the required standard for that RBL. This could
                                                       others
be significant if entries on multiple RBLs are being
taken as multiple independent observations, when
they may in fact stem from a single original source.
                                                          Churn For the RBLs that provide their whole
   Timeliness The view above is interesting, and current set of entries on each read, it is also useful to
shows some cross-pollination between RBLs, so the know how dynamic the list is. If an RBL’s volume
next question is where two or more RBLs have stays the same as the previous iteration, is it because
                                                       the list is static, or is it because as many entries
                                                           3.3. What We Can Measure Indirectly
                                                              Liveliness Above we measured the volume of
                                                           entries on an RBL. However, it is also interesting to
                                                           know how many of those entries are “active”. There
                                                           may be entries which no longer resolve, or have
                                                           been mitigated in other ways (for example, some
                                                           registrars take control of the domain and “park” it).
                                                              We would struggle to capture this information
                                                           for every entry on a sizeable RBL, and once we had
                                                           finished we would need to start again to catch any
                                                           new entries or changes in existing ones. One way
Figure 4: Volume over time for a single RBL along with     to tackle this would be to pick a random sample of
the number of additions and deletions
                                                           sufficient size to give us a measurement hopefully
                                                           representative of the whole population.


Figure 5: Histogram of entry ages for a single RBL, note
the log scale

                                                           Figure 6: Statuses of a sample of domains for two RBLs
are being removed as are being added? To this end
we can consider a single RBL over a period of time           If we see a large proportion of the entries not
and plot its volume along with the number of new           resolving then we need to think why this might be.
entries and removed entries as shown in Figure 4           While one reason may be that the RBL has stale
   Note that removing stale entries which are no           information there may be other explanations. For
longer active threats can be as important as adding        example, maybe the RBL includes the output from
new entries, but is often not considered. To this          one or more domain generation algorithms (DGAs),
end we can also look at the histogram of the ages of       many of which are never actually registered.
entries, see Figure 5, note the log y-scale. Figure 5
shows a healthy mix where the majority of entries
                                                              Purity One of the more serious potential issues
have a short lifespan of days/weeks, with a small
                                                           for RBLs is when they contain false positive reports,
number being on the RBL for a year or more.
                                                           that is they contain entries which are not, and
   This analysis gives us more insight into how active
                                                           never have been, malicious. These entries are nearly
the RBL is, how many new threats are being added
                                                           impossible to discover en masse, they will only really
and how many old threats are being removed. A
                                                           become apparent during use. However, can we try
higher churn reflects a more active RBL and so is
                                                           to discover potential issues ahead of time? One
seen as a positive feature. For those feeds which just
                                                           thing which we look at is the overlap between the
provide “point in time” observations this analysis
                                                           RBL and a source of "known good" entities. We are
is not so relevant; although we can still look at the
                                                           not aware of such a list, so use a surrogate source
volumes of new threats being added.
                                                           - a list of top domains, like the TRANCO top 1M.
                                                           While these domains may still be malicious they
                                                           are less likely to be. Also, for uses like blocking
                                                           network traffic, any entry in the top 10,000 say
would potentially be very disruptive.                     Reliability A metric that can only be determined
   We obviously want this score to be as low as pos-   with continued monitoring and use, is whether the
sible, and where we suspect false positives we’d like  RBL data is always available, or are there some-
to understand if there are explanations or mitiga-     times issues transferring. This can influence our
tions we can use. To take DGAs as an example           confidence in using an RBL in a production en-
again, short DGA domains may coincidentally over-      vironment as if we have our own SLAs then the
lap with real words and legitimate registrations. To   RBL should have something at least similar but
make this less of an issue it may be that only DGA     preferably better.
domains with seven or more characters are retained.       For open-source RBLs with no contract (and
                                                       therefore no SLA) only our experience with the
   Accuracy Where an RBL provides extra meta- RBL can give us this confidence.
data, like threat types, do we believe that they are
correct? Where we see entries in common between
different RBLs, do they agree? This can be difficult
                                                       4. Conclusions
to pin down as we do see the same entity reported In order to understand which RBL(s) are suitable for
for different threat types within the same RBL, so which projects, we need to understand the project
again we need to sample and check in order to get requirements, the RBL characteristics and how mul-
an idea of the scale of any issues.                    tiple RBLs interact with each other.
   We would like to be able to trust all the data that    We cannot claim that certain RBLs are better
an RBL provides, not just the presence of entries, than others; but it can be that some RBLs are more
and the mis-classification of entries can have serious suited to some projects.
consequences in some cases. If an RBL has a low           However, from what we have seen of the RBLs
accuracy in terms of the metadata we may not be we have access to, adding multiple sources increases
able to use it to generate statistics for example.     the number of unique entities included and hence
                                                       the comprehensiveness of the data used.
3.4. What We Can’t Measure                               While in this work we outlined our evaluation
                                                       processes, we emphasize the fact that these are not
   Catchment RBLs have different collection mech-
                                                       meant to be complete or prescriptive as they are
anisms, even though some are aggregates of multiple
                                                       predicated on our current use cases. It is quite likely
primary sources. This will end up giving the RBL
                                                       that future projects, or new RBLs, will suggest new
strengths and blindspots, which could be geographic
                                                       measures and modifications to existing ones.
or delivery related (e.g. no mobile data), no visibil-
ity of threats targeted at specific countries, etc.
   Understanding of these can sometimes be found References
from FAQs, whitepapers, conversations with the
providers or other second-hand methods. In many         [1] R. McMillan, What will stop spam?,
cases however the amount of information is, for             http://sunsite.uakom.sk/sunworldonline/
operational reasons, limited.                               swol-12-1997/swol-12-vixie.html, 1997.
   We may need this information to identify RBLs        [2] ICANN, Domain abuse activity reporting,
that fill gaps in our current set, for other uses it        https://www.icann.org/octo-ssr/daar, 2017.
may be that data for a particular locale is essential. [3] Spamhaus, The World’s Most Abused
                                                            TLDs, https://www.spamhaus.org/statistics/
   Entry Retesting We have seen that entries are            tlds/, 2023.
removed from RBLs; but we cannot, from our mea-         [4] L. Interisle Consulting Group, Phish-
surements, definitively say why. Are statuses of            ing landscape 2022, https://interisle.net/
entries being periodically reconfirmed, or are they         PhishingLandscape2022.pdf, 2022.
just timed out? Some RBLs give this information         [5] C. David Barnett, The highest threat
but most do not, and deciding how long we trust             tlds - part 1, https://circleid.com/posts/
entries for can be influenced by how this is being          20230112-the-highest-threat-tlds-part-1, 2013.
handled by the RBL.                                     [6] J. Bayer, Y. Nosyk, O. Hureau, S. Fernandez,
   Ideally all entries are frequently retested, but         I. Paulovics, A. Duda, M. Korczyński, Study
we appreciate that operationally this may not be            on domain name system (dns) abuse: Techni-
possible.                                                   cal report, arXiv preprint arXiv:2212.08879
                                                            (2022).
 [7] S. Maroofi, M. Korczyński, C. Hesselman,           curity Symposium (USENIX Security 19),
     B. Ampeau, A. Duda, Comar: classification          USENIX Association, Santa Clara, CA, 2019,
     of compromised versus maliciously registered       pp. 851–867. URL: https://www.usenix.org/
     domains, in: 2020 IEEE European Symposium          conference/usenixsecurity19/presentation/li.
     on Security and Privacy (EuroS&P), IEEE, [16] A. Pitsillidis, C. Kanich, G. M. Voelker,
     2020, pp. 607–623.                                 K. Levchenko, S. Savage, Taster’s choice:
 [8] M. Korczynski, M. Wullink, S. Tajal-               A comparative analysis of spam feeds, in:
     izadehkhoob, G. C. Moura, A. Noroozian,            Proceedings of the 2012 Internet Measure-
     D. Bagley, C. Hesselman, Cybercrime after          ment Conference, IMC ’12, Association for
     the sunrise: A statistical analysis of dns abuse   Computing Machinery, New York, NY, USA,
     in new gtlds, in: Proceedings of the 2018 on       2012, p. 427–440. URL: https://doi.org/10.
     Asia Conference on Computer and Communi-           1145/2398776.2398821. doi:10.1145/2398776.
     cations Security, 2018, pp. 609–623.               2398821.
 [9] S. Tajalizadehkhoob, T. Van Goethem, M. Ko- [17] P. Vallina, V. Le Pochat, A. Feal, M. Paraschiv,
     rczyński, A. Noroozian, R. Böhme, T. Moore,        J. Gamba, T. Burke, O. Hohlfeld, J. Tapiador,
     W. Joosen, M. Van Eeten, Herding vulnerable        N. Vallina-Rodriguez, Mis-shapes, mistakes,
     cats: a statistical approach to disentangle joint  misfits: An analysis of domain classification
     responsibility for web security in shared host-    services, in: Proceedings of the ACM Inter-
     ing, in: Proceedings of the 2017 ACM SIGSAC        net Measurement Conference, IMC ’20, Asso-
     Conference on Computer and Communications          ciation for Computing Machinery, New York,
     Security, 2017, pp. 553–567.                       NY, USA, 2020, p. 598–618. URL: https://doi.
[10] S. Sinha, M. Bailey, F. Jahanian, Shades of        org/10.1145/3419394.3423660. doi:10.1145/
     grey: On the effectiveness of reputation-based     3419394.3423660.
     “blacklists”, in: 2008 3rd International Con- [18] Fraunhofer, DGArchive website, https://
     ference on Malicious and Unwanted Software         dgarchive.caad.fkie.fraunhofer.de/welcome/,
     (MALWARE), 2008, pp. 57–64. doi:10.1109/           2023.
     MALWARE.2008.4690858.
[11] J. Zhang, A. Chivukula, M. Bailey, M. Karir,
     M. Liu, Characterization of blacklists and
     tainted network traffic, in: M. Roughan,
     R. Chang (Eds.), Passive and Active Measure-
     ment, Springer Berlin Heidelberg, Berlin, Hei-
     delberg, 2013, pp. 218–228.
[12] T. Vissers, P. Janssen, W. Joosen, L. Desmet,
     Assessing the effectiveness of domain black-
     listing against malicious dns registrations, in:
     2019 IEEE Security and Privacy Workshops
     (SPW), 2019, pp. 199–204. doi:10.1109/SPW.
     2019.00045.
[13] S. Ramanathan, J. Mirkovic, M. Yu,
     Blag: Improving the accuracy of black-
     lists,    NDSS (2020). URL: https://par.
     nsf.gov/biblio/10205652. doi:10.14722/ndss.
     2020.24232.
[14] M. Kührer, C. Rossow, T. Holz, Paint it
     black: Evaluating the effectiveness of malware
     blacklists, in: Research in Attacks, Intrusions
     and Defenses: 17th International Symposium,
     RAID 2014, Gothenburg, Sweden, September
     17-19, 2014. Proceedings 17, Springer, 2014,
     pp. 1–21.
[15] V. G. Li, M. Dunn, P. Pearce, D. Mc-
     Coy, G. M. Voelker, S. Savage, Reading
     the tea leaves: A comparative analysis of
     threat intelligence, in: 28th USENIX Se-
A. Appendix 1: Database Schema
We write all of our RBL data to a single database table per month; most sources are read daily although
some more frequently. Our current schema is shown in Table 1 although this has evolved with new RBLs
and requirements.
   Some processing is required for most entries to be written to these tables, for example domains are
extracted from URLs, as are the TLD and suffix. This means we can get a more consistent view across all
of our RBLs, coping with those which provide different fields in different formats, or use slightly different
terminology.
   Note that each time we read from a feed we add new entries rather than updating existing rows. This
means that there will be duplicate entries when an entity is reported by an RBL for multiple days.
   This is also true for RBLs which report on URLs, and so may have the same domain multiple times.


Table 1
RBL Data Schema
 Column Name          Type      Notes
   report_date         date     Some RBLs tell us, for others it’s when we read that RBL.
       domain           text    Stripped domain name
         feed           text    Which source it came from
       reason           text    Threat type - Spam, phishing, etc.
  full_identifier       text    Some RBLs give URLs or include subdomains.
        score            int    Some RBLs give a confidence score
        suffix          text    Suffix according to the public suffix list
          tld           text    Top-level domain
     tld_type           text    country code (CC) or generic (gTLD) top-level domain
      registrar         text    If known
       reg_id            int    Registrar ID, if known
    seen_since      timestamp   Initial report_date
  url_shortener       boolean   Is it a known URL shortener (e.g. bit.ly); won’t be reliable
     sub_feed           text    Some RBLs aggregate other sources, if this is the case the original source will be here
        notes           text    Any other info the RBL gave that might be useful. Will depend on the RBL
         dga          boolean   Is the entry known to be from a domain generation algorithm
           ip         boolean   Is the entry an IP address