=Paper=
{{Paper
|id=Vol-3631/paper3
|storemode=property
|title=Block and Roll: A Metric-based Evaluation of Reputation Block Lists
|pdfUrl=https://ceur-ws.org/Vol-3631/paper3.pdf
|volume=Vol-3631
|authors=Siôn Lloyd,Carlos Hernandez-Gañán,Samaneh Tajalizadehkhoob
|dblpUrl=https://dblp.org/rec/conf/apwg-eu/LloydGT23
}}
==Block and Roll: A Metric-based Evaluation of Reputation Block Lists==
Block and Roll: A Metric-based Evaluation of Reputation
Block Lists
Siôn Lloyd1 , Carlos Hernandez-Gañán1 and Samaneh Tajalizadehkhoob1
1
ICANN
Abstract
Reputation Block Lists (RBLs) serve as a common defense mechanism against harmful and unwanted internet
content. These lists contain the IP addresses, domain names, or full URLs of known spam sources, phishing,
malicious sites or other unwanted content. By using RBLs, internet service providers, email providers, and
other organizations can effectively safeguard their users from online threats. They are also used for more
academic research and as training sets for machine learning models. To help evaluate and understand the
effectiveness of RBLs, this paper covers a set of metrics that can be used to evaluate and characterize them.
These metrics include RBL focus, mechanics, metadata, volume, overlap, timeliness, and churn. We categorise
the metrics into four groups: a general description; metrics that can be directly measured; metrics that can
be indirectly measured and metrics that can only be discovered second-hand. When it comes to RBLs there is
no “one size fits all”. We argue that understanding the strengths and weaknesses of any one RBL, or set of
multiple RBLs, is key to getting a good fit for a particular use-case. To maximize the benefit of RBLs, we
suggest combining two or more to get a fuller picture than can be provided by any single RBL.
Keywords
DNS abuse, blocklists, phishing, malware
1. Introduction ferent identifier types: domain names, IP addresses
or full URLs, and in many cases a mixture of two
Domain name and IP address reputation lists have or more identifier types. They can also specialise
been used for many years as a way to identify and in particular threat types, like spam, phishing, mal-
block potentially harmful or unwanted traffic on ware, etc.; or they may contain a mixture of multiple
the internet. The earliest known reputation list threat types. They can differ in collection method-
was created by Paul Vixie in the 1990s, and was ology, licensing, distribution method, intended use
called the “Real-time Blackhole List” or RBL [1]. and almost every other conceivable way.
This list contained the IP addresses of known spam There are many examples of RBLs being used in
sources and was used by mail servers to block in- many different scenarios, some more obvious than
coming email from those sources. Over time, similar others, for example services like google safe brows-
lists were created for other types of online activ- ing1 can be thought of like an RBL protecting a
ities, such as domain or URL reputation lists for browser user from known phishing sites. The aca-
identifying malicious or phishing websites, and IP demic community also utilises RBLs to understand
address reputation lists for identifying sources of the current and historical reputation of domain
malware or other online threats. Today, these lists names in various types of analysis, to measure secu-
are widely used by internet service providers, email rity threat concentrations within the internet inter-
providers, and other organisations to help protect mediaries such as TLD, registry, registrars or host-
their users from online threats. They continue to ing providers and finally to assess mitigation strate-
evolve and improve as new threats emerge and new gies of internet intermediaries [2, 3, 4, 5, 6, 7, 8, 9].
technologies are developed to combat them. In many cases the use of this data is not nec-
We refer to these sources as “Reputation Block essarily aligned with how the producers intend it
Lists” or RBLs, others may call them by slightly to be used, and so its suitability may not be clear.
different names like “threat intelligence”, “security In other cases conclusions drawn from the analysis
feeds”, “abuse feed” or similar. They can contain dif- based on this data does not necessary reflect the
specifications and limitations of the data. Moreover,
APWG.EU Tech 2023, June 21–22, 2023, Dublin, IE for all use-cases it is hard to know if the RBL being
$ sion.lloyd@icann.org (S. Lloyd); carlos.ganan@icann.org used is the best fit, if there is a better option or
(C. Hernandez-Gañán); samaneh.tajali@icann.org if a combination of two or more RBLs would add
(S. Tajalizadehkhoob)
© 2023 Copyright for this paper by its authors. Use permitted under enough benefit to justify any extra cost. Note that
Creative Commons License Attribution 4.0 International (CC BY
4.0).
1
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org) https://developers.google.com/safe-browsing
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
the cost can be in terms of time and complexity as them with confidence, or understand why they are
well as financial, so even free open-source feeds have not suitable for a particular project. We also do
some cost associated with them. not consider cost or licensing terms here; although
Misalignment with the intended use can have a these could be significant factors in any decision on
significant impact on a project. For example, an whether to use an RBL. Lastly, we are not aiming
RBL which contains low confidence or not vetted to evaluate the absolute effectiveness of our RBLs
entries could result in an appreciable number of as some of the existing work have already looked at
incorrect entries, known as false positives. Such that aspect [10, 11, 12, 13, 14].
a data feed might be perfectly acceptable if used The metrics we use are listed below and described
to protect a small network where the mitigation of in more detail in section 3.
incorrect entries has a low associated cost. However,
the same RBL may not be suitable for an application 2.1. General RBL Description
where a false positive results in a time and resource
consuming investigation. These are characteristics that we will know before
we start to ingest data into our system. It may
be a feature that initially brought the RBL to our
2. Objectives interest, maybe to fill an identified gap in our other
RBLs. We also include details that we need to know
Given the problem introduced above, in this docu- during the integration of an RBL into our system,
ment we propose a method to evaluate and charac- like how it is distributed and what data it contains.
terise an RBL; not just in isolation but also in how
multiple RBLs complement one another. We’ll look • RBL focus - What entry types does it contain
at the general description of the RBL; things we can (spam, phish, etc.)
measure directly; things that we can make approx- • RBL mechanics - how is the RBL dissemi-
imations of and things that we can only discover nated, what format is the data in, etc.
second-hand. We’ll also discuss the implications
• Metadata - does the RBL provide more
and limitations of these measurements.
context on list entries, like malware family,
This work has been informed from earlier exam-
phished brand, etc.
ples [10, 11, 12, 13, 14, 15, 16, 17]; but we have
kept or modified parts of their suggested method
to suit our requirements. As such; our approach is 2.2. What We Can Measure Directly
grounded in the projects that we have been involved These are the metrics we can measure directly based
with, other parties with other experiences may well on the information provided in the RBLs.
have other metrics which they regard as important.
To move towards evaluating an RBL, or group • Volume - how many entries are present
of RBLs, we propose metrics that help measuring • Overlap - how many entries in one RBL are
multiple aspects of a list. We then demonstrate in common with other RBLs
the methods by which the metrics can be measured.
• Timeliness - how quickly do entries appear
Recall though that this work is based on the sources
(compared to other RBLs)
that we are already familiar with; it is likely that
• Churn - how dynamic are the entries
other RBLs have features which will require modifi-
cations to this method.
We will not discuss here the steps required to 2.3. What We Can Measure Indirectly
read RBL data as this will vary between RBLs. We
These are metrics that can be measured indirectly
do show, in Appendix A, the database schema that
from the data. So where we maybe have to sample
we use to harmonise data into a single, consistent,
the data in order to get an approximation of the
format. All of the RBL data we read is written into
answer, or where we have surrogate measurements in
this structure, although it has had to evolve as new
lieu of the thing that we actually want to investigate.
RBLs with new fields have been added.
Finally, there ae some things that we are explic- • Liveliness - how many entries are “active”
itly not trying to measure. We are not looking to • Purity - how many are potential false posi-
put a score on an RBL or say that one is demon- tives
strably better than another; we want to increase
• Accuracy - what proportion of stated threat
our understanding of RBL data used regularly by
types match reality
us and our community, so that we can either use
Figure 1: Number of unique domains seen over a fixed period of time
2.4. What We Cannot Measure of “point in time” observations). In the case of the
latter the decision on how long an entry remains
These are characteristics which cannot be derived
active is decided by us rather than the provider.
from the data itself, but are discovered based on
second-hand information. For example we look at
the documentation for the RBL, consult FAQs or Metadata It can be useful to have context
talk to the RBL providers to get this information. around a particular entry, and some RBLs provide
more information, like a timestamp the entry was
• Catchment - are there geographic blindspots, added, the malware family seen, the brand being
collection method gaps (e.g. no mobile data), phished and so on. Another useful data point is
etc. whether the entry is believed to be a malicious reg-
• Entry retesting - how frequently are en- istration or a compromised but otherwise legitimate
tries retested to check if they should still registration. All of this forms the metadata of an
be present on the RBL observation.
• Reliability - is the data always available or
are there issues transferring 3.2. What We Can Measure Directly
Volume Possibly the easiest measurement to
3. Method make is how many entries are present. Although
here some care needs to be taken that the same
Looking in more detail at the metrics outlined above, thing is being measured in each case. For example,
in the remainder of the paper we demonstrate what some RBLs contain just domains while others may
we measure and, where appropriate, how we might contain URLs, but of course multiple URLs may
use visualisations. Sticking to our four categories. well map to a single domain. We look at unique
entries over a period of time, preferably a month or
more, to give as good a representation as possible.
3.1. General RBL Description
This is particularly significant for those RBLs which
RBL Focus The first thing to consider is the provide a stream of new entries and so don’t have
threat types that the RBL contains. Does it focus the concept of a ”current list”. If we look at unique
on a single threat type or contain multiple types? domains we see something like Figure 1. We can also
How does this relate to any other RBLs in our set, produce similar figures but showing unique hosts,
does it fill a known gap? URLs, domains broken down by different threat
types, etc. .
RBL Mechanics A prosaic but significant issue Higher volumes are, in general, desirable; this
is how we can read the RBL and merge it into is not, however, the whole story. For example the
our larger dataset. We need to understand what DGArchive [18] data is based on enumeration of
delivery mechanism is used, is there an API, do domain generation algorithms, and so the major-
we get the data formatted in CSV, JSON, etc. . ity of those entries may never be registered. It is
Also, when we read the RBL does it provide the therefore arguable that we are not comparing like
whole current list or a stream of new entries (a list with like to other RBLs; we look to addresses issues
Figure 2: Overlap of unique domain entries seen between RBLs over a fixed period of time
like these later on. It is also true that some threats the same entry, which gets it earlier and by how
are more serious or active than others and so some much? To this end we look at the time delta between
entries offer more ”value”. an entry appearing on our “base feed” that we
are considering and any other RBL, this gives us
Overlap If we are looking to add a new RBL visualisations like that shown in Figure 3.
into our existing data, it is interesting to know how Here entries with a negative time show the base
many entries are in common with our current data. feed leading other RBL entries, whereas a positive
Again we need to aggregate over a period of time time indicates it lagging behind. So ideally we want
and be careful to compare apples with apples. It to see more weight to the left of the graph indicat-
may also be instructive to see different threat types ing that the RBL being considered is consistently
separately. One simple measure is the overlap of getting entries earlier than others.
unique domains shown in Figure 2.
This shows how much of one RBL is contained
within another (and vice versa). For example, if
we look at SURBL and openphish we can see that
SURBL contains 0.85 (85%) of openphish. However,
openphish contains just 0.015 (1.5%) of SURBL;
while the absolute number of domains in common
is the same, the difference is the underlying size of
the RBL.
The view shows us some other interesting features;
while the majority of overlaps are small, less than 5%
or so, there are some which are much higher. This is
where open sources are being read and incorporated
into other RBLs, presumably after being validated Figure 3: Where overlap is seen we can show if our
considered RBL saw the domain earlier or later than the
to the required standard for that RBL. This could
others
be significant if entries on multiple RBLs are being
taken as multiple independent observations, when
they may in fact stem from a single original source.
Churn For the RBLs that provide their whole
Timeliness The view above is interesting, and current set of entries on each read, it is also useful to
shows some cross-pollination between RBLs, so the know how dynamic the list is. If an RBL’s volume
next question is where two or more RBLs have stays the same as the previous iteration, is it because
the list is static, or is it because as many entries
3.3. What We Can Measure Indirectly
Liveliness Above we measured the volume of
entries on an RBL. However, it is also interesting to
know how many of those entries are “active”. There
may be entries which no longer resolve, or have
been mitigated in other ways (for example, some
registrars take control of the domain and “park” it).
We would struggle to capture this information
for every entry on a sizeable RBL, and once we had
finished we would need to start again to catch any
new entries or changes in existing ones. One way
Figure 4: Volume over time for a single RBL along with to tackle this would be to pick a random sample of
the number of additions and deletions
sufficient size to give us a measurement hopefully
representative of the whole population.
Figure 5: Histogram of entry ages for a single RBL, note
the log scale
Figure 6: Statuses of a sample of domains for two RBLs
are being removed as are being added? To this end
we can consider a single RBL over a period of time If we see a large proportion of the entries not
and plot its volume along with the number of new resolving then we need to think why this might be.
entries and removed entries as shown in Figure 4 While one reason may be that the RBL has stale
Note that removing stale entries which are no information there may be other explanations. For
longer active threats can be as important as adding example, maybe the RBL includes the output from
new entries, but is often not considered. To this one or more domain generation algorithms (DGAs),
end we can also look at the histogram of the ages of many of which are never actually registered.
entries, see Figure 5, note the log y-scale. Figure 5
shows a healthy mix where the majority of entries
Purity One of the more serious potential issues
have a short lifespan of days/weeks, with a small
for RBLs is when they contain false positive reports,
number being on the RBL for a year or more.
that is they contain entries which are not, and
This analysis gives us more insight into how active
never have been, malicious. These entries are nearly
the RBL is, how many new threats are being added
impossible to discover en masse, they will only really
and how many old threats are being removed. A
become apparent during use. However, can we try
higher churn reflects a more active RBL and so is
to discover potential issues ahead of time? One
seen as a positive feature. For those feeds which just
thing which we look at is the overlap between the
provide “point in time” observations this analysis
RBL and a source of "known good" entities. We are
is not so relevant; although we can still look at the
not aware of such a list, so use a surrogate source
volumes of new threats being added.
- a list of top domains, like the TRANCO top 1M.
While these domains may still be malicious they
are less likely to be. Also, for uses like blocking
network traffic, any entry in the top 10,000 say
would potentially be very disruptive. Reliability A metric that can only be determined
We obviously want this score to be as low as pos- with continued monitoring and use, is whether the
sible, and where we suspect false positives we’d like RBL data is always available, or are there some-
to understand if there are explanations or mitiga- times issues transferring. This can influence our
tions we can use. To take DGAs as an example confidence in using an RBL in a production en-
again, short DGA domains may coincidentally over- vironment as if we have our own SLAs then the
lap with real words and legitimate registrations. To RBL should have something at least similar but
make this less of an issue it may be that only DGA preferably better.
domains with seven or more characters are retained. For open-source RBLs with no contract (and
therefore no SLA) only our experience with the
Accuracy Where an RBL provides extra meta- RBL can give us this confidence.
data, like threat types, do we believe that they are
correct? Where we see entries in common between
different RBLs, do they agree? This can be difficult
4. Conclusions
to pin down as we do see the same entity reported In order to understand which RBL(s) are suitable for
for different threat types within the same RBL, so which projects, we need to understand the project
again we need to sample and check in order to get requirements, the RBL characteristics and how mul-
an idea of the scale of any issues. tiple RBLs interact with each other.
We would like to be able to trust all the data that We cannot claim that certain RBLs are better
an RBL provides, not just the presence of entries, than others; but it can be that some RBLs are more
and the mis-classification of entries can have serious suited to some projects.
consequences in some cases. If an RBL has a low However, from what we have seen of the RBLs
accuracy in terms of the metadata we may not be we have access to, adding multiple sources increases
able to use it to generate statistics for example. the number of unique entities included and hence
the comprehensiveness of the data used.
3.4. What We Can’t Measure While in this work we outlined our evaluation
processes, we emphasize the fact that these are not
Catchment RBLs have different collection mech-
meant to be complete or prescriptive as they are
anisms, even though some are aggregates of multiple
predicated on our current use cases. It is quite likely
primary sources. This will end up giving the RBL
that future projects, or new RBLs, will suggest new
strengths and blindspots, which could be geographic
measures and modifications to existing ones.
or delivery related (e.g. no mobile data), no visibil-
ity of threats targeted at specific countries, etc.
Understanding of these can sometimes be found References
from FAQs, whitepapers, conversations with the
providers or other second-hand methods. In many [1] R. McMillan, What will stop spam?,
cases however the amount of information is, for http://sunsite.uakom.sk/sunworldonline/
operational reasons, limited. swol-12-1997/swol-12-vixie.html, 1997.
We may need this information to identify RBLs [2] ICANN, Domain abuse activity reporting,
that fill gaps in our current set, for other uses it https://www.icann.org/octo-ssr/daar, 2017.
may be that data for a particular locale is essential. [3] Spamhaus, The World’s Most Abused
TLDs, https://www.spamhaus.org/statistics/
Entry Retesting We have seen that entries are tlds/, 2023.
removed from RBLs; but we cannot, from our mea- [4] L. Interisle Consulting Group, Phish-
surements, definitively say why. Are statuses of ing landscape 2022, https://interisle.net/
entries being periodically reconfirmed, or are they PhishingLandscape2022.pdf, 2022.
just timed out? Some RBLs give this information [5] C. David Barnett, The highest threat
but most do not, and deciding how long we trust tlds - part 1, https://circleid.com/posts/
entries for can be influenced by how this is being 20230112-the-highest-threat-tlds-part-1, 2013.
handled by the RBL. [6] J. Bayer, Y. Nosyk, O. Hureau, S. Fernandez,
Ideally all entries are frequently retested, but I. Paulovics, A. Duda, M. Korczyński, Study
we appreciate that operationally this may not be on domain name system (dns) abuse: Techni-
possible. cal report, arXiv preprint arXiv:2212.08879
(2022).
[7] S. Maroofi, M. Korczyński, C. Hesselman, curity Symposium (USENIX Security 19),
B. Ampeau, A. Duda, Comar: classification USENIX Association, Santa Clara, CA, 2019,
of compromised versus maliciously registered pp. 851–867. URL: https://www.usenix.org/
domains, in: 2020 IEEE European Symposium conference/usenixsecurity19/presentation/li.
on Security and Privacy (EuroS&P), IEEE, [16] A. Pitsillidis, C. Kanich, G. M. Voelker,
2020, pp. 607–623. K. Levchenko, S. Savage, Taster’s choice:
[8] M. Korczynski, M. Wullink, S. Tajal- A comparative analysis of spam feeds, in:
izadehkhoob, G. C. Moura, A. Noroozian, Proceedings of the 2012 Internet Measure-
D. Bagley, C. Hesselman, Cybercrime after ment Conference, IMC ’12, Association for
the sunrise: A statistical analysis of dns abuse Computing Machinery, New York, NY, USA,
in new gtlds, in: Proceedings of the 2018 on 2012, p. 427–440. URL: https://doi.org/10.
Asia Conference on Computer and Communi- 1145/2398776.2398821. doi:10.1145/2398776.
cations Security, 2018, pp. 609–623. 2398821.
[9] S. Tajalizadehkhoob, T. Van Goethem, M. Ko- [17] P. Vallina, V. Le Pochat, A. Feal, M. Paraschiv,
rczyński, A. Noroozian, R. Böhme, T. Moore, J. Gamba, T. Burke, O. Hohlfeld, J. Tapiador,
W. Joosen, M. Van Eeten, Herding vulnerable N. Vallina-Rodriguez, Mis-shapes, mistakes,
cats: a statistical approach to disentangle joint misfits: An analysis of domain classification
responsibility for web security in shared host- services, in: Proceedings of the ACM Inter-
ing, in: Proceedings of the 2017 ACM SIGSAC net Measurement Conference, IMC ’20, Asso-
Conference on Computer and Communications ciation for Computing Machinery, New York,
Security, 2017, pp. 553–567. NY, USA, 2020, p. 598–618. URL: https://doi.
[10] S. Sinha, M. Bailey, F. Jahanian, Shades of org/10.1145/3419394.3423660. doi:10.1145/
grey: On the effectiveness of reputation-based 3419394.3423660.
“blacklists”, in: 2008 3rd International Con- [18] Fraunhofer, DGArchive website, https://
ference on Malicious and Unwanted Software dgarchive.caad.fkie.fraunhofer.de/welcome/,
(MALWARE), 2008, pp. 57–64. doi:10.1109/ 2023.
MALWARE.2008.4690858.
[11] J. Zhang, A. Chivukula, M. Bailey, M. Karir,
M. Liu, Characterization of blacklists and
tainted network traffic, in: M. Roughan,
R. Chang (Eds.), Passive and Active Measure-
ment, Springer Berlin Heidelberg, Berlin, Hei-
delberg, 2013, pp. 218–228.
[12] T. Vissers, P. Janssen, W. Joosen, L. Desmet,
Assessing the effectiveness of domain black-
listing against malicious dns registrations, in:
2019 IEEE Security and Privacy Workshops
(SPW), 2019, pp. 199–204. doi:10.1109/SPW.
2019.00045.
[13] S. Ramanathan, J. Mirkovic, M. Yu,
Blag: Improving the accuracy of black-
lists, NDSS (2020). URL: https://par.
nsf.gov/biblio/10205652. doi:10.14722/ndss.
2020.24232.
[14] M. Kührer, C. Rossow, T. Holz, Paint it
black: Evaluating the effectiveness of malware
blacklists, in: Research in Attacks, Intrusions
and Defenses: 17th International Symposium,
RAID 2014, Gothenburg, Sweden, September
17-19, 2014. Proceedings 17, Springer, 2014,
pp. 1–21.
[15] V. G. Li, M. Dunn, P. Pearce, D. Mc-
Coy, G. M. Voelker, S. Savage, Reading
the tea leaves: A comparative analysis of
threat intelligence, in: 28th USENIX Se-
A. Appendix 1: Database Schema
We write all of our RBL data to a single database table per month; most sources are read daily although
some more frequently. Our current schema is shown in Table 1 although this has evolved with new RBLs
and requirements.
Some processing is required for most entries to be written to these tables, for example domains are
extracted from URLs, as are the TLD and suffix. This means we can get a more consistent view across all
of our RBLs, coping with those which provide different fields in different formats, or use slightly different
terminology.
Note that each time we read from a feed we add new entries rather than updating existing rows. This
means that there will be duplicate entries when an entity is reported by an RBL for multiple days.
This is also true for RBLs which report on URLs, and so may have the same domain multiple times.
Table 1
RBL Data Schema
Column Name Type Notes
report_date date Some RBLs tell us, for others it’s when we read that RBL.
domain text Stripped domain name
feed text Which source it came from
reason text Threat type - Spam, phishing, etc.
full_identifier text Some RBLs give URLs or include subdomains.
score int Some RBLs give a confidence score
suffix text Suffix according to the public suffix list
tld text Top-level domain
tld_type text country code (CC) or generic (gTLD) top-level domain
registrar text If known
reg_id int Registrar ID, if known
seen_since timestamp Initial report_date
url_shortener boolean Is it a known URL shortener (e.g. bit.ly); won’t be reliable
sub_feed text Some RBLs aggregate other sources, if this is the case the original source will be here
notes text Any other info the RBL gave that might be useful. Will depend on the RBL
dga boolean Is the entry known to be from a domain generation algorithm
ip boolean Is the entry an IP address