Block and Roll: A Metric-based Evaluation of Reputation Block Lists Siôn Lloyd1 , Carlos Hernandez-Gañán1 and Samaneh Tajalizadehkhoob1 1 ICANN Abstract Reputation Block Lists (RBLs) serve as a common defense mechanism against harmful and unwanted internet content. These lists contain the IP addresses, domain names, or full URLs of known spam sources, phishing, malicious sites or other unwanted content. By using RBLs, internet service providers, email providers, and other organizations can effectively safeguard their users from online threats. They are also used for more academic research and as training sets for machine learning models. To help evaluate and understand the effectiveness of RBLs, this paper covers a set of metrics that can be used to evaluate and characterize them. These metrics include RBL focus, mechanics, metadata, volume, overlap, timeliness, and churn. We categorise the metrics into four groups: a general description; metrics that can be directly measured; metrics that can be indirectly measured and metrics that can only be discovered second-hand. When it comes to RBLs there is no “one size fits all”. We argue that understanding the strengths and weaknesses of any one RBL, or set of multiple RBLs, is key to getting a good fit for a particular use-case. To maximize the benefit of RBLs, we suggest combining two or more to get a fuller picture than can be provided by any single RBL. Keywords DNS abuse, blocklists, phishing, malware 1. Introduction ferent identifier types: domain names, IP addresses or full URLs, and in many cases a mixture of two Domain name and IP address reputation lists have or more identifier types. They can also specialise been used for many years as a way to identify and in particular threat types, like spam, phishing, mal- block potentially harmful or unwanted traffic on ware, etc.; or they may contain a mixture of multiple the internet. The earliest known reputation list threat types. They can differ in collection method- was created by Paul Vixie in the 1990s, and was ology, licensing, distribution method, intended use called the “Real-time Blackhole List” or RBL [1]. and almost every other conceivable way. This list contained the IP addresses of known spam There are many examples of RBLs being used in sources and was used by mail servers to block in- many different scenarios, some more obvious than coming email from those sources. Over time, similar others, for example services like google safe brows- lists were created for other types of online activ- ing1 can be thought of like an RBL protecting a ities, such as domain or URL reputation lists for browser user from known phishing sites. The aca- identifying malicious or phishing websites, and IP demic community also utilises RBLs to understand address reputation lists for identifying sources of the current and historical reputation of domain malware or other online threats. Today, these lists names in various types of analysis, to measure secu- are widely used by internet service providers, email rity threat concentrations within the internet inter- providers, and other organisations to help protect mediaries such as TLD, registry, registrars or host- their users from online threats. They continue to ing providers and finally to assess mitigation strate- evolve and improve as new threats emerge and new gies of internet intermediaries [2, 3, 4, 5, 6, 7, 8, 9]. technologies are developed to combat them. In many cases the use of this data is not nec- We refer to these sources as “Reputation Block essarily aligned with how the producers intend it Lists” or RBLs, others may call them by slightly to be used, and so its suitability may not be clear. different names like “threat intelligence”, “security In other cases conclusions drawn from the analysis feeds”, “abuse feed” or similar. They can contain dif- based on this data does not necessary reflect the specifications and limitations of the data. Moreover, APWG.EU Tech 2023, June 21–22, 2023, Dublin, IE for all use-cases it is hard to know if the RBL being $ sion.lloyd@icann.org (S. Lloyd); carlos.ganan@icann.org used is the best fit, if there is a better option or (C. Hernandez-Gañán); samaneh.tajali@icann.org if a combination of two or more RBLs would add (S. Tajalizadehkhoob) © 2023 Copyright for this paper by its authors. Use permitted under enough benefit to justify any extra cost. Note that Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) https://developers.google.com/safe-browsing CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings the cost can be in terms of time and complexity as them with confidence, or understand why they are well as financial, so even free open-source feeds have not suitable for a particular project. We also do some cost associated with them. not consider cost or licensing terms here; although Misalignment with the intended use can have a these could be significant factors in any decision on significant impact on a project. For example, an whether to use an RBL. Lastly, we are not aiming RBL which contains low confidence or not vetted to evaluate the absolute effectiveness of our RBLs entries could result in an appreciable number of as some of the existing work have already looked at incorrect entries, known as false positives. Such that aspect [10, 11, 12, 13, 14]. a data feed might be perfectly acceptable if used The metrics we use are listed below and described to protect a small network where the mitigation of in more detail in section 3. incorrect entries has a low associated cost. However, the same RBL may not be suitable for an application 2.1. General RBL Description where a false positive results in a time and resource consuming investigation. These are characteristics that we will know before we start to ingest data into our system. It may be a feature that initially brought the RBL to our 2. Objectives interest, maybe to fill an identified gap in our other RBLs. We also include details that we need to know Given the problem introduced above, in this docu- during the integration of an RBL into our system, ment we propose a method to evaluate and charac- like how it is distributed and what data it contains. terise an RBL; not just in isolation but also in how multiple RBLs complement one another. We’ll look • RBL focus - What entry types does it contain at the general description of the RBL; things we can (spam, phish, etc.) measure directly; things that we can make approx- • RBL mechanics - how is the RBL dissemi- imations of and things that we can only discover nated, what format is the data in, etc. second-hand. We’ll also discuss the implications • Metadata - does the RBL provide more and limitations of these measurements. context on list entries, like malware family, This work has been informed from earlier exam- phished brand, etc. ples [10, 11, 12, 13, 14, 15, 16, 17]; but we have kept or modified parts of their suggested method to suit our requirements. As such; our approach is 2.2. What We Can Measure Directly grounded in the projects that we have been involved These are the metrics we can measure directly based with, other parties with other experiences may well on the information provided in the RBLs. have other metrics which they regard as important. To move towards evaluating an RBL, or group • Volume - how many entries are present of RBLs, we propose metrics that help measuring • Overlap - how many entries in one RBL are multiple aspects of a list. We then demonstrate in common with other RBLs the methods by which the metrics can be measured. • Timeliness - how quickly do entries appear Recall though that this work is based on the sources (compared to other RBLs) that we are already familiar with; it is likely that • Churn - how dynamic are the entries other RBLs have features which will require modifi- cations to this method. We will not discuss here the steps required to 2.3. What We Can Measure Indirectly read RBL data as this will vary between RBLs. We These are metrics that can be measured indirectly do show, in Appendix A, the database schema that from the data. So where we maybe have to sample we use to harmonise data into a single, consistent, the data in order to get an approximation of the format. All of the RBL data we read is written into answer, or where we have surrogate measurements in this structure, although it has had to evolve as new lieu of the thing that we actually want to investigate. RBLs with new fields have been added. Finally, there ae some things that we are explic- • Liveliness - how many entries are “active” itly not trying to measure. We are not looking to • Purity - how many are potential false posi- put a score on an RBL or say that one is demon- tives strably better than another; we want to increase • Accuracy - what proportion of stated threat our understanding of RBL data used regularly by types match reality us and our community, so that we can either use Figure 1: Number of unique domains seen over a fixed period of time 2.4. What We Cannot Measure of “point in time” observations). In the case of the latter the decision on how long an entry remains These are characteristics which cannot be derived active is decided by us rather than the provider. from the data itself, but are discovered based on second-hand information. For example we look at the documentation for the RBL, consult FAQs or Metadata It can be useful to have context talk to the RBL providers to get this information. around a particular entry, and some RBLs provide more information, like a timestamp the entry was • Catchment - are there geographic blindspots, added, the malware family seen, the brand being collection method gaps (e.g. no mobile data), phished and so on. Another useful data point is etc. whether the entry is believed to be a malicious reg- • Entry retesting - how frequently are en- istration or a compromised but otherwise legitimate tries retested to check if they should still registration. All of this forms the metadata of an be present on the RBL observation. • Reliability - is the data always available or are there issues transferring 3.2. What We Can Measure Directly Volume Possibly the easiest measurement to 3. Method make is how many entries are present. Although here some care needs to be taken that the same Looking in more detail at the metrics outlined above, thing is being measured in each case. For example, in the remainder of the paper we demonstrate what some RBLs contain just domains while others may we measure and, where appropriate, how we might contain URLs, but of course multiple URLs may use visualisations. Sticking to our four categories. well map to a single domain. We look at unique entries over a period of time, preferably a month or more, to give as good a representation as possible. 3.1. General RBL Description This is particularly significant for those RBLs which RBL Focus The first thing to consider is the provide a stream of new entries and so don’t have threat types that the RBL contains. Does it focus the concept of a ”current list”. If we look at unique on a single threat type or contain multiple types? domains we see something like Figure 1. We can also How does this relate to any other RBLs in our set, produce similar figures but showing unique hosts, does it fill a known gap? URLs, domains broken down by different threat types, etc. . RBL Mechanics A prosaic but significant issue Higher volumes are, in general, desirable; this is how we can read the RBL and merge it into is not, however, the whole story. For example the our larger dataset. We need to understand what DGArchive [18] data is based on enumeration of delivery mechanism is used, is there an API, do domain generation algorithms, and so the major- we get the data formatted in CSV, JSON, etc. . ity of those entries may never be registered. It is Also, when we read the RBL does it provide the therefore arguable that we are not comparing like whole current list or a stream of new entries (a list with like to other RBLs; we look to addresses issues Figure 2: Overlap of unique domain entries seen between RBLs over a fixed period of time like these later on. It is also true that some threats the same entry, which gets it earlier and by how are more serious or active than others and so some much? To this end we look at the time delta between entries offer more ”value”. an entry appearing on our “base feed” that we are considering and any other RBL, this gives us Overlap If we are looking to add a new RBL visualisations like that shown in Figure 3. into our existing data, it is interesting to know how Here entries with a negative time show the base many entries are in common with our current data. feed leading other RBL entries, whereas a positive Again we need to aggregate over a period of time time indicates it lagging behind. So ideally we want and be careful to compare apples with apples. It to see more weight to the left of the graph indicat- may also be instructive to see different threat types ing that the RBL being considered is consistently separately. One simple measure is the overlap of getting entries earlier than others. unique domains shown in Figure 2. This shows how much of one RBL is contained within another (and vice versa). For example, if we look at SURBL and openphish we can see that SURBL contains 0.85 (85%) of openphish. However, openphish contains just 0.015 (1.5%) of SURBL; while the absolute number of domains in common is the same, the difference is the underlying size of the RBL. The view shows us some other interesting features; while the majority of overlaps are small, less than 5% or so, there are some which are much higher. This is where open sources are being read and incorporated into other RBLs, presumably after being validated Figure 3: Where overlap is seen we can show if our considered RBL saw the domain earlier or later than the to the required standard for that RBL. This could others be significant if entries on multiple RBLs are being taken as multiple independent observations, when they may in fact stem from a single original source. Churn For the RBLs that provide their whole Timeliness The view above is interesting, and current set of entries on each read, it is also useful to shows some cross-pollination between RBLs, so the know how dynamic the list is. If an RBL’s volume next question is where two or more RBLs have stays the same as the previous iteration, is it because the list is static, or is it because as many entries 3.3. What We Can Measure Indirectly Liveliness Above we measured the volume of entries on an RBL. However, it is also interesting to know how many of those entries are “active”. There may be entries which no longer resolve, or have been mitigated in other ways (for example, some registrars take control of the domain and “park” it). We would struggle to capture this information for every entry on a sizeable RBL, and once we had finished we would need to start again to catch any new entries or changes in existing ones. One way Figure 4: Volume over time for a single RBL along with to tackle this would be to pick a random sample of the number of additions and deletions sufficient size to give us a measurement hopefully representative of the whole population. Figure 5: Histogram of entry ages for a single RBL, note the log scale Figure 6: Statuses of a sample of domains for two RBLs are being removed as are being added? To this end we can consider a single RBL over a period of time If we see a large proportion of the entries not and plot its volume along with the number of new resolving then we need to think why this might be. entries and removed entries as shown in Figure 4 While one reason may be that the RBL has stale Note that removing stale entries which are no information there may be other explanations. For longer active threats can be as important as adding example, maybe the RBL includes the output from new entries, but is often not considered. To this one or more domain generation algorithms (DGAs), end we can also look at the histogram of the ages of many of which are never actually registered. entries, see Figure 5, note the log y-scale. Figure 5 shows a healthy mix where the majority of entries Purity One of the more serious potential issues have a short lifespan of days/weeks, with a small for RBLs is when they contain false positive reports, number being on the RBL for a year or more. that is they contain entries which are not, and This analysis gives us more insight into how active never have been, malicious. These entries are nearly the RBL is, how many new threats are being added impossible to discover en masse, they will only really and how many old threats are being removed. A become apparent during use. However, can we try higher churn reflects a more active RBL and so is to discover potential issues ahead of time? One seen as a positive feature. For those feeds which just thing which we look at is the overlap between the provide “point in time” observations this analysis RBL and a source of "known good" entities. We are is not so relevant; although we can still look at the not aware of such a list, so use a surrogate source volumes of new threats being added. - a list of top domains, like the TRANCO top 1M. While these domains may still be malicious they are less likely to be. Also, for uses like blocking network traffic, any entry in the top 10,000 say would potentially be very disruptive. Reliability A metric that can only be determined We obviously want this score to be as low as pos- with continued monitoring and use, is whether the sible, and where we suspect false positives we’d like RBL data is always available, or are there some- to understand if there are explanations or mitiga- times issues transferring. This can influence our tions we can use. To take DGAs as an example confidence in using an RBL in a production en- again, short DGA domains may coincidentally over- vironment as if we have our own SLAs then the lap with real words and legitimate registrations. To RBL should have something at least similar but make this less of an issue it may be that only DGA preferably better. domains with seven or more characters are retained. For open-source RBLs with no contract (and therefore no SLA) only our experience with the Accuracy Where an RBL provides extra meta- RBL can give us this confidence. data, like threat types, do we believe that they are correct? Where we see entries in common between different RBLs, do they agree? This can be difficult 4. Conclusions to pin down as we do see the same entity reported In order to understand which RBL(s) are suitable for for different threat types within the same RBL, so which projects, we need to understand the project again we need to sample and check in order to get requirements, the RBL characteristics and how mul- an idea of the scale of any issues. tiple RBLs interact with each other. We would like to be able to trust all the data that We cannot claim that certain RBLs are better an RBL provides, not just the presence of entries, than others; but it can be that some RBLs are more and the mis-classification of entries can have serious suited to some projects. consequences in some cases. If an RBL has a low However, from what we have seen of the RBLs accuracy in terms of the metadata we may not be we have access to, adding multiple sources increases able to use it to generate statistics for example. the number of unique entities included and hence the comprehensiveness of the data used. 3.4. What We Can’t Measure While in this work we outlined our evaluation processes, we emphasize the fact that these are not Catchment RBLs have different collection mech- meant to be complete or prescriptive as they are anisms, even though some are aggregates of multiple predicated on our current use cases. It is quite likely primary sources. This will end up giving the RBL that future projects, or new RBLs, will suggest new strengths and blindspots, which could be geographic measures and modifications to existing ones. or delivery related (e.g. no mobile data), no visibil- ity of threats targeted at specific countries, etc. Understanding of these can sometimes be found References from FAQs, whitepapers, conversations with the providers or other second-hand methods. In many [1] R. McMillan, What will stop spam?, cases however the amount of information is, for http://sunsite.uakom.sk/sunworldonline/ operational reasons, limited. swol-12-1997/swol-12-vixie.html, 1997. We may need this information to identify RBLs [2] ICANN, Domain abuse activity reporting, that fill gaps in our current set, for other uses it https://www.icann.org/octo-ssr/daar, 2017. may be that data for a particular locale is essential. [3] Spamhaus, The World’s Most Abused TLDs, https://www.spamhaus.org/statistics/ Entry Retesting We have seen that entries are tlds/, 2023. removed from RBLs; but we cannot, from our mea- [4] L. Interisle Consulting Group, Phish- surements, definitively say why. Are statuses of ing landscape 2022, https://interisle.net/ entries being periodically reconfirmed, or are they PhishingLandscape2022.pdf, 2022. just timed out? Some RBLs give this information [5] C. David Barnett, The highest threat but most do not, and deciding how long we trust tlds - part 1, https://circleid.com/posts/ entries for can be influenced by how this is being 20230112-the-highest-threat-tlds-part-1, 2013. handled by the RBL. [6] J. Bayer, Y. Nosyk, O. Hureau, S. Fernandez, Ideally all entries are frequently retested, but I. Paulovics, A. Duda, M. Korczyński, Study we appreciate that operationally this may not be on domain name system (dns) abuse: Techni- possible. cal report, arXiv preprint arXiv:2212.08879 (2022). [7] S. Maroofi, M. Korczyński, C. Hesselman, curity Symposium (USENIX Security 19), B. Ampeau, A. Duda, Comar: classification USENIX Association, Santa Clara, CA, 2019, of compromised versus maliciously registered pp. 851–867. URL: https://www.usenix.org/ domains, in: 2020 IEEE European Symposium conference/usenixsecurity19/presentation/li. on Security and Privacy (EuroS&P), IEEE, [16] A. Pitsillidis, C. Kanich, G. M. Voelker, 2020, pp. 607–623. K. Levchenko, S. Savage, Taster’s choice: [8] M. Korczynski, M. Wullink, S. Tajal- A comparative analysis of spam feeds, in: izadehkhoob, G. C. Moura, A. Noroozian, Proceedings of the 2012 Internet Measure- D. Bagley, C. Hesselman, Cybercrime after ment Conference, IMC ’12, Association for the sunrise: A statistical analysis of dns abuse Computing Machinery, New York, NY, USA, in new gtlds, in: Proceedings of the 2018 on 2012, p. 427–440. URL: https://doi.org/10. Asia Conference on Computer and Communi- 1145/2398776.2398821. doi:10.1145/2398776. cations Security, 2018, pp. 609–623. 2398821. [9] S. Tajalizadehkhoob, T. Van Goethem, M. Ko- [17] P. Vallina, V. Le Pochat, A. Feal, M. Paraschiv, rczyński, A. Noroozian, R. Böhme, T. Moore, J. Gamba, T. Burke, O. Hohlfeld, J. Tapiador, W. Joosen, M. Van Eeten, Herding vulnerable N. Vallina-Rodriguez, Mis-shapes, mistakes, cats: a statistical approach to disentangle joint misfits: An analysis of domain classification responsibility for web security in shared host- services, in: Proceedings of the ACM Inter- ing, in: Proceedings of the 2017 ACM SIGSAC net Measurement Conference, IMC ’20, Asso- Conference on Computer and Communications ciation for Computing Machinery, New York, Security, 2017, pp. 553–567. NY, USA, 2020, p. 598–618. URL: https://doi. [10] S. Sinha, M. Bailey, F. Jahanian, Shades of org/10.1145/3419394.3423660. doi:10.1145/ grey: On the effectiveness of reputation-based 3419394.3423660. “blacklists”, in: 2008 3rd International Con- [18] Fraunhofer, DGArchive website, https:// ference on Malicious and Unwanted Software dgarchive.caad.fkie.fraunhofer.de/welcome/, (MALWARE), 2008, pp. 57–64. doi:10.1109/ 2023. MALWARE.2008.4690858. [11] J. Zhang, A. Chivukula, M. Bailey, M. Karir, M. Liu, Characterization of blacklists and tainted network traffic, in: M. Roughan, R. Chang (Eds.), Passive and Active Measure- ment, Springer Berlin Heidelberg, Berlin, Hei- delberg, 2013, pp. 218–228. [12] T. Vissers, P. Janssen, W. Joosen, L. Desmet, Assessing the effectiveness of domain black- listing against malicious dns registrations, in: 2019 IEEE Security and Privacy Workshops (SPW), 2019, pp. 199–204. doi:10.1109/SPW. 2019.00045. [13] S. Ramanathan, J. Mirkovic, M. Yu, Blag: Improving the accuracy of black- lists, NDSS (2020). URL: https://par. nsf.gov/biblio/10205652. doi:10.14722/ndss. 2020.24232. [14] M. Kührer, C. Rossow, T. Holz, Paint it black: Evaluating the effectiveness of malware blacklists, in: Research in Attacks, Intrusions and Defenses: 17th International Symposium, RAID 2014, Gothenburg, Sweden, September 17-19, 2014. Proceedings 17, Springer, 2014, pp. 1–21. [15] V. G. Li, M. Dunn, P. Pearce, D. Mc- Coy, G. M. Voelker, S. Savage, Reading the tea leaves: A comparative analysis of threat intelligence, in: 28th USENIX Se- A. Appendix 1: Database Schema We write all of our RBL data to a single database table per month; most sources are read daily although some more frequently. Our current schema is shown in Table 1 although this has evolved with new RBLs and requirements. Some processing is required for most entries to be written to these tables, for example domains are extracted from URLs, as are the TLD and suffix. This means we can get a more consistent view across all of our RBLs, coping with those which provide different fields in different formats, or use slightly different terminology. Note that each time we read from a feed we add new entries rather than updating existing rows. This means that there will be duplicate entries when an entity is reported by an RBL for multiple days. This is also true for RBLs which report on URLs, and so may have the same domain multiple times. Table 1 RBL Data Schema Column Name Type Notes report_date date Some RBLs tell us, for others it’s when we read that RBL. domain text Stripped domain name feed text Which source it came from reason text Threat type - Spam, phishing, etc. full_identifier text Some RBLs give URLs or include subdomains. score int Some RBLs give a confidence score suffix text Suffix according to the public suffix list tld text Top-level domain tld_type text country code (CC) or generic (gTLD) top-level domain registrar text If known reg_id int Registrar ID, if known seen_since timestamp Initial report_date url_shortener boolean Is it a known URL shortener (e.g. bit.ly); won’t be reliable sub_feed text Some RBLs aggregate other sources, if this is the case the original source will be here notes text Any other info the RBL gave that might be useful. Will depend on the RBL dga boolean Is the entry known to be from a domain generation algorithm ip boolean Is the entry an IP address