Anomaly Detection in Certificate Transparency Logs Richard Ostertág, Martin Stanek Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia Abstract We propose an anomaly detection technique for X.509 certificates utilizing Isolation Forest. This method can be beneficial when compliance testing with X.509 linters proves unsatisfactory, and we seek to identify anomalies beyond standards compliance. The technique is validated on a sample of 120,000 certificates from one of the largest public Certificate Transparency (CT) logs, Xenon 2024, which is operated by Google. Keywords Anomaly Detection, Certificate Transparency Logs, Isolation Forest 1. Introduction 6962 [3] (version 1). However, the API is focused on mon- itoring CT log entries and there is no method to search Digital certificates, or public key certificates, issued by for entries based on domain names or other attributes. To trusted certification authorities play an essential role in satisfy the demand for advanced queries and monitoring, facilitating trust in security protocols. They bind the there are various free and commercial services available. identity of a subject to a specific public key. Certificates Notable free search services are crt.sh 1 operated by that are issued mistakenly or with malicious intent pose Sectigo and Entrust Certificate Search2 . Commercial of- a significant security threat, with impacts related to iden- ferings allow outsourcing monitoring tasks for domain tity spoofing. owners and provide automated checks and notifications Certificate Transparency (CT) is a standard designed when events that require owner attention are observed. to mitigate this threat. The main idea behind CT is to col- In the world of ubiquitous Transport Layer Security lect and store all issued certificates in publicly available (TLS) communication, CT logs have become a rich source CT logs with verifiable authenticity. These logs allow of information regarding domain names. Passive recon- anyone, such as domain owners, to monitor issued cer- naissance regularly employs searches through CT logs to tificates and detect misissued certificates. The details enumerate subdomains during penetration testing. Ex- of CT operation, including participants, data structures, ample tools that use this technique, among other meth- protocol, etc., are specified in RFC 9162 [1]. ods, are OWASP Amass3 , subfinder4 , and reconFTW5 . Certificate Transparency is gradually gaining popular- ity, and browsers like Chrome (Chromium) and Safari are Anomaly detection. Anomalous certificates may in- now requiring Transport Layer Security (TLS) certificates dicate various issues, such as misissued certificates, unin- to contain proof of CT log inclusion. This requirement is tended defects, or operational problems of domain own- achieved by adding signed certificate timestamps (SCTs) ers. They can raise suspicions and warrant an investi- into the certificate. The SCT serves as a signed promise gation. Certificates in CT logs can even be abused for that the CT log operator will append the certificate to unidirectional covert communication [4]. There might be the CT log. other abuses of CT logs and unknown problems as well. The most prominent public CT logs are operated by It is much more efficient to detect misissued certificates Google, Cloudflare, and certification authorities them- using exact tests when we know what we are looking selves, such as DigiCert, Let’s Encrypt, and Sectigo. Since for. However, the detection of anomalous certificates can all relevant certification authorities support CT, as of May help identify potential, yet unknown, issues that may 2024, over 460,000 certificates are published in CT logs require further investigation. every hour [2]. Another application of anomaly detection is when The HTTP-based API that allows direct access to a anomalies initially identified by a model are no longer CT log is specified in RFC 9162 [1] (version 2.0) or RFC rare. This might indicate changes in the use of certifi- cates, reflected in their structure or content characteris- ITAT’24: Workshop on Applied Security, September 20–24, 2024, tics. Moreover, the model can be trained on certificates Drienica, SK Envelope-Open richard.ostertag@fmph.uniba.sk (R. Ostertág); 1 https://crt.sh, a direct SQL access to the database is also available martin.stanek@fmph.uniba.sk (M. Stanek) 2 https://ui.ctsearch.entrust.com/ui/ctsearchui Orcid 0000-0002-6560-1515 (R. Ostertág) 3 https://owasp.org/www-project-amass/ © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 4 Attribution 4.0 International (CC BY 4.0). https://github.com/projectdiscovery/subfinder CEUR Workshop Proceedings (CEUR-WS.org) 5 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 https://github.com/six2dez/reconftw CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings issued for specific domains, and anomalies detected in newly issued certificates can indicate an internal problem that needs to be addressed. In our paper, we use the term “anomaly” to refer to certificates that are significantly different from those usu- ally observed. We do not test certificates for compliance with X.509 standards like linters do6 . However, in future work, it might be interesting to include linter results as additional attributes for anomaly detection, providing a more comprehensive analysis of certificate structures and content. Our contribution. We evaluate selected statistical in- formation about certificates in CT logs, focusing on at- tributes defined by domain owners, such as Subject Alter- native Name (SAN) in Section 2. We propose a method for anomaly detection in certificates using Isolation For- est [5, 6], an unsupervised machine learning technique, in Section 3. We select suitable certificate attributes and train the model on a sampled set of certificates obtained from CT logs. The results of our Isolation Forest model are presented in Section 4. 2. Statistics and attributes selection Figure 1: Selected characteristics of subjects in the dataset We created a random sample of 120,000 records from one of the largest public Certificate Transparency (CT) CN=multimedia-academy.tudelft.nl, logs, Xenon 2024, which is operated by Google. In CT O=Technische Universiteit Delft, logs, there are two types of records: precertificates and ST=Zuid-Holland, certificates. Since all the features we want to extract are C=NL already available in precertificates, we do not discrimi- nate between these types in our analysis. The presence and number of attributes in a DN can According to the statistics presented by Cloudflare on vary. For instance, we found that approximately 2.28% of their Merkle Town webpage [2], the issuance rate of new our sample certificates did not contain a CN attribute. To certificates across all monitored CT logs is more than 460 extract quantitative features from the subject section of thousands per hour (as of May 2024). Therefore, the prob- a certificate, we considered the following characteristics ability of sampling both the corresponding precertificate (see also a boxplot visualization in the Figure 1): and certificate is negligible. The sample size and variety of records in our experiment are sufficient to effectively • The length of a DN – this refers to the number of investigate anomalous certificates within CT logs. characters in the DN string representing a subject. Let us discuss what attributes we considered and se- In our sample, DN lengths range from 0 to 278 lected for feature extraction. We group them in several characters with an average length of 33.0. categories – subject, subject’s public key, issuer, signa- • The number of attributes in a DN – this represents ture, validity, and X.509 extensions. the inner structure of the DN and indicates how many relative Distinguished Names (RDNs) are Subject. A distinguished name (DN) consists of a set present. The maximum value in our sample is 12 of attributes that identify a subject. In the case of domain attributes, while the mean is 1.4 attributes, and validated certificates, it usually contains just the com- only 14.0% of records have an attribute count that mon name (CN). For organization validated certificates, is not equal to 1. however, it may contain a set of attributes such as: • The length of a CN – this attribute focuses on 6 It is important to note that X.509 linters check certificates against a the most important and most frequently present specific set of rules, ensuring they conform to established standards. part of a DN. The maximal allowed length of 64 Some well-known tools are ZLint and pkilint. characters [7] is observed in 1.8% of records. 256 384 2048 3072 4096 8192 • CA rarity: A float number computed as a fraction RSA 64.8% 0.4% 8.4% 0.0%* of certificates in the sample with the same issuer ECDSA 24.4% 2.1% (DN). * exactly one 8192-bit RSA key in the sample It is assumed that more common certification authori- ties have better practices and stricter certification policies Table 1 in place, so their certificates are less likely to be anoma- Distribution of subject’s public key lengths in the sample lous. Therefore, we will not analyze other aspects of the issuer further. • Number of subdomains in a CN – this represents Signature. We do not extract any features from a sig- the inner structure of CN. In our sample, the num- nature algorithm used by certification authorities to sign ber of subdomains ranges from 0 to 15. (pre)certificates. This is entirely at their discretion, and • Wildcard CN – a boolean value indicating we assume that CA rarity, see above, covers unusual whether the CN contains a ‘* ’ character. Wild- certification authorities sufficiently in our experiment. card CNs are observed in 12.0% of records. However, if someone wants to consider signatures in anomaly detection, both types (algorithms) as well as Certainly, there might exist qualitative anomalies in key lengths should be considered. Nice online statistics certificates based on small differences or variances that covering signature algorithms are presented in [2], with are not captured by quantitative characteristics alone. For RSA-SHA256 being used in 90% of (pre)certificates. example, some uncommon semantics may be used for Another set of attributes that might be considered in DN attributes. These anomalies will not be detected with the future are embedded SCT (Signed Certificate Times- methods trained only on quantitative features. However, tamps) in certificates. A certification authority can decide we do not attempt to analyze these anomalies in this in which CT logs it wants to include a certificate. This paper as it would require interpreting different parts decision is usually uniform across different certificates, and attributes of the certificate beyond the scope of our taking into account their expiry date. An unusual combi- experiment. This approach, focusing on quantitative nation or SCT count can indicate an anomaly. characteristics, is also used for feature extraction in the rest of this section. Validity. Despite the validity period depends on the CA’s certification policy, our sample demonstrates sig- Subject’s public key. A public key is another attribute nificant variability within this attribute, ranging from that is fully controlled by the subject. The certificate au- one day to approximately 50 months. This feature is thority can restrict the types and supported lengths of extracted for use in anomaly detection. public keys for issued certificates, but the value is ulti- mately generated by the subject. We extract two features • Validity period: the number of days a certificate from the public key: its type and length. is valid, calculated as the difference between “not • Public Key Type: There are only two types of before” and “not after” dates. Approximately 70% subject public keys – RSA and Elliptic Curve Dig- of certificates in our sample are issued for a valid- ital Signature Algorithm (ECDSA). Our sample ity period of three months, predominantly due to shows a dominant position of RSA keys (73.5%). Let’s Encrypt’s certification policy. Nearly 19.3% We do not extract the type of elliptic curve used of the certificates have a validity period of ap- in ECDSA keys. A numeric encoding of public proximately one year. key type is performed as follows: ECDSA ↦ 0, RSA ↦ 1. X.509 extensions. There are various extensions that • Public Key Length: Bit length of the public key, can be part of a certificate. Our experiment with anomaly depending on the modulus length for RSA or cho- detection is focused mostly on attributes chosen by the sen curve for ECDSA. The observed variability of subject. Therefore, special attention is given to the fea- this attribute is presented in Table 1. tures of Subject Alternative Name (SAN) extension. Ac- cording to RFC 5280 [7], the SAN entry can contain DNS names, IP addresses, internet electronic mail addresses, Issuer. Let’s Encrypt is the most prevalent certifica- Uniform Resource Identifiers (URIs), and other options tion authority, accounting for over 52% of certificates in exist as well. The sample shows an overwhelming proba- our sample. The total number of distinct certification bility of DNS names, where almost all certificates have authorities, identified by unique DN, is 176. For anomaly at least one DNS name in the SAN extension. Other en- detection, we will use the rarity of CA as a feature: try types appear in negligible fractions of records: IP addresses are present in less than 0.03% of records, and 5 to 239 with an average value of 27.3. Given the other types are absent altogether. We extract the follow- observation of SAN count, the value primarily ing features for anomaly detection: depends on certificates with a small number of SAN entries. The distribution of average SAN length values along with the distribution of SAN count values is presented in Figure 2. • The number of wildcard domain names: Approx- imately 65% of the certificates do not contain any wildcard names in their CN and SAN attributes, while 31.1% of certificates have just one wildcard name. Other counts are significantly less repre- sented (less than 3.9%). • Average number of subdomains: The average number of subdomains for CN and SAN attributes is calculated by counting all substrings separated by periods (“.”) in a domain name. For example, “www.uniba.sk” has three subdomains: “www”, “uniba”, and “sk”. As expected, the average num- ber of subdomains is generally within the range of 2 to 4, as shown in Figure 3. • Validation type: We assign each certificate a nu- merical representation of its validation type, with 0 representing missing or unavailable informa- tion, 1 for Domain Validation (DV), 2 for Orga- nizational Validation (OV), and 3 for Extended Validation (EV). This representation orders val- idation types from the least strict policy to the most strict validation policy. For comprehensive global statistics, see Merkle Town’s webpage [2]. In our sample, we found that 88.3% of certificates were DV, and 11.7% were OV, while other types occurred negligibly. We decided not to analyze other extensions separately despite their potential interest, such as Key Usage, CRL, OCSP, and various constraints. Although problems or anomalies can be hidden in any of them, we selected a subset of attributes more related to the subject, because these attributes can help detect incorrect configurations when requesting certificates or possible covert commu- nication. For other anomaly detection applications, it might be important to include specific X.509 extensions in the set of selected features. Our experiment focuses on the following summary characteristics: Figure 2: SAN count and average length in the dataset • Extensions count: The number of X.509 exten- sions in a certificate. The dataset shows this parameter ranging from 5 to 13 with 97.3% of records having 9 or 10 extensions. • The count of SAN entries: Our sample shows • Extensions size: The length of X.509 extensions an average number of SAN entries as 2.1, with in a certificate excluding SAN, since the related a minimum of 1 and a maximum of 238. The SAN characteristics – number and average length number of certificates with 10 or more SANs is – are represented separately. The average size in below 1.5%. our sample is 2306 bytes, while minimum and • Average length of SAN entries: The average maximum sizes are 815 and 3506 bytes, respec- length of SAN entries in a certificate ranges from tively. • Sampling from the data is performed without re- placement. The contamination of the data, i.e. the proportion of anomalies in the dataset, is irrelevant for the discussion in Section 4. The reason being that the contamination is only used to set an anomalous score threshold. Instead, we examine which data, specifically precertificates and certificates, have the highest anomalous scores. From these observations, conclusions can be drawn without requiring knowledge of the exact contamination value for our dataset. 4. Results We document the types of precertificates and certifi- cates that are detected as the most anomalous in our exploratory experiment. A general observation is that some cloud services and their internal components are the most frequent outliers in our dataset. Figure 3: Average number of subdomains in the dataset Azure infrastructure. The most anomalous certifi- cates in our experiment are those issued by Microsoft for the components of Azure infrastructure. The issuing 3. Anomaly detection CAs are: • Microsoft Azure TLS Issuing CA XX – several Isolation Forest is an unsupervised anomaly detection authorities, where XX denotes number 01, 02, technique proposed by Liu, Ting, and Zhou [5, 6]. It etc.; builds a collection of binary trees, similar to binary • Microsoft Azure RSA TLS Issuing CA XX – again search trees, by randomly selecting branching features several authorities issuing certificates; and thresholds. The anomaly score for a data point is • Microsoft RSA TLS CA XX – significantly smaller based on the average depth at which it is isolated across number of certificates in comparison to the above multiple trees. The main idea behind Isolation Forest is two sets of authorities. that, on average, anomalies are isolated in lower depths than non-anomalous data. Table 2 summarizes basic characteristics for each CA. It The Isolation Forest algorithm was selected for our shows above-average values, particularly for the first two experiment due to its ability to detect anomalies with- CAs. Besides higher than usual number of SAN domain out relying on complex distance metrics or density esti- names, longer domain names and extensions, the other mation. Furthermore, Isolation Forest performs well in factors contribute to anomaly of detected certificates as high-dimensional problems containing a large number well. Top anomalous certificates show various devia- of irrelevant attributes. Additionally, it can effectively tions, such as slightly odd validity period, the number train the model even when the anomalies are not present of wildcard domain names, and other attributes, com- in the training sample. The technique also has low time bined with relative rarity of issuing CA. In this regard, and memory complexity. the anomaly detection works as intended. For example, We utilize an implementation of the Isolation Forest the most anomalous certificate in the dataset according provided in PyOD library [8] for anomaly detection in our trained model shows the following characteristics: multivariate data. We set the following parameters for • Common Name: this technique: CN=*.table.preprod.core.windows.net • Issuer: Microsoft Azure TLS Issuing CA 06 • Number of estimators (trees): 200 • Validity period: 282 • Number of samples drawn from the data to train • SAN count: 52, the number of wildcard domain each estimator: 256 names: 52 • Number of features drawn from the data to train • The average number of subdomains: 7 each estimator: 16 (all available features) • The number of extensions: 12, overall extension size: 3206 DN CN SAN extensions CA attributes/length length count/length count/size Microsoft Azure TLS Issuing CA 5.0/89.6 41.6 3.5/45.0 12.0/3218 Microsoft Azure RSA TLS Issuing CA 5.0/97.3 49.3 3.4/54.5 12.0/3232 Microsoft RSA TLS CA 1.0/38.5 35.5 21.2/38.2 10.7/3031 Table 2 Averages for selected characteristics of issued certificates Other CAs and ZeroSSL. After filtering out certifi- the fraction with an empty subject is rather large: cates issued by the CAs metioned above, we examined 41.3%. Table 3 summarizes some characteristics the top 100 anomalous items in greater detail. Among of records issued by this CA. It’s an interesting these, we observed: observation that free certificates issued by Let’s Encrypt CA do not exhibit such anomalies7 . • Two certificates issued by DigiCert: one for a Chinese cloud service provider and one for a le- The problem with unusual length of the CN attribute is gitimate IT company. not unique to ZeroSSL CA. Similar certificates are issued • Two certificates issued by Amazon for its AWS by Let’s Encrypt. Again, possible explanation might be components. an error in certificate management automation. Domain • All remaining 96 certificates were issued by Ze- owners are probably not aware or simply do not care, roSSL CA, specifically by ZeroSSL ECC Domain since both CA offers free certificates. An examples of Secure Site CA. These have an unusual structure: such CN (certificate issued by Let’s Encrypt) is empty subject (DN), and questionable SAN at- tributes containing a large number of repetitive gitlab.gitlab.gitlab.gitlab.gitlab.git. subdomains. Two examples are: testing.yikj.work – www.www.www.www.www.www.pay. Other observations. Ignoring ZeroSSL-issued certifi- avito.sber.avito.avito.www.www. cates and various additional infrastructure certificates by www.www.www.www.www.www.yandex. Apple, Cisco, Google, and other well-known companies, avito.yandex.pay.portalswebmail. we have found several more entries that look interesting. blumebwww.od3.10cekub2b.k. For example, a certificate issued by Let’s Encrypt CA webmail.ultagkhanub2b.k.webmemo. with the following set of SANs: m.phpmyadmin.wokemtutankhanub2b. k.webmail.ultagame.com *.ajptzd.com, *.amklvv.com, *.aqcssg.com, – www.www.www.www.www.www.www.www. *.ccjytp.com, *.doeigp.com, *.egfnjv.com, www.www.www.www.www.www.www.www. *.eydqoa.com, *.fvrnlf.com, *.guuzxk.com, www.www.www.www.www.www.www.www. *.hgmwfy.com, *.iwhqyn.com, *.kldcuc.com, www.www.www.www.www.www.www.www. *.lfmdnj.com, *.lloond.com, *.naktki.com, www.www.www.www.www.www.www.www. *.nmklqi.com, *.npwpbz.com, *.nxezmi.com, www.www.www.www.www.www.www.www. *.ojdger.com, *.psfqpu.com, *.ptgreh.com, www.www.calendario. *.raclbc.com, *.rvaajo.com, *.spikfh.com, panel-fiveheberg.fr *.swwoyd.com, *.tnuntp.com, *.xfcpkw.com, These might indicate an operational problem *.xnrsre.com, *.xuvvdq.com, *.yyiosx.com with an automation script that issues and re- Most of these domains are unresolvable by public DNS news certificates and adds www prefix to a domain (NXDOMAIN) as of May 2024. We did not investigate name. Additionally, in case of the domain panel- this certificate further to determine whether it represents fiveheberg.fr , combined with a wildcard DNS a legitimate use-case, misconfiguration, business mal- record that positively responds to any DNS query. practice, or other malicious intent. However, based on We checked both domains in VirusTotal, and no the experience documented in [9], these domains may security vendor flagged them as malicious (as of result from a domain generation algorithm (DGA) [10], April 2024). Not all precertificates and certificates issued by ZeroSSL ECC Domain Secure Site CA in our 7 There are only 7 (pre)certificates with empty subject out of 62,424 dataset have the mentioned structure. However, issued by any Let’s Encrypt CA in our dataset. DN CN SAN extensions set attributes/length length count/length count/size all (pre)certificates 0.6/18.9 17.1 1.0/61.2 9.0/2307 empty subject 0.0/0.0 0.0 1.0/106.7 9.0/2307 Table 3 Averages for (pre)certificates issued by ZeroSSL ECC Domain Secure Site CA indicating a likely malicious intent8 . The other attributes References of such certificates are rather normal, e.g., 90 days valid- ity, 2048-bit RSA key or nine X.509 extensions, dictated [1] B. Laurie, A. Langley, E. Kasper, E. Messeri, mostly by the certification policy of Let’s Encrypt CA. R. Stradling, Certificate Transparency Version 2.0, RFC 9162, 2021. URL: https://www.rfc-editor.org/ info/rfc9162. doi:10.17487/RFC9162 . 5. Conclusion [2] Cloudflare, Merkle town, 2023. URL: https:// ct.cloudflare.com/. We proposed an anomaly detection technique for cer- [3] B. Laurie, A. Langley, E. Kasper, Certificate Trans- tificates using Isolation Forest. This approach can be parency, RFC 6962, 2013. URL: https://www.rfc- beneficial when compliance testing with X.509 linters is editor.org/info/rfc6962. doi:10.17487/RFC6962 . unsatisfactory, and we seek anomalies beyond compli- [4] M. Jurčák, Using Certificates and CT Logs for com- ance. We demonstrated the feasibility of this method; munication, Bachelor’s thesis, Comenius Univer- however, further exploration is necessary. Some poten- sity, 2023. In Slovak. tial directions are: [5] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation for- est, in: 2008 Eighth IEEE International Conference • Training the model on certificates for a specific on Data Mining, 2008, pp. 413–422. doi:10.1109/ domain or domains owned by a single entity, al- ICDM.2008.17 . lowing anomalies to serve as early internal warn- [6] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation-based ings of potential issues. anomaly detection, ACM Trans. Knowl. Discov. • Identifying certificates from large cloud providers Data 6 (2012). doi:10.1145/2133360.2133363 . and excluding them from the model and evalua- [7] S. Boeyen, S. Santesson, T. Polk, R. Housley, S. Far- tion. The CT logs contain a vast quantity of these rell, D. Cooper, Internet X.509 Public Key In- precertificates and certificates, which can distort frastructure Certificate and Certificate Revocation parameters of the model. List (CRL) Profile, RFC 5280, 2008. URL: https: • Analyzing the results of identified anomalies in //www.rfc-editor.org/info/rfc5280. doi:10.17487/ greater detail, such as those described in the pre- RFC5280 . vious section, to find explanations for the anoma- [8] Y. Zhao, Z. Nasrullah, Z. Li, Pyod: A python lous certificates. toolbox for scalable outlier detection, Journal of Machine Learning Research 20 (2019) 1–7. URL: http://jmlr.org/papers/v20/19-011.html. Acknowledgments [9] J. Terrill, Analyzing a Wordpress PHP malware cam- This publication is the result of support under the paign and reverse engineering C2 communications, Operational Program Integrated Infrastructure for the 2022. URL: https://hacked.codes/2022/december- project: Advancing University Capacity and Compe- 2022-php-wordpress-malware-analysis/, [Online; tence in Research, Development a Innovation (ACCORD, accessed June 2024]. ITMS2014+:313021X329), co-financed by the European [10] Wikipedia contributors, Domain generation algo- Regional Development Fund. rithm, 2023. URL: https://en.wikipedia.org/wiki/ Domain_generation_algorithm, [Online; accessed June 2024]. 8 A common tactic employed by large threat actors involves creating a script that randomly generates numerous domain names, purchas- ing them, and subsequently switching between domains as needed once one is blocked or otherwise compromised.