Anomaly Detection in Certificate Transparency Logs
                                Richard Ostertág, Martin Stanek
                                Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia


                                                                          Abstract
                                                                          We propose an anomaly detection technique for X.509 certificates utilizing Isolation Forest. This method can be beneficial when
                                                                          compliance testing with X.509 linters proves unsatisfactory, and we seek to identify anomalies beyond standards compliance.
                                                                          The technique is validated on a sample of 120,000 certificates from one of the largest public Certificate Transparency (CT)
                                                                          logs, Xenon 2024, which is operated by Google.

                                                                          Keywords
                                                                          Anomaly Detection, Certificate Transparency Logs, Isolation Forest


                                1. Introduction                                                                                                                    6962 [3] (version 1). However, the API is focused on mon-
                                                                                                                                                                   itoring CT log entries and there is no method to search
                                Digital certificates, or public key certificates, issued by                                                                        for entries based on domain names or other attributes. To
                                trusted certification authorities play an essential role in                                                                        satisfy the demand for advanced queries and monitoring,
                                facilitating trust in security protocols. They bind the                                                                            there are various free and commercial services available.
                                identity of a subject to a specific public key. Certificates                                                                       Notable free search services are crt.sh 1 operated by
                                that are issued mistakenly or with malicious intent pose                                                                           Sectigo and Entrust Certificate Search2 . Commercial of-
                                a significant security threat, with impacts related to iden-                                                                       ferings allow outsourcing monitoring tasks for domain
                                tity spoofing.                                                                                                                     owners and provide automated checks and notifications
                                   Certificate Transparency (CT) is a standard designed                                                                            when events that require owner attention are observed.
                                to mitigate this threat. The main idea behind CT is to col-                                                                           In the world of ubiquitous Transport Layer Security
                                lect and store all issued certificates in publicly available                                                                       (TLS) communication, CT logs have become a rich source
                                CT logs with verifiable authenticity. These logs allow                                                                             of information regarding domain names. Passive recon-
                                anyone, such as domain owners, to monitor issued cer-                                                                              naissance regularly employs searches through CT logs to
                                tificates and detect misissued certificates. The details                                                                           enumerate subdomains during penetration testing. Ex-
                                of CT operation, including participants, data structures,                                                                          ample tools that use this technique, among other meth-
                                protocol, etc., are specified in RFC 9162 [1].                                                                                     ods, are OWASP Amass3 , subfinder4 , and reconFTW5 .
                                   Certificate Transparency is gradually gaining popular-
                                ity, and browsers like Chrome (Chromium) and Safari are                                                                            Anomaly detection. Anomalous certificates may in-
                                now requiring Transport Layer Security (TLS) certificates                                                                          dicate various issues, such as misissued certificates, unin-
                                to contain proof of CT log inclusion. This requirement is                                                                          tended defects, or operational problems of domain own-
                                achieved by adding signed certificate timestamps (SCTs)                                                                            ers. They can raise suspicions and warrant an investi-
                                into the certificate. The SCT serves as a signed promise                                                                           gation. Certificates in CT logs can even be abused for
                                that the CT log operator will append the certificate to                                                                            unidirectional covert communication [4]. There might be
                                the CT log.                                                                                                                        other abuses of CT logs and unknown problems as well.
                                   The most prominent public CT logs are operated by                                                                               It is much more efficient to detect misissued certificates
                                Google, Cloudflare, and certification authorities them-                                                                            using exact tests when we know what we are looking
                                selves, such as DigiCert, Let’s Encrypt, and Sectigo. Since                                                                        for. However, the detection of anomalous certificates can
                                all relevant certification authorities support CT, as of May                                                                       help identify potential, yet unknown, issues that may
                                2024, over 460,000 certificates are published in CT logs                                                                           require further investigation.
                                every hour [2].                                                                                                                        Another application of anomaly detection is when
                                   The HTTP-based API that allows direct access to a                                                                               anomalies initially identified by a model are no longer
                                CT log is specified in RFC 9162 [1] (version 2.0) or RFC                                                                           rare. This might indicate changes in the use of certifi-
                                                                                                                                                                   cates, reflected in their structure or content characteris-
                                ITAT’24: Workshop on Applied Security, September 20–24, 2024,                                                                      tics. Moreover, the model can be trained on certificates
                                Drienica, SK
                                Envelope-Open richard.ostertag@fmph.uniba.sk (R. Ostertág);                                                                        1
                                                                                                                                                                     https://crt.sh, a direct SQL access to the database is also available
                                martin.stanek@fmph.uniba.sk (M. Stanek)                                                                                            2
                                                                                                                                                                     https://ui.ctsearch.entrust.com/ui/ctsearchui
                                Orcid 0000-0002-6560-1515 (R. Ostertág)                                                                                            3
                                                                                                                                                                     https://owasp.org/www-project-amass/
                                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   4
                                                                    Attribution 4.0 International (CC BY 4.0).                                                       https://github.com/projectdiscovery/subfinder
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)                                                        5
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                                                                                                                     https://github.com/six2dez/reconftw


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
issued for specific domains, and anomalies detected in
newly issued certificates can indicate an internal problem
that needs to be addressed.
   In our paper, we use the term “anomaly” to refer to
certificates that are significantly different from those usu-
ally observed. We do not test certificates for compliance
with X.509 standards like linters do6 . However, in future
work, it might be interesting to include linter results as
additional attributes for anomaly detection, providing
a more comprehensive analysis of certificate structures
and content.

Our contribution. We evaluate selected statistical in-
formation about certificates in CT logs, focusing on at-
tributes defined by domain owners, such as Subject Alter-
native Name (SAN) in Section 2. We propose a method
for anomaly detection in certificates using Isolation For-
est [5, 6], an unsupervised machine learning technique,
in Section 3. We select suitable certificate attributes and
train the model on a sampled set of certificates obtained
from CT logs. The results of our Isolation Forest model
are presented in Section 4.


2. Statistics and attributes
   selection                                                                  Figure 1: Selected characteristics of subjects in the dataset

We created a random sample of 120,000 records from
one of the largest public Certificate Transparency (CT)                           CN=multimedia-academy.tudelft.nl,
logs, Xenon 2024, which is operated by Google. In CT                              O=Technische Universiteit Delft,
logs, there are two types of records: precertificates and                         ST=Zuid-Holland,
certificates. Since all the features we want to extract are                       C=NL
already available in precertificates, we do not discrimi-
nate between these types in our analysis.                                        The presence and number of attributes in a DN can
   According to the statistics presented by Cloudflare on                     vary. For instance, we found that approximately 2.28% of
their Merkle Town webpage [2], the issuance rate of new                       our sample certificates did not contain a CN attribute. To
certificates across all monitored CT logs is more than 460                    extract quantitative features from the subject section of
thousands per hour (as of May 2024). Therefore, the prob-                     a certificate, we considered the following characteristics
ability of sampling both the corresponding precertificate                     (see also a boxplot visualization in the Figure 1):
and certificate is negligible. The sample size and variety
of records in our experiment are sufficient to effectively                         • The length of a DN – this refers to the number of
investigate anomalous certificates within CT logs.                                   characters in the DN string representing a subject.
   Let us discuss what attributes we considered and se-                              In our sample, DN lengths range from 0 to 278
lected for feature extraction. We group them in several                              characters with an average length of 33.0.
categories – subject, subject’s public key, issuer, signa-                         • The number of attributes in a DN – this represents
ture, validity, and X.509 extensions.                                                the inner structure of the DN and indicates how
                                                                                     many relative Distinguished Names (RDNs) are
Subject. A distinguished name (DN) consists of a set                                 present. The maximum value in our sample is 12
of attributes that identify a subject. In the case of domain                         attributes, while the mean is 1.4 attributes, and
validated certificates, it usually contains just the com-                            only 14.0% of records have an attribute count that
mon name (CN). For organization validated certificates,                              is not equal to 1.
however, it may contain a set of attributes such as:                               • The length of a CN – this attribute focuses on
6
    It is important to note that X.509 linters check certificates against a          the most important and most frequently present
    specific set of rules, ensuring they conform to established standards.           part of a DN. The maximal allowed length of 64
    Some well-known tools are ZLint and pkilint.                                     characters [7] is observed in 1.8% of records.
              256     384     2048    3072    4096     8192         • CA rarity: A float number computed as a fraction
RSA                          64.8%    0.4%     8.4%    0.0%*          of certificates in the sample with the same issuer
ECDSA       24.4%    2.1%                                             (DN).
  * exactly one 8192-bit RSA key in the sample                    It is assumed that more common certification authori-
                                                               ties have better practices and stricter certification policies
Table 1                                                        in place, so their certificates are less likely to be anoma-
Distribution of subject’s public key lengths in the sample     lous. Therefore, we will not analyze other aspects of the
                                                               issuer further.

     • Number of subdomains in a CN – this represents        Signature. We do not extract any features from a sig-
       the inner structure of CN. In our sample, the num-    nature algorithm used by certification authorities to sign
       ber of subdomains ranges from 0 to 15.                (pre)certificates. This is entirely at their discretion, and
     • Wildcard CN – a boolean value indicating              we assume that CA rarity, see above, covers unusual
       whether the CN contains a ‘* ’ character. Wild-       certification authorities sufficiently in our experiment.
       card CNs are observed in 12.0% of records.            However, if someone wants to consider signatures in
                                                             anomaly detection, both types (algorithms) as well as
   Certainly, there might exist qualitative anomalies in
                                                             key lengths should be considered. Nice online statistics
certificates based on small differences or variances that
                                                             covering signature algorithms are presented in [2], with
are not captured by quantitative characteristics alone. For
                                                             RSA-SHA256 being used in 90% of (pre)certificates.
example, some uncommon semantics may be used for
                                                                Another set of attributes that might be considered in
DN attributes. These anomalies will not be detected with
                                                             the future are embedded SCT (Signed Certificate Times-
methods trained only on quantitative features. However,
                                                             tamps) in certificates. A certification authority can decide
we do not attempt to analyze these anomalies in this
                                                             in which CT logs it wants to include a certificate. This
paper as it would require interpreting different parts
                                                             decision is usually uniform across different certificates,
and attributes of the certificate beyond the scope of our
                                                             taking into account their expiry date. An unusual combi-
experiment. This approach, focusing on quantitative
                                                             nation or SCT count can indicate an anomaly.
characteristics, is also used for feature extraction in the
rest of this section.
                                                             Validity. Despite the validity period depends on the
                                                             CA’s certification policy, our sample demonstrates sig-
Subject’s public key. A public key is another attribute
                                                             nificant variability within this attribute, ranging from
that is fully controlled by the subject. The certificate au-
                                                             one day to approximately 50 months. This feature is
thority can restrict the types and supported lengths of
                                                             extracted for use in anomaly detection.
public keys for issued certificates, but the value is ulti-
mately generated by the subject. We extract two features           • Validity period: the number of days a certificate
from the public key: its type and length.                            is valid, calculated as the difference between “not
     • Public Key Type: There are only two types of                    before” and “not after” dates. Approximately 70%
       subject public keys – RSA and Elliptic Curve Dig-               of certificates in our sample are issued for a valid-
       ital Signature Algorithm (ECDSA). Our sample                    ity period of three months, predominantly due to
       shows a dominant position of RSA keys (73.5%).                  Let’s Encrypt’s certification policy. Nearly 19.3%
       We do not extract the type of elliptic curve used               of the certificates have a validity period of ap-
       in ECDSA keys. A numeric encoding of public                     proximately one year.
       key type is performed as follows: ECDSA ↦ 0,
       RSA ↦ 1.                                            X.509 extensions. There are various extensions that
     • Public Key Length: Bit length of the public key,    can be part of a certificate. Our experiment with anomaly
       depending on the modulus length for RSA or cho-     detection is focused mostly on attributes chosen by the
       sen curve for ECDSA. The observed variability of    subject. Therefore, special attention is given to the fea-
       this attribute is presented in Table 1.             tures of Subject Alternative Name (SAN) extension. Ac-
                                                           cording to RFC 5280 [7], the SAN entry can contain DNS
                                                           names, IP addresses, internet electronic mail addresses,
Issuer. Let’s Encrypt is the most prevalent certifica- Uniform Resource Identifiers (URIs), and other options
tion authority, accounting for over 52% of certificates in exist as well. The sample shows an overwhelming proba-
our sample. The total number of distinct certification bility of DNS names, where almost all certificates have
authorities, identified by unique DN, is 176. For anomaly at least one DNS name in the SAN extension. Other en-
detection, we will use the rarity of CA as a feature:      try types appear in negligible fractions of records: IP
addresses are present in less than 0.03% of records, and           5 to 239 with an average value of 27.3. Given the
other types are absent altogether. We extract the follow-          observation of SAN count, the value primarily
ing features for anomaly detection:                                depends on certificates with a small number of
                                                                   SAN entries. The distribution of average SAN
                                                                   length values along with the distribution of SAN
                                                                   count values is presented in Figure 2.
                                                                 • The number of wildcard domain names: Approx-
                                                                   imately 65% of the certificates do not contain any
                                                                   wildcard names in their CN and SAN attributes,
                                                                   while 31.1% of certificates have just one wildcard
                                                                   name. Other counts are significantly less repre-
                                                                   sented (less than 3.9%).
                                                                 • Average number of subdomains: The average
                                                                   number of subdomains for CN and SAN attributes
                                                                   is calculated by counting all substrings separated
                                                                   by periods (“.”) in a domain name. For example,
                                                                   “www.uniba.sk” has three subdomains: “www”,
                                                                   “uniba”, and “sk”. As expected, the average num-
                                                                   ber of subdomains is generally within the range
                                                                   of 2 to 4, as shown in Figure 3.
                                                                 • Validation type: We assign each certificate a nu-
                                                                   merical representation of its validation type, with
                                                                   0 representing missing or unavailable informa-
                                                                   tion, 1 for Domain Validation (DV), 2 for Orga-
                                                                   nizational Validation (OV), and 3 for Extended
                                                                   Validation (EV). This representation orders val-
                                                                   idation types from the least strict policy to the
                                                                   most strict validation policy. For comprehensive
                                                                   global statistics, see Merkle Town’s webpage [2].
                                                                   In our sample, we found that 88.3% of certificates
                                                                   were DV, and 11.7% were OV, while other types
                                                                   occurred negligibly.
                                                               We decided not to analyze other extensions separately
                                                            despite their potential interest, such as Key Usage, CRL,
                                                            OCSP, and various constraints. Although problems or
                                                            anomalies can be hidden in any of them, we selected a
                                                            subset of attributes more related to the subject, because
                                                            these attributes can help detect incorrect configurations
                                                            when requesting certificates or possible covert commu-
                                                            nication. For other anomaly detection applications, it
                                                            might be important to include specific X.509 extensions
                                                            in the set of selected features. Our experiment focuses
                                                            on the following summary characteristics:

Figure 2: SAN count and average length in the dataset            • Extensions count: The number of X.509 exten-
                                                                   sions in a certificate. The dataset shows this
                                                                   parameter ranging from 5 to 13 with 97.3% of
                                                                   records having 9 or 10 extensions.
     • The count of SAN entries: Our sample shows                • Extensions size: The length of X.509 extensions
       an average number of SAN entries as 2.1, with               in a certificate excluding SAN, since the related
       a minimum of 1 and a maximum of 238. The                    SAN characteristics – number and average length
       number of certificates with 10 or more SANs is              – are represented separately. The average size in
       below 1.5%.                                                 our sample is 2306 bytes, while minimum and
     • Average length of SAN entries: The average                  maximum sizes are 815 and 3506 bytes, respec-
       length of SAN entries in a certificate ranges from          tively.
                                                                  • Sampling from the data is performed without re-
                                                                    placement.
                                                                The contamination of the data, i.e. the proportion of
                                                             anomalies in the dataset, is irrelevant for the discussion
                                                             in Section 4. The reason being that the contamination is
                                                             only used to set an anomalous score threshold. Instead,
                                                             we examine which data, specifically precertificates and
                                                             certificates, have the highest anomalous scores. From
                                                             these observations, conclusions can be drawn without
                                                             requiring knowledge of the exact contamination value
                                                             for our dataset.


                                                             4. Results
                                                             We document the types of precertificates and certifi-
                                                             cates that are detected as the most anomalous in our
                                                             exploratory experiment. A general observation is that
                                                             some cloud services and their internal components are
                                                             the most frequent outliers in our dataset.

Figure 3: Average number of subdomains in the dataset        Azure infrastructure. The most anomalous certifi-
                                                             cates in our experiment are those issued by Microsoft
                                                             for the components of Azure infrastructure. The issuing
3. Anomaly detection                                         CAs are:
                                                                  • Microsoft Azure TLS Issuing CA XX – several
Isolation Forest is an unsupervised anomaly detection
                                                                    authorities, where XX denotes number 01, 02,
technique proposed by Liu, Ting, and Zhou [5, 6]. It
                                                                    etc.;
builds a collection of binary trees, similar to binary
                                                                  • Microsoft Azure RSA TLS Issuing CA XX – again
search trees, by randomly selecting branching features
                                                                    several authorities issuing certificates;
and thresholds. The anomaly score for a data point is
                                                                  • Microsoft RSA TLS CA XX – significantly smaller
based on the average depth at which it is isolated across
                                                                    number of certificates in comparison to the above
multiple trees. The main idea behind Isolation Forest is
                                                                    two sets of authorities.
that, on average, anomalies are isolated in lower depths
than non-anomalous data.                                        Table 2 summarizes basic characteristics for each CA. It
   The Isolation Forest algorithm was selected for our       shows above-average values, particularly for the first two
experiment due to its ability to detect anomalies with-      CAs. Besides higher than usual number of SAN domain
out relying on complex distance metrics or density esti-     names, longer domain names and extensions, the other
mation. Furthermore, Isolation Forest performs well in       factors contribute to anomaly of detected certificates as
high-dimensional problems containing a large number          well. Top anomalous certificates show various devia-
of irrelevant attributes. Additionally, it can effectively   tions, such as slightly odd validity period, the number
train the model even when the anomalies are not present      of wildcard domain names, and other attributes, com-
in the training sample. The technique also has low time      bined with relative rarity of issuing CA. In this regard,
and memory complexity.                                       the anomaly detection works as intended. For example,
   We utilize an implementation of the Isolation Forest      the most anomalous certificate in the dataset according
provided in PyOD library [8] for anomaly detection in        our trained model shows the following characteristics:
multivariate data. We set the following parameters for            • Common Name:
this technique:                                                     CN=*.table.preprod.core.windows.net
                                                                  • Issuer: Microsoft Azure TLS Issuing CA 06
     • Number of estimators (trees): 200                          • Validity period: 282
     • Number of samples drawn from the data to train             • SAN count: 52, the number of wildcard domain
       each estimator: 256                                          names: 52
     • Number of features drawn from the data to train            • The average number of subdomains: 7
       each estimator: 16 (all available features)                • The number of extensions: 12, overall extension
                                                                    size: 3206
                                                                      DN           CN               SAN       extensions
             CA                                         attributes/length       length      count/length      count/size
             Microsoft Azure TLS Issuing CA                      5.0/89.6          41.6          3.5/45.0      12.0/3218
             Microsoft Azure RSA TLS Issuing CA                  5.0/97.3          49.3          3.4/54.5      12.0/3232
             Microsoft RSA TLS CA                                1.0/38.5          35.5         21.2/38.2      10.7/3031


Table 2
Averages for selected characteristics of issued certificates


Other CAs and ZeroSSL. After filtering out certifi-                           the fraction with an empty subject is rather large:
cates issued by the CAs metioned above, we examined                           41.3%. Table 3 summarizes some characteristics
the top 100 anomalous items in greater detail. Among                          of records issued by this CA. It’s an interesting
these, we observed:                                                           observation that free certificates issued by Let’s
                                                                              Encrypt CA do not exhibit such anomalies7 .
     • Two certificates issued by DigiCert: one for a
       Chinese cloud service provider and one for a le-              The problem with unusual length of the CN attribute is
       gitimate IT company.                                       not unique to ZeroSSL CA. Similar certificates are issued
     • Two certificates issued by Amazon for its AWS              by Let’s Encrypt. Again, possible explanation might be
       components.                                                an error in certificate management automation. Domain
     • All remaining 96 certificates were issued by Ze-           owners are probably not aware or simply do not care,
       roSSL CA, specifically by ZeroSSL ECC Domain               since both CA offers free certificates. An examples of
       Secure Site CA. These have an unusual structure:           such CN (certificate issued by Let’s Encrypt) is
       empty subject (DN), and questionable SAN at-
       tributes containing a large number of repetitive                  gitlab.gitlab.gitlab.gitlab.gitlab.git.
       subdomains. Two examples are:                                     testing.yikj.work

            – www.www.www.www.www.www.pay.
                                                                 Other observations. Ignoring ZeroSSL-issued certifi-
                avito.sber.avito.avito.www.www.
                                                                 cates and various additional infrastructure certificates by
                www.www.www.www.www.www.yandex.
                                                                 Apple, Cisco, Google, and other well-known companies,
                avito.yandex.pay.portalswebmail.
                                                                 we have found several more entries that look interesting.
                blumebwww.od3.10cekub2b.k.
                                                                 For example, a certificate issued by Let’s Encrypt CA
                webmail.ultagkhanub2b.k.webmemo.
                                                                 with the following set of SANs:
                m.phpmyadmin.wokemtutankhanub2b.
                k.webmail.ultagame.com                                  *.ajptzd.com, *.amklvv.com, *.aqcssg.com,
            – www.www.www.www.www.www.www.www.                          *.ccjytp.com, *.doeigp.com, *.egfnjv.com,
                www.www.www.www.www.www.www.www.                        *.eydqoa.com, *.fvrnlf.com, *.guuzxk.com,
                www.www.www.www.www.www.www.www.                        *.hgmwfy.com, *.iwhqyn.com, *.kldcuc.com,
                www.www.www.www.www.www.www.www.                        *.lfmdnj.com, *.lloond.com, *.naktki.com,
                www.www.www.www.www.www.www.www.                        *.nmklqi.com, *.npwpbz.com, *.nxezmi.com,
                www.www.www.www.www.www.www.www.                        *.ojdger.com, *.psfqpu.com, *.ptgreh.com,
                www.www.calendario.                                     *.raclbc.com, *.rvaajo.com, *.spikfh.com,
                panel-fiveheberg.fr                                     *.swwoyd.com, *.tnuntp.com, *.xfcpkw.com,
        These might indicate an operational problem                     *.xnrsre.com, *.xuvvdq.com, *.yyiosx.com
        with an automation script that issues and re-            Most of these domains are unresolvable by public DNS
        news certificates and adds www prefix to a domain        (NXDOMAIN) as of May 2024. We did not investigate
        name. Additionally, in case of the domain panel-         this certificate further to determine whether it represents
        fiveheberg.fr , combined with a wildcard DNS             a legitimate use-case, misconfiguration, business mal-
        record that positively responds to any DNS query.        practice, or other malicious intent. However, based on
        We checked both domains in VirusTotal, and no            the experience documented in [9], these domains may
        security vendor flagged them as malicious (as of         result from a domain generation algorithm (DGA) [10],
        April 2024).
        Not all precertificates and certificates issued by
        ZeroSSL ECC Domain Secure Site CA in our                  7
                                                                      There are only 7 (pre)certificates with empty subject out of 62,424
        dataset have the mentioned structure. However,                issued by any Let’s Encrypt CA in our dataset.
                                                                 DN         CN            SAN     extensions
                           set                     attributes/length     length   count/length    count/size
                           all (pre)certificates             0.6/18.9      17.1        1.0/61.2     9.0/2307
                           empty subject                      0.0/0.0       0.0       1.0/106.7     9.0/2307


Table 3
Averages for (pre)certificates issued by ZeroSSL ECC Domain Secure Site CA


indicating a likely malicious intent8 . The other attributes            References
of such certificates are rather normal, e.g., 90 days valid-
ity, 2048-bit RSA key or nine X.509 extensions, dictated                 [1] B. Laurie, A. Langley, E. Kasper, E. Messeri,
mostly by the certification policy of Let’s Encrypt CA.                      R. Stradling, Certificate Transparency Version 2.0,
                                                                             RFC 9162, 2021. URL: https://www.rfc-editor.org/
                                                                             info/rfc9162. doi:10.17487/RFC9162 .
5. Conclusion                                                            [2] Cloudflare, Merkle town, 2023. URL: https://
                                                                             ct.cloudflare.com/.
We proposed an anomaly detection technique for cer-                      [3] B. Laurie, A. Langley, E. Kasper, Certificate Trans-
tificates using Isolation Forest. This approach can be                       parency, RFC 6962, 2013. URL: https://www.rfc-
beneficial when compliance testing with X.509 linters is                     editor.org/info/rfc6962. doi:10.17487/RFC6962 .
unsatisfactory, and we seek anomalies beyond compli-                     [4] M. Jurčák, Using Certificates and CT Logs for com-
ance. We demonstrated the feasibility of this method;                        munication, Bachelor’s thesis, Comenius Univer-
however, further exploration is necessary. Some poten-                       sity, 2023. In Slovak.
tial directions are:                                                     [5] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation for-
                                                                             est, in: 2008 Eighth IEEE International Conference
         • Training the model on certificates for a specific
                                                                             on Data Mining, 2008, pp. 413–422. doi:10.1109/
           domain or domains owned by a single entity, al-
                                                                             ICDM.2008.17 .
           lowing anomalies to serve as early internal warn-
                                                                         [6] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation-based
           ings of potential issues.
                                                                             anomaly detection, ACM Trans. Knowl. Discov.
         • Identifying certificates from large cloud providers               Data 6 (2012). doi:10.1145/2133360.2133363 .
           and excluding them from the model and evalua-                 [7] S. Boeyen, S. Santesson, T. Polk, R. Housley, S. Far-
           tion. The CT logs contain a vast quantity of these                rell, D. Cooper, Internet X.509 Public Key In-
           precertificates and certificates, which can distort               frastructure Certificate and Certificate Revocation
           parameters of the model.                                          List (CRL) Profile, RFC 5280, 2008. URL: https:
         • Analyzing the results of identified anomalies in                  //www.rfc-editor.org/info/rfc5280. doi:10.17487/
           greater detail, such as those described in the pre-               RFC5280 .
           vious section, to find explanations for the anoma-            [8] Y. Zhao, Z. Nasrullah, Z. Li, Pyod: A python
           lous certificates.                                                toolbox for scalable outlier detection, Journal of
                                                                             Machine Learning Research 20 (2019) 1–7. URL:
                                                                             http://jmlr.org/papers/v20/19-011.html.
Acknowledgments                                                          [9] J. Terrill, Analyzing a Wordpress PHP malware cam-
This publication is the result of support under the                          paign and reverse engineering C2 communications,
Operational Program Integrated Infrastructure for the                        2022. URL: https://hacked.codes/2022/december-
project: Advancing University Capacity and Compe-                            2022-php-wordpress-malware-analysis/, [Online;
tence in Research, Development a Innovation (ACCORD,                         accessed June 2024].
ITMS2014+:313021X329), co-financed by the European                      [10] Wikipedia contributors, Domain generation algo-
Regional Development Fund.                                                   rithm, 2023. URL: https://en.wikipedia.org/wiki/
                                                                             Domain_generation_algorithm, [Online; accessed
                                                                             June 2024].


8
    A common tactic employed by large threat actors involves creating
    a script that randomly generates numerous domain names, purchas-
    ing them, and subsequently switching between domains as needed
    once one is blocked or otherwise compromised.