Revisiting Monte Carlo Strength Evaluation
                                Martin Stanek1
                                1
                                    Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University


                                                                             Abstract
                                                                             The Monte Carlo method, proposed by Dell’Amico and Filippone, estimates a password’s rank within a probabilistic model
                                                                             for password generation, i.e., it determines the password’s strength according to this model. We propose several ideas to
                                                                             improve the precision or speed of the estimation. Through experimental tests, we demonstrate that improved sampling can
                                                                             yield slightly better precision. Moreover, additional precomputation results in faster estimations with a modest increase in
                                                                             memory usage.

                                                                             Keywords
                                                                             Password, Monte Carlo, Strength Evaluation


                                1. Introduction                                                                                                                       The first idea is to interpolate password’s rank within
                                                                                                                                                                      the sampled interval it belongs, according its probabil-
                                Passwords remain a frequently used authentication                                                                                     ity. The second idea aims to reduce probability overlap
                                method, despite numerous initiatives, technologies, and                                                                               in sampled passwords. Both these ideas, presented in
                                implementations aiming for passwordless authentication.                                                                               Section 3.2, seek to improve the estimator’s precision.
                                Although the popularity of methods such as Windows                                                                                    The estimation speed for a password, originally based
                                Hello, Passkey, and WebAuthn has increased, the security                                                                              on binary search, can be enhanced with some additional
                                of passwords continues to be a significant topic in many                                                                              data computed in advance (the third idea, see Section
                                application areas.                                                                                                                    3.3). All ideas have been tested experimentally to assess
                                   Evaluating the strength of a password is useful for pro-                                                                           their merit. The results are presented in Section 4. Our
                                viding users with feedback on their chosen passwords.                                                                                 experiments demonstrate that improved sampling can
                                This feedback can assist users in selecting stronger pass-                                                                            yield slightly better precision. However, the effect of in-
                                words. Often, the strength is calculated as a password’s                                                                              terpolation on precision is inconclusive, and we cannot
                                rank, i.e., how many passwords will be generated by                                                                                   rely on this technique to improve precision.
                                some chosen algorithm until our password is produced.                                                                                    We utilize the reference implementation of the Monte
                                There are various tools that calculate the strength of the                                                                            Carlo estimator, which was published by one of the au-
                                password, for example zxcvbn [1], or password scorer                                                                                  thors of the original paper on GitHub [8], and we employ
                                tool in PCFG cracker [2].                                                                                                             the RockYou dataset for our experiments. Given that our
                                   Dell’Amico and Filippone proposed a Monte Carlo al-                                                                                focus lies on the estimator itself, the choice of dataset is
                                gorithm that estimates a password’s rank within a proba-                                                                              relatively unimportant.
                                bilistic model [3]. The algorithm work for any probabilis-
                                tic password generation model, and the authors proved
                                that estimated results converge to the actual ranks.                                                                                  2. How the Monte Carlo Estimator
                                   The Monte Carlo estimator is also used to evaluate                                                                                    works
                                and compare different probabilistic models for password
                                generation. The original paper compares 𝑛-grams mod-          We mostly follow [3] in this section. Let Γ be a set of all
                                els [4], the PCFG model using probabilistic context-free      allowed passwords. A probabilistic password model aims
                                grammar [5], and the Backoff model [6]. Recent example        to capture how humans select password, assigning higher
                                of using the Monte Carlo estimator is the evaluation of a     probabilities to more frequently chosen passwords and
                                password guessing method that employs a random forest         lower probabilities to less common ones. Let 𝑝(𝛼) de-
                                [7].                                                          notes a probability assigned to password 𝛼 by the model,
                                                                                              such that 𝛼∈Γ 𝑝(𝛼) = 1. Different models yield differ-
                                                                                                        ∑︀
                                Our contribution. We propose three ideas for improv- ent probability distributions.
                                ing the precision or speed of the Monte Carlo estimator.         When the model is used for an attack, it enumerates
                                                                                              password in descending order of probability. Therefore,
                                ITAT’24: Workshop on Applied Security, September 20–24, 2024, the strength of a password 𝛼 is the number of passwords
                                Drienica, SK                                                  with a higher probability:
                                $ martin.stanek@fmph.uniba.sk (M. Stanek)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                       Attribution 4.0 International (CC BY 4.0).                                                               𝑆𝑝 (𝛼) = |{𝛽 ∈ Γ; 𝑝(𝛽) > 𝑝(𝛼)}|.
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Remark. In this context, the authors do not address the        substantial size: 3.17 MB for 4-gram, 7.45 MB for 5-gram,
possibility that the model may assign identical probabili-     43.5 MB for Backoff, and 0.99 MB for PCFG. An attempt
ties to multiple passwords, resulting in a non-monotonic       to use up to 10% of the RockYou dataset for training leads
𝑝. The definition of 𝑆𝑝 (𝛼) assigns all passwords that         to unacceptable model sizes, where Backoff model being
share the same probability the lowest rank in their group.     the largest, as shown in Figure 1.
This approach can be considered prudent from a security           The model defines how passwords are represented,
standpoint.                                                    generated, and how their probabilities are calculated.
   Computing the exact value of 𝑆𝑝 (𝛼), for a random 𝛼,        Since these methods are specific for each model, we do
has prohibitively large time complexity. The Monte Carlo       not aim to improve the model size. However, the Monte
estimator uses sampling and approximation to provide           Carlo estimator utilizes an additional arrays, where prob-
efficient and sufficiently accurate estimation. It relies on   abilities and ranks of sampled passwords are precom-
two properties of the underlying model:                        puted. The original paper [3] experiments with various
                                                               sample sizes up to 100,000 (having “relative error 1%”), but
     • The model allows for efficiently computing 𝑝(𝛼)         mostly uses the default sample size of 10,000. The default
       for any password 𝛼.                                     sample size requires 160 kB of memory1 and its domi-
     • There is an efficient sampling method that gener-       nated by the memory required for any model trained on
       ates a password according to the model’s distri-        a dataset of reasonable length.
       bution.
                                                               3.2. Precision
Precomputation. The estimator generates a sample Θ
                                                            The estimator assigns the same rank 𝑐𝑗 to any pass-
of 𝑛 passwords (sampling with replacement). The sample
                                                            word 𝛼 for which the probability falls within the range
Θ = {𝛽1 , . . . , 𝛽𝑛 } is sorted by descending probability,
                                                            𝑝(𝛽𝑗 ) > 𝑝(𝛼) ≥ 𝑝(𝛽𝑗+1 ). Intuitively, passwords with
i.e., 𝑝(𝛽1 ) ≥ . . . ≥ 𝑝(𝛽𝑛 ). The cumulative ranks of
                                                            distinct probabilities should not get the same numeric es-
sampled passwords are calculated as follows:
                                                            timate. Certainly, this is not an issue when the password
                     𝑖                                      strength is presented on a reduced scale using descriptive
                1 ∑︁ 1
         𝑐𝑖 =                     for 𝑖 = 1, . . . , 𝑛.     characteristics like weak – medium –strong – very strong,
                𝑛 𝑗=1 𝑝(𝛽𝑗 )
                                                            or using a traffic lights metaphor red – amber – green.
  The estimator needs to store the probabilities. The      Idea 1. Interpolate rank values within intervals using an
cumulative ranks can be easily recomputed. However,        appropriate function. The most basic approach, without
both these arrays are usually significantly smaller than   additional parameters, is linear interpolation. This has no
representation of the model, see Section 3.                impact on memory complexity and a negligible impact
Remark. The implementation [8] uses negative log2 prob- on time complexity. Figure 2 shows a graph of password
abilities, i.e., scaling 𝑝(𝛽𝑗 ) to − log2 𝑝(𝛽𝑗 ).          ranks, on a logarithmic scale, for various models and the
                                                           sample size of 10,000. It appears that linear interpolation
Estimation. In order to estimate 𝑆𝑝 (𝛼) for some pass- on the logarithmic scale should perform well for these
word 𝛼, the probability 𝑝(𝛼) is computed first. Then the models.
binary search is used to compute the largest index 𝑗 such     The precision of the estimator depends on the sam-
that 𝑝(𝛽𝑗 ) > 𝑝(𝛼). The result, estimated rank of 𝛼 is ple size. More specifically, it depends on the number of
𝑆𝑝 (𝛼) ≈ 𝑐𝑗 . Hence, the time complexity of the estimator unique probabilities in the set 𝑃 = {𝑝(𝛽1 ), . . . , 𝑝(𝛽𝑛 )}.
is 𝑂(log 𝑛).                                               We define the overlap of Θ as the fraction of probability
                                                           values that are already in the set, and therefore do not
                                                           contribute to the estimator’s precision: 1 − |𝑃 |/𝑛. Table
3. Areas for improvement                                   1 shows the average overlap for different models and
                                                           sample sizes. Surprising differences in overlap are ob-
3.1. Memory requirement                                    served among different models. An expected increase in
The RockYou dataset contains more than 14 million overlap is observed with an increasing sample size, since
unique passwords. The more passwords are used to train the overlap depends substantially on password proba-
a model, the better and more precise results we can ex- bility distribution, given by the model from which the
pect, such as in our case for password strength estima- passwords are generated. On the other hand, a larger
tion. However, there is a point beyond which additional training dataset results in greater diversity of passwords,
training data provide only negligible improvement, while leading to slightly lower overlap.
further increasing the model’s size. Notably, even the set
of 10,000 most frequent passwords generates models of 1 Real numbers are represented as the numpy.float64 datatype.
                                                                                                                                    J U D P                                                                                  J U D P
                                                                                                                                        J U D P                                                                               J U D P

                                                              6 L ] H  R I  W K H  P R G H O  > 0 % @
                                                                                                                                  3 & ) *                                                                                 3 & ) *
                                                                                                                                      % D F N R I I
                                                                                                                                                                                                              
                                                                                                                                                                                                                  
                                                                                                                
                                                                                                                                                                                                                   
                                                                                                                
                                                                                                                                                                                                                   
                                                                                                                                                                                                                
                                                                                                                                                                                                                   
                                                                                                                                                                                                                                  
                                                                                                                                   1 X P E H U  R I  S D V V Z R U G V  > W K R X V D Q G V @                              1 X P E H U  R I  S D V V Z R U G V  > W K R X V D Q G V @
Figure 1: Size of the model reflecting the number of passwords in a training dataset. The graph on the right excludes the
Backoff model to show other three models more clearly.


                                                              
                                                                                                                                                                                                        of passwords need to be estimated. An optimization can
                                                                                                                          J U D P                                                                       be relevant in these scenarios.
 3 D V V Z R U G  U D Q N   O R J  V F D O H 


                                                              
                                                                                                                        J U D P
                                                                                                                        3 & ) *                                                                           Idea 3. Divide the interval of possible probability values
                                                                                                                                                                                                        𝑝(𝛽𝑖 ), in our case expressed as negative log2 values, into
                                                                                                                      % D F N R I I
                                                                                                                                                                                                          𝑡 intervals (bins): [0, 𝜏1 ), [𝜏2 , 𝜏3 ), . . . , [𝜏𝑡−1 , ∞), where
                                                              
                                                                                                                                                                                                        0 < 𝜏1 < . . . < 𝜏𝑡−1 . For each interval, for 1 ≤ 𝑖 ≤ 𝑡,
                                                                                                                                                                                                         we calculate minimal and maximal look-up indices that
                                                                                                                                                                                                        narrow interval for binary search (we use 𝜏0 = 0 in the
                                                        
                                                                                                                                                                                                         following equations):

                                                                                                                                                                                                              LUmin (𝑖) = max{1 ≤ 𝑗 ≤ 𝑛 | − log2 𝑝(𝛽𝑗 ) ≥ 𝜏𝑖−1 },
                                                                                                                                                                                 
                                                                                                                                    3 R V L W L R Q  L Q  W K H  V D P S O H                               LUmax (𝑖) = min{1 ≤ 𝑗 ≤ 𝑛 | − log2 𝑝(𝛽𝑗 ) < 𝜏𝑖−1 }.

Figure 2: Password ranks corresponding to the position in                                                                                                                                                    The estimator is adapted accordingly. Given a
the sample                                                                                                                                                                                                password 𝛼, we calculate an appropriate interval
                                                                                                                                                                                                          such that − log2 𝑝(𝛼) ∈ [𝜏𝑖−1 , 𝜏𝑖 ). Then, the bi-
                                                                                                                                                                                                          nary search is performed within the set of indices
                                                                                                                                                                                                          {LUmin (𝑖), . . . , LUmax (𝑖)}, instead of full set {1, . . . , 𝑛}.
Idea 2. The estimator will sample random passwords
                                                                                                                                                                                                          We expect to narrow the interval for the binary search
for Θ until it gets 𝑛 unique probabilities. It compresses
                                                                                                                                                                                                          substantially, so the benefit of fewer comparisons will
sample by discarding duplicate probabilities in such a
                                                                                                                                                                                                          be measurable. Trivially, the precision of the estimator
way that preserves the cumulative sum of the entry with
                                                                                                                                                                                                          remains unchanged.
the largest index. Hence, the rank calculation remains
                                                                                                                                                                                                             The price paid is the cost of computing LUmin and
intact, and the overlap of the resulting Θ is be 0. Since
                                                                                                                                                                                                          LUmax arrays, which is simple one-time precomputation,
the sampling is done in precomputation phase, it does
                                                                                                                                                                                                          and small memory needed to store these arrays in the
not impact the estimation time or memory complexity
                                                                                                                                                                                                          estimator2 .
in any way.
  Table 2 shows how many passwords must be sampled
using a trained model to achieve the target size of the                                                                                                                                                   4. Experiments
sample with distinct probabilities.
                                                                                                                                                                                                          We implement the ideas presented in the previous section
                                                                                                                                                                                                          and present the results of our experiments.
3.3. Estimation speed
The binary search employed in the original estimator is
fast enough for assessing individual passwords. However,
                                                                                                                                                                                                          2
when the estimator is used to evaluate or compare differ-                                                                                                                                                     For example, 100 intervals “cost” approximately 7.8 kB, even with
ent models and their variants, the ranks of a large number                                                                                                                                                    a wasteful representation using Python’s int objects for stored
                                                                                                                                                                                                              indices and lists for the arrays
                           training set     sample size        4-gram   5-gram     Backoff      PCFG
                               500,000             10,000       13.6%    16.5%      20.4%       44.6%
                                                   30,000       20.6%    25.8%      31.6%       60.8%
                                                   50,000       24.8%    30.5%      37.3%       67.1%
                             1,000,000             10,000       12.4%    14.7%      17.2%       43.1%
                                                   30,000       19.1%    22.4%      27.6%       58.5%
                                                   50,000       22.4%    26.7%      33.2%       64.8%

Table 1
Overlap percentage for different models and sample sizes. Models are trained on 500,000 and 1,000,000 passwords using the
most frequent passwords from the RockYou dataset. Every number is an average of 3 experiments.


       target              Sampled passwords                            • interpolation – interpolate rank calculation
   sample size   4-gram    5-gram Backoff             PCFG                within the interval between two adjacent proba-
                                                                          bilities (Idea 1);
        10,000    11,689     12,239       12,894      23,483
                                                                        • sampling – improved sampling with 𝑛 unique
        30,000    38,795     42,032       47,178     122,865              probabilities (Idea 2);
        50,000    68,358     75,953       90,123     258,489            • all – a combination of interpolation and sampling.
Table 2                                                               Let rr(𝛼) denote the real rank of password 𝛼 ∈ 𝑇 , and
Average number of sampled passwords required to achieve
                                                                   let er(𝛼) denote the rank estimated by a particular variant
the desired sample size with distinct probabilities. Models
are trained on 500,000 passwords using the most frequent           of the estimator. The weighted error of the estimator on
passwords from the RockYou dataset. Every number is an             the password set 𝑇 is calculated as follows:
average of 10 experiments, rounded to the nearest integer.                         ∑︁
                                                                                        𝑝(𝛼) |er(𝛼) − rr(𝛼)|.
                                                                                   𝛼∈𝑇

4.1. Precision                                                     The weighted error assumes that the estimators are used
                                                                   to asses passwords chosen by humans, following the
The ideas aimed at improving precision apply to the
                                                                   original distribution. We also consider a simple error for
Monte Carlo Estimator, regardless of the underlying
                                                                   completeness:
model. We do not attempt to modify the models. For
example, if password -1-1-1-1 is assigned inf3 as neg-                               1 ∑︁
ative log2 probability in the PCFG model, because the                                        |er(𝛼) − rr(𝛼)|.
                                                                                    |𝑇 | 𝛼∈𝑇
pattern is outside of the trained grammar, we do not try to
“fix this”. Moreover, we do not compare the performance
of the models to each other.
                                                                         variant             weighted error     simple error
   We assess the impact of our ideas on the real ranks
of password generated by the models. Similarly to the                    original                       16.54        101.11
original paper [3], we generate all passwords up to some                 interpolation                  15.33         90.63
probability threshold. The rank of a password is its posi-               sampling                       11.79         70.63
tion in the list sorted by the probabilities assigned by the             all                            10.86         63.10
model to the passwords.
   The first experiment uses the PCFG model trained on             Table 3
10 million passwords from the RockYou dataset. The                 Weighted and simple errors of various estimator variants. Ev-
threshold for password generation was set at 20, i.e., all         ery number is an average of 100 experiments.
passwords with probability at least 2−20 were generated
– there were 91,693 passwords in this dataset (let’s denote  Table 3 shows the results of our experiment. We per-
it 𝑇 ). We consider various combinations of proposed       formed 100 experiments. We have to warn the reader –
ideas:                                                     the reported errors are sensitive to the particular pass-
                                                           word distribution sampled into Θ. Unsurprisingly, the
      • original – a reference implementation of the esti-
                                                           sampling (Idea 2) helps to reduce estimation errors in gen-
         mator [8];
                                                           eral. The situation with interpolation (Idea 1) is mixed,
3
  Python’s float(’inf’) value                              with a substantial fraction of experiments showing worse
                                                                                                                                                          
                                          R U L J L Q D O                                                                                                                                                                        R U L J L Q D O
                                       D O O                                                                                                                                                                               D O O


                                                                                                                    5 H O D W L Y H  H U U R U  >  @
                                                                                                                                                          
 6 L P S O H  H U U R U


                                                                                                                                                           
                                                                                                                                                            
                               
                                                                                                                                                             
                              
                                                                                                                                                             
                              
                                                                                                                                                             
                                                                                                                                                                                           
                                                             3 D V V Z R U G  U D Q N                                                                                                 3 D V V Z R U G  U D Q N
Figure 3: The simple difference error (on the left) and the relative error (on the right) of the original and the "all" strategy.
Both graphs display results for 50,000 of the most probable passwords from the PCFG model. The numbers are the average
values from 100 experiments.


statistics. The reason is that the interpolation makes the                                                                                                                          Estimation performance
error worse when passwords in Θ already “overshoot”                                                                                                            sample size      original 100 bins 1000 bins
their true ranks. Taking the same rank without inter-
polation compensates for this. Therefore, interpolation                                                                                                             10,000            1.00              0.92                0.37
cannot be recommended for improving the precision of                                                                                                                30,000            1.08              1.00                0.39
the estimator. On the other hand, it helps with the “same                                                                                                           50,000            1.13              1.04                0.40
rank” problem, when different passwords are assigned                                                                                                               100,000            1.20              1.10                0.40
the same rank by the estimator.
   Figure 3 compares visually the original variant with                                                                       Table 4
                                                                                                                              Average relative estimation performance, where the baseline
the “all” variant. It illustrates the simple difference of
                                                                                                                              1.00 is the estimation performance of the original binary
calculated rank and estimated rank. It also shows the                                                                         search for the sample size 10,000. Experiment uses 106 ran-
relative error of the estimators. As expected, based on                                                                       domly generated passwords by the PCFG model. Every num-
the convergence proof in [3], the relative error is rather                                                                    ber is an average of 10 experiments, and rounded to the two
small in both cases.                                                                                                          decimal places.

4.2. Estimation speed
                                                                                                                              simply reporting the order of these top-𝑘 passwords by
We tested two configurations: the first one with 100 inter-
                                                                                                                              the estimator can be beneficial to the precision. Figure
vals (bins), and the second with 1,000 intervals. Negative
                                                                                                                              4 illustrates this phenomenon for the PCFG model and
log probabilities are divided into fixed intervals [0, 1),
                                                                                                                              the sample size of 10,000, where approximately the top
[1, 2), . . . , [99, ∞) for the first case, and into [0, 0.1),
                                                                                                                              180 passwords have the exact rank as their position in
[0.1, 0.2), . . . , [99.9, ∞) for the second case, respectively.
                                                                                                                              Θ. However, further down the precision quickly deterio-
Both configurations were tested with four different sizes
                                                                                                                              rates.
of Θ. Table 4 shows the relative speed of different vari-
                                                                                                                                 An interesting question is if we can improve the esti-
ants with respect to the baseline, which is the original
                                                                                                                              mator’s precision by compensating for unusually large or
algorithm with |Θ| = 10000. The results confirm a
                                                                                                                              small jumps (differences) between adjacent probabilities
moderate speed-up for 100 intervals and a substantial
                                                                                                                              in the sampled passwords.
speed-up for 1,000 intervals.
                                                                                                                                 An area outside this paper that deserves further focus
                                                                                                                              is the precision of the estimator for low-probability pass-
5. Additional observation and                                                                                                 words. The estimator’s precision worsens for passwords
                                                                                                                              with high ranks. A potential approach might use a dif-
   conclusion                                                                                                                 ferent or additional sampling methods that focus on less
                                                                                                                              probable passwords, so that we can cover this part of the
Since passwords in Θ are generated according to their
                                                                                                                              probability space better.
probability, with sufficiently large sample size, we expect
that for some 𝑘, the top-𝑘 most probable passwords will
be in the correct order at the beginning of Θ. Therefore,
                                                                                                                                //www.usenix.org/conference/usenixsecurity23/
                                                                                                                                 presentation/wang-ding-password-guessing.
                                
                                                                                                                             [8] M. Dell’Amico, montecarlopwd: Monte carlo pass-
                                                                                                                                word checking, 2016. URL: https://github.com/
 $ E V R O X W H  H U U R U


                                                                                                                                matteodellamico/montecarlopwd.
                                
                                
                                         R U L J L Q D O
                                         I L [ H G

                                                                                                    
                                                                      3 D V V Z R U G  U D Q N

Figure 4: Comparison of original password’s rank estimate
and estimate according the position in sampled passwords
(denoted as fixed).


References
[1] D. L. Wheeler,         Zxcvbn: low-budget pass-
    word strength estimation,             in:    Proceed-
    ings of the 25th USENIX Conference on
    Security Symposium, SEC’16, USENIX As-
    sociation, 2016, pp. 157–173. URL: https:
    //www.usenix.org/conference/usenixsecurity16/
    technical-sessions/presentation/wheeler.
[2] PCFG cracker, 2024. URL: https://github.com/lakiw/
    pcfg_cracker.
[3] M. Dell’Amico, M. Filippone, Monte carlo strength
    evaluation: Fast and reliable password checking,
    in: Proceedings of the 22nd ACM SIGSAC Con-
    ference on Computer and Communications Secu-
    rity, CCS ’15, Association for Computing Machin-
    ery, 2015, p. 158–169. URL: https://doi.org/10.1145/
    2810103.2813631.
[4] A. Narayanan, V. Shmatikov, Fast dictionary attacks
    on passwords using time-space tradeoff, in: Pro-
    ceedings of the 12th ACM Conference on Computer
    and Communications Security, CCS ’05, Association
    for Computing Machinery, 2005, pp. 364–372. URL:
    https://doi.org/10.1145/1102120.1102168.
[5] M. Weir, S. Aggarwal, B. Medeiros, B. Glodek, Pass-
    word cracking using probabilistic context-free gram-
    mars, in: 30th IEEE Symposium on Security and
    Privacy, 2009, pp. 391–405. URL: https://doi.org/
    10.1109/SP.2009.8.
[6] J. Ma, W. Yang, M. Luo, N. Li, A study of probabilistic
    password models, in: IEEE Symposium on Security
    and Privacy, 2014, pp. 689–704. URL: https://doi.org/
    10.1145/1102120.1102168.
[7] D. Wang, Y. Zou, Z. Zhang, K. Xiu, Password
    guessing using random forest, in: 32nd USENIX
    Security Symposium (USENIX Security 23),
    USENIX Association, 2023, pp. 965–982. URL: https: