1. Introduction

Revisiting Monte Carlo Strength Evaluation

Martin Stanek

0 0 Department of Computer Science, Faculty of Mathematics , Physics and Informatics , Comenius University

The Monte Carlo method, proposed by Dell'Amico and Filippone, estimates a password's rank within a probabilistic model for password generation, i.e., it determines the password's strength according to this model. We propose several ideas to improve the precision or speed of the estimation. Through experimental tests, we demonstrate that improved sampling can yield slightly better precision. Moreover, additional precomputation results in faster estimations with a modest increase in memory usage.

eol>Password Monte Carlo Strength Evaluation

1. Introduction

The first idea is to interpolate password’s rank within the sampled interval it belongs, according its probabilPasswords remain a frequently used authentication ity. The second idea aims to reduce probability overlap method, despite numerous initiatives, technologies, and in sampled passwords. Both these ideas, presented in implementations aiming for passwordless authentication. Section 3.2, seek to improve the estimator’s precision. Although the popularity of methods such as Windows The estimation speed for a password, originally based Hello, Passkey, and WebAuthn has increased, the security on binary search, can be enhanced with some additional of passwords continues to be a significant topic in many data computed in advance (the third idea, see Section application areas. 3.3). All ideas have been tested experimentally to assess

Evaluating the strength of a password is useful for pro- their merit. The results are presented in Section 4. Our viding users with feedback on their chosen passwords. experiments demonstrate that improved sampling can This feedback can assist users in selecting stronger pass- yield slightly better precision. However, the efect of inwords. Often, the strength is calculated as a password’s terpolation on precision is inconclusive, and we cannot rank, i.e., how many passwords will be generated by rely on this technique to improve precision. some chosen algorithm until our password is produced. We utilize the reference implementation of the Monte There are various tools that calculate the strength of the Carlo estimator, which was published by one of the aupassword, for example zxcvbn [ 1 ], or password scorer thors of the original paper on GitHub [ 8 ], and we employ tool in PCFG cracker [ 2 ]. the RockYou dataset for our experiments. Given that our

Dell’Amico and Filippone proposed a Monte Carlo al- focus lies on the estimator itself, the choice of dataset is gorithm that estimates a password’s rank within a proba- relatively unimportant. bilistic model [ 3 ]. The algorithm work for any probabilistic password generation model, and the authors proved that estimated results converge to the actual ranks. 2. How the Monte Carlo Estimator

The Monte Carlo estimator is also used to evaluate works and compare diferent probabilistic models for password generation. The original paper compares -grams models [ 4 ], the PCFG model using probabilistic context-free grammar [ 5 ], and the Backof model [ 6 ]. Recent example of using the Monte Carlo estimator is the evaluation of a password guessing method that employs a random forest [ 7 ].

We mostly follow [ 3 ] in this section. Let Γ be a set of all allowed passwords. A probabilistic password model aims to capture how humans select password, assigning higher probabilities to more frequently chosen passwords and lower probabilities to less common ones. Let ( ) denotes a probability assigned to password by the model, such that ∑︀ ∈Γ ( ) = 1. Diferent models yield diferOur contribution. We propose three ideas for improv- ent probability distributions. ing the precision or speed of the Monte Carlo estimator. When the model is used for an attack, it enumerates password in descending order of probability. Therefore, ITAT’24: Workshop on Applied Security, September 20–24, 2024, the strength of a password is the number of passwords Drienica, SK with a higher probability: $ martin.stanek@fmph.uniba.sk (M. Stanek)

© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) ( ) = |{ ∈ Γ; ( ) > ( )}|. Remark. In this context, the authors do not address the substantial size: 3.17 MB for 4-gram, 7.45 MB for 5-gram, possibility that the model may assign identical probabili- 43.5 MB for Backof, and 0.99 MB for PCFG. An attempt ties to multiple passwords, resulting in a non-monotonic to use up to 10% of the RockYou dataset for training leads . The definition of ( ) assigns all passwords that to unacceptable model sizes, where Backof model being share the same probability the lowest rank in their group. the largest, as shown in Figure 1.

This approach can be considered prudent from a security The model defines how passwords are represented, standpoint. generated, and how their probabilities are calculated.

Computing the exact value of ( ), for a random , Since these methods are specific for each model, we do has prohibitively large time complexity. The Monte Carlo not aim to improve the model size. However, the Monte estimator uses sampling and approximation to provide Carlo estimator utilizes an additional arrays, where probeficient and suficiently accurate estimation. It relies on abilities and ranks of sampled passwords are precomtwo properties of the underlying model: puted. The original paper [ 3 ] experiments with various sample sizes up to 100,000 (having “relative error 1%”), but • The model allows for eficiently computing ( ) mostly uses the default sample size of 10,000. The default for any password . sample size requires 160 kB of memory1 and its domi• There is an eficient sampling method that gener- nated by the memory required for any model trained on ates a password according to the model’s distri- a dataset of reasonable length.

bution.

Precomputation. The estimator generates a sample Θ of passwords (sampling with replacement). The sample Θ = { 1, . . . , } is sorted by descending probability, i.e., ( 1) ≥ . . . ≥ ( ). The cumulative ranks of sampled passwords are calculated as follows:

1 ∑︁ =1 ( )

1 = for = 1, . . . , .

3.2. Precision

The estimator assigns the same rank to any password for which the probability falls within the range ( ) > ( ) ≥ ( +1). Intuitively, passwords with distinct probabilities should not get the same numeric estimate. Certainly, this is not an issue when the password strength is presented on a reduced scale using descriptive characteristics like weak – medium –strong – very strong, or using a trafic lights metaphor red – amber – green.

The estimator needs to store the probabilities. The cumulative ranks can be easily recomputed. However, both these arrays are usually significantly smaller than representation of the model, see Section 3.

Idea 1. Interpolate rank values within intervals using an appropriate function. The most basic approach, without additional parameters, is linear interpolation. This has no impact on memory complexity and a negligible impact Remark. The implementation [ 8 ] uses negative log2 prob- on time complexity. Figure 2 shows a graph of password abilities, i.e., scaling ( ) to − log2 ( ). ranks, on a logarithmic scale, for various models and the sample size of 10,000. It appears that linear interpolation Estimation. In order to estimate ( ) for some pass- on the logarithmic scale should perform well for these word , the probability ( ) is computed first. Then the models. binary search is used to compute the largest index such that ( ) > ( ). The result, estimated rank of is ( ) ≈ . Hence, the time complexity of the estimator is (log ).

3. Areas for improvement 3.1. Memory requirement

The RockYou dataset contains more than 14 million unique passwords. The more passwords are used to train a model, the better and more precise results we can expect, such as in our case for password strength estimation. However, there is a point beyond which additional training data provide only negligible improvement, while further increasing the model’s size. Notably, even the set of 10,000 most frequent passwords generates models of 1Real numbers are represented as the numpy.float64 datatype.

The precision of the estimator depends on the sample size. More specifically, it depends on the number of unique probabilities in the set = {( 1), . . . , ( )}.

We define the overlap of Θ as the fraction of probability values that are already in the set, and therefore do not contribute to the estimator’s precision: 1 − | |/. Table 1 shows the average overlap for diferent models and sample sizes. Surprising diferences in overlap are observed among diferent models. An expected increase in overlap is observed with an increasing sample size, since the overlap depends substantially on password probability distribution, given by the model from which the passwords are generated. On the other hand, a larger training dataset results in greater diversity of passwords, leading to slightly lower overlap. 4-gram 5-gram PCFG

Backoff 1020 ) lae1016 c lsog1012 ( k 8 ran 10 d 4 ro 10 w s sa 100 P 0 2000

4000 6000 Position in the sample 8000

10000 Idea 2. The estimator will sample random passwords for Θ until it gets unique probabilities. It compresses sample by discarding duplicate probabilities in such a way that preserves the cumulative sum of the entry with the largest index. Hence, the rank calculation remains intact, and the overlap of the resulting Θ is be 0. Since the sampling is done in precomputation phase, it does not impact the estimation time or memory complexity in any way.

3.3. Estimation speed

The binary search employed in the original estimator is fast enough for assessing individual passwords. However, when the estimator is used to evaluate or compare diferent models and their variants, the ranks of a large number 0

200 400 600 800 1000 1200 Number of passwords [thousands] 0

200 400 600 800 1000 1200 Number of passwords [thousands] 4-gram 5-gram

PCFG of passwords need to be estimated. An optimization can be relevant in these scenarios.

Idea 3. Divide the interval of possible probability values ( ), in our case expressed as negative log2 values, into intervals (bins): [0, 1), [ 2, 3), . . . , [ − 1, ∞), where 0 < 1 < . . . < − 1. For each interval, for 1 ≤ ≤ , we calculate minimal and maximal look-up indices that narrow interval for binary search (we use 0 = 0 in the following equations): LUmin() = max{1 ≤ ≤ | − log2 ( ) ≥ − 1}, LUmax() = min{1 ≤ ≤ | − log2 ( ) < − 1}.

The estimator is adapted accordingly. Given a password , we calculate an appropriate interval such that − log2 ( ) ∈ [ − 1, ). Then, the binary search is performed within the set of indices {LUmin(), . . . , LUmax()}, instead of full set {1, . . . , }. We expect to narrow the interval for the binary search substantially, so the benefit of fewer comparisons will be measurable. Trivially, the precision of the estimator remains unchanged.

The price paid is the cost of computing LUmin and LUmax arrays, which is simple one-time precomputation, and small memory needed to store these arrays in the estimator2.

4. Experiments

We implement the ideas presented in the previous section and present the results of our experiments. 2For example, 100 intervals “cost” approximately 7.8 kB, even with a wasteful representation using Python’s int objects for stored indices and lists for the arrays training set

4-gram 5-gram Backof

500,000 1,000,000

Let rr( ) denote the real rank of password ∈ , and let er( ) denote the rank estimated by a particular variant of the estimator. The weighted error of the estimator on the password set is calculated as follows: ∑︁ ( ) |er( ) − rr( )|. ∈

4.1. Precision

The weighted error assumes that the estimators are used to asses passwords chosen by humans, following the original distribution. We also consider a simple error for completeness: The ideas aimed at improving precision apply to the Monte Carlo Estimator, regardless of the underlying model. We do not attempt to modify the models. For example, if password -1-1-1-1 is assigned inf3 as neg- 1 ative log2 probability in the PCFG model, because the ∑︁ |er( ) − rr( )|. pattern is outside of the trained grammar, we do not try to | | ∈ “fix this”. Moreover, we do not compare the performance of the models to each other.

We assess the impact of our ideas on the real ranks variant weighted error simple error of password generated by the models. Similarly to the original 16.54 101.11 original paper [ 3 ], we generate all passwords up to some interpolation 15.33 90.63 probability threshold. The rank of a password is its posi- sampling 11.79 70.63 tion in the list sorted by the probabilities assigned by the all 10.86 63.10 model to the passwords.

The first experiment uses the PCFG model trained on Table 3 10 million passwords from the RockYou dataset. The Weighted and simple errors of various estimator variants. Evthreshold for password generation was set at 20, i.e., all ery number is an average of 100 experiments. passwords with probability at least 2− 20 were generated – there were 91,693 passwords in this dataset (let’s denote it ). We consider various combinations of proposed ideas:

Table 3 shows the results of our experiment. We performed 100 experiments. We have to warn the reader – the reported errors are sensitive to the particular password distribution sampled into Θ . Unsurprisingly, the sampling (Idea 2) helps to reduce estimation errors in general. The situation with interpolation (Idea 1) is mixed, with a substantial fraction of experiments showing worse • original – a reference implementation of the esti

mator [ 8 ]; 3Python’s float(’inf’) value original all 0.8 statistics. The reason is that the interpolation makes the error worse when passwords in Θ already “overshoot” their true ranks. Taking the same rank without interpolation compensates for this. Therefore, interpolation cannot be recommended for improving the precision of the estimator. On the other hand, it helps with the “same rank” problem, when diferent passwords are assigned the same rank by the estimator.

Figure 3 compares visually the original variant with the “all” variant. It illustrates the simple diference of calculated rank and estimated rank. It also shows the relative error of the estimators. As expected, based on the convergence proof in [ 3 ], the relative error is rather small in both cases.

4.2. Estimation speed

Since passwords in Θ are generated according to their probability, with suficiently large sample size, we expect that for some , the top- most probable passwords will be in the correct order at the beginning of Θ . Therefore, simply reporting the order of these top- passwords by We tested two configurations: the first one with 100 inter- the estimator can be beneficial to the precision. Figure vals (bins), and the second with 1,000 intervals. Negative 4 illustrates this phenomenon for the PCFG model and log probabilities are divided into fixed intervals [0, 1), the sample size of 10,000, where approximately the top [1, 2), . . . , [99, ∞) for the first case, and into [0, 0.1), 180 passwords have the exact rank as their position in [0.1, 0.2), . . . , [99.9, ∞) for the second case, respectively. Θ . However, further down the precision quickly deterioBoth configurations were tested with four diferent sizes rates. of Θ . Table 4 shows the relative speed of diferent vari- An interesting question is if we can improve the estiants with respect to the baseline, which is the original mator’s precision by compensating for unusually large or algorithm with |Θ | = 10000. The results confirm a small jumps (diferences) between adjacent probabilities moderate speed-up for 100 intervals and a substantial in the sampled passwords. speed-up for 1,000 intervals. An area outside this paper that deserves further focus is the precision of the estimator for low-probability pass5. Additional observation and words. The estimator’s precision worsens for passwords with high ranks. A potential approach might use a difconclusion ferent or additional sampling methods that focus on less probable passwords, so that we can cover this part of the probability space better.

700

[1]

D. L.

Wheeler , Zxcvbn: low-budget password strength estimation , in: Proceedings of the 25th USENIX Conference on Security Symposium , SEC'16,

USENIX

Association , 2016 , pp. 157 - 173 . URL: https: //www.usenix.org/conference/usenixsecurity16/ technical-sessions/presentation/wheeler.

[2] PCFG cracker, 2024 . URL: https://github.com/lakiw/ pcfg_cracker.

[3]

Dell'Amico ,

Filippone , Monte carlo strength evaluation: Fast and reliable password checking , in: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security , CCS '15, Association for Computing Machinery, 2015 , p. 158 - 169 . URL: https://doi.org/10.1145/ 2810103.2813631.

[4]

Narayanan ,

Shmatikov , Fast dictionary attacks on passwords using time-space tradeof , in: Proceedings of the 12th ACM Conference on Computer and Communications Security , CCS '05, Association for Computing Machinery, 2005 , pp. 364 - 372 . URL: https://doi.org/10.1145/1102120.1102168.

[5]

Weir ,

Aggarwal ,

Medeiros ,

Glodek , Password cracking using probabilistic context-free grammars , in: 30th IEEE Symposium on Security and Privacy , 2009 , pp. 391 - 405 . URL: https://doi.org/ 10.1109/SP. 2009 . 8 .

[6]

Ma , W. Yang,

Luo ,

Li , A study of probabilistic password models , in: IEEE Symposium on Security and Privacy , 2014 , pp. 689 - 704 . URL: https://doi.org/ 10.1145/1102120.1102168.

[7]

Wang ,

Zou ,

Zhang , K. Xiu, Password guessing using random forest , in: 32nd USENIX Security Symposium (USENIX Security 23) , USENIX Association, 2023 , pp. 965 - 982 . URL: https: //www.usenix.org/conference/usenixsecurity23/ presentation/wang-ding -password-guessing.

[8]

Dell 'Amico, montecarlopwd: Monte carlo password checking , 2016 . URL: https://github.com/ matteodellamico/montecarlopwd.