Revisiting Monte Carlo Strength Evaluation Martin Stanek1 1 Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University Abstract The Monte Carlo method, proposed by Dell’Amico and Filippone, estimates a password’s rank within a probabilistic model for password generation, i.e., it determines the password’s strength according to this model. We propose several ideas to improve the precision or speed of the estimation. Through experimental tests, we demonstrate that improved sampling can yield slightly better precision. Moreover, additional precomputation results in faster estimations with a modest increase in memory usage. Keywords Password, Monte Carlo, Strength Evaluation 1. Introduction The first idea is to interpolate password’s rank within the sampled interval it belongs, according its probabil- Passwords remain a frequently used authentication ity. The second idea aims to reduce probability overlap method, despite numerous initiatives, technologies, and in sampled passwords. Both these ideas, presented in implementations aiming for passwordless authentication. Section 3.2, seek to improve the estimator’s precision. Although the popularity of methods such as Windows The estimation speed for a password, originally based Hello, Passkey, and WebAuthn has increased, the security on binary search, can be enhanced with some additional of passwords continues to be a significant topic in many data computed in advance (the third idea, see Section application areas. 3.3). All ideas have been tested experimentally to assess Evaluating the strength of a password is useful for pro- their merit. The results are presented in Section 4. Our viding users with feedback on their chosen passwords. experiments demonstrate that improved sampling can This feedback can assist users in selecting stronger pass- yield slightly better precision. However, the effect of in- words. Often, the strength is calculated as a password’s terpolation on precision is inconclusive, and we cannot rank, i.e., how many passwords will be generated by rely on this technique to improve precision. some chosen algorithm until our password is produced. We utilize the reference implementation of the Monte There are various tools that calculate the strength of the Carlo estimator, which was published by one of the au- password, for example zxcvbn [1], or password scorer thors of the original paper on GitHub [8], and we employ tool in PCFG cracker [2]. the RockYou dataset for our experiments. Given that our Dell’Amico and Filippone proposed a Monte Carlo al- focus lies on the estimator itself, the choice of dataset is gorithm that estimates a password’s rank within a proba- relatively unimportant. bilistic model [3]. The algorithm work for any probabilis- tic password generation model, and the authors proved that estimated results converge to the actual ranks. 2. How the Monte Carlo Estimator The Monte Carlo estimator is also used to evaluate works and compare different probabilistic models for password generation. The original paper compares 𝑛-grams mod- We mostly follow [3] in this section. Let Γ be a set of all els [4], the PCFG model using probabilistic context-free allowed passwords. A probabilistic password model aims grammar [5], and the Backoff model [6]. Recent example to capture how humans select password, assigning higher of using the Monte Carlo estimator is the evaluation of a probabilities to more frequently chosen passwords and password guessing method that employs a random forest lower probabilities to less common ones. Let 𝑝(𝛼) de- [7]. notes a probability assigned to password 𝛼 by the model, such that 𝛼∈Γ 𝑝(𝛼) = 1. Different models yield differ- ∑︀ Our contribution. We propose three ideas for improv- ent probability distributions. ing the precision or speed of the Monte Carlo estimator. When the model is used for an attack, it enumerates password in descending order of probability. Therefore, ITAT’24: Workshop on Applied Security, September 20–24, 2024, the strength of a password 𝛼 is the number of passwords Drienica, SK with a higher probability: $ martin.stanek@fmph.uniba.sk (M. Stanek) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 𝑆𝑝 (𝛼) = |{𝛽 ∈ Γ; 𝑝(𝛽) > 𝑝(𝛼)}|. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Remark. In this context, the authors do not address the substantial size: 3.17 MB for 4-gram, 7.45 MB for 5-gram, possibility that the model may assign identical probabili- 43.5 MB for Backoff, and 0.99 MB for PCFG. An attempt ties to multiple passwords, resulting in a non-monotonic to use up to 10% of the RockYou dataset for training leads 𝑝. The definition of 𝑆𝑝 (𝛼) assigns all passwords that to unacceptable model sizes, where Backoff model being share the same probability the lowest rank in their group. the largest, as shown in Figure 1. This approach can be considered prudent from a security The model defines how passwords are represented, standpoint. generated, and how their probabilities are calculated. Computing the exact value of 𝑆𝑝 (𝛼), for a random 𝛼, Since these methods are specific for each model, we do has prohibitively large time complexity. The Monte Carlo not aim to improve the model size. However, the Monte estimator uses sampling and approximation to provide Carlo estimator utilizes an additional arrays, where prob- efficient and sufficiently accurate estimation. It relies on abilities and ranks of sampled passwords are precom- two properties of the underlying model: puted. The original paper [3] experiments with various sample sizes up to 100,000 (having “relative error 1%”), but • The model allows for efficiently computing 𝑝(𝛼) mostly uses the default sample size of 10,000. The default for any password 𝛼. sample size requires 160 kB of memory1 and its domi- • There is an efficient sampling method that gener- nated by the memory required for any model trained on ates a password according to the model’s distri- a dataset of reasonable length. bution. 3.2. Precision Precomputation. The estimator generates a sample Θ The estimator assigns the same rank 𝑐𝑗 to any pass- of 𝑛 passwords (sampling with replacement). The sample word 𝛼 for which the probability falls within the range Θ = {𝛽1 , . . . , 𝛽𝑛 } is sorted by descending probability, 𝑝(𝛽𝑗 ) > 𝑝(𝛼) ≥ 𝑝(𝛽𝑗+1 ). Intuitively, passwords with i.e., 𝑝(𝛽1 ) ≥ . . . ≥ 𝑝(𝛽𝑛 ). The cumulative ranks of distinct probabilities should not get the same numeric es- sampled passwords are calculated as follows: timate. Certainly, this is not an issue when the password 𝑖 strength is presented on a reduced scale using descriptive 1 ∑︁ 1 𝑐𝑖 = for 𝑖 = 1, . . . , 𝑛. characteristics like weak – medium –strong – very strong, 𝑛 𝑗=1 𝑝(𝛽𝑗 ) or using a traffic lights metaphor red – amber – green. The estimator needs to store the probabilities. The Idea 1. Interpolate rank values within intervals using an cumulative ranks can be easily recomputed. However, appropriate function. The most basic approach, without both these arrays are usually significantly smaller than additional parameters, is linear interpolation. This has no representation of the model, see Section 3. impact on memory complexity and a negligible impact Remark. The implementation [8] uses negative log2 prob- on time complexity. Figure 2 shows a graph of password abilities, i.e., scaling 𝑝(𝛽𝑗 ) to − log2 𝑝(𝛽𝑗 ). ranks, on a logarithmic scale, for various models and the sample size of 10,000. It appears that linear interpolation Estimation. In order to estimate 𝑆𝑝 (𝛼) for some pass- on the logarithmic scale should perform well for these word 𝛼, the probability 𝑝(𝛼) is computed first. Then the models. binary search is used to compute the largest index 𝑗 such The precision of the estimator depends on the sam- that 𝑝(𝛽𝑗 ) > 𝑝(𝛼). The result, estimated rank of 𝛼 is ple size. More specifically, it depends on the number of 𝑆𝑝 (𝛼) ≈ 𝑐𝑗 . Hence, the time complexity of the estimator unique probabilities in the set 𝑃 = {𝑝(𝛽1 ), . . . , 𝑝(𝛽𝑛 )}. is 𝑂(log 𝑛). We define the overlap of Θ as the fraction of probability values that are already in the set, and therefore do not contribute to the estimator’s precision: 1 − |𝑃 |/𝑛. Table 3. Areas for improvement 1 shows the average overlap for different models and sample sizes. Surprising differences in overlap are ob- 3.1. Memory requirement served among different models. An expected increase in The RockYou dataset contains more than 14 million overlap is observed with an increasing sample size, since unique passwords. The more passwords are used to train the overlap depends substantially on password proba- a model, the better and more precise results we can ex- bility distribution, given by the model from which the pect, such as in our case for password strength estima- passwords are generated. On the other hand, a larger tion. However, there is a point beyond which additional training dataset results in greater diversity of passwords, training data provide only negligible improvement, while leading to slightly lower overlap. further increasing the model’s size. Notably, even the set of 10,000 most frequent passwords generates models of 1 Real numbers are represented as the numpy.float64 datatype.  JUDP JUDP JUDP  JUDP 6L]HRIWKHPRGHO>0%@  3&)*  3&)* %DFNRII                          1XPEHURISDVVZRUGV>WKRXVDQGV@ 1XPEHURISDVVZRUGV>WKRXVDQGV@ Figure 1: Size of the model reflecting the number of passwords in a training dataset. The graph on the right excludes the Backoff model to show other three models more clearly.   of passwords need to be estimated. An optimization can JUDP be relevant in these scenarios. 3DVVZRUGUDQN ORJVFDOH   JUDP 3&)* Idea 3. Divide the interval of possible probability values  𝑝(𝛽𝑖 ), in our case expressed as negative log2 values, into  %DFNRII 𝑡 intervals (bins): [0, 𝜏1 ), [𝜏2 , 𝜏3 ), . . . , [𝜏𝑡−1 , ∞), where   0 < 𝜏1 < . . . < 𝜏𝑡−1 . For each interval, for 1 ≤ 𝑖 ≤ 𝑡,  we calculate minimal and maximal look-up indices that  narrow interval for binary search (we use 𝜏0 = 0 in the   following equations): LUmin (𝑖) = max{1 ≤ 𝑗 ≤ 𝑛 | − log2 𝑝(𝛽𝑗 ) ≥ 𝜏𝑖−1 },       3RVLWLRQLQWKHVDPSOH LUmax (𝑖) = min{1 ≤ 𝑗 ≤ 𝑛 | − log2 𝑝(𝛽𝑗 ) < 𝜏𝑖−1 }. Figure 2: Password ranks corresponding to the position in The estimator is adapted accordingly. Given a the sample password 𝛼, we calculate an appropriate interval such that − log2 𝑝(𝛼) ∈ [𝜏𝑖−1 , 𝜏𝑖 ). Then, the bi- nary search is performed within the set of indices {LUmin (𝑖), . . . , LUmax (𝑖)}, instead of full set {1, . . . , 𝑛}. Idea 2. The estimator will sample random passwords We expect to narrow the interval for the binary search for Θ until it gets 𝑛 unique probabilities. It compresses substantially, so the benefit of fewer comparisons will sample by discarding duplicate probabilities in such a be measurable. Trivially, the precision of the estimator way that preserves the cumulative sum of the entry with remains unchanged. the largest index. Hence, the rank calculation remains The price paid is the cost of computing LUmin and intact, and the overlap of the resulting Θ is be 0. Since LUmax arrays, which is simple one-time precomputation, the sampling is done in precomputation phase, it does and small memory needed to store these arrays in the not impact the estimation time or memory complexity estimator2 . in any way. Table 2 shows how many passwords must be sampled using a trained model to achieve the target size of the 4. Experiments sample with distinct probabilities. We implement the ideas presented in the previous section and present the results of our experiments. 3.3. Estimation speed The binary search employed in the original estimator is fast enough for assessing individual passwords. However, 2 when the estimator is used to evaluate or compare differ- For example, 100 intervals “cost” approximately 7.8 kB, even with ent models and their variants, the ranks of a large number a wasteful representation using Python’s int objects for stored indices and lists for the arrays training set sample size 4-gram 5-gram Backoff PCFG 500,000 10,000 13.6% 16.5% 20.4% 44.6% 30,000 20.6% 25.8% 31.6% 60.8% 50,000 24.8% 30.5% 37.3% 67.1% 1,000,000 10,000 12.4% 14.7% 17.2% 43.1% 30,000 19.1% 22.4% 27.6% 58.5% 50,000 22.4% 26.7% 33.2% 64.8% Table 1 Overlap percentage for different models and sample sizes. Models are trained on 500,000 and 1,000,000 passwords using the most frequent passwords from the RockYou dataset. Every number is an average of 3 experiments. target Sampled passwords • interpolation – interpolate rank calculation sample size 4-gram 5-gram Backoff PCFG within the interval between two adjacent proba- bilities (Idea 1); 10,000 11,689 12,239 12,894 23,483 • sampling – improved sampling with 𝑛 unique 30,000 38,795 42,032 47,178 122,865 probabilities (Idea 2); 50,000 68,358 75,953 90,123 258,489 • all – a combination of interpolation and sampling. Table 2 Let rr(𝛼) denote the real rank of password 𝛼 ∈ 𝑇 , and Average number of sampled passwords required to achieve let er(𝛼) denote the rank estimated by a particular variant the desired sample size with distinct probabilities. Models are trained on 500,000 passwords using the most frequent of the estimator. The weighted error of the estimator on passwords from the RockYou dataset. Every number is an the password set 𝑇 is calculated as follows: average of 10 experiments, rounded to the nearest integer. ∑︁ 𝑝(𝛼) |er(𝛼) − rr(𝛼)|. 𝛼∈𝑇 4.1. Precision The weighted error assumes that the estimators are used to asses passwords chosen by humans, following the The ideas aimed at improving precision apply to the original distribution. We also consider a simple error for Monte Carlo Estimator, regardless of the underlying completeness: model. We do not attempt to modify the models. For example, if password -1-1-1-1 is assigned inf3 as neg- 1 ∑︁ ative log2 probability in the PCFG model, because the |er(𝛼) − rr(𝛼)|. |𝑇 | 𝛼∈𝑇 pattern is outside of the trained grammar, we do not try to “fix this”. Moreover, we do not compare the performance of the models to each other. variant weighted error simple error We assess the impact of our ideas on the real ranks of password generated by the models. Similarly to the original 16.54 101.11 original paper [3], we generate all passwords up to some interpolation 15.33 90.63 probability threshold. The rank of a password is its posi- sampling 11.79 70.63 tion in the list sorted by the probabilities assigned by the all 10.86 63.10 model to the passwords. The first experiment uses the PCFG model trained on Table 3 10 million passwords from the RockYou dataset. The Weighted and simple errors of various estimator variants. Ev- threshold for password generation was set at 20, i.e., all ery number is an average of 100 experiments. passwords with probability at least 2−20 were generated – there were 91,693 passwords in this dataset (let’s denote Table 3 shows the results of our experiment. We per- it 𝑇 ). We consider various combinations of proposed formed 100 experiments. We have to warn the reader – ideas: the reported errors are sensitive to the particular pass- word distribution sampled into Θ. Unsurprisingly, the • original – a reference implementation of the esti- sampling (Idea 2) helps to reduce estimation errors in gen- mator [8]; eral. The situation with interpolation (Idea 1) is mixed, 3 Python’s float(’inf’) value with a substantial fraction of experiments showing worse   RULJLQDO RULJLQDO  DOO  DOO 5HODWLYHHUURU>@   6LPSOHHUURU                       3DVVZRUGUDQN 3DVVZRUGUDQN Figure 3: The simple difference error (on the left) and the relative error (on the right) of the original and the "all" strategy. Both graphs display results for 50,000 of the most probable passwords from the PCFG model. The numbers are the average values from 100 experiments. statistics. The reason is that the interpolation makes the Estimation performance error worse when passwords in Θ already “overshoot” sample size original 100 bins 1000 bins their true ranks. Taking the same rank without inter- polation compensates for this. Therefore, interpolation 10,000 1.00 0.92 0.37 cannot be recommended for improving the precision of 30,000 1.08 1.00 0.39 the estimator. On the other hand, it helps with the “same 50,000 1.13 1.04 0.40 rank” problem, when different passwords are assigned 100,000 1.20 1.10 0.40 the same rank by the estimator. Figure 3 compares visually the original variant with Table 4 Average relative estimation performance, where the baseline the “all” variant. It illustrates the simple difference of 1.00 is the estimation performance of the original binary calculated rank and estimated rank. It also shows the search for the sample size 10,000. Experiment uses 106 ran- relative error of the estimators. As expected, based on domly generated passwords by the PCFG model. Every num- the convergence proof in [3], the relative error is rather ber is an average of 10 experiments, and rounded to the two small in both cases. decimal places. 4.2. Estimation speed simply reporting the order of these top-𝑘 passwords by We tested two configurations: the first one with 100 inter- the estimator can be beneficial to the precision. Figure vals (bins), and the second with 1,000 intervals. Negative 4 illustrates this phenomenon for the PCFG model and log probabilities are divided into fixed intervals [0, 1), the sample size of 10,000, where approximately the top [1, 2), . . . , [99, ∞) for the first case, and into [0, 0.1), 180 passwords have the exact rank as their position in [0.1, 0.2), . . . , [99.9, ∞) for the second case, respectively. Θ. However, further down the precision quickly deterio- Both configurations were tested with four different sizes rates. of Θ. Table 4 shows the relative speed of different vari- An interesting question is if we can improve the esti- ants with respect to the baseline, which is the original mator’s precision by compensating for unusually large or algorithm with |Θ| = 10000. The results confirm a small jumps (differences) between adjacent probabilities moderate speed-up for 100 intervals and a substantial in the sampled passwords. speed-up for 1,000 intervals. An area outside this paper that deserves further focus is the precision of the estimator for low-probability pass- 5. Additional observation and words. The estimator’s precision worsens for passwords with high ranks. A potential approach might use a dif- conclusion ferent or additional sampling methods that focus on less probable passwords, so that we can cover this part of the Since passwords in Θ are generated according to their probability space better. probability, with sufficiently large sample size, we expect that for some 𝑘, the top-𝑘 most probable passwords will be in the correct order at the beginning of Θ. Therefore,  //www.usenix.org/conference/usenixsecurity23/ presentation/wang-ding-password-guessing.  [8] M. Dell’Amico, montecarlopwd: Monte carlo pass-  word checking, 2016. URL: https://github.com/ $EVROXWHHUURU  matteodellamico/montecarlopwd.    RULJLQDO  IL[HG         3DVVZRUGUDQN Figure 4: Comparison of original password’s rank estimate and estimate according the position in sampled passwords (denoted as fixed). References [1] D. L. Wheeler, Zxcvbn: low-budget pass- word strength estimation, in: Proceed- ings of the 25th USENIX Conference on Security Symposium, SEC’16, USENIX As- sociation, 2016, pp. 157–173. URL: https: //www.usenix.org/conference/usenixsecurity16/ technical-sessions/presentation/wheeler. [2] PCFG cracker, 2024. URL: https://github.com/lakiw/ pcfg_cracker. [3] M. Dell’Amico, M. Filippone, Monte carlo strength evaluation: Fast and reliable password checking, in: Proceedings of the 22nd ACM SIGSAC Con- ference on Computer and Communications Secu- rity, CCS ’15, Association for Computing Machin- ery, 2015, p. 158–169. URL: https://doi.org/10.1145/ 2810103.2813631. [4] A. Narayanan, V. Shmatikov, Fast dictionary attacks on passwords using time-space tradeoff, in: Pro- ceedings of the 12th ACM Conference on Computer and Communications Security, CCS ’05, Association for Computing Machinery, 2005, pp. 364–372. URL: https://doi.org/10.1145/1102120.1102168. [5] M. Weir, S. Aggarwal, B. Medeiros, B. Glodek, Pass- word cracking using probabilistic context-free gram- mars, in: 30th IEEE Symposium on Security and Privacy, 2009, pp. 391–405. URL: https://doi.org/ 10.1109/SP.2009.8. [6] J. Ma, W. Yang, M. Luo, N. Li, A study of probabilistic password models, in: IEEE Symposium on Security and Privacy, 2014, pp. 689–704. URL: https://doi.org/ 10.1145/1102120.1102168. [7] D. Wang, Y. Zou, Z. Zhang, K. Xiu, Password guessing using random forest, in: 32nd USENIX Security Symposium (USENIX Security 23), USENIX Association, 2023, pp. 965–982. URL: https: