=Paper= {{Paper |id=Vol-1885/93 |storemode=property |title=Breaking CAPTCHAs with Convolutional Neural Networks |pdfUrl=https://ceur-ws.org/Vol-1885/93.pdf |volume=Vol-1885 |authors=Martin Kopp,Matej Nikl,Martin Holeňa |dblpUrl=https://dblp.org/rec/conf/itat/KoppNH17 }} ==Breaking CAPTCHAs with Convolutional Neural Networks== https://ceur-ws.org/Vol-1885/93.pdf
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 93–99
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 M. Kopp, M. Nikl, M. Holeňa



                       Breaking CAPTCHAs with Convolutional Neural Networks

                                                Martin Kopp1,2 , Matěj Nikl1 , and Martin Holeňa1,3
                                     1 Faculty of Information Technology, Czech Technical University in Prague
                                                             Thákurova 9, 160 00 Prague
                                                 2 Cisco Systems, Cognitive Research Team in Prague
                                    3 Institute of Computer Science, Academy of Sciences of the Czech Republic

                                                       Pod Vodárenskou věží 2, 182 07 Prague

     Abstract: This paper studies reverse Turing tests to distin-               the main or at least as a fallback solution. For example,
     guish humans and computers, called CAPTCHA. Contrary                       Google uses the text-based schemes when you fail in their
     to classical Turing tests, in this case the judge is not a hu-             newer image-based ones.
     man but a computer. The main purpose of such tests is                         This paper is focused on automatic character recogni-
     securing user logins against the dictionary or brute force                 tion from multiple text-based CAPTCHA schemes using
     password guessing, avoiding automated usage of various                     artificial neural networks (ANNs) and clustering. The ul-
     services, preventing bots from spamming on forums and                      timate goal is to take a captcha challenge as an input while
     many others.                                                               outputting transcription of the text presented in the chal-
        Typical approaches to solving text-based CAPTCHA                        lenge. Contrary to the most prior art, our approach is gen-
     automatically are based on a scheme specific pipeline con-                 eral and can solve multiple schemes without modification
     taining hand-designed pre-processing, denoising, segmen-                   of any part of the algorithm.
     tation, post processing and optical character recognition.                    The experimental part compares the performance of the
     Only the last part, optical character recognition, is usually              shallow (only one hidden layer) and deep (multiple hidden
     based on some machine learning algorithm. We present an                    layers) ANNs and shows the benefits of using a convolu-
     approach using neural networks and a simple clustering al-                 tional neural networks (CNNs) multi-layered perceptrons
     gorithm that consists of only two steps, character localisa-               (MLP).
     tion and recognition. We tested our approach on 11 differ-                    The rest of this paper is organised as follows. The re-
     ent schemes selected to present very diverse security fea-                 lated work is briefly reviewed in the next section. Section 3
     tures. We experimentally show that using convolutional                     surveys the current captcha solutions. Section 4 presents
     neural networks is superior to multi-layered perceptrons.                  our approach to breaking captcha challenges. The experi-
     Keywords: CAPTCHA, convolutional neural networks,                          mental evaluation is summarised in Section 5 followed by
     network security, optical character recognition                            the conclusion.


     1    Introduction                                                          2   Related Work
     The acronym CAPTCHA1 stands for Completely Auto-                           Most papers about breaking captcha heavily focus on one
     mated Public Turing test to tell Computers and Humans                      particular scheme. As an example may serve [11] with
     Apart, and was coined in 2003 by von Ahn et al [20]. The                   preprocessing, text-alignment and everything else fitted
     fundamental idea is to use hard AI problems easily solved                  for the scheme reCapthca 2011. To our knowledge, the
     by most human, but unfeasible for current computer pro-                    most general approach was presented in [7]. This approach
     grams. Captcha is widely used to distinguish the human                     is based on an effective selection of the best segmentation
     users from computer bots and automated scripts. Nowa-                      cuts and presenting them to k-nn classifier. It was tested
     days, it is an established security mechanism to prevent                   on many up-to-date text-based schemes with better results
     automated posting on the internet forums, voting in online                 than specialized solutions.
     polls, downloading files in large amounts and many other                      The most recent approaches use neural networks [19].
     abusive usage of web services.                                             The results are still not that impressive as the previous
        There are many available captcha schemes ranging from                   approaches, but the neural-net-based approaches improve
     classical text-based over image-based to many unusual                      very quickly. Our work is based on CNN, being motivated
     custom designed solutions, e.g. [3, 4]. Because most of                    by their success in pattern recognition, e.g. [6, 14].
     the older schemes have already been proven vulnerable to                      The Microsoft researcher Chellapilla who intensively
     attacks and thus found unsafe [7, 19] new schemes are                      studied human interaction proofs stated that, depending on
     being invented. Despite that trend, there are still many                   the cost of the attack, automated scripts should not be more
     places where the classical text-based schemes are used as                  successful than 1 in 10 000 attempts, while human success
          1 The acronym captcha will be written in lowercase for better read-   rate should approach 90% [10]. It is generally considered
     ability.                                                                   a too ambitious goal, after the publication of [8] showing
94                                                                                                             M. Kopp, M. Nikl, M. Holeňa

     the human success rate in completing captcha challenges          3.2 Audio-Based
     and [9] showing that random guesses can be successful.
                                                                      From the beginning, the adoption of captcha schemes was
     Consequently, a captcha is considered compromised when
                                                                      problematic. Users were annoyed with captchas that were
     the attacker success rate surpasses 1%.
                                                                      hard to solve and had to try multiple times. The people af-
                                                                      fected the most were those with visual impairments or var-
     3     Captcha Schemes Survey                                     ious reading disorders such as dyslexia. Soon, an alterna-
                                                                      tive emerged in the form of audio captchas. Instead of dis-
                                                                      playing images, a voice reading letters and digits is played.
     This section surveys the currently available captcha
                                                                      In order to remain effective and secure, the captcha has to
     schemes and challenges they present.
                                                                      be resistant to automated sound analysis. For this purpose
                                                                      various background noise and sound distortion are added.
     3.1   Text-Based                                                 Generally, this scheme is now a standard alternative option
                                                                      on major websites that use captcha.
     The first ever use of captcha was in 1997 by the software
     company Alta-Vista, which sought a way to prevent auto-          3.3 Image-Based
     mated submissions to their search-engine. It was a simple
     text-based test which was sufficient for that time, but it       Currently, the most prominent design is image-based
     was quickly proven ineffective when the computer char-           captcha. A series of images showing various objects is
     acter recognition success rates improved. The most com-          presented to the user and the task is to select the images
     monly used techniques to prevent automatic recognition           with a topic given by a keyword or by an example image.
     can be divided into two groups called anti-recognition fea-      For example the user is shown a series of images of vari-
     tures and anti-segmentation features.                            ous landscapes and is asked to select those with trees, like
        The anti-recognition features such as different sizes and     in Figure 2. This type of captcha has gained huge pop-
     fonts of characters or rotation was a straightforward first      ularity especially on touchscreen devices, where tapping
     step to the more sophisticated captcha schemes. All those        the screen is preferable over typing. In the case of Google
     features are well accepted by humans, as we learn several        reCaptcha there are nine images from which the 4 − 6 are
     shapes of letters since childhood, e.g. handwritten alpha-       the correct answer. In order to successfully complete the
     bet, small letters, capitals. The effective way of reducing      challenge a user is allowed to have one wrong answer.
     the classifier accuracy is a distortion. Distortion is a tech-
     nique in which ripples and warp are added to the image.
     But excessive distortion can make it very difficult even for
     humans and thus the usage of this feature slowly vanishes
     being replaced by anti-segmentation features.
        The anti-segmentation features are not designed to com-
     plicate a single character recognition but instead they try
     to make the automated segmentation of the captcha image
     unmanageable. The first two features used for this pur-
     pose were added noise and confusing background. But
     it showed up that both of them are bigger obstacle for hu-
     mans than for computers and therefore, they where replace
     by occlusion lines, an example can be seen in Figure 1.
     The most recent anti-segmentation feature is called neg-
     ative kerning. It means that the neighbouring characters
     are moved so close to each other that they can eventually
     overlap. It showed up that humans are still able to read the
     overlapping text with only a small error rate, but for com-
     puters it is almost impossible to find a right segmentation.



                                                                      Figure 2: Current Google reCaptcha with image recogni-
                                                                      tion challenge.

                                                                        Relatively new but fast spreading type of image captcha
                                                                      combines the pattern recognition task presented above
     Figure 1: Older Google reCaptcha with the occlusion line.        with object localisation. Also the number of squares was
                                                                      increased from 9 to 16.
Breaking CAPTCHAs with Convolutional Neural Networks                                                                               95

     3.4    Other Types

     In parallel with the image-based captcha developed by
     Google and other big players, many alternative schemes
     appeared. They are different variations of text-based
     schemes hidden in video instead of distorted image, some      Figure 4: Example of a heat map for a challenge generated
     simple logical games or puzzles. As an example of an easy     by scheme s16.
     to solve logical game we selected the naughts and crosses,
     Figure 3. All of those got recently dominated by Google’s
     noCaptcha button. It uses browser cookies, user profiles      4.2 Clustering
     and history to track users behaviour and distinguish real
     users from bots.                                              When a heat map is complete, all points with value greater
                                                                   than 0.5 are added to the list of points to be clustered. As
                                                                   this is still work in progress we simplified the situation by
                                                                   knowing the number of characters within the image in ad-
                                                                   vance and therefore, knowing the correct number of clus-
                                                                   ters k, we decided to use k-means clustering to determine
                                                                   windows with characters close to their center. But almost
                                                                   an arbitrary clustering algorithm can be used, preferably
                                                                   some, that can determine the correct number of clusters.
                                                                      The k centroids are initialized uniformly from left to
      Figure 3: A naughts and crosses game used as a captcha.
                                                                   right, vertically in the middle, as this provides a good ini-
                                                                   tial estimation. Figure 5 illustrates the whole idea.


     4     Our Approach

     Our algorithm has two main stages localisation and recog-
     nition. The localisation can be further divided into heat
     map generation and clustering. Consequently, our algo-
     rithm consist of three steps:

         1. Create a heat map using a sliding window with an
            ANN, that classifies whether there is a character in
            the center or not.

         2. Use the k-means algorithm to determine the most
            probable locations of characters from the heat map.          (a) Initial centroids          (b) Final centroids

         3. Recognize the characters using another specifically    Figure 5: Heatmap clustering on random character loca-
            trained ANN.                                           tions


     4.1    Heatmap Generation
                                                                   4.3 Recognition
     We decided to use the sliding window technique to local-
     ize characters within a CAPTCHA image. This approach          Assuming that the character localization part worked well,
     is well known in the context of object localization [16].     windows containing characters are now ready to be rec-
     A sliding window is a rectangular region of fixed width       ognized. This task is known to be easy for computers to
     and height that slides across an image. Each of those win-    solve; in fact, they are even better than humans [10].
     dows serve as an input for a feed-forward ANN with a sin-        Again, a feed-forward ANN is used. This time with an
     gle output neuron. Its output values are the probability of   output layer consisting of 36 neurons to estimate the prob-
     its input image having a character in the center. Figure 4    ability distribution over classes: numbers 0–9 and upper-
     shows an example of such heat map. To enable a charac-        case letters A–Z. Finally, a CAPTCHA transcription is cre-
     ter localization even at the very edge of an image one can    ated by writing the recognized characters in the ascending
     expand each input image with black pixels.                    order of their x-axis coordinates.
96                                                                                                                                 M. Kopp, M. Nikl, M. Holeňa

     5     Experimental Evaluation
     This section describes the selection of a captcha suite and                           (a) Snow (s04)                       (b) Stitch (s08)
     generation of the labelled database, followed by a detailed
     description of the artificial neural networks used in our ex-
     periments. The last part of this section presents results of
     the experiments.                                                                      (c) Circles (s10)                    (d) Mass (s14)

     5.1   Experimental Set up
     Training an ANN usually requires a lot of training exam-                        (e) BlackOverlap (s16)                   (f) Overlap2 (s18)
     ples (in the order of millions in the case of a very deep
     CNN). It is advised to have at least multiple times the
     number of all parameters in the network [13]. Manually
     downloading, cropping and labelling such high number of                         (g) FingerPrints (s25)             (h) ThinWavyLetters (s30)
     examples is infeasible. Therefore, we tested three captcha
     providers with obtainable source code to be able to gener-
     ate large enough datasets: Secureimage PHP Captcha [5],
     capchas.net [2] and BotDetect captcha [1]. We selected the
                                                                                         (i) Chalkboard (s31)                 (j) Spiderweb (s41)
     last one as it provides the most variable set of schemes.
        BotDetect CAPTCHA is a paid, up-to-date service
     used by many government institutions and companies all
     around the world [1]. They offer a free licence with an
                                                                                                         (k) MeltingHeat2 (s52)
     access to obfuscated source codes. We selected 11 very
     diverse schemes out of available 60, see Figure 6 for ex-             Figure 6: Schemes generated by the BotDetect captcha
     ample of images, and generated 100.000 images cropped
     to one character for each scheme. The cropping is done to
     32x32 pixel windows, which is the size of a sliding win-
     dow. Cropped images are then used for training of the lo-                      70
                                                                                            lns=15
     calization as well as the recognition ANN. The testing set                     60      lns=30
     consist of 1000 whole captcha images with 5 characters                                 lns=60
                                                                                            lns=90
                                                                                    50
     each.
                                                                     accuracy [%]




        Schemes display various security features such as ran-                      40

     dom lines and other objects occluding the characters,                          30

     jagged or translucent character edges and global warp.                         20
     The scheme s10 - Circles stands out with its colour invert-                    10
     ing randomly placed circles. This property could make it                       0
     harder to recognize than others, because the solver needs                                  rns=30    rns=60    rns=120    rns=180    rns=250

     to account for random parts of characters and their back-                                           character recognition ANN size

     ground switching colours.

                                                                     Figure 7: Comparison of SLP recognition rate on the
     5.2 Artificial Neural Networks
                                                                     scheme s10, depending on the number of neuron use by
     The perceptron with single hidden layer (SLP), the percep-      the localization network (lns) and the recognition network
     tron with three hidden layers (MLP) and the convolutional       (rns).
     neural networks were tested in the localization and recog-
     nition. In all ANNs, rectified linear units were used as
     activation functions.                                              The next experiments was the same but the MLP with
        First experiment tested the influence of the number of       three hidden layers was used instead of SLP. Results, de-
     hidden neurons of a SLP. The number of hidden neurons           picted in Figure 8, suggest that adding more hidden lay-
     used for the localization network was lns={15,30,60,90}         ers does not improve accuracy of the localization neither
     and the number of neurons for the recognition network was       of the recognition. Therefore, the rest experiments were
     rns={30,60,120,180,250}. The results depicted in Fig-           done using SLP as it can be trained faster.
     ure 7 show the recognition rate for 1000 whole captcha             Both CNNs architectures resemble the LeNet-5 pre-
     images (all characters have to be correctly recognized) on      sented in [17] for handwritten digits recognition. The lo-
     the scheme s10. The scheme s10 was selected because we          calization CNN consists of two convolutional layers with
     consider it the most difficult one.                             six and sixteen 5x5 kernels, each of them followed by the
Breaking CAPTCHAs with Convolutional Neural Networks                                                                                                          97



                                       Accuracy comparison on the s10 scheme




                                                                                                                                                1@32x32
                                                                                                                                                Input
                     70
                            lns=3x15
                     60     lns=3x30
                            lns=3x60
                            lns=3x90
                     50
      accuracy [%]




                     40




                                                                                                                     5x5 kernel
                                                                                                                     Convolution




                                                                                                                                                6@28x28
                                                                                                                                                maps
                                                                                                                                                Feature
                     30

                     20

                     10

                     0




                                                                                                                     2x2 kernel
                                                                                                                     Max-pooling




                                                                                                                                                6@14x14
                                                                                                                                                maps
                                                                                                                                                Feature
                               rns=3x30 rns=3x60 rns=3x120 rns=3x180 rns=3x250
                                           character recognition ANN size




     Figure 8: Comparison of MLP recognition rate on the




                                                                                                                     5x5 kernel
                                                                                                                     Convolution




                                                                                                                                                16@10x10
                                                                                                                                                maps
                                                                                                                                                Feature
     scheme s10, depending on the number of neuron use by
     the localization network (lns) and the recognition network
     (rns).

     Table 1: Results of the statistical test of Friedman [12]




                                                                                                                     2x2 kernel
                                                                                                                     Max-pooling




                                                                                                                                                16@5x5
                                                                                                                                                maps
                                                                                                                                                Feature
     and the correction for simultaneous hypotheses testing by
     Holm [15] and Shaffer [18]. The rejection thresholds are
     computed for the family-wise significance level p = 0.05
     for a single scheme.




                                                                                                                                                120
                                                                                                                                                units
                                                                                                                                                Hidden
                                                                                                                     Fully connected
                                                                                                                     Flatten,
               Algorithms                                   p          Holm     Shaffer
               SLP+SLP vs. CNN+CNN                      7.257e-7       0.0083   0.0083
               SLP+SLP vs. SLP+CNN                      1.456e-4         0.01   0.0166
               CNN+SLP vs. CNN+CNN                      5.242e-4       0.0125   0.0166




                                                                                                                                                36
                                                                                                                                                Outputs
               CNN+SLP vs. SLP+CNN                        0.020        0.0166   0.0166
                                                                                                                     connected
                                                                                                                     Fully




               SLP+SLP vs. CNN+SLP                        0.137         0.025    0.025
               SLP+CNN vs. CNN+CNN                        0.247         0.05      0.05


     2x2 max pooling layers,and finally, the last layer of the                            Figure 9: The architecture of a character recognition CNN.
     network is a fully connected output layer.
        The recognition CNN contains an additional fully-
     connected layer with 120 neurons right before the output
     layer as illustrated in Figure 9.

     5.3                  Results
                                                                                                                            Single Scheme Accuracy
     After choosing the right architectures, we followed by test-                                        100
                                                                                                              SLP+SLP
                                                                                                          90 CNN+SLP
     ing the accuracy of captcha transcription on each scheme                                             80 SLP+CNN
     separately where both training and testing sets were gen-                                            70
                                                                                                             CNN+CNN
                                                                                          accuracy [%]




     erated by the same scheme. All images in the test set con-                                           60
     tained 5 characters and only the successful transcription of                                         50
                                                                                                          40
     all of them was accepted as a correct answer. The results,                                           30
     depicted in Figure 10, show appealing performance of all                                             20
     tested configurations. In the most cases it doesn’t matter                                           10
     if the localization network was a SLP or a CNN, but the                                               0
                                                                                                                s04 s08 s10 s14 s16 s18 s25 s30 s31 s41 s52
     CNN clearly outperforms the SLP in the role of a recog-                                                                           scheme
     nition network. This observation is also confirmed by the
     statistical test of Friedman [12] with corrections for simul-
     taneous hypothesis testing by Holm[15] and Shaffer [18],                             Figure 10: The accuracy of captcha image transcription
     see Table 1.                                                                         separately for each scheme.
        A subsequent experiment tested the accuracy of captcha
     transcription when training and testing sets consist of im-
98                                                                                                                                  M. Kopp, M. Nikl, M. Holeňa



                                         All Schemes Accuracy                                                Leave-one-out Scheme Accuracy
                    100                                                                     100
                         SLP+SLP                                                                 SLP+SLP
                     90 CNN+SLP                                                              90 CNN+SLP
                     80 SLP+CNN                                                              80 SLP+CNN
                        CNN+CNN                                                                 CNN+CNN
                     70                                                                      70
     accuracy [%]




                                                                             accuracy [%]
                     60                                                                      60
                     50                                                                      50
                     40                                                                      40
                     30                                                                      30
                     20                                                                      20
                     10                                                                      10
                      0                                                                       0
                           s04 s08 s10 s14 s16 s18 s25 s30 s31 s41 s52                             s04 s08 s10 s14 s16 s18 s25 s30 s31 s41 s52
                                             scheme                                                               left-out scheme

                                                                                                                                                         ‘
     Figure 11: The accuracy of captcha image transcription                  Figure 12: The accuracy of captcha image transcription in
     when example images generated by all schemes were                       leave-one-scheme-out scenario.
     available in the training and test sets.

                                                                             perceptron and the recognition CNN is the best. Over-
     ages generated by all schemes. Both training and testing
                                                                             all, the accuracy may seem relatively low, especially for
     set contained examples generated by all schemes. The re-
                                                                             schemes s10, s30, s31 and s41, but lets recall that recog-
     sults are depicted in Figure 11. In this experiment the CNN
                                                                             nition rate of 1% is already considered enough to compro-
     outperformed the SLP not only in the recognition but even
                                                                             mise the scheme. The failure of CNNS on scheme s41 is
     in the localization accuracy. The most visible difference
                                                                             understandable as the spiderweb background confuses the
     is on schemes s08, s18, s41. Overall performance is again
                                                                             convolutional kernels learned on other schemes.
     compared by the statistical test with results summarized
     in Table 2. All accuracies are lower than in the previous                  This is the most important experiment showing the abil-
     experiment, as the data set complexity grown (data were                 ity to solve yet unseen captcha .The ranking of all algo-
     generated by multiple schemes), but the number of train-                rithms is summarized in Table 3 and the statical tests in
     ing examples remained the same.                                         Table 4.

     Table 2: Results of the statistical test of Friedman [12]                                 Table 3: Average Rankings of the algorithms
     and the correction for simultaneous hypotheses testing by
     Holm [15] and Shaffer [18]. The rejection thresholds are                                             Algorithm        Ranking
     computed for the family-wise significance level p = 0.05                                             CNN+CNN           1.27
     for all schemes.                                                                                     SLP+CNN           2.00
                                                                                                          CNN+SLP           3.27
              Algorithms                          p      Holm      Shaffer                                 SLP+SLP          3.45
              SLP+SLP vs. CNN+CNN             1.259e-7   0.0083    0.0083
              CNN+SLP vs. CNN+CNN             2.799e-4    0.01     0.0166
              SLP+SLP vs. SLP+CNN             9.569e-4   0.0125    0.0166
              SLP+CNN vs. CNN+CNN               0.047    0.0166    0.0166    Table 4: Results of the statistical test of Friedman [12]
              SLP+SLP vs. CNN+SLP               0.098     0.025     0.025    and the correction for simultaneous hypotheses testing by
              CNN+SLP vs. SLP+CNN               0.098      0.05      0.05    Holm [15] and Shaffer [18]. The rejection thresholds are
                                                                             computed for the family-wise significance level p = 0.05
        The last experiment tested the accuracy of captcha tran-             for the leave-one-scheme-out scenario.
     scription in leave-one-scheme-out scenario. The training
     set contained images generated by only 10 schemes and                            Algorithms                           p         Holm      Shaffer
     the images used for testing were all generated by the last                       SLP+SLP vs. CNN+CNN              7.386e-5      0.0083    0.0083
     yet unseen scheme. Trying to recognize characters from                           CNN+SLP vs. CNN+CNN              2.799e-4        0.01    0.0166
                                                                                      SLP+SLP vs. SLP+CNN                0.008       0.0125    0.0166
     images generated by an unknown scheme is a challeng-
                                                                                      CNN+SLP vs. SLP+CNN                0.020       0.0166    0.0166
     ing task, furthermore the schemes were selected to differ                        SLP+CNN vs. CNN+CNN                0.186        0.025     0.025
     form each other as much as possible. The results are de-                         SLP+SLP vs. CNN+SLP                0.741        0.05      0.05
     picted in Figure 12. All configurations using a perceptron
     as the recognition classifier fail in all except the most sim-
     ple schemes, e.g. s12 and s16. The combination of two                     The above experiments show that most of current
     CNNs is the best in all cases, with only exception being                schemes can be compromised using two convolutional net-
     the scheme s30, where the combination of the localization               works or a localization perceptron and a recognition CNN.
Breaking CAPTCHAs with Convolutional Neural Networks                                                                                        99

     6    Conclusion                                                     [10] Kumar Chellapilla, Kevin Larson, Patrice Simard, and
                                                                              Mary Czerwinski. Designing human friendly human inter-
                                                                              action proofs (hips). In Proceedings of the SIGCHI confer-
     In this paper, we presented a novel captcha recognition ap-
                                                                              ence on Human factors in computing systems, pages 711–
     proach, which can fully replace the state-of-the art scheme
                                                                              720. ACM, 2005.
     specific pipelines. Our approach not only consists of less
                                                                         [11] Claudia Cruz-Perez, Oleg Starostenko, Fernando Uceda-
     steps, but it is also more general as it can be applied to a
                                                                              Ponga, Vicente Alarcon-Aquino, and Leobardo Reyes-
     wide variety of captcha schemes without modification. We                 Cabrera. Breaking recaptchas with unpredictable collapse:
     were able to compromise 10 out of 11 using two CNNs                      heuristic character segmentation and recognition. In Pat-
     or a localization perceptron and a recognition CNN with-                 tern Recognition, pages 155–165. Springer, 2012.
     out previously seeing any example image generated by that           [12] Milton Friedman. The use of ranks to avoid the assumption
     particular scheme. Furthermore, we were able to break all                of normality implicit in the analysis of variance. Journal
     11 captcha schemes using a CNN for the localization as                   of the american statistical association, 32(200):675–701,
     well as for the recognition, with the accuracy higher than               1937.
     50% when we included example images of each charac-                 [13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
     ter generated by the particular scheme into the training set.            Deep Learning.        MIT Press, 2016.        http://www.
     Lets recall that 1% recognition rate is enough for a scheme              deeplearningbook.org.
     to be considered compromised.                                       [14] Ian Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha
        We experimentally compared the ability of SLP, MLP                    Arnoud, and Vinay Shet. Multi-digit number recognition
     and CNN to transcribe characters from captcha images.                    from street view imagery using deep convolutional neural
     According to our experiments, CNNs performs much bet-                    networks. In International Conference on Learning Repre-
     ter in both localization and recognition.                                sentations, 2014.
                                                                         [15] Sture Holm. A simple sequentially rejective multiple test
                                                                              procedure. Scandinavian journal of statistics, pages 65–
     Acknowledgement                                                          70, 1979.
                                                                         [16] CH. Lampert, MB. Blaschko, and T. Hofmann. Beyond
     The research reported in this paper has been supported by                sliding windows: Object localization by efficient subwin-
     the Czech Science Foundation (GAČR) grant 17-01251                      dow search. In CVPR 2008, pages 1–8, Los Alamitos, CA,
     and student grant SGS17/210/OHK3/3T/18.                                  USA, 2008. Max-Planck-Gesellschaft, IEEE Computer So-
                                                                              ciety.
                                                                         [17] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick
     References                                                               Haffner. Gradient-based learning applied to document
                                                                              recognition. Proceedings of the IEEE, 86(11):2278–2324,
                                                                              1998.
      [1] Botdetect      captcha    generator     [online],     2017.
          www.captcha.com [Cited 2017-06-01].                            [18] Juliet Popper Shaffer. Multiple hypothesis testing. Annual
                                                                              review of psychology, 46(1):561–584, 1995.
      [2] Free captcha-service [online], 2017. www.captchas.net
          [Cited 2017-06-01].                                            [19] F. Stark, C. Hazırbaş, R. Triebel, and D. Cremers. Captcha
                                                                              recognition with active deep learning. In GCPR Workshop
      [3] Metal captcha, 2017. www.heavygifts.com/metalcaptcha
                                                                              on New Challenges in Neural Computation, 2015.
          [Cited 2017-06-01].
                                                                         [20] Luis Von Ahn, Manuel Blum, Nicholas J Hopper, and John
      [4] Resisty captcha, 2017. www.wordpress.org/plugins/resisty
                                                                              Langford. Captcha: Using hard ai problems for security. In
          [Cited 2017-06-01].
                                                                              Advances in Cryptology—EUROCRYPT 2003, pages 294–
      [5] Secureimage        php     captcha     [online],      2017.         311. Springer, 2003.
          www.phpcaptcha.org [Cited 2017-06-01].
      [6] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu.
          Multiple object recognition with visual attention. In Inter-
          national Conference on Learning Representations, 2015.
      [7] Elie Bursztein, Jonathan Aigrain, Angelika Moscicki, and
          John C Mitchell. The end is nigh: Generic solving of text-
          based captchas. In 8th USENIX Workshop on Offensive
          Technologies (WOOT 14), 2014.
      [8] Elie Bursztein, Steven Bethard, Celine Fabry, John C
          Mitchell, and Dan Jurafsky. How good are humans at solv-
          ing captchas? a large scale evaluation. In 2010 IEEE Sym-
          posium on Security and Privacy, pages 399–413. IEEE,
          2010.
      [9] Elie Bursztein, Matthieu Martin, and John Mitchell. Text-
          based captcha strengths and weaknesses. In Proceedings
          of the 18th ACM conference on Computer and communica-
          tions security, pages 125–138. ACM, 2011.