1. Introduction

Text Detection Forgot About Document OCR

Krzysztof Olejniczak

0 1 3

Milan Šulc

0 1 0 26th Computer Vision Winter Workshop , Robert Sablatnig and Florian Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 1 Rossum.ai , Czech Republic 2 The work was done when Krzysztof Olejniczak was an intern at Rossum 3 University of Oxford , United Kingdom

Detection and recognition of text from scans and other images, commonly denoted as Optical Character Recognition (OCR), is a widely used form of automated document processing with a number of methods available. Yet OCR systems still do not achieve 100% accuracy, requiring human corrections in applications where correct readout is essential. Advances in machine learning enabled even more challenging scenarios of text detection and recognition "in-the-wild" - such as detecting text on objects from photographs of complex scenes. While the state-of-the-art methods for in-the-wild text recognition are typically evaluated on complex scenes, their performance in the domain of documents is typically not published, and a comprehensive comparison with methods for document OCR is missing. This paper compares several methods designed for in-the-wild text recognition and for document text recognition, and provides their evaluation on the domain of structured documents. The results suggest that state-of-the-art methods originally proposed for in-the-wild text detection also achieve competitive results on document text detection, outperforming available OCR methods. We argue that the application of document OCR should not be omitted in evaluation of text detection and recognition methods.

eol>Text Detection Text Recognition OCR Optical Character Recognition Text In The Wild

1. Introduction

Optical Character Recognition (OCR) is a classic problem in machine learning and computer vision with standard methods [1, 2] and surveys [ 3, 4, 5, 6 ] available. Recent advances in machine learning and its applications, such as autonomous driving, scene understanding or large-scale image retrieval, shifted the attention of Text Recognition research towards the more challenging in-the-wild text scenarios, with arbitrarily shaped and oriented instances of text appearing in complex scenes. Spotting text in-the-wild poses challenges such as extreme aspect ratios, curved or otherwise irregular text, complex backgrounds and clutter in the scenes. Recent methods [ 7, 8 ] achieve impressive results on challenging text in-the-wild datasets like TotalText [9] or CTW-1500 [ 10 ], with F1 reaching 90% and 87% respectively. Although automated document processing remains one of the major applications of OCR, to the best of our knowledge, the results of in-the-wild text detection models were never comprehensively evaluated on the domain of documents and compared with methods developed for document OCR. This paper reviews several recent Text Detection methods developed for the in-the-wild scenario [ 11, 12, 13, 7, 14, 8 ], evaluates their performance (out of the box and finetuned) on benchmark document datasets [15, 16, 17], and compares their scores against popular Document OCR

2. Related Work 2.1. Document OCR OCR engines designed for the "standard" application do

main of documents range from open-source projects such as TesseractOCR [2] and PP-OCR [1] to commercial services, including AWS Textract [18] or Google Document AI [19]. Despite Document OCR being a classic problem with many practical applications, studied for decades [22, 23], it still cannot be considered ’solved’ – even the best engines struggle to achieve perfect accuracy. The methodology behind the commercial cloud services is typically not disclosed. The most popular1 open-source OCR engine at the time of publication, Tesseract [2] (v4 and v5), uses a Long Short-Term Memory (LSTM) neural network as the default recognition engine.

2.2. In-the-wild Text Detection

2.2.1. Regression-based Methods Regression-based Methods follow the object classification approach, reduced to a single-class problem. TextBoxes [25] and TextBoxes++ [26] locate text instances with various lengths by using sets of anchors with diferent aspect ratios. Various regression-based methods utilize

1Based on the GitHub repository [24] statistics.

an iterative refinement strategy, iteratively enhancing using post-processing on so obtained pixel maps. The the quality of detected boundaries. LOMO [27] uses an binary, deterministic nature of such pixel classification Iterative Refinement Module, which in every step re- problem may cause learning confusion on the borders gresses coordinates of each corner of the predicted bound- of text instances. Numerous methods address this issue ary, with an attention mechanism. PCR [28] proposes a by predicting text kernels (central regions of instances) top-down approach, starting with predictions of centres and appropriately gathering pixels around them. PSENet and sizes of text instances, and iteratively improving the [32] predicts kernels of diferent sizes and forms boundbounding boxes using its Contour Localisation Mecha- ing boxes by iteratively expanding their regions. PAN nism. TextBPN++ [ 8 ] introduces an Iterative Boundary [14] generates pixel classification and kernel maps, linkDeformation Module, utilizing Transformer Blocks with ing each classified text pixel to the nearest kernel. Cenmulti-head attention [29] encoder and a multi-layer per- tripetalText [33] produces centripetal shift vectors that ceptron decoder, to iteratively adjust vertices of detected map pixels to correct text centres. KPN [34] creates pixel instances. Instead of considering vertices of the bound- embedding vectors, for each instance locates the central ing boxes, DCLNet [12] predicts quadrilateral boundaries pixel and retrieves the whole shapes by measuring the by locating four lines restricting the corresponding area, similarities in embedding vectors with scalar product. representing them in polar coordinates system. To ad- Vast majority of segmentation-based methods generate dress the problem of arbitrary-shaped text detection and probability maps, representing how likely pixels are to be accurately model the boundaries of irregular text regions, contained in some text region, and using certain binarizamore sophisticated bounding boxes representation ideas tion mechanism (e.g. by applying thresholding) convert have been developed. ABCNet [30] adapts cubic Bezier them into binary pixel maps. However, the thresholds curves to parametrize curved text instances, gaining the are often determined empirically, and uncareful choice possibility of fitting non-polygon shapes. FCENet [ 31] of them may lead to drastic decrease in performance. To proposes Fourier Contour Embedding method, predict- solve this problem, DBNet [13] proposes a Diferentiable ing the Fourier signature vectors corresponding to the Binarization Equation, making the step between probarepresentation of the boundary in Fourier domain, and bility and classification maps end-to-end trainable and uses them to generate the shape of the instance with therefore letting the network learn how to accurately Inverse Fourier Transformation. binarise predictions. DBNet++ [7] further improves on the baseline by extending the backbone network with an 2.2.2. Segmentation-based Methods Adaptive Scale Fusion attention module, enhancing the upscaling process and obtaining deeper features. TextSegmentation-based Methods aim to classify each pixel FuseNet [35] generates features on three diferent levels: as either text or non-text, and generate bounding boxes global-, word- and character-level, and fuses them to gain relevant context and deeper insight into the image structure. Instead of detecting words, CRAFT [11] locates text on character-level, predicting the areas covered by single letters, and links characters of each instance with respect to the generated afinity map.

3. Methods 3.1. Text Detection To cover a wide range of text detectors, we selected

methods from Section 2.2 with diferent approaches: for regression-based methods, we included TextBPN++ as a vertex-focused algorithm and DCLNet as an edge-focused approach. From segmentation-based methods, we selected DBNet and DBNet++ as pure segmentation and PAN as an approach linking text pixels to corresponding kernels. Finally, CRAFT was chosen as a character-level method.

3.2. Text Recognition

1–7

4. Experiments 4.1. Training Strategies

DBNet [13], DBNet++ [7] and PAN [14] were fine-tuned for 100 epochs (600 epochs in case of FUNSD) with batch size of 8 and initial learning rate set to 0.0001 and decreasing by a factor of 10 at the 60th and 80th epoch (200th and 400th for FUNSD). Baselines, pre-trained on SynthText [38] (DBNet, DBNet++) or ImageNet [39] (PAN), were downloaded from the MMOCR 0.6.2 Model Zoo [40]. DCLNet [12] was fine-tuned from a pre-trained model [41] on each dataset for 150 epochs with batch size of 4, initial learning rate of 0.001, decaying to 0.0001. For each dataset, TextBPN++ [ 8 ] was fine-tuned from a pre-trained model [42] for 50 epochs with batch size of 4, learning rate of 0.0001 and data augmentations consisting of flipping, cropping and rotations. Given no publicly-available training scripts for CRAFT, during the experiments, we used the MLT model from the github repository [43] without fine-tuning. All experiments were performed using Adam optimizer with momentum 0.9, on a single GPU with 11 GB of VRAM (GeForce GTX-1080Ti).

4.2. Detection Results The ultimate goal of text detection, especially in the case

of document processing, is to recognize the text within the detected instances. Therefore, to evaluate the suitability of popular in-the-wild detectors for document OCR, we perform end-to-end measurements with the following text recognition engines: SAR [20], MASTER [36] and CRNN [21]. The open-source engines were combined with the detection methods in a two-stage manner: the input image was initially processed by a detector, which returned bounding boxes. Afterwards, the corresponding cropped instances were passed to recognition models. As a point of reference, we compare both the detection and end-to-end recognition results of the selected methods with predictions of three common engines for end-to-end document OCR: Tesseract [2], Google Document AI [19] and AWS Textract [18].

Results of the text detection methods selected in Section

3.1 on the datasets from Table 1 are presented in Table 2.

On FUNSD dataset, DBNet++ achieves both the highest detection recall (97.40%) and F1-score (97.42%). The highest precision rate, 97.84% was scored by CRAFT. PAN performed the weakest out of all considered in-the-wild algorithms, scoring just 81.44% F1-score. Despite having achieved better results on FUNSD, segmentation-based approaches were outperformed by regression-based methods on CORD and XFUND. TextBPN++ proved to be the best performing algorithm on CORD in terms of recall and F1-score, scoring 99.74% and 99.19%, respectively. DCLNet, for which the best precision rate on CORD (98.67%) was recorded, achieved superior results 3.3. Metric on XFUND, outperforming the remaining methods with respect to all three measures: precision - 98.22%, recall To measure both detection and end-to-end performance, - 98.17% and F1-score - 98.20%. Out of the considered we used the CLEval [37] metric. Contrary to metrics such popular engines for end-to-end document OCR, AWS as Intersection over Union (IoU) perceiving text on word- Textract presented the best performance on the domain level, CLEval measures precision and recall on character of scans of structured documents – FUNSD and XFUND – level. As a consequence, it slightly reduces the punish- scoring 96.69% and 92.65% F1-score, respectively. Google ment for splitting or merging problematic instances (e.g Document AI generalized remarkably better to distorted dates), providing reliable and intuitive comparison of the photos of receipts from the CORD dataset, achieving quality of detection and recognition. Additionally, the 93.30% F1-score, surpassing the scores of AWS Textract Recognition Score evaluated by CLEval, approximately and Tesseract. The results show that in-the-wild detectors corresponding to the precision of character recognition, fine-tuned on document datasets can outperform popular informs about the quality of the recognition engine specifically on the detected bounding boxes.

OCR engines on the domain of structured documents in terms of the CLEval detection metric. However, the results for the predictions of pre-trained detectors may not

be fully representative due to diferences in splitting rules. Recognition Score for AWS Textract reached almost 96%, E.g. Document AI creates separate instances for special surpassing CRNN’s scores by c.a. 2%. This suggests that symbols, e.g. brackets, leading to undesired splitting the recognition engine used in AWS Textract, performof words like "name(s)" into several fragments, lower- ing much more accurately on FUNSD than the CRNN ing precision and recall. On all experimented datasets, model, may have been a crucial reason for the good all fine-tuned in-the-wild text detection models reached results. When evaluated on CORD, models with Difhigh prediction scores, proving themselves capable of ferentiable Binarization scored the highest marks in all handling text in structured documents. Qualitative anal- end-to-end measures: recall (DBNet++), precision and ysis of detectors’ predictions revealed that the major F1-score (DBNet); significantly surpassing the remaining sources of error were incorrect splitting of long text frag- methods. Interestingly, despite obtaining the best recall ments (e.g e-mail addresses), merging instances in dense rate, DBNet++ did not beat the simpler DBNet in terms text regions and missing short stand-alone text, such as of end-to-end F1-score. The predictions of regressionsingle-digit numbers. based approaches, better than segmentation-based ones when pure detection scores were measured, appeared to 4.3. Recognition Results combine slightly worse with CRNN. TextBPN++, however, remained competitive, achieving similar results End-to-end text recognition results combining fine-tuned to DBNet and DBNet++. Recognition Scores of CRNN, in-the-wild detectors with SAR [20] and MASTER [36] regardless the choice of in-the-wild detector, exceeded models from MMOCR 0.6.2 Model Zoo [46], and CRNN 93% on FUNSD and 98.5% on CORD, once again demon[21] from docTR [45] are listed in Table 3. The XFUND strating the suitability of applying these algorithms to dataset was skipped for this experiment since it contains document text recognition. SAR model, not specifically Chinese and Japanese characters, for which the recog- trained on documents, presented poorer performance: nition models were not trained. On FUNSD, the end-to- the highest measured F1-scores on FUNSD and CORD end measurement outcomes followed the patterns from were 86.36% and 85.25%, respectively, both obtained by detection: equipped with CRNN as the recognition en- the combination with TextBPN++. Fine-tuned SAR modgine, DBNet++ proved to be the best tuned model in els achieved slightly higher F1-scores reaching 89.49% terms of CLEval end-to-end Recall (93.52%) and F1-score on FUNSD (equipped with DBNet++ as the detector) and (92.23%), losing only to CRAFT in terms of precision. 93.77% on CORD (combined with TextBPN++ detections). Much higher F1-score (+2%) was measured for AWS Tex- Despite gaining a noticeable advantage over the basetract, whose end-to-end results outperformed all of the line, fine-tuned SAR models did not surpass the perforconsidered algorithms. It is important to note that the mance of the pre-trained CRNN. Similarly to SAR, the pre-trained MASTER model [46] worked the best in com- cess. In particular, fine-tuning models such as DBNet++ bination with TextBPN++, achieving F1 score of 83.00% or TextBPN++ yielded over 96% detection F1-score on on FUNSD and 93.26% on CORD. FUNSD, over 98% detection F1-score on CORD and over 96% detection F1-score on XFUND, with respect to the CLEval metric, outperforming Google Document AI and 5. Conclusions AWS Textract. Moreover, combining these detectors with a publicly-available CRNN recognition model in a twostage manner consistently achieves over 90% CLEval end-to-end F1-score, even without explicit fine-tuning of CRNN. We hope the results will bring more attention to evaluating future Text Detection methods not only in the text-in-the-wild scenario, but also on the domain of documents.

Text detection research has witnessed great progress in recent years, thanks to advancements in deep machine learning. The recently introduced methods widened the range of possible applications of text detectors, making them viable for in-the-wild text spotting. This shifted the attention towards more complex scenarios, including arbitrarily-shaped text or instances with non-orthogonal orientations. With automated document processing remaining one of the most relevant commercial OCR Acknowledgement applications, we stress the importance of determining whether the state-of-the-art methods for scene text spot- We acknowledge the help of Bohumír Zámečník, an exting can also improve document OCR. Our experiments pert on OCR systems, who helped with the supervision prove that detectors designed for in-the-wild text spot- of Krzysztof’s internship project. ting can indeed be applied to documents with great suc

in: Proc. AAAI , 2020 . [1]

Du ,

Li ,

Guo ,

Yin , W. Liu,

Zhou ,

Bai , [14]

Wang ,

Xie ,

Song ,

Zang ,

Wang , T. Lu,

abs/ 2009 .09941 ( 2020 ). URL: https://arxiv.org/abs/ work, CoRR abs/ 1908 .05900 ( 2019 ). URL: http:

2009 .09941. arXiv: 2009 .09941. //arxiv.org/abs/ 1908 .05900. arXiv: 1908 . 05900 . [2]

Kay , Tesseract: An open-source optical character [15] J.-P. T. Guillaume

Jaume

, Hazim Kemal Ekenel,

recognition engine , Linux J . 2007 ( 2007 ) 2 . Funsd : A dataset for form understanding in noisy [3]

Hamad ,

Mehmet , A detailed analysis of optical scanned documents, in: Accepted to ICDAR-OST,

character recognition technology , International 2019 .

Journal of Applied Mathematics Electronics and [16]

Park ,

Shin ,

Lee ,

Surh ,

Seo ,

Lee ,

Computers ( 2016 ) 244 - 249 . Cord: A consolidated receipt dataset for post- ocr [4]

Hegghammer , Ocr with tesseract, amazon tex- parsing ( 2019 ).

tract , and google document ai: A benchmarking [17] Y.

Xu , T.

Lv , L.

Cui , G.

Wang , Y.

Lu , D. Flo-

experiment , 2021 . URL: osf.io/preprints/socarxiv/ rencio, C. Zhang,

Wei , XFUND: A bench-

6zfvs. doi: 10 .31235/osf.io/6zfvs. mark dataset for multilingual visually rich form [5 ]

Islam ,

Noor , A survey on opti- understanding , in: Findings of the Asso-

arXiv:1710.05703 ( 2017 ). 2022 , Association for Computational Linguistics, [6]

Memon ,

Sami ,

R. A.

Khan ,

Uddin , Hand- Dublin, Ireland, 2022 , pp. 3214 - 3224 . URL: https:

written optical character recognition (ocr): A com- //aclanthology.org/ 2022 .findings-acl. 253 . doi: 10.

prehensive systematic literature review (slr) , IEEE 18653 /v1/ 2022 .findings-acl. 253 .

Access 8 ( 2020 ) 142642 - 142668 . [18] Amazon , Amazon textract, https://aws.amazon. [7] M.

Liao , Z.

Zou , Z.

Wan , C.

Yao , X.

Bai , Real-time com/textract, 2022 . Accessed: 2022 -09-25.

scene text detection with diferentiable binarization [19] Google, Google cloud document ai , https://cloud.

and adaptive scale fusion , IEEE Transactions on google.com/document-ai , 2022 . Accessed: 2022 -09-

Pattern Analysis and Machine Intelligence ( 2022 ). 25 . [8]

Zhang ,

Zhu ,

Yang ,

Wang ,

Yin , Adap- [20]

Li ,

Wang ,

Shen , G. Zhang, Show, attend and

text detection , in: 2021 IEEE/CVF International recognition, CoRR abs/ 1811 .00751 ( 2018 ). URL: http:

Conference on Computer Vision , ICCV 2021, Mon- //arxiv.org/abs/ 1811 .00751. arXiv: 1811 .00751.

treal , QC, Canada, October 10-17 , 2021 , IEEE, 2021 , [21]

Shi ,

Bai ,

Yao , An end-to-end trainable

pp. 1285 - 1294 . neural network for image-based sequence recogni [9]

C. K.

Ch'ng ,

C. S.

Chan , C. Liu, Total-text: To- tion and its application to scene text recognition,

wards orientation robustness in scene text detec - CoRR abs/1507 .05717 ( 2015 ). URL: http://arxiv.org/

tion , International Journal on Document Analysis abs/1507 .05717. arXiv: 1507 . 05717 .

and Recognition (IJDAR) 23 ( 2020 ) 31 - 52 . doi:10. [22]

Mori ,

Nishida ,

Yamada , Optical character

1007 /s10032-019-00334-z. recognition, John Wiley & Sons, Inc., 1999 . [10]

Liu ,

Jin ,

Zhang ,

Luo ,

Zhang , Curved [23]

H. F.

Schantz , The history of ocr, optical character

sequence connection , Pattern Recognition 90 Technologies Users Association ( 1982 ).

( 2019 ) 337 - 345 . URL: https://www.sciencedirect. [24] S. W. et al., Tesseract open source ocr engine

com/science/article/pii/S0031320319300664. (main repository), https://github.com/tesseract-ocr/

doi:https://doi.org/10.1016/j.patcog. tesseract, 2022 . Accessed: 2022 -10-14.

2019 . 02 .002. [25]

Liao ,

Shi ,

Bai ,

Wang , W. Liu, Textboxes: [11]

Baek ,

Lee , D. Han, S . Yun,

Lee , Character A fast text detector with a single deep neural net-

region awareness for text detection , in: Proceedings work, in: AAAI, 2017 .

of the IEEE Conference on Computer Vision and [26]

B. S.

Minghui Liao ,

Bai , TextBoxes++: A single-

Pattern

Recognition , 2019 , pp. 9365 - 9374 . shot oriented scene text detector , IEEE Transactions [12]

Bi ,

Hu , Disentangled contour learning for on Image Processing 27 ( 2018 ) 3676 - 3690 . URL:

quadrilateral text detection , in: Proceedings of the https://doi.org/10.1109/TIP. 2018 . 2825107 . doi:10.

IEEE/CVF Winter Conference on Applications of 1109/TIP . 2018 . 2825107 .

Computer

Vision , 2021 , pp. 909 - 918 . [27]

Zhang ,

Liang ,

Huang ,

En , J. Han, [13]

Liao ,

Wan ,

Yao ,

Chen ,

Bai , Real-time

Ding , X.

Ding , Look more than once: An accu-

abs/ 1904 .06535 ( 2019 ). URL: http://arxiv.org/abs/ database, in: 2009 IEEE conference on computer

1904 .06535. arXiv: 1904 . 06535. vision and pattern recognition , Ieee, 2009 , pp. 248 - [ 28]

Dai ,

Zhang ,

Cao , Progressive 255 .

contour regression for arbitrary-shape scene text [40]

Kuang ,

Sun ,

Li ,

Yue ,

T. H.

Lin ,

Chen ,

detection, in: 2021 IEEE/CVF Conference on Com - H. Wei , Y.

Zhu , T.

Gao , W.

Zhang , K. Chen,

puter Vision and Pattern Recognition (CVPR) , 2021 ,

Zhang ,

Lin , Text detection models - mmocr

pp. 7389 - 7398 . doi: 10 .1109/CVPR46437. 2021 . 0 . 6 .2 documentation, https://mmocr.readthedocs.

00731. io/en/latest/textdet_models.html, 2022 . Accessed: [29]

Vaswani ,

Shazeer ,

Parmar , J. Uszkor- 2022 - 10-14.

eit , L. Jones , A. N.

Gomez , L.

Kaiser , I. Polo- [41] Y.

Bi , Z.

Hu , Pytorch implementation of dclnet "dis-

abs/1706 .03762 ( 2017 ). URL: http://arxiv.org/abs/ tection", https://github.com/SakuraRiven/DCLNet,

1706.03762. arXiv: 1706 . 03762 . 2021 . Accessed: 2022 -10-13. [30]

Liu ,

Chen ,

Shen ,

He , L. Jin, [42]

Zhang ,

Zhu ,

Yang ,

Wang ,

Yin ,

abs/ 2002 .10200 ( 2020 ). URL: https://arxiv.org/abs/ TextBPN-Plus-Plus, 2022 . Accessed: 2022 -09-29.

2002 .10200. arXiv: 2002 . 10200 . [43]

Baek ,

Lee , D. Han, S . Yun,

Lee , Oficial [31]

Zhu ,

Chen ,

Liang ,

Kuang ,

Jin , W. Zhang, implementation of character region awareness for

text detection , in: CVPR , 2021 . CRAFT-pytorch, 2019 . Accessed: 2022 -10-13. [32]

Wang ,

Xie ,

Li ,

Hou ,

Lu ,

Yu , S. Shao, [44]

Kuang ,

Sun ,

Li ,

Yue ,

T. H.

Lin ,

Chen ,

nition , 2019 , pp. 9336 - 9345 . arXiv preprint arXiv: 2108 .06543 ( 2021 ). [33]

Sheng ,

Chen ,

Lian , Centripetaltext: An [45] Mindee, doctr: Document text recognition, https:

eficient text instance representation for scene text //github .com/mindee/doctr, 2021 .

detection, in: Thirty-Fifth Conference on Neural [46]

Kuang ,

Sun ,

Li ,

Yue ,

T. H.

Lin ,

Chen ,

Information Processing Systems , 2021 . H. Wei , Y.

Zhu , T.

Gao , W.

Zhang , K. Chen, [34] S.-X.

Zhang , X.

Zhu , J.-B. Hou , C.

Yang , X.-C.

Yin , W.

Zhang , D.

Lin , Text recognition models - mmocr

Kernel proposal network for arbitrary shape text de- 0.6.2 documentation , https://mmocr.readthedocs.io/

tection , 2022 . URL: https://arxiv.org/abs/2203.06410. en/latest/textrecog_models.html, 2021 . Accessed:

doi:10.48550/ARXIV.2203.06410 . 2022- 10 -14. [35]

Ye ,

Chen ,

Liu ,

Du , Textfusenet: Scene

Conference on Artificial Intelligence, IJCAI-20 , In-

gence Organization , 2020 , pp. 516 - 522 . [36]

Lu ,

Yu ,

Qi ,

Chen ,

Gong ,

Xiao , MAS-

recognition, CoRR abs/ 1910 .02562 ( 2019 ). URL: http:

//arxiv.org/abs/ 1910 .02562. arXiv: 1910 . 02562 . [37]

Baek ,

Nam ,

Park ,

Lee ,

Shin , J. Baek,

abs/ 2006 .06244 ( 2020 ). URL: https://arxiv.org/abs/

2006 .06244. arXiv: 2006 . 06244 . [38]

Gupta ,

Vedaldi ,

Zisserman , Synthetic data

nition , 2016 . [39]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , L . Fei-