=Paper=
{{Paper
|id=Vol-3361/paper4
|storemode=property
|title=NonDisclosureGrid: A Multimodal Privacy-Preserving Document Representation for Automated Document Processing
|pdfUrl=https://ceur-ws.org/Vol-3361/paper4.pdf
|volume=Vol-3361
|authors=Claudio Paonessa
|dblpUrl=https://dblp.org/rec/conf/swisstext/Paonessa22
}}
==NonDisclosureGrid: A Multimodal Privacy-Preserving Document Representation for Automated Document Processing==
NonDisclosureGrid: A Multimodal Privacy-Preserving Document Representation for Automated Document Processing Claudio Paonessa1 1 Institute for Data Science, FHNW University of Applied Sciences and Arts Northwestern Switzerland, School of Engineering, Bahnhofstrasse 6, CH-5210 Windisch, Switzerland Abstract We propose a novel type of document representation that preserves textual, visual, and spatial information without containing any sensitive data. We achieve this by transforming the original visual and textual data into simplified encodings. These pieces of non-sensitive information are combined into a tensor to form the NonDisclosureGrid (NDGrid). We demonstrate its capabilities on information extraction tasks and show, that our representation matches the performance of state-of-the-art representations and even outperforms them in specific cases. 1. Introduction The Chargrid [1] encodes text on character-level. A mapping function assigns an integer value to each char- Automated document processing is a pivotal element to- acter (i.e., alphabetic letters, numeric characters, special wards successful digitization for many businesses world- characters). The location occupied by a character on the wide. The goal is to transform unstructured or semi- grid will have the corresponding integer value. Before structured information into a structured form for various being fed into a deep learning model, the Chargrid is downstream tasks, hence streamlining administrative one-hot encoded. procedures in banking, medicine, and many other do- BERTgrid [2] is a special case of the Wordgrid [1]. It mains. uses contextualized word embeddings from BERT [3]. Because of the typically sensitive nature of the data Because BERT acts on a word-piece level, the text in the used for document processing, the data available to train document needs to be tokenized into word pieces first. A state-of-the-art systems is limited. Private companies line-by-line serialized version of the document can then are often obliged to delete documents collected from cus- be fed into a pre-trained BERT language model. tomers after a certain time period and may not share the data with providers specialized in automated document processing at all. This restriction prevents them from 3. NonDisclosureGrid training and continuously improving machine learning models. Based on the assumption that simplified encodings can replace the original information in documents and still retain utility for model training, we define components 2. Related Work to transform the original data into non-sensitive infor- mational pieces. Some components are based on textual A lot of the current progress in the field of document un- information, and some represent purely visual parts of derstanding builds upon the combination of spatial and the document. textual information into a common document represen- tation. This is often achieved using grid-based methods, which preserve the 2D layout of the document and di- 3.1. Textual Features rectly embed textual information into the representation. State-of-the-art grid-based representations embed the The models can use textual information in an embedded text more or less directly into the grid. Because the form and still take advantage of the 2D correlations of the character-level encoding and the word or subword em- document. These methods encode the text in embedding beddings can potentially contain sensitive information, vectors and transpose these vectors into corresponding we need to develop other approaches to incorporate pixels of the grid. textual information. SwissText 2022: Swiss Text Analytics Conference, June 08β10, 2022, Layout-only is a binary text mask and the simplest Lugano, Switzerland Envelope-Open claudio.paonessa@fhnw.ch (C. Paonessa) component in our novel representation. This layer con- Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tains the information if text is present at the given posi- Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) tion in the 2D grid based on the token bounding boxes Category Value tion without the possibility of reconstructing the original Contains alphabetic (a-z, A-Z) (1, _, _) text in the document. Contains numeric (0-9) (_, 1, _) In our experiments we apply this method to BERT Contains non-alphanumeric (_, _, 1) embeddings [3] with π set to 10 and 100 respectively. Table 1 One could argue that depending on the number of Definition of the alphanumeric categorization. hyperplanes, this hash could enable the reconstruction of the original text. We do not expect this to be an is- sue with the number of hyperplanes chosen significantly detected from OCR. This reduces the document to its spa- lower than the original embedding dimensions. Nev- tial layout structure. In Kerroumi et al. [4] this is called ertheless, this is still an outstanding matter and needs Layout Approach and uses three channels; i.e., (1, 1, 1) further investigation. for foreground and (0, 0, 0) for background. Our one- channel version πΏ β βπ» Γπ Γ1 forms a grid with height 3.2. Visual Features π» and width π: Let π‘π be the π-th token in the page and ππ = (π₯π , π¦π , βπ , π€π ) the associated bounding box of the Visually-rich documents contain valuable information π-th token outside of detected textual information. Visual elements are incorporated into documents to increase their 1, if βπ such as (π, π) βΊ ππ readability for humans. Hence, downstream tasks in πΏππ = { (1) automatic document processing can benefit from these 0, otherwise visual features. where βΊ means the point (π, π) lies within the bounding box ππ (formally: (π, π) βΊ ππ βΊ π₯π β€ π β€ π₯π + π€π β§ π¦π β€ Line mask is a method to incorporate line segments π β€ π¦π + βπ ). into a one-channel binary mask. A line in a document can be part of a rectangular box around textual elements or a Alphanumeric Categorization is a strongly simpli- dividing line to separate content or tabular structures. To fied text encoding. In our approach, we encode a to- find lines in document scans, we use the line segment de- ken into a three-dimensional binary vector. These three tector implementation from OpenCV [7], which follows components are 1 if the token contains at least one al- the algorithm described in Gioi et al. [8]. To prevent dis- phabetic character, a numeric character, and one other closing textual information, we only include lines with a non-alphanumeric character, respectively. This encoding length of at least 10% of the document width. With math- is summarized in Table 1. ematic rounding, the determined lines are incorporated The idea behind this approach is that the key infor- into a binary mask. mation relevant for tasks like information extraction often has consistent underlying character properties. For example, the extraction of monetary values from 4. Key Information Extraction invoices can be supported if we know which tokens contain numbers; e.g., 65.90 or 23.β would, with our The automated extraction of key-value information from approach, most of the time be encoded with (0, 1, 1) (no document scans such as invoices, receipts, or forms can alphabetic but both numeric and non-alphanumeric decrease the manual labor needed for many business characters). workflows. We use this task to compare our novel ap- proach to state-of-the-art representations. Locality-sensitive hashing (LSH) [5, 6] is a family of hashing techniques with a high chance of the hashes 4.1. Datasets of similar inputs being the same. These techniques can Our work is evaluated on three public datasets covering be used for data clustering and efficient nearest neighbor forms and invoices, the two most common applications search. for document understanding systems. One possible implementation is LSH based on hyper- planes. For this, we randomly sample π hyperplanes in FUNSD [9] is an English dataset for form understand- the original input space. For each sample in the original ing. The dataset contains noisy scanned form pages and space we determine if itβs on the left or right of each consists of 149 training samples and 50 test samples. Each hyperplane, resulting in a π dimensional boolean vector token in the documents is labeled with one of four differ- which forms our hash. We thus reduce word embeddings ent classes: Header, Question, Answer, Other. to a π dimensional binary vector, as every hyperplane XFUND [10] is a multilanguage form understanding randomly splits the embedding space into two categories. benchmark with matching classes to the FUNSD dataset. The idea behind this hashing is to have textual informa- The underlying dataset contains human-labeled forms 5.1. Ablation Study with key-value pairs in 7 languages: Chinese, Japanese, We report the results of the ablation study in Figure 1. We Spanish, French, Italian, German, Portuguese. Because of experimented with different combinations of our devel- different character sets we do not use the Chinese and oped components: Layout-only (LA), Alphanumeric Cat- Japanese samples from this dataset. We end up with 745 egorization (AN), Locality-sensitive hashing (LSH), and training samples and 250 test samples. the Line mask (LI). For the FUNSD and XFUND datasets, RVL-CDIP Layout [11] is derived from the RVL-CDIP we additionaly compare LSH components with 10 and classification dataset [12] and consists of 520 scanned 100 hyperplanes, denoted as LSH(10) and LSH(100) , re- invoice document pages in English. Each token on a spectively. page is labeled with one of 6 different classes. We split the dataset into 416 training samples and 104 test samples. We focus on the fields Receiver, Supplier, Invoice info, and Total. 4.2. Model Architecture We replicate the chargrid-net architecture from Katti et al. [1]. This model is a fully convolutional neural network with an encoder-decoder architecture using downsam- pling in the encoder and a reversion of the downsampling based on stride-2 transposed convolutions in the decoder. In contrast to the two parallel decoders in the original model, we only use the semantic segmentation decoder, which concludes in a pixel-level classification for the number of target classes. Replicated from the loss used in Katti et al. [1], we counter the strong class imbalance between background pixels and actual relevant pixels with static aggressive class weighting following Paszke et al. [13]. Figure 1: Ablation study to investigate the impact of the dif- 4.3. Evaluation Measure ferent non-sensitive components in the NDGrid. Experiments To evaluate the model performance for the key infor- to estimate impact of LSH (5, 6, 7) are only carried out for FUNSD and XFUND. mation extraction task, we use the Word Accuracy Rate (WAR) [1, 4]. Similar to the Levenshtein distance it counts the number of substitutions, insertions, and deletions between the ground truth and the prediction. We report 5.2. Comparison the WAR for each given field and overall. The instances are pooled across the entire test set. WAR is defined as We show the quantitative comparison in Table 2, 3 and 4. follows: Our Chargrid implementation distinguishes between 54 different characters and is case-insensitive. Besides 26 #[ins.] + #[del.] + #[sub.] alphabetic characters and the 10 numeric characters, we WAR = 1 β (2) π include 18 additional other characters. where π is the total number of tokens of a specific field For the BERTgrid, we use the pre-trained BERT base in the ground truth instance. model bert-base-uncased from the Hugging Face trans- formers library [14]. Before being fed into the tokenizer, we order the words by their corresponding bounding 5. Experiments and Results boxesβ top and left coordinates. In the following we show quantitative results of our pro- posed document representation. First we show the im- 6. Discussion pact of single components by carry out an ablation study followed by a comparison among Chargrid [1] and BERT- In the ablation study, we show how we can increase the grid [2]. As a baseline we use the 3-channel RGB image model performance by combining our non-sensitive com- as input. We report the average metrics over 5-fold cross- ponents. The impacts of the different components are not validation. consistent over the analyzed datasets. Different dataset key fields. Our approach often matches and, on specific fields, even outperforms the BERTgrid performance. The results support the hypothesis that a generalized and fundamentally simplified representation still contains enough information to be used in automated document processing. 7. Conclusion Figure 2: Comparison of model performances between the different document representations. NonDisclosureGrid is a privacy-preserving document representation, including textual, visual, and spatial in- All HED. QST. ANS. formation. Reducing multimodal information into simpli- [Image] 47.1% -32.2% 33.3% 26.7% fied and non-sensitive encodings is very effective for key [Chargrid] 52.3% 10.8% 38.5% 31.5% information extraction. Our NDGrid produces matching [BERTgrid] 56.5% -13.3% 43.3% 41.9% or even better results than models trained with state- [NDGrid] 58.3% -15.0% 43.0% 47.1% of-the-art representations. The performance on other document understanding tasks has yet to be shown, and Table 2 Comparison of WAR metrics on FUNSD. Fields: Header (HED.), this work is only the first step towards a versatile privacy- Question (QST.), Answer (ANS.). preserving document representation. There is still room for more information-inducing components. All HED. QST. ANS. [Image] 52.7% -6.8% 20.2% 14.2% References [Chargrid] 61.3% 20.8% 33.1% 30.6% [BERTgrid] 64.7% 21.7% 36.3% 39.3% [1] A. R. Katti, C. Reisswig, C. Guder, S. Brarda, [NDGrid] 64.5% 6.2% 36.1% 40.0% S. Bickel, J. HΓΆhne, J. B. Faddoul, Chargrid: Table 3 Towards understanding 2d documents, CoRR Comparison of WAR metrics on XFUND. Fields: Header abs/1809.08799 (2018). URL: http://arxiv.org/ (HED.), Question (QST.), Answer (ANS.). abs/1809.08799 . arXiv:1809.08799 . [2] T. I. Denk, C. Reisswig, Bertgrid: Contextu- All REC. SUP. INF. TOT. alized embedding for 2d document representa- [Image] 67.7% 38.5% 19.4% 3.2% -8.2% tion and understanding, CoRR abs/1909.04948 [Chargrid] 69.7% 41.1% 31.0% 4.0% -5.3% (2019). URL: http://arxiv.org/abs/1909.04948 . [BERTgrid] 73.2% 43.8% 41.8% 20.9% 1.7% arXiv:1909.04948 . [NDGrid] 72.9% 48.8% 39.3% 14.9% -6.7% [3] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional trans- Table 4 formers for language understanding, CoRR Comparison of WAR metrics on RVL-CDIP Layout. Fields: abs/1810.04805 (2018). URL: http://arxiv.org/ Receiver (REC.), Supplier (SUP.), Info (INF.), Total (TOT.). abs/1810.04805 . arXiv:1810.04805 . [4] M. Kerroumi, O. Sayem, A. Shabou, Visualwordgrid: Information extraction from scanned documents us- sizes seem to be influential when it comes to the impact of ing A multimodal approach, CoRR abs/2010.02358 the components. When using the Line mask (LI) in com- (2020). URL: https://arxiv.org/abs/2010.02358 . bination with Layout-only (LA) and the Alphanumeric arXiv:2010.02358 . Categorization (AN), the model performance increases [5] P. Indyk, R. Motwani, Approximate nearest neigh- significantly. The Line mask (LI) without the Alphanu- bors: Towards removing the curse of dimension- meric Categorization seems to be less effective or, in the ality, in: Proceedings of the Thirtieth Annual case of the RVL-CDIP Layout dataset, even worse than ACM Symposium on Theory of Computing, STOC the Layout-only (LA) by itself. Combining all compo- β98, Association for Computing Machinery, New nents, including LSH with 100 hyperplanes, yields the York, NY, USA, 1998, p. 604β613. URL: https://doi. best model performance for all three datasets. The LSH org/10.1145/276698.276876 . doi:10.1145/276698. component with 100 hyperplanes does perform worse 276876 . when not combined with the other components. [6] A. Gionis, P. Indyk, R. Motwani, Similarity search in Except for the header fields in the FUNSD dataset, the high dimensions via hashing, in: Proceedings of the BERTgrid outperforms the image and the Chargrid on all 25th International Conference on Very Large Data Bases, VLDB β99, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999, p. 518β529. [7] G. Bradski, The OpenCV Library, Dr. Dobbβs Jour- nal of Software Tools (2000). [8] R. Gioi, J. Jakubowicz, J.-M. Morel, G. Randall, Lsd: A line segment detector, Image Processing On Line 2 (2012) 35β55. doi:10.5201/ipol.2012. gjmr- lsd . [9] J.-P. T. Guillaume Jaume, Hazim Kemal Ekenel, Funsd: A dataset for form understanding in noisy scanned documents, in: Accepted to ICDAR-OST, 2019. [10] Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio, C. Zhang, F. Wei, Layoutxlm: Multimodal pre- training for multilingual visually-rich document understanding (2021). arXiv:2104.08836 . [11] P. Riba, A. Dutta, L. Goldmann, A. FornΓ©s, O. Ramos, J. LladΓ³s, Table detection in invoice documents by graph neural networks, in: 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 122β127. doi:10.1109/ICDAR. 2019.00028 . [12] A. W. Harley, A. Ufkes, K. G. Derpanis, Evaluation of deep convolutional nets for document image classi- fication and retrieval, in: International Conference on Document Analysis and Recognition, 2015. [13] A. Paszke, A. Chaurasia, S. Kim, E. Culurciello, Enet: A deep neural network architecture for real-time semantic segmentation, 2016. arXiv:1606.02147 . [14] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38β45. URL: https:// www.aclweb.org/anthology/2020.emnlp- demos.6 .