=Paper= {{Paper |id=Vol-3361/paper4 |storemode=property |title=NonDisclosureGrid: A Multimodal Privacy-Preserving Document Representation for Automated Document Processing |pdfUrl=https://ceur-ws.org/Vol-3361/paper4.pdf |volume=Vol-3361 |authors=Claudio Paonessa |dblpUrl=https://dblp.org/rec/conf/swisstext/Paonessa22 }} ==NonDisclosureGrid: A Multimodal Privacy-Preserving Document Representation for Automated Document Processing== https://ceur-ws.org/Vol-3361/paper4.pdf
NonDisclosureGrid: A Multimodal Privacy-Preserving
Document Representation for Automated Document
Processing
Claudio Paonessa1
1
 Institute for Data Science, FHNW University of Applied Sciences and Arts Northwestern Switzerland, School of Engineering, Bahnhofstrasse 6,
CH-5210 Windisch, Switzerland


                                           Abstract
                                           We propose a novel type of document representation that preserves textual, visual, and spatial information without containing
                                           any sensitive data. We achieve this by transforming the original visual and textual data into simplified encodings. These
                                           pieces of non-sensitive information are combined into a tensor to form the NonDisclosureGrid (NDGrid). We demonstrate its
                                           capabilities on information extraction tasks and show, that our representation matches the performance of state-of-the-art
                                           representations and even outperforms them in specific cases.


1. Introduction                                                                                                  The Chargrid [1] encodes text on character-level. A
                                                                                                              mapping function assigns an integer value to each char-
Automated document processing is a pivotal element to-                                                        acter (i.e., alphabetic letters, numeric characters, special
wards successful digitization for many businesses world-                                                      characters). The location occupied by a character on the
wide. The goal is to transform unstructured or semi-                                                          grid will have the corresponding integer value. Before
structured information into a structured form for various                                                     being fed into a deep learning model, the Chargrid is
downstream tasks, hence streamlining administrative                                                           one-hot encoded.
procedures in banking, medicine, and many other do-                                                              BERTgrid [2] is a special case of the Wordgrid [1]. It
mains.                                                                                                        uses contextualized word embeddings from BERT [3].
   Because of the typically sensitive nature of the data                                                      Because BERT acts on a word-piece level, the text in the
used for document processing, the data available to train                                                     document needs to be tokenized into word pieces first. A
state-of-the-art systems is limited. Private companies                                                        line-by-line serialized version of the document can then
are often obliged to delete documents collected from cus-                                                     be fed into a pre-trained BERT language model.
tomers after a certain time period and may not share the
data with providers specialized in automated document
processing at all. This restriction prevents them from 3. NonDisclosureGrid
training and continuously improving machine learning
models.                                                     Based on the assumption that simplified encodings can
                                                            replace the original information in documents and still
                                                            retain utility for model training, we define components
2. Related Work                                             to transform the original data into non-sensitive infor-
                                                            mational pieces. Some components are based on textual
A lot of the current progress in the field of document un- information, and some represent purely visual parts of
derstanding builds upon the combination of spatial and the document.
textual information into a common document represen-
tation. This is often achieved using grid-based methods,
which preserve the 2D layout of the document and di-
                                                            3.1. Textual Features
rectly embed textual information into the representation. State-of-the-art grid-based representations embed the
The models can use textual information in an embedded text more or less directly into the grid. Because the
form and still take advantage of the 2D correlations of the character-level encoding and the word or subword em-
document. These methods encode the text in embedding beddings can potentially contain sensitive information,
vectors and transpose these vectors into corresponding we need to develop other approaches to incorporate
pixels of the grid.                                         textual information.

SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022,                                                                     Layout-only is a binary text mask and the simplest
Lugano, Switzerland
Envelope-Open claudio.paonessa@fhnw.ch (C. Paonessa)
                                                                                                                                    component in our novel representation. This layer con-
                                       Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License tains the information if text is present at the given posi-
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                      tion in the 2D grid based on the token bounding boxes
         Category                          Value              tion without the possibility of reconstructing the original
         Contains alphabetic (a-z, A-Z)   (1, _, _)
                                                              text in the document.
         Contains numeric (0-9)           (_, 1, _)
                                                                 In our experiments we apply this method to BERT
         Contains non-alphanumeric        (_, _, 1)
                                                              embeddings [3] with 𝑛 set to 10 and 100 respectively.
Table 1                                                          One could argue that depending on the number of
Definition of the alphanumeric categorization.                hyperplanes, this hash could enable the reconstruction
                                                              of the original text. We do not expect this to be an is-
                                                              sue with the number of hyperplanes chosen significantly
detected from OCR. This reduces the document to its spa- lower than the original embedding dimensions. Nev-
tial layout structure. In Kerroumi et al. [4] this is called ertheless, this is still an outstanding matter and needs
Layout Approach and uses three channels; i.e., (1, 1, 1) further investigation.
for foreground and (0, 0, 0) for background. Our one-
channel version 𝐿 ∈ ℕ𝐻 Γ—π‘Š Γ—1 forms a grid with height 3.2. Visual Features
𝐻 and width π‘Š: Let π‘‘π‘˜ be the π‘˜-th token in the page and
π‘π‘˜ = (π‘₯π‘˜ , π‘¦π‘˜ , β„Žπ‘˜ , π‘€π‘˜ ) the associated bounding box of the Visually-rich documents contain valuable information
π‘˜-th token                                                    outside of detected textual information. Visual elements
                                                              are incorporated into documents to increase their
                       1, if βˆƒπ‘˜ such as (𝑖, 𝑗) β‰Ί π‘π‘˜           readability for humans. Hence, downstream tasks in
             𝐿𝑖𝑗 = {                                      (1) automatic document processing can benefit from these
                       0, otherwise
                                                              visual features.
where β‰Ί means the point (𝑖, 𝑗) lies within the bounding
box π‘π‘˜ (formally: (𝑖, 𝑗) β‰Ί π‘π‘˜ ⟺ π‘₯π‘˜ ≀ 𝑖 ≀ π‘₯π‘˜ + π‘€π‘˜ ∧ π‘¦π‘˜ ≀          Line mask is a method to incorporate line segments
𝑗 ≀ π‘¦π‘˜ + β„Žπ‘˜ ).                                                into a one-channel binary mask. A line in a document can
                                                              be part of a rectangular box around textual elements or a
   Alphanumeric Categorization is a strongly simpli- dividing line to separate content or tabular structures. To
fied text encoding. In our approach, we encode a to- find lines in document scans, we use the line segment de-
ken into a three-dimensional binary vector. These three tector implementation from OpenCV [7], which follows
components are 1 if the token contains at least one al- the algorithm described in Gioi et al. [8]. To prevent dis-
phabetic character, a numeric character, and one other closing textual information, we only include lines with a
non-alphanumeric character, respectively. This encoding length of at least 10% of the document width. With math-
is summarized in Table 1.                                     ematic rounding, the determined lines are incorporated
   The idea behind this approach is that the key infor- into a binary mask.
mation relevant for tasks like information extraction
often has consistent underlying character properties.
For example, the extraction of monetary values from 4. Key Information Extraction
invoices can be supported if we know which tokens
contain numbers; e.g., 65.90 or 23.– would, with our The automated extraction of key-value information from
approach, most of the time be encoded with (0, 1, 1) (no document scans such as invoices, receipts, or forms can
alphabetic but both numeric and non-alphanumeric decrease the manual labor needed for many business
characters).                                                  workflows. We use this task to compare our novel ap-
                                                              proach to state-of-the-art representations.
   Locality-sensitive hashing (LSH) [5, 6] is a family
of hashing techniques with a high chance of the hashes       4.1. Datasets
of similar inputs being the same. These techniques can
                                                             Our work is evaluated on three public datasets covering
be used for data clustering and efficient nearest neighbor
                                                             forms and invoices, the two most common applications
search.
                                                             for document understanding systems.
   One possible implementation is LSH based on hyper-
planes. For this, we randomly sample 𝑛 hyperplanes in
                                                               FUNSD [9] is an English dataset for form understand-
the original input space. For each sample in the original
                                                             ing. The dataset contains noisy scanned form pages and
space we determine if it’s on the left or right of each
                                                             consists of 149 training samples and 50 test samples. Each
hyperplane, resulting in a 𝑛 dimensional boolean vector
                                                             token in the documents is labeled with one of four differ-
which forms our hash. We thus reduce word embeddings
                                                             ent classes: Header, Question, Answer, Other.
to a 𝑛 dimensional binary vector, as every hyperplane
                                                               XFUND [10] is a multilanguage form understanding
randomly splits the embedding space into two categories.
                                                             benchmark with matching classes to the FUNSD dataset.
The idea behind this hashing is to have textual informa-
The underlying dataset contains human-labeled forms            5.1. Ablation Study
with key-value pairs in 7 languages: Chinese, Japanese,
                                                               We report the results of the ablation study in Figure 1. We
Spanish, French, Italian, German, Portuguese. Because of
                                                               experimented with different combinations of our devel-
different character sets we do not use the Chinese and
                                                               oped components: Layout-only (LA), Alphanumeric Cat-
Japanese samples from this dataset. We end up with 745
                                                               egorization (AN), Locality-sensitive hashing (LSH), and
training samples and 250 test samples.
                                                               the Line mask (LI). For the FUNSD and XFUND datasets,
   RVL-CDIP Layout [11] is derived from the RVL-CDIP
                                                               we additionaly compare LSH components with 10 and
classification dataset [12] and consists of 520 scanned
                                                               100 hyperplanes, denoted as LSH(10) and LSH(100) , re-
invoice document pages in English. Each token on a
                                                               spectively.
page is labeled with one of 6 different classes. We split
the dataset into 416 training samples and 104 test samples.
We focus on the fields Receiver, Supplier, Invoice info, and
Total.

4.2. Model Architecture
We replicate the chargrid-net architecture from Katti et al.
[1]. This model is a fully convolutional neural network
with an encoder-decoder architecture using downsam-
pling in the encoder and a reversion of the downsampling
based on stride-2 transposed convolutions in the decoder.
In contrast to the two parallel decoders in the original
model, we only use the semantic segmentation decoder,
which concludes in a pixel-level classification for the
number of target classes.
   Replicated from the loss used in Katti et al. [1], we
counter the strong class imbalance between background
pixels and actual relevant pixels with static aggressive
class weighting following Paszke et al. [13].

                                                               Figure 1: Ablation study to investigate the impact of the dif-
4.3. Evaluation Measure                                        ferent non-sensitive components in the NDGrid. Experiments
To evaluate the model performance for the key infor-           to estimate impact of LSH (5, 6, 7) are only carried out for
                                                               FUNSD and XFUND.
mation extraction task, we use the Word Accuracy Rate
(WAR) [1, 4]. Similar to the Levenshtein distance it counts
the number of substitutions, insertions, and deletions
between the ground truth and the prediction. We report 5.2. Comparison
the WAR for each given field and overall. The instances
are pooled across the entire test set. WAR is defined as We show the quantitative comparison in Table 2, 3 and 4.
follows:                                                        Our Chargrid implementation distinguishes between
                                                             54 different characters and is case-insensitive. Besides 26
                       #[ins.] + #[del.] + #[sub.]           alphabetic characters and the 10 numeric characters, we
         WAR = 1 βˆ’                                       (2)
                                   𝑁                         include 18 additional other characters.
   where 𝑁 is the total number of tokens of a specific field    For the BERTgrid, we use the pre-trained BERT base
in the ground truth instance.                                model bert-base-uncased from the Hugging Face trans-
                                                             formers library [14]. Before being fed into the tokenizer,
                                                             we order the words by their corresponding bounding
5. Experiments and Results                                   boxes’ top and left coordinates.
In the following we show quantitative results of our pro-
posed document representation. First we show the im-           6. Discussion
pact of single components by carry out an ablation study
followed by a comparison among Chargrid [1] and BERT-          In the ablation study, we show how we can increase the
grid [2]. As a baseline we use the 3-channel RGB image         model performance by combining our non-sensitive com-
as input. We report the average metrics over 5-fold cross-     ponents. The impacts of the different components are not
validation.                                                    consistent over the analyzed datasets. Different dataset
                                                               key fields. Our approach often matches and, on specific
                                                               fields, even outperforms the BERTgrid performance. The
                                                               results support the hypothesis that a generalized and
                                                               fundamentally simplified representation still contains
                                                               enough information to be used in automated document
                                                               processing.


                                                               7. Conclusion
Figure 2: Comparison of model performances between the
different document representations.                            NonDisclosureGrid is a privacy-preserving document
                                                               representation, including textual, visual, and spatial in-
                     All      HED.    QST.     ANS.            formation. Reducing multimodal information into simpli-
      [Image]       47.1%    -32.2%   33.3%    26.7%           fied and non-sensitive encodings is very effective for key
      [Chargrid]    52.3%    10.8%    38.5%    31.5%           information extraction. Our NDGrid produces matching
      [BERTgrid]    56.5%    -13.3%   43.3%    41.9%           or even better results than models trained with state-
      [NDGrid]      58.3%    -15.0%   43.0%    47.1%           of-the-art representations. The performance on other
                                                               document understanding tasks has yet to be shown, and
Table 2
Comparison of WAR metrics on FUNSD. Fields: Header (HED.),
                                                               this work is only the first step towards a versatile privacy-
Question (QST.), Answer (ANS.).                                preserving document representation. There is still room
                                                               for more information-inducing components.
                     All     HED.     QST.     ANS.
       [Image]      52.7%    -6.8%    20.2%    14.2%           References
       [Chargrid]   61.3%    20.8%    33.1%    30.6%
       [BERTgrid]   64.7%    21.7%    36.3%    39.3%            [1] A. R. Katti, C. Reisswig, C. Guder, S. Brarda,
       [NDGrid]     64.5%     6.2%    36.1%    40.0%                S. Bickel, J. HΓΆhne, J. B. Faddoul, Chargrid:
Table 3                                                             Towards understanding 2d documents, CoRR
Comparison of WAR metrics on XFUND. Fields: Header                  abs/1809.08799 (2018). URL: http://arxiv.org/
(HED.), Question (QST.), Answer (ANS.).                             abs/1809.08799 . arXiv:1809.08799 .
                                                                [2] T. I. Denk, C. Reisswig, Bertgrid: Contextu-
                 All     REC.     SUP.      INF.   TOT.             alized embedding for 2d document representa-
   [Image]      67.7%    38.5%    19.4%     3.2%   -8.2%            tion and understanding, CoRR abs/1909.04948
   [Chargrid]   69.7%    41.1%    31.0%     4.0%   -5.3%            (2019). URL: http://arxiv.org/abs/1909.04948 .
   [BERTgrid]   73.2%    43.8%    41.8%    20.9%   1.7%             arXiv:1909.04948 .
   [NDGrid]     72.9%    48.8%    39.3%    14.9%   -6.7%        [3] J. Devlin, M. Chang, K. Lee, K. Toutanova,
                                                                    BERT: pre-training of deep bidirectional trans-
Table 4
                                                                    formers for language understanding,            CoRR
Comparison of WAR metrics on RVL-CDIP Layout. Fields:
                                                                    abs/1810.04805 (2018). URL: http://arxiv.org/
Receiver (REC.), Supplier (SUP.), Info (INF.), Total (TOT.).
                                                                    abs/1810.04805 . arXiv:1810.04805 .
                                                                [4] M. Kerroumi, O. Sayem, A. Shabou, Visualwordgrid:
                                                                    Information extraction from scanned documents us-
sizes seem to be influential when it comes to the impact of         ing A multimodal approach, CoRR abs/2010.02358
the components. When using the Line mask (LI) in com-               (2020). URL: https://arxiv.org/abs/2010.02358 .
bination with Layout-only (LA) and the Alphanumeric                 arXiv:2010.02358 .
Categorization (AN), the model performance increases            [5] P. Indyk, R. Motwani, Approximate nearest neigh-
significantly. The Line mask (LI) without the Alphanu-              bors: Towards removing the curse of dimension-
meric Categorization seems to be less effective or, in the          ality, in: Proceedings of the Thirtieth Annual
case of the RVL-CDIP Layout dataset, even worse than                ACM Symposium on Theory of Computing, STOC
the Layout-only (LA) by itself. Combining all compo-                ’98, Association for Computing Machinery, New
nents, including LSH with 100 hyperplanes, yields the               York, NY, USA, 1998, p. 604–613. URL: https://doi.
best model performance for all three datasets. The LSH              org/10.1145/276698.276876 . doi:10.1145/276698.
component with 100 hyperplanes does perform worse                   276876 .
when not combined with the other components.                    [6] A. Gionis, P. Indyk, R. Motwani, Similarity search in
   Except for the header fields in the FUNSD dataset, the           high dimensions via hashing, in: Proceedings of the
BERTgrid outperforms the image and the Chargrid on all              25th International Conference on Very Large Data
     Bases, VLDB ’99, Morgan Kaufmann Publishers Inc.,
     San Francisco, CA, USA, 1999, p. 518–529.
 [7] G. Bradski, The OpenCV Library, Dr. Dobb’s Jour-
     nal of Software Tools (2000).
 [8] R. Gioi, J. Jakubowicz, J.-M. Morel, G. Randall,
     Lsd: A line segment detector, Image Processing
     On Line 2 (2012) 35–55. doi:10.5201/ipol.2012.
     gjmr- lsd .
 [9] J.-P. T. Guillaume Jaume, Hazim Kemal Ekenel,
     Funsd: A dataset for form understanding in noisy
     scanned documents, in: Accepted to ICDAR-OST,
     2019.
[10] Y. Xu, T. Lv, L. Cui, G. Wang, Y. Lu, D. Florencio,
     C. Zhang, F. Wei, Layoutxlm: Multimodal pre-
     training for multilingual visually-rich document
     understanding (2021). arXiv:2104.08836 .
[11] P. Riba, A. Dutta, L. Goldmann, A. FornΓ©s, O. Ramos,
     J. LladΓ³s, Table detection in invoice documents
     by graph neural networks, in: 2019 International
     Conference on Document Analysis and Recognition
     (ICDAR), 2019, pp. 122–127. doi:10.1109/ICDAR.
     2019.00028 .
[12] A. W. Harley, A. Ufkes, K. G. Derpanis, Evaluation of
     deep convolutional nets for document image classi-
     fication and retrieval, in: International Conference
     on Document Analysis and Recognition, 2015.
[13] A. Paszke, A. Chaurasia, S. Kim, E. Culurciello, Enet:
     A deep neural network architecture for real-time
     semantic segmentation, 2016. arXiv:1606.02147 .
[14] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
     langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
     towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
     Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
     M. Drame, Q. Lhoest, A. M. Rush, Transformers:
     State-of-the-art natural language processing, in:
     Proceedings of the 2020 Conference on Empirical
     Methods in Natural Language Processing: System
     Demonstrations, Association for Computational
     Linguistics, Online, 2020, pp. 38–45. URL: https://
     www.aclweb.org/anthology/2020.emnlp- demos.6 .