Towards Visual Explanations for Document Table
Detection using Coarse Localization Maps
Arnab Ghosh Chowdhury1 , Martin Atzmueller1,2
1
    Osnabrück University, Semantic Information Systems (SIS) Group, Osnabrück, Germany
2
    German Research Center for Artificial Intelligence (DFKI), Osnabrück, Germany


                                         Abstract
                                         Computer-vision-based methods using deep neural networks offer considerable opportunities to extract
                                         tabular information from richly-structured documents. However, it is extremely challenging to build
                                         a unified framework for tabular information extraction, for example, due to a variety of document
                                         templates, as well as, diverse document table templates. Earlier, we proposed a transfer learning based
                                         table detection approach[1] and a supervised table detection framework initialized with pre-trained self-
                                         supervised image classification model weight for table detection [2] on domain specific document images.
                                         In this paper, we investigate different document table detection techniques with respect to explainability
                                         issues. These enable, e. g., diagnostics and method refinement, towards a complete tabular data extraction
                                         pipeline and tool. In particular, we present visual explanation approaches of earlier proposed table
                                         detection models on domain specific document images in order to enhance the explainability of the
                                         applied Convolutional Neural Network (CNN) based models. We discuss first experimental results for
                                         visual explanations of those models and outline several challenges in this context.

                                         Keywords
                                         Tabular Information Extraction, Barlow Twins, Grad-CAM, Grad-CAM++, Ablation-CAM


1. Introduction
Document tables commonly offer essential information in a systematic structured way. Doc-
ument table detection is a critical task due to diverse layouts and formats of the document
templates, as well as, table templates. In the age of digitalization, documents are often pro-
vided in digital form such as Portable Document Format (PDF) documents, LaTeX documents
or scanned documents. The open-source tools such as Camelot1 or Tabula2 are not properly
suitable yetx to support all possible PDF documents in order to extract tabular information,
e. g., due to diverse layouts of PDF document templates. In this context, computer vision based
object detection approaches have emerged for enabling document layout analysis and for table
detection. Benchmark datasets such as TableBank [3], PubLayNet [4] along with pre-trained
object detection models3 , for example, are readily available for document layout analysis. How-
ever, some domain specific information extraction tasks usually still suffer due to absence of
manually annotated benchmark datasets, as well as, the pre-trained object detection models.
LWDA’22: Lernen, Wissen, Daten, Analysen. October 05–07, 2022, Hildesheim, Germany
$ arnab.ghosh.chowdhury@uos.de (A. Ghosh Chowdhury); martin.atzmueller@uos.de (M. Atzmueller)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                    https://github.com/camelot-dev/camelot
                  2
                    https://github.com/chezou/tabula-py
                  3
                    https://github.com/Layout-Parser/layout-parser/blob/main/src/layoutparser/models/detectron2/catalog.py
                                                          PDF documents

                                                                          Split PDF pages and converted
                                                                          each page into a document image

                                                         Document Images


                Transfer learning based supervised                      Supervised table detection model initialized with pre-
                      table detection model                            trained Barlow Twins image classification model weight


              Inference results of the table detection                      Inference results of the table detection model on
                model on unseen document images                                         unseen document images


 Visual explanation for the decisions     Visual explanation for the decisions        Visual explanation for the decisions
    of the model using Grad-CAM            of the model using Grad-CAM++               of the model using Ablation-CAM


Figure 1: Overview: Document table detection methods and explanation approaches.


    Previously, we extracted tabular information directly from document images using Optical
Character Recognition (OCR) [1]. In this paper, we specifically investigate these approaches
towards their explainability, which can then be applied for assessment, diagnostics, and method
refinement. Then, ultimately, such approaches can the enable flexible tabular data extraction,
i. e., by mapping the bounding box coordinates of predicted tables of document images to PDF
document pages in order to identify the table region or region of interest (ROI) on the respective
document page. We discuss this in the context of a prototypical implementation using the
ROI information along with the Camelot tool to extract tabular data from PDF documents to
facilitate knowledge management, e. g., [5, 6, 7, 8, 9]. Regarding explainability, we apply coarse
localization maps to offer visual explanations for the decisions of the two table detection models
on the domain specific (so-called) Di-Plast dataset leveraging the respective Grad-CAM [10],
Grad-CAM++ [11] and Ablation-CAM [12] methods. Figure 1 depicts the methods and structure.
    In previous work, we proposed a transfer learning based table detection approach [1] on
the given domain specific Di-Plast dataset4 on the basis of a pre-trained TableBank model,
which is trained on the TableBank dataset [1]. Such pre-trained TableBank or PubLayNet
models generally follow the supervised object detection framework, for example, Faster Region
based Convolutional Neural Network (Faster R-CNN) [3, 4] which is primarily initialized
with pre-trained ImageNet supervised image classification model weight. In another previous
experiment [2], we utilized a pre-trained self-supervised image classification model weight
instead of the pre-trained ImageNet supervised image classification model weight, and obtained
a comparatively substantial table detection results compared to the transfer learning approach.

   4
       https://github.com/cslab-hub/MatrixDataExtractor/tree/main/tabledetection
   We leveraged the architecture of Barlow Twins, a redundancy-reduction based self-supervised
learning method [13] to build an image classifier, which is trained on subsets of PubTabNet
[14] and DVQA [15] datasets. Subsequently, we exploited the supervised Faster R-CNN object
detection framework, which is primarily initialized with this Barlow Twins image classification
model weight [2]. However, the transfer learning based table detection model achieves better
performance than this model on our domain specific Di-Plast dataset. Therefore, it is quite
beneficial to obtain the visual explanations for the decisions made by our transfer learning
based table detection model, and the respective supervised Faster R-CNN table detection model.
In general, this can enhance the transparency of the respective models, which also provides
important options for assessment, tuning, and further analysis.
   The rest of the paper is organized as follows: Section 2 discusses related work. Section 3
summarizes the applied table detection methods, with an outlook on tabular data extraction.
Section 4 presents visual explanations for the decisions of our table detection models using
Grad-CAM, Grad-CAM++ and Ablation-CAM. Finally, Section 5 concludes the paper with a
summary and outlines interesting directions for future work.


2. Related Work
Below, we briefly discuss related work concerning document layout analysis, before we summa-
rize visual explanation methods for deep Convolutional Neural Network (CNN) based models.

2.1. Barlow Twins in Document Layout Analysis
Barlow Twins is a redundancy-reduction based self-supervised learning method. It works on
a joint embedding of two augmented views for all images of a batch sampled from a training
dataset, learning representations that are invariant under different image augmentations. It
estimates the cross-correlation between the embeddings of two identical networks incorporating
augmented views for all images of a batch of samples, aiming to make the cross-correlation ma-
trix close to the identity matrix [13]. [16], for example, conduct a document image classification
task using Barlow Twins on the RVL-CDIP [17] and the Tobacco-3482 [18] datasets.

2.2. Class Activation Mapping (CAM)
To build trustworthy CNN models, it is important to explain their decisions, for example, why
these table detection models predict what they predict. This transparency helps to comprehend
the failure scenarios and to debug CNN based models along with identifying and eliminating
potential biases in training data [10]. [19] demonstrates that the convolutional units of different
layers of Convolutional Neural Network (CNN) act as object detectors in spite of no supervision
on the location of the object was offered. Such ability to localize objects is lost when fully-
connected (FC) layers are used for image classification. The class activation mapping (CAM)
for CNN is proposed with global average pooling to empower the classification-trained CNN
to learn to perform object localization without utilizing bounding box annotations. Towards
object localization, class activation maps facilitate to visualize the predicted class scores on the
image by highlighting the discriminative object parts detected by the CNN [20].
   Gradient-weighted Class Activation Mapping (Grad-CAM) is proposed to offer visual expla-
nations for the decisions from a large class of CNN based models to make those models more
transparent and explainable. In image classification, it uses the gradients of target concepts, for
instance, class labels, flowing into the final convolutional layer to generate a coarse localization
map highlighting the important regions in the image to predict the concept or the class label [10].
Grad-CAM methods have some limitations, such as that the performance falls when multiple
occurrences of the same class are localized. Also, Grad-CAM heatmaps commonly do not
capture the entire object in completeness for single object images. Here, Grad-CAM++ method
is proposed to offer better visual explanations of the CNN model predictions [11]. Furthermore,
Grad-CAM suffers from the gradient saturation problem; this induces the backpropagating
gradients to diminish, adversely affects the quality of visualizations, and suffers from detecting
multiple occurrences of the same object in an image. Unlike gradient based methods (e. g.,
Grad-CAM, Grad-CAM++), a gradient free visualization method known as Ablation-CAM is
proposed to produce visual explanations for interpreting CNN models, that avoids the use of
gradients, and simultaneously offers high quality class-discriminative localization maps [12].
   In this context, [21] proposes a CNN architecture for handwritten Chinese character recog-
nition and leverages CAM method for visual explanations. [22] studies numerous weakly
supervised object localization and detection approaches along with CAM methods. [23] investi-
gates the pseudo-label-based semi-supervised object detection system and applies Grad-CAM to
give further evidence for their proposed Mix/UnMix (MUM) data augmentation method. A novel
weakly supervised object detection approach is introduced that uses both the proposal-level
relationship and the semantic-level relationship, and generates object proposals based on the
heatmaps extracted by Grad-CAM [24]. [25] leverages Grad-CAM generating and selecting
high-quality proposals for weakly supervised object detection.


3. Methods
In this section, we first summarize our earlier proposed table detection methods before pre-
senting first results of our experimentation below. We also sketch a prototypical tabular data
extraction approach by mapping the bounding box coordinates of the predicted tables of docu-
ment images to PDF document pages, then leveraging the Camelot tool for data extraction.

3.1. Barlow Twins based Table Detection Model
The encoder of the Barlow Twins model consists of a ResNet50 network (without the final
classification layer, and using 2048 output units) which is followed by a projector network.
Such a projector network consists of three linear layers, each with 8192 output units. The first
two layers of the projector are followed by a batch normalization layer and rectified linear
units. The output of the encoder is denoted as the representation, while the output of the
projector is denoted as the embedding [13]. We exploited the learned representations for table
detection; the obtained embeddings are fed to the loss function of the Barlow Twins model.
After image classification model training, the encoder of the ResNet-50 network (without the
final classification layer and with 2048 output units) is fed into the ResNet-FPN (Feature Pyramid
Network) architecture, as the backbone of the Faster R-CNN table detection framework [2].
Table 1
Train and test dataset distribution for image classification.
       Dataset ( randomly selected )             Train dataset               Test dataset
                 PubTabNet                  5000 table images          100 table images
                   DVQA                     5000 bar chart images      100 bar chart images


Table 2
The consolidated result of table detection models using transfer learning and Barlow Twins architecture
on Di-Plast validation dataset – with different Intersection over Union (column IoU) thresholds w.r.t.
overlap/intersection and union of the respective regions, c. f., [1, 2].
                 Metric        IoU        area  maxDets          ValueTL    ValueBT
                  AP       [0.50:0.95]   all    100             0.900      0.770
                  AP       0.50          all    100             1.000      0.987
                  AP       0.75          all    100             1.000      0.933
                  AP       [0.50:0.95]   small  100             -1.000     -1.000
                  AP       [0.50:0.95]   medium 100             -1.000     -1.000
                  AP       [0.50:0.95]   large  100             0.900      0.770
                  AR       [0.50:0.95]   all    1               0.542      0.447
                  AR       [0.50:0.95]   all    10              0.909      0.811
                  AR       [0.50:0.95]   all    100             0.909      0.811
                  AR       [0.50:0.95]   small  100             -1.000     -1.000
                  AR       [0.50:0.95]   medium 100             -1.000     -1.000
                  AR       [0.50:0.95]   large  100             0.909      0.811


   In our experimentation in [2], we considered 5,100 random samples each from the PubTabNet
[14] and DVQA [15] datasets to create train and test datasets for the Barlow Twins image
classification model, c. f., Table 1. We trained the Barlow Twins model on a training dataset
consists of 5,000 table images and 5,000 bar chart images. As there is no label in self-supervised
learning, we considered the encoder (ResNet-50) of the Barlow Twins model and froze the model
weight. For the evaluation of the model, we added the final classification layer on top of the
encoder and trained only that final classification layer on the same training dataset (i. e., on
5,000 table images and 5,000 bar chart images). We evaluated the image classification model
and obtained 93% accuracy on the test dataset, which consists of 100 table images and 100 bar
chart images, c. f., [2] for a detailed discussion.
   Subsequently, we performed supervised Faster R-CNN based table detection initialized with
the pre-trained Barlow Twins image classification model weight on the Di-Plast training dataset.
We evaluated table detection model on Di-Plast validation dataset and obtained nearly 77%
mAP (mean average precision) of IoU (Intersection over Union) as presented in Table 2 in the
ValueBT column [2]. The ValueTL column presents evaluation result of our transfer learning
based table detection method [1]. Commonly three typical table detection errors are observed,
such as, partial-detection, un-detection, and mis-detection. In partial-detection, only some part of
the ground-truth table is predicted and some information is missing. Entire ground-truth table
is not predicted in un-detection problem. Other components such as text blocks, figures or bar
charts on document images are predicted as tables in mis-detection problem [3].
                              (a)                                      (b)
Figure 2: Inference results (with red-colored bounding box) of Faster R-CNN based table detection
model with pre-trained Barlow Twins image classification model weight on two document images.


   For each predicted table, we compute the IoU w.r.t. each ground-truth bounding box of table
class on document images. Thereafter, Average Precision (AP) and Average Recall (AR) values
are averaged over multiple IoU values. AP is averaged over all categories (here is only one
category, e. g., table class), which is referred as mean average precision (mAP). No distinction
are made between AP and mAP, and similarly between AR and mean average recall (mAR) in
COCO (Common Objects in Context) evaluation metrics5 . Values of -1.000 indicate that the
metric cannot be computed, since no predictions are performed for small (with an area less than
32 × 32 pixels) and medium objects (with an area between 32 × 32 pixels and 96 × 96 pixels),
because the area of each table is larger than 96 × 96 pixels in our Di-Plast dataset [1].
   To exemplify our analysis and the respective predictions of Faster R-CNN based table detection
model initialized with the pre-trained Barlow Twins image classification model weight [2], some
exemplary instances are shown in Figure 2. Two tables are predicted as shown in Figure 2(a).
We observe, that only one table is predicted and the second table is not predicted as shown
in Figure 2(b). In the shown exemplary instances, we sketch the structure of the documents
contained in the domain-specific Di-Plast dataset. We notice that Barlow Twins based table
detection model suffers from partial-detection and un-detection problem on the Di-Plast test
dataset. As partial table detection is observed in Figure 2(a), where some left part of the top
table is not accurately predicted, hence textual information of the table could be missed during
tabular data extraction. On the other hand, Figure 2(b) denotes un-detection problem, where
the top table is not predicted at all.


   5
       https://cocodataset.org/#detection-eval
(a) Document image coordinate system                             (b) PDF page coordinate system
Figure 3: Overview of coordinate systems of a document image and a PDF page.


3.2. Application: From Document Table Detection to Tabular Data Extraction
In general, after extracting the bounding box information, we can apply tabular data extraction in
order to enable information extraction afterwards, e. g., using knowledge-based approaches [26,
27, 28]. This then also enables ultimately knowledge management since then we can apply
standardized representations for the extracted information.
   For tabular data extraction based on transfer learning based document table detection ap-
proach, we have implemented a prototypical approach using the Camelot open-source tool
to extract tabular data from PDF documents. Below, we sketch the basic idea of exploiting
the region of interest information, being used for enabling data and information extraction.
Essentially, we create a mapping between the bounding box pixel values of the document images
and the relevant coordinate values of the PDF document pages. For the mapping, we have
to align the coordinate systems of a PDF document page and of a document image. In the
2-dimensional coordinate system of a PDF page, the coordinate value (0,0) of a PDF page starts
at the bottom-left corner.6 In contrast, the pixel value (0,0) of a document image starts at the
top-left corner. Figure 3 depicts the respective coordinate systems. The number of printed dots
contained within 1 inch of an image printed by a printer is measured by dots per inch (dpi).7
   During pre-processing, when we convert a PDF document to (potentially a set of) document
images, we consider a dpi value of 72 dpi. Afterwards, the transfer learning based table detection
model is used to infer the unseen document images to predict table images on those document
images. We obtain the bounding box coordinate values of the predicted tables of the document
images similar to Figure 3(a). Here, we also need to take into account that the dimension of each
document image used in our tabular data extraction process is given as: width of 612 pixels and
height of 792 pixels. In contrast, the the dimension of the image shown in Figure 3(a), has a
width of 4250 pixels and height of 5500 pixels.
   6
       https://www.pdfscripting.com/public/PDF-Page-Coordinates.cfm
   7
       https://www.sony.com/electronics/support/articles/00027623
4. Results: Visual Explanations
In this section, we focus on different explanatory visualization methods in our context of coarse
localization maps by leveraging Grad-CAM, Grad-CAM++ and Ablation-CAM. In Figure 4 and
Figure 5, we focus on the structure of the documents and the localization maps. In general, Deep
residual networks such as ResNets exhibit state-of-the-art performance in several challenging
tasks in computer vision, which makes these models difficult to interpret. CAM is proposed to
identify discriminative regions used by a restricted class of image classification CNN models
which do not contain any fully-connected layer [20]. On the other hand, Grad-CAM offers
existing state-of-the-art deep neural network models interpretable without altering their archi-
tecture. A good visual explanation for image classification model is considered as any target
category or class label should be (1) class-discriminative, i. e., localize the category in the image,
and (2) high-resolution, i. e., capture fine-grained details [10].

4.1. Transfer Learning based Model: Grad-CAM and Grad-CAM++
A table detection model does not contain only label information alike image classification,
but also contains bounding box and score information. We leverage the Grad-CAM and Grad-
CAM++ methods8 to visualize coarse localization maps for the decisions made by our transfer
learning based table detection model on Di-Plast test dataset. Our transfer learning based table
detection model follows the Faster R-CNN architecture initialized with pre-trained TableBank
table detection model referred in [1], which uses PyTorch Detectron2 library9 . Figure 4(a) and
Figure 4(b) exhibit coarse localization maps produced by Grad-CAM/Grad-CAM++ for our
transfer learning based table detection model on the same document image. Red color indicates
the areas in which the heatmaps have a higher intensity10 , which are expected to match with
the location of the objects corresponding to the table class [29, 30, 31].
   Our first visualization results indicate that the Grad-CAM++ method seems to perform better
than the Grad-CAM method to visually explain the decisions of our transfer learning based
table detection model. Most of the red region of the localization maps produced by Grad-
CAM++ perfectly remain within the predicted bounding box of table class in the Di-Plast test
dataset rather than for the Grad-CAM method. We use the JET colormap in Matplotlib [32] for
Grad-CAM and Grad-CAM++ methods during visualization11 .

4.2. Barlow Twins based Model: Ablation-CAM
The Grad-CAM and Grad-CAM++ methods rely on gradients backpropagating from the output
class nodes during visualization. Grad-CAM fails to offer reliable visual explanations for
highly confident decisions due to gradient saturation. Several times, Grad-CAM highlights
relatively incomplete and imperfect regions to detect multiple occurrences of same object in
an image that may not be sufficient for the trustworthiness of CNN based models. Ablation-
CAM, a gradient free method is proposed to visualize class-discriminative localization maps for
    8
      https://github.com/alexriedel1/detectron2-GradCAM
    9
      https://github.com/facebookresearch/detectron2
   10
      https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html
   11
      https://matplotlib.org/stable/tutorials/colors/colormaps.html
          (a) Table detection with Grad-CAM                 (b) Table detection with Grad-CAM++
Figure 4: Visualization of coarse localization maps of transfer learning based table detection model on
same document image


explaining decisions of CNN based models [12]. We use the Ablation-CAM method12 to get a
visual explanation of the decisions made by Faster R-CNN table detection model initialized with
pre-trained Barlow Twins image classification model weight on Di-Plast test dataset, where the
model uses PyTorch TorchVision library13 . Figure 5(a) and Figure 5(b) show the table detection
model inference result (with red colored bounding box) and corresponding coarse localization
map produced by Ablation-CAM. The red colored region on localization map indicates the
heatmap has a higher intensity, which are expected to match with the location of the objects
corresponding to the table class [30, 33].
   Here, it appears that the visual explanation for the table detection model produced by Ablation-
CAM is not satisfactory. The reason might be that this table detection model obtained nearly 77%
mAP of IoU compare to transfer learning based table detection model, which obtained nearly
90% mAP of IoU referred in [1]. We use the JET colormap in OpenCV [34] for the Ablation-CAM
method during visualization14


5. Conclusions
In this paper, we focused on techniques for tabular data extraction from PDF documents with
the help of computer vision based deep neural networks. In particular, we investigated those
approaches towards their explainability, which can then be applied for assessment, diagnostics,
method refinement and tuning. Regarding the methods, we discussed inference results of Faster
R-CNN based table detection model on document images, which is initialized with pre-trained
Barlow Twins image classification model weight, building on our previous research [1, 2].

   12
      https://github.com/jacobgil/pytorch-grad-cam
   13
      https://pytorch.org/vision/stable/index.html
   14
      https://docs.opencv.org/4.x/d3/d50/group__imgproc__colormap.html
           (a) Table detection: bounding box            (b) Table detection: Ablation-CAM
Figure 5: Exemplary inference results (with red-colored bounding box) and coarse localization maps
visualization for Faster R-CNN table detection model with pre-trained Barlow Twins model weight


We explored the visual explanations for the decisions made by the Barlow Twins based table
detection model along with previously analyzed transfer learning based table detection model
using Grad-CAM, Grad-CAM++ and Ablation-CAM methods. For Faster R-CNN based table
detection model initialized with pre-trained Barlow Twins image classification model weight,
we applied the gradient free Ablation-CAM method to visualize coarse localization map. In our
first experiments, we observe the trend that here visual explanations of the decisions of this
model is not adequately possible due to lower mAP of IoU value, as well as, for partial-detection
and un-detection problems in our context. For transfer learning based table detection model,
we utilized gradient based Grad-CAM and Grad-CAM++ methods. The red colored region
on localization map indicates the heatmap with higher intensity value to match the location
of the objects corresponding to table class. Grad-CAM++ method seem to offer better visual
explanation of the decisions made by the transfer learning based table detection model compared
to the Grad-CAM method.
   For future work, we aim to explore semi-supervised and active learning based document table
detection approaches for minimizing the manual image annotation effort on domain specific
datasets. We also aim to concentrate on further methods for providing visual explanations
of the decisions of such CNN based models for table detection. Furthermore, we intend to
explore further benchmark datasets, e. g., PubLayNet [4] and TableBank [3] datasets. In addition,
combining the tabular data extraction approach with further knowledge-based techniques,
e. g., [26, 27, 28] indicates promising directions for future research.


Acknowledgments
This work has been funded by the Interreg North-West Europe program (Interreg NWE), project
Di-Plast - Digital Circular Economy for the Plastics Industry (NWE729).
References
 [1] A. G. Chowdhury, N. Schut, M. Atzmueller, A hybrid information extraction approach using
     transfer learning on richly-structured documents, in: T. Seidl, M. Fromm, S. Obermeier
     (Eds.), Proc. LWDA 2021 Workshops: FGWM, KDML, FGWI-BIA, and FGIR, volume 2993
     of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 13–25.
 [2] A. Ghosh Chowdhury, M. b. Ahmed, M. Atzmueller, Towards Tabular Data Extraction
     From Richly-Structured Documents Using Supervised and Weakly-Supervised Learning, in:
     Proc. IEEE International Conference on Emerging Technologies and Factory Automation
     (ETFA), IEEE, 2022.
 [3] M. Li, L. Cui, S. Huang, F. Wei, M. Zhou, Z. Li, Tablebank: A benchmark dataset for table
     detection and recognition, arXiv preprint arXiv:1903.01949 (2019).
 [4] X. Zhong, J. Tang, A. J. Yepes, Publaynet: largest dataset ever for document layout analysis,
     in: Proc. International Conference on Document Analysis and Recognition (ICDAR), IEEE,
     2019, pp. 1015–1022.
 [5] M. Alavi, D. E. Leidner, Knowledge management and knowledge management systems:
     Conceptual foundations and research issues, MIS quarterly (2001) 107–136.
 [6] Q. Zhu, X. Cheng, The opportunities and challenges of information extraction, in: Proc.
     International Symposium on Intelligent Information Technology Application Workshops,
     IEEE, 2008, pp. 597–600.
 [7] P. Năstase, D. Stoica, F. Mihai, A. Stanciu, From document management to knowledge
     management, Annales Universitatis Apulensis Series Oeconomica 11 (2009) 325–334.
 [8] S. Furth, J. Baumeister, Semantification of large corpora of technical documentation,
     in: Enterprise Big Data Engineering, Analytics, and Management, IGI Global, 2016, pp.
     171–200.
 [9] S. A. Siddiqui, M. I. Malik, S. Agne, A. Dengel, S. Ahmed, Decnt: Deep deformable cnn for
     table detection, IEEE access 6 (2018) 74151–74161.
[10] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: Vi-
     sual explanations from deep networks via gradient-based localization, in: Proc. IEEE
     international conference on computer vision, 2017, pp. 618–626.
[11] A. Chattopadhay, A. Sarkar, P. Howlader, V. N. Balasubramanian, Grad-cam++: Generalized
     gradient-based visual explanations for deep convolutional networks, in: Proc. IEEE winter
     conference on applications of computer vision (WACV), IEEE, 2018, pp. 839–847.
[12] H. G. Ramaswamy, et al., Ablation-cam: Visual explanations for deep convolutional net-
     work via gradient-free localization, in: Proc. IEEE/CVF Winter Conference on Applications
     of Computer Vision, 2020, pp. 983–991.
[13] J. Zbontar, L. Jing, I. Misra, Y. LeCun, S. Deny, Barlow twins: Self-supervised learning via
     redundancy reduction, in: Proc. International Conference on Machine Learning, PMLR,
     2021, pp. 12310–12320.
[14] X. Zhong, E. ShafieiBavani, A. Jimeno Yepes, Image-based table recognition: data, model,
     and evaluation, in: European Conference on Computer Vision, Springer, 2020, pp. 564–580.
[15] K. Kafle, B. Price, S. Cohen, C. Kanan, Dvqa: Understanding data visualizations via question
     answering, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018,
     pp. 5648–5656.
[16] S. A. Siddiqui, A. Dengel, S. Ahmed, Self-supervised representation learning for document
     image classification, IEEE Access 9 (2021) 164358–164367.
[17] A. W. Harley, A. Ufkes, K. G. Derpanis, Evaluation of deep convolutional nets for document
     image classification and retrieval, in: Proc. International Conference on Document Analysis
     and Recognition (ICDAR), IEEE, 2015, pp. 991–995.
[18] M. Z. Afzal, A. Kölsch, S. Ahmed, M. Liwicki, Cutting the error by half: Investigation
     of very deep cnn and advanced training strategies for document image classification, in:
     Proc. IAPR International Conference on Document Analysis and Recognition (ICDAR),
     volume 1, IEEE, 2017, pp. 883–888.
[19] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Object detectors emerge in deep
     scene cnns, arXiv preprint arXiv:1412.6856 (2014).
[20] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for
     discriminative localization, in: Proc. IEEE Conference on Computer Vision and Pattern
     Recognition, 2016, pp. 2921–2929.
[21] P. Melnyk, Z. You, K. Li, A high-performance cnn method for offline handwritten chinese
     character recognition and visualization, soft computing 24 (2020) 7977–7987.
[22] D. Zhang, J. Han, G. Cheng, M.-H. Yang, Weakly supervised object localization and
     detection: A survey, IEEE transactions on pattern analysis and machine intelligence (2021).
[23] J. Kim, J. Jang, S. Seo, J. Jeong, J. Na, N. Kwak, Mum: Mix image tiles and unmix feature
     tiles for semi-supervised object detection, in: Proc. IEEE/CVF Conference on Computer
     Vision and Pattern Recognition, 2022, pp. 14512–14521.
[24] D. Zhang, W. Zeng, J. Yao, J. Han, Weakly supervised object detection using proposal-
     and semantic-level relationships, IEEE Transactions on Pattern Analysis and Machine
     Intelligence (2020).
[25] G. Cheng, J. Yang, D. Gao, L. Guo, J. Han, High-quality proposals for weakly supervised
     object detection, IEEE Transactions on Image Processing 29 (2020) 5794–5804.
[26] M. Atzmueller, P. Kluegl, F. Puppe, Rule-Based Information Extraction for Structured
     Data Acquisition using TextMarker, in: Proc. LWA 2008 (KDML Track), University of
     Wuerzburg, Wuerzburg, Germany, 2008, pp. 1–7.
[27] P. Kluegl, M. Atzmueller, F. Puppe, Meta-Level Information Extraction, in: The 32nd
     Annual Conference on Artificial Intelligence, Springer, Berlin, 2009. (233–240).
[28] P. Kluegl, M. Atzmueller, F. Puppe, TextMarker: A Tool for Rule-Based Information
     Extraction, in: Proc. Unstructured Information Management Architecture (UIMA) 2nd
     UIMA@GSCL Workshop, Conference of the GSCL, 2009.
[29] M. Lerma, M. Lucas, Grad-cam++ is equivalent to grad-cam with positive gradients, arXiv
     preprint arXiv:2205.10838 (2022).
[30] C. Molnar, Interpretable machine learning, Lulu. com, 2020.
[31] J. VanderPlas, Python data science handbook: Essential tools for working with data, "
     O’Reilly Media, Inc.", 2016.
[32] N. Rougier, Matplotlib tutorial, Ph.D. thesis, INRIA, 2012.
[33] I. Culjak, D. Abram, T. Pribanic, H. Dzapo, M. Cifrek, A brief introduction to opencv, in:
     2012 Proc. 35th international convention MIPRO, IEEE, 2012, pp. 1725–1730.
[34] T. T. Santos, Scipy and opencv as an interactive computing environment for computer
     vision., Revista de Informática Teórica e Aplicada 1 (2015) 154–189.