=Paper=
{{Paper
|id=Vol-3616/paper7
|storemode=property
|title=Patterns Recognition Approach for Newspapers Analytics: Case of Algerian PDF Newspapers
|pdfUrl=https://ceur-ws.org/Vol-3616/paper7.pdf
|volume=Vol-3616
|authors=Dihia Lanasri
|dblpUrl=https://dblp.org/rec/conf/rif/Lanasri23
}}
==Patterns Recognition Approach for Newspapers Analytics: Case of Algerian PDF Newspapers==
<pdf width="1500px">https://ceur-ws.org/Vol-3616/paper7.pdf</pdf>
<pre>
                                Patterns Recognition Approach for Newspapers
                                Analytics: Case of Algerian PDF Newspapers
                                Dihia LANASRI1
                                1
                                    PROXYLAN EPE/SPA, Benaknoun, Algiers, Algeria


                                                                         Abstract
                                                                         Nowadays, companies are convinced that putting people at the heart of their businesses is vital for their
                                                                         competition performances and success. To achieve this goal, a deep understanding of people’s published
                                                                         content is mandatory to evaluate their satisfaction, frustration, and interestingness, and eventually
                                                                         recommend them other items and services.
                                                                         Media Analytics or News analytics is the main solution used to collect, process and analyze the different
                                                                         news and content published by people on media, social networks, etc. to provide the required insights
                                                                         and metrics which help in data-driven decision making.
                                                                         Press newspapers, precisely, PDF newspapers are one of the valuable and rich generated content that
                                                                         should be collected and analyzed by companies and organizations due to the capital of information they
                                                                         contain. Newspaper Analytics is a promising field of study, however, a reduced number of works are
                                                                         interested in this topic.
                                                                         The lack of work dealing with this kind of content (PDF newspapers) for data analytics motivates us to
                                                                         propose: (1) An end-to-end approach which allows conducting a newspaper analytics applied to PDF
                                                                         newspapers; (2) A detailed approach for PDF newspaper pattern recognition, for Latin and Arabic PDF,
                                                                         is presented.
                                                                         This approach consists in identifying the different blocks of press articles in a PDF page, and their
                                                                         associated metadata (authors, title). Once boundings of each article is detected, the recognized articles
                                                                         and metadata can be extracted using the OCR techniques (which is out of the scope of this paper).
                                                                         This approach is validated through different experiments applied on our proper constructed dataset. This
                                                                         later contains more than 1.500 PDFs collected from different Algerian newspapers in Latin and Arabic
                                                                         languages, during five months. The obtained results are promising and allowed us to develop a tool
                                                                         which presents the results of PDF newspaper analytics to end-users through dashboards and KPIs (key
                                                                         performance indicators) to keep an eye on their presence in the media and their reputation.

                                                                         Keywords
                                                                         Newspaper Analytics, Pattern Recognition, Computer Vision, Deep Learning


                                1. Introduction
                                News analytics, Media analytics or Media Intelligence is a really promising field which needs
                                more attention regarding its added value for companies, industries, organizations and govern-
                                ments. News analytics refers to the different tools, solutions, metrics and indicators used to
                                analyze the huge amount of published news, comments, posts, reviews, etc., in their different
                                formats mainly on websites and social networks. The advents made by web2.0 have encouraged

                                RIF’23: The 12th Seminary of Computer Science Research at Feminine, March 09, 2023, Constantine, Algeria
                                $ dihia.lanasri@proxylan.dz (D. LANASRI)
                                 0000-0002-3794-844X (D. LANASRI)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
people to share more content on the web. According to Datareportal1 special report published
in July 2020, important raise is experienced in social media activities during the beginning of
the COVID-19.
   These active users on social media have become more powerful and companies become more
careful with this users content. This content expresses satisfaction, frustration, or delights user
sentiments [1] and has a big impact on influencing customers’ decisions [2]. As a consequence,
companies are spending a lot of effort in making and improving their business by deeply
analyzing and understanding this content [3].
   Gathering and analyzing this kind of data is very important mainly for top management to
keep an eye on the evolution of the market, build competitive advantages, analyze the evolution
of business, understand the requirements of their customers or their prospects, etc.
   News analytics is based on complex techniques of natural language processing, advanced
linguistic analysis to process and understand the context of text, then derive interesting qualita-
tive and quantitative insights about targeted audience, opinion analysis, fake news detection,
context analysis, etc. This information is valuable to help managers to build their data-driven
strategies, improve their risk management tools, and make better business decisions.
   Newspaper analytics is a sub-field of news analytics which focuses on the analysis of data
published in press newspapers by journalists in the form of articles. For many years, specific
entities are created inside companies and organizations to collect, read and analyze the different
articles published in plenty of newspapers. Specific roles are hired just to perform this burden
task of manually selecting, then cutting the portions of newspaper that may interest the
managers. The emergence of the web and the development of technology has reduced the
difficulty of this task thanks to the availability of electronic newspapers in HTML or PDF
formats.
   Many solutions are provided to collect, process and analyze the content of web news based
on advanced analytics techniques like machine learning and deep learning, NLP. Hundreds of
paid solutions are provided by vendors to facilitate this task like Refinitiv Machine Readable
News Analytics2 . However, a lack of works and solutions dealing with newspapers, mainly the
PDF format, is identified in industrial and academic fields. Even if some open source and paid
solutions allow extracting the whole content of this type of documents like (pdf2text python
library) in a raw shuffled format, but this is not really suitable for PDF Newspaper. This latter
is characterized by its specific format of articles blocks. Where in each page, many blocks are
dumped and each block represents the text of the article, its authors and its title.
   Identifying the parts and the boundings of each article like illustrated in figure 1 with its
related metadata like title and author is an imperative requirement, in order to be able in the
future to extract articles one by one in text format from each PDF page. Once extracted, different
NLP and text processing techniques can be applied on text to understand and analyze it. Plenty
of usage cases can be proposed like sentiment analysis [4] of a given article, opinion mining,
Topic modeling, etc.
   The lack of academic and industrial works dealing with this issue motivates our proposal.

    1
     https://datareportal.com/reports/digital-2020-july-global-statshot
    2
     https://www.refinitiv.com/en/financial-data/financial-news-coverage/political-news-feeds-analysis/news-
analytics
Figure 1: Newspaper PDF Blocks -Akhbar Alyoum; May 08, 2021-
In this paper, we aim to propose: (1) An end-to-end approach which allows conducting a
newspaper analytics applied to PDF newspapers; (2) A detailed approach for PDF newspaper
pattern recognition, for Latin and Arabic PDF. To achieve these goals, we propose a complete
approach which allows us to identify in each PDF newspaper page the different articles associated
with their authors and titles. Deep learning models are trained to recognize these patterns in
PDFs.
   To validate our proposal, some experiments are conducted on our own constructed dataset
composed of more than 1.500 PDF newspapers in Arabic and Latin languages and collected
from many Algerian newspapers like Echourouk, Ennahar, El Watan, Jeune indépendant, etc.
Interesting results are achieved which encourage us to consider this solution in our newspaper
analytics tool which returns many KPIs and graphs via an interactive dashboard to be used by
many companies and institutions.
   This paper is organized as follows, section 2 summarizes the related work, section 3 presents
the detailed approach and deep learning model, section 4 details the conducted experiments
and achieved results, section 5 concludes the paper.


2. Related Work
The literature is abundant of works related to news and media analytics [5], computer vision
and image patterns recognition [6], in both academic and industrial fields. However, a restricted
number of works are interested in press newspaper data analysis despite their added value [7].
In this section, we highlight the main works and solutions related to press newspaper analysis
approaches (news discourse analysis is out of the scope of our paper), and the main pattern
recognition used solutions.

2.1. Newspaper analytics approaches
Press News analysis passes through many steps principally subscribed in NLP discipline. Many
frameworks and approaches are proposed to analyze the news as a type of text [8] characterized
by its language, structure and context. Linguistic and grammatical operators are applied on
press news to extract valuable semantic properties [9] that may be used for different analytics
purposes like sentiment analysis [4], financial reporting [10], Crime analytics [11] where data
mining and lexicon based techniques are used.
In general, analyzing news requires the creation of a rich dataset [12], this dataset resulted from
manual or automatic collection of articles, headlines [13], content [14] from archives to be used
for different analytic purposes. The direct use of these datasets, in most cases, is not advised,
text preprocessing is an imperative step to be performed in order to enhance the quality of text,
extract features and tokens and vectorize extracted text to facilitate machine understanding.
[13] proposed a complete text mining approach to analyze the text headlines extracted from the
newspaper front pages. Word cloud, word sentiment analysis and word clustering case studies
are performed after manual collection and preprocessing of headlines. Newspaper articles are
also used to visually explore social networks [14]. The text articles were extracted manually
using a German articles corpora.
[15] proposed a complete tool to analyze and visualize the newspaper front pages. This tool
downloaded images of newspaper front pages, imported them into a local software application,
and hand-coded them for coverage measures. However, this tool did not extract the content
and cannot identify the blocks of articles in the whole newspaper. It focuses on coloring the
areas and calculating their surfaces to show the coverage rate.
We found one work [7] which proposed a mechanism to identify photos areas and text areas
in electronic newspapers. However, this work did not focus on identifying the block of each
article and its metadata, it colored the whole text area with a given color and image area with
another color.

2.2. Pattern recognition solutions
Image pattern recognition is a field of computer vision which studies how automated systems
can process and understand digital images or videos with the aim of making them work as
similarly as possible to humans [16]. Many solutions are proposed for image pattern recognition.
On one hand, some approaches are traditional and called feature descriptors for the extraction
of image features like: edges, corners, colors. SIFT (Scale-Invariant Feature Transform) [17],
SURF (Speeded-Up Robust Features) [18] and BRIEF (Binary Robust Independent Elementary
Features) [19]. They use a series of mathematical approximations to learn a representation of
the image. However, these solutions are complex and need more expertise [20]. On the other
hand, many deep learning based solutions are proposed in literature where artificial neural
networks (ANN) and Convolution Neural networks (CNN) are designed to train needed models
which automatically extract image features .
   To summarize, many works proposed approaches to analyze news or media content. However,
few works are interested in press newspapers for data analytics. Moreover, the analysis of
the literature shows the absence of works dealing with PDF newspapers or proposing pattern
recognition solutions to identify the blocks of press articles in each PDF page. These findings
motivate our proposal.


3. Newspaper Patterns Detection Approach
To fill-in the gap in academic and industrial fields, we propose in this section, a complete
approach which allows identifying in each PDF newspaper page the different articles associated
with their authors and titles. This solution is widely required in the context of newspaper
analytics.
   This approach allows us to extract and analyze the content of press PDF Newspapers, under-
stand their purpose, then use some metrics to present a dashboard for top management to help
them making decisions. This solution focuses on press newspaper pattern recognition applied
to PDF newspapers in Latin or Arabic languages.
   Our approach is illustrated in figure 2, where the different steps are defined as follows:
   1- Requirements Analysis: In any data-driven decision making or data analytics solution,
defining and understanding the business requirements is an imperative step. These steps allow
us to define the needs and expectations of this product. The developers should analyze the
analytics requirements expressed by the end-users. Generally, these requirements are given as:
(1) The list of newspapers needed to be analyzed (e.x. Echourouk, Ennahar, El watan, etc.);
Figure 2: Newspaper Analytics Approach


(2) The frequency of dashboard updates, i.e. how many times the dashboard should be updated
by new articles (e.x. update the dashboard every day at midnight.). This information is important
because it allows us to define the automatic jobs wich will be run at this time of the day in order
to bring the new articles and new PDFs;
(3) The list of metrics, KPIs and graphs that should be included in the dashabord for an effective
data analysis;
(4) The target keywords to look for (e.x. gather articles talking about Algeria and Economy).
The newspaper collection is targeted action, we bring just what we need in order to optimize
processing and storage.

2- PDF Newspapers Selection: Once the requirements are well established, a list of the
different newspapers that should be collected is defined and ranked by priority. At this step,
developers are called to prepare the list of links of these PDF newspapers like Echourouk
(https://www.echoroukonline.com/), Ennahar (https://www.ennaharonline.com/), etc. To define
these links, a small manual work should be performed by engineers.

3- PDF Newspapers Scrapping: At this step, we define a python script which is used to
scrape and download the different PDF newspapers then save them in a local repository. The
python script is automatically executed (via a scheduled job). Scraping is an interesting way
to collect data. Many libraries are available and they are open source to achieve these goals.
Selenium is one of these well-known libraries widely used by data engineers.

4- PDF to Image Conversion: The different collected PDFs are automatically converted
into images. This consists of converting each page of the PDF into a JPG image and storing
them in a local repository needed for the next step.

5- Newspaper Patterns Recognition: This step is the main one in the process, where com-
plex deep learning development is required to identify the different articles blocks with their
metadata (authors and title). To achieve this goal, we define the approach illustrated in Figure
3. This step is splitted into different sub-steps detailed in what follows and it is an iterative
solution, at each cycle, many enhancements and improvements can be added and the whole
cycle should be executed.


Figure 3: Pattern Recognition applied to newspaper images

   5.1. Image Collection: in order to develop a deep learning model subscribed to the computer
vision field, we need a voluminous dataset of annotated images. The different boundings of
articles, titles, authors should be drawn manually and extract their positions.
5.2. Image Preprocessing: to achieve best results of accuracy and precision, the collected images
which will be used in the training phase should be of a good quality. The image quality criteria
that we defined are: clearness, luminosity, contrast, zoom. To enhance the image quality, we
developed a python script based on the PIL library which curates images based on these criteria.
5.3. Image Annotation: Once the quality is treated, these images are annotated using the image
annotation tool. This later allows drawing the boundaries of the article content, title and authors
(as described in Figure 1). Then yolo files are exported to be used in the next step. The yolo file
contains the 4 positions of each drawn boundary for each image extracted from the newpaper.
5.4. Deep learning model development: this is the core of the solution. At this phase, a python
code is written to train, evaluate and tune the parameters of the deep learning model used to
learn the article patterns obtained from the previous step. The deep learning model is based on
Keras Tensorflow, where a CNN model is trained on the annotated dataset. We use some metrics
to evaluate this model like precision, recall, loss. The CNN model we developed is composed
of many layers. Three main layers should be present in any CNN model: (1) a convolutional
layer is the main building block of a CNN. It contains a set of filters (or kernels), parameters
of which are to be learned throughout the training. The size of the filters is usually smaller
than the actual image. Each filter convolves with the image and creates an activation map[21];
(2) a pooling layer which reduces the dimensions of data by combining the outputs of neuron
clusters at one layer into a single neuron in the next layer., and (3) a fully connected layer which
connects every neuron in one layer to every neuron in another layer.
5.5. Model Pickling: once we achieve good results, this model is packaged or pickled to be used
to identify the patterns of any PDF newspaper.
5.6. Model Deployment: the model is deployed and consumed as an API. We give an image of a
newspaper PDF to the API, which will identify the patterns and return the positions of each
boundary.

6- OCR applied on each pattern: Once these patterns are detected by the previous model,
OCR techniques can be used to convert the patterns into text. At this level, some available
libraries can be used. In the other case, we develop a deep learning model which recognizes the
different characters of the images and transcripts them into text.

7- Text News Analytics: The extracted text from each page is then analyzed, where we
verify if the defined keywords (at step 1) are contained in the text. If the text contains one of
the keywords, it is stored in a database with some other metadata (publication date, authors,
title, location of the image, page number, Name of the newspaper, newspaper size). Moreover,
we developed an excel macro which highlights the keywords on the image.

8- Data VIZ: This last step is used to consume the data stored on the database to present
it in the dashboard. We define a set of interesting graphs and metrics needed in the newspaper
analytics field. Moreover, using this dashboard, the end-user can click on a given image location
to display the complete newspaper page with highlighted keywords.


4. Experiments and Results
Our motivation for these experiments is guided by a real project of newspaper analytics in
the R&D&I of our company. We have been working on this project for more than three years.
Many deep experiments were conducted in these years before obtaining the accepted results
presented in this paper.
We collected our newspapers from 10 Arabic and Latin Algerian PDF newspapers like echourouk,
ennahar, el watan, el massa, depeche kabylie, etc. during 5 months. At the end, we obtained
more than 1.500 PDF. Each PDF was converted into a list of JPG images to construct a dataset
of more than 150.000 images (each newspaper contains an average number of 10 pages).
These images should be manually annotated to train our model of pattern detection. For this
end, we hired a group of 4 annotators to do this work, we obtained 10.000 annotated images.
The process of annotation is detailed in the previous section. In our case, we used an online
tool which helped us to annotate the boundaries of articles, authors, titles of each image and
convert this annotation into a yolo file containing the pixel positions of these boundaries.
   We developed our CNN model using Tensorflow in order to allow the detection of patterns
in newspapers. This model was composed of many layers mainly (convolution layer, pooling
layer and fully connected layer). The obtained yolo files were splitted into training, testing and
validation datasets.
To obtain better performance of processing, we used our internal servers equipped with many
GPUs. The training phase takes around 16 hours.
We obtained a model with a training accuracy of 85 % and inference accuracy of 81 %. We used
this model to infer results from unseen images. The example of detected patterns on new images
is given in figure 5 and figure 4.
   The obtained results are promising. More experiments are conducted now to improve the
Figure 4: Newspaper PDF detected patterns using our solution -Jeune independant; April 21, 2021- (to
zoom the image click on: https://bit.ly/3GrPkEq)


accuracy of the model. New images are also annotated for better results. The finality of this
work is: after detecting these blocks of data (article, title, authors) and an OCR is developed (out
of the scope) to transform them into text then analyze it using sentiment analysis models, topic
modeling, opinion mining, etc. Moreover, the obtained results helped us to create dashboards
given to the end-user to evaluate the presence and reputation of the media.
   A limitation, even if our work returned good results in terms of precision but still many
enhancements are needed. In some cases the boundaries are not detected exactly which make
the OCR not really performing. In case of complex articles where the text is splitted into many
positions, our model cannot detect the article as one block. All these issues are the objects of
our improvements.
   Using an OCR model, we extracted the text from images to feed a database of newspaper data
using MongoDB database. This latter is used to feed a dashboard containing many metrics and
graphs useful for newspaper analytics in different companies.
5. Conclusion
News / Press data are valuable sources of information which are widely considered in academic
and industrial worlds. The correct exploitation of these data helps in improving many services
and satisfying the end-users. PDF newspapers are one of these important sources which require
more attention for effective analytics.
   The exploitation and analysis of PDF newspapers provide companies and institutions with
many elements of information about their brand image, their presence in media, their coverage
by newspapers, etc. It may also have other advantages like: security management, risk analysis,
competitors tracking, etc. All these data help in proactive decision-making and prevent actions.
   Having these motivations in mind, and because of the lack identified in academic and industrial
worlds, we propose a complete end-to-end approach for newspaper analytics. Moreover, a special
focus is given to the newspaper pattern recognition module, which consists of recognizing in
each PDF newspaper page, the different blocks of articles and their associated metadata (authors
and title) using advanced deep learning techniques. This information is required for future use,
after applying OCR techniques to transform the detected patterns into text.
   Interesting experiments are conducted on our own constructed dataset which is composed
of thousands of images extracted from different Algerian Latin and Arabic collected PDF
newspapers. These results allowed us to trust our approach and used it in our developed tool
dedicated to our customers. Our newspaper analytics tool provides an interactive dashboard
with very interesting KPI required by the end user to analyze their coverage by newspapers.
   As a perspective, we are working on the OCR module which allows extracting text from each
block of the detected article and their associated metadata. Interesting metrics can be derived
from text analysis of the extracted articles to be used for a rich newspaper analytics experience.


Acknowledgments
We would like to thank Mr Farid GHANEM and Mr Samir TAGZOUT for their efforts for the
success of this project and to achieve the defined objectives.
We would also like to thank our intern students [ABOUD Ibrahim, KETFI Hibet Allah, BOUGUESSA
Wail] who helped us to prepare and annotate the dataset of images.


References
 [1] K. S. Kumar, J. Desai, J. Majumdar, Opinion mining and sentiment analysis on online
     customer review, in: 2016 IEEE International Conference on Computational Intelligence
     and Computing Research (ICCIC), IEEE, 2016, pp. 1–4.
 [2] H. Zhang, L. Zhao, S. Gupta, The role of online product recommendations on customer
     decision making and loyalty in social shopping communities, International Journal of
     Information Management 38 (2018) 150–166.
 [3] J. Krumm, N. Davies, C. Narayanaswami, User-generated content, IEEE Pervasive Com-
     puting 7 (2008) 10–11.
 [4] S. Taj, B. B. Shaikh, A. F. Meghji, Sentiment analysis of news articles: a lexicon based ap-
     proach, in: 2019 2nd international conference on computing, mathematics and engineering
     technologies (iCoMET), IEEE, 2019, pp. 1–5.
 [5] D. Zeng, H. Chen, R. Lusch, S.-H. Li, Social media analytics and intelligence, IEEE
     Intelligent Systems 25 (2010) 13–16.
 [6] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, S. Yan, Sparse representation for computer
     vision and pattern recognition, Proceedings of the IEEE 98 (2010) 1031–1044.
 [7] P. Fyfe, Q. Ge, Image analytics and the nineteenth-century illustrated newspaper, Journal
     of Cultural Analytics 3 (2018) 11032.
 [8] T. A. Van Dijk, News as discourse, Routledge, 2013.
 [9] T. A. Van Dijk, News analysis: Case studies of international and national news in the press,
     Routledge, 2013.
[10] G. Mitra, L. Mitra, The handbook of news analytics in finance, John Wiley & Sons, 2011.
[11] I. Jayaweera, C. Sajeewa, S. Liyanage, T. Wijewardane, I. Perera, A. Wijayasiri, Crime
     analytics: Analysis of crimes through newspaper articles, in: 2015 Moratuwa Engineering
     Research Conference (MERCon), IEEE, 2015, pp. 277–282.
[12] M. Reason, B. García, Approaches to the newspaper archive: content analysis and press
     coverage of glasgow’s year of culture, Media, Culture & Society 29 (2007) 304–331.
[13] A. Hossain, M. Karimuzzaman, M. M. Hossain, A. Rahman, Text mining and sentiment
     analysis of newspaper headlines, Information 12 (2021) 414.
[14] A. Kochtchi, T. v. Landesberger, C. Biemann, Networks of names: Visual exploration
     and semi-automatic tagging of social networks from newspaper articles, in: Computer
     graphics forum, volume 33, Wiley Online Library, 2014, pp. 211–220.
[15] S. Costanza-Chock, P. Rey-Mazon, Pageonex: New approaches to newspaper front page
     analysis, International Journal of Communication 10 (2016) 28.
[16] C. Orhei, M. Mocofan, S. Vert, R. Vasiu, End-to-end computer vision framework, in: 2020
     International Symposium on Electronics and Telecommunications (ISETC), IEEE, 2020, pp.
     1–4.
[17] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International
     journal of computer vision 60 (2004) 91–110.
[18] H. Bay, T. Tuytelaars, L. V. Gool, Surf: Speeded up robust features, in: European conference
     on computer vision, Springer, 2006, pp. 404–417.
[19] M. Calonder, V. Lepetit, C. Strecha, P. Fua, Brief: Binary robust independent elementary
     features, in: European conference on computer vision, Springer, 2010, pp. 778–792.
[20] S. Khan, H. Rahmani, S. A. A. Shah, M. Bennamoun, A guide to convolutional neural
     networks for computer vision, Synthesis lectures on computer vision 8 (2018) 1–207.
[21] S. Mostafa, F.-X. Wu, Diagnosis of autism spectrum disorder with convolutional autoen-
     coder and structural mri images, in: Neural Engineering Techniques for Autism Spectrum
     Disorder, Elsevier, 2021, pp. 23–38.
Figure 5: Newspaper PDF detected patterns using our solution -Le Jeune Indépendant; April 18, 2021-
(to zoom the image click on: https://bit.ly/3QlHDEm)

</pre>