An approach for processing and document flow
automation for Microsoft Word and LibreOffice
Writer file formats
Pavlo V. Zahorodko1 , Pavlo V. Merzlykin1
1
    Kryvyi Rih State Pedagogical University, 54 Gagarin Ave., Kryvyi Rih, 50086, Ukraine


                                         Abstract
                                         The rapid growth of modern information technologies influences all aspects of human life. Companies all
                                         over the world are adopting new approaches to solve business problems, such as diverse automation, by
                                         using information technologies. Automation substitutes routine human work and noticeably increases
                                         efficiency. This research examines different approaches to document automation. Basic concepts of
                                         document processing using XML and existing solutions have been reviewed and a library based on
                                         LibreOffice UNO API has been designed and implemented. The library contains different helpers,
                                         wrappers, and processing tools to create an additional layer of abstraction. Moreover, the library is aimed
                                         at simplifying processing, working, and converting documents, which might considerably optimize a
                                         process of creating document reports generators.

                                         Keywords
                                         document processing, automation, library, OpenDocument, Office Open XML


1. Introduction
A significant amount of organizations, companies, and educational institutions deal frequently
with different document-related processes. Eventually, the growth of a company causes a demand
on optimizing processes. Documentation generators are one of the earliest and substantial
stages of business processes automation [1].
   According to McKinsey Global Institute [2], which is a part of the worldwide management-
consulting firm McKinsey&Company, from 9 to 26 percent of working hours could be saved by
automation. Additionally, with a midpoint of 15 percent, about 30 percent of working places
could be displaced by 2030, which is equivalent to 400 million full-time working days. In
addition, the research admits that about 50 percent of working time, which is spent on different
types of work, might be optimized with automation.
   Hospitals, as well as other organizations, work with an immense amount of documents.
According to Steve Wilson’s paper on Electronic Health Reporter website [3], every day doctors
have to deal with a large amount of different documents, starting from physician agreements

CS&SE@SW 2021: 4th Workshop for Young Scientists in Computer Science & Software Engineering, December 18, 2021,
Kryvyi Rih, Ukraine
" mongolzzz21@gmail.com (P. V. Zahorodko); ipmcourses@gmail.com (P. V. Merzlykin)
~ https://kdpu.edu.ua/personal/pvmerzlykin.html (P. V. Merzlykin)
 0000-0002-4017-7172 (P. V. Merzlykin)
                                       © 2022 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                         66
and credential documents to time sheets and other organizational forms. Undoubtedly, it is hard
to handle or search through such a number of paper documents in comparison with digital ones.
Another, surely important, reason to use automation is working with patients. Digital forms
help to avoid human interaction, which has become crucial due to the COVID-19 pandemic. In
addition, digital forms might help simplify the whole process of requesting prescriptions.
   Whilst the described problem seems completely explored, it is not exactly so. Many existing
implementations are proprietary, that is to say you could not obtain their source code easily.
This leads to the fact that it is hardly possible to launch software locally for your company or
set it up preferably, for instance, choose a web-server or database. Moreover, the assortment of
the supported documents is usually meager and often includes only Microsoft Office formats.
Another hot topic is privacy. If processed documents contain users’ sensitive or corporate data,
you could not trust proprietary cloud services you are not able to control. Moreover, it could
be simply considered illegal in some countries to transfer personal data to 3rd parties servers.
Consequently, it is critical for document automation systems to allow users to have control over
their data. The aforesaid leads us to the reasons why we decided to develop our own document
management system as an attempt to solve the mentioned problems.


2. Overview
A review of scientific literature [4, 5, 6, 7] on the topic of document flow automation showed
that the topic is relevant. But due to the lack of access to the source code, we will examine only
those implementations that are open or provide, at least partially, free trial access to the service.
   Let’s take a look at the proprietary document processing systems.
   Hypatos [8, 9] is a workflow automation system which uses artificial intelligence, namely
Cognitive Process Automation (CPA) technology. It is a fairly high-quality and professional
tool. It supports AWS and cloud storage. Both API and free version are available.
   DocuPhase [10] is a system for automating business processes. It supports web forms that
allow one to generate ready-made PDFs. It also features a document management system with
user-friendly interface for processing and managing files shared among different departments.
   Docupilot [11] is an automation and documentation generation system. It supports working
with cloud services such as Zapier, DropBox, Docusign. It has a good templating engine with
conditional statements, tables and loops support. It could handle docx, pptx, pdf or a custom,
created with a WYSIWYG editor, template. It also supports email messages sending. There is
documentation and examples of using the internal API.
   Contactbook [12] is a platform for organizing, storing and processing documents. The service
supports docx and pdf files. An integration with 3000+ programs has been implemented. A
public API as available as well.
   Now let’s take a look at the open-source applications.
   One of these is Docassemble [13], an open-source system for working with web forms and
documents. The system is implemented with Python, YAML and Markdown. It is focused on
“Interview” questions. That is to say, one web form is divided into several questions and at the
end you can get a result. It supports YAML code in configuration files. With Markdown, one
could dynamically create PDF, RTF, and DOCX files.


                                                 67
  M2Doc [14] is an open-source plug-in for automating MS Word files processing. There
are add-ons for MS Word and Eclipse IDE. The generator takes input data from a generator
configuration .genconf. One is able to work with the original Java API.
  Summarizing this section, the reviewed systems are competitive and powerful tools. But,
they have the following disadvantages:

   1. Static patterns. Most tools use only one proposed pattern for fields filling. It means that
      only system prefix and suffix ought be used in templates. For instance, with the prefix {{
      and the suffix }}, field definition would look like {{field}}.
   2. Solely Microsoft Word formats support. Most mentioned systems don’t support LibreOf-
      fice file format or other similar formats. However Microsoft products usage is not always
      possible or acceptable by some companies.
   3. No internal converters. Sometimes it is needed to convert a document into different
      format than docx or pdf.

  Thus, it was decided to design our own system for documents processing that would satisfy
our needs.


3. Approaches in document processing
Document management system needs a core document processing tool. There are a few different
approaches in Microsoft Office and LibreOffice documents processing. We will overview the
most popular: XML processing and frameworks.
   Microsoft Office and LibreOffice documents are basically archives with all content inside.
Most of the files inside are XML files. They represent document’s structures, styles, metadata,
settings, and other configurations.
   Microsoft Office documents (doc and docx) have their own XML-based file format developed
by Microsoft, which is called Office Open XML (OOXML). Its structure is shown on the figure 1.
   The actual content of the document is stored in the word folder in the document.xml file.
   LibreOffice documents also have their own XML-based format called Open Document Format
(ODF) also known as OpenDocument [15]. ODF is developed by The Document Foundation.
The structure of the document is shown on the figure 2.
   In this case, the actual content of the document is stored in the content.xml file. Depicted
structures may vary depending on the complexity of a document. In comparison to docx
document, which has three folders files hierarchy, an odt document has a similar structure but
contains additional folders such as configurations.
   In the case of document generating on the basis of a template with custom keywords, the
keywords might be split by office software into different tags. Therefore, this approach needs
additional validation and handling of the keywords parts to merge them together.
   To inquire the issue let us look at a simple document that contains the following text:

1
${KeyWord}${KeyWord2} and ${KeyWord3}
${KeyWord4} some text.


                                               68
Figure 1: Docx file structure.


  ${ and } statements indicate the beginning and the ending of a keyword. All the paragraphs
have the same style family, namely Calibri 11 pt. However, things appear to be more surprising
in the content file. Figure 3 shows the XML representation of the first keyword.
   Microsoft Word splits text into different w:r elements called runs. Inside each run, we can
see a w:r tag that represents a text element. So one keyword in this example has 3 different
runs with different parts of the keyword. The second keyword is shown on figure 4.
   In this case, we have four different runs. The number of runs depends on the length of the
keyword and different special symbols. The same issue may be found in LibreOffice documents.
   For the LibreOffice document, we will use the same font family and font size. Right after
document creation, we get the solid not split paragraphs. The XML representation of the text is
shown on the figure 5.
   A problem may appear after editing the document with LibreOffice editor. Let’s change the
KeyWord2 keyword to KeyWord_New. The result of this replacement is depicted on the figure 6.
   As a result of a slight document editing, the XML changed significantly. New elements were
added and the keyword split into 2 parts, even though the keyword still has the same style. At
first glance, it may seem that the problem is in using the underscore character. However, to
dispel this assumption, we will return the original value to the keyword. The result is shown
on the figure 7.
   Even after original value recovery, we still have the XML code which is different from the
initial one. Moreover, two extra text:span elements appeared. In the case of the LibreOffice


                                              69
Figure 2: ODT file structure.


documents, text:span elements may be added as a consequence of updating or text changing
within the document.
   Another approach is using LibreOffice UNO API. LibreOffice provides Universal Network
Objects, which allows using this API in different programming languages, such as C++, Java,
Python, Perl, C#, JavaScript, and many others. This API supports working with different formats,
originally LibreOffice applications, but partly including support of Microsoft Office applications.
As a matter of fact, LibreOffice UNO API is almost completely compatible with OpenOffice.
   LibreOffice has a Frame-Controller-Model paradigm (FCM) that is similar to the Model-View-
Controller paradigm (MVC) [16]. The model contains the document data and methods to change
them. The controller views the status of the documents and manipulates screen presentations.
The frame contains the controller and knows which windows are being used. This approach
allows interacting easily with the application’s GUI and its functionality.
   LibreOffice UNO API is extremely functional and useful in document manipulation. However,
API documentation is bulky and might be time-consuming to read [17]. Due to this fact, we
decided to develop a library as a layer over the LibreOffice UNO API.


                                                70
Figure 3: XML representation of the first keyword.


   Returning to the split issue in XML documents, LibreOffice UNO API allows one to use GUI
and work with text in a simpler manner. It handles text as though it had been edited by user. In
addition, in comparison with the XML approach, this API provides access to styles and other
functionality, like pictures, converters and other GUI functions.
   We have chosen the Java programming language to work with the LibreOffice UNO API. Our
library provides an abstraction to process documents easier in comparison with UNO API, and
it does not require knowledge of the LibreOffice UNO API. As a part of this library, we have
implemented classes for XML manipulations. In more detail, this library will be discussed in
the next section.


4. Documents processing Java library implementation
The easier document processing approach is XML processing. It allows developers to implement
a simple keywords replacement. On the other hand, LibreOffice UNO API provides a rich set
of functionality for document manipulation. Nevertheless, it does not nullify the usefulness
of the XML approach. A combination of two different approaches allows choosing developers
which one is the most appropriate for their application. Usually, one ought to use two different
libraries or frameworks to implement it, but our library provides a simple interfaces to interact
with both solutions simultaneously.
   Our library’s purpose is to simplify access to the documents and their handling by providing


                                                71
Figure 4: The XML representation of the second keyword.


Figure 5: The XML representation of the text in the LibreOffice document.


an additional abstraction. The library has been implemented using the Java programming
language. The source code may be found at [18].
   The XML approach is quite simple to use. The main class is OdtDocumentPatternsAdjust.
It has two constructors. The first one is empty, and the second one with a Pattern parameter.
The Pattern class is a JavaBean class with two fields, the start of the pattern and the end of
the pattern.
   The OdtDocumentPatternsAdjust class implements the DocumentPatternsAdjust
interface which has two methods for adjusting the XML content. The methods are the following:

String adjustPatterns(File archive)


                                                72
Figure 6: The XML document after editing.


Figure 7: The document XML after return the original value.


String adjustPatterns(File archive, Pattern pattern)

   The actual processor of the XML content is the OdtXmlPatternAdjustProcessor class.
It contains different methods for content processing, most of which are private. One of the
public method is processXml. The algorithm of XML content processing is the following:

   1. Get the position of the start and the end of the pattern.
   2. Set offset to the position of the start of the pattern.
   3. Get text before the next tag. It is needed to get the part of the pattern before there will be
      the next tag like w:r or text:span.
   4. Move offset by adding the length of the found part of the pattern.
   5. Look for the next possible part of the pattern meanwhile skipping tags without actual
      text inside.
   6. When the next part of the pattern is found, get the text. At this step, the text will be
      extracted and inserted into the beginning of the pattern in the XML content.
   7. Check whether the offset is less than the position of the end of the pattern; if it is, then
      repeat every action starting with step 5, otherwise the next step.
   8. If the next part of the pattern could be found, repeat every action starting with step 1,
      and add the earlier found pattern into the ArrayList, otherwise return the list of the
      pattern.

  The LibreOffice UNO API part is larger and offers richer functionality. There are a few
essential classes. First of all, consider the DocumentManagerProvider class. This class is a
Factory and provides the implementation of corresponding DocumentManager depending


                                                73
on file extension. This class contains one static method called createDocumentManager and
has the following signature:

DocumentManager createDocumentManager(File file)

  DocumentManager is an interface that provides an ability to open a passed document. It has
the following method:

Document openDocument(File file);

  The openDocument method returns a Document instance, which is also an interface. This ap-
proach allows avoiding specific implementations for a developer. The Document class contains
the following set of methods:

void saveDocument(File file);
void saveDocument(String filepath);
void saveDocument();
void saveDocumentAs(File file, DocumentConvertTypes convertTo);
void saveDocumentAs(String filepath, DocumentConvertTypes convertTo);
void saveDocumentAs(DocumentConvertTypes convertTo);
void replace(String search, String replace);
void close();

   These methods allow converting documents to any supported format and replacing a particu-
lar value in a document. The LibreOffice UNO API supports a considerable amount of formats
to convert. All of them are described in Apache OpenOffice Wiki [19]. Partly, those types had
been moved to our library and stored in an enum called DocumentConvertTypes. We decided
to use enums to simplify the usage of constants that can be used as properties. In comparison
with final static variables, enums make it easier to specify what should be passed there.
   To provide more functionality, a lower abstraction layer is available.                       The
LibreOfficeUnoManager contains most of the implemented methods in the Document class.
This class provides basic methods to interact with documents without direct work with UNO
API. As return values, it uses API’s objects, so it may be considered an additional functionality
layer.
   There are small utility classes which might help in working with documents. Nevertheless, de-
velopers will rarely use them because most of the LibreOfficeUnoManager methods already
have been optimized with the use of those utility classes. Let us look into two useful classes. The
OdtDocumentProperties provides a wrapper for the PropertyValue class to simplify work-
ing with document properties. The OdtFilePathHandler helps to convert the initial File class
into an understandable LibreOffice UNO API string. The reason for OdtFilePathHandler
class existence is that LibreOffice UNO API works with Uniform Resource Identifier (URI). This
means that the file path should be started with the file:/// prefix and all backslash characters
should be replaced with the slash character.
   The next example demonstrates a basic usage of our library to replace keywords in the
document and convert it into an appropriate format:


                                                74
File file = ResourcesManager.getResourceFile("Document.odt");
DocumentManager documentManager =
         DocumentManagerProvider.createDocumentManager(file);
Document document = documentManager.openDocument(file);
document.replace("{Search}", "Value");
document.saveDocumentAs(new File(
    "C:/Users/hp/IdeaProjects/XmlDocumentProcessing/File.docx"),
    DocumentConvertTypes.MS_WORD_2007_XML);

  In order to work with text, we implemented a few specific classes.       The
LibreOfficeUnoManager class supports working with text using the findAllAsText
method. The method’s signature is the following:

public List<Text> findAllAsText(String search);

  This method returns a list of Text classes. The Text class supports text editing, creating
cursor, getting all paragraphs, setting font weight, and paragraph adjustment. The list of the
methods is shown on the Figure 8.
  An example of getting a text and performing some basic operations is shown below.

Text allDocumentText = libreOfficeUnoManager.findAllAsText("and").get(0);
allDocumentText.createCursor().gotoStartOfTheSentence(true);
allDocumentText.setCenteredAdjustment();

  The Cursor class is basically usual graphic cursor. In order to move through the text,
LibreOffice UNO API implements cursor as a main mechanism for this purpose. But, considering
the fact of complexity of some original UNO API methods, we have implemented a simplified
wrapper class. The list of its methods is shown on the figure 9.
  The names of most methods, such as gotoNextSentence, are intuitively recognizable. Every
type of goto moving has two different implementations. One does not have parameters and
another one has a Boolean parameter. The Boolean parameter is used for telling the LibreOffice
UNO API, whether should we stop and select current word or go to next one. Methods without
parameters basically just use methods with parameters by passing false to them.
  Also, to implement a more convenient way of Cursor class methods usage, goto methods
take advantage of Builder design pattern. The example of such use is shown below.

allDocumentText.createCursor()
.gotoStartOfTheSentence()
.gotoNextSentence()
.gotoNextWord()
.gotoPreviousWord();

  It is impossible to predict different components usage due to LibreOffice UNO API complexity
and massiveness. So, to simplify it for developers, all the classes contain corresponding methods


                                               75
Figure 8: The list of Text class methods.


which return the original LibreOffice UNO API objects. For instance, the Cursor class has
getTextCursor, which returns a XTextCursor object.
  In order to demonstrate the developed library usage, we implemented a cloud-based system
which aims to automate document flow.
  The application is divided onto frontend and backend parts. The development stack is shown
below:

    • Server development stack: Spring (Spring boot, Spring Security, Spring WebFlux, Spring
      JPA), jjwt (Java JWT: JSON Web Token for Java and Android), Connector/J (Mysql Java
      Connector).
    • Client development stack: Vue.js 3 (Vue Cli, Vue Router, Vuex, Vue i18n, Vue Class Com-
      ponent, Vue FontAwesome, SFC, Element Plus), Typescript, Javascript, Babel, Webpack.

  The backend has microservice architecture. In order to minimize the application load, we
have implemented 3 different microservices:

   1. Microservice for login and token generation.


                                             76
Figure 9: The list of Cursor class methods.


   2. Microservice for document processing (storage and document management).
   3. Microservice for generating documents according to the data.

  As a matter of application security and microservice communication, we have used the JWT
token as the most eligible.
  For signing up and signing in into the application, the login page may be used (figure 10).
  After this procedure, the user goes to the main page for handling documents, which is called
Document Management (figure 11).


                                              77
Figure 10: The login page.


   This page contains all the document information. To create a document, one should push a
side bar button which leads to a document adding page (figure 12). The index of the documents
is shown as a list, and each item has two different buttons:

   1. Generate Form. This button is responsible for form generating. These forms may be used
      as data origins for producing documents from templates.
   2. Delete. Remove the entry.

   Furthermore, our system supports custom template patterns, which means that documents
may contain any kind of keyword distinguishers.
   The forms are common way of document generating from an uploaded template. All created
forms are displayed and might be changed in the Form page (figure 13).
   The actual form page, which may be accessed by using the View Form button, contains all of
the extracted from the template document keywords. The example of a form is shown on the
figure 14.
   Considering the fact that key words are not always named human-friendly, it is also possible
to change their display name using the Edit button on the table. After submitting a form, the
user automatically receives the document.


                                              78
Figure 11: The Document Management page.


  At the moment, the following features have been implemented:

   1. The XML adjuster.
   2. Document handling without working with UNO API directly.
   3. Simplified classes for working with UNO API and its components.
   4. Utilities for working with URI.
   5. Converters.
   6. Constants classes implemented as an enum class.
   7. A cloud-based system for documents automation.
   8. Interfaces and classes which help to add custom implementations of most mechanisms of
      the library.

  We are planning to implement a cloud-based interface for working with documents without
coding. In addition, we have intention to provide the richest functionality for working with
LibreOffice UNO API.


5. Conclusion
Document processing may be complicated and confusing. The XML processing is more compli-
cated and limited. The reason is that handling raw XML is difficult, especially when a document
is massive.
   LibreOffice UNO API is one of the richest open-source APIs for processing documents. It
provides the necessary functionality to edit and process documents. In comparison with XML


                                              79
Figure 12: The document adding page.


Figure 13: The Form page.


Figure 14: The generated form page.


                                       80
processing, this approach is more advantageous. Moreover, the LibreOffice UNO API solves the
keyword splitting issue, or to be more precise, allows avoiding it.
  The developed library allows one to handle both of the described processing approaches. It is
easier to combine them regardless of whether you only need to process patterns or additionally
edit the inner structure of the document. Looking at the future, we are planning to complete
the development of this library. Converters of the library are useful tools because there are
not many solutions that could manage all the major formats, such as doc, docx, odt, html, and
others. As an application of this library, we are currently working on creating a cloud-based
document management system that will be able to help in storing, handling, and processing
documents. It is going to be discussed in the further reports.


References
 [1] IBM Corporation, The evolution of process automation, 2018. URL: https://www.ibm.com/
     downloads/cas/QAQMRGVN.
 [2] McKinsey&Company, Jobs lost, jobs gained:                  Workforce transitions in
     a time of automation,               2017. URL: https://www.mckinsey.com/~/media/
     BAB489A30B724BECB5DEDC41E9BB9FAC.ashx.
 [3] S. Wilson, How document automation is changing the healthcare industry, 2017. URL: https:
     //electronichealthreporter.com/document-automation-changing-healthcare-industry/.
 [4] M. Bhanja, N. Barik, Library automation: problems and prospect, in: 10th National Conven-
     tion of MANLIBNET organized by KIIT University, 2009, pp. 199–201. URL: https://www.
     researchgate.net/publication/323219596_Library_Automation_problems_and_prospect.
 [5] H.-Y. Hsueh, C.-N. Chen, K.-F. Huang, Generating metadata from web documents: a
     systematic approach, Human-centric Computing and Information Sciences 3 (2013).
     doi:10.1186/2192-1962-3-7.
 [6] S. T. Rosenbloom, W. Kiepek, J. Belletti, P. Adams, K. Shuxteau, K. B. Johnson, P. L.
     Elkin, E. K. Shultz, Generating complex clinical documents using structured entry and
     reporting, Studies in health technology and informatics 107 (2004) 683–687. URL: https:
     //pubmed.ncbi.nlm.nih.gov/15360900/.
 [7] M. J. A. Salomi, R. F. Maciel, Document management and process automation in a paperless
     healthcare institution, Technology and Investment 08 (2017) 167–178. doi:10.4236/ti.
     2017.83015.
 [8] Hypatos, Hypatos document hyperautomation for e2e doc processing, 2021. URL: https:
     //hypatos.ai/en.
 [9] C. Dilmegani, The ultimate guide to document automation in 2021, 2021. URL: https:
     //research.aimultiple.com/document-automation/.
[10] Docuphase, Enterprise automation software, 2021. URL: https://www.docuphase.com/.
[11] Flackon Inc., Document automation software, 2021. URL: https://docupilot.app/.
[12] Contractbook, Better contracts, 2021. URL: https://contractbook.com/.
[13] Docassemble, Docassemble, 2021. URL: https://docassemble.org/.
[14] Obeo, M2doc, 2021. URL: https://www.m2doc.org/.


                                              81
[15] Wikipedia, Opendocument, 2021. URL: https://en.wikipedia.org/w/index.php?title=
     OpenDocument&oldid=1025760709.
[16] Apache, Frame-controller-model paradigm in apache openoffice, 2021. URL: https://wiki.
     openoffice.org/wiki/Documentation/DevGuide/OfficeDev/Frame-Controller-Model_
     Paradigm_in_OpenOffice.org.
[17] A. Davison, Java libreoffice programming, 2021. URL: https://fivedots.coe.psu.ac.th/~ad/
     jlop/.
[18] CodePsi, GitHub - CodePsi/Lycorse-DPL: Lycorse Document Processing Library, 2021.
     URL: https://github.com/CodePsi/Lycorse-DPL.
[19] Apache, Framework/article/filter/filterlist ooo 3 0, 2021. URL: https://wiki.openoffice.org/
     wiki/Framework/Article/Filter/FilterList_OOo_3_0.


                                               82