Semantic Annotation of Office Documents Boheng Fan, Ning Li and Yingai Tian Beijing Information Science & Technology University, Beijing, China Abstract Semantic annotations can be used to add semantic metadata to documents, which enables their effective, automated analyses. However, only a few document formats (e.g., HTML, PDF) currently support semantic annotations. In office document formats represented by OOXML or ODF, semantic annotations are not yet supported. On the basis of an in-depth study of office document formats, this paper draws on the mainstream semantic metadata identification method of HTML and proposes OOXML-oriented semantic annotation rules. These rules support the addition of semantic metadata to office documents in a standardized way. In addition, pre-processing and post-processing methods are presented, which enable existing office software to read, edit, and save office documents with semantic metadata without modification. Experimental results demonstrate that the proposed approach can add semantic metadata to office documents, and the added semantic metadata can be parsed into RDF triple- structured data by mainstream validators such as W3C or Google. This research can provide a good foundation for tasks such as office document classification, office document information retrieval, and information extraction. Keywords document format, semantic annotation, RDFa 1. Introduction 1 With the increasing number and widespread dissemination of various types of documents, the need to automatically analyze documents continues to grow. Their automated analyses generally require semantic annotation, which is a core document technology [1]. Semantic annotations use an Ontology or Vocabulary to add semantic markups to the specific content of a document, identify the corresponding concepts or entities, and establish the relationships between the tagged content and the vocabulary. They enable documents to have semantic metadata, which enables their automated analyses; in turn, this facilitates the sharing and effective use of document information through semantic retrieval, content extraction, and automatic classification of documents. Recently, semantic annotations have emerged that use markup languages (e.g., Microdata [2], RDFa [3]) together with the Schema.org [4] vocabulary. RDFa is a standard developed by the W3C. It can add the following attributes to elements: • @vocab: The default vocabulary currently in use. • @typeof: The concept (type) of the resource. • @property: A property that a type has. Search engines such as Google, Yahoo!, and Yandex optimize search results based on structured markups embedded in web pages, and they can extract structured data from HTML to understand the semantics of their content. One particular area witnessing the boom in semantic annotations is e- commerce, where online shops increasingly embed Schema.org annotations in HTML-pages to describe products. This enables search engines to easily identify products, drive traffic to appropriate websites [5], and present related advertisements. Such structured product data on the web create opportunities for new services such as product classification [6], product matching [7], and recommender systems [8]; emerging research areas include product knowledge graphs [9]. ICBASE2022@3rd International Conference on Big Data & Artificial Intelligence & Software Engineering, October 21- 23, 2022, Guangzhou, China © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 150 Beyond web applications, text is also present in a wide variety of documents including office, fixed- layout, and complex mixed formats. A limited number of studies are available that focus on the semantic annotation of office documents. Tallis [10] provides a set of semi-automatic document annotation system Semantic Word for the DOC format. Carr [11] proposes the WICKOffice system. Fink [12] develops a Word plug-in that allows users to manually annotate professional terms in the biomedical field in documents. However, these annotation systems cannot support standard office document formats such as Office Open XML (OOXML) [13], Open Document Format (ODF) [14], Unified Office document Format (UOF) [15]. And they can not support multi-source semantic annotations. When semantically annotating an office document, the following requirements need to be met: 1) Ability to record semantic metadata in document formats; 2) Ability to support document format standards such as OOXML, ODF, UOF; 3) Ability to withstand the semantic metadata editing and modification; 4) Ability to maintain consistency between document content and semantic metadata during document editing; 5) Ability to support multi-source semantic annotation, allowing the same annotation content to correspond to multiple ontologies or vocabularies. These complex requirements have impeded the development of mature tools for embedding semantic annotations in office documents. To address the requirements summarized above, this paper proposes a method to support the semantic annotation of office documents. The method includes the extension method ofoffice document, the design of semantic annotation rules and support for semantic annotations in mainstream office software. The results of this paper have been adopted by the group standard T/CESA 1176-2021 "Information technology—Method for embedding semantic metadata into the electronic documents [16]". 2. Related Work Although the studies has limited results on the semantic annotation of office documents, there are some related research results that can be used for reference. For example, text in HTML is stored in a similar way to office documents, and there are mature semantic annotation technologies for PDF. HTML mainly adopts the method of embedding annotations, and the results are recorded as attributes of related elements. Tittel [17] embeds the vocabulary of RDFa-marked in web page content on medieval French. Beno [18] developed Doc2RDFa, an HTML rich text processor capable of automatically and manually annotating content. However, the system only targets documents in the legal field. Salem [19] enhances original online content with semantic annotations automatically derived from different knowledge sources. It is the transformation from a web document without semantic annotations to another document with RDFa semantics. Albukhitan [20]、Mbouadeu [21] use deep learning to automatically annotate HTML documents, and the results are stored in the original document using JSON-LD or Microdata. PDF mainly uses a separate annotation method, and the results are stored in separate data blocks in the document. Kim [22]、Eriksson [23] establish the mapping relationship between the content and an Ontology in PDF documents by using Extensible Metadata Platform (XMP) [24]. In the above research, the embedded annotation results are easy to manage and can ensure that the semantic metadata are consistent with the document content. However, they have a certain impact on the original document format. In contrast, separate annotation results need to be stored separately, which makes it difficult to ensure that the semantic metadata are consistent with the document content, but have less impact on the original document format. Among the embedded semantic annotation techniques, RDFa has stronger expressiveness and broader applicability; it can support multiple vocabularies [25]. Therefore, in view of the many advantages of RDFa application in HTML, this paper extends the office document format OOXML based on RDFa so that semantic metadata can be added to office documents. At present, the OOXML format is the most widely used in office documents, and it is often considered as a research object. Other XML-based office document format standards such as ODF and UOF are similar, and the method in this paper is also applicable. 151 3. Document Format Extension Before using RDFa for the semantic annotation of OOXML documents, it is first necessary to analyze the document structure of HTML and OOXML. The news in Figure 1 is used as an example to illustrate the analysis. At least 22 dead, more than 1,200 injured in Turkey earthquake By Isil Sariyuc, Hamdi Alkhshali and Amir Vera, CNN Updated: Sat, 25 Jan 2020 15:19:20 GMT Istanbul (CNN) At least 22 people died and hundreds injured in eastern Turkey after an earthquake rattled the region on Friday evening, according to authorities.The 6.7-magnitude quake struck near the town of Sivrice, in eastern Elazig province, collapsing at least 10 buildings, Turkish Interior Minister Sulyman Soylu said. Figure 1: A case of news text. For an HTML document, the main content is displayed by combining different block-level elements and inline elements. Block-level elements such as 'div', 'p', etc. usually represent paragraphs or fragments of text. Inline elements such as 'span' are usually used to refine the segmentation of local text. Using RDFa does not affect HTML browsing. The semantic annotation of HTML using RDFa is shown in Figure 2. There are two annotations, "Elazig province" and "Sulyman Soylu", which are the two entity names representing a place (i.e., typeof="Place") and a person (i.e., typeof="Person"). The vocabulary used is "http://schema.org/".
Elazig province,collapsing at least 10 buildings, Turkish Interior Minister Sulyman Soylu
Figure 2: Example of RDFa annotations in HTML. For an OOXML document, the storage is based on the ZIP compressed packaging format [26]. The main content of the office documents is recorded in the document.xml file in the document package. Paragraphs are stored as paragraph elements 'p' in the OOXML format; these are the basic units that constitute a document. The child elements of 'p' include the paragraph attribute element 'pPr' and run element 'r'. The text in the paragraph forms multiple 'r' according to different styles. The child elements of 'r' contain the sentence attribute element 'rPr' and the text element 't'. The text in the sentence is recorded in the 't' element. Semantic metadata are mainly annotated at the 'r' or 't' level. RDFa attributes can be embedded into 't' or other elements as shown in Figure 3. Elazig province ,collapsing at least 10 buildings, Turkish Interior Minister Sulyman Soylu Figure 3: Example of RDFa annotations in OOXML. For the general situation that is described above, this way of extending OOXML based on RDFa is feasible. However, OOXML is different from HTML. OOXML is mainly used for editing office documents, and its document data and semantic information have a relatively complex relationship. This complexity leads to the following issues: 1. The scope of semantic annotations in OOXML may be inconsistent with the scope of text elements in two cases: 152 a) A single text may correspond to multiple semantic annotations. In the above example, entities such as "Istanbul", "Turkey" appear under the same 't'. Since 't' is already the smallest unit of text, it is impossible to assign different RDFa properties to different entities. This is illustrated in Figure 4. Istanbul (CNN) At least 22 people died and hundreds injured in eastern Turkey after an earthquake rattled the region on Friday evening, according to authorities. Figure 4: Text element contains OOXML source for multiple entities. b) The content to be annotated may be scattered among multiple run elements 'r' or text elements 't'. In office software such as Microsoft Word, the texts of different formats are automatically divided into different run elements 'r'. It is difficult to obtain a complete annotation. "Isil Sariyuc" in the example illustrated in Figure 5 is placed in multiple run elements 'r' or text elements 't' Isil Sariyuc Figure 5: OOXML source code for entities scattered into multiple run and text elements. 2. HTML is used for viewing, there is no need for readers to support editing functions. In contrast, OOXML supports editing, so documents with embedded semantic metadata need to be able to be opened and edited by office software. In addition, the embedded semantic metadata need to be robust and withstand repeated editing. In summary, semantic annotation rules and office software that support semantic annotations for office documents are needed. 4. Semantic Annotation Rules In order to ensure the consistency of semantic annotations in OOXML within the scope of text elements, this paper proposes a model for office document annotations. The annotation model is shown in Figure 6. Figure 6: Annotation model. 153 The annotation model is described by an XML schema. The root element is "metadata", which has three attributes: @ID, @Seq, and @Begin. Any RDFa property can also be used. For the case where a single text corresponds to multiple semantic annotation content, multiple content to be annotated in the text element 't' can be placed into multiple 'metadata' element nodes for description; the relevant text is set as 'metadata' in the content of the element. Taking "Istanbul", "Turkey", and "earthquake" in Figure 4 as examples, the results of their annotations are shown in Figure 7. Istanbul (CNN) At least 22 people died and hundreds injured in eastern Turkey after an earthquake rattled the region on Friday evening, according to authorities. Figure 7: Annotation method in which a single text corresponds to multiple semantic annotation contents. In Figure 7, "Istanbul" and "Turkey" are place (rdfa:typeof="Place") names (rdfa:property="name"); "earthquake" is the name (rdfa:property="name") of the event (rdfa:typeof="Event"). These annotations utilize the vocabulary "http://schema.org/". For the case in which the content to be annotated is scattered in multiple run elements 'r' or text elements 't', multiple semantic annotations can be combined in sequence by specifying the @ID and @Seq attributes in the 'metadata' element, where @ID is the number of the marked entity and @Seq is the sequence number of the annotation. @ID combined with @Seq can realize multi-segment annotations of the same entity. @Begin indicates the start or end of an annotation. For example, "Isil Sariyuc" is the editor of this news. In some vocabularies, name is further refined into first and last names. Therefore, "Isil Sariyuc" is annotated as two entities, namely "Isil" and "Sariyuc". The annotation results are shown in Figure 8. 1 2 Is 3 il 4 5 6 Sari 7 yuc 8 Figure 8: An annotation method in which the content to be annotated is scattered among multiple run elements and texts. The two metadata element nodes described in lines 1-4 and 5-8 in Figure 8 have the same @ID to indicate that they collectively annotate one person entity. The annotations are delineated with @Begin="true" in the metadata element node of line 1 and @Begin="false" in the metadata element node of line 4. They have the same @Seq attribute (Seq="1") to indicate that they are for one person entity in the vocabulary "http://schema.org/" (rdfa:vocab="http://schema.org/" rdfa:typeof="Person" rdfa:property="name"). The first part of the annotation corresponds to the text "Is", "il" in the two sentence elements 'r'. Similarly, lines 5-8 indicate that the annotation content of the second part of the person entity corresponds to the text "Sari" and "yuc". 154 5. Semantic Document Editing and Processing The OOXML document that is extended by adding RDFa attributes to the document in OOXML format is called a semantic document. However, this extension prevents word processing software from opening and editing semantic documents properly. This paper proposes a solution to achieve seamless switching between semantic documents and ordinary office documents through pre-processing and post-processing methods. 5.1. Semantic Document Pre-processing and Conversion During pre-processing, the comment mechanism in the office document is used as the carrier for recording semantic metadata in the word processing software. The semantic metadata marked in the semantic document are stored in comments. Comments in office documents are annotation information attached to document content fragments. Comments are displayed independently and associated with the original text, pictures, and other content in office documents. The comments do not impact the typesetting style of the original document; they are convenient for people to observe and operate. At the same time, users can directly edit and modify comments when editing the original text of the document, which provides a robust editing environment. Therefore, using comments to store semantic metadata allows users to edit and modify semantic metadata in the editing process. It is also possible to keep the semantic metadata consistent during editing. In document.xml in the OOXML packaged file, comments are described by three elements: 'commentRangeStart', 'commentRangeEnd', and 'commentReference'. Among them, 'commentRangeStart' and 'commentRangeEnd' determine the starting position and ending position of the comment, indicating the range of the comment (comment.Range). The 'commentReference' element is associated with the content of the comment element in comment.xml; it is used to display the content of the comment in word processing software. In a comment, the above four elements have the same attribute @id. For example, after annotating "Istanbul" in Figure 1, document.xml appears as shown in Figure 9. Istanbul Figure 9: Annotation reference in document.xml. The annotation is associated with the comment.xml file structure as shown in Figure 10. comment content Figure 10: Annotation reference in comment.xml. To distinguish between general comments and comments containing semantic metadata, a special username "dsm" is used for semantic metadata comments. The assumption is that the special username is only associated with semantic comments. In the future, a special type for document comments can 155 be investigated to record semantic metadata in office documents. However, this requires revising the relevant standards through standards development organizations. The core idea of the pre-processing algorithm is to identify the scope of each semantic annotation. That is, the scope of the semantic elements’ 'metadata' determines the scope of the office documents comment. According to the recorded semantic metadata, the content of the office documents comment is determined. When determining the scope of comments in an office document, two cases can occur: (1) There are multiple semantic annotation entities under a single text element 't'; (2) An annotation entity is described by multiple run elements 'r' or text elements 't'. For case 1, the document structure needs to be adjusted in order to comply with the standard format of OOXML. Specifically, it is necessary to copy multiple 'r' and 't' element nodes and ensure that each 't' contains only one element 'metadata'. For case 2, the comment range can be determined directly according to the two element nodes’ 'metadata'. In this paper, the following algorithms are used to pre-process semantic documents. Algorithm1:CreateCommentNode Input:m as metadata, r as range, u as user Output:c as the comment node 1 Function CreateCommentNode(m, r, u) 2 Create a new comment node c; 3 c.content ← n.metadata; // n is the semantic node in r 4 c.range ← Range(t); // t is the text node in r; Range(t) means get range of t 5 c.user ← "dsm"; // Set user of c as "dsm" 6 Output c; 7 End Function Algorithm2:ForwardTransform Input:D as the semantic document Output:D as the transformed document in the standard format 1 Function ForwardTransform(D) 2 Let c-list be the comment list to be added into D; 3 c-list ← null; 4 For each paragraph p in D 5 For each run node r in p 6 If r contains multiple semantic nodes Then 7 Split r into multiple nodes r-list, each of which contains a single semantic node; 8 End If 9 End For 10 End For 11 For each semantic node n in D 12 If n spans multiple text nodes Then 13 Group the text nodes of same semantics into one segment; 14 For each text segment s which n spans 15 c ← CreateCommentNode(n.metadata, Range(s), "dsm"); 16 Add c into c-list; 17 End For 18 Else // n spans single text node t 19 c ← CreateCommentNode(n.metadata, Range(t), "dsm"); 20 Add c into c-list; 21 End If 22 End for 23 Add c-list into D; 24 Output D; 25End Function 156 Algorithm 1 is used to generate comments, including the range and content of comments. Algorithm 2 is the flow of the pre-processing algorithm. In Algorithm 2, step 2-3 declares the set of comments to be added and assigns the value to be empty. Steps 4-22 are the process of generating semantic metadata comments. Among them, steps 4-10 are used to adjust the document structure, which corresponds to the first case of determining the range of comments described above. In steps 11-22, if the annotated entity is described by multiple run elements 'r' or text elements 't', thenperform steps 13-17. After determining the start and end positions of annotated entity, call algorithm 1 to generate semantic comment; otherwise, according to the document structure generated in steps 4-10, Algorithm 1 is directly invoked to generate semantic comment. Steps 23-25 output the converted office documents conforming to the OOXML standard. 5.2. Semantic Document Post-processing and Conversion The purpose of post-processing is opposite to that of pre-processing. The core idea is to find each annotation with semantic metadata and its scope in office documents, so as to determine the content and scope of semantic annotations. There are also two cases: (1) the entity is described by a single text element 't'; (2) the entity is described by multiple run elements 'r' and text elements 't'. For case 1, the entity is placed into the 'metadata' element node for description, and the content of the annotation is placed into the 'metadata' attribute. For case 2, two 'metadata' element nodes need to be added. This is accomplished by setting the corresponding @Begin and inserting the nodes after the 'commentRangeStart' and before the 'commentRangeEnd'. The content of the comment is placed in the 'metadata' attribute whose @Begin value is true. In this paper, the following algorithm is used for post- processing semantic documents. Algorithm3:BackwardTransform Input:D as the document in the standard format Output:D as the semantic document 1 Function BackwardTransform(D) 2 For each comment node c in D 3 If c.user is "dsm" Then 4 Let n be the semantic node; 5 If c spans multiple text segment Then 6 n.id ← GenerateID(); // Generate ID 7 n.metadata ← c.content; 8 Insert n at the first text node; 9 Add end tag of n with n.id at the end of c.range; 10 Else // c spans single text node t 11 n.id ← GenerateID(); 12 n.metadata ← c.content; 13 Insert n at t; 14 End If 15 Remove c; 16 End If 17 End For 18 Output D; 19End Function In Algorithm 3, steps 2-17 are the process of adding the semantic element 'metadata'. In step 3, it is determined whether comments is a semantic metadata comments according to comments user name. If it is determined to be true, perform steps 4-15. Steps 5-14 are used to add the positions and content of semantic element 'metadata'. Among them, if the entity is described by multiple texts, perform steps 6- 9 to assign the comment content to the attribute of the semantic element 'metadata' whose @Begin value is "true"; otherwise, perform steps 11-13, assign the comment content to the attribute of semantic element metadata, and insert 'metadata' under the text element 't'. Steps 18-19 output the semantic document generated by conversion. 157 6. Experiment A semantic annotation tool is implemented for office documents based on C# and the winform designer. The tool has three functions: semantic annotation, pre-processing conversion, and post- processing conversion. Taking the news in Figure 1 as an example, it is annotated according to the afore-mentioned semantic annotation rules for office documents. This example involves multiple entity objects, such as "Istanbul", "Elazig province", "earthquake", and so on. When applying the proposed method for entity annotation, it is necessary to select the appropriate type from the Schema.org vocabulary according to the nature of the object, and also select the appropriate attribute. Taking the first paragraph of the news example as an example, the annotation results are shown in Figure 11. Is il Sari yuc Istanbul ( CNN) At least 22 people died and hundreds injured in eastern Turkey after an earthquake rattled the region on Friday evening- , according to authorities. Figure 11: Example of an extended OOXML document with semantic annotations. The above-mentioned extended OOXML document is pre-processed and converted into a standard OOXML document. Figure 12 shows the editing interface after the office software Microsoft Word opens the document. As can be seen in Figure 12, the seman-tic metadata in the semantic document are presented in the form of comments in the Word document. Users can readily view and edit these comments. The semantic user name is "dsm" and each semantic comment corresponds to a named entity. 158 Figure 12: Open preprocessed document in Microsoft Word. The semantic document structure obtained by the post-processing conversion of the office document shown in Fig 12 is consistent with that in Fig11. Semantic documents can be checked for RDFa syntax correctness using the Google Structured Testing Tool or the RDFa.info[27] validator. The semantic metadata of the RDF triple structure can also be extracted using GRDDL [28] or W3C's RDFa parser to generate Turtle, RDF/XML, or N-Triples files. The parsed structured data in Figure 13 shows that the semantic document can be verified by the RDFa.info testing tool. @prefix rdf: . @prefix schema: . _:1 rdf:type schema:Person; schema:name "Isil Sariyuce" . _:2 rdf:type schema:Place; schema:name "Istanbul" . _:3 rdf:type schema:Organization; schema:name "CNN" . _:4 rdf:type schema:Place; schema:name " eastern Turkey " . rdf:type schema:Event; schema:name " earthquake "; schema:startDate "Friday evening". Figure 13: Document structured data. Based on RDF data, knowledge graphs can be further generated. Figure 14 is a knowledge graph corresponding to part of the semantic annotation content of Figure 11. 159 Figure 14: Knowledge graph bsed on document semantic metadata. The above results indicate that the proposed method successfully realizes the semantic annotation of office documents, which supports the effective extraction of knowledge from documents. 7. Conclusion This paper takes OOXML as the research object, and analyzes in detail the differences between the OOXML format and the HTML format for text descriptions. On this basis, a method for the semantic annotation of office documents is proposed. According to the format characteristics of office documents, specific semantic annotation rules are designed. At the same time, an approach to achieve seamless switching between semantic documents and office documents through pre-processing and post- processing methods is proposed. This ensures the marked documents can be supported by word processing software after pre-processing. After post-processing, office documents with semantic annotations can be reformed into semantic documents. Experiments show that the use of the proposed annotation method can enhance the semantic representation ability of office documents; in turn, this can support the automated analysis of annotated documents. This paper mainly focuses on semantic annotation of office documents in OOXML format. The approach can be extended to ODF, UOF, and other formats in the future, which can broaden support for including semantic annotations in a wider variety of office document formats. In addition, the annotation object in this paper is text; the semantic annotation of other objects such as images, tables, formulas, and so on has not been considered. As their structure is more complex, further research and improvement of the annotation method are needed. Natural language processing technology can also be considered in the future to realize automatic or interactive semantic annotation and improve the efficiency of annotation. 8. Acknowledgements Project supported by National Natural Science Foundation of China: The Intelligent Analysis and Optimization Method for Reflowable Documents (61672105). 9. References [1] Fu Zhu. (2014) A Review of Semantic Annotation for Text Documents. Journal of the China Society for Scientific and Technical Information, 4:439-448. 160 [2] Sikos, Leslie. (2015) Mastering structured data on the Semantic Web: From HTML5 microdata to linked open data. Apress. [3] W3C. (2015) XHTML+RDFa 1.1 — Third Edition. https://www.w3.org/TR/xhtml-rdfa/. [4] Guha R V., Brickley D. and Macbeth S. (2016) Schema.org: Evolution of Structured Data on the Web. Communications of the ACM, 59,(2): 44-51. [5] Zhang Z., Bizer C., Peeters R. and Primpeli A. (2020) MWPD2020: Semantic Web Challenge on Mining the Web of HTML-embedded Product Data. Proceedings of CEUR Workshop, RWTH,2020. [6] Zhang Z. and Paramita M. (2019) Product Classification Using Microdata Annotations. International Semantic Web Conference, Springer, Cham, 2019, 716-732. [7] Peeters R., Primpeli A., Wichtlhuber B. and Bizer C. (2020) Using Schema.org Annotations for Training and Maintaining Product Matchers. Proc of the 10th International Conference on Web Intelligence, Mining and Semantics, 2020, 195-204. [8] Bai P., Ge Y., Liu F., and Lu H. (2019) Joint interaction with context operation for collaborative filtering. Pattern Recognition, 88: 729-738. [9] Dong X L. (2018) Challenges and Innovations in Building a Product Knowledge Graph. Proc of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, 2869-2869. [10] Tallis M. Semantic Word processing for content authors. Proceedings of the Knowledge Markup & Semantic Annotation Workshop, Florida, USA, 2003. [11] Carr L., Miles-Board T., Wills G., Woukeu A. and Hall W. (2004) Towards a knowledge-aware office environment. International Conference on Practical Aspects of Knowledge Management, Springer, Berlin, Heidelberg, 2004, 129-140. [12] Fink JL., Fernicola P., Chandran R., Parastatidis S., Wade A. AND Naim O. (2010) Word add-in for ontology recognition: semantic enrichment of scientific literature. BMC bioinformatics, 11, (1): 1-8. [13] ISO/IEC 29500-1:2016 Information technology — Document description and processing languages — Office Open XML File Formats. (2004) . [14] ISO/IEC 26300: 2015 Information technology — Open Document Format for Office Applications (OpenDocument) v1.2. (2015). [15] GB/T 20916 — 2007: UOF--Unified Office document Format (2007). China Standards Press, Beijing. [16] T/CESA 1176-2021 Information Technology Electronic Document Semantic Metadata Embedding Method. 2021. [17] Tittel S., Bermyudez-Sabel H. and Chiarcos C. (2018) Using RDFa to link text and dictionary data for Medieval French. Proc of the Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan, 2018, 7-12. [18] Beno M., Filtz E. and Kirrane S. (2019) Doc2RDFa: semantic annotation for web documents. [19] Salem H., Mazzara M. and Elnaffar S. (2021) Automatically Injecting Semantic Annotations into Online Articles. International Conference on Advanced Information Networking and Applications, Springer, Cham, 2021, 617-624. [20] Mbouadeu S F., Keshtkar F. and Bukhari S A C. (2021) Semantically: A Framework for Structured Biomedical Content Authoring and Publishing. [21] Albukhitan S., Alnazer A. and Helmy T. (2018) Semantic annotation of arabic web documents using deep learning. Procedia Computer Science, 130, 589-596. [22] Kim H L., Kim H G. and Decker S. (2006) Semantic documentation using semantic web technologies and social web services. International Conference on Next Generation Web Services Practices, IEEE, 2006, 27-32. [23] Eriksson H. (2007) The semantic-document approach to combining documents and ontologies. International journal of human-computer studies, 65, (7): 624-639. [24] IX-ISO. (2012) ISO 16684-1-2012. Graphics Technology Extensible Metadata Platform (XMP) Specification Part 1: Data Modeling, Serialization and Core Properties. [25] Song Yu., Zheng Yi. and Wu Yan. (2009) Overview of RDFa semantic annotation techniques. Proc of National Conference on Computer Networks and Communications, 300-306. 161 [26] ISO/IEC. (2021) ISO/IEC 29500-2. 2021Document description and processing languages — Office Open XML file formats — Part 2: Open packaging conventions. [27] Buraga S C. and Panu A. (2013) A web tool for extracting and viewing the semantic markups. International Conference on Knowledge Science,Engineering and Management, Berlin, Springer, 2013, 570-579. [28] Hazayul-Massieux D. and Connolly D. Gleaning resource descriptions from dialects of languages (grddl). World Wide Web Consortium, W3C Coordination Group Note NOTE-grddl-20040413. 162