DMS-XT: a blockchain-based document management system for secure and intelligent archival Edlira Martiri Gentjana Muça Lecturer / Blockchain architect Blockchain developer Department of “Statistics and Applied Informatics”, Ambrogio, sh.p.k., FE, UT, Tirana, Albania Tirana, Albania edlira.martiri@unitir.edu.al gentjanamuca@yahoo.com Abstract has tried to protect their documents and make them trustworthy by applying cryptographic mechanisms, such as digital signatures. This mechanism guarantees First areas where the blockchain technology integrity (the document is not tampered with), dominated were financial sectors for the secure authenticity (the owner can be easily verified), and non- trading, exchanging or supply-chain of assets. Then cryptocurrencies started to exchange not repudiation (the owner can not deny he is the owner). only money, but also objects. They developed the These are all necessary properties of a system that concept of DApps (decentralized applications) stores, manages and protects documents, but if we introducing the third blockchain generation. consider attestations we should think of very long-term Despite all different areas where blockchain can functionality, thing that digital signatures cannot offer be used today, in this paper we are focused in without a high degree of technical and procedural secure document management. The idea of the complexity, with the additional disadvantage of heavily system we present, called DMS-XT, is to store relying on central authorities. not the whole content of a document, but after One possible solution to all these issues, is the properly getting an extract from the unstructured content of pdf documents using introduction of blockchain in Document Management Information Extraction techniques, and Systems (DMS). The blockchain can be considered as a encrypting it, then storing it in the blockchain. distributed ledger, or a database, containing a list of Whoever wants to verify the ownership and continuous records, called blocks, connected as a tree content, can do so by retrieving and decrypting structure and secured by cryptographic algorithms this information-view stored in the blockchain. (hash functions) [Gat17]. The underlying mechanisms To test the system accuracy and performance we of the blockchain strongly rely on cryptographic suggest applying it in Education, for the secure apparatus, and mathematical mechanisms. We will storage and quality assurance of diplomas for briefly describe some of them in Section 2. authorship right protection and against plagiarized content. The main goal of this paper is to provide the architecture of a developing system for diploma management having as a back-up a blockchain-based 1. Introduction solution for document content verification. One of the We continuously feel we live in times of fake news main features of the system are the inclusion of a or alternative truths. Cases when intellectual property is plagiarism tool, and a statistics module. The idea is stolen or misused are also frequent feeds we read from presented in the following main steps: (1) unstructured different sources in our everyday life. More precisely, information from documents in .pdf formats is fake documents implying fake certifications, degrees or extracted; (2) information is converted to a structured other documents are documents one could easily find form resulting in a compact table including necessary with one simple “search” over the internet. For fields from the documents; (3) the table data is example, one such case is a “diploma mill” with center encrypted and stored in the blockchain. Further details in Pakistan where thousands of British nationals bought will be given in Section 3. In Section 4 conclusions and fake degrees. During 2013-2014 were sold around further work will be treated. 3,000 qualifications, including master’s and doctorate degrees. 1. 1. State-of-art The above problem stands for diplomas issued on paper, and it is valid for digital attestations too Being a distributed ledger shared among all the nodes [Bbc18]. Up to now, the education system worldwide in a network, the blockchain was first used in the financial sector for exchanging and trading assets in a blockchain-based cloud storage, and it secures secure way and very efficiently because of the short documents by encrypting them [Sto18]. Documents are execution time of the transactions. The variety of split into partitions where every part is a peer in the applications ranges from currency exchange, payments, network. (2) IPFS: is a blockchain-based File-System, remittances, loans, crowdfunding to stocks and shares, distributed in all peers of the network [Ipf18]. It offers digital bonds, gold, etc. Other sectors include a similar to BitTorrent file exchanging mechanism and healthcare, insurance, communications, peer-to-peer versioning similar to Git. storage, identity management, and every day new areas are exploring the inclusion of blockchain in their 2. Generic blockchain characteristics and information systems. In 2014 blockchain developers made possible the technicalities arise of a new network where everyone could enter the There are some important characteristics of the global economy allowing them to exchange without blockchain technology, showing its importance and intermediaries. Since then the interest in blockchain future perspective. Not only the fact of being became important at the government level, starting with decentralized, but also terms such as smart contracts, Next in 2016 from the Russian Federation, Singapore consensus, unchangeability, open source, peer-to-peer, in 2016 in collaboration with IBM, World Economic are at the root of the algorithms on which blockchain is Forum in November 2016 to discuss government built. All these characteristics are essential in document models developed in blockchain, in 2017 Harvard management systems. Some definitions of the suggested blockchain as a groundbreaking technology, characteristics are [Swa15]: etc. Blockchain and cryptocurrencies give to 1. Distribution: the design allows distribution of developing countries a great opportunity to advance blocks and synchronization in the network. their economy. It would be a great opportunity for our 2. Smart contracts: are pieces of code executed on country or similar ones too, even though by now the blockchain, consisting of complex instructions blockchain is not seriously considered by their written in a programming language and determines the respective governments [Mar18]. rights of each party in the network. The first HEI (Higher Education Institution) that 3. Consensus: prior to executing a transaction there stored academic certificates was the University of exists a consensus between parties, verifying that the Nicosia, Cyprus. They actually use the bitcoin network transaction is valid. [Sha16]. Malta is another country who has adopted 4. Data unchangeability: after a transaction is blockchains for academic and professional recorded it cannot change afterword. certifications [Csm18]. Their project relies on a 5. Transparency: in the internet stack, blockchain successful initiative, Blockcerts, developed by MIT has added a new layer, the layer of trust, which is a Media Lab Learning Initiative. Blockcerts is an open- characteristic able to be coded and included in the source ecosystem for creating, sharing, and verifying algorithms of this technology. Basically, trust can be blockchain-based educational certificates [Mit18]. achieved in a trusted network of nodes, guaranteeing its security. Blockchain is extending fast in the Education area. 6. Integrity: file or data integrity is a very important A blockchain system based on Ethereum is security principle and in the blockchain world it allows implemented in University of Glasgow Scotland, UK to its users to verify data version is unchanged. store student grades [Roo17]. Other solutions include TrueRec from SAP, to manage certificates of online courses [Trr18]. 2.1 Hash functions All these solutions store in the blockchain all A hash function (H) is a mathematical function that validated transactions, whereas data are not stored. In has three attributes [Gat17], [Wan18]. It can take any fact storing large amounts of data in the blockchain string as input and produces a fixed-size output. would be very costly for the data owner. For that Secondly, it must be efficiently computable, meaning reason, there were created many storage solutions in given a string, in a reasonable length of time, one can order to offer cheaper, faster, more secure, more figure out what the output is. In out blockchain distributed and independent that cloud solutions. Some explanation we need hash functions that are of the most popular solutions are: (1) StorJ: it is a cryptographically secure. The cryptographic properties of hash functions are many, but we will mention some in particular: (i) the function is collision-free; (ii) has Fig. 1. Diagram of hash pointers. (Source from the hiding property; (iii) it is puzzle-friendly [Men96]. The first property that we need from a cryptographic hash function is that it's collision free. And what that means is that it's impossible, nobody can find values x and y, such that x and y are different, and yet the hash of x is equal to the hash of y [Mat18]. Collisions in fact exist in every hash function, but it is impossible to find within considerable time, with regular machines and computational power two same hash values from two different inputs. Until now there are no known ways to find faster collisions in a hash function beyond a certain [Cou18]) space cardinality. The second property of hash functions is that they It can be easily seen that we can add data at the are hiding, i.e. given the output of the hash function ending leaves of the tree. If anybody messes the data H(x), then there is no feasible way to find x [Bak95]. earlier in the levels of the tree, it will be immediately This means the function is irreversible, the output is detectable. That’s why blockchains are “tamper- completely different from the original data, which in evident”. turn is hidden or safe even if its hash is exposed. The binary tree structure is called a Merkle tree after 3. DMS-XT: system architecture and Ralph Merkle who invented it. An important feature of component functionalities the Merkle trees is that if a data block needs to be proved if it belongs to a Merkle tree, it can be verified The DMS-XT system aims to make electronic if the hashes of the blocks match, from the given block management of student diploma (in the first and second to its parent, from that parent to its parent, until the root cycle of studies) more coordinate, simpler, and to is reached. This way it can be verified that the block increase the quality of their content by detection of belongs to the tree. This membership verification can possible plagiarized content from previously stored be done in logarithmic time (O (log n)). thesis. This system is conceptualized to help especially The third property is to be puzzle-friendly. This universities in countries with a low level of means unpredictability and randomness, i.e. given any informatization in higher education. data there is no way to tell the value of the hash without For example, in Albania the lack of an anti- calculating it [Tha17]. plagiarism tool in some of the main universities has become a heavy load for lecturers, in their attempt to 2. 2. Hash pointers understand the originality of the submitted material. Not only this is an issue, but also the frequency of A hash pointer is a kind of data structure that is used certain topics is an observed phenomenon (even though a lot in the systems implemented on blockchain. It is not documented), or the quality of the references to basically a simple structure where except the fact that it these topics. Moreover, the existence of centers that contains the address where to points, but it stores this provide ready-made works is on the rise and is a information in a hashed form. Whereas a regular problematic phenomenon, evidenced by various media pointer gives a way to retrieve the information, a hash chronicles. As the easy-to-find “diploma mill”, in the pointer allows to get the information back and verify space of social networks, one can find thesis’ for an that the information hasn’t changed. So, a hash pointer affordable prize. tells us where something is and what its value was The system is designed not to archive the whole [Med18]. material (thesis book) but will serve as a common With hash pointers can be built all kinds of data platform between lecturers and students to better structures, from linked lists to binary trees, if they don’t coordinate the mentorship process and quality have cycles. It suffices to substitute the regular pointers assurance of the thesis. The student can read suggested with hash pointers. This is the data structure called topics of department lecturers and then decide which “blockchain”, a tree using hash pointers, as in Fig. 1. research topic to pick. Lecturers can add, delete or edit research topics; check submitted drafts for plagiarized retrieved from the blockchain, will be decrypted, then a content; and approve a final thesis prior to defense day. search in the database index will provide the record Another role in the system is the department secretary. which will be further compared with the decrypted He will assign mentorship to students and administer view. A positive match means the student is verified the process. and that he/she is the owner of the document. 3. 1. System flow In figure 2, is shown the process of a thesis registration. The steps are as follows: (1) File Upload. Secretary uploads a thesis file in .pdf format. It is supposed that even though such files represent unstructured data, they all follow the same template, as approved accordingly by inner regulatory bodies of every university. This policy will be the main drive for the system logic. (2) PDF Parser. File will be analyzed by the component which will try to find the defined fields as according to the policy. The tool is grammar-free and it will be able to detect certain field names such as: Student Name, Mentor Name, Year, Program degree, Abstract, Keywords, and References. (3) Information Extractor. The found fields from the parser will help the IE module in retrieving the correct information and storing it in a table. The module will automatically crate the information view by means of Natural Language Processing (NLP). Fig. 2. Thesis registration and information-view (4) Information-view Builder. The module will creation, storage and encryption. process the extracted information and store it in a structured form. The module is responsible for passing 4. Conclusions and future work the view to the browser. The secretary confirms the correct creation of the view and by selecting the name Managing and verifying the documents/diploma of the mentor he transfers this content to her account. authenticity is very difficult and due to the very high (5) Database. The view is stored to a central technicalities digital signatures cannot be a long-term database. This module will serve not only for the solution. Blockchain technology offers a very stable storage of the extracted information, but also during the solution in the document management. It offers stability plagiarized content check. Plagiarism will be checked and security as it strongly relies on cryptographic based on three fields: abstract, keywords, and reference apparatus and mathematical mechanisms. There are a list. lot of blockchain characteristics that make it a good (6) Information-view encryption. The view will be match for the documents management system such as: encrypted, and only this encrypted content will be distribution, being permanent and not mutable, open stored in the blockchain. after a student successfully source, consensus among the interested parties, etc. defends and passes his/her thesis. The blockchain will The system will include a traditional approach by serve as a backup system for every time a verification is using database and a distributed approach by using requested. In this case, the view will be decrypted, blockchain. The main feature of the system is the compared with the view in the database, and a decision plagiarism tool that will check the source of the is made if it is verified or not. information by extracting the information on three core For the process of verification, a similar flow is parts such as: abstract, introduction and references. followed. We are not presenting a full picture in this DMS-XT is a system that aims to manage the diploma paper, but we can say that the difference stands in the of bachelor and master students. All the process will database logic. The requested student thesis will be follow a simple flow to allow the document verification. In the blockchain will be stored only the [Gat17] Gates, Mark. Blockchain: Ultimate guide to encrypted information-view which later will be used for understanding blockchain, bitcoin, ownership verification. The view will be created cryptocurrencies, smart contracts and the future of automatically by the means of Natural Language money. CreateSpace Independent Publishing Processing (NLP). Platform, 2017. We provided the architecture of the system and [Ipf18] IPFS: https://ipfs.io/, accessed August 2018. described the main components in the registration [Mar18] Martiri, Edlira; Muca, Gentjana,, “A blockchain scenario. The actual system has finished the design eco-system analysis for the Western Balkans phase of the SDLC and smart contracts implementation countries and an economic and testing is the first step in the implementation phase, perspective”,7thInternational Conference on following the other components of Intelligent Computer Science and Communication processing module as according to Fig. 2, starting with Engineering, UBT, Kosovo, October 2018. the PDF parser, Information extractor, Information- [Mat18] Mathworld website: accessed June 2018. view builder and encryption module. As generic literature suggests and with the rapid http://mathworld.wolfram.com/Collision- developments of the blockchain technology, we believe FreeHashFunction.html. the integration of DMS-XT with other system is [Med18] Medium, “Hash pointers and data structures”, optimistic. After a successful testing of the system in accessed online, June 2018, the context of HEI diplomas we will further adopt the https://medium.com/@zhaohuabing/hash-pointers- solution to other areas within areas in need for a secure and-data-structures-f85d5fe91659 and transparent document management process, such as [Men96] Menezes, Alfred J.; van Oorschot, Paul C.; public administration, human resource management, Vanstone, Scott A (1996). Handbook of Applied etc. Cryptography. CRC Press. ISBN 0849385237. A final word to mention is the fact that the [Mit18] Certificates, Reputation, and the Blockchain – MIT blockchain is not without its warts. New networks are MEDIA LAB. http://certificates.media.mit.edu/ growing, but current mechanisms have slowed their transaction speed. Blockchain, as a technology still [Roo17] Rooksby, John, and K. Dimitrov. "Trustless needs to be appropriately regulated in both Europe and Education? A Blockchain System for University USA. When these regulations will be fully developed Grades." New Value Transactions Understanding and introduced probably the costs will increase. and Designing for Distributed Autonomous Nevertheless, with all the skepticism it evoked, Organisations Workshop at DIS 2017. blockchain is absolutely an important achievement and [Sha16] Sharples, Mike et al. 2016. Innovating pedagogy a milestone in the technology development. 2016: Open University innovation report. [Sto18] StorJ: https://storj.io/, accessed August 2018. References [Swa15] Swan, Melanie. Blockchain: Blueprint for a new [Bak95] Bakhtiari, Shahram, Reihaneh Safavi-Naini, and economy. " O'Reilly Media, Inc.", 2015. Josef Pieprzyk. Cryptographic hash functions: A [Tha17] Thakur, Mukesh. "Authentication, Authorization and survey. Vol. 4. Technical Report 95-09, Accounting with Ethereum Blockchain.", Master Department of Computer Science, University of thesis, University of Helsinky (2017). Wollongong, 1995. [Trr18] Meet TrueRec by SAP: Trusted Digital Credentials [Bbc18] https://www.bbc.co.uk/news/uk-42579634, Powered by Blockchain. Retrieved March 22, 2018 accessed September 2018. from https://news.sap.com/meet-truerec-by-sap- [Cou18] “Bitcoin and Cryptocurrency Technologies”, trusteddigital-credentials-powered-by-blockchain/ Princeton University online course. Accessed [Wan18] Wang, Maoning, Meijiao Duan, and Jianming Zhu. March, 2018. "Research on the Security Criteria of Hash (https://www.coursera.org/learn/cryptocurrency/ho Functions in the Blockchain." Proceedings of the me/welcome) 2nd ACM Workshop on Blockchains, [Csm18] Case Study Malta|Learning Machine. Cryptocurrencies, and Contracts. ACM, 2018. https://www.learningmachine.com/casestudies- malta.