Introduction

An Ontology-Based Model of Technical Documentation Fuzzy Structuring

Alexey M. Namestnikov

nam@ulstu.ru 1

Alexey A. Filippov

al.filippov@ulstu.ru 1

Valeria S. Avvakumova

valeria.avvakumova73@gmail.com 0 0 FRPC JSC 'RPA 'Mars' 1 Ulyanovsk State Technical University

63 74

The article is concerned with the method for structuring the electronic archive of technical documentation on the basis of the domainspeci c ontology. The ontology formal model, the technical document model, and the algorithm for clustering electronic archive content that has its origins in the modi ed fcm-method are presented. The authors are pioneered in o ering the formalization of the measure of distance between ontological representations of the archive technical documents on the basis of hierarchy transformation complexities comparison. Different types of semantic relations between ontology concepts should be taken into account. Thus, the article considers the experimental results of the subset of the electronic archive technical documentation of the large project organization.

ontology clustering technical document fuzzy model graph

Introduction

A modern large project organization possesses a sizable electronic archive of design and engineering documentation and engineering documentation. Its greater part is represented in unstructured text les. In actual truth, such an electronic archive contains the totally experience and knowledges of a great number of highly trained specialists that have been developing and designing complex systems over many years. In case of expanding the electronic archive, di culties related to document analysis on the basis of predetermined properties ensure. Also skills of semantic processing of a great number of technical documentation and intimate knowledges of the subject area are required for persons who involved in complex technical systems designing. As a result, the important experience of previous developments xed in electronic archives often becomes non-demanded. Thus, R&D cycle runtime increases.

The solution of the speci ed problem can be based on the use of intelligent methods and algorithms of text documents analysis in order to create the navigation structure of the technical documentation electronic archive. The paper [ 1 ] suggests using ontologies in intelligent document analysis.

Evaluating the speci c character of project knowledges leads to the necessity of forming the project organization ontology with the special structure including features of a project process in the form of a subject area concept system, relations between these concepts, and interpretation functions. In such a manner the electronic archive should possesses properties of an intelligent system. At the moment mathematical methods and algorithms providing the means for structuring an electronic archive of technical documentation with consideration for its content and the speci c character of a project organization subject area are not available.

Consequently, currently central problems include development of models, methods and algorithms for construction of the navigation structure of the technical documentation electronic archive on the basis of domain-speci c clustering of partially formalized information resources.

In Section 1, the authors decribe the formal model of electronic archive ontology structure. Section 2 considers a technical document as an electronic archive resource and presents the ontological model. In its turn, Section 3 proposes the algorithm for ontology-oriented indexing of technical documents. The measure of distance in the context of ontology relating to the level of designing standards is formalized in Section 4. Section 5 o ers some experimental results. 1

The structural model of an electronic archive ontology A subject area of complex system designing places some constraints on the structure of an applied ontology. The rigid binding to standards and systems life cycle models applied at di erent stages of designing implies the necessity of forming the ontology that consists of a lot of levels, as indicated by 1.

Formally, the electronic archive ontology consists of two applied ontologies and may be written as the equation 1:

O = hOD; OLC ; RAi; (1) where OD is a subject area ontology component, OLC is an ontology of designing systems life cycles, RA is a unidirectional association relation between the ontology components. Let us consider the electronic archive ontology components in more detail (1).

In this way, let us write the domain-speci c ontology as the following sequence:

OD = hC; W; RD; F Di; where C is the set of electronic archive concepts that makes up a bulk of a conceptual apparatus of an automated system designing, W = W S [ W P is a set of subject area concepts, here W S is a set of concepts on the level of standards, W P is a set of concepts on the project level, RD is a set of relations. Symbolically,

RD = fRGD; RCD; RADg; where RGD is anti-symmetric, transitive, irre exive binary generalization relationship ('subclass of'), RCD is a binary transitive composition relation ('part of'), RAD is a binary relationship of unidirectional association.

The set of concepts C is de ned by the following equation:

C = CS1 [ CS2 [ : : : CSk [ CP ; where CSi ; i = 1; k is the set of subject area concepts for the standards of the ith group, CP is the set of subject area concepts extracting from the technical documentation of projects realized.

The set of interpreting functions is denoted as follows:

F D = fFWDCP ; FCDP CS g; here F D

W CP : fW g ! fCP g is a function correlating a set of terms and a set of subject area concepts, FCDP CS : fCP g ! fCS g is an interpretation function of the set of concepts allowing to go to the level of concepts de ned in standards.

The ontology on a life cycle as a sequence component (eq. 1) consists of three sets and is denoted by the following equation:

OLC = hM LC ; StLC ; RLC i; here M LC is a set of models of designing systems life cycles, StLC is a set of life cycle stages.

De nition 1. Terminological environment of concepts is the set of terms (layers) from the electronic archive technical documentation of projects realized.

According to the paper [ 1 ], a semantic distance between the concept and terms in the technical document should be de ned on the basis of the semantic relation idea. The idea encloses the use of 'distance' between words.

The semantic coe cient of the relation between the concept and the term (the semantic distance) is de ned by the following equation:

S ciP (S); wj =

1 occur ciP (S);wj exp(sentence (paragraph+1))

num occur ciP (S); wj num paragraph

cooccur ciP (S); wj num (totalparagraph) ; here ciP (S); wj is the ith concept on the level of projects (standards) of the ontology and the jth term, sentence is the distance expressed in the form of the number of sentences between the concept and the term, paragraph is a distance expressed in the form of the number of paragraphs between the concept and the term, num paragraph cooccur ciP (S); wj is the number of paragraphs where coocurrence ciP (S) and wj exist, num occur ciP (S); wj is the number of rencontres between ciP (S) and wj, num (totalparagraph) is the number of paragraphs in the document.

After de ning semantic distances between the concept and the document terms, its necessary to de ne the subset of terms that are appreciably semantically close to the concept. In case of de ning the terminological environment, according to the paper [ 2 ], the hypothesis of -compactness that leans up the -distance, taking into account a normalized distance d between terms and the characteristics of a local density of terms about these elements.

If the semantic distances between all the pairs of terms with the terminological environment are de ned, the graph connecting all terms can be plotted. After that, the most long edge (the graph diameter D) should be de ned. Consider two terms wi and wj and denote the length of the edge connecting them (the semantic distance) as (wi; wj). We obtain the normalized distance between terms d = D .

Further, let us nd the shortest edge between the ones adjusted to the edge (wi; wj). Its length is denoted by min. The ration between the lengths of adjusted intervals is denoted by = . In order to normalize this value, let min us nd the largest value max in the entire graph. The value = max is a normalized characteristic of a set local density nonhomogeneity about the ontology terms wi and wj. = f ( ; d) is a -distance between the terms wi and wj. According to the paper [ 2 ], the use of = 2 d as such a distance measure is suggested.

In order to de ne the terminological environment of the ontology concept on the level of realized projects, it is necessary to mark such an edge (wi; wj) that can be a boundary between terms related to the ontology concept and terms that are not included in the terminological environment of the concept. With the use of -KRAB algorithm, the nal criteria characterizing the quality of such a disjunction of terms is denoted by the following equation:

F = h4 2d ! max; where h = 2 mm+ mm , is the equinumerosity criteria of the speci ed classes of terms. Here m+ is the number of terms included on the terminological environment of the concept, m is the number of other ones.

Thus, with the use of the -compactness hypophysis, the subset of terms that is included in the terminological environment of the concerned concept is de ned.

Every terminological environment Wk of the concept CkP (S) can be denoted by the following equation

f(w1k; f1k) ; (w2k; f2k) ; : : : ; (wik; fik) ; : : : ; (wlk; flk)g; here wik is ith term kth ontology concept, lk is the total amount of term associated with the the kth concept, fik is a normalized semantic weight of the ith term in the terminological environment of the kth concept (normalized semantic distance between the term and the concept in the context of the one ontology environment). 2

The ontology model of the technical document as an electronic archive resource A technical document in the context of an electronic archive is considered is an information resource. Any one of technical documents can be considered as a container of partially structured information. On the one hand, we deal with a natural language text, but on the other hand, a technical document is proper structured. The structure is de ned in di erent standards.

We compare a frequency of occurrence of terms in one technical document with a frequency of occurrence of the same terms in the whole set of documents. It is originally conceived that the terms are not valuable if the frequency of terms in the document analyzed is far in excess of the frequency in the whole set of documents. Symbolically, such a dependence can be denoted as follows: fi = tf idfi = tfi log

N df (wi) ; here tf idfi is a relative importance of the term wi in a document, tfi is a normalized frequency of term wi occurrence, N is a number of documents, df (wi) is a number of documents containing a term wi.

An ontological model of a technical document is such a document representation that corresponds to the applied ontology state of an electronic archive. By [ 3 ], it follows that the notion of electronic document passport including a semantic index can be an analog of such a model.

A section of a technical document can be shown as follows: sid = hchsd ; CsPd ; CsSd i;

i i i where sid is the ith section of a technical document d, chsd is a unique name of i the ith section of a technical document d, CsPd ; CsSd is a subset of subject area i i concepts, de ned in the context of the ith section of a technical document d.

Let us denote the jth term of the ith section of a technical document d by sd wji , than a set of terms of the ith section of a technical document d can be de ned as:

sd sd sd Wsd = fw1i ; w2i ; : : : ; wlsid g;

i i where lsd is a number of terms of the ith section of a technical document d.

i With the use of an interpretation function of the ontology F D W CP : fW g ! fCP g on the stage of technical document indexing, we obtain the ontological representation of the document section: oVsdd = hchsd ; CsPd ; CsSd i; CsPd i i i i i

CP ; CsSd

CS jStkLC :

CsSd CS jStkLC means that the ontological representation of the document i includes only ontology concepts of a subset CS (on the level of standards using in automated systems designing) that correspond to the kth stage of designing StkLC .

With the use of function FCDP CS : fCP g ! fCS g, we can get the nal representation of a technical document section that considers the state on an electronic archive applied ontology: oVsdd = hchsid ; fCsPid [ CsSid gi; CsPd i i

CP ; CsSd

CS jStkLC :

A formal ontology model of a technical document can be de ned as follows: oV d = hSd; fCdP [ CdS gi; The two main parts can be marked in the above equation: a structural one (Sd) and a conceptual one (fCdP [ CdS g) in the context of realized projects of the archive and standards applied in the process of automated system designing with regard to the stage of a life cycle. 3

Ontology-oriented indexing of technical documents The ontology indexing of a technical document has in its basis the following function:

FoV d : sid ! oVsdd ; i here sid is the ith section of a technical document d, oVsdd is an ontological repi resentation of the ith section of a technical document d. d

Notice that the method of computing a normalized weight of a term wjsi in the ith section of a technical document d has in its basis the following equation: fjsid = 1 + log tfwjsid log

N dt

1 rtf 2 sd + tf 2 sd + : : : + tf 2 sd w1i w2i wni ; 1 j n; d d here fjsi is a normalized weight of a term wjsi in the ith section of a technical d document d, tf sd is a term wjsi frequency of occurrence, N is the total amount wji sd of documents, dt is a number of documents including a term wji , n is a number of terms in the jth section of a technical document d.

De nition 2. A degree of manifestation of an electronic archive ontology concept is a degree of conjunction between a terminological environment and a set of concepts of a technical document fragment subject to the condition that a terminological environment includes terms that are semantically close to the concept.

Computing the degrees of manifestation of ontology concepts for every section of a technical document is performed with the use of the apparatus of fuzzy irrelevance [ 4 ]. Fuzzy irrelevance between a set W (a set of ontology terms on the level of projects (standards) included in the terminological environment of concept) and a set CP (S) (a set of concepts of an applied ontology on the level of projects (standards)) denoted by ~ = W; CP (S); O~ where W and CP (S) are crisp sets, O~ is a fuzzy set in W CP (S). A set W is a domain of a function, a set CP (S) is a range of a function, and O~ is a fuzzy graph of a fuzzy relevance.

The crisp relevance = W; CP (S); O with a chart O as a carrier of a fuzzy chart O~ is called the carrier of fuzzy relevance ~ = W; CP (S); O~ . In the context of an ontology, a chart O de nes parts of unidirectional associations RAD between a project concepts and terms in an ontology.

In order to nd the meaning of concept domination, the method comparing the terminological environment of every concept in the ontology of a subject area ontology on the project level with the text analyzed. Let us remark that the minimal fragment of a text analyzed is a sentence and a maximal one is the whole document, as in di erent fragments of the text di erent concepts of the subject area are layed an emphasis on [ 5 ].

The algorithm of computing a degree of dominance of a concept in the text fragment consists of the following steps:

Step 1. De ning the maximal degree of manifestation of ontology concepts in the text fragment of a technical document d: f^rpd cP (S) = maxc frpd cP (S) :

Step 2. De ning the mean of a degree of manifestation of ontology concepts without the concept with the maximum degree of manifestation (de ned at the previous step): f~rpd cP (S) = frpd c^P (S) ;

i 1 n 1 n 1 X i=1 where c^iP (S) 2 cP (S) cPm(aSx), cPm(aSx) = argmaxcP (S) frpd cP (S) , n is a number of concepts with a non-zero degree of manifestation for a text fragment f rpd.

Step 3. De ning a degree of manifestation of a concept in a text fragment f rpd: frpd cP (S) = f^rpd cP (S) f~rpd cP (S) : (2)

The equation 2 de nes a quality of selection of a text fragment in a technical document in order to constrain the subject area concept that is xed in an electronic archive ontology.

Having applied the ontology interpretation function FWDCP : fW g ! fCP g, we obtain an initial ontological representation of each segment. The representation consists of initial sets of concepts on the levels of projects and standards that require correction.

The results of the experiments with extracting text fragments on the basis of the genetic optimization show that averages 30% of concepts add up to 70% of the total degree of manifestation of all the concept of the text fragment.

The nal step of forming the ontological representation of a technical document is the use of interpreting function FCDP CS : fCP g ! fCS g that allows to specify a set of concepts on the level of standards resting on the subset of ontology concepts found in a technical document. The concepts correspond to the realized projects.

In case of realizing the above procedures, we get the nal ontological representation for every ith section of a technical document.

The ontological measure of distance between documents Let us consider the formal measure of distance between documents in the context of ontology concepts relating to the level of designing standards. Every ontological representation can be illustrated in a form of a tree (a hierarchy) of subject area concepts. Such an hierarchy can be de ned by nding a minimal tree including all concepts from the ontological representation [ 2 ].

The Levenshtein distance between hierarchies can be de ned on the basis of computing an edit operation cost that should be found for each type of a semantic relation. Thus, an edit operation for a generalization relation is denoted by Si RGD and a 'part of' one is denoted by Si RCD . Si shows belonging the value of an edit operation to the the ith group of standards. Actually, in case of clustering, an edit operation is de ned as a weight of a certain relation. The weight value lies in the range between 0 and 1 and have di erent values within the framework of every group of standards.

The total edit distance between the hierarchies is de ned as the following equation: m X s=1 oV = maxi

Si RGD s + n X l=1

Si RCD l ! ; where i is a group of standards number, s is an adding generalization relation number, l is an adding 'part of' relation number. The total edit distance can be computed as a maximum one from all edit distance de ned for every group of standards.

A normalization coe cient ToV is de ned on the basis of all semantic relation of a generalized hierarchy. Thus, a measure of distance between ontological representations of technical documents can be de ned as follows: k oV d1 oV d2 k=

oV : ToV

In order to create the navigation structure in the form of a nested set of clusters of technical documents, it is necessary to solve the problem of setting the weights of semantic relations between ontology concepts on the level of standards. As noted above, weight coe cients are de ned as Si RGD and Si RCD for a generalization relation and 'part of' relations respectively.

In view of the fact that the speci ed relations are used in the ontology concepts for di erent groups of standards, let us suppose that their optimal values for each group (in the context of their concept hierarchies) are generally di erent. Let us formulate the principle of the best value for weight coe cients of ontology semantic relations.

Let foV dg be a set of ontological relations of documents included in the model sampling (the expert division of documents between classes). The following equation is true: foV dg foV dg; where foV dg is a full set of ontological representation of electronic archive technical documents. The ontology is de ned by the equation (1). On the level of standards, the generalization and 'part of' relations are de ned on the basis of concepts with corresponding weight coe cients Si RGD and Si RCD , where Si is the ith group of designing standards used in ontology creation.

A set foV dg consists of two subsets foV dg+ [ foV dg that correspond to the expert division of documents between two predetermined classes. The optimization problem of weight coe cients of semantic relations consists of nding such a set of coe cients as follows: fh S1

RGD ; S1

RCD i; h S2

RGD ; S2

RCD i; : : : ; h Sn

RGD ; Sn

RCD ig: The clustering coe cient de ned by the equation 3 should be as low as possible.

F = max K+ + K ; K^+ + K^

N ! min (3) where K and K^ are sets of absent documents respectively in the rst and the second clusters, K+ K^+ are sets of redundant documents respectively in the rst and the second clusters, N is the number of documents. 5

The analysis of computational experiments result on the basis of FRPC JSC 'RPA 'Mars' electronic archive documentation In case of analysis of computational experiments result on the basis of the documentation of FRPC JSC 'RPA 'Mars' electronic archive, the domain-speci c ontology was used. The ontology consists of two series of standards used at the enterprise: 1. GOST 34. Information technologies. Open systems interconnections. (It consists of 108 ontology concepts at the level of standards). 2. GOST 19. Uni ed system for design documentation. (It consists of 111 ontology concepts at the level of standards).

The ontology level appropriate to the realized projects is based on the selection of FRPC JSC 'RPA 'Mars' electronic archive documentation that includes 5017 technical documents. The level consists of 81 concepts and 10078 unique terms comprising the terminological environment of concepts.

Thus, the domain-speci c ontology consists of 300 concepts. They include 219 concepts at the level of standards used at the enterprise and 81 concepts and 10078 unique terms at the level of realized projects.

The expert of FRPC JSC 'RPA 'Mars' prepared the selection involving 5017 technical documents and grouped into two main sections: { the section based on the documentation type that consists of 52 groups (GOST 2.601, 2.602, 2.102, 2.701 3.1201); { the section based on work sectors that consists of 28 groups (products discussed in documents).

In order to perform the experiment of quality evaluation of structuring FRPC JSC 'RPA 'Mars' electronic archive documentation, the index containing both ontological and traditional representations of technical documents (set of 'terminfrequency' pairs) was used. Further, the indices were structured with the use of di erent variants and subsequent quality evaluation according to the following list: { structuring the traditional representations of technical documents with the use of Oracle Text tools; { structuring the traditional representations of technical documents with the use of the modi ed FCM-algorithm of clustering; { structuring the ontological representations of technical documents with the use of the modi ed FCM-algorithm of clustering; { structuring the ontological representations of technical documents with the use of the modi ed FCM-algorithm of clustering with regard to the life cycle models of the designing system.

As indicated by Fig. 2, the most appropriate values of the evaluation function for ontological results with regard to the life cycle models of the designing system were obtained in case of structuring the technical documentation selection in work sectors as it performs structuring in individual documents content. In case of structuring according to the document type, Oracle Text outperforms the others.

The function of documentation structuring with the use of Oracle Text is based on the clustering algorithm considering a frequency of term occurrence in documents. The algorithm works well in case of structuring in accordance with the document type when Oracle Text gives the best results. The modi ed FCM-algorithm of clustering ontological representations of technical documents with regard to the life cycle models of the designing system provides structuring of highest quality in accordance with work sectors with regard to the content.

Conclusion

The computational experiments show that the results of structuring the ontological representations of technical documents with regard to the life cycle models of the designing system is 40% better than results structuring with the use of Oracle Text. The time spending on indexing and structuring processes of technical documentation ontological representations is, on the average, 7% less than the total time spending on indexing and structuring processes of technical documentation traditional representations. The ontological approach to indexing and structuring technical documentation makes possible structuring the electronic archive for less time. As this takes place, the most time spending is related to the process of documentation indexing.

Acknowledgments The research was carried out within the state assignment No. 2014/232 for the accomplishment of state works in the sphere of scienti c a airs of the Ministry of Education and Science of the Russian Federation (theme 'Developing a new approach to the analysis of partially structured information resources').

1. Serrano-Guerrero

, Olivas

J. A.

, de la Mata J., Garces

: Physical and Semantic Relations to Build Ontologies for Representing Documents . Fuzzy logic, Soft Computing and Computational Intelligence (Eleventh International Fuzzy Systems Association World Congress IFSA) , Beijing, China. Tsinghua University Press, 2005 , vol. 1 , pp. 503508 .

2. Zagoruyko

N.G.

: Prikladnye metody analiza dannykh i znanii [Applied Approaches to Data and Knowledge Analysis]. Novosibirsk, IM SO RAN Publ ., 1999 . 270 p.

Zagorulko

Yu .A., Kononenko

I.S.

, Sidorova

E.A.

: Semanticheskii podkhod k analizu dokumentov na osnove ontologii predmetnoi oblasti [A Semantic Approach to the Document Analysis on the basis of Domain Ontology] . Available at: www.dialog21.ru/digests/dialog2006/ materials/html/SidorovaE.html.

4. Bershtein

L.S.

, Bozhenuk

A.V.

: Nechetkie grafy i gipergrafy [Fuzzy Graphs and Hypergraphs]. Moscow, Nauchnyi Mir Publ., 2005 . 256 p.

5. Namestnikov

A.M.

, Filippov

A.A.

: Metod geneticheskoi optimizatsii ontologicheskikh predstavlenii proektnykh dokumentov v zadache indeksirovaniia [The Method of genetic optimization of project documentation ontological representations in case of indexing]. Trinadtsataia nats. konf. po iskusstvennomu intellektu s mezhd. uchastiem KII-2012 [The XIII National Arti cial Intelligence Conference with international participation KII- 2012 ]. Belgorod,BSTU Publ., 2012 , pp. 84 - 91 .