Using UsingBMH Algorithm BMH algorithm to Solve to solve Subset subset of queries of XPath XPath Queries David Toth1 1 David Toth Dept. of Computer Science and Engineering, FEE, Czech Technical University, Dept. of Computer Science and Engineering, Karlovo Náměstí 13, 121 35 Praha 2, Czech Republic FEE, Czech Technical University, tothd1@fel.cvut.cz Karlovo Náměstı́ 13, 121 35 Praha 2, Czech Republic tothd1@fel.cvut.cz Abstract. Boyer-Moore-Horspool (BMH) algorithm is commonly used to solve text searching problems. In this paper is used to solve the constraint subset of XPath queries offering effective algorithm to resolve such queries. XML can be grasp as text file contains tags and its content; that kind of view to XML is used in this work. We constraint XML document content and possible XPath que- ries. This work focus on key ideas and problems appertaining XPath queries execution using text search algorithm BMH. 1 Introduction Many papers were published appertaining the XPath query effective execution issues. Many authors concern many aspects of XML and XPath. In this paper we look in detail how to solve part of XPath queries involving only path of nodes in the query but not any functions. Another extensions are possible and in future works can be exploited in more details; see into Conclusions and future works for more informa- tion. Why bother with XML and XPath? XML is special text file format, meta- language, markup language (using tags instead of commands); XML can be repre- sented by a tree where nodes represent tags and connections between nodes represent relationship super-element and embedded element. XML can be used for many rea- sons. Especially useful is using XML as data interchange format in internet environ- ment. It is conceived as a tool assuring compatibility between different applications possibly running on different platforms and using different inherent data paradigm and formats. Today in B2B applications is XML widely used. Of course XML has many other purposes. For example in XML documents can be stored content of any magazine or generally paper. Such kind of XML documents are known as document oriented XML; i.e. there is huge amount of text contrary to data oriented XML documents where this is not generally true. In data oriented XML documents are many tags and often as much as data itself. Finally the tag information is itself also valuable information. Document oriented XML are typically intended to use by humans in opposite to data oriented XML documents which are normally designated for computers. J. Pokorný, V. Snášel, K. Richta (Eds.): Dateso 2007, pp. 81–88, ISBN 80-7378-002-X. 82 David Toth Here we try to prove that whenever data oriented XML documents are supposed we can treat them as plain text files finding queried node(s) using BMH algorithm. Only when the amount of text representing structure information (for simplicity let us consider just tags) in XML file be the same order of magnitude as amount of text representing stored data itself. This is big assumption which has to be preserved to can legally speak about effective XPath query resolution. The next section introduce basic BMH algorithm principle and show example of using this algorithm. Another section acquaint XPath queries and selected subset further elaborated. After come section named BMH in XPath where are introduced and explained all considered XPath query execution algorithms based basically on BMH from concrete queries to generic ones. Finally in Conclusions and future works section assess gained results and procedures and outline extensions and future works directions. 2 BMH Algorithm After short introduction here we will treat BMH algorithm basics. Here we deal with BMH (Boyer-Moore-Horspool) algorithm commonly used to solve text searching problems. The basis of this approach inhere in a little preprocess- ing of searching pattern and then searching the text forward but backwards from pattern. That is the trick. That is how can be raise effectivity. There can be hopped (or jumped) many irelevant letters (or generally parts of input stream). The number of reading is therefore typically much more lesser than whole length of stream.. 2.1 Preprocessing issues What must be done during preprocessing phase? For any pattern there is a need for create hopping table where must be pre-counted how many letters (or input symbols) can be jumped after input sign and the appropriate sign from pattern has been read and they do not match. Let us show an example. Suppose input stream as: „Where are you going young man?“ Furthermore assume we trying to find pattern: „you“. The answer obviously is that the pattern is presented 2 times 1st at position 11 and 2nd at position 21. Naive approach lead us to read whole text i.e. 31 characters and do at least the same number of comparisons. When we try to use BMH we have to resolve how much can be hopped when some letter is achieved. The preprocessed table should look like this: youx 2133 Which means that whenever compare of input letter and pattern letter do not match we should jump ahead of 2, 1 or 3 characters depending on what letter is on input Using BMH Algorithm to Solve Subset of XPath Queries 83 what is talking about the 1st line in the above table. Note that x mean any other sym- bol than enumerated before x – in our case whatever symbol but y, o, u. So the above example will be, using BMH algorithm, take just 16 readings. 2.2 How BMH algorithm operate? First necessary variables: I position in input stream pat pattern 2.3 Semiformal BMH description (BMH algorithm) 1. preprocessing – the tab is calculated 2. begin 3. Go to the I where I = len(pat) (initial position in the input text is set to the length of pattern) 4. repeat (until in input stream the end is not reached) 5. compare letters backwards from position I with pattern until they fit each other 6. when pattern is matched then 1. save the position of 1st matched occurence – store I into the array 2. go to the position of possible next occurence which is: I = I + 2*len(pat) 7. (ELSE) when pattern is not matched then 1. go ahead to the fartest position possible which is I = I + tab[K], where K is letter of input text where was expected any different symbol 8. end. Algorithm notes. len means function returning length of its string argument and tab preprocessed table containing length of possible hops for all letters. Useful source where formal BMH algorithm description is introduced, explained and described using an example is for example [2]. After BMH algorithm basics were explained next section deal with XPath query which we will subject of next investigation. 3 XPath queries Now after the slim repetition of how BMH algorithm runs we will focus on XPath queries. This section deal with investigated XPath query subset which is briefly intro- duced. XPath is derived from XML Path which suggests us it is about determining paths in XML documents. In fact it is more complicated. Functions and other features must be considered furthermore. But for the purpose of this paper we focus on simple que- ries involving just nodes and paths. 84 David Toth Generally there are two possibilities how to determine path using XPath. So called relative paths and absolute paths. Absolute path always syntactically begins with the sign for root of the XML document which is denoted as '/'. Absolute path is inevitably derived from the root of the document. Result of such query can be none node found, exactly one node in specified path or set of nodes of same name in specified path. Relative queries on the other side mean that somewhere in the query this sequence '//' is placed which means no matter of deep where the node will be found. Such kind of a query is practically eminently useful but it is this kind of queries which means prob- lems with effectivity. XPath queries examples follow. First absolute queries: /rootnode/node1/node3 – the result are all nodes appertaining elements which are descendants of element which are descendants of element. /rootnode/node6 – the result of this XPath query are all elements which are descendants of element. Let us see now relative queries: //node2 – such XPath query gives all elements in valid XML document. //node4/node5 – the answer is: all elements whichs parents are . And finally we will look at mixed queries: /rootnode/node7//node8/node9 – the return all elements whichs parents are elements placed however deep inside the element which has to be direct descendant of element. Many useful properties of XPath were omitted. We will treat about extensions in part Conclusions and future works. XPath query somehow specify the path in tree struc- ture of XML document. We will not need this imagination anymore. In this paper linear conception of XML (XML viewed as a text file) will be sufficient. All about XPath can be found in [4]. After XPath query subset was delimited we will in next section focus on how to this subset can be implemented using BMH algorithm. 4 BMH in XPath Now after we briefly examined BMH algorithm and supposed XPath queries we will look at how to use BMH when evaluating XPath queries. First the easiest case. Let us consider the following XPath query: //nodeX. So the result of such a XPath query should be all elements named nodeX wherever inside the XML document. Of course we suppose the input XML document is well-formed XML document as this term is commonly understand according to [3]. Furthermore we do not consider CDATA section inside the XML document but it is not hard to imagine its subse- quent incorporation. But here it would break our deliberations and diverge to another Using BMH Algorithm to Solve Subset of XPath Queries 85 than key problems. Except CDATA we also do not consider comments in XML. Both CDATA and comments can be easily incorporated to the proposed algorithms but it is not the aim of this work to focus on every aspects. Rather we focus on key principles. See the section Conclusions and future work for summarizing involving issues. 4.1 Semiformal simple algorithm description (simpleBMH) 1. use BMH algorithm to find asked pattern on input stream 2. for all found patterns must be verified the context 1. just a word in text inside an element but do not denote an element (if it is the case then nothing happen, go further without writing into the re- sults; this test can be done comparing symbols before the found pattern and symbol '/' or '