A Framework for Understanding Web Publishing Applications Sonia Guéhis Paris-Dauphine University, Place du Marechal de Lattre de Tassigny, 75775 Paris, France sonia.guehis@dauphine.fr Abstract. We address in this paper the reverse engineering issue in the Web applications. The maintenance process of this kind of applications is often hardly performed due to the lack of documentation. In most cases, the documentation associated to the Web application does not exist or rarely complete and up-to-date. We aim to present a solution which describes the structure of Web application in order to gain their better understanding and so facilitate their maintenance. We describe a method to infer Web publishing programs, specifically defined as database-driven programs producing dynamic documents. We address a typical reverse engineering situation where the program is a “black box” that takes a database instance (the input) and produces a dynamic document (the output). Our method attempts to understand and describe the program. Keywords: reverse engineering, Web publishing programs, dynamic doc- uments, canonical instances 1 Introduction 1.1 Context and motivations The production of dynamic (X)HTML documents from relational databases is probably one of the most common techniques used in Web applications devel- opment. Such documents combine static parts that correspond to free text and (X)HTML rendering instructions (e.g., tags), and dynamic parts which are re- trieved at run time from a relational instance. Many specialized languages (i.e., Java/JSP, PHP, ASP) and development frameworks (Struts, .NET, PHP/MVC) make the production of dynamic Web sites a relatively easy task, and this con- tributes to the richness and accessibility of the Web. A downside is that pub- lishing programs are often poorly written, and tend to be quite difficult to un- derstand and maintain. The situation even degrades as the application evolves through maintenance and evolutions. In the present paper we describe the main aspects of a method to address this situation. Our goal is to derive useful information on the structure and behavior of publishing programs without having to delve into the source code. The main idea is that we can infer how a program P accesses the underlying Proceedings of WISM 2009 Canonical Dynamic instances documents Publishing Program P D1 111111 000000 I1 000000 111111 D2 I2 ... ... In Dn Data graph Analysis 10 0 01 1 0 1 Mapping 00 110 1 0 11 0 0 1 0 11 0 0 1 1 01 0 0 1 1 0 0 1 0 10 1 1 0 0 10 1 1 00 101 10 01 1 1 00 1 Document graph 0 01 1010 Program description Fig. 1. Overview of the reverse engineering process database and merges the dynamic and static parts by just examining the input and output. Moreover this re-ingeneering process needs only an access to the database schema and the right to run P on instances of this schema. We do not require an access to the actual database instance, nor do we need the code of the application. This preserves the privacy of business data, and allows to cope with situations where the source code is no longer available. 1.2 Process overview Basically our method produces carefully chosen instances of the database, runs P, and makes some inferences of the program behavior by analysing the dynamic document produced as output. Figure 1 illustrates the main components involved in the process. Let us briefly describe their role before entering into details. We apply P to canonical instances of the database schema and obtain dy- namic documents. The concept of canonical instance denotes both complete and unambiguous instances [1]. Intuitively, a canonical instance enjoys suitable prop- erties for the analysis of the dynamic document produced by a program. For instance, given a set of values, one can determine without ambiguity the set of tuples (if such a set exists) of the instance that contain these values, as well as their dependencies. In order to model conveniently such structured sets of tu- ples, we view the relational instance as a data graph where tuples and values are nodes, and edges represent dependencies. The data graph model and canonical instances are presented in Section 2. Next, dynamic documents are analysed in order to distinguish the static parts from the dynamic ones. This is done through an iteration of program executions that produce dynamic documents from as many canonical instances as necessary. The dynamic part is modeled as a graph of values, constructed from both the document structure and the database schema. This analysis process is described in Section 3. Proceedings of WISM 2009 Finally, we construct a mapping between the graph of values of a document Di and a subgraph of the canonical instance Ii . This mapping constitutes an interpretation of the program P that carries out a navigation in Ii , retrieves some values and merge these values with static text to create Di . We can then produce a description of P at a suitable abstraction level independent from specific details such as, for instance, the programming language. This final step is presented in Section 4. The rest of the paper develops this brief overview, and discusses the perspec- tive of the approach, as well as related work (Section 5). Due to space limitations, the presentation is mostly driven by examples based on a simple database that represents movies with their (unique) director and their (many) actors (Fig- ure 2). The interested reader is referred to[1] and [2] for formal definitions and technical details on the DocQL language and our concept of complete instance. (see http://www.lamsade.dauphine.fr/∼guehis/Protos.htm). id last_name first_name title year id_director genre 20 Eastwood Clint Unforgiven 1992 20 Western 21 Hackman Gene Van Gogh 1990 29 Drama 29 Pialat Maurice Kagemusha 1980 68 Drama 30 Dutronc Jacques Absolute Power 1997 20 Crime 68 Kurosawa Akira Movie Artist title id_actor character Unforgiven 20 William Munny Unforgiven 21 Little Bill Dagget Van Gogh 30 Van Gogh Absolute Power 21 President Allen Richmond Cast Fig. 2. An instance of the Movies database 2 Modeling the input: canonical instances As mentioned previously, we model a database instance as a labeled directed graph GI , and rely on a query language on this graph which constitutes a syntax- neutral (declarative) specification of a publishing program written in Java/JSP or in any other programming framework. Data model and query language. Our reverse-engineering process operates on a view of the relational instance where tuples are seen as internal nodes, values as leaf nodes, and edges represent either tuple-to-tuple dependencies or tuple-to-attribute dependencies. Figure 3 shows the data graph of the instance of Figure 2. We distinguish functional dependencies between nodes (e.g., between a movie node and its director node) Proceedings of WISM 2009 Unforgiven Clint 1992 first_name Eastwood title Western last_name year id Director genre 20 (Artist) Cast Direct Little Bill Dagget Actor Cast character Cast (Movie) Movie Cast Movie (Cast) (Artist) (Cast) id 21 Actor last_name character first_name Hackman William Munny Gene Fig. 3. A subset of the data graph of our sample instance and multivalued dependencies (e.g., between a movie node and its actor nodes). The former are shown with white-headed arrows, the latter with black ones. We associate to this model a query language, called DocQL, which com- bines navigation in the data graph with instantiation of the textual fragments that contribute to the final document. A DocQL query is essentially a tree of path expressions which denote the part of the graph that must be visited in order to retrieve the data to include in the final document. Path expressions use an XPath-like syntax. An expression p is interpreted with respect to an initial node Ni (unless it begins with db which plays the role of / in XPath), and delivers a set of nodes, called the terminal nodes of p (with respect to Ni ). Each path is associated to a fragment which is instantiated for each termi- nal node. Path and fragments are syntactically organized in rules of the form @path[condition]{fragment}, with a path expression, a node condition and fragment is the fragment instantiated for each instance of path. The following example shows a DocQL query over our Movies database. It produces a (rough) document showing the movie Unforgiven along with its director and actors. @db.Movie[title=’Unforgiven’]{ @title{}, @year{}, directed by @director.first_name{} @director.last_name{} Featuring: @Cast{ - @artist.first_name{} @artist.last_name{}as @character{} } } Proceedings of WISM 2009 The semantics of the language corresponds to nested loops that explore the data graph, one loop per rule. Looking at the previous example, we first search for the node Movie with title Unforgiven. Taking this node as an initial one, the value of each (unique) path title, year, etc., is evaluated. The multiple path Cast leads to all the nodes that represent one of the characters of Unforgiven. Applied to the data graph of Figure 3, one obtains the following document as result of the previous example: Unforgiven, 1992, directed by Clint Eastwood, Featuring: - Clint Eastwood as William Munny - Gene Hackman as Little Bill Dagget Aggregation and negation cannot be directly expressed in DocQL, but ag- gregated values can be obtained via the mapping that transforms the relational instance to the virtual data graph (an even simpler solution is to define SQL views with group by clauses, which can then be exported in the data graph). We shall discuss these limitations in Section 5. Canonical instances. Our method relies on the creation of specific instances satisfying two condi- tions: completeness and non-ambiguity. In short, the first condition ensures that the program always finds query results during its navigation in the database. The second condition is meant to allow the identification of a unique subgraph of the instance, isomorphic to the set of values found in the dynamic document. Completeness. An instance is said complete if, for each one-to-many depen- dency E 1 →∗ E 0 and each tuple e instance of E, there exists an instance e0 of E 0 associated to e. Let us take the example of the dependency directed by that links a director and its movies. In terms of the relational schema, there is an integrity constraint that ensures that each movie refers to a director (see Figure 2). The completeness constraint states, in addition, that each tuple in table Artist is referred to by a tuple in table Movie. Therefore, in a complete instance of our schema, each artist is the director of a movie. This is by no way a realistic constraint. It is only intended to ensure that a publishing program that wishes to show a film, its actors, and for each actor, the list of films possibly directed by this actor, will produce the most complete result document. In other words a complete instance allows to obtain complete documents, and thus a complete view of the program output. The instance of Figure 3 is not complete. If we remove the node squared with dashed lines (and the corresponding Artist subgraph), the instance be- comes complete, because the only remaining artist is Clint Eastwood who turns out to be both an actor and a director. Note the cycle in the data graph that corresponds to a cyclic dependency in the graph schema. Non-ambiguity. The non-ambiguity condition can be informally stated as fol- lows: if N and N 0 are two nodes in the data graph, then the path that links N to N 0 can be uniquely determined. The instance on Figure 3 is ambiguous, even after removal of the node that corresponds to Gene Hackman. Indeed, if Proceedings of WISM 2009 Clint Eastwood Woody Allen Sidney Pollack Robert Redford Director Cast Cast Director Unforgiven Husbands and Wives Jeremiah Johnson ... a. Minimal cycle (2 edges) b. Cycle of size k*2 Fig. 4. Generating cycles in a canonical instance we are given the values ’Eastwood’ and ’Unforgiven’ found in a dynamic docu- ment, there is an ambiguity on the meaning of the Artist node, which can be interpreted either as the director or an actor of the film. Ambiguity is a consequence either of two distinct nodes sharing the same value, or of cycles in the database instance. The first problem can easily be avoided by generating distinct values. The second problem is trickier because simply removing cycles would contradict the completeness property. As men- tioned above, the database needs to represent by cycles in the instance the cycles of the schema in order to obtain a solution for any path chosen by the program from any node in the graph. A trade-off is here necessary. Note first that the cycle size in the instance is proportional to the cycle size in the schema. Figure 4.a shows a minimal cycle in our sample instance, of size 2, and Figure 4.b its generalization to a cycle of length 2 × k, with k > 1. The value of k is a parameter of the instance construction which represents the upper-bound on the number of tables joined by a single SQL query of the program (or equivalently, k is the longuest path used by the program during its navigation in the data graph). k must be chosen large enough so that no ambiguity can arise when a link must be created between two values extracted from a dynamic document. In the following we call a complete and non-ambiguous instance a canonical instance. An algorithm to create canonical instances is given in[1]. 3 Modeling the output: dynamic documents Assume now that we obtain a document D from the execution of the program P on a canonical instance I. We need to distinguish static parts from dynamic parts. From the dynamic part we will be able to construct the mapping that associates the document structure to a database subgraph. For the sake of clarity, we illustrate the mechanism with the following document D, obtained by running P on a canonical instance I. Unforgiven, 1992, directed by Clint Eastwood, Featuring: