-

Web Forms and XML Processing: Some QnaRty Factors of Process and Prodnct Mo Amado Alves2 Faculdade de Ci8ncias e Tecnologia da Universidade Nova de Lisboa maa@di.fct.unI.pt

0 $ Centre for Artificial Intelligence, Universidade Nova de Lisboa

1999

25 3 11 16

"The web is bad; really bad."--observed Jakob Nielsen threeyears ago about The Web Usage Paradox [1]. The true paradox today is that science and technology institutions have bad sites! Programrnes devotedto theinformation socioty itself have bad sites!! Hopefully this international conference will make a difference. At least show Nielsen and myself are not just fools on the hill" and, hopefully, help management make wiser decisions regarding web development strategies and, correlatively, better technical staff selection. The currentpaper contributesto this goal by exposing a methodfor the developmentof complex web services which embodies a numberof quality assurance items and has passed the test of real web service deployment, featuring user authentication, multiple forms,recorded data,and automatedpage creation. The method and the case are described with incursions into selected technical details, Quality factors are explicitly orimplicitly associatedwith each described item.

1. Web QuaKty Mauifesto

Theweb is bad; really bad.[1]

And it is getting worse. Already three years have passed since web usage specialist Jakob Nielsens article [I] has appeared, and still his observations are right on the mark:"90% of all commercial websites are overly difficult to use due to b/oated page design that takes forever to download, internally focused design that hypes products without giving real info, obscure site structures, Lack of navigation support, narrative writing style optimised for print,not for the way users read online, etc." ([I] abridged, original emphasis maintained).

In the currentpaper I add a couple of items to this list.

Why is it even worse, today? Well, for one, institutional sites are bad. In fact, the true paradox today is that scz.enceand technology institutions have bad sites! Research prograrnrnes devoted to the information society itself have bad sites!! The 2000 Olympics site was bad. Oracle'ssite is bad. I submit a Law of Inverse Quality: the greater the institution, worse the site.

I hope this international conference-in particular panel PNI.2 entitled Qualidade nos Sistemas de InformaVdo da Administrado PAbIica.. o Inlc!o duma Cruda (Qualz'ryin Public Administration Information Sysrems..the Start of a Quest), in regard to institutional sites--win make a difference. At least it will show Nielsen and myself are not simply fools on the hill"not anymore. And, hopefully, it will help educated management make wiser decisions regarding web development strategies-and, correlatively, better technical staff selection.

The current paper contributes to this honourable goal by exposing a method for the development of complex web services which embodies a number of quality assuranceitems and has passed the test of real service case deployment.

Is the Web really worse today, rather than three years ago? Yes, definitely. Here is a recent (2000) observation from the same author of [1]: "If you are going to go and buy something on a new website, you will fail. If you go to a new website, you will not be able to use it. (http://www.wired.com/newsibusiness/O,1367,40155,00 ,html)

Web quality is a twofold problem: technical and social.

Technical A veritable plethora of techniques and methods exists today to develop web services. Judging from the results, most of them are bad. The current paper presents a method that emphasises some quality factors of process and product. These factors are explicitly or implicitly associated with each described technical item. Bottom line: it is a good method. It is proven. The rest of this paper will deal with the technical aspect only.

Social. The social problem is to convince people to use good methods and techniques. To be quality-aware. People like Jakob Nielsen [I] is trying to pass the message for some years now. The message is simple: User: demand web quality Web service provider, provz"dqeuaLz.t(yor eLsedz.e).But seemingly the word is not getting through. Users are not demanding.Perhaps they simply do not know the Web could be much better. Perhaps they simply do not want to: one way for a site to be better is to be simpler; perhapsmost users prefer complicated, slow sites. This social aspect is not addressedfurd1erin the currentpaper. 2. Qullity Faaors Overview

HTTP, CGI are the GOTOs of the 1990s.

[21

We present a method for the development of complex web services. The method was tested with a service case featuri02: . user authentication . multiple forms . recorded data . automated page creation

The method emphasises quality at two stages: development and execution (meaning runtime execution of the service). It does this by scoring high on quality factors of process and product respectively; mostly of product, but high scores here are justified by process factors implicit in the method, as illustrated in Table I .

Factor Correctn ReLiabz-Lz.ty

Maz.ntaz-nab z'LitY

Justification

High traceability: rich messages. Operationalised complet checks.

All errors handled. Standard technology. Simple page design.

Good choice of programming language (Ada). Separation of HTML code and service logic.

The productalso scores high on eciency, usability, portability, and interoperabz.Iz.ty.It scores less on testabz.Ltly(test data must be preparedfor each case, and it is not operationalised), and incegn`ty (no access control tool). These scores and their justification are furthersupportedby the items detailed in the rest of the paper.

The method comprises selected and created "open source" software tools and components: package CGI by David Wheeler (modified version included in [3]), package XML_Parser by the author{3], and GNAT by GNU, NYU and ACT (vd. adapower.com).

We use the word safety as a synonym of reLiability, and we use the words method and safety in a wide sense, viz. with method ranging from architectureto coding, and safety including effectiveness and efficiency both in development (cost safety) and execution.

In this paperthe method is presented with examples from the real development case, and with incursions into the detail of selected aspects.

The method is continually evolving, due to both external technological change and internal planned increments.Some of these planned incrementsare also exposed in this paper, as a means of obtainingfeedback from the software engineering community. This traitin particularputsthe method on the top level of the CMM (CapabilityMaturity Model, vd. httfi://www.sei.cmu.edu). Other well known software process references associated with the current method are vanilla frameworks, extreme programming, futurisz programmz"ng(vd. Intemet). The precise form of these associationsis left implicit in the paper.

1. The Case

The most recent application of the method was in the implementationof an official inquiryto schools via Internet.This was in Portugal, in the year 2000. The purpose of the inquiry was to evaluate a recent reform in school administration.The inquirer, and my client, was CEESCOLA3, a state-funded research centre in education-henceforth simply the Centre.

The inquirees were 350 secondary schools and school groups randomly chosen out of a nation-wide universe of 1472 such entities.

The service was requiredto be accessible only by the selected schools, so these were previously given, via surface mail, private access elements (identifier and password). A time window for answering the inquiry was fixed, and the system was required to make the answers available to the Contra as soon as they were submitted.

The inquiryitseJftook the form of a numberof long and complex questionnaires: each questionnaire had hundreds of questions, and the answer to certain questions determinesthe existence of other questions or their domainof possible answers.

Note that this case is very similar to electronic commerceservices in complexity and safety issues. 2. The Method The top-level featuresof the method are: HTML CGI separationof HTML code (documents) and service logic (program) HTMLextendedinternally documentspreparedthroughXML transformations both the service logic and the transformations writtenin Ada session state maintainedin the served pages a singlemeta HTML unit a single service procedure 3 Centro de Estudos da Escola = Center fOrSchool Studies, Facultyof Psychology and Education Sciences of the University ofLisbon.

The separation of HTML code and service logic is a crucial design premise. Our rationale for this cOnverges for the most part with that described in [2]. In order to attain separation, the stored pages are written in a slightly extended HTML, call it meta" HTML, which is transformed by the service, upon each request, into the served pages in standard HTML,

Also, minirnalized HTML was preferred as a basis for meta-HTML, because minimalized HTML is more readable by humans than its non-minimalized counterpart or XETML-and the ultimate reviewers of the meta-document are human.

Now, HTML, minimalized HTML, XHTML, and the designed meta-HTML are all subsumed by a slightly relaxed XML, notably one not requiring pairing end tags. This may seem nonsensical to XML formalist eyes and sound of heresy to XML purist ears, but in practice such a "dirty" version of XlvIL is very convenient. With a robust XML processor one can easily control that one dirty aspect of not requiring pairing end tags. Package XML__Parserhas such a robustness feature. The gains include: . a single processing component for all "dirty" XML instances (HTML, minimalized HTML, meta-HTML, XHTML) . increased readability of the input units . an easy path to proper XML representations (not taken, but the current trend from HTML towards xMI- in the Web was a concern)

So, in this paper, we take the liberty of calling simply XML to all that-and hence the pervasive use of the term and inclusion of XML tools in the method.

XML processing happens at two stages: data preparation and service execution"

Data preparation. The questiOnnaires are created by client staff using WYSIWYG editors like Microsoft FrontPage and Word. Then these items are transformed into the final, static meta-HTML items. The major part of this transformation is automated, by means of Ada procedures utilising package XML_Parser. The transformation consists of: . rectify the messy HTML emitted by Microsoft tools . rectify and insert control elements and attributes . structure the items into identified groups

Because the necessary ad hoc transformation programs are small (c. Ik lines), and the compiler is fast and easily installable on any site, Ada can also be used here, instead of the usual unsafe scripting languages.

Service execu6on. The Pages are not served directly: they have a number of markers that must be re.p2aced by the definitive va2ues. his is done at runfive by the main service Procedure, agam utxlxsmg XML-Parser. The rest of this section focuses on this.

Input values from one form are relayed onto the next as hidden input elements. This provides for: . data communication between session points, or forms-this implements sessions . general tests on input values to be run on any session point-this increases safety

All input values are relayed, so careful naming of input elements is required (in order to avoid collision). The localisation of all forms in a single meta-unit promotes this.

The method evidently relies on the usual external pieces of web technology: an HTTP/CGI server and web browsers. The service was deployed with the Apache server running on a Linux system. Some Problems were felt here, notably an access security hole: the service internal database files, in order to be accessible by the main procedure, had to be configured in such a way that they were also accessib2e by all 20ca2 Linux system users! This problem is perhaps corrigible with the Proper Apache settings; but this servers documentation is hardly comprehensible. 4.1 The service procedure

The service procedure is a non-reactive program, i,e. it terminates, a usual CGI procedures are. It is designed as the sequence of blocks sketched in Figure I.

The computation is data-driven by form input values, meta-HTML markers, and system database files (users, passwords, etc,) The form input values and the files are totally case-dependent, so we focus on the meta-HTML markers, and dedicate the next section to them.

The exception handling is crucial. All errors are captured in a report Page served to the user with a wealth of useful information, including instructions for error recovery, illustrated in Figure 2.4 This happens 4 The Original data in portuguese are shown in the fl2ures because thev have formal indentifiers in port-uguese (sOmetimes" in English, e.g when they emanat-e from the compiler), an-d we wahted to ensur referenctial consistency between all data items shown in this paper and at its presentations. even during development and testing, facilitating these tasks greatly

MeW-HTML

This section describes the meta-HTML used in the example case. OtherHTMI extensions are possible~In fact this possibility is a major plus of the method: it provides applicabilityto a wide range of possible web services, through case-by-case adaptationof the meta documentary language. It can even go beyond XML eventually,but thatis anotherstory.

3.1 Inpnt field types

The names of the form/input fields are extended with a type suffix of the form : t::, where t is a single letter as describedin Table 2.

descrfpJGunoon integer alphanumeric subject to verification special

The upper case versions of t CI, A, E) additionally requirea non-null value. The set is easily extendedwith more basic types, e.g"float and date. Type e (from the Portuguese word especI) requires a case-by-case treatmentin the main procedure.The relevant section in the procedure is structuredas a case construct:any e type value falling back to the others case raises a System_Error (or something similar). This together with the proper test data set increases safety in the developmentstage.

3.2 Conditiomd indnsion

Meta-element i f provides conditional selection of partsof the meta-documentto be included in the served page. The selected part is the enclosed content of this element This is similar to the C preprocessordirective # f . The condition is expressed in the element attributes

Namel -[Valuel~

] ... [VaZuel [ Name -|Value |-..|Value [ which contain references to form/input element namesand values. The set of attributesis a conjunction of disjunctions. The (positive) Boolean value of the set determinesinclusion of the element content. Figure 3 shows an excerpt of the example meta-document with heavy use of conditional inclusion, and Figure4 shows the correspondingHTMLresult for a particularsession.

Note the pervasive use of E suffixes in the meta" text: this was very helpful in assuring completeness of treatmentof all cases-and thereforethe correctnessof the service.

33 Session control

A special hidden input element named _Seguinte : e (Portuguese for next) specifies the next meta-HTML unit to be processed. This in nontrivial at the start of the session, when moving from an authenticationform to the main set"

Also, the absence of this element may be used to signal to the main procedure that the session is in its final step, usually submission of the combined data of all forms.

A small number (circa five) of other special elements were found necessary co control very specific aspects of the service. It was technically easy to implementthem in the same vein notably with the CGI and XML processing resources alreadyavailable. 4. The tools and components

To see the next transactional "transfer"happen, ignore the XML (and SOAP) hype and watchfor actual XML implementations. (Mike Radow,in [3])

A modified version of package CGI by David Wheeler served well as the CGJ component~ The modifications, done by myself, included: . EJirmnation of auxiliary overloading which caused ambiguity problems to the GNAT compiler. I suspect GNATs complaints were legitimate, language-wise; perhaps Wheeler used another, non-validated, compiler; or the problem was not detected until my use of the package. . Redesign of the output format of procedure

Put_Variables.

The modified version is now in [32" Further modifications are planned and described there,

Package XML_Parser by myself, also in [3], was used to transform the HTML emitted by the nontechnical staff into extended HTML and then into the served HTML pages. Although XML_Parser served well as the (extended) HTML component of the current project case, it has severe limitations with respect to XML proper, noticeable in its documentation; it has also some design drawbacks, viz" the finite state device is entangled with the rest of the code.

To overcome these ]imitations, I have aJready developed a new XMI processing package, XML_Automaton. This package properly encapsulates the finite state device. A new XML parser package, XML_Parser_2, will use XML_Automaton as its engine, in order to produce a more localised interpretation of the XML input. XML_Parser_2 is designed after XML_Parser with respect co the (internal) treatment of XML element containment, and I am trying to make the expression of this containment generic, probably with an array of packages drawing on XML_Parser_2, each dedicated to a certain expression:' an Ada linked list Prolog facts, a DOM structure (Document Object Model, vd. w3.org),etc.

A rather specific but interesting point is the character-by-character vs. chunking way of processing XML input, XML elements may span over more than one text line. In chunk-based parsers, the chunk is normally the line, These parsers, especia/Jy if a/so based on character string pattern matching libraries, have a real problem here. XML_Automaton does not.

XML_Parser_2 design includes an unbounded array of stacks. Currently I am choosing between two bases for the implementation of this structure: GNAT.Table or Unbounded_Array. 1 am inclined to the latter because it is compiler-independent"

Evalnation and some remarks ate..Thesoftware metrics available for the example case

Note the cost. We are missing precise comparison data with other experiments, but our experience and intuition tells us that it is a very good number-given the degree of correctness attained in the final service; notably, no fatal defaults were found. I have worked also recently with a team developing a service similar to the example in intrinsic complexity but with much less form data, implemented with inter-calling PERL (www.perl.corn/f)uhsc)ripts (essentially a Great Ball of Mud, vd. slashdot.org/articles/00/04/29/092624I.shcrnl)-it required much more work and delivered much less correctness. The service is still plagued with detected bugs thatno one rectifies anymore.

Why not use PHP (www.php.net)?Our reasons include: * our methodoffers morecontrol over the design and processing of the meta-language * PIfP documentationis incomprehensible . Why notuse MawI[2]? * it is not extensible . it is not rnainmined * it seems to be very hardto achieve a working installation

I am particularlyfond of the inevitable conclusion thatAda is a good choice for programmingin the small. So, there is a real small software engineering after all, and it is not confined to the unadjusted Personal Sofi"ware Process [4] we read about-but never practice.

Acknowledgements `Iwish to thank my research advisor at CENTRIA5, Doctor GabrielPereira Lopes. His correctenvisionment of research in informatics as a rich network of diversified competencies and interests has made possible the degree of reusabilityseen here, notably of the XML tools which were firstly developed for our research projects in information retrieval and natural language processing.6 I am also indebted to Professor Joo Barroso of CEESCOLA for providing such an interesting case of Internet usage as the one described here. Thanks to my colleagues Pablo Otero and Alexandre Agustini, and to the QUATIC'2OOI reviewers, for their good comments. Thanks to my family, for letting our home be also a software house. And to Our Lord, for everything.

References

The Web Usage Paradox [webpage}: Why Do People 1] Use SomethingThis Bad?/ JakobNielsen.- Alertboxfor August 9, 1998. (http://www.useit.corn/alertbox/980809.html)

Ad"'lib: the software process and programminglibrary 3} [web site} / by Mo Amado AIves, (http:!/lexis.di"fct"unl.pt/ADaLIB)

Results Of applying the personal software process / P. 4] Ferguson ; W. S. Humphrey; S. Khajenoori ; S~Macke ; A.

Matvya - pp. 24-32 - //In: IEEE Computer,30(5), 1997 -(description apud[5])

6 PrQjectS Corpora de Portuguas Medieval, PGR, }GM, and, in greatpart, my post-graduationscholarship PRAXIS XXI/BM/2O8OO/99,grantedby the FundaV&o para a`Cilncia e Tecnologia of Portugal.