Literate Sources for Content Dictionaries: a Progress Report Lars Hellström Division of Applied Mathematics, The School of Education, Culture and Communication, Mälardalen University, Box 883, 721 23 Västerås, Sweden; lars.hellstrom@residenset.net Abstract At OM2013, the author suggested and sketched a system that would use LATEX doc- uments as Literate Programming sources for content dictionaries. This paper reports on the progress that has since been made with this system. One important milestone that has been reached is that valid .ocd files with CDDefinitions, FMPs, CMPs, and Examples are being generated. 1 Overview In [1], the idea of applying literate programming techniques to the task of creating OpenMath content dictionaries was raised, various possibilities were analysed, and a concrete design was sketched. The primary characteristic of this design is that all machine-readable material of the content dictionary is encoded using special-purpose markup in a LATEX document. Typesetting said document using standard LATEX will on the one hand produce a printed representation of the information so encoded, and on the other hand generate those XML documents that formally define the content dictionary. In March 2014, development had reached the point where the system could generate an .ocd file that would validate against the omcd2.rng schema, and in June it was used for an original content dictionary submitted to the OpenMath 2014 workshop. Concretely, the system consists of a LATEX 2ε package called openmathcd. This package defines a number of commands and environments that constitute the special-purpose markup for encoding content dictionaries. Among these are the OpenMathCD environment, which encloses the material for a content dictionary, and the CDDefinition environment, which encloses the material for one symbol within a content dictionary. There are also numerous commands and environments for encoding OpenMath objects. In more detail, what the package currently can do is: • generate every kind of element allowed in an .ocd file, • generate FMPs and Examples with embedded OpenMath objects, • typeset OpenMath objects (with or without simultaneously writing them to file). Some envisioned things it cannot yet do, but which would only take a small amount of pro- gramming, are: • generate OME, OMB, OMF, OMSTR, OMR, or OMFOREIGN elements (so far, there has been no need for them in the content dictionaries generated), • generate .sts files. One envisioned thing which may require a bit more thought, mostly to design a sensible user interface, is: • generate file(s) defining notation for symbols. 1 Progress report Lars Hellström 2 Details A file archive containing the current state as of 2014-06-07 of the openmathcd package—.dtx sources as well as a ready-to-use .sty files—can be downloaded from http://www.mdh.se/polopoly_fs/1.60502!/Menu/general/column-content/ attachment/list4.zip This archive also contains an example document (list4.tex) and the content dictionary gener- ated from it (list4.ocd), which is the submitted original content dictionary mentioned above. 2.1 Object markup The openmathcd markup for objects is patterned after the XML encoding for these, but with some basic adjustments to fit LATEX syntax. Markup for compound objects are environments, whereas basic objects are expressed as commands. The base grammar for an homel i may be stated as homel i −→ \OMV{hnamei} | \OMI{hoptional signihdigitsi} | \OMS[hcdbasei]{hcd i}{hnamei} | \begin{OMA} homel i+ \end{OMA} | \begin{OMBIND} homel i \begin{OMBVAR} homel i+ \end{OMBVAR} homel i \end{OMBIND} + | \begin{OMATTR} \begin{OMATP} homel i homel i \end{OMATP} homel i \end{OMATTR} which is fewer characters than the XML encoding for the leaf homel is, but a few more for the compound ones; the primary gain is not in providing a significantly more compact encoding, but rather in switching from a format known chiefly by computer scientists (XML) to a format known by most mathematicians (LATEX). The LATEX code fragments conforming to this grammar for an homel i may be used to two ends, which typically happen in parallel: they may be transformed to valid XML encoding OpenMath objects written to a generated file, and they may be typeset to become part of the printed material in the document. For typesetting, there are currently two styles avail- able: XML code style, which will be used inside an OMOBJ environment, and a “semiformula” style, which will be used inside sfOMOBJ and semiformulae environments. The OMOBJ and sfOMOBJ will cause the homel i to be written to a generated file (where appropriate), whereas the semiformulae environment is more for facilitating discussions of OpenMath objects. Exam- ples of such ‘discussions’ can be nonformalised proofs of mathematical theorems where details in the formal encoding of something as an OpenMath object are important. A practical extension of the markup, which saves quite some typing, is the OMAS environment. Technically, it extends the above grammar for homel i with the alternative \begin{OMAS}[hcdbasei]{hcd i}{hnamei} homel i∗ \end{OMAS} that (as far as encoding an OpenMath object is concerned) is equivalent to \begin{OMA} \OMS[hcdbasei]{hcd i}{hnamei} homel i∗ \end{OMA} Using this, the formula 2 + 2 = 4 may be encoded as the homel i \begin{OMAS}{relation1}{eq} \begin{OMAS}{arith1}{plus} \OMI{2} \OMI{2} \end{OMAS} \OMI{4} \end{OMAS} 2 Progress report Lars Hellström and an OMOBJ environment typesets that as 2 2 4 whereas the sfOMOBJ environment may typeset it as application(relation1.eq, application(arith1.plus, 2, 2), 4) (although what should be the defaults in this latter style is at the time of writing very much in flux). The generated XML code is in both cases the same as that typeset by the OMOBJ environment, except that the generated code also has an xmlns="http://www.openmath.org/ OpenMath" attribute on the OMOBJ element. LATEX parses homel is by executing them, so the standard range of LATEX programming tricks are available for further streamlining of markup. 2.2 Funny characters One of the great challenges when generating code, especially code that may embed arbitrary strings (as is the case with for example CMPs), is to make sure that all characters are correctly encoded. The difficulty level increases even more when the source format has its own syntax rules that are different from those of the target format; an incomplete translation could result in a situation where users would have to know and counteract idiosyncracies of both the source and the target format. But thanks to using the harmless LATEX package (included in the above archive) for handling character strings, users of openmathcd need only worry about handling LATEX syntax, and may even use LATEX markup for accented letters (which for mathematicians may be less confusing than locating them on the keyboard). Generated XML files are always pure ASCII because that is all TEX can do portably, but the character set supported by openmathcd is full Unicode; numerical character entities are used extensively in the generated XML files, which will be well-formed. (Getting LATEX to typeset unusual characters as appropriate glyphs can however be nontrivial.) openmathcd does not check that names only contain valid characters, but that is a trivial matter to verify using XML validation. 2.3 Canned strings One design goal has been that users should not have to write long strings that are anyway fixed beforehand. One class of such strings are the XML namespaces, which are inserted automatically where needed. The longest canned string is however the copyright licence; the single command \StandardOMLicence will insert the full 29 lines of the standard licence (wrapped up in a CDComment element) into the generated file(s). It would be a minor modification to also put cdbase and/or version attributes on each generated OMOBJ. The author would be interested to hear arguments for or against. 3 Progress report Lars Hellström 3 Moving forward 3.1 The importance of brevity The XML encoding of an OpenMath object can be hard to read because the information is very spread out; there can be a lot of text between the name of a function and the name of the variable it is being applied to. The semiformula style is much closer to ordinary mathematical formulae, but they too do not achieve the same togetherness of formula elements as ordinary mathematical formulae do. One reason for this might be that many of the tokens in semiformulae are still too long to allow the eye to behold groups of them as units; reducing common tokens to single glyphs could overcome this. Changing the long application token to a simple @ makes a significant difference (because it is very common), but then it is instead the names of the symbols which stand out as being long. When written semiformulae (or something very similar to them) have been hand-crafted, such as for example in [2], it is typical that also symbols (particularly the common ones) are given single glyph presentations: ∀ for quant1#forall, = for relation1#eq, etc. Doing this for an explicit set of declared symbols is within the realm of what LATEX macros can achieve, so it should probably be added as a feature to openmathcd. 3.2 Relation to standard enhancement In several cases, it is hard to tell exactly how to further develop the openmathcd markup, because the correct direction depends on how the OpenMath standard will evolve. Some open tickets in the OpenMath Trac database,1 and aspects of openmathcd they would affect, are: ticket title affects #5 FMP type=defining FMP environment arguments #128 Make CDSignatures or Signature use cdbase STS generation #138 CD’s CDBase declaration is mandatory \CDBase command #139 Symbol’s default cdbase not specified correctly OMOBJ attributes #144 Add Notation Definitions to OpenMath Notation specification #152 Revising the Simple Type System STS generation It should however be observed that even a partial resolution of some of these issues—for example defining a partial notation definition system, or defining a system abstractly even if not with a formal syntax—would be a great help, as it could allow development to take a few steps forward. References [1] Lars Hellström. Literate sources for content dictionaries. Paper 22 in MathUI, OpenMath, PLMMS and ThEdu Workshops and Work in Progress at the Conference on Intelligent Computer Mathe- matics, CEUR Workshop Proceedings 1010, 2013. http://ceur-ws.org/Vol-1010/paper-22.pdf [2] Fulya Horozal, Michael Kohlhase, and Florian Rabe. Extending OpenMath with Sequences, pp. 58– 72 in: Intelligent Computer Mathematics, Work-in-Progress Proceedings, Technical Reports of Uni- versity of Bologna UBLCS-2011-04, 2011. http://kwarc.info/frabe/Research/HKR_sequences_ 11.pdf 1 https://trac.mathweb.org/OM3/query 4