ICAT Job Portal: a generic job submission system built on a scientific data catalog Stephen M Fisher Kevin Phipps Daniel J Rolfe Scientific Computing Department Scientific Computing Department Central Laser Facility Rutherford Appleton Laboratory Rutherford Appleton Laboratory Research Complex at Harwell Didcot, OX11 0QX, UK Didcot, OX11 0QX, UK Rutherford Appleton Laboratory Email: dr.s.m.fisher@gmail.com Email: kevin.phipps@stfc.ac.uk Didcot, OX11 0QX, UK Email: daniel.rolfe@stfc.ac.uk Abstract—The value of metadata to the scientist is well known: Each entity has a small agreed set of attributes. To make with the right choice of metadata, data files can be selected the system extensible, parameter types can be defined and very quickly without having to scan through huge volumes of associated with one or more of the entity types. Actual data. The ICAT metadata catalog[1] (which is part of the ICAT parameters of those types can then be associated with the project[2]) allows the scientist to store and query information corresponding entities. For example a parameter type could about individual data files and sets of data files as well as storing provenance information. This paper explains how a generic be defined for current measured in milliamps or elapsed time job management system, exposed as a web portal, has been measured in seconds. built on top of ICAT. This gives the scientist easy access to a Further entities Application, Job, InputDataset and Output- high performance computing infrastructure without allowing the Dataset allow the provenance of datasets to be stored within complexities of that infrastructure to impede progress. the catalog, such that it is possible to trace the derived dataset The aim was to build a job and data management portal back through a chain of applications and intermediate datasets capable of dealing with batch and interactive work that would to the original raw dataset. be simple to use and that was based on tried and tested, scalable, and preferably open source technologies. For the team operating ICAT is implemented as a SOAP based web service using the portal, it needed to be generic and configurable enough so the mechanisms provided by the Java Persistence Architecture that they can, without too much effort, modify their software (JPA) to connect to a relational database. ICAT has rule to run within the portal, add new software, and create new based authorization and a powerful query language which is dataset types and parameters. Modifications to existing software translated into the JPA query language (JPQL). should be limited to saving and loading their datasets in a slightly different way so that instead of just being saved to disk, they are The data files are not stored within ICAT itself, but are registered within the system along with recording any provenance stored within an ICAT Data Service (IDS)[4] as explained information. below. I. I NTRODUCTION B. IDS the ICAT Data Service This is a component, defined by its interface which is able The ICAT Job Portal (IJP)[3] builds upon the tried and to store files and register their metadata in ICAT. It makes use tested ICAT data catalog, an existing component written of ICAT for authorization. If ICAT allows the file metadata specifically to catalog datasets produced by scientific facilities. to be written then the IDS will allow the file to be written. It uses ICAT as the central database component which also Control of who can read follows the same pattern. provides authorization via a flexible rules based system. This means that users will only be shown datasets readable by them, and any datasets produced whilst using the Job Portal will also II. BACKGROUND be protected by relevant permissions. A. Use Case While developing a prototype portal to meet the needs of This work was motivated by a request from the Lasers one group it became apparent that it could be made generic for Science Facility (LSF) of the UK Science and Technology and configurable enough to be used by a wide range of teams Facilities Council (STFC) to help them with their data. The within the scientific community. LSF operates the OCTOPUS imaging cluster[5], a central core of lasers coupled to a set of advanced interconnected A. ICAT the metadata catalog microscopy stations that can be used to image samples from single molecules to whole cells and tissues. They had accu- ICAT is a data catalog specifically aimed at scientific facili- mulated a large number of data files stored in a directory ties into which data are stored based on the following hierarchy structure. They had both a range of applications to process of entities: Facility, Investigation, Dataset and Datafile. The and visualise that data[6] and an interactive program with an “Facility” produces the data for a group of users associated easy to use GUI that would scan through a selected part of the with an “Investigation”. Within the investigation “Datafiles” file system to collect information in memory about their data are grouped into “Datasets”. then offer lists of raw datasets and lists of processed datasets and offer the ability to process those datasets with a fixed set of User's PC interactive jobs. The main problem with this solution was that Web Remote it was not scalable; the user had to restrict himself1 to a small browser Desktop client part of the available data each time the GUI was launched https rdp as the program had to scan the data afresh each time it was started which took time proportional to the volume of data. In addition the user needed a machine allocated to him with a personal account on that machine to allow him to run his ICAT Job Portal webapp Assign Prepare job Facility software work. This machine was hidden from off site users by a firewall Submit batch job interactive job Run batch job requiring his presence on site or the use of a VPN. Relieving Torque batch server Assign Torque worker node the bottlenecks of data, job and user management would enable Head Node batch Worker Node 1 Worker Node n job a significant improvement to the user experience and enable more effective exploitation of the OCTOPUS facility. After development of a prototype solution it was realised Fig. 1. Architecture overview that there was a need for a generic solution so that LSF could quickly and easily add new dataset types and job types without needing to go back to the developers to make coding changes. Metadata JEE Application Server database Our funders also favoured a generic solution that could be deployed for other facilities which led to formulating a set of ICAT requirements some of which are listed in the next section. File storage IDS B. Requirements Job Portal Jobs database Following analysis of the prototype the requirements were refined. Some of the key requirements are listed below. XML Job Descriptions 1) System accessible via both GUI and command line and Job Dataset from on and off site. Torque batch server Parameters 2) All the systems should have automated installation of OS and software updates. 3) Centralised user/group management. Head Node 4) A file server must be able to store raw data from microscopes, analysed data and other user data. All Fig. 2. The head node data must be backed up and “old” data migrated with an easy mechanism to restore it when needed. 5) All data should be managed with a single point to IBM’s Platform Application Center[8] provides a means to consult the metadata to find out what is where. describe jobs in XML and submit them however though it does 6) Ability to upload and download data. meet requirement 10 it fails 4 and 5. 7) The ability to submit batch jobs to a set of Linux The Galaxy portal[9] is quite close to meeting our require- nodes, some with CUDA GPU capability. Listing, ments and also provides workflow support. Its main drawback cancelling and retrieving output from jobs must also is that it describes itself as a genomics workbench and as be supported. such is too focused on one discipline. This paper also contains 8) The ability to run interactive GUI based analy- interesting comparisons with other genomics workbenches. sis/visualisation jobs able to access data. 9) Select and submit multiple datasets for processing As we have a good metadata catalog: ICAT, and a matched through applications. This must cover both multiple data service: the IDS, we decided to build directly on those jobs with one dataset per job, or a job which will components. process all selected datasets. 10) Any menus must be configurable, as must the types III. S YSTEM A RCHITECTURE of datasets that can be stored, jobs that can be run The architecture shown in Fig. 1 is based around a single and job parameters associated with a job type. head node acting as a central point for all communications and an extensible number of worker nodes which can be added to in the future in order to increase the job handling capacity of C. Possible solutions the system. Consideration was given to OMERO[7]; however this is The head node which is shown in more detail in Fig. 2 more suited to viewing and performing simple analysis of hosts an application server (currently Glassfish) running the images rather than the specialised analysis codes developed Job Portal, ICAT and IDS software and acts as the head node by LSF and it it does not meet requirement 10. for a batch system (currently Torque[10]). 1 Gender specific terminology should be interpreted as non-gender specific Worker nodes have this role within the batch system but throughout this paper. may also be assigned temporarily to a user for interactive work. They should be capable of running all the facility software that E. Command Line Interface users require and they are able to communicate with ICAT for With the addition of a RESTful web service on the server, metadata and with the IDS for data, both of which run on the a Python client has been provided to allow interaction with head node. the Job Portal via a command line interface. Both of these are very thin layers totalling only a few hundred lines of code. This A. Batch jobs provides an alternative to the GUI interface which may prove to be the preferred way for more proficient users to interact It is essential that a batch job belonging to one user cannot with the portal, and would be the interface of choice for anyone access the account of any other user. To achieve this a batch looking to write a script to handle their data processing. job is submitted to run on an account chosen randomly from a pool. Each worker node is configured to run a very small number of concurrent jobs. The job has a prologue which is F. Automated Configuration run before the user’s job. This tries to get a lock by creating The installation, configuration and upgrading of the soft- a directory to ensure that two jobs cannot run simultaneously ware has been set up using the Puppet Open Source[11] under the same account. If it fails it will issue a return code framework. This means that, starting with computers with an that causes the job to be requeued. The epilogue, which is operating system and configured to use the network, it is run after the job, frees the lock if it is run by the same job possible to install the head node within an hour and each that created it. The batch pool should be sufficiently large that worker node can be added in a few minutes. The result is requeuing is rare. There is a mechanism to tidy up if things a working system including the Java Development Kit, a go wrong. Glassfish Application Server (running ICAT, IDS and the Job Portal software), database servers and required databases, batch B. Interactive Jobs system, monitoring (Ganglia and Nagios) and the scientific software provided by the team operating the portal. Although most batch systems do have some kind of interac- tive job capability we found it convenient to provide the desired IV. C REATION AND U SE OF M ETADATA functionality outside the batch system. For these jobs, the most lightly loaded worker node is found, any running batch jobs The use of metadata is essential to the operation of the are suspended, the node is made temporarily unavailable for IJP. Because it is a generic tool, the portal itself is not able to new batch jobs, and the user is given exclusive use of that node look inside domain specific datasets. It is entirely reliant on to run the interactive job. To achieve this, the user is supplied the metadata inserted into the ICAT database, and uses only with a username taken from a pool and a temporary password, this metadata for searching and displaying information. allowing a remote desktop connection to be established via the When an instrument produces data this is typically written RDP protocol to the worker node. This will typically be either to a local file store from which they can be ingested into the via the Remote Desktop Connection application in Windows IJP system. The best people to define the metadata to associate or using the rdesktop command on Linux systems. The account with this raw data are the team conducting the experiment. will have been configured such that the interactive job that the An IJP job can be submitted each time that data need to be user has requested will start automatically. The user is only ingested. This job must be able to derive the metadata from given a short time to connect to the worker node machine the available information and upload the data files themselves before the password is removed. Once the user has logged out to the IDS as well as creating entries in the ICAT database for the system will remove the account, along with any local files the metadata. that may be left, and will release any suspended jobs that were on the machine and make the machine available to the batch When data are processed by an IJP job this results in new system again. data and metadata being stored. It is the responsibility of the job to identify useful pieces of metadata to allow datasets to be subsequently selected. As it is difficult to identify all the C. Ganglia monitoring metadata that might eventually be useful, jobs can be written All nodes within the system are configured to make use of to look at the data and add metadata to ICAT to hold more the Ganglia Monitoring System. Currently this is being used information about existing datasets. to select the most lightly loaded machine in the cluster when The three categories of job described here: ingestion, an interactive job is requested. It allows a single XML stream derivation of processed data and augmentation of metadata are from the Ganglia host on the head node to be parsed, giving all just jobs for the IJP and must be installed by the facility an instant overview of the loading of each machine. Nagios for its users. monitoring is also installed but it is not an essential part of the system. V. A U SER ’ S V IEW OF THE P ORTAL D. Job Status Information Users access the job portal via a web browser as shown in Fig. 3. This was developed in Java using the Google The batch system is not well suited for holding job status Web Toolkit[12] and communicates with a number of servlets information for an extended period. In addition the portal needs running on the application server on the head node. Once to hold information about jobs that are not known to the batch logged in, the user is presented with a number of search system. Therefore the portal maintains its own records and options tailored to the user base of the portal, and a generic periodically harvests information from the batch system. search widget listing all of the dataset parameters that are Fig. 3. Screenshot of IJP searchable. The widget provides relevant search options for batch system, checking the output and error logs if required each parameter: =, !=, >, >=, <, <=, LIKE and BETWEEN and monitoring the status until the job is complete. depending on its type - string, numeric or date/time. The list of parameters and their types is read from the underlying ICAT As well as handling interactive and batch jobs, the portal is database so that the portal software remains generic. able to handle jobs that take either a single dataset or multiple datasets as input. Users are able to select multiple datasets and Within ICAT all datasets have to be of a type which has the portal uses the job definition to work out whether to submit been pre-defined before the dataset is registered. This allows multiple jobs each with a single dataset as input, or a single for easier searching of datasets. Once a user has selected the job with multiple datasets as the input. Where it is ambiguous, type of dataset in which they are interested, they can narrow the user is asked to confirm what was intended. down their search if they wish using the search options, then click search. A list of matching datasets then appears in the Datasets remain registered within ICAT and available via central panel. When one of these datasets is selected, all of its the IDS. They are suitably protected via a rule based per- dataset parameters are displayed in the lower panel. Having mission system which should have been configured to ensure selected a dataset, the central Options select box lists all of that users can at least read the data they have created. These the jobs that it is possible to run on that dataset type. After data will remain within the system. Should the user wish to selecting the desired job, a Job Options Form is displayed download a copy of their data, this is possible via “Download” allowing the user to pass particular parameters to the job, if in the Options select box. There is also an option to display a required. This form is automatically generated from an XML URL to obtain the dataset from the IDS. file defining the job within the system, as shown in Fig. 4. The options displayed can also be tailored so that only options VI. A N A DMINISTRATOR ’ S V IEW OF THE P ORTAL relevant to the chosen dataset are offered. Configuration of the portal is defined by XML files. Each Submitting the form results in the job being submitted to team using the portal to run their software needs to have at least the server and the user receiving a response containing the ID one person who is familiar enough with the team’s software assigned to the job in the batch system. The user can then use and the datasets it uses, to be able to set up each piece of the Job Status tab to follow the progress of the job through the software so that it can be run as a “job” by the portal. Firstly, Fig. 4. Configuration of job options there are two fairly straightforward tasks which need carrying Within the XML specifying each of the command line out: options for a job, as shown in Fig. 4, a condition can be specified in terms of the named quantities defined in the • picking out the characteristics of each dataset type XML which if met, causes this option to appear on the Job which lead to different options being made available Options Form. This takes the form of a logical expression such in the Job Options Form. as numChannels == 3 && numHdfFiles > 500. If multiple datasets are selected, only the options that are com- • creating an XML file describing each piece of soft- mon to all of those datasets are offered to the user. ware: whether it runs as an interactive or batch job, In addition to setting up these XML descriptor files, there which type of datasets it needs, whether it accepts is a certain amount of work that needs to be done in order multiple input datasets, along with all of the various to make the team’s existing software compatible with the job command line options that it accepts. portal. This can be done either by modifying the existing applications or by providing job wrappers to perform tasks These two tasks are linked by the concept of Job Dataset such as obtaining data from the IDS and laying it out as the Parameters. For each type of dataset, an XML file is set up program expects, storing resulting datasets back in the IDS and allowing the administrator to define a named quantity and how recording provenance information. Python libraries are being it may be derived from an ICAT query. While the query can established to simplify these operations - there is a generic span all information the logged in user is allowed to see, a library and we recommend using a facility specific library that query might reasonably take into account information from knows the facility conventions for layout of data. the metadata associated with the dataset or any of its files and might make use of JPQL aggregate functions SUM, AVG, VII. C URRENT S TATUS MIN, MAX and COUNT. Having developed a prototype to prove the concept and The administrator has thus defined named quantities spe- help the users to define the features that they need from the cific to a dataset type and derivable by an ICAT query; IJP, we are currently completing the work and plan to have a examples include: the number of files of a particular type first deployment for production use in a few months time. or the size of the largest file in a dataset. When a dataset is selected within the browser, the server runs the database VIII. F UTURE D EVELOPMENTS queries specified within the relevant XML file, generates a map of name-value pairs and sends it back to the browser to We anticipate that requirements will be clarified further control what appears in the Job Options Form. once we receive feedback from users of the deployed pro- duction system. Based on existing feedback we are already control the system and the jobs it is running. This will support planning the following enhancements. common tasks such as monitoring job distribution and loading on the worker nodes, pausing and terminating jobs, taking A. Visualisation of Provenance worker nodes offline and bringing them back online, user and group administration, modification of authorization rules and Provenance information is stored within ICAT when a new removal of unwanted datasets. dataset is stored but there is currently no way to visualise this information within the portal. A new panel will be added E. Alternative remote desktop mechanism to represent the provenance information in a graphical format. This will allow the user to select the dataset they are interested A possible alternative to using either Remote Desktop in and expand it to see the input and output datasets and files Connection in Windows or rdesktop on a Linux system is to associated with it. Those datasets and files can, in turn, be have the remote desktop session also run within the browser. selected and expanded to follow the chain of provenance. Currently, the RDP server port needs to be accessible on each of the worker nodes which is not a problem within the A further development would be the addition of a prove- local site network. The system is, however, intended to be nance based search facility which would allow searches such used remotely from other institutions, which may contravene as all datasets derived from a given dataset or all datasets security policies. Having the possibility of running the Remote produced directly or indirectly by a specific version of an Desktop session via https within a browser may be the solution. application. One solution of interest to solve this problem is B. Workflow Support Guacamole[16], an HTML5 client-less remote desktop. It supports remote desktop protocols such as VNC and RDP, It would be a particularly useful feature to have the Job and is able to deliver a remote desktop within a web browser Portal integrated with a Workflow Management System. This without the need for any browser plugins or client software would make it possible to set off a chain of data processing installation. jobs with the output of the first job becoming the input to later jobs, and so on. As the job relating to each stage F. Alternative batch system of the process completes, the next job in the workflow is automatically submitted on behalf of the user. We only support Torque as a batch system. We plan to include Maui as a scheduler because the inbuilt Torque sched- One workflow management system which would be of uler (pbs sched) is very basic. Maui would enable scheduling particular interest would be Taverna[13]. It is open source, policies to be defined to allow more control of which job is domain independent and written in Java, and therefore should selected to be run when a slot becomes free. integrate well with the server side of the portal software which is also written in Java. Taverna has already been used behind We also plan to make the choice of batch system config- a portal[14] by a number of projects which demonstrates its urable. The batch system might even act as a front-end to a suitability to being used in this way. grid or cloud solution. We already have a request to support IBM Platform LSF[17]. C. Software as Data G. Portability Initially our preferred solution for deploying facility soft- ware was to use the native packaging system of the operating The Puppet configuration is only available for Ubuntu[18] system - typically RPMs or DEBs. While convenient for the and has only been tested on version 12.04 (64 bit). This is a IJP developers this may not meet the needs of some of our concern for existing infrastructures which are not able to easily users who would like more freedom to run the software of their accommodate these decisions. We plan to make the system own choosing without arranging to have it officially installed easy to install on other platforms and to support alternative and who want to have multiple versions of the software subcomponents where practical. available. Following a suggestion[15], we are considering the We have a request to support Red Hat Enterprise Linux[19] implications of storing a job as a dataset known to ICAT. version 6 (64 bit) and will probably include CentOS[20] A job wrapper would then first download the application version 6.4 (64 bit) at the same time. software before setting up the data and running the downloaded application. This would probably require some kind of caching IX. C ONCLUSION mechanism and would require a means of specifying software dependencies to ensure that the correct packages are available We have successfully built a job portal for ICAT users for the desired software. This solution, which will require some on top of the basic metadata catalog and the IDS. The initial operations to be run as root to install dependencies, needs prototype was very valuable as it allowed us to get something careful evaluation. out quickly to ensure that we were on the right track and to understand what needed generalising. D. Administration console Though the generalisation was not a trivial task; the result is a tool that we believe is now very easy to configure for The system is currently rather opaque to the administrator many scientific disciplines. and requires the use of the native batch system commands to find out what is going on. We plan to provide a browser The IJP allows rapidly changing, mature and wrapped based web application allowing administrators to monitor and “legacy” software to be made available, side by side, with a uniform and modern style of interface to a scientific com- [6] D. Rolfe, C. McLachlan, M. Hirsch, S. Needham, C. Tynan, munity. S. Webb, M. Martin-Fernandez, and M. Hobson, “Automated multidimensional single molecule fluorescence microscopy feature We already have a number of groups from the existing detection and tracking,” European Biophysics Journal, vol. 40, no. 10, ICAT community interested in the project and we anticipate a pp. 1167–1186, 2011. [Online]. Available: http://dx.doi.org/10.1007/ s00249-011-0747-7 good uptake of the software. [7] The OMERO website. [Online]. Available: http://www.openmicroscopy. org/site/products/omero ACKNOWLEDGMENT [8] The IBM Platform Application Center. [Online]. Available: http://www.ibm.com/support/entry/portal/documentation expanded The authors would like to thank Dave Clarke from STFC’s list/software/platform computing/platform application center Lasers for Science Facility for supporting this work and [9] J. Goecks, A. Nekrutenko, J. Taylor, and The Galaxy Team, “Galaxy: attracting funding. a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences,” Genome We would like to acknowledge the assistance and funding Biology, vol. 11, no. 8, p. R86, 2010. [Online]. Available: from STFC’s Harwell Imaging Partnership, which has sup- http://genomebiology.com/2010/11/8/R86 ported this development from inception (http://www.stfc.ac.uk/ [10] Adaptive Computing’s Torque website. [Online]. Available: http: hip). //www.adaptivecomputing.com/products/open-source/torque/ [11] The Puppet Open Source website. [Online]. Available: https: The diagrams in this paper were produced by Noris //puppetlabs.com/puppet/puppet-open-source/ Nyamekye. [12] The Google Web Toolkit website. [Online]. Available: https: Finally we thank our colleagues Brian Mathews, Alistair //developers.google.com/web-toolkit/ Mills and Erica Yang who all provided helpful comments on [13] P. Missier, S. Soiland-Reyes, S. Owen, W. Tan, A. Nenadic, I. Dunlop, A. Williams, T. Oinn, and C. Goble, “Taverna, reloaded,” in SSDBM drafts of this paper. 2010, M. Gertz, T. Hey, and B. Ludaescher, Eds., Heidelberg, Germany, June 2010. [Online]. Available: http://www.taverna.org.uk/ R EFERENCES pages/wp-content/uploads/2010/04/T2Architecture.pdf [14] Taverna: Behind a portal. [Online]. Available: http://prototype.taverna. [1] The ICAT Metadata Catalog website. [Online]. Available: http: org.uk/introduction/taverna-in-use/portal/ //code.google.com/p/icatproject/ [15] Rich Wareham, Cambridge, private communication, 2012. [2] The ICAT project website. [Online]. Available: http://www.icatproject. org/ [16] The Guacamole website. [Online]. Available: http://guac-dev.org/ [3] The ICAT Job Portal website. [Online]. Available: http://code.google. [17] IBM Platform LSF. [Online]. Available: http://www.ibm.com/systems/ com/p/icat-job-portal/ technicalcomputing/platformcomputing/products/lsf/index.html [4] The ICAT Data Service website. [Online]. Available: http://code. [18] The Ubuntu website. [Online]. Available: http://www.ubuntu.com/ google.com/p/icat-data-service/ [19] Red hat enterprise linux. [Online]. Available: http://www.redhat.com/ [5] D. T. Clarke, S. W. Botchway, B. C. Coles, S. R. Needham, products/enterprise-linux/ S. K. Roberts, D. J. Rolfe, C. J. Tynan, A. D. Ward, S. E. D. [20] CentOS: The Community ENTerprise Operating System. [Online]. Webb, R. Yadav, L. Zanetti-Domingues, and M. L. Martin-Fernandez, Available: http://www.centos.org/ “Optics clustered to output unique solutions: A multi-laser facility for combined single molecule and ensemble microscopy,” Review of Scientific Instruments, vol. 82, no. 9, p. 093705, 2011. [Online]. Available: http://link.aip.org/link/?RSI/82/093705/1