A Model of Data Processing Pipeline for Space
       Weather Analysis and Forecast?

                      Minh-Duc Nguyen1[0000−0002−5003−3623]

     Skobeltsyn Institute of Nuclear Physics, Lomonosov Moscow State University,
                       Moscow, Russia nguyendmitri@gmail.com


        Abstract. Space weather is a branch of space physics that studies var-
        ious factors in the near-Earth space such as solar wind, magnetosphere
        disturbance, solar proton events, and others, which make a massive im-
        pact on the Earth. In practice, data measured by different satellite in-
        struments need to be gathered and appropriately transformed before
        use in space weather analysis and forecast. The data processing pipeline
        involves a large number of various programs. It also requires in-depth
        technical knowledge of both satellite instruments and programming tools
        so that data will be processed correctly. Building such a data pipeline
        is time-consuming and error-prone. The correctness of the output data
        produced by the processing pipeline is one of the critical factors that
        define the success of an analysis or a forecast model. This work proposes
        a model that describes how the data processing pipeline might be orga-
        nized and how to build a distributed data processing system based on
        the proposed model.

                                     ·      ·
        · Space physics.
        Keywords: Data processing ETL Space weather analysis and forecast


1     Introduction

    Space weather is a branch of space physics that studies complex processes,
so-called space weather factors, happening in the near-Earth space. The main
driving force of such processes is high energy particles (protons, electrons, and
alpha-particles) that are mostly ejected from solar events, heading from the
Sun toward the Earth, and directly impacting Earth’s heliosphere and magne-
tosphere, satellite and ground systems. Some of these processes, such as quasi-
stationary solar wind fluxes, solar proton events, and fluences of outer radiation
belt electrons, are well known and broadly studied. Real-time monitoring and
forecasting the impact of these factors on satellite and ground systems are crit-
ical missions. For the last several decades, many space experiments have been
launched to accomplish these missions. Hundred of satellites are rotating around

              ©
?
    Supported by the Russian Science Foundation, grant #16-17-00098.
    Copyright     2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
the Earth at different orbits collecting data measured by various instruments in
real-time. The data are later used to analyze space weather factors and to de-
velop operational models that describe and predict the behavior of these factors
and their impact.
    There are three most significant challenges that scientists encounter each time
a new study of space weather starts. Searching for datasets that cover a period
when specific space weather events happened is one of them. In practice, datasets
are not always complete. Missing data are a common issue. Another challenge
is finding alternative datasets that cover a specific interval when the primary
datasets have missing data. The third challenge is to transform different datasets
presented in various formats into one and normalized them so that they can be
used together. Solutions to these challenges are still an active topic for research.
Until today, no solution can provide a smooth experience of data acquisition and
match the demand of scientists of the space weather community. Several ongoing
projects address these challenges, such as the Planetary Data System [1] and
Euro Planet [2]. But due to the complexity and scale of these projects, it is still
hard for individual researchers to benefit from their results. While the number
of data products provided by these projects is vast, the available search engine
and API lack the flexibility that allows researchers to search and retrieve data
without any specific knowledge.
    To fulfill the need of individual researchers, a Satellite Data Downloading Sys-
tem (SDDS) [3] has been created at the Skobeltsyn Institute of Nuclear Physics,
Lomonosov Moscow State University. The system collects data from the most
used satellites in space weather research and provides a user-friendly API that
allows researchers to search and retrieve data with ease. The system is based on
a data processing pipeline model that will be considered in detail in this paper.


2   The data processing pipeline model
Data processing pipeline for a satellite instrument is a process of reconstructing
instrument and payload data at full resolution with any and all communications
artifacts such as synchronization frames, communication headers, etc. During
this process, data products are produced at various levels ranging from Level 0
to Level 4 [4]. Level 0 products are raw data at full instrument resolution and are
not used in research due to communication artifacts. Calibrated data products of
Level 1A or higher are often used. A data provider might provide data products
at various levels. To be able to use data from different instruments together, it
is required to process them to a correct level applying all radiometric, geometric
coefficients, and georeferencing parameters.
    Despite the diversity of satellites and their instruments, they all share some
common procedures in the data processing pipeline. The data processing pipeline
model of the SDDS system (the Model) is used to describe these common pro-
cedures and what actions should be taken in each of them.
    The Model involves the following entities:
 – the data processing system (the system);
 – the data source;
 – the gateway that connects the data source to the Internet;
 – the source file provided by the data source (can be at various level);
 – the satellite;
 – the instrument that is set up on board of the satellite;
 – the instrument file that contains scientific payload;
 – the local server where the data processing system is functioning;
 – the data storage where source files, instrument files, and the processing result
   are stored;
 – the database.

   The Model splits a typical data processing pipelines into seven stages:

 1. connecting to a data source;
 2. checking for new source files;
 3. downloading the new source files;
 4. extracting instrument files from the source files;
 5. processing the instrument files;
 6. loading the processing result to the database;
 7. moving all files to the data storage.

    In the connecting stage, the data processing system establishes a connection
to the data source. If the data source is behind a gateway on a private subnet, the
system creates a VPN connection to the gateway, adds necessary routes to the
routing table, and establishes another connection to the data source. If required,
the system also authenticates itself against the data source.
    In the checking stage, the system searches for new source files by comparing
the remote file list with the local one or using the recept (the file) that contains
links to the new source files. A source file is considered new if it does not exist
in the local data storage or if the last modification time or the file size differs
from the existing local one.
    In the downloading stage, the system downloads the new source files to the
local server. After downloading, the system calculates the checksums of the files
to check for correctness. Network issues are handled by the system in this stage.
If the connection drops during a downloading session, the system will try to
reconnect to the data source and recover the downloading operation.
    In the extracting stage, the system reconstructs the instrument files from
the source files. If the source file is a zip-archive, the systems will uncompress it
first. If several instrument files are packed into one single source file in a custom
binary data format, the system will unpack the instrument files using the format
specification.
    In the processing stage, the system executes special programs, so-called de-
coders, to transform the payload from lower to a higher level and store the
result in the CSV format. The system might perform additional post-processing
routines to produce high-level data such as Levels 2, 3, and 4. A set of instru-
ment files can be processed in parallel. If there is a dependency between files of
different instruments, the systems will execute the processing routine in strict
order.
    In the loading stage, the system loads the processing result in the CSV format
to the database. The schema and table structure depends on the hierarchy of
instruments and data channels. In this stage, data at different resolutions are
calculated inside the database using the original resolution.
    In the moving stage, the system moves all files to the long-term data storage.
The file and directory structure of each satellite reflects the hierarchy of instru-
ments. Depending on the size, files can be split according to a specific time-based
period: by year, by month, or by day.


3   Technical Implementation

The components of the SDDS system responsible for the data processing pipeline
were implemented based on the Model described above. Most of them were de-
signed using the microkernel [5] pattern widely used in operating system com-
ponent design. The idea of the microkernel pattern is that the primary business
logic is implemented in the core component. Everything else is implemented as
pluggable modules that can be loaded and executed dynamically in run time. The
interface between the core component and modules is determined, so different
versions of a module or modules with similar features can be used interchange-
ably. This pattern ensures the flexibility and the scalability of the resulting
system and the isolation of components.
     The common logic of the data processing pipeline was implemented in a base
class representing an abstract satellite controller. The changing logic is described
in the configuration file, each of which belongs to a specific satellite controller.
Data sources and instrument data decoders are described in the configuration
file in JSON format along with other parameters, such as the order in which
files from multiple data sources are processed. Each decoder has its own config-
uration file. The stages of the data processing pipeline are also defined in the
configuration file. For example, if the data source already provided instrument
files, the extracting stage can be omitted, and thus there are only six stages
defined in the configuration files.
     The abstract base controller has an interface consisting of a set of methods.
Each method performs common actions in each stage. Each method has a set
of standard-type input parameters and two function-type parameters. The first
function-type parameter represents a pre-processing function that is called be-
fore any common actions in the method. The second function-type parameter
represents a post-processing function that is called after all common actions in
the method. A specific satellite controller is implemented as a class derived from
the base class. In the derived class, the base methods might be reused as-is.
The derived class might have its specific methods that are later passed to the
base methods as parameters to be used as pre- and post-processing functions. If
the logic of the data processing pipeline is complex, the base methods might be
redefined completely in the derived class.
                Microkernels           Abstract Base Controller       Microkernels


                                                          Network       HTTPS
                 JsonConfig      Config Loading
                                                         Connection    Connector


                HTTP Catalog       New Data                Data          python-
                  Crawler          Checking             Downloading     requests


                                                            Data       Solar Image
                FITS Extractor   Data Extraction
                                                         Processing     Decoder


                 PostgreSQL       DB Loading               Storage    rsync via ssh
                   Loader


Fig. 1. The base controller and microkernels implementing the data processing pipeline
of the SDO controller


    The actions performed inside a base method are implemented as microker-
nels that can be loaded and called dynamically depending on the metadata of
the satellite described in a configuration file. For example, if the data source
uses HTTPS as a connection protocol when the base method responsible for
establishing a connection is called, it will execute the microkernel-method for
the HTTPS connection. The same approach was used in the implementation of
other stages. The controller class implementing the data processing pipeline of
the Solar Dynamics Observatory (SDO) [6] satellite is shown in Fig. 1.

    The microkernel approach also fits when it comes to implementing actions
in the instrument data processing stage. When the base method is called, it, in
turn, calls the specific decoder used to process the instrument data. If several
instruments use the same format to present the data, a single decoder that
processes data in that format can be reused.

    The satellite controller can run in automatic mode and interactive mode
through the command-line interface. In the automatic mode, the controller passes
each source file through the data processing pipeline. In the interactive mode,
actions of a specific stage can be executed against the input file manually when
the corresponding argument is passed. The interactive mode makes it possible
to adapt the controller to a complex processing scenario. For example, when a
number of files need to be reprocessed, instead of passing files one by one through
the pipeline, one can run the controller in interactive mode to download all files
to a temporary buffer first and then start processing files.
4   Conclusion

The data processing pipeline model described in this paper has been used as a
baseline to design and implement the SDDS system. Applying the microkernel
pattern reduces the development time required to support new satellite instru-
ments, especially when they share the same feature or data format similar to an
existing one. The flexible at the same time determined interface of the abstract
base controller makes it possible to maintain the compatibility across compo-
nents. Currently, data of the twenty most used satellites and geomagnetic indices
are being processed by the SDDS system. Processing pipelines are executed on
regular basics. The frequency varies from 5-minute intervals to 1-day.


References
1. The Planetary Data System, https://pds.nasa.gov. Last accessed 29.06.2020
2. Euro Planet, http://www.europlanet-vespa.eu. Last accessed 29.06.2020
3. Nguyen M.-D.: A Scientific Workflow System for Satellite Data Processing
  with Real-Time Monitoring. EPJ Web of Conferences 2018, vol. 173, 05012.
  https://doi.org/10.1051/epjconf/201817305012
4. National     Aeronautics    and     Space     Administration:     Earth    Observ-
  ing    System     Data    and    Information    System     (EOSDIS)       Handbook,
  https://cdn.earthdata.nasa.gov/conduit/upload/5980/EOSDISHandbookWebFinal1.pdf.
  Last accessed 29.06.2020
5. Richards M.: Software Architecture Patterns. O’Reilly Media, Inc., 1005 Gravenstein
  Highway North, Sebastopol, CA 95472 (2015)
6. National Aeronautics and Space Administration: Solar Dynamics Observatory,
  https://sdo.gsfc.nasa.gov Last accessed 29.06.2020