Matrix Completion for Storm Damages Prediction

                   Quang-Khai Tran1, 2, Jung-Ho Um1, Sa-kwang Song1, 2 (*)
1
    Korea Institute of Science and Technology Information (KISTI). Daejeon, Republic of Korea
             2
               University of Science and Technology (UST). Daejeon, Republic of Korea
                        {khai.tran, jhum, esmallj}@kisti.re.kr


         Abstract. Forecasting weather disasters is very important, but still remains a big
         challenge for science. Aiming to tackle this issue, our study attempts to predict
         storm damages by using Semantic Web data (SRBench) and techniques (matrix
         completion methods and Statistical Unit Node Set framework). Preliminary expe-
         riments try predicting which regions are likely to be hit by the most deadly storms
         like hurricane Katrina (in USA, 2005). Result shows that, even with incomplete
         data, the approach can determine highly threatened locations at different time-
         steps. It also hints the ability to forecast different storm damage scenarios.
         Key words: matrix completion, statistical unit node set, storm damages prediction.


1 Introduction

In 2005, hurricane Katrina hit the US, killed more than 1800 people and caused about
$108 billion of property loss [4]. Unfortunately, prediction of storm's intensity and
track is still a big challenge for modern meteorological models. Approaching the issue
differently to such models, which often require good and sufficient data, and inspired
by work in [6] and [2], we propose an alternative to estimate the likelihood that storm
damages may happen to some locations, based-on Semantic Web (SW) technologies.
   We consider 5 hurricanes in the US provided in the SRBench data set [7]: Charley
(2004), Katrina (2005), Wilma (2005), Gustav (2008) and Ike (2008). For learning
and predicting, Matrix Completion (MC) methods, used in the Statistical Unit Node
Set (SUNS) framework [6], are employed as the multivariate regression approach.
Usually, SW data is incomplete, but multivariate learning has shown its strength in
dealing with information-missing data [2]. In [2, 3, 6], the same research group com-
bined SPARQL with MC under SUNS framework and applied it successfully with
several SW data sets (such as friend-of-a-friend and gene-disease relationships).


2 Methodology

Known triples of storms' data are trans-coded into matrices by using SUNS approach,
and MC process based-on Singular Value Decomposition (SVD) technique is used for
filling the missing value of unknown triples.

(*) Corresponding author
2.1 Constructing data matrix from SRBench

We consider “Storm” and “Location” as the center concepts, and their relationship is
the objective, of the learning and predicting processes. Under the SUNS framework,
they become statistical units of interest, with relationship represented by the triple
form (subject, predicate, object) of the Resource Description Framework. Hence, a
triple              is to indicate that storm- may “hit” location- . Considering the
time-series of streaming data, the triple is extended as               . Value of this
extended triple is 1 if is occurring over at time , and 0 otherwise.


                Fig. 1. Example: 6 locations on the track of hurricane Katrina.


           Wind-   Wind-   Press-                   Press-     Wind-
                                    Lat     Long
            dir    speed    ure                      ure       speed
           Location's Attributes          Storm's Attributes
               (Observed)                    (Observed)

           Aggregated Matrix
           New observed data                                           Unknown triples

                    Fig. 2. Aggregated matrix for learning and prediction.

   Fig. 1 shows an example of a grid of 6 locations hit by hurricane Katrina. At tim e
  , truth values of 6 hitting triples corresponding to 6 locations form the i-th row of
the hitting matrix. So that, an -to-6 matrix is generated for time-steps of the storm.
To perform multivariate regression, data of weather attributes in each location (wind-
direction, wind-speed, pressure) and attributes of the storm (latitude, longitude, pres-
sure, wind-speed) is also used to form the columns of co-variate matrices in the same
way, but filling the matrices with observed values rather than truth values.
   Fig. 2 represents the data matrix aggregated from the above co-variate matrices
and hitting triple matrix, with new data of observation (input data of prediction) added
as new rows. For predicting the occurrence of the storm in the future, for example at
next 6 hrs, hitting data of the next 6hrs will replace the matrix of current hitting triple,
or it can be used to expand the aggregated matrix (column-expansion).
2.2 Matrix Completion

SUNS framework uses MC methods based-on SVD factorization for filling unknown
entries in the aggregated matrix. Low-rank SVD decomposition is defined as formula
(1), with    is a data matrix,   and are orthonormal matrices, and    is a diagonal
matrix formed from the -biggest eigenvalues:

                                                                                           (1)

   With a training data matrix , MC is to find a matrix model , which can be con-
sidered as a generalization of , via low-rank SVD decomposition. In [2], authors
introduced Reduced-rank Penalized Regression (RRPR1) algorithm:

                                                                                           (2)


     where         is an approximation of   ,    is derived from SVD factorization of     and
                      is    (with    is the -th eigenvalue and    is the balance parameter).
  In Table 1, two algorithms used in our study are presented. They are adapted from
SVD-Impute algorithm [1] and SOFT-Impute algorithm [5].

                  Table 1. Two matrix completion algorithms: “Naive” SVD and RRPR
                        ( is the set of known entries (non-zero) in the data set).

         “Naive” SVD (SVD-Impute)                                      RRPR
    Step 0: set     =0                              Step 0: set   =0
    Step 1:                                         Step 1:

    Step 2:    ,     and     are derived from       Step 2:    ,     and     are derived from
             SVD( ), with                                    SVD( ), with
    Step 3: is reconstructed by formula (1)         Step 3: is reconstructed by formula (2)
    Step 4: if is not converged, go to Step 1       Step 4: if is not converged, go to Step 1


3 Experiment

In preliminary experiments, data of hurricane Katrina [4] is tested, with 6 locations in
Fig. 1, and 3 attributes per location and 4 attributes of the storm (like in Fig. 2) over
31 6-hr time-steps (31 rows and 22 columns). However, there are just 80 observed
entries of 6 locations (558 entries in total), together with 124 observed entries of the
hurricane, to form a sparse co-variate matrix (density ~29.9%). It is aggregated with a
31x6 matrix of current-hitting-triple and a 31x6 matrix of next-6-hr-hitting-triple to

1   The authors used the abbreviation name “RRPP”.
form a 31x34 training matrix (density ~25.8%). For testing, two observations of new
locations' attributes and current hitting statements are expanded as new rows (entries
of the next-6-hr-hitting-triple of each row are set to 0 (unknown)).
   This data is limited, but still meaningful for testing our idea. Results show that two
algorithms fill the missing entries with similar patterns of other similar training obser-
vations. Both predict (0.1, 0.7, 0.1, 0.0, 0.0, 0.0) for expected pattern (0, 1, 0, 0, 0, 0)
and (0.0, -0.1, -0.2, 0.0, -0.1, 0.1) for (0, 0, 0, 0, 0, 0). This means that the pattern of
hurricane Katrina is reflected well, despite of missing information. In comparing root
mean square error and relative error of the training process, RRPR performs slightly
better than “Naive” SVD (6.687x10 -2 and 2.245x10-3 comparing to 6.697x10-2 and
2.248x10-3, respectively). However, as the data is simple, the difference is very small,
and two algorithms result in the same predictions.


4 Conclusion and Discussion

Even though the tested data is limited and incomplete, the recognition of hurricane
Katrina's occurrence pattern indicates strongly that MC algorithms can be used for
forecasting storm damages. Moreover, using SUNS approach, other types of data in
SW can be used to investigate other types of disaster damages.
   So far, research on bridging SW resources and weather disasters prediction seems
to be undiscovered, and we can not find other similar works. In next stage, data of 5
storms will be combined for predicting storm damage index related to population of
some counties in the US, and we target to add new algorithm and heuristics for impro-
ving SUNS approach. It can be applied for forecasting damages of different disaster
scenarios, which is very meaningful in the context of global climate change situation.


References

1. Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M., Brown, P., & Botstein, D. (1999).
   Imputing missing data for gene expression arrays.
2. Huang, Y., Tresp, V., Bundschus, M., Rettinger, A., & Kriegel, H. P. (2011). Multivariate
   prediction for learning on the semantic web. In: Inductive Logic Programming (pp. 92-104).
   Springer Berlin Heidelberg.
3. Jiang, X., Huang, Y., Nickel, M., & Tresp, V. (2012). Combining information extraction,
   deductive reasoning and machine learning for relation prediction. In The Semantic Web:
   Research and Applications (pp. 164-178). Springer Berlin Heidelberg.
4. Knabb, R. D., Rhome, J. R., & Brown, R. D. (2005). Tropical Cyclone Report: Hurricane
   Katrina. Technical Report, NOAA National Hurricane Center, 23–30 August.
5. Mazumder, R., Hastie, T., & Tibshirani, R. (2010). Spectral regularization algorithms for
   learning large incomplete matrices. Journal of Machine Learning Research 11, (2287-2322).
6. Tresp, V., Huang, Y., Bundschus, M., & Rettinger, A. (2009). Materializing and querying
   learned knowledge. Proc. of IRMLeS, 2009.
7. Zhang, Y., Duc, P. M., Corcho, O., & Calbimonte, J. P. (2012). SRBench: A Streaming
   RDF/SPARQL Benchmark. In: The Semantic Web–ISWC 2012 (pp. 641-657). Springer
   Berlin Heidelberg.