104 Data driven case base construction for prediction of success of marine operations Bjørn Magnus Mathisen, Agnar Aamodt, Helge Langseth Norwegian University of Science and Technology bjornmm@ntnu.no, agnar@ntnu.no, helge.langseth@ntnu.no Abstract. It is a common situation to have lots of recorded data that you want to use for improving a process in your organization or make use of this data to provide new services or products. Starting with one primary data set we describe a system that enhances this data set to a level such that it can be used by a deep learning system. This deep learning system then creates a model based on this data set, trying to predict operational windows for marine operations. Using this model the system extracts cases for use in a CBR-system aimed at providing operational support. This paper describes the partial implementation and results of this system. Keywords: Data Science, Deep Neural Networks, Data Analytics, Case-based Reasoning 1 Introduction Critical operations are often meticulously planned and subject to many pa- rameters that decide if and how these operations are performed. Some of these parameters are called operational time windows, which in marine environments often are connected to external factors such as weather. This paper uses machine learning to predict favorable operational time win- dows or warn of unfavorable operational windows, so that critical operations can be planned with better accuracy, e.g. when the operation should ideally take place. One way of doing this is to look at historical data of previously exe- cuted operations. By combining data on successful and unsuccessful operations with the relevant context of that operation, we create a data set that can be used to find indicators for success or failure in advance. Which context that is relevant is dependent on the nature of operational window; wind and fog are im- portant contexts for aviation, while waves and current are important for marine operations but not aviation. This paper focuses on marine operations, and we analyze event data captured from boats moving in and out of zones connected to aquaculture installations. Next, we calculate the duration of these events and connect them to the relevant context and the associated success or failure classification. The data used in this analysis is gathered as part of the EXPOSED project1 . This project aims to develop enabling- and applied technologies for exposed 1 http://exposedaquaculture.no/en/ Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purpose. In Proceedings of the ICCBR 2017 Workshops. Trondheim, Norway 105 2 aquaculture operations. The work we describe aims to improve planning of op- erations on aquaculture installations on exposed locations. The data is a subset of boats moving across geofences attached to aquacul- ture installations. This system consists of two zones around every aquaculture installation in Norway: One outer zone 400 meters from the outer points of the structures holding the fish themselves (not including the control building/fishfeed silos). The inner zone is 20 meters from the structure. These limits are in ad- herence to government regulations that no boat should fish within the outer zone and no boat should move within the inner zone unless the boat is there to operate on the installation. An example of geofencing zones are shown in Fig. 1 below. Fig. 1: The Green line show the outer geofence zone, the red line shows the inner geofence zone. An event is created each time a boat crosses any of the geofence zones, marking the time. Table 1 below shows an example of a typical event. Event ID Location-ID Vessel Name Time LocationZone EventType 81766 12966 Vessel A 2014-09-02 21:39:32 1 1 81767 12966 Vessel A 2014-09-02 21:40:11 1 2 Table 1: This table shows an example of two events of a vessel entering (Event- Type=1) and leaving (EventType=2) the outer zone (LocationZone=1) of loca- tion 12966. In data gathered in the EXPOSED project, the aquaculture industry reports on several possible problems with fish feed carriers interacting with aquaculture instal- lations: Approaching the feed barges, often placed in shallow waters; Knowing which 106 3 barge container to fill with what feed; Planning according to weather and route to enable the installation crew to attend the operation; And the fact that impact and currents from the boat can damage the installation. As our data only gives us the time spent in two different proximities to the aqua- culture installation there will be limits to which types of operational problems we can detect, and it will be very hard to discern between different causes (other than bad weather which is very general) of any detected problem. The architecture of the full decision support system for EXPOSED is illustrated in Fig. 2. In this paper we only present results from parts of the system. Future work will integrate these results with the other modules (e.g. knowledge models) to complete the system to a state where it can be verified in the field. Simulators/ Simulators/ Numerical Simulators/ Sensor Pre- Numerical Sensor Models Numerical Current Readin Sensor processing Models Readin Models State g Readin g g Simulators/ Simulators/ Decision Pre- Numerical Numerical Knowledge Support processing Models Models Models System Simulators/ Future Simulators/ State Numerical Machine Numerical Models Learning Knowledge Models Models and Experience Case Base Fig. 2: The architecture of the planned systems. The parts implemented are highlighted, the case base and the future state is highlighted in red as being the current target for development. Our main hypothesis is that given enough contextual weather data a deep neural network should be able to predict the length of a maritime operation at a aquaculture installation, enabling us to predict favorable operational windows. The main contribu- tion of this paper is to show the reader the process of gathering, collating, filtering of data and subjecting this data to an analysis. This paper is structured as follows; Section 2 introduces related work and our work in the light of this previous work. Section 3 describes the methods used in our work as well as the data sources used. Section 4 shows the result of our experiments, while section 5 presents the conclusion along with a discussion of the results. 2 Related work In this work we aim to extract cases from a time series of events, CBR research has been done on several aspects of automatic case-authoring. In CBR there has been a lot of focus on how to measure competence and utility of a case-base [1,2]. In [3], they do this via reversing deletion policies constructed in [4] that try to improve case base utility without degrading competence. 107 4 Several works [5,6,7] use NLP to extract cases from structured and unstructured ([8,9,10]) text. More specifically connected to the task of extracting cases from time series is the work done by Bach et.al. [11] where they employ clustering of time-series events in time and space, in combination with other detection methods. Funk et. al [12] uses different models of how predictive (or discriminatory) different time-series patterns are to different medical diagnosis of stress. For more insight into work done in time-series analysis connected to CBR research we suggest chapter 3.3 in [13] The work presented in this paper shares the approach of Bach et al. [11] in that we try to extract the useful data points from the time series via clustering and filtering. Our work differs from the previous work in that we have very few verified cases apriori or during learning. In other words, the time-series is in all practical sense unlabeled for our use. We will try to apply common knowledge about how long an operation usually takes to perform. Then we can extract failed operations from the even time series to create cases that exemplify failed operations. 3 Method To enable the deep learning system to correctly model and predict the time spent at an installation, we need to provide it with as much context data as possible for each of the event data points. In addition, we need the data to be as noise free as possible, thus we want to filter away operations that naturally have a high degree of variation in time spent at the location. We address these two requirements by combining the primary data set with other data sets, to enable us to provide filtering and context. An illustration of this process can be seen in Fig. 3 . Below we describe each of the data sets. n: 3188925 n: 92910 n: 28260 n: 3099 n: 2717 n: 2717 event with duration Merge events with Convert enter and exit events to one Fishfeed carriers Group related weather data Boats Reduce to Mege with site data events Events with Boat Type Events Site Weather Exposur Data e Fig. 3: This figure illustrates how the different data sources are combined and filtered to provide the deep learning system as much context as possible. Boat data set As mentioned in the introduction we do not want to analyze all the traffic data of all of the boats. To verify that our method is usable in at least one instance, we want to look at a specific type of boat that has stable characteristics 108 5 when it comes to the parameters (e.g. time and stability of time) of the operations it executes on the installation. We chose fishfeed boats in this case, as they only do one type of operation. That way we do not need to deduce the type of operation from the event data (one less hidden variable). In addition, this operation should be stable in the time it takes to execute it. To filter the data accordingly we need to combine the event data set with a data source that describes the boats. We can then easily extract the fishfeed boats. NORA10 data set NORA10 [14,15] is a data set that describes output of a precise weather model (hind-cast), that is validated by measurements. It has a higher resolution (10km) than most other models (e.g. the much used ERA2 model with 80km resolution) as it is re-sampled for this specific region around Norway. We sample this model for each of the installations and at each time of each event (in the case of long events we use the median time of the event). We sample every datatype that we think will have an impact on the time spent on an operation: wind speed, wave direction, wind direction, significant swell wave height and significant wave height. Exposure data set SINTEF EXPOSED has produced a data set [16] that de- scribes the degree of exposure for a large number of the installations that are used in the event data set. This data set provides a level of exposure for 360 degrees around the installation (from 0 to max, where max is no land in sight). We combine our weather data with this (described above), thus we combine the wind direction of the wind with how exposed the location is in the direction of the wind using a filter that combines exposure level from +/- 10 degrees around the direction of the wind. 3.1 Extracting time spent in zones. The data set needs to contain the time spent in the zones around the aquaculture installations. The raw data only contains events of entering and exiting the zones. To extract this we sequentially find each exit from a zone then search backwards for the entry to that zone by the same boat, then compute the time spent in that zone. 3.2 Grouping events close in time After converting all discrete events into events with a duration, we still ended up with a lot of extremely short events. This is most probably caused by boats trying to stay close to the installation but the dynamic positioning system moves them in and out of the inner or outer zones. To counter this fact we grouped all events with the same boat at the same location within 1 hour into one event. However, after this grouping there is still 63% (or 244) of the events within the first 10 minute window. These are events within a zone that is less than ten minutes in duration and without another event in the same location within one hour of the original event. There are three possible explanations for these strange events: 1. The boat is passing through the location, and not returning for at least one hour. Or otherwise briefly enters and exists the zone, without this fact having any effect on the operation. 2. The boat tries to perform an operation at the location but has to abort and leaves within ten minutes. 3. The event was not registered correctly when the data was gathered. The most probable cause for most of these events are boats that travel through the zone heading for another location. This hypothesis can be tested by removing outer zone events from the distribution. As the inner zone is small, very few of these big fishfeed carrier boats would drive through the inner zone of an aquaculture installation when 2 http://www.ecmwf.int/en/research/climate-reanalysis/era-interim 109 6 heading somewhere else. We can still see 244 events that are of duration 10 minutes or less within the inner zone of an aquaculture installation. Figure 4 looks at the 1 minute distribution within the first 10 minutes to try to find the causes for the high number of short stay events. And once again we can see that many of the events are very short, with very few events lasting more than 3 minutes. This further supports our first hypothesis. 200 150 100 50 00 1 2 3 4 5 6 7 8 9 10 75.0 59.0 49.0 17.0 17.0 8.0 3.0 7.0 7.0 2.0 31% 24% 20% 7% 7% 3% 1% 3% 3% 1% Fig. 4: Distribution of events over length of stays in all inner zone after grouping all events within a 1 hour time window. Zoomed into the first 10 minutes. One problem with our approach so far is that some events are very far apart in time as well as having different zone types. One example being one boat having a 0 second stay in the inner zone of location 31437 at 18:23 the 28th of November, however the boat entered the outer zone of the same site at 17:04 the same day, and exited zone 1 of that location at 18:24. We can then conclude that the boat spent approximately 1 hour and 20 minutes at the location in the outer zone, then very briefly entered the inner zone before leaving the location. Again supporting the first hypothesis. From this we can see that including inner zone in analyzing fishfeed carrier operations adds very little information to our analysis as the fishfeed carriers do not enter the inner zone when transferring fishfeed. As a consequence we discard the inner zone data. We are still left with 2401 events with a duration shorter than 10 minutes. Fig 5 shows the distribution of these events length in stay. We can see that most of these are shorter than 5 minutes, and most probably does not represent actual maritime operations (or failed tries), but rather traveling through the zone. Thus we discard events shorter than 10 minutes, giving us the final distribution shown in Fig. 6. 3.3 Predicting the operational time using Deep Learning To extract cases that exemplify instances where the weather conditions stops a fishfeed operation from being successfull, we are currently building a deep learning model aimed at predicting the time spent at the installation, with the given weather and level of exposure at the time and location. The input to the model is: draft and length of the boat, wind speed3 , distance between the model grid point and actual site coordinate, wave direction3 , wind direction3 , maximum level of exposure at location, significant swell wave height3 , month, hour, wind effect (wind speed combined with 3 Measured at the closest grid point in NORA10 110 7 2000 1500 1000 500 00 1 2 3 4 5 6 7 8 9 10 519.0 620.0 494.0 375.0 200.0 110.0 43.0 24.0 9.0 7.0 22% 26% 21% 16% 8% 5% 2% 1% 0% 0% Fig. 5: Distribution of events over length of stays in all outer zone after grouping all events within a 1 hour time window. Zoomed into the first 10 minutes. 600 500 400 300 200 100 00 0.0 10 42.0 20 33.0 30 62.0 40 78.0 50114.060 79.0 70 82.0 80 59.0 90 46.010028.011015.012016.013015.0140 4.0 150 4.0 160 7.0 170 4.0 180 0% 6% 5% 9% 11% 16% 11% 12% 8% 7% 4% 2% 2% 2% 1% 1% 1% 1% Fig. 6: Distribution of events over length of stays in all outer zone after grouping all events within a 1 hour time window. With all stays smaller than 10 minutes removed. exposure levels in the wind direction +/- 10 degrees) and significant wave height. The output of the model is the amount of time spent on the installation. The regression was implemented using python. We used sklearn for preprocess- ing and scaling (MinMax scaling) of input data (including regression target). The Keras library for deep learning was used for the regression itself, with a input layer of inputcolumns + 1 = 14 nodes. We used 3 hidden layers with 13 nodes each and a output layer of 1 node. All nodes used the ReLU activation function. 4 Results The current results show that there is little information in the gathered data (through the NORA10 model and exposure levels) that account for the variance shown in the time spent at the locations. The neural network models presented in the previous Section 3.3 gets very low accuracy (0.11%, which means the predictor is very slightly better than just outputting the average) in terms of predicting how long a fish feed boat stays at a aquaculture installation. Figure 7 shows the length of all of the events in the chronologically in blue and the predicted length in orange. The "Time Spent" axis is normalized values of the time spent in near a installation where y = 1.0 represents the longest stay recorded in the training data. There are obvious differences between 111 8 predicted and true values; predicted values consistently returns too high values, and fails to predict short stays. A cross validated (cv = 5) hyper parameter grid search was performed and showed no better performance at 10 hidden layers with 56 nodes in each hidden layer. Fig. 7: This shows the DNN model try to predict the amount of time spent at a installation in orange, and the actual time spent in blue. The X-axis is simply the record number, where the record are ordered along the time axis. After we received the disappointing results we created scatter plots of two weather variables in relation to the length of stay at the installations. Typically most would assume there would be a pattern of some correlation between the weather and the length of stay. However Figure 8 shows that neither wind (8a) or waves (8b) reveals any obvious correlation patterns against time spent at installations. In addition we did a principal component analysis of the data, to discover if there where any clear principal components that could contain the variance in the data. The components returned: C = (0.127, 0.117, 0.109, 0.099, 0.091, 0.039, 0.034, 0.028, 0.020, 0.011 , 0.008, 0.004, 0.002, 0.000) Where the sum of components sum(C) = 0.6967 indicating that the total of the components could account for little of the variance. Finally we tried a standard method for non-linear regression as a base-line result to measure the DNN against. We tried Epsilon-Support Vector Regression (SVR) which scored with a coefficient of determination R2 = −0.83 which is worse than constantly predicting the 112 9 Histogram cutoff at time < 30 and time > 300 Histogram cutoff at time < 30 and time > 300 225 225 200 200 175 175 150 150 Time spent Time spent 125 125 100 100 75 75 50 50 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 1 2 3 4 5 6 Wind with exposure Wave height (a) Wind correlation (b) Wave correlation Fig. 8: Scatter plot illustrating correlation between the weather and the time spent at the installation mean of the target (which would give R2 = 0.0). This final result shows in the context of the other results listed above us that the data set may not contain the features needed to predict the length of the stay at a installation. 5 Conclusions and future work We started the work with a hypothesis that whether or not a fishfeed boat operation (loading of fishfeed from boat to barge) succeeded depended on the weather, and that such a failure could be detected from the length of time the fishfeed boat stayed at the aquaculture installation. Our analysis did not find any deterministic correlation between the weather and location data and the length of the stay at the installation. There can be many reasons for this, we will try to list some of the reasons we think are probable; The first possibility is that despite our efforts to remove noise from the data, the data still contains noise. This includes the three factors listed in the introduction section and other possibilities we have not considered. Second, given the size of the boats and their stability, they can operate during harsh conditions. In addition these boats are expensive in operation, and even more expensive if they fail to deliver feed at the appointed time, possibly starving the fish at the installation. Thus these boats are already subject to careful operational planning. It may therefore be that there is none to very few failed fishfeed operations in the data captured. An additional consequence is that the time spent during operations has very low variance. Extending this work would start with confirming these possible explanations for the lack of correlation found in our data. We would also like to gather further data, extending the number of events beyond the current 2700. This would enable us to train and test our models with more rigor and less uncertainty. 6 Acknowledements None of the work done in this paper would have been possible without the support of the EXPOSED project. Special thanks to ANTEO (http://anteo.no/) for providing data to this experiment and working with us to make use of this data. 113 10 References 1. Barry Smyth and Elizabeth McKenna. Modelling the competence of case-bases, pages 208–220. Lecture Notes in Computer Science. Springer Nature, 1998. 2. Barry Smyth and Elizabeth McKenna. Building Compact Competent Case-Bases, pages 329–342. Case-Based Reasoning Research and Development. Springer Na- ture, 1999. 3. Jun Zhu and Qiang Yang. Remembering to add: competence-preserving case- addition policies for case-base maintenance. In IJCAI, volume 99, pages 234–241, 1999. 4. B Smyth and M Keane. Remembering to forget: A competence-preserving deletion policy for cbr. In Proceedings IJCAI-95, 1995. 5. Chunsheng Yang, Benoit Farley, and Bob Orchard. Automated case creation and management for diagnostic cbr systems. Applied Intelligence, 28(1):17–28, Feb 2007. 6. Qiang Yang and Hong Cheng. Case mining from large databases. Lecture Notes in Computer Science, page 691–702. 7. Marvin Zaluski, Nathalie Japkowicz, and Stan Matwin. Case authoring from text and historical experiences. Lecture Notes in Computer Science, page 222–236, 2003. 8. Kerstin Bach, Klaus-Dieter Althoff, Régis Newo, and Armin Stahl. A Case-Based Reasoning Approach for Providing Machine Diagnosis from Service Reports, pages 363–377. Case-Based Reasoning Research and Development. Springer Nature, 2011. 9. Valmi Dufour-Lussier, Florence Le Ber, Jean Lieber, and Emmanuel Nauer. Au- tomatic case acquisition from texts for process-oriented case-based reasoning. In- formation Systems, 40(nil):153–167, 2014. 10. Benoit Farley. From free-text repair action messages to automated case generation. In Proceedings of AAAI 1999 Spring Symposium: AI in Equipment Maintenance Service & Support, Technical Reprot SS-99-02, Menlo Park, CA, AAAI Press, pages 109–118, 1999. 11. Kerstin Bach, Odd Erik Gundersen, Christian Knappskog, and Pinar Öztürk. Au- tomatic case capturing for problematic drilling situations. In International Con- ference on Case-Based Reasoning, pages 48–62. Springer, 2014. 12. Peter Funk and Ning Xiong. Case-based reasoning and knowledge discovery in medical applications with time series. Computational Intelligence, 22(3-4):238–253, Aug 2006. 13. Odd Erik Gundersen. Enhancing the Situation Awareness of Decision Makers by Applying Case-Based Reasoning on Streaming Data. PhD thesis, NTNU, 2014. 14. Øyvind Breivik, Magnar Reistad, and Hilde Haakenstad. A high-resolution hind- cast study for the north sea, the norwegian sea and the barents sea. In 10th International Workshop on Wave Hindcasting and Forecasting, 2007. 15. Magnar Reistad, Øyvind Breivik, Hilde Haakenstad, Ole Johan Aarnes, Birgitte R Furevik, and Jean-Raymond Bidlot. A high-resolution hindcast of wind and waves for the north sea, the norwegian sea, and the barents sea. Journal of Geophysical Research: Oceans, 116(C5), 2011. 16. Pål Lader, David Kristiansen, Morten Alver, Hans. V Bjelland, and Dag Myrhaug. Classification of aquaculture locations in norway with respect to wind wave expo- sure. In Proceedings of the ASME 2017 36th International Conference on Ocean, Offshore and Arctic Engineering OMAE2017, 2017.