=Paper=
{{Paper
|id=Vol-3762/509
|storemode=property
|title=SAI4EO: Symbiotic Artificial Intelligence for Earth Observation
|pdfUrl=https://ceur-ws.org/Vol-3762/509.pdf
|volume=Vol-3762
|authors=Nicolò Taggio,Sergio Samarelli,Matteo Simone
|dblpUrl=https://dblp.org/rec/conf/ital-ia/TaggioSS24
}}
==SAI4EO: Symbiotic Artificial Intelligence for Earth Observation==
<pdf width="1500px">https://ceur-ws.org/Vol-3762/509.pdf</pdf>
<pre>
                                SAI4EO: Symbiotic Artificial Intelligence for Earth
                                Observation
                                Nicolò Taggio1,*,† , Sergio Samarelli1,† and Matteo Simone1,†
                                1
                                    Planetek Italia, via Massaua, Bari, 70123, Italy


                                                 Abstract
                                                 Symbiotic Artificial Intelligence (SAI) refers to the symbiotic relationship between artificial intelligence systems and human
                                                 users, characterized by mutual interaction and cooperation. Earth Observation (EO) involves collecting data about the Earth’s
                                                 surface and atmosphere using technologies like satellites and aerial sensors. Leveraging the synergies between these domains,
                                                 this work explores the convergence of SAI and EO through the implementation of an EO assistant-chatbot system. In particular,
                                                 using natural language and remote sensing data, two primary tasks will be investigated: image captioning for object counting
                                                 and detection, and scene description for image classification. This integration promises to revolutionize automated analysis
                                                 and interpretation of EO data, with significant implications in the evolution of smart cities, for environmental monitoring,
                                                 land use planning, and related fields.

                                                 Keywords
                                                 Symbiotic AI, Earth Observation, Artificial Intelligence, Smart Cities


                                1. Introduction                                                                                                vations span a diverse array of phenomena, ranging from
                                                                                                                                               natural processes such as weather patterns, land cover
                                Symbiotic Artificial Intelligence (SAI) explores the com- changes, and geological formations, to human-induced
                                plex aspects of the relationship between humans and activities like urbanization, deforestation, and agricul-
                                artificial intelligence, including scientific, societal, eco- tural practices. EO plays a pivotal role in facilitating our
                                nomic, legal, and ethical considerations. With the perva- understanding of global dynamics, enabling scientists,
                                sive integration of AI systems into our daily routines, the policymakers, and stakeholders to monitor environmen-
                                imperative to address existing limitations and constraints tal changes, assess natural hazards, and make informed
                                in human-machine cooperation has gained paramount decisions regarding resource management, disaster miti-
                                importance. While the effectiveness and precision of an gation, and sustainable development efforts. Moreover,
                                autonomously operating AI agent remain central con- EO data serves as a valuable resource for a multitude
                                cerns, the landscape becomes more intricate within a of applications across disciplines, including climate re-
                                collaborative framework where humans and intelligent search, biodiversity conservation, and urban planning.
                                AI systems collaborate towards shared objectives. The                                                             Anyway, the intricacy of satellite-acquired data, cou-
                                AI system must possess the capacity to comprehend not pled with the vast volume encompassing various types
                                only human actions, but also their cognitive frameworks. such as optical, SAR (Synthetic Aperture Radar), multi-
                                Symbiotic AI holds the potential to revolutionize human- spectral, hyperspectral sensors, among others, presents a
                                machine interaction, fostering symbiotic partnerships significant challenge in swiftly translating this wealth of
                                that amplify and enrich human cognitive capabilities information into actionable insights for end-users. Given
                                rather than removing them.                                                                                     these considerations, AI, particularly SAI, emerges as
                                     Earth Observation (EO) encompasses a broad spec- an important link to bridge human requirements with
                                trum of technologies and methodologies aimed to collect the wealth of information derived from remote sensing
                                comprehensive insights into the Earth’s surface and atmo- technologies. For this purpose, we highlight the tasks of
                                sphere. Through the utilization of remote sensing tech- "Image Captioning" and "Scene Description" applied to
                                nologies, including satellites, aerial sensors, and ground- the EO domain. Both tasks focus on describing the con-
                                based instruments, EO attemps to capture and interpret tents of the images, the former by detecting the objects
                                various aspects of the Earth environment. These obser- that are present while the latter by assigning a class to
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- them from a pre-defined set.
                                nized by CINI, May 29-30, 2024, Naples, Italy                                                                     In this project, starting from a comprehensive review
                                *
                                  Corresponding author.                                                                                        of existing literature (Section 2), an exploration of two
                                †
                                  These authors contributed equally.                                                                           distinct tasks will be investigated (Section 3), describing
                                $ taggio@planetek.it (N. Taggio); samarelli@planetek.it                                                        how SAI can facilitate the development of an EO assistant
                                (S. Samarelli); simone@planetek.it (M. Simone)
                                                                                                                                               chatbot tailored for EO applications (Section 4). Lastly,
                                 0009-0003-6392-3099 (N. Taggio); 0009-0004-6822-231X
                                (M. Simone)                                                                                                    the discussion will delve into the challenges encountered
                                  © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 and outline future avenues of exploration (Section 5).
                                  International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
We highlight the possibility of leveraging Multimodal         Adaptation strategies.
Large Language Models (MLLMs) to solve EO tasks and              Through the integration of these recent architectures,
underline the lack of state-of-the-art approaches to tackle   symbiotic AI systems can process and synthesize infor-
the domain challenges.                                        mation from heterogeneous sources such as text, images,
                                                              and other sensory data. This capability allows systems to
                                                              perceive and interpret the surrounding world in a more
2. State of the Art                                           comprehensive and multidimensional manner. Further-
                                                              more, leveraging MLLMs in symbiotic AI provides op-
In the following, a state of the art review is detailed,
                                                              portunities for natural interaction between humans and
exploring latest developments, methodologies, and ap-
                                                              machines. The models can understand human requests
plications of algorithms and datasets in Symbiotic AI for
                                                              in a more contextual manner, interpreting not only the
EO.
                                                              text but also associated images, enabling a more seamless
                                                              and intuitive interaction.
2.1. SAI Methods                                                 However, applying MLLMs to EO data poses signifi-
Although recent advances in deep learning applied to          cant challenges due to the substantial differences between
the EO field have demonstrated promising results in vi-       natural and remote-sensing images, including variations
sual analysis tasks, e.g. object detection and instance       in imaging conditions, environmental scales, and acqui-
segmentation, current methodologies tipically rely on         sition angles. Consequently, there is a scarcity of studies
task-specific architectures. These approaches, while ef-      in literature focused on the application of MLLMs to
fective for individual tasks, often struggle to handle the    EO data, highlighting an important area for future re-
complexities of multi-sensor remote sensing data, ac-         search and innovation. Among these research works, it’s
commodate multiple tasks, and generalize to open-set          noteworthy to highlight GeoChat [5] and EarthGPT [6],
reasoning scenarios.                                          despite their significant constraints and lack of suitability
   In contrast, the emergence of MLLMs has gathered           for the Italian language.
attention in the domain of natural images, showcasing
impressive multi-task reasoning abilities in real-world       2.2. EO Datasets
settings. Unlike domain-specific models tailored for spe-
                                                              EO datasets serve as the base of intelligent systems capa-
cific tasks, MLLMs exhibit versatile performances and
                                                              ble of extracting valuable insights from images. Within
can generalize effectively to new situations, enabling
                                                              this domain, extensive research has been conducted, con-
zero-shot capabilities across open-set tasks.
                                                              sidering the utility of these datasets for both scientific
   Key architectures employed in trending MLLMs in-
                                                              and commercial purposes. Indeed, the literature presents
clude the Vision Transformer (ViT)[1], Large Language
                                                              numerous datasets crucial for significant EO tasks such
Model (LLM) and Contrastive Language-Image Pre-
                                                              as object detection and semantic segmentation. For in-
training (CLIP) [2]. ViT stands out for its departure
                                                              stance, the DOTA dataset [7] is curated explicitly for
from traditional convolutional neural network (CNN)
                                                              object detection, featuring satellite images prepared for
approaches, relying solely on self-attention mechanisms
                                                              oriented bounding box detection. Additionally, xView
to process images as sequences of tokens. This inno-
                                                              [8] stands out as one of the largest datasets available,
vative methodology enables ViT to achieve remarkable
                                                              boasting one million object instances across 60 diverse
performances across various vision tasks, highlighting its
                                                              classes. However, it is important to highlight that these
adaptability and versatility. On the other hand, CLIP, de-
                                                              datasets are often encumbered by licenses limiting their
veloped by OpenAI, leverages contrastive learning princi-
                                                              usage to non-commercial applications. Despite these
ples to jointly understand images and text, representing
                                                              restrictions, the demand for freely accessible datasets
them in a shared embedding space. Nowadays, Large Lan-
                                                              remains imperative, particularly from a service-oriented
guage Models are cutting-edge linguistic tools crafted to
                                                              perspective.
empower computers with the capability to understand
                                                                 In this context, datasets like SODA [9] and RarePlanes
human language. The LLaMA (Large Language Model
                                                              [10] emerge as viable options for Task 1, which involves
Meta AI) [3] family stands out for its remarkable abil-
                                                              image captioning and object detection. Alternatively, for
ity to catch intricate contextual connections, marking
                                                              the Task 2, datasets such as, Coastal Zone EEA and Open
a significant leap forward in natural language process-
                                                              Street Map are preferred due to their comprehensive cov-
ing. These foundational models enhance the transformer
                                                              erage and unrestricted access. Specifically, the SODA
architecture’s capacity for comprehending natural lan-
                                                              dataset serves as a comprehensive benchmark for small
guage, owing to their extensive set of trainable param-
                                                              object detection. SODA-A, a subset of SODA, comprises
eters. Some open source models, like LLaMAntino [4],
                                                              2513 high-resolution images meticulously annotated with
have been released for Italian language using Language
                                                              oriented rectangles across nine distinct classes, and is
distributed under the MIT license [9]. Furthermore, to        (VHR) data. For instance, monitoring the activity within
recognize the significance of particular object in mili-      strategic areas such as airports, including the counting of
tary applications and to be able to perform analysis, the     airplanes, helicopters, and similar entities, serves to de-
RarePlanes dataset has been incorporated into the project.    tect unexpected occurrences. Conversely, understanding
This dataset encompasses a diverse examples of both real      the multitude of objects within a scene, ranging from ve-
and synthetic generated satellite imagery, categorized        hicles to bus routes and sports facilities, holds relevance
into various classes based on airplane characteristics such   within the context of smart cities. Moreover, having this
as propulsion type, length, civilian or military designa-     information represented in terms of bounding boxes, ide-
tion, number of tail fins, and more.                          ally with geographical coordinates or oriented bounding
   On the other hand, in the task of land cover land use      boxes, can significantly expedite standard Geographic
classification, identify classes present in a buffer along    Information System (GIS) processes for domain experts.
coastal zone is a crucial task. In fact, the European Envi-   In light of these requirements, the development of an
ronment Agency (EEA) has destined an entire project for       EO chatbot tailored for EO becomes both necessary and
the first Copernicus Land-VHR Coastal Zone hotspot the-       highly desirable, enabling optimal interrogation of VHR
matic mapping produced on the European coastal zones.         data.
A consortium of EO service providers, spearheaded by             Nevertheless, certain challenges warrant attention.
Planetek Italia, has been commissioned by the EEA to          Firstly, the scarcity of readily available ground truth data
develop a novel product focusing on Coastal Zones (CZ).       in AI-like format is noteworthy. Within the EO domain,
This initiative aims to enhance the Thematic Hotspot          it’s crucial to emphasize the scarcity of data, particularly
Mapping (THM) category of the Copernicus Land Moni-           for deep learning methodologies. While some ground
toring Service (CLMS). THM attempts within the CLMS           truth datasets exist, their accessibility is restricted by
complement the broader wall-to-wall mapping efforts by        commercial licensing restrictions. For instance, datasets
furnishing specific and comprehensive land cover/land         like DOTA or xView, which encompass diverse objects,
use (LC/LU) data to tackle environmental challenges ef-       cannot be utilized for commercial purposes due to licens-
fectively. The upcoming products will encompass the           ing constraints, as previously mentioned. Secondly, ob-
entire European coastal region up to an inland depth          taining commercial Very High Resolution (VHR) images
of 10 km, spanning approximately 730,000 km². These           poses a significant obstacle.
products will feature a minimum mapping unit of 0.5              So that, the Task 1, called image caption - object detec-
hectares and delineate approximately 71 LC/LU classes.        tion aims to detect and count objects in a remote sensing
The project delivers a comprehensive and highly precise       images using very high resolution data (i.e. from 30 to
LC/LU map encompassing the entire European coastline,         50 cm) and with 3 or 4 bands.
boasting an impressive accuracy rate exceeding 90%. This
serves as an important illustration of how SAI can expe-      3.2. Scene Description
dite the extraction of valuable insights from vast amounts
of data, leveraging approximately 10 terabytes of remote    In the same context, the task of analyzing a scene in
sensing images within the realm of EO. Finally, for the     terms of LC/LU, potentially on a global scale, presents a
Task2, other available layers like OSM will be investi-     significant challenge in the EO domain. This challenge
gated to extract useful information in terms of buildings,  stems from various factors, including the complexities
streets, industrial zones and more.                         involved in classifying numerous types of LC/LU due to
                                                            their spectral similarities. Additionally, resolving the is-
                                                            sue requires addressing the limitations of satellite image
3. Tasks description                                        resolution, both spatial and spectral. While certain vege-
                                                            tation classes may benefit from spectral information (i.e.
In this section, two crucial tasks in EO are detailed where tree cover, shrubland, grassland, etc..), discerning others,
SAI, particularly MLLMs, could bridge the gap between such as urban or anthropic areas (i.e. dense or sparse
EO applications and human interaction, facilitating direct urban), necessitates intricate patterns and, consequently,
access to insightful information derived from remote higher spatial resolution.
sensing data.                                                  Another crucial aspect lies in the potential of utiliz-
                                                            ing open data to enhance the EO chatbot’s capability
3.1. Image Caption                                          in describing LC/LU. For instance, various freely avail-
                                                            able layers such as OSM, Microsoft Open Buildings, and
In the context of EO, conducting a thorough analysis of
                                                            similar resources can be leveraged without additional
a scene emerges as a valuable effort for EO experts. Var-
                                                            training for computer vision step. The EO chatbot needs
ious scenarios present themselves to EO experts when
                                                            to possess the capability to utilize geographical informa-
examining images representing Very High Resolution
                                                            tion for data extraction, crop it to the area of interest,
and conduct inquiries to retrieve additional information. important aspect is the capability to interpret metadata
   The Task 2, called scene description - image semantic  effectively. EO experts rely on metadata to understand
segmentation aims to describe the LC/LU following the     the context of the data and extract meaningful insights.
classes defined in the coastal zone dataset using aux-    Additionally, the ability to transform bounding box an-
iliary information to integrate available pre-computed    notations, such as those formatted in COCO (Common
classes (i.e. residential, and non-residential buildings) in
                                                          Objects in Context), into a geographical context is cru-
a remote sensing images with high resolution images (i.e. cial. This transformation facilitates the integration of EO
from 2 to 4 m) and with 3 or 4 bands.                     data with GIS and other spatial analysis tools, enabling a
                                                          more comprehensive understanding of the landscape and
                                                          its features. So that, EO experts require not only visual
4. EO assistant                                           representations of objects in imagery but also detailed
                                                          statistical information and spatial coordinates. The seam-
The objective of the project is to use the theory behind
                                                          less integration of metadata interpretation and bounding
LLMs and MLLMs in assisting EO experts in the cre-
                                                          box transformation into a geographical context enhances
ation of operational services. As part of this work, it
                                                          the usability and relevance of EO data for various applica-
is developing an EO assistant, tailored to address the
                                                          tions, ranging from environmental monitoring to urban
aforementioned tasks efficiently. To provide a practical
                                                          planning and beyond.
perspective, the tasks have been delineated into specific
                                                             However, the ability to recognize LC/LU on a global
questions that need to be addressed from an operational
                                                          scale, or at least in Europe, remains a challenge even
standpoint. This approach ensures that the EO assistant
                                                          with traditional deep learning approaches. Generative
is equipped to tackle the essential aspects of EO work-
                                                          AI, specifically SAI, supported by LLMs and MLLMs, can
flows effectively.
                                                          assist EO experts in interpreting remote sensing data and
   Figure 1 shows an example of what are the expected
                                                          aid end-users in extracting information using natural and
questions and answers in the Task 1. In particular, the
                                                          straightforward language. Figure 2 illustrates potential
                                                          use cases that could significantly expedite the replica-
                                                          tion of the European Environment Agency (EEA) Coastal
                                                          Zone project for a new product next year. Given that
                                                          the EEA updates CZ products every six years, there is an
                                                          expectation for a new product within a relatively shorter
                                                          timeframe.


Figure 1: An example of questions and answers between an
end-user (blu lines) and the EO chatbot (red lines). On the
right side, an example of a VHR image [6]


EO experts need information about the precise spatial
                                                               Figure 2: An illustration of a dialogue between an end-user
positioning of objects within EO imagery. Specifically,        (shown in blue) and the EO chatbot (displayed in red). On
they require detailed information regarding the location       the right, there is a depiction of layers beneficial for the EO
of objects depicted in the images. This entails not only       chatbot. Specifically, the base map is sourced from the Pléi-
identifying the objects themselves but also extracting         ades constellation, while the colored and transparent polygons
valuable statistical insights related to their dimensions.     originate from the OSM (red polygons) and CZ dataset.
Additionally, experts often aim to identify the largest
object within the scene, as it may hold significant rele-
vance for various analyses and applications. Moreover,            It is clear how an EO expert can benefit from an EO
the chosen dataset plays a critical role in fulfilling these   chatbot that "comprehends" LC/LU. In fact it enables the
requirements. It is essential that the dataset provides not    monitoring of changes over time, identifying trends like
only visual representations of objects but also accompa-       urban expansion, deforestation, or agricultural intensi-
nying metadata that offer contextual information about         fication. It also aids in supporting urban planning by
each object’s characteristics and spatial attributes. One      pinpointing suitable areas for development, infrastruc-
ture planning, and land zoning. Additionally, leveraging           renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
auxiliary information extracted from sources like OSM              D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
could expedite analysis. Importantly, ensuring the infor-          P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-
mation is available in geographical coordinates facilitates        stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
direct use by EO experts, not just end-users.                      Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay-
                                                                   lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
                                                                   Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro-
5. Conclusion                                                      driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2:
                                                                   Open foundation and fine-tuned chat models, 2023.
In this study, symbiotic artificial intelligence methods
                                                                   arXiv:2307.09288.
have been employed to aid EO experts and end-users
                                                               [4] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,
in extracting valuable insights related to land use land
                                                                   G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod-
cover, and image captioning tasks, addressing tangible
                                                                   els for effective text generation in italian language,
challenges in the advancement of smart cities, environ-
                                                                   arXiv preprint arXiv:2312.09993 (2023).
mental monitoring, land use planning, and related do-
                                                               [5] K. Kuckreja, M. S. Danish, M. Naseer, A. Das,
mains. Specifically, an exhaustive review of the current
                                                                   S. Khan, F. S. Khan, Geochat: Grounded large
state-of-the-art has been conducted, elucidating the most
                                                                   vision-language model for remote sensing, 2023.
viable algorithms for two practical scenarios within the
                                                                   arXiv:2311.15826.
EO domain: image captioning and scene description.
                                                               [6] W. Zhang, M. Cai, T. Zhang, Y. Zhuang, X. Mao,
   The ongoing work has involved the selection of
                                                                   Earthgpt: A universal multi-modal large lan-
MLLMs based on existing LLMs , such as LLaMantino,
                                                                   guage model for multi-sensor image comprehen-
alongside diverse EO datasets with open licensing. This
                                                                   sion in remote sensing domain, arXiv preprint
amalgamation aims to develop an EO chatbot tailored for
                                                                   arXiv:2401.16822 (2024).
both EO experts and end-users. The delineated tasks un-
                                                               [7] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo,
derscore the necessity for seamless interaction between
                                                                   M. Datcu, M. Pelillo, L. Zhang, Dota: A large-scale
end-users through natural language and the system’s
                                                                   dataset for object detection in aerial images, in:
proficiency in retrieving information from EO data, en-
                                                                   Proceedings of the IEEE conference on computer
capsulating the essence of Symbiotic AI for EO (SAI4EO).
                                                                   vision and pattern recognition, 2018, pp. 3974–3983.
   Subsequently, the projects will embark on creating
                                                               [8] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli,
ground truth data comprising "questions-answers-EO
                                                                   M. Klaric, Y. Bulatov, B. McCord, xview: Objects
data", followed by a feasibility assessment prior to estab-
                                                                   in context in overhead imagery, arXiv preprint
lishing an end-to-end service catered to both end-users
                                                                   arXiv:1802.07856 (2018).
and EO experts.
                                                               [9] R. Duan, H. Deng, M. Tian, Y. Deng, J. Lin, Soda:
                                                                   A large-scale open site object detection dataset for
References                                                         deep learning in construction, Automation in Con-
                                                                   struction 142 (2022) 104499.
 [1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis-        [10] J. Shermeyer, T. Hossler, A. Van Etten, D. Hogan,
     senborn, X. Zhai, T. Unterthiner, M. Dehghani,                R. Lewis, D. Kim, Rareplanes: Synthetic data takes
     M. Minderer, G. Heigold, S. Gelly, et al., An image is        flight, in: Proceedings of the IEEE/CVF Winter
     worth 16x16 words: Transformers for image recog-              Conference on Applications of Computer Vision,
     nition at scale, arXiv preprint arXiv:2010.11929              2021, pp. 207–217.
     (2020).
 [2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
     G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, et al., Learning transferable visual mod-
     els from natural language supervision, in: Inter-
     national conference on machine learning, PMLR,
     2021, pp. 8748–8763.
 [3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
     hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,
     M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,
     W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,
     A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-
     das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-

</pre>