=Paper=
{{Paper
|id=Vol-3762/509
|storemode=property
|title=SAI4EO: Symbiotic Artificial Intelligence for Earth Observation
|pdfUrl=https://ceur-ws.org/Vol-3762/509.pdf
|volume=Vol-3762
|authors=Nicolò Taggio,Sergio Samarelli,Matteo Simone
|dblpUrl=https://dblp.org/rec/conf/ital-ia/TaggioSS24
}}
==SAI4EO: Symbiotic Artificial Intelligence for Earth Observation==
SAI4EO: Symbiotic Artificial Intelligence for Earth
Observation
Nicolò Taggio1,*,† , Sergio Samarelli1,† and Matteo Simone1,†
1
Planetek Italia, via Massaua, Bari, 70123, Italy
Abstract
Symbiotic Artificial Intelligence (SAI) refers to the symbiotic relationship between artificial intelligence systems and human
users, characterized by mutual interaction and cooperation. Earth Observation (EO) involves collecting data about the Earth’s
surface and atmosphere using technologies like satellites and aerial sensors. Leveraging the synergies between these domains,
this work explores the convergence of SAI and EO through the implementation of an EO assistant-chatbot system. In particular,
using natural language and remote sensing data, two primary tasks will be investigated: image captioning for object counting
and detection, and scene description for image classification. This integration promises to revolutionize automated analysis
and interpretation of EO data, with significant implications in the evolution of smart cities, for environmental monitoring,
land use planning, and related fields.
Keywords
Symbiotic AI, Earth Observation, Artificial Intelligence, Smart Cities
1. Introduction vations span a diverse array of phenomena, ranging from
natural processes such as weather patterns, land cover
Symbiotic Artificial Intelligence (SAI) explores the com- changes, and geological formations, to human-induced
plex aspects of the relationship between humans and activities like urbanization, deforestation, and agricul-
artificial intelligence, including scientific, societal, eco- tural practices. EO plays a pivotal role in facilitating our
nomic, legal, and ethical considerations. With the perva- understanding of global dynamics, enabling scientists,
sive integration of AI systems into our daily routines, the policymakers, and stakeholders to monitor environmen-
imperative to address existing limitations and constraints tal changes, assess natural hazards, and make informed
in human-machine cooperation has gained paramount decisions regarding resource management, disaster miti-
importance. While the effectiveness and precision of an gation, and sustainable development efforts. Moreover,
autonomously operating AI agent remain central con- EO data serves as a valuable resource for a multitude
cerns, the landscape becomes more intricate within a of applications across disciplines, including climate re-
collaborative framework where humans and intelligent search, biodiversity conservation, and urban planning.
AI systems collaborate towards shared objectives. The Anyway, the intricacy of satellite-acquired data, cou-
AI system must possess the capacity to comprehend not pled with the vast volume encompassing various types
only human actions, but also their cognitive frameworks. such as optical, SAR (Synthetic Aperture Radar), multi-
Symbiotic AI holds the potential to revolutionize human- spectral, hyperspectral sensors, among others, presents a
machine interaction, fostering symbiotic partnerships significant challenge in swiftly translating this wealth of
that amplify and enrich human cognitive capabilities information into actionable insights for end-users. Given
rather than removing them. these considerations, AI, particularly SAI, emerges as
Earth Observation (EO) encompasses a broad spec- an important link to bridge human requirements with
trum of technologies and methodologies aimed to collect the wealth of information derived from remote sensing
comprehensive insights into the Earth’s surface and atmo- technologies. For this purpose, we highlight the tasks of
sphere. Through the utilization of remote sensing tech- "Image Captioning" and "Scene Description" applied to
nologies, including satellites, aerial sensors, and ground- the EO domain. Both tasks focus on describing the con-
based instruments, EO attemps to capture and interpret tents of the images, the former by detecting the objects
various aspects of the Earth environment. These obser- that are present while the latter by assigning a class to
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- them from a pre-defined set.
nized by CINI, May 29-30, 2024, Naples, Italy In this project, starting from a comprehensive review
*
Corresponding author. of existing literature (Section 2), an exploration of two
†
These authors contributed equally. distinct tasks will be investigated (Section 3), describing
$ taggio@planetek.it (N. Taggio); samarelli@planetek.it how SAI can facilitate the development of an EO assistant
(S. Samarelli); simone@planetek.it (M. Simone)
chatbot tailored for EO applications (Section 4). Lastly,
0009-0003-6392-3099 (N. Taggio); 0009-0004-6822-231X
(M. Simone) the discussion will delve into the challenges encountered
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 and outline future avenues of exploration (Section 5).
International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
We highlight the possibility of leveraging Multimodal Adaptation strategies.
Large Language Models (MLLMs) to solve EO tasks and Through the integration of these recent architectures,
underline the lack of state-of-the-art approaches to tackle symbiotic AI systems can process and synthesize infor-
the domain challenges. mation from heterogeneous sources such as text, images,
and other sensory data. This capability allows systems to
perceive and interpret the surrounding world in a more
2. State of the Art comprehensive and multidimensional manner. Further-
more, leveraging MLLMs in symbiotic AI provides op-
In the following, a state of the art review is detailed,
portunities for natural interaction between humans and
exploring latest developments, methodologies, and ap-
machines. The models can understand human requests
plications of algorithms and datasets in Symbiotic AI for
in a more contextual manner, interpreting not only the
EO.
text but also associated images, enabling a more seamless
and intuitive interaction.
2.1. SAI Methods However, applying MLLMs to EO data poses signifi-
Although recent advances in deep learning applied to cant challenges due to the substantial differences between
the EO field have demonstrated promising results in vi- natural and remote-sensing images, including variations
sual analysis tasks, e.g. object detection and instance in imaging conditions, environmental scales, and acqui-
segmentation, current methodologies tipically rely on sition angles. Consequently, there is a scarcity of studies
task-specific architectures. These approaches, while ef- in literature focused on the application of MLLMs to
fective for individual tasks, often struggle to handle the EO data, highlighting an important area for future re-
complexities of multi-sensor remote sensing data, ac- search and innovation. Among these research works, it’s
commodate multiple tasks, and generalize to open-set noteworthy to highlight GeoChat [5] and EarthGPT [6],
reasoning scenarios. despite their significant constraints and lack of suitability
In contrast, the emergence of MLLMs has gathered for the Italian language.
attention in the domain of natural images, showcasing
impressive multi-task reasoning abilities in real-world 2.2. EO Datasets
settings. Unlike domain-specific models tailored for spe-
EO datasets serve as the base of intelligent systems capa-
cific tasks, MLLMs exhibit versatile performances and
ble of extracting valuable insights from images. Within
can generalize effectively to new situations, enabling
this domain, extensive research has been conducted, con-
zero-shot capabilities across open-set tasks.
sidering the utility of these datasets for both scientific
Key architectures employed in trending MLLMs in-
and commercial purposes. Indeed, the literature presents
clude the Vision Transformer (ViT)[1], Large Language
numerous datasets crucial for significant EO tasks such
Model (LLM) and Contrastive Language-Image Pre-
as object detection and semantic segmentation. For in-
training (CLIP) [2]. ViT stands out for its departure
stance, the DOTA dataset [7] is curated explicitly for
from traditional convolutional neural network (CNN)
object detection, featuring satellite images prepared for
approaches, relying solely on self-attention mechanisms
oriented bounding box detection. Additionally, xView
to process images as sequences of tokens. This inno-
[8] stands out as one of the largest datasets available,
vative methodology enables ViT to achieve remarkable
boasting one million object instances across 60 diverse
performances across various vision tasks, highlighting its
classes. However, it is important to highlight that these
adaptability and versatility. On the other hand, CLIP, de-
datasets are often encumbered by licenses limiting their
veloped by OpenAI, leverages contrastive learning princi-
usage to non-commercial applications. Despite these
ples to jointly understand images and text, representing
restrictions, the demand for freely accessible datasets
them in a shared embedding space. Nowadays, Large Lan-
remains imperative, particularly from a service-oriented
guage Models are cutting-edge linguistic tools crafted to
perspective.
empower computers with the capability to understand
In this context, datasets like SODA [9] and RarePlanes
human language. The LLaMA (Large Language Model
[10] emerge as viable options for Task 1, which involves
Meta AI) [3] family stands out for its remarkable abil-
image captioning and object detection. Alternatively, for
ity to catch intricate contextual connections, marking
the Task 2, datasets such as, Coastal Zone EEA and Open
a significant leap forward in natural language process-
Street Map are preferred due to their comprehensive cov-
ing. These foundational models enhance the transformer
erage and unrestricted access. Specifically, the SODA
architecture’s capacity for comprehending natural lan-
dataset serves as a comprehensive benchmark for small
guage, owing to their extensive set of trainable param-
object detection. SODA-A, a subset of SODA, comprises
eters. Some open source models, like LLaMAntino [4],
2513 high-resolution images meticulously annotated with
have been released for Italian language using Language
oriented rectangles across nine distinct classes, and is
distributed under the MIT license [9]. Furthermore, to (VHR) data. For instance, monitoring the activity within
recognize the significance of particular object in mili- strategic areas such as airports, including the counting of
tary applications and to be able to perform analysis, the airplanes, helicopters, and similar entities, serves to de-
RarePlanes dataset has been incorporated into the project. tect unexpected occurrences. Conversely, understanding
This dataset encompasses a diverse examples of both real the multitude of objects within a scene, ranging from ve-
and synthetic generated satellite imagery, categorized hicles to bus routes and sports facilities, holds relevance
into various classes based on airplane characteristics such within the context of smart cities. Moreover, having this
as propulsion type, length, civilian or military designa- information represented in terms of bounding boxes, ide-
tion, number of tail fins, and more. ally with geographical coordinates or oriented bounding
On the other hand, in the task of land cover land use boxes, can significantly expedite standard Geographic
classification, identify classes present in a buffer along Information System (GIS) processes for domain experts.
coastal zone is a crucial task. In fact, the European Envi- In light of these requirements, the development of an
ronment Agency (EEA) has destined an entire project for EO chatbot tailored for EO becomes both necessary and
the first Copernicus Land-VHR Coastal Zone hotspot the- highly desirable, enabling optimal interrogation of VHR
matic mapping produced on the European coastal zones. data.
A consortium of EO service providers, spearheaded by Nevertheless, certain challenges warrant attention.
Planetek Italia, has been commissioned by the EEA to Firstly, the scarcity of readily available ground truth data
develop a novel product focusing on Coastal Zones (CZ). in AI-like format is noteworthy. Within the EO domain,
This initiative aims to enhance the Thematic Hotspot it’s crucial to emphasize the scarcity of data, particularly
Mapping (THM) category of the Copernicus Land Moni- for deep learning methodologies. While some ground
toring Service (CLMS). THM attempts within the CLMS truth datasets exist, their accessibility is restricted by
complement the broader wall-to-wall mapping efforts by commercial licensing restrictions. For instance, datasets
furnishing specific and comprehensive land cover/land like DOTA or xView, which encompass diverse objects,
use (LC/LU) data to tackle environmental challenges ef- cannot be utilized for commercial purposes due to licens-
fectively. The upcoming products will encompass the ing constraints, as previously mentioned. Secondly, ob-
entire European coastal region up to an inland depth taining commercial Very High Resolution (VHR) images
of 10 km, spanning approximately 730,000 km². These poses a significant obstacle.
products will feature a minimum mapping unit of 0.5 So that, the Task 1, called image caption - object detec-
hectares and delineate approximately 71 LC/LU classes. tion aims to detect and count objects in a remote sensing
The project delivers a comprehensive and highly precise images using very high resolution data (i.e. from 30 to
LC/LU map encompassing the entire European coastline, 50 cm) and with 3 or 4 bands.
boasting an impressive accuracy rate exceeding 90%. This
serves as an important illustration of how SAI can expe- 3.2. Scene Description
dite the extraction of valuable insights from vast amounts
of data, leveraging approximately 10 terabytes of remote In the same context, the task of analyzing a scene in
sensing images within the realm of EO. Finally, for the terms of LC/LU, potentially on a global scale, presents a
Task2, other available layers like OSM will be investi- significant challenge in the EO domain. This challenge
gated to extract useful information in terms of buildings, stems from various factors, including the complexities
streets, industrial zones and more. involved in classifying numerous types of LC/LU due to
their spectral similarities. Additionally, resolving the is-
sue requires addressing the limitations of satellite image
3. Tasks description resolution, both spatial and spectral. While certain vege-
tation classes may benefit from spectral information (i.e.
In this section, two crucial tasks in EO are detailed where tree cover, shrubland, grassland, etc..), discerning others,
SAI, particularly MLLMs, could bridge the gap between such as urban or anthropic areas (i.e. dense or sparse
EO applications and human interaction, facilitating direct urban), necessitates intricate patterns and, consequently,
access to insightful information derived from remote higher spatial resolution.
sensing data. Another crucial aspect lies in the potential of utiliz-
ing open data to enhance the EO chatbot’s capability
3.1. Image Caption in describing LC/LU. For instance, various freely avail-
able layers such as OSM, Microsoft Open Buildings, and
In the context of EO, conducting a thorough analysis of
similar resources can be leveraged without additional
a scene emerges as a valuable effort for EO experts. Var-
training for computer vision step. The EO chatbot needs
ious scenarios present themselves to EO experts when
to possess the capability to utilize geographical informa-
examining images representing Very High Resolution
tion for data extraction, crop it to the area of interest,
and conduct inquiries to retrieve additional information. important aspect is the capability to interpret metadata
The Task 2, called scene description - image semantic effectively. EO experts rely on metadata to understand
segmentation aims to describe the LC/LU following the the context of the data and extract meaningful insights.
classes defined in the coastal zone dataset using aux- Additionally, the ability to transform bounding box an-
iliary information to integrate available pre-computed notations, such as those formatted in COCO (Common
classes (i.e. residential, and non-residential buildings) in
Objects in Context), into a geographical context is cru-
a remote sensing images with high resolution images (i.e. cial. This transformation facilitates the integration of EO
from 2 to 4 m) and with 3 or 4 bands. data with GIS and other spatial analysis tools, enabling a
more comprehensive understanding of the landscape and
its features. So that, EO experts require not only visual
4. EO assistant representations of objects in imagery but also detailed
statistical information and spatial coordinates. The seam-
The objective of the project is to use the theory behind
less integration of metadata interpretation and bounding
LLMs and MLLMs in assisting EO experts in the cre-
box transformation into a geographical context enhances
ation of operational services. As part of this work, it
the usability and relevance of EO data for various applica-
is developing an EO assistant, tailored to address the
tions, ranging from environmental monitoring to urban
aforementioned tasks efficiently. To provide a practical
planning and beyond.
perspective, the tasks have been delineated into specific
However, the ability to recognize LC/LU on a global
questions that need to be addressed from an operational
scale, or at least in Europe, remains a challenge even
standpoint. This approach ensures that the EO assistant
with traditional deep learning approaches. Generative
is equipped to tackle the essential aspects of EO work-
AI, specifically SAI, supported by LLMs and MLLMs, can
flows effectively.
assist EO experts in interpreting remote sensing data and
Figure 1 shows an example of what are the expected
aid end-users in extracting information using natural and
questions and answers in the Task 1. In particular, the
straightforward language. Figure 2 illustrates potential
use cases that could significantly expedite the replica-
tion of the European Environment Agency (EEA) Coastal
Zone project for a new product next year. Given that
the EEA updates CZ products every six years, there is an
expectation for a new product within a relatively shorter
timeframe.
Figure 1: An example of questions and answers between an
end-user (blu lines) and the EO chatbot (red lines). On the
right side, an example of a VHR image [6]
EO experts need information about the precise spatial
Figure 2: An illustration of a dialogue between an end-user
positioning of objects within EO imagery. Specifically, (shown in blue) and the EO chatbot (displayed in red). On
they require detailed information regarding the location the right, there is a depiction of layers beneficial for the EO
of objects depicted in the images. This entails not only chatbot. Specifically, the base map is sourced from the Pléi-
identifying the objects themselves but also extracting ades constellation, while the colored and transparent polygons
valuable statistical insights related to their dimensions. originate from the OSM (red polygons) and CZ dataset.
Additionally, experts often aim to identify the largest
object within the scene, as it may hold significant rele-
vance for various analyses and applications. Moreover, It is clear how an EO expert can benefit from an EO
the chosen dataset plays a critical role in fulfilling these chatbot that "comprehends" LC/LU. In fact it enables the
requirements. It is essential that the dataset provides not monitoring of changes over time, identifying trends like
only visual representations of objects but also accompa- urban expansion, deforestation, or agricultural intensi-
nying metadata that offer contextual information about fication. It also aids in supporting urban planning by
each object’s characteristics and spatial attributes. One pinpointing suitable areas for development, infrastruc-
ture planning, and land zoning. Additionally, leveraging renev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee,
auxiliary information extracted from sources like OSM D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
could expedite analysis. Importantly, ensuring the infor- P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizen-
mation is available in geographical coordinates facilitates stein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M.
direct use by EO experts, not just end-users. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Tay-
lor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Ro-
5. Conclusion driguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2:
Open foundation and fine-tuned chat models, 2023.
In this study, symbiotic artificial intelligence methods
arXiv:2307.09288.
have been employed to aid EO experts and end-users
[4] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,
in extracting valuable insights related to land use land
G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod-
cover, and image captioning tasks, addressing tangible
els for effective text generation in italian language,
challenges in the advancement of smart cities, environ-
arXiv preprint arXiv:2312.09993 (2023).
mental monitoring, land use planning, and related do-
[5] K. Kuckreja, M. S. Danish, M. Naseer, A. Das,
mains. Specifically, an exhaustive review of the current
S. Khan, F. S. Khan, Geochat: Grounded large
state-of-the-art has been conducted, elucidating the most
vision-language model for remote sensing, 2023.
viable algorithms for two practical scenarios within the
arXiv:2311.15826.
EO domain: image captioning and scene description.
[6] W. Zhang, M. Cai, T. Zhang, Y. Zhuang, X. Mao,
The ongoing work has involved the selection of
Earthgpt: A universal multi-modal large lan-
MLLMs based on existing LLMs , such as LLaMantino,
guage model for multi-sensor image comprehen-
alongside diverse EO datasets with open licensing. This
sion in remote sensing domain, arXiv preprint
amalgamation aims to develop an EO chatbot tailored for
arXiv:2401.16822 (2024).
both EO experts and end-users. The delineated tasks un-
[7] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo,
derscore the necessity for seamless interaction between
M. Datcu, M. Pelillo, L. Zhang, Dota: A large-scale
end-users through natural language and the system’s
dataset for object detection in aerial images, in:
proficiency in retrieving information from EO data, en-
Proceedings of the IEEE conference on computer
capsulating the essence of Symbiotic AI for EO (SAI4EO).
vision and pattern recognition, 2018, pp. 3974–3983.
Subsequently, the projects will embark on creating
[8] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli,
ground truth data comprising "questions-answers-EO
M. Klaric, Y. Bulatov, B. McCord, xview: Objects
data", followed by a feasibility assessment prior to estab-
in context in overhead imagery, arXiv preprint
lishing an end-to-end service catered to both end-users
arXiv:1802.07856 (2018).
and EO experts.
[9] R. Duan, H. Deng, M. Tian, Y. Deng, J. Lin, Soda:
A large-scale open site object detection dataset for
References deep learning in construction, Automation in Con-
struction 142 (2022) 104499.
[1] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- [10] J. Shermeyer, T. Hossler, A. Van Etten, D. Hogan,
senborn, X. Zhai, T. Unterthiner, M. Dehghani, R. Lewis, D. Kim, Rareplanes: Synthetic data takes
M. Minderer, G. Heigold, S. Gelly, et al., An image is flight, in: Proceedings of the IEEE/CVF Winter
worth 16x16 words: Transformers for image recog- Conference on Applications of Computer Vision,
nition at scale, arXiv preprint arXiv:2010.11929 2021, pp. 207–217.
(2020).
[2] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual mod-
els from natural language supervision, in: Inter-
national conference on machine learning, PMLR,
2021, pp. 8748–8763.
[3] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer,
M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,
W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,
A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kar-
das, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko-