A Pilot Study Comparing ChatGPT and Google Search
                                in Supporting Visualization Insight Discovery
                                Chen He1,* , Robin Welsch2 and Giulio Jacucci1
                                1
                                    University of Helsinki, Finland
                                2
                                    Aalto University, Finland


                                                                         Abstract
                                                                         The popularity of large language models (LLMs) provides new possibilities for deriving visualization
                                                                         insights, integrating human and machine intelligence. However, we have yet to understand how a
                                                                         contextualized LLM compares with the traditional search in supporting visualization insight discovery.
                                                                         To this end, we conducted a between-subjects study with 25 participants to compare user insight
                                                                         generation with chat/search on a CO2 Explorer. The Chat condition has ChatGPT contextualized with
                                                                         the data, user tasks, and interactions as programmed system prompts. Results show both systems have
                                                                         their merits and demerits: ChatGPT affords users to ask more diverse questions but can produce wrong
                                                                         answers; Search provides information sources, making the answer more reliable, but users can fail to
                                                                         find the answer. This study prompts us to synthesize them in a future study for reliable and efficient
                                                                         information retrieval.

                                                                         Keywords
                                                                         Information Visualization, Large Language Models, Google Search, Empirical Study


                                1. Introduction
                                Discovering insights is considered the main purpose of visual data exploration (VDE) [1].
                                Compared with data tables, visualization reveals data patterns and trends, facilitating insight
                                discovery. But still, deriving insights needs visualization literacy and cognitive efforts [2]. Imag-
                                ine that an AI system could provide insights into the data you are exploring right now instead
                                of you meticulously looking for them. Prior work proposed techniques to (semi-)automate
                                insights, such as data trends and clusters (e.g., [3, 4, 5]); however, researchers pointed out
                                the superficiality of automated insights: 1) Automated insights are limited to the data while
                                losing the context of the domain. For instance, automatically discovered data clusters and
                                patterns might not be meaningful to the domain under exploration to improve the viewer’s
                                understanding of the domain [6]. 2) Deriving knowledge from collected insights could not be
                                automated [7]. Analysts often need to gather evidence from multiple perspectives to build new
                                knowledge. However, the advent of the Large Language Models (LLMs) might update our views
                                on how visualization insight could be generated.


                                Joint Proceedings of the ACM IUI Workshops 2024, March 18-21, 2024, Greenville, South Carolina, USA
                                *
                                 Corresponding author.
                                $ chen.he@helsinki.fi (C. He); robin.welsch@aalto.fi (R. Welsch); giulio.jacucci@helsinki.fi (G. Jacucci)
                                 0000-0003-2055-4468 (C. He); 0000-0002-7255-7890 (R. Welsch); 0000-0002-9185-7928 (G. Jacucci)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   The introduction of ChatGPT popularized LLMs thanks to its advantages in a wide range
of tasks and simple conversational interfaces, despite its limitations like hallucinations [8].
LLMs can benefit visualization in multiple ways, such as charts’ title generation and pattern
recognition [9]. LLMs’ large reservoir of information and the modeling thereof provide the
potential to level up visualization insight generation, including but not limited to linking data
to external evidence to build a plausible insight and using its reasoning capability to derive
hypotheses and generalizations.
   This research investigates the use of contextualized LLMs (ChatGPT 4 with Vision [10]) to
facilitate visualization insight generation, compared with using the traditional Google search
interface to answer the research question (RQ): What are the similarities and differences
between contextualized ChatGPT and Google search in supporting visualization insight
generation?
   We provide ChatGPT with contextual information by prompting the data under exploration,
user tasks, and real-time user interactions and the resulting visualization states as system
messages. To investigate the RQ, we conducted a between-subjects study with 25 participants,
asking them to explore an existing CO2 Explorer and discover data insights with external
evidence focusing on either insight quantity or quality. The CO2 Explorer has a chat/search
interface next to the visualization as two conditions for comparison. Results show that both
have their own strengths and weaknesses, which prompt us to integrate them for future studies.


2. Related Work
2.1. Natural Language Support for Data Insights
Manually generating data insights can be time-consuming and opportunistic; prior work de-
veloped techniques to discover data insights, like averages and extremes, systematically and
generate texts and visualizations to communicate the insights (e.g., [11, 12]). The process can
be conversational in question-and-answer mode: The user queries the data in natural language,
and the system provides textual and/or visual answers (e.g., [13, 14, 15]). On the other hand,
prior research explored computationally linking the visualization and its textual annotations
for visual storytelling/presentation (e.g., [16, 17]). In contrast, our study explores how users
use contextualized LLMs to generate visualization insights compared with using Google search.
The work most close to ours, to the best of our knowledge, is DataTales [18], which asks users
to author data stories using LLMs. The generated narrative is linked to the chart components.
However, DataTales does not use the conversational feature of LLMs but generates a narrative
with predetermined prompts and was not compared with a baseline system.

2.2. ChatGPT vs. Google Search
Researchers compared ChatGPT and Google search in supporting medical information retrieval
[19, 20, 21] and learning [22]. Results show that readability is low for both platforms [20],
indicating that the information provided is not easy to understand by the general audiences, but
ChatGPT is more difficult to read and comprehend [20, 19]. ChatGPT provides more relevant
responses without citing sources, while Google is more reliable as it often attaches the date and
Figure 1: Screenshot of the interface of ChatGPT-empowered CO2 Explorer for insight discovery. Users
can select a year from the top list (A) to view that year’s CO2 emission of various countries on the map
(B), select countries from the map to view their historic CO2 emission in the line chart (C), chat with
the chatbot (E) to gain more information about the data, such as news and events, and compose a note
recording their discoveries (D).


source of retrieved information [20]. Ayoub et al. [21] found that GPT is better at providing
general medical information but worse at medical recommendations compared with Google
search. When solving programming exercises, Elissa and Marco [22] discovered that students
using ChatGPT have a better success rate with less time spent but worse at their understanding
of the topic when tested with questionnaires. We compare these two platforms in supporting
visualization insight generation and draw conclusions to compare and contrast with prior
results.


3. Study Design
To investigate the RQ, we conducted a between-subjects study comparing ChatGPT and Google
search in assisting VDE. We developed a prototype integrating a CO2 Explorer and the chat-
bot/search. The CO2 Explorer, studied in prior work [23, 24], shows various countries’ CO2
emission data in tons per capita from 1960 to 2021. It consists of a choropleth map and a line
chart (Figure 1). Users can select a year from the top list to view that year’s CO2 emission
of various countries on the choropleth map (Figure 1B); mousing over a country on the map
displays a tooltip with the country name and its emission value. Users can also select countries
from the map to view these countries’ histories of CO2 emission in the line chart (Figure 1C).
Mousing over the line chart displays a black vertical reference line marking the year nearest to
the mouse pointer and that year’s emission values of selected countries. The red vertical line in
the line chart indicates the year chosen in the map view. To capture user insights during their
VDE, users can input and post written texts as notes (Figure 1D).

3.1. ChatGPT to Assist Visualization Insight Discovery
We added the chat function to the visualization’s right side to assist users in VDE. With
ChatGPT 4 with Vision API (parameter settings in Appendix A), we feed in two types of
contextual information as system prompts: 1) Description of the situation the user is in. The
initial system prompt conveys the data in CSV format, describes the visualization and the user
task, and instructs the chatbot to assist with the user task (Appendix B1). 2) To assist real-time
insight discovery, the Explorer transforms user interactions as prompts to retrieve relevant
information. The Explorer prompts three types of user interactions: user selection of a year
and selection/de-selection of a country from the line chart, with the text describing the user
interaction (Appendix B2) and the resulting visualization as an image prompt. In the study,
users were unaware of the initial system message or the image prompt of their interaction, but
they can see their interaction as a textual prompt and the response from the chatbot (Figure 1).
So, every time they click a country or year, they need to wait for the answer to complete until
they can click another one. Users can also prompt freely using the input box at the bottom right.

3.2. Baseline with Search Engine
As a baseline, we put a search component powered by Google Search API [25] next to the
visualization instead of a chat interface. So users can use the search engine to assist with
insight discovery. Unlike the chatbot, the search engine does not have information about the
visualization or user interaction.

3.3. Participants
We recruit 25 international students from a large university to join the study through mailing
lists. They conducted the study either on-site or remotely through Zoom. Upon completion,
each received a 10-euro plus a bonus gift card from a local supermarket chain. They were
randomly assigned to one of the two study conditions. Of the 12 in the Search condition (age
range: 21-53, median: 25.5; female: 10; on-site: 4), five originally came from Asia, four were
from Europe, and three were from North America. Of the 13 people in the Chat condition
(age range: 21-45, median: 25; female: 6; on-site: 6), one was from North America, while the
remaining six each were from Asia and Europe.
   Except two from the Search condition and one from the Chat condition had an intermediate
level of English proficiency, others self-indicated as having a native/advanced level of English.
We examined their familiarity with the techniques in 5-point Likert scales. Both groups were
familiar with heatmaps (Median search: 5, chat: 4) and line charts (Median search: 5, chat: 5).
However, with the two test conditions, a Wilcoxon rank-sum test shows a statistically significant
difference that the search group is familiar with Google search (median: 5) while the chat group
(median: 3) is not so familiar with ChatGPT (Wilcoxon effect size: 0.76, p < 0.001).
3.4. Procedure and Tasks
The study consists of three stages: an interactive tutorial, two visual exploration tasks, and a
questionnaire. The tutorial, built on top of the interface using the intro.js library [26], introduced
the charts, note posting, and chat/search component in six steps. To capture user behavior in
generating different types of insights for comparison, we created two tasks: one quantitative
and one qualitative task. The task description of the quantitative task is:

   Freely explore the CO2 emission data of [a Country Group]a .
   Post as many notes as possible, recording your data discoveries. Your data discoveries
   must be linked to external evidence as references, such as events, policies, and news.
   Please use the Search function (or ‘chat with the ChatBot’ for the Chat condition) to
   assist with your task.
   You will receive a maximum bonus of €5 for this task based on the number of correct
   notes you have posted.
   a
       There are two country groups: 1. the USA, Italy, and Finland; 2. China, India, and Turkey


  For the qualitative task, we replace the above italic part of the description with:
Post one note with the following requirements:
   1. The note records a hypothesis or generalization you have made from your data analysis;
   2. The note includes the rationale behind, that is, how you have derived the hypothesis
      or generalization;
   3. The rationale must link your data analysis with external evidence as references, such
      as events, policies, and news;
   4. The note must be logical and correct.
You will receive a maximum bonus of €5 for this task if your note satisfies the above
requirements. If you post multiple notes, only the last one will be evaluated.
   The order of the two country groups and two tasks were randomly assigned to control the
carryover effect. Finally, the questionnaire collected subjects’ backgrounds, system usability
scale (SUS [27]) answers, and free-form comments. Participants went through the whole study at
their own pace. The experimenter was present if they had any questions. They were encouraged
to think aloud during the tasks; we recorded the screen and voice for analysis. The whole study
generally took less than an hour.

3.5. Data Collection and Analysis
We recorded the screen and voice during the tasks. Mouse interactions, user notes, query/prompt
input, and search results/chat answers were logged with time stamps. Mouse hover actions
were recorded if they lasted for over 3 seconds. We analyzed the time they spent on the tasks,
the number of notes for the quantity task, the number of VDE actions, and their questionnaire
answers. We assessed the overall gradings of notes on a 5-point Likert scale (grading criteria in
Table 1). One author went through the notes, created the grading criteria, and graded the rest of
Table 1
Note grading criteria.
 Grade    Interpretation
    5     Better than Grade 4 with novel hypothesis/generalization.
    4     Clear hypothesis/generalization; well-thought rationale; the logic makes the texts flow well.
    3     Well-thought discovery of the data with multi-aspect evidence.
    2     Simple discovery of the data with external evidence.
    1     Unclear note, missing data references or external evidence.


the notes. For the quantity task, we averaged the note grades of each participant for statistical
analysis, while for the quality task, each had one note, so there was no need for averaging.
   Since two users in the Chat condition spontaneously used search engines, we removed them
from the analysis except for the questionnaire part. For statistical analysis, we used the Wilcoxon
rank sum tests (for unpaired samples) to compare performance between the two conditions;
also, we used Wilcoxon signed-rank tests (for paired samples) to compare the two tasks within
a condition. We report Wilcoxon effect sizes and p-values of the tests. Moreover, we examined
participants’ queries/prompts and went through video recordings to understand several action
patterns when participants posted notes.


4. Results
Figure 2 shows that users spent more time on the quantity task
than the quality task (Search effect size = 0.30, p = 0.38; Chat effect
size = 0.69, p = 0.03), while the same task took a similar amount
of time across the conditions. The number of notes recorded in
the quantity task is also alike in both conditions (Median search:
4.5, chat: 3.5; Range search: [1, 11], chat: [1, 12]).
   In the questionnaire, we asked about their confidence in the
notes they posted and if they learned something new during the
tasks on 5-point Likert scales. While the two conditions showed
no difference in user confidence in their notes (Median search:
4, chat: 4), the Chat condition demonstrated moderately more
learning experiences for the users (Median search: 4.5, chat: 5; Figure 2: Time participants
effect size = 0.34, p = 0.11).                                               spent on the tasks.
   The SUS scores reveal that the Search condition is considered
slightly more user-friendly (Median search: 85, chat: 77.5; effect size = 0.27, p = 0.18), while
both conditions are rated above the general average score of 68. User comments on the Chat
condition show that five users complained about the VDE that they needed to wait for the chat
answers after clicks; two users mentioned trust issues in the answers from ChatGPT; two had
difficulties writing prompts to get the expected answers. Four pointed out that the chatbot gave
pertinent answers (context-aware). With search, most comments are about the visualization
instead, while one mentioned the search results seemed repetitive.
   To understand more about VDE in the two conditions, we
counted the number of effective clicks and mouseovers on the
charts. Results show that the Search condition had more data
exploration actions than the Chat condition (Figure 3), especially
for the quality task (Quality effect size = 0.59, p = 0.01; Quantity
effect size = 0.28, p = 0.22). We can presume that had the Chat
condition not blocked the user interaction, participants would
have interacted more with the charts. In both conditions, the
quantity task had more data exploration actions than the quality
task (Search effect size = 0.47, p = 0.16; Chat effect size = 0.73, p
= 0.02).                                                              Figure 3: Number of VDE ac-
   The quality task produced notes with significantly higher                    tions.
grades than the quantity task within the conditions with large
effect sizes (Search effect size = 0.55, p = 0.08; Chat effect size =
0.81, p = 0.02). On the other hand, the two conditions did not show much difference in note
grades with the two tasks (Figure 4).
   Five among the 12 people (42%) in the Search condition put
external links as evidence in notes in both tasks. Two users in
the Chat condition who used search also included external links
in notes. Some users made notes without mentioning year or
country, such as using the phrase ‘this is..’, supposing the note
is linked to the visualization state.
   Video analysis shows that in both conditions, almost all users
pasted texts from websites/chats to notes. In the Chat condition,
one user copied large amounts of texts as notes without reading
the chats; another user put a wrong answer from chats directly as
a note (an answer to which year has the biggest decrease in CO2
emission). In the Search condition, three out of 12 users failed        Figure 4: Note gradings.
to find the answer they were looking for. The Search condition
allowed users to explore many external charts, infographics, and scientific articles.
   Queries in search and prompts in chats showed similar qualities. Both are iterative; users
often drill down to retrieve more concrete information. Both asked for facts, like events and
policies, and causalities, such as the impact of renewable energy. However, with chats, questions
are more diverse, including how much-, how-, and when-type of questions, such as “How much
in absolute terms did the emissions of China go up from 2002 to 2011?” and “When can we say
that the Kyoto protocol had a sure effect on the decrease on the emissions?”. Moreover, queries
are most often phrases, while prompts are complete questions.


5. Discussion
To summarize, results showed no significant differences between the two conditions in the
time taken or the grades of notes for the tasks; neither did the number of notes generated
for the quantity task. The result could be partially caused by participants’ unfamiliarity with
the new ChatGPT technology. In both conditions, the quantity task took more time than the
quality task, while the quality task produced notes with higher grades. In both conditions, users
were confident in the notes they posted, while the Chat condition exhibited more learning
gain for users, but we did not use an additional questionnaire to test or validate this claim. On
the contrary, Elissa and Marco [22] found that ChatGPT hindered learning, potentially due to
students’ inexperience with the technology.
   The search system had better usability scores; the probable reason is that the click-to-wait-
for-answers feature in the Chat condition is not apposite, as it blocks VDE. Search allows users
to put sources to notes, which makes the insight more reliable, as also shown by Hristidis et
al. [20]. Moreover, search results contained diverse content for exploration, such as charts and
publications, besides texts, which may contribute to its better readability as discussed in prior
work [19, 20].
   However, during information retrieval, users can fail to find the answer with the search or get
the wrong answer with chats. User queries in both conditions had similarities, such as iterative
and drilling down to the topics, as well as differences: Besides asking for facts and reasons,
queries in chat also include when- and how-type of diverse questions.
   We conclude that both platforms have their merits and demerits. Users can fail information
retrieval with search and retrieve unreliable information without sources using chats. We
suggest combining search and chatbot (e.g., [28]) to complement each other and overcome
the weaknesses so as to 1) avoid failure in information seeking, 2) enable users to retrieve the
correct answer, and 3) obtain more reliable answers with sources.

5.1. Limitations
The number of participants for this pilot study is small, which hinders us from drawing firm
conclusions, but the results illuminate the complementarity of the two platforms. Moreover,
as the LLM tools become more user-friendly and familiar to the general public, study results
can be largely affected. Follow-up studies with other data/visualization and a large general
population could be conducted to expand on this investigation.


6. Conclusion
This research compares ChatGPT 4 with Vision and Google search in supporting visualization
insight generation involving external evidence. We conducted a between-subjects study with
25 participants and asked them to use chat/search to complete the quantitative and qualitative
insight task of the CO2 Explorer. Results showed no significant differences between the two
conditions in the task time and number/gradings of generated insights. Qualitative analysis
revealed that the two systems had their own advantages and disadvantages, such as possible
wrong and unreliable answers from ChatGPT and less efficient information retrieval with search.
In the future, combining the two platforms will help improve both the reliability and efficiency
of information retrieval.
Acknowledgments
This research is funded by the Strategic Research Council at the Research Council of Finland
[Grant Number 358247].


References
 [1] R. Chang, C. Ziemkiewicz, T. M. Green, W. Ribarsky, Defining insight for visual analytics,
     IEEE Computer Graphics and Applications 29 (2009) 14–17.
 [2] P. Law, A. Endert, J. T. Stasko, Characterizing automated data insights, in: IEEE Visualiza-
     tion Conference - Short Papers, IEEE, 2020, pp. 171–175.
 [3] R. Ding, S. Han, Y. Xu, H. Zhang, D. Zhang, QuickInsights: Quick and automatic discovery
     of insights from multi-dimensional data, in: the International Conference on Management
     of Data, ACM, 2019, pp. 317–332.
 [4] P. Ma, R. Ding, S. Han, D. Zhang, MetaInsight: Automatic discovery of structured knowl-
     edge for exploratory data analysis, in: the International Conference on Management of
     Data, ACM, 2021, p. 1262–1274.
 [5] Y. Chen, S. Barlowe, J. Yang, Click2Annotate: Automated insight externalization with rich
     semantics, in: IEEE Conference on Visual Analytics Science and Technology, IEEE, 2010,
     pp. 155–162.
 [6] B. Karer, H. Hagen, D. J. Lehmann, Insight beyond numbers: The impact of qualitative
     factors on visual data analysis, IEEE Trans. Vis. Comput. Graph. 27 (2021) 1011–1021.
 [7] D. Sacha, A. Stoffel, F. Stoffel, B. C. Kwon, G. P. Ellis, D. A. Keim, Knowledge generation
     model for visual analytics, IEEE Trans. Vis. Comput. Graph. 20 (2014) 1604–1613.
 [8] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung,
     Q. V. Do, Y. Xu, P. Fung, A multitask, multilingual, multimodal evaluation of chatgpt on
     reasoning, hallucination, and interactivity, CoRR abs/2302.04023 (2023).
 [9] W. Yang, M. Liu, Z. Wang, S. Liu, Foundation models meet visualizations: Challenges and
     opportunities, CoRR abs/2310.05771 (2023).
[10] OpenAI, ChatGPT 4 with Vision, https://platform.openai.com/docs/guides/vision, 2024.
[11] Y. Wang, Z. Sun, H. Zhang, W. Cui, K. Xu, X. Ma, D. Zhang, DataShot: Automatic generation
     of fact sheets from tabular data, IEEE Trans. Vis. Comput. Graph. 26 (2020) 895–905.
[12] Z. Cui, S. K. Badam, M. A. Yalçin, N. Elmqvist, DataSite: Proactive visual data exploration
     with computation of insight-based recommendations, Inf. Vis. 18 (2019).
[13] C. Liu, Y. Han, R. Jiang, X. Yuan, Advisor: Automatic visualization answer for natural-
     language question on tabular data, in: IEEE Pacific Visualization Symposium, IEEE, 2021,
     pp. 11–20.
[14] D. J. L. Lee, A. Quamar, E. Kandogan, F. Özcan, Boomerang: Proactive insight-based
     recommendations for guiding conversational data analysis, in: International Conference
     on Management of Data, ACM, 2021, pp. 2750–2754.
[15] K. Kafle, B. L. Price, S. Cohen, C. Kanan, DVQA: understanding data visualizations via
     question answering, in: IEEE Conference on Computer Vision and Pattern Recognition,
     Computer Vision Foundation / IEEE Computer Society, 2018, pp. 5648–5656.
[16] S. Latif, Z. Zhou, Y. Kim, F. Beck, N. W. Kim, Kori: Interactive synthesis of text and charts
     in data documents, IEEE Trans. Vis. Comput. Graph. 28 (2022) 184–194.
[17] R. Brath, C. Hagerman, Automated insights on visualizations with natural language
     generation, in: International Conference Information Visualisation, 2021, pp. 278–284.
[18] N. Sultanum, A. Srinivasan, DataTales: investigating the use of large language models for
     authoring data-driven articles, in: IEEE Visualization and Visual Analytics, IEEE, 2023, pp.
     231–235.
[19] J. R. Bellinger, J. S. De La Chapa, M. W. Kwak, G. A. Ramos, D. Morrison, B. W. Kesser,
     BPPV Information on Google Versus AI (ChatGPT), Otolaryngology–Head and Neck
     Surgery (2023).
[20] V. Hristidis, N. Ruggiano, E. L. Brown, S. R. R. Ganta, S. Stewart, ChatGPT vs Google for
     Queries Related to Dementia and Other Cognitive Decline: Comparison of Results, J Med
     Internet Res 25 (2023).
[21] N. F. Ayoub, Y.-J. Lee, D. Grimm, V. Divi, Head-to-Head Comparison of ChatGPT Versus
     Google Search for Medical Knowledge Acquisition, Otolaryngology–Head and Neck
     Surgery (2023).
[22] E. Arias Sosa, M. Godow, Comparing Google and ChatGPT as Assistive
     Tools for Students in Solving Programming Exercises (Bachelor thesis),
     https://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-330994, 2023.
[23] M. Feng, C. Deng, E. M. Peck, L. Harrison, HindSight: Encouraging Exploration through
     Direct Encoding of Personal Interaction History, IEEE Trans. Vis. Comput. Graph. 23
     (2017) 351–360.
[24] J. Boy, F. Détienne, J. Fekete, Storytelling in Information Visualizations: Does it Engage
     Users to Explore Data?, in: the CHI Conference on Human Factors in Computing Systems,
     ACM, 2015, pp. 1449–1458.
[25] Google, Programmable search engine, https://developers.google.com/custom-search/v1/
     overview, 2024.
[26] I. team, Intro.js, https://introjs.com/, 2024.
[27] J. Brooke, Sus: A quick and dirty usability scale, Usability Eval. Ind. 189 (1995).
[28] P. team, Perplexity, https://www.perplexity.ai/, 2023.


A. API Parameter settings
The settings were tested to ensure a certain amount of diversity and novelty in the chatbot’s
answers. The model we used is gpt-4-vision-preview with temperature 0.5, max tokens 1000,
top p 1, frequency penalty 0.3, and presence penalty 0.3.


B. System Prompts
1. The initial system message: This is a visual data exploration task. The user explores CO2
emission data for [a Country Group] from 1960 to 2021, measured in tons per capita. Here is
the data delimited by triple backticks in CSV format. “‘ data in CSV format “‘ The visualization
displays the data with a choropleth map, showing the geographic areas of the three countries
color-coded by their CO2 emission values of a selected year, and a line chart, depicting user-
selected countries’ CO2 emissions from 1960 to 2021. The choropleth map uses green to red
colors to code CO2 emissions from 0 to 24 tons per capita. Initially, the latest year, 2021, is
selected. The line chart shows years on the x-axis and CO2 emission values on the y-axis.
Initially, no country is selected."
   ***If it is a quantitative task*** The user can post notes about data discoveries. The data
discovery must be linked to external evidence, such as events, policies, and news. The user’s
task is to post as many notes about such discoveries as possible.
   ***For a qualitative task*** The user can post notes. The user’s task is to analyze the CO2
emission data of the three countries, coupled with the analysis of the external evidence, such as
events, policies, and news, to compose a hypothesis or generalization logically and correctly as
a note.
   Your task is to assist the user with their task using the information provided above and your
knowledge database on actual national or international news or events. Be concise with your
answers.

2. The system prompts when the user selects/de-selects a country or selects a year:
   The user selects [a country name] in the line chart.
   The user de-selects [a country name] in the line chart.
   The user selects the year [year] in the choropleth map.