-

1613-0073

Prompting LLMs in Italian language for Text-to-SQL translation

Federico Ranald

federico.ranaldi@alumni.uniroma2.e leonardo.ranaldi@idiap.c 1 3 4 5

Elena Sofia Ruzzett

1 3 4 5

LeonardoRanald

1 3 4 5

DavideVenditt

1 3 4 5

CristinaGiannone

0 1 3 5

AndreaFavall

0 1 3 5

RanieroRomagnol

0 1 3 5

Fabio MassimoZanzott

1 3 4 5 0 Almawave S.p.A., Via di Casal Boccone , 188-190 00137, Rome, IT 1 Commons License Attribution 4.0 International , CC BY 4.0 2 Idiap Research Institute , Switzerland 3 Text-to-SQL , It-LLMs, prompt, zero-shot, Natural Language Processing, Natural Language Query, Natural Language Under- 4 University of Rome Tor Vergata 5 Workshop Proce dings

Fine-tuning Large Language Models (LLMs) on tasks with instructions has demonstrated potential in boosting zero-shot generalization to unseen tasks. Inspired by studies on the reasoning skills of Instruction-tuned LLMs (It-LLMs), we investigate reading-comprehension, reasoning, and production over symbolic tasks. In particular, we propose an iterative readingcomprehension and reasoning approach to solve question-answering tasks based on structured data, i.e., Text-to-SQL task. In our approach, we define a specialized procedure to provide the relevant evidence from structured data and natural language queries in order to stimulate the It-LLMs to focus on the production task and reasoning. Hence, we propose a prompting generation procedure to allow It-LLMs to reason about the structural information and natural language queries and produce symbolic output, i.e., the SQL queries. Extensive experiments, in zero-shot scenarios, with diferent types of structured data, demonstrate the superhuman abilities of It-LLMs in comprehension and production astonishing answers. However, hallucinations and misleading answers are also produced; this still shows the shortcomings of the instructed LLMs and, thus, their partial unreliability.

standing

CEUR

CEUR Workshop Proceedings C(EUR-WS.org) version. Then, we define a specialized procedure to provide the relevant evidence from structured data and query the It-LLMs in natural language. In this way, we direct the models to focus on understanding the prompt, reasoning based on the information provided, and producing the output, the SQL code that solves Text-to-SQL task. ent types of structured data demonstrate the remarkable abilities of It-LLMs in understanding and producing astonishing responses in the presence of various levels of information. However, we have observed errors as the2.3. Text-to-SQL task information given to It-LLMs decreases. The results of

The ability to translate natural language queries into SQL the zero-shot scenarios still show shortcomings of the

or other ontological formal languag27e,s2[8] is a valuIt-LLMs and, thus, their partial unreliability when the

able tool because it allows one to interact with databases harder queries and less informative databases are consid

using a natural language without having to learn SQL. ered. There are several approaches to the problem of translation from natural language to SQL. The earliest methods 2. Background & Related Works were totally rule-base2d9[, 30]; later, with the arrival of statistical learners, a common approach became learning 2.1. Large Language Models the mapping between SQL queries and command1s5][.

Database schema and queries, rich in terms of relationBrown et al.,2[] with GPT3 were the forerunners of theships, are often encoded in graphs – and processed by many Large Language Models (LLMs). Among the welgl-raph neural networks31[] or self-attention mechanisms famous LLMs are OPT [16], FLAN [17], and LLaMA [18]. [32] – or translated into intermediate representations Compared to the smaller language models, LLMs ha[v3e3]. Recently, the Text-to-SQL task has been interpreted several emergent abilitie1s9][, including zero-shot multi-as a sequence-to-sequence, and transformer-based modtask solving6[] and few-shot in-context learning witehls are applied3[4, 35]. However, a critical aspect is the chain-of-thought reasonin2g0][. amount of input information, i.e., database schemas and relationships encoding. In this paper, we move forward 2.2. Instruction-tuned LLMs and propose a new Text-to-SQL approach by exploiting the potential of It-LLMs models. In particular, after an LLMs generate texts following certain formats andeixnt-ensive prompt-tuning phase, we analyze two It-LLMs structions from examples in their prompts. Ouyang aml.o,dels’ reasoning and generalization abilities in solv[5] trained GPT3 with instruction-response corpora tinog the Text-to-SQL task with less informative database make LLMs more scalable and improve zero-shot perforr-epresentations and harder queries. Our contribution is mance. As a result, InstructGPT, ChatGPT, and GPTu4nafected by LLMs’ prior knowledge after pre-training perform well on a wide range of tasks without seeinags we test a collection of definitely unseen databases. any examples. Recent research has also found that GPTgenerated instructions and outputs to follow instructions [21] can improve LLMs’ ability to follow instructio3ns.. Methods Wang et al.2,[2] proposed a semi-supervised method to generate diferent instructions from an NLP task-baseIdn order to test the reading-comprehension abilities seed instruction7[]. However, these models are not fullyof Instruction-tuned Large Language Models (It-LLMs) open-source, and it is often possible to use them for freein the Text-to-SQL translation task, we organized the as black-boxes2[3]. Recent open-sourcing eforts include prompting phase into two parts. In the first phase, we several competitive model2s4[,25] but cannot match thedefined diferent prompts for studying how the presence performance of closed-source models26[]. of Structural Information and data afects the behavior of models (Section3.1). In the second phase, we defined pos

3.1. Prompting Structural Information 4.1. Datasets 3.2. Prompting Natural Language Query In order to analyze the generalization abilities, we have

Regarding the Natural Language Query (NLQ), i.e., thfeed dumps of three SQL databases that are definitely unqueries we wish to translate SQL, inspired by the work osefen, thus not found on the Web, and never seen in the [36], we considered three hardness-levels: easy, mediump,re-training corpora of Large Language Models. Moreand hard. A given NLQ is assigned to a certain levoevler, databases difer in topic, topology, and size as shown if the best corresponding SQL translation has speciificn Table1. hardness characteristics. The hardness-levels are defined as follows: 4.2. Experimental Settings 1. EASY: values are selected only from one tablBeehind describing the data (Secti4o.1n) and prompting (there is no join). methodologies (Sectio3n), we tested our proposals on 2. MEDIUM: values are selected by joining two taG-PT-3.5 and Claude Instant. Hence, we provided Strucbles. tural Information, defined in Sectio3n.1, in three difer3. HARD: values are selected by joining more thaennt ways, in each of which we requested the translation two tables. of four Natural Language Queries (NLQ) for each hardFurthermore, in all levels, an arbitrary number of conndei-ss level. We conducted experiments on three diferent tions is allowed, and aggregation functions are includdeadt.abases to study phenomena in diferent scenarios. The NLQs were in Italian and, as described in Sect3i.o2nwere of the type:”Traduci in sql la seguente query ’nomi,cog3.3. Prompting Phase nomi,età degli utenti...ordinati per età’”.

We conducted the Text-to-SQL task using two It-LLMs:

GPT-3.5 [37] and Claude-instant38[]. In a zero-shot 5. Results & Discussion scenario, we considered the three diferent approaches (as described in Section3.1), behind which we asked the 5.1. The reading-comprehension models to translate a small number of NLQ per hardnesslevel on three diferent databases. In particular, except for Challenge feeding the SQL dump of the database as input, requesItts-LLMs are amazing understanders; in fact, in presence such as”Traduci la seguente query NL in SQL” were made of structured information, they perform very well in overwithout any further prompt engineering steps. coming complex challenges and generating good translations from Text-to-SQL. In Tabl2ewe can observe that 4. Experiments both GPT-3.5 and Claude Instant perform very well in theSOLO∼SCHEMA approach. In particular, both GPTIn order to observe the real abilities of Intructio3.n5-and Claude Instant produce an accurate translation tuned Large Language Models (It-LLMs) in readingfo-r all the EASY queries. Moreover, Claude Instant procomprehension on heterogeneous inputs and the redau-ces very good results on average also on the MEDIUM queries. Hence, the It-LLMs showed good abilities in soning abilities behind output generation, we selected a comprehending natural language and the structural iInn- fact, we can observe that as we degrade the strucformation of databases in SQL language. tural information of the database by removing vocals from the table and attribute namUeGsL(Y∼SCHEMA), 5.2. The reasoning-generation Challenge the models tend to make errors with a high frequency.

Looking at Tabl2e, GPT-3.5 and Claude instant perforThe It-LLMs’ reasoning and SQL query generation skillmsances deteriorate at all hardness levels. Moreover, GPTare strongly related to the quality of the queries. Ind3e.e5da,lways fails to translate HARD queries. This means the It-LLMs could generate intriguing output eventihnat both models find it more challenging to understand zero-shot and low-resource scenarios (with limited struwch-at is asked in the NL query and to reason over the tural information). However, they could not generadtaetabase structure with deteriorated names. exhaustive translations when the types of SQL queriesHowever, some points can be recovered by providrequired were hard. In fact, in Tab2le,it is possible to ob- ing the database with a small amount of real data serve a marked decrease in thSeOLO∼SCHEMA rows of (UGLY&INSERT). This phenomenon can be observed by theHARD columns compared to thEeASY andMEDIUM noting that the TOT obtained inUtGhLeY & INSERT apcolumns. In particular, for DB3 queries, performancesproach never worsens compared to thUeGLY∼SCHEMA, fall by half, or worse, going from EASY level to HARDr.egardless of the hardness level of the queries.

Hence, we can conclude that degrading information 5.3. Efects of degradation of structural quality has negative efects on both models, afecting the information reliability of their reasoning skills.

Finally, we want to quantify how model performance Both the reading-comprehension and reasoningis- afected by the amount of information available on a generation abilities of It-LLMs are negatively afected dbaytabase compared to the amount of information needed degrading database information. to efectively resolve queries. We hence define this quanin the database in question. In Figu3,rewe can observe the efect of diferent approaches on the number of errors in the two cases.

As expected, as the information available to a system decreases, the number osfemantic errors tends to increase. We can observe that both GPT-3.5 (Fig3uar)eand Claude Instant (Figu3reb) tend to make a limited number of semantic errors in theSOLO∼SCHEMA approach, while theUGLY∼SCHEMA approach leads to the largest number of errors. We can observe that tUhGeLY & INSERT approach, with a limited set of realistic data, seems to reduce the number osfemantic errors. (a) GPT-3.5 (b) Claude Instant On the other hand, the trend in the number of syntactic errors is diferent between the two modeFrigruorres f3o:r GNPuTm-3b.5eranofd sCelmaaundteiIcnsetrarnotrascraonssdaspypnrtoaacchtiesc, els. In GPT-3.5, the decrease in the informativeness of ordered from most informative to least informative. the dumps leads to more errors. Manual inspection found that only one error was due to incorrect use of SQL syntax: in most cases, GPT-3.5 has dificulty identifying the tables and columns to be used in the given database and tity of information aInsformation Level . We define as therefore proposes SQL queries that make use of arbifollows: = trary tables. In this case, thseysnetactic errors are ℎ definitely examples of hallucinations and need to be furwhere is the Approach score anℎd is Hardness Score. ther explored. Claude Instant, instead, tends to retain The Approach Score assigns a score to each approachm,ore information about the dump, and the number of ranging from1 to 2: the highest valu2eis assigned syntactic errors is more constant across the diferent to theSOLO∼SCHEMA approach and the lowes1tto approaches.

UGLY∼SCHEMA. The UGLY & INSERT approach is assigned an intermediate score 1o.f5. To calculate the6. Conclusion Information Level we smooth this information with the actual hardness of the query that is assigned with Itnhethis paper, we propose an iterative readingHardness Score ℎ : it ranges from1 (for the EASY level) comprehension and reasoning approach to solve to3 (for the HARD level). question-answering challenges of the Text-to-SQL task.

As shown in Figure2, GPT-3.5 and Claude Instant per-The results obtained from the experiments conducted formances correlate with tIhneformation Level. For GPT- in this work witness the potential of Instruction-tuned 3.5 (Figure 2a), a large Pearson correlation coeficient Large Language Models (It-LLMs).However, despite (0.88) is observed, which is statistically significant withthaeir promising performance, certain limitations have value of0.001. Claude Instant performance (Figu2rbe) emerged. We discovered that even with minimal inforis still positively correlated withIntfohremation Level, mation about the database, It-LLMs can generate natualthough the Pearson correlation coeficient is low0e.5r)( ral language query translations that yield correct and and has a higher value 0(.1). executable SQL queries by just prompting them. Nevertheless, it became evident that reducing the amount 5.4. Errors Analysis of information provided could lead to the generation of incorrect queries. Expanding the scope of our investigaIn this section, we focus on the characterization ofteior-n, we believe it would be worthwhile to conduct simirors that are made by the analyzed models. We ilna-r experiments with other It-LLMs. Such comparisons vestigate two types of errorss:emantic errors and could help determine whether the common phenomena syntactic errors. The semantic errors are queries observed in both tested models result from a coincidence mistranslated by the system that, if executed, resulotr irnepresent aspects to further investigate in studying the selection of information other than what wastihneis-e new technologies. tially requested in natural language. On the other ha nInd,conclusion, this research underscores the substantial syntactic errors are errors that make the query notadvancements ofered by It-LLMs in the realm of Text-toexecutable by an engine: these queries are characterizSeQdL translation while also the implications of choosing by incorrect use of SQL syntax (e.g., they contain a fieldwhether to provide more or less information during the in thehaving statement that is not present insetlhecet) prompting process. or contain references to tables and fields that do not exist

arXiv:2203.02155. [6] V. Sanh, A. Webson, C. Rafel, S. H. Bach,

L. Sutawika, Z. Alyafeai, A. Chafin, A. Stiegler, T. L.

Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, [11] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, Do, Y. Xu, P. Fung, A multitask, multilingual, multiM. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw- modal evaluation of chatgpt on reasoning, halluciden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San- nation, and interactivity, 20a2r3X.iv:2302.04023. tilli, T. Fevry, J. A. Fries, R. Teehan, T. Bers, S. Bi-[12] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, derman, L. Gao, T. Wolf, A. M. Rush, Multitask H. Chan, J. Ba, Large language models are humanprompted training enables zero-shot task gener al- level prompt engineers (2022 a).rXiv:2211.01910. iz ation, 2022 .arXiv:2110.08207. [13] J. Jang, S. Ye, M. Seo, Can large language mod[7] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, els truly understand prompts? a case study with neg ated prompts, 2022 a.rXiv:2209.12711. [14] S. Arora, A. Narayan, M. F. Chen, L. Orr, N. Guha, H. Liu, P. Abbeel, S. Levine, D. Song, The K. Bhatia, I. Chami, F. Sala, C. Ré, Ask me anything: false promise of imitating proprietary llms, 2023. A simple strategy for prompting language models, arXiv:2305.15717.

2022. arXiv:2210.02441. [27] P. Atzeni, R. Basili, D. Hansen, P. Missier, P. Paggio, [15] T. Wolfson, D. Deutch, J. Berant, Weakly super- M. Pazienza, F. Zanzotto, Ontology-based question vised text-to-SQL parsing through question de- answering in a Federation of University Sites: The composition, in: Findings of the Association MOSES case study, Lecture Notes in Computer for Comput ational Linguistics: NAACL 2022 , As- Science (including subseries Lecture Notes in sociation for Computational Linguistics, Seattle, Artificial Intelligence and Lecture Notes in BioinUnited States, 2022, pp. 2528–2542. URLh: ttps: formatics) (2004). URL:https://www.scopus.com/ // aclanthology.org/2022 .findings-naacl..1d9o3i:10. inward/record.uri?eid=2-s2.0-35048854325&doi= 18653/ v1/2022 .findings-naacl.193. 10.1007%2f978-3-540-27779-8_40&partnerID= [16] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, 40&md5=7545b9abe40e6ac9d64b47d45e71b78c.

S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mi- doi:10.1007/978-3-540-27779-8_40. haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig[2,8] R. Basili, D. H. Hansen, P. Paggio, M. T. Pazienza, P. S. Koura, A. Sridhar, T. Wang, L. Zettlemoyer, F. M. Zanzotto, Ontological resources and question Opt: Open pre-trained transformer language mod- answering, in: Proceedings of the Workshop on els, 2022 .arXiv:2205.01068. Pragmatics of Question Answering at HLT-NAACL [17] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. 2004, Association for Computational Linguistics, Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Fine- Boston, Massachusetts, USA, 2004, pp. 78–84. URL: tuned language models are zero-shot learners, 2022 . https://aclanthology.org/W04-2.510 arXiv:2109.01652. [29] F. Li, H. V. Jagadish, Constructing an interactive nat[18] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. ural language interface for relational databases, ProLachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- ceedings of the VLDB Endowment 8 (2014) 73–84. bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, [30] T. Mahmud, K. M. Hasan, M. Ahmed, T. Chak, G. Lample, Llama: Open and eficient foundation A rule based approach for nlp based query pro language models, 2023 a.rXiv:2302.13971. cessing, 2015, pp. 78–82. doi1:0.1109/EICT.2015. [19] J. Wei, Y. Tay, R. Bommasani, C. Rafel, B. Zoph, 7391926.

S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,[31] B. Bogin, J. Berant, M. Gardner, Representing D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, schema structure with graph neural networks for P. Liang, J. Dean, W. Fedus, Emergent abilities of text-to-SQL parsing, in: Proceedings of the 57th Anlarge language models, 202a2r. Xiv:2206.07682. nual Meeting of the Association for Computational [20] J. Wei, X. Wang, D. Schuurmans, M. Bosma, Linguistics, Association for Computational LinB. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of- guistics, Florence, Italy, 2019, pp. 4560–4565. URL: thought prompting elicits reasoning in large lan- https://aclanthology.org/P19-1.4d4o8i:10.18653/ guage models, 2023.arXiv:2201.11903. v1/P19-1448. [21] B. Peng, C. Li, P. He, M. Galley, J. Gao, Instruction[32] B. Wang, R. Shin, X. Liu, O. Polozov, M. Richardtuning with gpt-4, 202a3r.Xiv:2304.03277. son, RAT-SQL: Relation-aware schema encoding [22] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, and linking for text-to-SQL parsers, in: Proceedings S. Narang, A. Chowdhery, D. Zhou, Self-consistency of the 58th Annual Meeting of the Association for improves chain of thought reasoning in language Computational Linguistics, Association for Compumodels, 2023.arXiv:2203.11171. tational Linguistics, Online, 2020, pp. 7567–7578. [23] Z. Lin, S. Trivedi, J. Sun, Generating with confi- URL: https://aclanthology.org/2020.acl-mai n..677 dence: Uncertainty quantification for black-box doi:10.18653/v1/2020.acl-main.677. large language models, 202a3r. Xiv:2305.19187. [33] I. Sucameli, A. Bondielli, L. Passaro, E. Annunziata, [24] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, G. Lucherini, A. Romei, A. Lenci, Mate, a meta C. Guestrin, P. Liang, T. B. Hashimoto, Stanford al- layer between natural language and database, in: paca: An instruction-following llama mohdtetlp,s: Proceedings of the Sixth Workshop on Natural Lan//github.com/tatsu-lab/stanford_al,p2a0c2a3. guage for Artificial Intelligence (NL4 AI 2022 ) co[25] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, located with 21th International Conference of the H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Italian Association for Artificial Intelligence (AI* IA Gonzalez, I. Stoica, E. P. Xing, Vicun a: An open- 2022 ), 2022. source chatbot impressing gpt-4 with 90%* chatg[p3t4] T. Scholak, N. Schucher, D. Bahdanau, PIquality, 2023. URLh:ttps://vicuna.lmsys.o.rg CARD: Parsing incrementally for constrained [26] A. Gudibande, E. Wallace, C. Snell, X. Geng, auto-regressive decoding from language models,

language models with self-generated instructions, This work was conducted within the DATALAKE Gius- 2023 . arXiv: 2212 .10560. tizia project; we acknowledge the partners and the scie [n8] - Y. K. Dwivedi , N.

Kshetri , L.

Hughes , E. L.

tific committee for their support . Slade,

Jeyaraj ,

A. K.

Kar ,

A. M.

Baabdullah ,

ishnan , Y.

Barlette , S.

Basu , I. Bose , L. Brooks, [1]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Buhalis ,

Carter ,

Chowdhury ,

Crick , S. W.

transformer , 2020 .arXiv: 1910 .10683. Edwards , C.

Flavián , R.

Gauld , V.

Grover , M.-C. [2] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , Hu, M.

Janssen , P.

Jones , I. Junglas, S. Khorana,

2020 . arXiv: 2005 .14165. W. van der Aalst, V. Venkatesh, G. Viglia,

Wade , [3]

Gao ,

Biderman ,

Black ,

Golding ,

Hoppe ,

Walton ,

Wirtz ,

Wright , Opinion paper:

Presser ,

Leahy , The pile: An 800gb dataset perspectives on opportunities, challenges and

of diverse text for language modeling, 2020. implications of generative conversational ai for

arXiv:2101 .00027. research, practice and policy, International Journal [4]

Mishra ,

Khashabi ,

Baral , H. Hajishirzi, of Information Management 71 ( 2023 ) 102642 . URL:

crowdsourcing instructions , in: Proceedings of pii/S0268401223000233 . doi:https://doi.org/10.

the 60th Annual Meeting of the Association for 1016/j .ijinfomgt. 2023 . 102642 .

Computational

Linguistics (Volume 1 :

Long

Pa [-9]

Jiang ,

Zhou ,

J.-R.

Wen ,

Zhao ,

Dublin , Ireland, 2022 , pp. 3470 - 3487 . URLh:ttps:/ / ple knowledge encoder for enhancing the

aclanthology.org/ 2022 . acl-long . 2d4o4i : 10 .18653/ commonsense reasoning capacity of pre-trained

v1/ 2022 . acl-long.244. models , in: Findings of the Association for [5]

Ouyang ,

Wu ,

Jiang ,

Almeida ,

C. L.

Wain- Computational

Linguistics: NAACL

2022 ,

Ray ,

Schulman ,

Hilton ,

Kelton , L. Miller, tle, United States, 2022 , pp. 1730 - 1741 . URL:

Simens ,

Askell ,

Welinder , P. Christiano, https://aclanthology.org/ 2022 .findings-naac. l.131

Leike ,

Lowe , Training language models to doi:10.18653/v1/ 2022 .findings-naacl. 131 .

follow instructions with human feedback , 202 [ 21 .0]

Schick ,

Dwivedi-Yu ,

Dessì , R. Raileanu,

to use tools, 2023a .rXiv: 2302 . 04761 .

in: Proceedings of the 2021 Conference on Em-

2021 , pp. 9895 - 9901 . URL: https://aclanthology.

org/ 2021 .emnlp-main. 77 .9doi: 10 .18653/v1/ 2021 .

emnlp-main. 779 . [35]

Xie ,

C. H.

Wu ,

Shi ,

Zhong , T. Scholak,

Proceedings of the 2022 Conference on Empirical

United

Arab Emirates , 2022 , pp. 602 - 631 . URL:

https://aclanthology.org/ 2022 .emnlp-mai.n. 39 [36]

Yu ,

Zhang ,

Yang ,

Yasunaga ,

Wang ,

ings of the 2018 Conference on Empirical Meth-

2018 , pp. 3911 - 3921 . URL: https://aclanthology.org/

D18 - 1425 . doi: 10 .18653/v1/ D18 -1425. [37] OpenAI , Chatgpt, 2022 . URLh:ttps://chat.openai.

com/. [38] Anthropic , Claude-instant, 2022 . URhLt:tps://poe.