Bringing Rome to Life: Evaluating Historical Image Generation

Bringing Rome to Life: Evaluating Historical Image Generation PhillipBStröbel phillip.stroebel@uzh.ch Department of Computational Linguistics University of Zurich

Andreasstrasse 15 8050 Zurich Switzerland

Department of History University of Zurich

Karl Schmid-Strasse 4 8006 Zurich Switzerland

ZejieGuo zejie.guo@uzh.ch Department of Computational Linguistics University of Zurich

Andreasstrasse 15 8050 Zurich Switzerland

ÜlküKaragöz Department of Computational Linguistics University of Zurich

Andreasstrasse 15 8050 Zurich Switzerland

EvaMariaWilli evamaria.willi@uzh.ch Department of History University of Zurich

Karl Schmid-Strasse 4 8006 Zurich Switzerland

FelixKMaier felix.maier@hist.uzh.ch Department of History University of Zurich

Karl Schmid-Strasse 4 8006 Zurich Switzerland

Bringing Rome to Life: Evaluating Historical Image Generation 1613-0073 8DD773794F2D18502FDF7B009E3F212B GROBID - A machine learning software for extracting information from scholarly documents Digital Humanities image generation human evaluation automatic evaluation history image dataset

This study evaluates the potential of AI image generation for visualising historical events, focusing on two ancient Roman scenarios: the Roman triumph and the Lupercalia festival. Using DALL-E 3, we generated 600 images based on 100 prompts derived from scientific texts. We then conducted a twopart evaluation: (1) A human evaluation by 21 history students, who compared image pairs and rated individual images on accuracy and prompt alignment, and (2) two automated analyses, one modelled after the human evaluation protocol and one using visual question-answering (VQA) techniques.

Our results reveal both the promise and limitations of AI in historical visualisation. While DALL-E 3 produced many convincing images, there were notable discrepancies between human and automated assessments. We found that Large Language Models tend to rate images more favourably than human evaluators.

We contribute a novel dataset for historical image generation, initial human and automated evaluation protocols, and insights into the challenges of using AI for historical visualisation, which is incredibly important for historians to reconstruct past events. Our findings highlight the need for refined evaluation methods and underscore the complexity of assessing historical accuracy in AI-generated imagery. This study lays the groundwork for future research on improving AI models for historical visualisation and developing more robust evaluation frameworks.

Introduction

Historians, akin to criminologists, analyse primary sources and eyewitness accounts to extract meaning and understand the motives and circumstances of historical events. However, unlike criminologists, who can re-enact events, historians face the challenge of studying occurrences that cannot be replicated or reproduced in experiments. This presents a significant challenge in their work.

Criminologists have developed methods to mitigate the uncertainties involved. Re-enacting crucial moments of an action or crime using real people or AI-based simulations has become

Our Contribution

Our study focuses on these two events due to their significance in Roman culture and the varying levels of textual and visual documentation available for each. The Roman triumph, a well-documented celebration of military victory, provides a rich base of textual descriptions. In contrast, the Lupercalia, an ancient fertility festival, offers a more challenging scenario with fewer detailed contemporary accounts.

To assess DALL-E 3's capabilities in this domain, we generated 600 images -450 for the triumph and 150 for the Lupercalia (see Section 3). Our evaluation process is twofold:

1. Human evaluation: We conducted a comprehensive review involving 21 advanced history students to assess the images' historical accuracy. 2. Automated analysis: We employed computer vision techniques to analyse the images for prompt alignment.

This dual approach allows us to measure the generated images' subjective impact on human viewers and their objective alignment with historical data. Our research contributes to the broader discussion of AI's potential in historical visualisation and its limitations and contains the following items:

1. A novel, automatically generated dataset comprising 100 prompts and 600 images for historical image generation. 2. An initial human evaluation of a subset of these automatically generated images.

3. An initial automatic evaluation of the same subset.

4. An assessment of how well human and automatic evaluation correlate.

Related Work

The evaluation of automatically generated images has recently gained traction, mainly due to the increasingly sophisticated image generation models. Otani, Togashi, Sawai, Ishigami, Nakashima, Rahtu, Heikkilä, and Satoh [24] contemplated, based on an extensive analysis of 37 papers, that human evaluation protocols are often not reproducible and lack a clear description. Moreover, evaluation usually relies on automatic measures that poorly align with human scores.

The advantage of human feedback is that it can improve text-to-image models, e.g., with reinforcement learning from human feedback (as used in Natural Language Processing [28]). Xu, Liu, Wu, Tong, Li, Ding, Tang, and Dong [33] exploited a dataset of 8,878 prompts and 136,892 image comparisons to fine-tune a reward model that aligns more closely with human preferences. Liang, He, Li, Li, Klimovskiy, Carolan, Sun, Pont-Tuset, Young, Yang, Ke, Dvijotham, Collins, Luo, Li, Kohlhoff, Ramachandran, and Navalpakkam [17] used human feedback concerning Plausibility, Aesthetics, Text-image Alignment, and an Overall impression to predict human feedback scores. Due to the successful integration of human feedback in the model fine-tuning by Xu, Liu, Wu, Tong, Li, Ding, Tang, and Dong [33], we created an evaluation scenario which allows us to integrate such feedback directly in future work (see Section 4.1).

While Xu, Liu, Wu, Tong, Li, Ding, Tang, and Dong [33] focused on prompt-to-image alignment, other image properties are open for evaluation. Lee, Yasunaga, Meng, Mai, Park, Gupta, Zhang, Narayanan, Teufel, Bellagente, Kang, Park, Leskovec, Zhu, Li, Wu, Ermon, and Liang [16] worked on holistic image evaluation and identified twelve aspects among which we find Alignment, Quality, Aesthetics, and Originality (among others). Evaluating each aspect calls for different measures, some of them human, some of them automated. They created a holistic image evaluation benchmark for existing datasets and reported scores for all aspects and 26 models. While such an evaluation effort is valuable and provides a helpful oversight, we focus on prompt-to-image alignment evaluation in this work.

The research mentioned above has had access to large and heterogeneous datasets and results from extensive evaluation campaigns. In the context of historical image generation, such work does not yet exist. One exception is the investigation of Fareed, Bou Nassif, and Nofal [8] who tested the usage of Leonardo1 for teaching purposes in the field of "History of Architecture". They evaluated the usability of Leonardo with a questionnaire after a workshop, which generally showed a need for the evaluation of AI-generated images for usage in the historical domain.

Data Collection with DALL-E 3

Next, we outline the methodology for data collection using DALL-E 3 to generate images related to triumphal processions and the Lupercalia, which included the following steps:

1. Collecting Historical Documents: We collected resources (i.e., academic papers, books, and other relevant documents) about the triumph and the Lupercalia in ancient Rome. Specifically, we included five documents related to the Lupercalia [32,29,20,7,10] and 15 documents focused on triumphal processions [27,22,3,2,15,14,18,23,12,25,13,19,9,1,30]. 2. Creating Prompts from Documents: For each document, we manually derived five prompts. Each prompt was designed to capture a specific scene described in the texts. E.g., a document on triumphal processions could include prompts about the attire Romans wore, the types of vehicles used, or the procession sequence. In total, we created 100 prompts. 3. Image Generation with DALL-E 3: We used each prompt to generate six images using DALL-E 3 [4] via the OpenAI API. 2 The 100 prompts resulted in 150 generated images for the Lupercalia and 450 for the triumphal processions. 3Note that we did not force the model to produce realistic images. This led to a great variety of image styles, some of which are indeed life-like, while others are more in the style of a Renaissance painting or a black-and-white pencil sketch. All prompts, however, are based on scientific literature. See Figure 1 for example images and prompts from the dataset. 4

Evaluating Automatically Generated Data

The following sections focus on the different evaluation scenarios employing human annotators and automatic evaluation measures.

Human Evaluation

Human Evaluation Setup

We generated two evaluation scenarios to obtain feedback from human annotators.

Image Comparison (IC)

The first scenario asks annotators to decide which of two images better reflects the prompt. This is a cognitively easier task. Much in the manner of Xu, Liu, Wu, Tong, Li, Ding, Tang, and Dong [33], we plan to use these ratings for fine-tuning models to produce more faithful images. The participants are instructed not to judge the image style. We only compared images generated with the same prompt, which, based on the formula 𝑛(𝑛−1) 2 to find unique pairings, results in 15 pairs per prompt (as mentioned in the previous section, we generated six images per prompt). Multiplied with the 100 total prompts in the dataset, we arrive at 1,500 comparisons.

Image Rating (IR)

The second task requires the participants to rate an image on a 5-point Likert scale with the following options:

1. The image does not match the prompt at all. 2. The image barely contains aspects of the prompt. 3. The image catches some aspects of the prompt, but it is not very accurate. 4. The image catches most of the aspects of the prompt. 5. The image completely matches the prompt.

Additionally, we asked the users to describe which aspects of the image did not correspond to the prompt in a text field. In this scenario, which demands more time and effort, we need 600 ratings for one complete dataset annotation.

We set up a Prodigy interface, 5 which we used to obtain the assessment of the annotators. See Figure 2 to get an impression of the annotation environment. We recruited 21 advanced history students for the annotations. We did not ask the participants to annotate a specific number of pairs. They were compensated with book vouchers of a value of $30. An online meeting was organised to explain the guidelines, emphasising that in the first scenario, they should judge based on the alignment of the images with the prompts rather than their visual appeal. They should consider visual features only if the two images reflect the prompts equally. The students spent approximately one afternoon annotating the data in both scenarios.

Results of Human Evaluation

Table 1 gives an overview of the results from the human evaluation. In the IC setting, we received 1,569 comparisons. 103 samples were annotated more than once. For unknown reasons, 64 data points did not contain the human assessment, so we excluded them from further analysis. On average, each participant compared 74.71 (SD 43.32) image pairs. The IR scenario received less feedback since the participants provided written feedback in a text field besides their rating. We obtained 568 ratings, of which 29 were double ratings-24 feedbacks without scores needed to be excluded.

We must note here that, due to a wrong parameter setting of Prodigy in both scenarios, the data samples to be evaluated were presented to the participants in sequential instead of a random order. This led to only marginal annotation overlap. For this reason, we cannot compute inter-annotator agreements (IAA) yet. However, since we plan to improve the models with the feedback obtained from the participants, we will have further evaluation rounds during which we can take care of this limitation. Still, to the best of our knowledge, this is the first "largescale" evaluation campaign dedicated to historical image generation. We can still analyse and compare the results obtained with the limitations in mind (see Section 4.1.3).

However, since previous studies reported low IAA in human evaluation scenarios (cf. [16]), we hypothesise a similar outcome on our dataset.

Comparison of Human Results with Large Language Model (LLM) Evaluation

To mitigate the missing information on IAA and to evaluate the suitability of multimodal LLMs for scoring tasks, we employed GPT-4o [21], Gemini 1.5 Pro [26] and Claude 3.5 Sonnet. 6 We let the LLMs solve the same tasks as the annotators, i.e., we applied them to the IC (only GPT-4o) and IR (all three) evaluation scenarios. 7For IC, Table 2 shows the agreements of the human comparisons with GPT-4o's comparisons. We see that in 57.51% of the cases, human annotators and GPT-4o agree on which of the two images better corresponds to the prompt.

Figure 3 summarises the results for the IR setting. The left graph shows the differences between the human and the LLM ratings. The tendency is that LLMs rate images higher than human annotators. The right graph shows the LLM's deviations from the human scores. E.g., in 164 (30.15%) ratings, GPT-4o agrees with the human scores. In 169 (31.07%) cases, GPT-4o scores one point higher on the Likert scale than the human annotators (i.e., GPT-4o had rated an image a 3 when the human annotator rated it at 2). We see that Claude tends to rate images higher, especially. Overall, the deviations seem normally distributed, a fact that might be exploited for future evaluations.

Choosing two scenarios to evaluate allows us to test for differences in assessing images between the triumph and the Lupercalia scenario. Our null hypothesis 𝐻 0 is that there is no difference in the ratings of human annotators and, e.g., GPT-4o in the two historical scenarios. Table 3 shows the results of two Welch's t-test [31], which we chose because of (i) unequal variation and (ii) unequal sample sizes. For the human evaluation (unifying the assessment results but excluding invalid samples), the p-value does not allow us to reject 𝐻 0 . The GPT and Gemini ratings show another picture. The p-values show a highly significant difference between ratings of the triumph and the Lupercalia images. Claude's p-value is on the brink of showing a statistically significant difference. The, on average, lower ratings by LLMs of the Lupercalia images could indicate DALL-E's difÏculties in generating adequate imagery. Firstly, since the Lupercalia are not so much a described nor illustrated phenomenon, it is reasonable that images portraying the festival are not on the same standard as those generated for the triumphal procession. Secondly, the automatic evaluation poses problems for LLMs because they do not "know" as much as they do for the triumph.

Although we cannot provide IAA scores for the human evaluation yet, we can do so for the automatically generated ratings by the LLMs. Table 4 shows the results when we compare the ratings for the LLMs (again split into triumph-and Lupercalia-related scores). The scores are all around 0, indicating low overlap, IAA. Unifying all human scores and comparing them against the ratings obtained via GPT-4o also shows low overlap. These results hint at the very different rating "strategies" of the LLMs. We need further evaluation to shed more light on the origins of the discrepancies.

Automatic Evaluation

Automatic Evaluation Setup

For a further fully automatic evaluation procedure, we employed the Question Generation and Answering (QG/A) [11,6] framework for automatic image evaluation. The first step in this framework involves using a pre-trained language model to generate a set of questions based on a given prompt and question-generation instructions via few-shot learning. In the second step, a pre-trained multimodal model generates answers given the image and the generated set of questions.

Question Generation (QG) In our study, we utilised GPT-3.5 [5] for QG employing the Davidsonian Scene Graph (DSG) [6] method. DSG serves as an evaluation framework grounded in formal semantics. This method's main advantage is its ability to generate atomic and unique questions structured in dependency graphs, which (i) ensure comprehensive semantic coverage and (ii) avoid inconsistencies in responses. Cho, Hu, Garg, Anderson, Krishna, Baldridge, Bansal, Pont-Tuset, and Wang [6] empirically demonstrated that DSG addresses the challenges of hallucinations, duplications, and omissions in QG.

Visual Question Answering (VQA) We employed GPT-4o for the VQA task. The following prompt instruction guides the model: "You are a helpful assistant. Please answer the question only with 'Yes' or 'No'. Do not give other outputs. Question: {question}." To ensure precise control over the output, specifically responding with either 'Yes' or 'No', we set the parameter logit_bias to 100 for both 'Yes' and 'No' tokens. Logit bias modifies the likelihood of specified tokens appearing in the model-generated output. We also set the top_p (nucleus sampling) parameter to 0.1 to restrict the model's consideration to a subset of tokens (the nucleus) whose cumulative probability mass reaches a designated threshold (top-p). In the context of a 0.1 top_p setting, the model exclusively considers tokens constituting the top 10% of the probability mass for the subsequent token. The combination of logit_bias and top_p configurations enables the outputs to adhere to predefined patterns ('Yes' and 'No'), rendering the model more deterministic and particularly suitable for our image evaluation task. 8 We assign a score of 1 for 'Yes' and 0 for 'No' and then compute an average score for each image. We observe that GPT occasionally generates questions such as "Is there an image?" or "Can you visualize a scene?" which are invalid in our context, as the input consistently includes an image and a set of questions. We excluded the scores of these invalid questions from our analysis.

Results of VQA

Figure 4 shows a histogram of the results of the VQA scores for all 600 images. We find most scores between 0.5 and 0.9, with over 60 images obtaining a perfect score of 1. This means that each 'Yes-or-No' question was answered with 'Yes'. When we look at three results as presented in Figure 5 in Appendix A, we find that VQA attributes a low score of 0.05 for image a). The human evaluator and GPT, however, have scored this image with a 4 in the IR scenario. In b), we have a medium VQA score of 0.61, a human score of 5 and a GPT score of 4. Lastly, c) shows an image with a VQA score of 1, but a human annotator scored this image a 3 and GPT a 4. We already see discrepancies between the different scores from these three examples only. A comparison of VQA between the 450 images from the triumphal procession and the 150 images from the Lupercalia based on Welch's t-test shows no significant differences between the two ratings (𝑝 = 0.88). From this, we conclude that ratings based on VQA produce more reliable results than those produced with a Likert scale.

Limitations and Outlook

The most significant limitation of our work is the missing IAA scores. For future evaluation rounds, we will set up the evaluation to allow for their computation. In this way, we get reliable measures of how demanding the task of assessing the alignment of historical images and the prompts they produced is. However, we argue that the results we obtained from the human evaluation are still valuable and allow for fine-tuning models based on human feedback (preferences in the IC and textual input in the IR scenario), albeit in a low-resource setting.

Moreover, we will employ more models to generate images in future experiments. This approach enables us to decide which models are the most suitable for historical image generation. The stable prompt base also allows for comparable results. Still, the significant number of images we will generate in future endeavours also calls for automatic evaluation methods.

Conclusion

In conclusion, our study provides valuable insights into the potential and challenges of using AI for historical image generation. The evaluation of 600 AI-generated images of triumphal processions and the Lupercalia revealed both promising capabilities and significant limitations.

Our findings hint at the discrepancies between human and automated assessments, underscoring the complexity of evaluating historical accuracy in AI-generated imagery. Ultimately, this study serves as a stepping stone towards more sophisticated use of AI in historical recreation and education while cautioning against over-reliance on automated systems for historical interpretation.

A. Additional Figures

Figure 1 :1Figure 1: Four example images for the two scenarios generated with DALL-E 3. Top row (a and a'), triumphal procession, prompt: Generate an image of Trajan's Triumph as it passes through the Circus Maximus from the point of view of one of the around 150,000 to 250,000 spectators. Bottom row (b and b'), Lupercalia, prompt: Create a historical image of a group of Luperci running about naked and holding thongs made of goat hides during the Lupercalia ritual in 44 BCE at the foot of the Palatine Hill. As they run past people they strike them with the thongs. They are laughing, larking about the exchanging obscenities with those who attended the ritual. People seem to be happy with what's going on.

Figure 2 :2Figure 2: Parts of the Prodigy interface to obtain human assessments: a) the interface for image comparison with the side panel with an overview of how many image pairs have been annotated, b) the interface for the rating scenario with the 5-point Likert scale and a text comment field.

Figure 4 :4Figure 4: Histogram of scores obtained with the VQA evaluation.

Figure 5 :5Figure 5: Examples of the VQA ratings. a) from the triumphal procession based on Mittag [19], scored 0.05 based on 22 questions, prompt: "There exist coins minted in 326 CE which show Emperor Constantinus I. on an elephant quadriga during the celebrations of his viceannalia (20 years on the throne). Although textual sources do not confirm that elephant quadrigas were in use, create an image that shows Constantinus I. together with his son Constantius II. on a chariot pulled by four elephants during the vicennalia in Nicomedia. The chariot is accompanied by two lictores. The elephants are guided by Mahouts and Constantinus the I. wears the laurel wreath. ", scored a 4 by both human evaluators and GPT, b) from the Lupercalia based on Erker [7], scored 0.61 based on 18 questions, prompt: "Create an image that shows high-ranking magistrates of ancient Rome, dressed in loincloths. They are emerging from a cave of the Paletine Hill to start the traditional run of the Lupercalian festival. They are running on a rugged terrain under a blue sky. ", scored a 5 by a human annotator and a 4 by GPT, c) from the triumphal procession based on Madsen [18], scored 1.00 based on 19 questions, prompt: "Create a historical image of the spectacle of Pompey's triumph in 61 BC. Pompey adorned in triumphal regalia, parades through the streets of Rome atop his chariot, with captured treasures and defeated foes on display. Imagine the jubilation among the crowds as they celebrate Pompey's military prowess and the expansion of Roman territories under his command. ", scored a 3 by a human annotator and a 4 by GPT.

Table 11Overview of results in the IC and the IR scenarios.Evaluation scenarioTotal assessments Multiple annotations Excluded After exclusionsImage Comparison (IC)1,569103641,505Image Rating (IR)5682924544

Table 22Agreement of human evaluation with GPT-4o's assessment.Score Agreement Count PercentageTRUE86457.41%FALSE64142.59%Total1,505

Left: Aggregation and comparison of scores of human ratings vs. LLM ratings. Right: Deviation of LLM scores from human ratings.

Table 33Data statistics and results of Welch's t-test.RatingsHumanGPTGeminiClaudeTriumph Lupercalia Triumph Lupercalia Triumph Lupercalia Triumph Lupercalia# of samples404140404140404140404140Average score3.463.314.093.683.603.033.913.81SD1.121.140.700.820.900.740.460.52p-value0.180.00000020.000040.052

Table 44Inter-annotator agreement between different groups using Krippendorff's alpha. We used the same 544 images for which we have computed the t-test..GPT vs. Gemini vs. ClaudeGPT vs. humanTriumphLupercalia Triumph Lupercalia𝛼0.079-0.044-0.008-0.005

See https://leonardo.ai. See https://openai.com/index/openai-api. The image generation costs amount to $48.06. The whole dataset (images and prompts) is available on GitHub. See https://github.com/AncientHistory-UZH/C HR2024_prompt-and-image-dataset. See https://prodi.gy. See https://www.anthropic.com/news/claude-3-5-sonnet. This generated costs of $19.14 for GPT-4o, $2.49 for Gemini and $5.27 for Claude. This evaluation scenario cost us $7.69. The whole experiment, i.e., image generation, LLM evaluation in the two scenarios from Section 4.1.3 and the one mentioned in this section totalled at $80.67.

This research contributes a novel dataset and evaluation framework to the field, enabling future studies. As AI continues to evolve, our work suggests that while it holds promise for enhancing historical visualisation and understanding, it requires careful human oversight and interpretation.

The Roman Triumph: Participation, Historiography and Remembrance AAlgül 2018 Claiming Victory: The Early Roman Triumph JArmstrong Rituals of triumph in the Mediterranean world. Culture and History of the Ancient Near East 63 JeremyArmstrong

Leiden u.a

Brill 1947. 2013 The Roman Triumph MBeard 2007 Harvard University Press Improving Image Generation with Better Captions JBetker GGoh LJing TBrooks JWang LLi LOuyang JZhuang JLee YGuo 2023 Language Models are Few-Shot Learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei Advances in Neural Information Processing Systems HLarochelle MRanzato RHadsell MBalcan HLin Curran Associates, Inc 2020 33 Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation JCho YHu RGarg PAnderson RKrishna JBaldridge MBansal JPont-Tuset SWang 2023 Das Lupercalia-Fest im augusteischen Rom: Performativität, Raum und Zeit DŠErker 10.1515/9783110208962.2.145 Archiv für Religionsgeschichte 11 1 2009 Exploring the Potentials of Artificial Intelligence Image Generators for Educating the History of Architecture MWFareed ABou ENassif Nofal 10.3390/heritage7030081 Heritage 7 3 2024 Der römische Triumph in Prinzipat und Spätantike: Probleme -Paradigmen -Perspektiven 10.1515/9783110448009-003 Der römische Triumph in Prinzipat und Spätantike FGoldbeck JWienand

Berlin, Boston

De Gruyter 2017 Augustus, the Lupercalia and the Roman identity DGuarisco 10.1556/068.2015.55.1-4.16 Acta Antiqua Academiae Scientiarum Hungaricae 55 1-4 2015 TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering YHu BLiu JKasai YWang MOstendorf RKrishna NASmith 2023 Mock the Triumph: Cassius Dio, Triumph and Triumph-Like Celebrations CLange 10.1163/9789004335318\_007 Cassius Dio. Brill's Historiography of Rome and Its Empire Series Brill 2016 The Late Republican Triumph: Continuity and Change CHLange 10.1515/9783110448009-004 Der römische Triumph in Prinzipat und Spätantike

Berlin, Boston

De Gruyter 2017 The Triumph outside the City: Voices of Protest in the Middle Republic CLange The Roman Republican Triumph CHjortLange FVervaet 2014 . Quasar Triumph and Civil War in the Late Republic CHLange 10.1017/s0068246213000056 Papers of the British School at Rome 81 2013 Holistic Evaluation of Text-to-Image Models TLee MYasunaga CMeng YMai JSPark AGupta YZhang DNarayanan HTeufel MBellagente MKang TPark JLeskovec J.-YZhu F.-FLi JWu SErmon PSLiang Advances in Neural Information Processing Systems AOh TNaumann AGloberson KSaenko MHardt SLevine Curran Associates, Inc 2023 36 Rich Human Feedback for Text-to-Image Generation YLiang JHe GLi PLi AKlimovskiy NCarolan JSun JPont-Tuset SYoung FYang JKe KDDvijotham KMCollins YLuo YLi KJKohlhoff DRamachandran VNavalpakkam Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024 The Loser's Prize: Roman Triumphs and Political Strategies during the Mithridatic Wars JMadsen Analecta Romana Instituti Danici. Supplementa Xlv. Quasar 2014 The Roman Republican Triumph Beyond the Spectacle Die Triumphatordarstellung auf Münzen und Medaillons in Prinzipat und Spätantike PFMittag 10.1515/9783110448009-017 Der römische Triumph in Prinzipat und Spätantike

Berlin, Boston

De Gruyter 2017 Caesar at the Lupercalia JANorth 10.3815/007543508786239210 Journal of Roman Studies 98 2008 <author> <persName><surname>Openai</surname></persName> </author> <author> <persName><forename type="first">Gpt</forename><surname>Hello</surname></persName> </author> <idno>-4o</idno> <ptr target="https://openai.com/index/hello-gpt-4o" /> <imprint> <date type="published" when="2024">2024</date> </imprint> </monogr> </biblStruct> <biblStruct xml:id="b21"> <monogr> <author> <persName><forename type="first">I</forename><surname>Östenberg</surname></persName> </author> <idno type="DOI">10.1093/acprof:oso/9780199215973.001.0001</idno> <title level="m">Staging the World: Spoils, Captives, and Representations in the Roman Triumphal Procession

Oxford

Oxford University Press 2009 Triumph and spectacle. Victory celebrations in the Late Republican civil wars IÖstenberg The Roman Republican Triumph Beyond the Spectacle 2014 Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation MOtani RTogashi YSawai RIshigami YNakashima ERahtu JHeikkilä SSatoh Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 The Architecture of the Roman Triumph: Monuments, Memory, and Identity MLPopkin 2016 Cambridge University Press Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context MReid NSavinov DTeplyashin DLepikhin TLillicrap J-B. Alayrac RSoricut ALazaridou OFirat JSchrittwieser 10.48550/arXiv arXiv:2403.05530 .2403.05530 2024 arXiv preprint Wege des Triumphes. Zum Verlauf der Triumphzüge im spätrepublikanischen und augusteischen Rom STSchipporeit Triplici invectus triumpho : Der römische Triumph in augusteischer Zeit 2008 Learning to Summarize with Human Feedback NStiennon LOuyang JWu DZiegler RLowe CVoss ARadford DAmodei PFChristiano Advances in Neural Information Processing Systems HLarochelle MRanzato RHadsell MBalcan HLin Curran Associates, Inc 2020 33 The Lupercalia and the Romulus and Remus Legend PTennant Acta Classica 31 1988 Gendering the Roman Triumph: Elite Women and the Triumph in the Republic and Early Empire LWebb LBrännstedt 10.1163/9789004524774\_005 Gendering Roman Imperialism

Leiden, The Netherlands

Brill 2022 The Generalization of Student's Problem when Several Different Population Variances are Involved BLWelch 10.1093/biomet/34.1-2.28 Biometrika 34 1-2 1947 Das Angebot des Diadems an Caesar und das Luperkalienproblem K.-WWelwei Historia: Zeitschrift für Alte Geschichte 16 1 1967 ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation JXu XLiu YWu YTong QLi MDing JTang YDong Advances in Neural Information Processing Systems AOh TNaumann AGloberson KSaenko MHardt SLevine Curran Associates, Inc 2023 36