1. Introduction

Political Bias in Large Language Models: A Case Study on the 2025 German Federal Election

Buket Kurtulus

Anna Kruspe

0 0 Munich University of Applied Sciences

2025

With the increased use of Large Language Models (LLMs) to generate responses to social and political topics, concerns about potential bias have grown. The output of these models can influence social behavior, public discourse, and potentially impact democratic processes, like national elections. This study evaluated the political alignment of three LLMs-ChatGPT, Grok, and DeepSeek-using the 2025 German Federal Election Wahl-O-Mat as a framework. By comparing model responses to 38 political statements with the oficial positions of German parties, we assess how diferent systems align with political identities across the ideological spectrum. We also explore the theoretical foundations of political bias in LLMs, focusing on how prompt language and model characteristics (e.g., scale and regional origin) may influence ideological alignment, and examine relevant ethical considerations. The results reveal a consistent left-leaning tendency across all models, with minimal alignment with far-right positions, largely independent of prompt language. By combining empirical findings with existing theoretical perspectives, this work contributes to a deeper understanding of political bias in LLMs and highlights the importance of transparency in their public use.

eol>Large Language Models (LLMs) Political Bias Algorithmic bias

1. Introduction

Artificial intelligence (AI) systems increasingly mediate how citizens access and discuss political information, raising both ethical and technical concerns about whose perspectives these systems surface and privilege [ 1, 2, 3 ]. Among these systems, Large Language Models (LLMs) have become ubiquitous tools for drafting, summarizing, and answering open-ended questions in public-facing settings [ 4, 1 ]. Usage is massive while verification is sporadic: recent figures indicate rapid growth in interactions and comparatively low rates of fact-checking among German users, echoing worries about over-trust in model outputs [ 5, 6, 7 ]. If LLMs manifest systematic political leanings, such scale can subtly shape issue salience and party perceptions.

Emerging studies report detectable ideological tendencies in several LLMs, often with liberal or leftlibertarian patterns [ 1 ]. However, most evaluations are US-centric; moreover, many analyses emphasize high-level model mechanics rather than measurement design in an electoral context. We address this gap by auditing three widely used systems - ChatGPT, Grok, and DeepSeek - against a standard German voting aid.

Our study uses the 38 statements of the Wahl-O-Mat, a very popular decision-making tool for voters, for the 2025 federal election as a nationally grounded instrument. Each statement is posed in German and English with a constrained response set (Agree/Neutral/Disagree); we aggregate 100 stochastic runs per item and compute agreement with oficial party positions. We visualize response structure (heatmaps, PCA), examine refusal and variance patterns, and add a concise primer on Germany’s political system to aid interpretation for non-specialists.

Contributions. (i) A Germany-focused, bilingual audit protocol using a civic, election-proximal instrument; (ii) a comparative evaluation of ChatGPT, Grok, and DeepSeek with simple, reproducible agreement metrics; (iii) analysis of refusal behavior and within-item variability; and (iv) an ethicsoriented discussion of transparency and public use. We keep model-mechanics exposition minimal, pointing to prior work for background [ 4, 1 ], and release code and prompts for replication.

2. Background

This section outlines the theoretical background for understanding how bias arises in large language models (LLMs). We first give a short overview of LLM architecture and training, then discuss how bias can emerge—with emphasis on political bias. We also include a brief primer on the German political system to frame our results.

2.1. Large Language Models

Artificial Intelligence (AI) refers to systems that perform tasks associated with human intelligence, such as reasoning, learning, and language understanding. The term dates to the Dartmouth Conference of 1956, but recent advances have brought AI into broad public use [ 1, 8, 9 ]. One prominent development is Generative AI (GenAI), which can create new content (text, images, code). LLMs are a subset trained to understand and produce human-like text; examples include Copilot and GPT-5. They support tasks such as answering questions, translating languages, and generating code with high fluency [ 1, 8, 9 ].

LLMs are typically built on the transformer architecture, which processes input by attending to diferent parts of the sequence and then generates output token by token. Functionally, models encode the input into high-dimensional representations that capture semantic and contextual relations, and then decode to produce text. Output quality depends on multiple factors, including prompt formulation, decoding strategy, model hyperparameters, and—crucially—the scope and composition of training and ifne-tuning data [ 1, 10, 9 ]. 2.2. Bias As LLMs become integrated into public-facing applications, embedded biases raise concerns about societal and political impact [ 1 ]. Bias can originate from several sources common to machine learning systems. Because algorithms are developed by humans and trained on historical data, they may reflect and amplify existing patterns [ 4 ]. Training corpora drawn from search engines, social media, and digitized texts often contain prejudices and imbalances; models trained on such data tend to reproduce them [ 1, 11 ]. Analyses of recent models (including GPT-4) show that LLMs frequently replicate biases present in their data [ 12, 13, 14, 15 ]. Studies further emphasize that political and social biases are shaped not only by data but also by modeling choices and optimization procedures [ 1 ].

Beyond data and algorithms, deployment context and ethical oversight matter. Design decisions—especially during fine-tuning—can unintentionally encode particular normative viewpoints [ 1 ]. Depending on the environment or public-service setting, outputs may align with prevailing narratives or amplify specific perspectives [ 16, 17 ]. Without robust governance, transparency in development, and ongoing bias audits, LLMs risk deepening inequalities and undermining public trust.

In short, bias in LLMs is not simply a training-data problem but the result of intersecting factors: data, design, context, and governance [ 4, 1 ]. Mitigation therefore combines technical and organizational measures. On the technical side, strategies include regular bias audits, periodic updates, and adversarial or counterfactual training to surface and correct unwanted behaviors [ 1 ]. For instance, [ 18, 19, 20 ] propose multilayered approaches that integrate audits, transparency reports, and debiasing algorithms. On the organizational side, transparent documentation of data and model behavior, along with clear governance frameworks, is essential [ 1 ].

2.3. Political Bias

Political bias is a specific form of model bias: a systematic tendency to favor certain ideological positions in outputs. Evidence of such tendencies has been reported for several LLMs (e.g., group-related biases in GPT-3; centrist tendencies in Google Gemini) and can be reinforced by user interaction, especially when prompts introduce ideologically charged framing that the model mirrors [ 1, 21, 13 ].

In democratic contexts, political bias is particularly consequential. Algorithmic predictions and generated content can reflect the interests or perspectives of those who design or deploy the systems [ 4 ]. As LLMs are increasingly used in public-facing settings, their potential to shape opinion, influence behavior, and afect electoral processes raises serious ethical concerns, including the spread of misinformation and propaganda [ 1, 22, 23 ]. More broadly, biased algorithms can intensify social injustices and erode democratic norms, underscoring the need for accountability and ethically grounded responses [ 24, 16 ].

2.4. German Political System and Parties

Germany is a federal parliamentary democracy. The Bundestag (federal parliament) is elected under mixed-member proportional representation: voters cast one constituency vote for a district candidate and one party-list vote that determines proportional seat shares. A 5% national threshold (or suficient direct mandates) is required for list representation. Coalition governments are typical. Competences are shared between federation and states (Länder); the Basic Law (Grundgesetz) is the constitution. Nationwide referendums are not a standard federal instrument.

Major parties (alphabetical).

AfD (Alternative für Deutschland): national-conservative/right-wing populist; positions emphasize national sovereignty, restrictive migration policy, and skepticism toward aspects of EU integration and climate policy.

Bündnis 90/Die Grünen: ecological and progressive; prioritize climate protection, social liberalism, and European integration. (In graphs, we shorten the name to GRUENE).

CDU/CSU (Christian Democrats/Christian Social Union): center-right Christian-democratic alliance; social market economy, incremental climate policy, broadly pro-EU; CSU operates only in Bavaria. Die Linke: democratic-socialist; redistribution, public services, social rights; generally critical of military engagements.

FDP (Free Democrats): classical-liberal; market-oriented reforms, individual liberties, fiscal restraint, pro-competition.

SPD (Social Democrats): center-left; welfare-state orientation, labor rights, negotiated socio-ecological transition, pro-EU.

Reading alignments. Under proportional representation and routine coalition-building, parties tend to align along two broad axes: (i) economic policy (redistribution vs. market liberalism) and (ii) socio-cultural policy (liberal–cosmopolitan vs. conservative–sovereigntist), with an additional European integration dimension. We use these coarse orientations to interpret agreement patterns and low-dimensional structure in our results.

3. Related Work

With the growing influence of LLMs, understanding political bias has become a critical research area. We summarize three case studies that examine political bias in LLMs from complementary methodological perspectives.

Rettenberger et al. [ 12 ] investigate political bias in open-source LLMs (LLaMA-2/3, Mistral-7B) for the 2024 European Parliament election from a German voter’s perspective using the Wahl-O-Mat in German and English. They constrain outputs to single-word labels (“Ja/Neutral/Nein”) via an End-of-Input prompt to suppress evasions/refusals and analyze inter-model variability with Kruskal–Wallis tests and post hoc Dunn comparisons. Larger models (e.g., LLaMA3-70B) align more with left-leaning parties such as Bündnis 90/Die Grünen, Die Linke, and Volt, with consistently low agreement for the AfD; German prompts elicit clearer stances than English. Our study complements this by shifting to the 2025 federal party set and auditing closed-source and non-Western systems (ChatGPT, Grok, DeepSeek); instead of suppressing abstention, we measure neutrality and refusals explicitly (counting refusals as mismatches), run 100 stochastic repetitions per item, and examine response structure via PCA.

Choudhary [ 1 ] presents a comprehensive analysis of political bias in four popular LLMs: ChatGPT-4, Perplexity, Google Gemini, and Claude. The authors combine quantitative and qualitative methods, testing models with three political typology tools—the Pew Research Center’s Political Typology Quiz, the Political Compass Assessment, and the ISideWith Political Party Quiz. Each model is prompted with the same questions; responses are standardized and placed on an ideological scale ranging from “strongly conservative” (Faith and Flag Conservatives) to “strongly liberal” (Progressive Left). ChatGPT4 consistently displays liberal tendencies, particularly on social and economic issues, and is classified on Pew as “Establishment Liberal,” a group comprising 13% of the US public (Figure 1). Perplexity also leans left overall but shows more conservative tendencies on selected issues, leading to a categorization as “Outsider Left” (10%). In contrast, Claude and Google Gemini are more centrist, adopting neutral or moderate stances.

Yang et al. [ 25 ] conduct a large-scale study evaluating 43 LLMs from 19 model families across four regions (US, Europe, Asia, Middle East) to assess political bias on a US-centric testing ground while accounting for model characteristics such as scale, release date, and geographic origin. The corpus spans open- and closed-source models of various sizes. The authors select 32 politically themed questions from the American National Election Studies (ANES) and the 2024 Pew Research Center survey, grouped into eight topics: four highly polarized (presidential elections, abortion, immigration, issue ownership) and four less polarized (climate change, misinformation, discrimination, foreign policy). A two-step prompting framework elicits answers to sensitive questions while navigating safety filters. Responses are analyzed via a preference-scoring scheme (positive = Democratic-leaning; negative = Republicanleaning). Most models exhibit a left-leaning bias, especially on highly polarized topics. Using the 2024 US presidential election as a benchmark, 76% of models express a stronger preference for the Democratic candidates (Joe Biden or Kamala Harris), with 35% consistently favoring them. Bias is less distinct on less polarized topics.

4. Method 4.1. Wahl-O-Mat

The Wahl-O-Mat is a digital tool that helps voters in Germany evaluate how well political parties align with their personal views. It is typically released ahead of significant elections, such as the 2025 federal election. Users are presented with 38 political statements spanning topics such as energy, environmental, and migration policy [ 26 ]. For each statement, they indicate whether they agree, disagree, or are neutral. Figure 2 provides a visual example of the interface for readers unfamiliar with the Wahl-O-Mat. After answering all items, users may prioritize selected statements, which then receive double weight.

Once the questionnaire is completed (including any prioritization), the system compares answers with oficial party positions and returns percentage match scores for each party. Developed by the Federal Agency for Civic Education (Bundeszentrale für politische Bildung, bpb), the Wahl-O-Mat is widely used and regarded as a key voter-information tool; for the most recent federal election it was accessed over 26 million times [ 26 ].

4.2. Large Language Models

We evaluate three LLMs to enable a geographically and structurally diverse comparison: OpenAI’s ChatGPT, xAI’s Grok, and DeepSeek. These systems difer in training origin, scale, and integration into public platforms.

ChatGPT (gpt-3.5-turbo). Selected due to its widespread use and global influence. Developed by OpenAI, it is primarily trained on English-language data and is widely deployed in consumer and enterprise settings. We accessed the model via the oficial OpenAI API [ 27 ].

DeepSeek (deepseek-chat). Developed in China, representing a newer presence in the global LLM landscape and of interest due to a distinct linguistic and sociopolitical development context. We accessed the model via its oficial API [28].

Grok (grok-3-mini). Developed by xAI, closely integrated with the social media platform X (formerly Twitter) and frequently described as conversational in style. xAI leadership has stated that they were aiming to make the model less “woke”, i.e. more conservative [29]. We accessed the model through the oficial xAI platform [30].

Together, these models provide a diverse basis for examining potential political bias.

4.3. Experimental Setup

To assess potential political bias, we constructed a structured evaluation pipeline using the Wahl-O-Mat statements as input. The original items (published in German) were translated into English by the author to enable cross-linguistic comparison; the full list appears in Appendix A. Each statement was inserted into a standardized prompt template designed to minimize ambiguity and enforce a uniform response format (agree/neutral/disagree).

We queried all three models in both German and English. To account for stochasticity, each model was prompted 100 times per question. Although the temperature parameter was set to 0, repeated runs were conducted to capture nondeterminism. The choice of 100 repetitions balances computational feasibility with the need to capture variation in model behavior, allowing us to observe response consistency across repeated prompts.

For quantitative analysis, textual outputs were mapped to numeric codes: disagree=0, neutral=1, agree=2. Data processing and analysis were conducted in Jupyter notebooks. This pipeline supports multiple comparisons, including agreement scoring against party positions and Principal Component Analysis (PCA) to explore ideological proximity and language efects.

To ensure consistency, all models received the same structured prompt in each language. Because the oficial Wahl -O-Mat is formulated in German and our study adopts a German voter perspective, we evaluated the original German phrasing; given that many LLMs are primarily trained and fine-tuned on English, we repeated the full evaluation in English to probe potential language efects. Thus, all models were evaluated in both German and English.

The system prompts were as follows:

German prompt: Bitte gib deine Haltung zu folgender Aussage an. Wähle nur eine der folgenden Optionen: Stimme zu, Neutral, Stimme nicht zu.

Antworte bitte nur mit einer der Optionen.

Aussage: {statement} English prompt: Please indicate your preference regarding the following statement. Choose one of the following options: Agree, Neutral, Disagree.

Please respond with only one of the options.

Statement: {statement}

Each prompt was populated with one of the 38 Wahl-O-Mat statements (denoted as statement).

4.4. Evaluation

We compared model-generated responses with oficial party positions obtained from the Wahl -O-Mat. All evaluation steps were conducted separately for German and English outputs.

4.4.1. Response Aggregation

As noted above, each model was prompted 100 times per question. Although this repetition count is not derived from formal statistical criteria, it ofers a practical trade-of between variability capture and runtime cost. The final score for each question corresponds to the mean of the numeric encodings over the 100 runs. This produces a continous value between 0 and 2 reflects the models average tendency toward agreement. This representation serves as a descriptive aggregate of categorical outputs rather than assuming a strictly ordinal scale: 0 = Disagree, 1 = Neutral,

2 = Agree.

In cases of persistent refusals or nonsensical outputs, responses were encoded as -1. Agreement is computed over all 38 items; refusals (-1) are treated as mismatches.

In contrast to some prior studies reporting stronger language efects, our models produced similar aggregates in German and English. This may indicate cross-linguistic robustness; alternatively, it could reflect limitations in prompt design or sensitivity of the evaluation method.

4.4.2. Further Analysis

Beyond agreement scores, we ranked each LLM by its highest-matching party to provide a concise orientation signal. To explore broader structure, we applied PCA to model and party response vectors. Finally, we identified potentially controversial topics by examining per-question variability (standard deviation across runs) and refusal frequency, flagging items with unusually high disagreement or non-answers.

5. Results 5.1. Party agreement

For visual inspection, we assemble model outputs into a matrix and color-code labels (red = Disagree, yellow = Neutral, green = Agree; where present, -1 refusals appear in gray). Figure 3 shows model responses; Figure 4 shows party responses.

Figure 5 reports agreement scores between LLMs and parties for English- and German-prompted outputs. Unless noted, agreement is computed over all 38 items and refusals (-1) count as mismatches. A consistent pattern emerges across all systems: the lowest alignment is always with AfD (about 0.11 in English—roughly 4/38 matches—and about 0.26 in German), while the highest cells are with Bündnis 90/Die Grünen and SPD. At the same time, the top–second gaps are modest, indicating leaning rather than strong partisanship.

English prompts. ChatGPT aligns most with Bündnis 90/Die Grünen and Die Linke (both 0.58), SPD at 0.55, and lowest with AfD (0.11). DeepSeek shows SPD 0.42, Bündnis 90/Die Grünen 0.39, Die Linke 0.37, AfD 0.11. Grok has the strongest English-only alignments overall—Bündnis 90/Die Grünen 0.66, SPD 0.63—with a comparatively higher CDU/CSU score (0.37) than the other English runs, and AfD again at 0.11.

German prompts. Prompting in German produces a clear, broad-based uplift across all parties. ChatGPT : Bündnis 90/Die Grünen 0.58→0.84, SPD 0.55→0.76, CDU/CSU 0.26→0.45, FDP 0.34→0.45, AfD 0.11→0.26, Die Linke 0.58→0.66. DeepSeek: SPD 0.42→0.71, Bündnis 90/Die Grünen 0.39→0.63, CDU/CSU 0.24→0.50, Die Linke 0.37→0.61, AfD 0.11→0.26 (with FDP rising slightly: 0.26→0.29). This supports the language-efects observation: German anchoring increases apparent agreement and moves models closer to mainstream party positions, not only to left-leaning ones.

Selectivity. Orientation is present but not highly selective. For example, GPT-German’s Bündnis 90/Die Grünen 0.84 vs. SPD 0.76 (∆ = 0.08 ); Deep-German’s SPD 0.71 vs. Bündnis 90/Die Grünen 0.63 (∆ = 0.08 ); Grok-English 0.66 (Bündnis 90/Die Grünen) vs. 0.63 (SPD). These small separations indicate systematic leaning rather than strong partisan alignment.

Hedging and refusals. The model heatmap contains more yellow cells than the party heatmap, consistent with greater use of Neutral. Refusals (-1) are concentrated in Grok-English on a small set of items (visible as gray cells), which helps explain slightly lower agreements for Grok when refusals count as mismatches.

5.2. Dimensionality reduction

Figure 6 shows a two-dimensional PCA over all 38 questions for models and parties. The arrangement corroborates the agreement analysis while revealing additional structure. Bündnis 90/Die Grünen, SPD, and Die Linke cluster together; CDU/CSU and AfD lie apart along the principal spectrum, and the economically liberal FDP is clearly isolated. All LLM points lie closest to the center-left cluster but occupy a distinct sector—consistent with their higher neutral/abstention rates and suggesting a secondary dimension (e.g., caution/safety or “consensus” tendency) that separates models from parties. GPT-Germans shift closer to the party manifold, in line with the agreement uplift noted above.

6. Controversial questions

Finally, we analyzed which of the 38 statements tended to elicit empty, nonsensical, or refused answers, and which ones produced high disagreement across runs. The full list of statements appears in Appendix A.

Grok is the only model with a notable number of non-answers and also shows the highest within-item disagreement. DeepSeek exhibits the most consistent behavior, with no pronounced standard deviations across runs. ChatGPT shows elevated variance for a single question in English and for four questions in German.

There is no single statement that consistently causes problems across all models. For ChatGPT, the more controversial items were Q5 (English) and Q14, Q23, Q21, and Q10 (German). Two of these concern asylum policy, a central topic in recent German debates; one addresses the Basic Law (constitution) and its religious invocation. Grok mainly returned non-answers for Q10 (religion in the constitution), Q23 (asylum seekers), and Q31 (strike rights). Questions leading to high standard deviation for Grok involved similar themes as well as fiscal topics (tarifs, debt, taxes, pensions, student aid). Notably, some highly debated public issues—e.g., Q26 on abortion rights—did not appear particularly dificult for the models in our setup.

7. Ethical Implications

The growing use of LLMs in politically sensitive contexts presents significant opportunities alongside nontrivial risks. On the opportunity side, LLMs can support political communication and decisionmaking by surfacing public concerns, tracking shifts in sentiment and ideology, and enabling new perspectives on historical and political texts [ 4 ]. When audited and deployed transparently, such tools can broaden access to information and help citizens compare arguments eficiently.

At the same time, ethical challenges are substantial. As discussed in Section 2.2, biases present in training data and design pipelines can reappear in outputs, reinforcing stereotypes and inequalities [ 1 ]. Empirical evidence underscores this risk: for example, GPT-3 associated “Muslim” with “terrorist” in 23% of test cases and “Jewish” with “money” in 5% [31]. Without critical scrutiny, such associations can propagate harmful narratives through apparently neutral, authoritative prose.

A second concern is susceptibility to manipulation and over-trust. Survey data indicate that a large share of German users report trusting LLM outputs without systematic verification [ 5 ]. Coupled with model hallucinations and latent leanings, this can shift issue salience or party perceptions in subtle ways. Experimental work further shows that LLMs can outperform incentivized humans at persuasion in both truthful and deceptive settings [32], heightening the risk of misinformation and targeted influence campaigns, particularly around elections [ 4 ].

Privacy and data protection risks also arise. Training and adaptation rely on large-scale corpora that may include personal or sensitive information; inference-time interactions can reveal political preferences or profiles. Absent strong governance, this creates potential for unauthorized collection, leakage, or downstream misuse.

Design directions and safeguards. To mitigate these risks while preserving utility, we outline practical steps consistent with prior work on accountability and domain adaptation [ 1, 24, 16, 17 ]: • Transparent audits and documentation. Publish evaluation protocols, datasets (or data statements), model versions, and known limitations, including language-specific performance notes and refusal/uncertainty behavior. • Locale-specific testing. Use nationally grounded instruments (e.g., Wahl -O-Mat) to audit stance patterns; report agreement by topic, language, and model, and disclose instability across reruns. • Multi-model and multi-view presentation. Where feasible, show answers from several systems side by side or present pro/con rationales, to avoid a single authoritative “voice.” • Calibrated abstention and uncertainty. Prefer explicit “cannot answer” with brief rationale over confident speculation; surface uncertainty bands (e.g., variability across runs) in user-facing summaries. • Democratic safeguards during elections. Time-bound guardrails (rate limits, content provenance indicators, stricter verification for claims, heightened monitoring of coordinated prompts) reduce campaign-period risks. • Privacy-by-design. Minimize retention of interaction logs; avoid storing political inferences; provide clear user controls and data-use disclosures. • Governance and oversight. Establish internal review for prompt/policy changes, schedule periodic re-evaluations, and enable external scrutiny through red-team exercises and bugbounty–style reporting.

In sum, these ethical risks highlight why transparency and multi-model safeguards are essential, as they directly address patterns observed in our analysis. LLMs can support an informed public sphere, but only when paired with ongoing auditing and privacy-aware governance. Absent these measures, systematic biases, persuasive capabilities, and user over-trust combine to produce disproportionate harms, particularly for marginalized groups and during high-stakes democratic events [ 1, 33, 24, 16 ].

8. Discussion

Our audit of three widely used LLMs on the 2025 Wahl-O-Mat shows a consistent left-leaning orientation with the weakest alignment to far-right positions. The pattern is qualitatively stable across languages, while German prompts systematically nudge models closer to party positions. Selectivity remains modest - orientation rather than hard partisanship - and models employ Neutral. Refusals and higher within-item variance cluster on a narrow set of themes: constitutional symbolism (religious invocation), asylum/migration, and strike rights. This concentrates caution where public debate in Germany is most polarized, while other high-salience issues pose fewer dificulties in our setup. PCA complements agreement tables: parties arrange along familiar axes, while models sit in a distinct nearby sector rather than inside any party cluster. We read this as a secondary “caution/consensus” dimension—greater neutrality/abstention and safety-driven hedging—separable from the classic left–right and market–redistribution spectra.

Origin-sensitive divergence (Q26). For Q26 (abortion), the models converge on positions that track Anglophone/US discourse more closely than prevailing German party and public stances [34]. We view this less as an issue-specific anomaly and more as a provenance efect: training corpora and preference tuning pipelines are heavily Anglophone, and moderation norms are often calibrated to U.S. cultural/legal baselines. In line with prior evidence on geography- and culture-linked biases in LLMs [ 14, 35, 1 ], this suggests that model origin and data geography can imprint issue framing that remains visible even when overall alignment is highest with German center-left parties. Likely drivers include (i) training-data geography and outlet mix; (ii) preference optimization shaped by English-language moderation norms; and (iii) system-specific safety policies. The consistent German-prompt uplift suggests light “language anchoring,” where phrasing closer to domestic discourse reduces distance to party stances across the board, not only on the left.

Reading results through German milieus. The Sinus–Milieus are a sociological segmentation developed by the SINUS-Institut (now SINUS Sociovision) that clusters the German population along two axes: (i) social status (education/income/occupation; vertical) and (ii) basic values (tradition → modernization/individualization → re-orientation; horizontal). The typology is empirically updated via representative surveys and qualitative studies and is widely used in political analysis and media planning. As Figure 7 illustrates, our LLM outputs most closely resemble orientations typical of the “Performer,” “Expeditive,” “Neo-Ecological,” and partly “Post-Materialist” milieus. Crucially, these milieus together account for a comparatively large share of the German population (about 28-40% in recent waves [36]); hence, positions that might be labeled “center-left” in a U.S. typology are closer to the German mainstream. This helps reconcile why models trained largely on Anglophone sources can still align well with German party positions in a proportional, coalition-oriented system. By contrast, Pew’s “Establishment Liberal” segment comprises only 13% in the U.S. [ 1 ].

Positioning to prior work. Our findings echo earlier Wahl-O-Mat–based audits that report stronger alignment with left-leaning parties and the weakest matches with AfD [ 12 ], and they are consistent with U.S. typology studies showing liberal tendencies [ 1 ]. Relative to [ 12 ], we extend the picture by (i) focusing on the 2025 federal party set, (ii) including closed-source and non-Western systems (ChatGPT, Grok, DeepSeek), and (iii) treating neutrality/refusals and within-item variability as first-class behavioral signals via repeated stochastic querying, with PCA providing structural context.

Limitations and implications. We study three models and one instrument; newer versions are out of scope, and one model (Grok) was only evaluated in English due to computational and time restraints. In the context of the Wahl-O-Mat, where respondents must select agree, disagree, or neutral, our agreement metric penalizes abstention by counting refusals as mismatches; alternative choices (e.g., valid-only denominators or ordinal distances) would shift absolute levels but not the qualitative ordering we observe. Practically, we recommend that election-adjacent audits report (a) language efects, (b) refusal/neutrality rates, and (c) topic-wise stability, alongside simple agreement scores.

9. Conclusion

As LLMs enter everyday political information flows, understanding their leanings is essential. Using Germany’s Wahl-O-Mat, we find (i) consistent left-leaning alignment across ChatGPT, DeepSeek, and Grok; (ii) the lowest agreement with AfD; (iii) broadly similar English/German patterns with a clear German-prompt uplift across all parties; (iv) small top–second gaps indicating leaning rather than strong partisanship; and (v) model-specific response behavior, with Grok showing the most refusals and all models exhibiting higher neutrality than parties. PCA places models near center-left parties but in a distinct sector, consistent with a general caution/consensus tendency.

Future work should probe robustness to paraphrase and register (formal vs. colloquial German), expand model coverage and versions, and complement exact-match agreement with ordinal distances and uncertainty estimates. In line with our ethical discussion, we recommend transparent, locale-specific audits and disclosure of refusal/neutrality patterns to support informed public use. 10. Code and data availability The code and analyzed responses are available at: https://github.com/buket99/political_bias_in_llm.

Declaration on Generative AI

During the preparation of this work, the authors used Grammarly to check grammar and spelling. [28] D. AI, Deepseek api documentation, 2025. URL: https://api-docs.deepseek.com/. [29] B. Horvath, Grok, elon musk’s ai chatbot, seems to get right-wing update, NBC News (2025). URL: https://www.nbcnews.com/tech/elon-musk/ grok-elon-musks-ai-chatbot-seems-get-right-wing-update-rcna217306, published at 4:57 AM GMT+2. [30] xAI, xai api, 2025. URL: https://x.ai/api, accessed August 28, 2025. [31] A. Abid, M. Farooqi, J. Zou, Persistent anti-muslim bias in large language models, in: Proc. 2021

AAAI/ACM Conf. on AI, Ethics, and Society, 2021, pp. 298–306. [32] P. Schoenegger, F. Salvi, J. Liu, X. Nan, R. Debnath, B. Fasolo, et al., Large language models are more persuasive than incentivized human persuaders, arXiv preprint (2025). arXiv:2505.09662. [33] H. de Vries, et al., The political impact of text generators: Diferential efects on conservatives and liberals, Proceedings of the National Academy of Sciences USA 120 (2023) e2026070119. [34] Bundesministerium für Familie, Senioren, Frauen und Jugend (BMFSFJ), Meinungsbild zur reproduktiven Selbstbestimmung und Schwangerschaftsabbruch bis zur 12. Woche, Technical Report, BMFSFJ, 2024. URL: https://www.bmfsfj.bund.de/resource/blob/246478/ 9b685f150c5734ef76efa909234f9285/umfrage-reproduktive-selbstbestimmung-data.pdf, repräsentative Bevölkerungsbefragung im März und April 2024, Deutschland. [35] M. Stillman, A. Kruspe, Biased Geolocation in LLMs: Experiments on Probing LLMs for Geographic Knowledge and Reasoning, in: Proceedings of the International Workshop on Geographic Information Extraction from Texts (GeoExT) at ECIR 2025, 2025. [36] Sinus-Institut, Sinus-Milieus® Germany, Online, 2025. URL: https://www.sinus-institut.de/en/ sinus-milieus/sinus-milieus-germany, ansicht der zehn Sinus-Milieus und deren Bevölkerungsanteile (Stand: 2021/24).

Question Q1 Q2 Q3 Q4 Q5 Q36 Q37

[1]

Choudhary , Political Bias in Large Language Models: A Comparative Analysis of ChatGPT-4 , Perplexity,

Google

Gemini , and Claude, IEEE Access ( 2024 ).

[2]

N. W.

Mohamed ,

Rashed ,

Mahmoud , Cybersecurity in the era of artificial intelligence: Risks and solutions , in: 2024 ASU Int. Conf. in Emerging Technologies for Sustainability and Intelligent Systems (ICETSIS) , IEEE, 2024 .

[3]

Binns , Fairness in machine learning: Lessons from political philosophy , in: Proc. 1st Conf. Fairness, Accountability and Transparency (FAT) , 2018 , pp. 149 - 159 .

[4]

Al-Mhasneh ,

Alrasheed ,

Al-Arqan ,

Fares ,

Alqahtani ,

Salman , The Role of Artificial Intelligence in Political Analysis and Decision Aid: “Chat GPT Application” as a Model, in: 2024 International Conference on Decision Aid Sciences and Applications (DASA), Manama , Bahrain, 2024 , pp. 1 - 4 . doi: 10 .1109/DASA63652. 2024 . 10836603 .

[5] Statista , Umfrage zum Überprüfen von KI-Ergebnissen , 2025 . URL: https://de.statista.com/ infografik/34419/umfrage-zum -ueberpruefen-von-ki-ergebnissen/.

[6] Statista , Anzahl der Visits pro Monat von chatgpt .com, 2025 . URL: https://de.statista.com/statistik/ daten/studie/1535435/umfrage/anzahl-der -visits-pro-monat-von-chatgptcom/.

[7]

Economic

Times , Does ChatGPT sufer from hallucinations? OpenAI CEO Sam Altman admits surprise over users' blind trust in, 2025 . URL: https://economictimes.indiatimes.com/magazines/panache/ does-chatgpt -sufer-from-hallucinations-openai-ceo-sam-altman-admits-surprise-over-users-blind- trust- in-ai/ articleshow/122090109.cms.

[8]

M. R.

Douglas , Large language models, arXiv preprint ( 2023 ). arXiv: 2307 . 05782 .

[9]

Feuerriegel ,

Hartmann ,

Janiesch , et al., Generative

, Business & Information Systems Engineering 66 ( 2024 ) 111 - 126 . doi: 10 .1007/s12599-023-00834-7.

[10]

Chen ,

Zhang ,

Langrené ,

Zhu , Unleashing the potential of prompt engineering for large language models , Patterns 6 ( 2025 ). doi: 10 .1016/j.patter. 2025 . 101260 .

[11]

S. U.

Noble , Algorithms of Oppression: How Search Engines Reinforce Racism, New York University Press, 2018 .

[12]

Rettenberger ,

Reischl ,

Schutera , Assessing political bias in large language models , Journal of Computational Social Science 8 ( 2025 ) 1 - 17 .

[13]

Kruspe , Towards detecting unanticipated bias in language models ( 2024 ). URL: https://arxiv. org/abs/2404.02650. arXiv: 2404 . 02650 .

[14]

Kruspe ,

Stillman , Saxony-Anhalt is the Worst: Bias Towards German Federal States in Large Language Models , in: Proceedings of the German Conference on Artificial Intelligence (KI 2024 ), 2024 .

[15]

Kruspe , Musical Ethnocentrism in Large Language Models , in: Proceedings of the NLP4MusA Workshop at ISMIR 2024 , 2024 .

[16]

Diakopoulos , Accountability in algorithmic decision making , Communications of the ACM 59 ( 2016 ) 56 - 62 .

[17]

Bai , et al., Adapting large language models for specialized domains: Techniques and challenges , Journal of AI Research 89 ( 2024 ) 203 - 219 .

[18]

B. C.

Cheong , Transparency and accountability in ai systems: safeguarding wellbeing in the age of algorithmic decision-making , Frontiers in Human Dynamics Volume 6 - 2024 ( 2024 ). URL: https://www.frontiersin.org/journals/human-dynamics/articles/10.3389/fhumd. 2024 . 1421273 . doi: 10 .3389/fhumd. 2024 . 1421273 .

[19]

Park , H. Y. Yoon, Ai algorithm transparency, pipelines for trust not prisms: mitigating general negative attitudes and enhancing trust toward ai , Humanities and Social Sciences Communications 12 ( 2025 ) 1160 . URL: https://doi.org/10.1057/s41599-025-05116-z. doi: 10 .1057/ s41599-025-05116-z.

[20]

J. K.

Bahangulu ,

Owusu-Berko , Algorithmic bias, data ethics, and governance: Ensuring fairness, transparency and compliance in ai-powered business analytics applications , World Journal of Advanced Research and Reviews 25 ( 2025 ) 1746 - 1763 . URL: https://doi.org/10.30574/wjarr. 2025 . 25 . 2.0571. doi: 10 .30574/wjarr. 2025 . 25 .2.0571.

[21]

Schramowski ,

Turan ,

Kersting , Large pre-trained language models contain human-like biases of what is right and wrong to do , Nature Machine Intelligence 5 ( 2023 ) 258 - 268 .

[22]

Rode-Hasinger ,

Kruspe ,

X. X.

Zhu , True or False? Detecting False Information on Social Media Using Graph Neural Networks , in: Proceedings of the Workshop on Noisy User-generated Text (W-NUT) at COLING 2022 , 2022 .

[23]

Niu ,

Stillman ,

Kruspe , OSINT or BULLSHINT? Exploring Open-Source Intelligence Tweets about the Russo-Ukrainian War ( 2025 ). URL: https://arxiv.org/abs/2508.03599. arXiv: 2508 . 03599 .

[24] C. O'Neil , Weapons of Math Destruction: How Big Data Increases Inequality and

Threatens

Democracy , Crown Publishing Group, 2016 .

[25]

Yang ,

Li ,

Chu ,

Lin ,

T. Q.

Peng , H. Liu, Unpacking Political Bias in Large Language Models: Insights Across Topic Polarization , arXiv preprint ( 2024 ). arXiv: 2412 . 16746 .

[26] Bundeszentrale für politische Bildung, Die Geschichte des Wahl-O- Mat , 2025 . URL: https://www. bpb.de/themen/wahl -o-mat/326661/die-geschichte-des-wahl-o-mat/.

[27] OpenAI , Openai api, 2023 . URL: https://openai.com/api/.