<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Towards Mechanistic Interpretability for Autoencoder compression of EEG signals</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Leon</forename><surname>Hegedić</surname></persName>
							<email>leon.hegedic@fer.hr</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Electrical Engineering and Computing</orgName>
								<orgName type="institution">University of Zagreb</orgName>
								<address>
									<addrLine>Unska ulica 3</addrLine>
									<settlement>Zagreb</settlement>
									<country>Republic of Croatia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Luka</forename><surname>Hobor</surname></persName>
							<email>luka.hobor@fer.hr</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Electrical Engineering and Computing</orgName>
								<orgName type="institution">University of Zagreb</orgName>
								<address>
									<addrLine>Unska ulica 3</addrLine>
									<settlement>Zagreb</settlement>
									<country>Republic of Croatia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nikola</forename><surname>Marić</surname></persName>
							<email>nikola.maric@fer.hr</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Electrical Engineering and Computing</orgName>
								<orgName type="institution">University of Zagreb</orgName>
								<address>
									<addrLine>Unska ulica 3</addrLine>
									<settlement>Zagreb</settlement>
									<country>Republic of Croatia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><forename type="middle">Ante</forename><surname>Rogošić</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Electrical Engineering and Computing</orgName>
								<orgName type="institution">University of Zagreb</orgName>
								<address>
									<addrLine>Unska ulica 3</addrLine>
									<settlement>Zagreb</settlement>
									<country>Republic of Croatia</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mario</forename><surname>Brcic</surname></persName>
							<email>mario.brcic@fer.hr</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Electrical Engineering and Computing</orgName>
								<orgName type="institution">University of Zagreb</orgName>
								<address>
									<addrLine>Unska ulica 3</addrLine>
									<settlement>Zagreb</settlement>
									<country>Republic of Croatia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Towards Mechanistic Interpretability for Autoencoder compression of EEG signals</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">D02D3211A7E6DB5E4165B89A1281F08B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T19:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Mechanistic Interpretability</term>
					<term>EEG</term>
					<term>Convolutional Variational Autoencoder</term>
					<term>Iteratively Shaped Search</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Convolutional Variational Autoencoders (VAEs) have found extensive application in dimensionality reduction, data compressibility, assessment, and signal analysis. However, a comprehensive understanding of their internal mechanisms remains elusive. This study aims to use mechanistic interpretability to elucidate the inner workings of VAEs. By training VAEs on images generated by interpolating EEG signals from the human brain and analyzing the resulting latent space, as well as the signal propagation through the network layers, we aim to construct an explanation of how these specific models internally analyze the generated images. Since we work under hardware constraints, we devised an iterative approach that breaks big task into easier, more manageable steps.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Electroencephalogram (EEG) is a non-invasive tool for measuring brain activity by placing electrodes on the human scalp, which detect neuronal discharge voltage. While EEG technology possesses limitations such as a poor signal-to-noise ratio and capturing only surface brain activity, it remains a reliable method for diagnosing conditions like epilepsy and sleep disorders <ref type="bibr" target="#b0">[1]</ref>. Autoencoders <ref type="bibr" target="#b1">[2]</ref> are a specialized class of neural networks functioning as encoder-decoder pairs. The encoder compresses input data into a condensed representation, known as the latent space, by progressively reducing neuron count across layers, culminating in a bottleneck layer. Conversely, the decoder reconstructs input data from this compressed form by gradually increasing neuron count in subsequent layers. This compression and reconstruction process enables the network to capture salient features of the input data effectively. Convolutional Variational Autoencoders (CVAEs) <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref> extend this framework by incorporating convolutional layers, making them particularly adept at processing image data. Unlike standard autoencoders, CVAEs generate a probabilistic latent space. This probabilistic approach facilitates learning robust features and enhances the model's capability to generate new data instances resembling the training data. Leveraging convolutional layers, CVAEs exploit spatial hierarchies within data, enhancing their ability to analyze and reconstruct complex patterns and textures inherent in image data. Consequently, CVAEs find extensive application in tasks demanding detailed analysis and synthesis of image content, offering significant improvements in both data reconstruction quality and interpretability of learned representations. Mechanistic interpretability <ref type="bibr" target="#b4">[5]</ref> involves eliciting a simple algorithm from a learned ML model, under the assumption that there exists a human-understandable algorithm with low complexity that closely approximates the ML model <ref type="bibr" target="#b5">[6]</ref>. This technique enhances the interpretability of ML models and aids in understanding the underlying mechanisms driving their predictions <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>. It originates from the field of AI safety <ref type="bibr" target="#b8">[9]</ref>, but is increasingly finding application across various domains.</p><p>The motivation of this paper is to build upon the work in <ref type="bibr" target="#b9">[10]</ref>, where the authors used a Conditional Variational Autoencoder (CVAE) on EEG data. Our goal is to use the technique of mechanistic interpretability to uncover the basic algorithmic approach learned within the neural network. Given that this technique requires extensive manual work, intuition, hardware, and domain knowledge in neuroscience (the latter two of which we lack), we introduce a stepwise guiding procedure through gradually relaxing constraints. We begin with a highly constrained CVAE designed to plausibly approximate only one known algorithm. Our objective is to understand how the neural network implements this algorithm. Subsequently, we relax these constraints incrementally and monitor how the learned underlying algorithm adapts. We reckon that this procedure of tracking evolution from a known origin is easier than immediately interpreting mechanistically the unconstrained system. This paper is organized as follows: Section 2 reviews related and relevant work. Section 3 presents our hypothesis, describes the stepwise guided approach to mechanistic interpretability, and introduces the tools we developed for this process. In Section 4, we detail our experimental results. Finally, Section 5 offers our conclusions and ideas for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head><p>Neuroscience is a multidisciplinary field which aims to explain the functioning of the brain and the nervous system in general. The place where neuroscience meets machine learning is in trying to explain the computations which the brain preforms. The authors of <ref type="bibr" target="#b9">[10]</ref> have used CVAEs in order to analyze EEG images of the brain. They took EEG signals from 32 electrodes placed on the human scalp, this way each data point represents a 32 dimensional vector, afterwards they used geometric transformations to project the positions of the electrodes onto a 40 by 40 grid. Finally, they performed cubic interpolation upon the grid which resulted in the images which comprised the dataset. Then they trained the models, one for each person. After training, the models showed significant capabilities in reproducing the images, showing that the data has an underlying structure. <ref type="foot" target="#foot_0">1</ref> They did this in an attempt to filter out spikes of neural activity from blinks, head movement etc. which in this context amount to noise.</p><p>Mechanistic interpretability is an emerging field in machine learning aimed at understanding the algorithms discovered by neural networks to solve specific problems. The papers <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b10">11]</ref> were the first to explore the concept of "grokking" and explaining it, linking it to uncovering the underlying algorithm. Another paper ( <ref type="bibr" target="#b11">[12]</ref>) subsequently demonstrated that models do not consistently uncover the same algorithm. This finding highlights that the algorithm's nature is highly reliant not only on the model architecture and learning process but also on the inductive bias added by the initial chosen weights. The most significant results of mechanistic interpretability have been achieved on simple logical tasks using unimodal language models <ref type="bibr" target="#b4">[5]</ref>, e.g. modular arithmetic <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b11">12]</ref> and othello gameplaying <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>. There are attempts at automation of mechanistic interpretability such as automatic circuit detection <ref type="bibr" target="#b14">[15]</ref> and attribution patching <ref type="bibr" target="#b15">[16]</ref>, but this work is only at the beginning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Hypotheses, Approach, and Tools</head><p>We build upon the work in <ref type="bibr" target="#b9">[10]</ref> by examining the performance of the Convolutional Variational Autoencoder (CVAE) in reconstructing images. Unlike the previous study, we also aggregate data from multiple individuals. To ensure mechanistic interpretability, we first conduct experiments to get grokking on each model, a process that may require extensive training. The task at hand is significantly more complex than previous problems addressed by mechanistic interpretability. While earlier works focused on discrete tasks such as arithmetic and board-game playing, our challenge involves the continuous approximation of EEG readouts. This necessitates an iterative approach to reduce and constrain the complexity of the analysis. To tackle this, we propose bootstrapping from a simple target algorithm, the pick&amp;interpolate method. This method replicates how the 40x40 EEG images are generated from electrode values. Specifically, we constrain the neural network to select electrode positions from the input image and interpolate these values, mimicking the known method used to produce the input images. This approach allows us to incrementally increase complexity while tracking the underlying algorithm at each step, starting from the initial, known algorithm implemented in a neural manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Hypotheses</head><p>After analyzing the underlying mechanisms, we aim to gradually relax constraints and track changes in mechanisms. This iterative process enables us to break down the complex task into simpler steps conducive to analysis with more limited hardware resources. While the original EEG signal comprises 32 dimensions, the interpolation process expands it to 40x40 (1600 dimensions), indicating redundancy beyond the original 32 dimensions. We identify two factors complicating the learned algorithms:</p><p>1. Bottleneck layer size below 32: we simplify by setting the layer size to 32.</p><p>2. Biases of convolutional neural networks: Specifically, the bias away from locating objects.</p><p>To mitigate this bias, we introduce an injection layer at the input to inject helpful artifacts into the image. Additionally, the network architecture may not be powerful or expressive enough to learn the direct interpolation algorithm -we will check this with an experiment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Tools</head><p>The injection layer introduces helpful artifacts into the input image, including reference watermarks and occlusion of non-informative points. We have created 20 reference 2x2-pixel watermarks that uniquely locate 20 out of 32 electrodes. These watermarks are positioned 1px below its pertinent point. The remaining 12 electrodes are at the edge of head, so can be located by the local curvature. One-hot signals are synthetic signals with one electrode set to 1 and others to 0, to investigate signal propagation through the VAE. This is a further simplification from the complex mixtures on which VAE is trained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Approach</head><p>Our iterative approach involves the following steps:</p><p>E1 Test the decoder with pure electrode signals to assess its capacity to reproduce interpolated images faithfully.</p><p>E2 Utilize a latent layer size of 32 and bias VAE in a way to target approximating pick&amp;interpolate algorithm. We initially mark the inner electrodes and the edges of the skull, then occlude all positions that do not correspond to the marked areas to guide the VAE's focus. Special constant watermarks were added to the inner electrodes to provide the model with relative information about the position of the electrodes. We train this occlusion&amp;mark version of the VAE using original images from DEAP and a loss function that ignores pixels outside the circle. We investigate the signal propagation first with one-hot signals, and if necessary expand to original signals.</p><p>E3 Remove occlusion to allow the VAE more freedom in selecting where to focus, which may steer from picking the electrode pixels. Again, we investigate the signal propagation first with one-hot signals, and if necessary expand to original signals. The focus is on changes with respect from the mechanisms in the previous step.</p><p>E4 Further relax constraints by reducing the latent layer size to 27, as in <ref type="bibr" target="#b9">[10]</ref>, and track differences with respect to previous step.</p><p>E5 Analyze the original setting in <ref type="bibr" target="#b9">[10]</ref> without marking.</p><p>This iterative approach allows us to systematically explore variations and uncover the mechanisms underlying the CVAE's image reconstruction capabilities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiment</head><p>We build upon the basic experimental setup from <ref type="bibr" target="#b9">[10]</ref>. That means we are using publicly available DEAP dataset where EEG data were collected from 32 persons who watched 40 one-minute music video clips <ref type="bibr" target="#b16">[17]</ref>. Therein standard 10-20 systems were applied with the following 32 electrode positions: 'Fp1', 'AF3', 'F7', 'F3', 'FC1', 'FC5', 'T7', 'C3', 'CP1', 'CP5', 'P7', 'P3', 'Pz', 'PO3', 'O1', 'Oz', 'O2', 'PO4', 'P4', 'P8', 'CP6', 'CP2', 'C4', 'T8', 'FC6', 'FC2', 'F4', 'F8', 'AF4', 'Fp2', 'Fz', 'Cz'. We conducted our experiments in the Google Colab environment with V100 GPU. Prior to training, we preprocessed the dataset by clipping values to the lower and upper 5 quantiles and normalizing them to address issues related to blinks. Our implementation was based on the Keras framework, employing the AdamW optimizer with weight decay set to 1, as suggested in <ref type="bibr" target="#b5">[6]</ref>. We initialized the learning rate (LR) to 10 −3 and applied LR reduction on plateau with a patience of 30 epochs and factor of 0.3.</p><p>The VAE encoder architecture consists of three convolutional layers (kernel=4, stride=2), followed by three LeakyReLU layers, and concludes with two fully connected layers, one for the mean and one for the variance of the distribution. The number of filters is set to 32 for the first layer, 64 for the second layer, and 128 for the third layer. Conversely, the decoder comprises one fully connected layer, one reshape layer, and three Conv2DTranspose layers (kernel=4, stride=2). These layers collaborate to reconstruct the images, with a sigmoid function applied at the end to ensure proper output scaling. We have conducted the three initial experimental iterations, with the rest left as continuation. The preliminary results are given below. E1 After the initial epoch, the decoder displayed an MSE of 1.803e-04 for the training set and 2.6914e-04 for the validation set, with a slight yet consistent decline thereafter. Upon analyzing the outcomes from 1, it becomes evident that the reconstruction shows promise, exhibiting minor discrepancies around the brighter areas of the image. The metrics for the test set were SSIM = 0.980, MSE = 2.807e-4 and MAE = 0.007. As indicated by both the metrics and 1, the decoder demonstrated high accuracy in reproducing the images, indicating that the architecture possesses ample capability to learn the interpolation algorithm. This enables our study based on initially targeting biasing our model to pick&amp;interpolate algorithm.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E3</head><p>The model exhibited solid performance after several hundred iterations and continued to improve gradually even at the final iteration of 2000. Throughout the iterations, the model consistently maintained a validation loss approximately twice the MSE of the training set. Upon analyzing the materials presented in 3, the image reconstruction appeared nearly flawless, with only minute differences observed around the bright and dark spots, which would not be noticeable without a closer examination. The obtained metrics, though smaller than expected, were as follows: SSIM = 0.899, MSE = 0.004 and MAE = 0.046. This indicates that the model was indeed capable of performing well and reconstructing the image accurately. Subsequently, we proceeded with the analysis using one-hot signals to hypothesize of the algorithm. Upon observing the activations transitioning from layer to layer, we made the following observations:</p><p>1. The encoder, spanning from the 1st to the 3rd convolutional layer, transforms the input image into feature maps. Before reaching the fully connected layer for the latent space, we obtain 128 maps with spatial dimensions of 5x5. Each map is activated to some extent, reflecting the intensity at a relative position of the input signal, akin to resizing the input image to 5x5 and introducing some noise. Despite slight variations, these activation maps serve to accurately encode and differentiate different input signals. Consequently, the information at the output of the encoder comprises two key components: the activation position, providing a rough estimate of the input signal's position, and subtle differences surrounding the activation, offering a finer estimate of which signal is active and to what extent.</p><p>2. When mapping into the latent space, two main factors are emphasized: the sigma, or standard deviation, is consistently very small (1e-7) and therefore insignificant, and the bias of the fully connected layer, which is also close to zero (around 1e-2). Consequently, the combination of convolutions, LeakyReLU activation and a linear layer with a negligible bias suggests an overall linear transformation.</p><p>3. Activation maps at this deconvolution layers are very similar, due to small differences in scale. This suggests that the latent space exhibits a high degree of similarity across all inputs, which is something we aimed to avoid in order to simplify the interpretation. Consequently, the decoder relies on minor differences in its initial layers to extract information. Already at the 1st deconvolutional layer, the identity of the signal being processed becomes evident.</p><p>The gap from E1 to E2 could in future be circumvented by stage-wise (instead of end-to-end) training where we would take decoder from E1 and teach the encoder to pick points. Other, easier alternative is initialization of the VAE close to the pick&amp;interpolate algorithm and then tracking the evolution of the underlying algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and future work</head><p>Our future work involves finalizing our experimental plan through all its steps, as only the initial three are partially completed. Meanwhile, based on our current experiments, we have gained an understanding and experience with mechanistic interpretability that may also benefit other researchers. Grokking <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b10">11]</ref> may take a long time, if it happens at all, as it depends on hyperparameter values. Whether the model will grok under certain settings is not evident beforehand (maybe even not decidable <ref type="bibr" target="#b17">[18]</ref>), and extensive experimentation is necessary to find suitable values. Training is often unstable, with many oscillations, and sensitivity to hyperparameters is high. For instance, the authors in <ref type="bibr" target="#b5">[6]</ref> could not achieve grokking using L1 normalization. Considering the above, searching for good configurations for grokking is computationally expensive. Due to our modest hardware resources, we adopted an iterative approach. Initially, we shaped and constrained the initial step solution, which we could interpret anchored on the initial target algorithm pick &amp; interpolate. Then, we gradually allowed more freedom to the model until it matched the architecture of interest. We also observed that domain expertise is necessary to facilitate easier interpretability. However, this was a hindrance to our team, as none of us is well-versed in neuroscience. Therefore, we resorted to more abstract and basic algorithmic features. Additionally, previous work in mechanistic interpretability focused on nicely structured domains in arithmetic and logic, while the problem addressed in this paper is qualitatively more challenging. Looking further ahead, automation of mechanistic interpretability is a valuable goal to pursue <ref type="bibr" target="#b4">[5]</ref>. It aims to circumvent the manual work currently performed, which is subject to all human limitations. Such discovery algorithms would mine the inner workings of black-box systems, searching for patterns with low algorithmic complexity and reformulations similar to fragments in the current codebase available in repositories.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The image on the left is the original interpolated image and the image on the right is the output of the decoder.</figDesc><graphic coords="5,193.46,312.92,208.35,97.61" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: The image on the left is the original interpolated image and the image on the right is the output of the occluded model.</figDesc><graphic coords="5,193.47,549.18,208.35,93.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: The image on the left is again the original EEG topographic map placed into the model and the image on the right is the model output.</figDesc><graphic coords="7,193.46,84.19,208.35,100.17" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">It is important to note that in<ref type="bibr" target="#b9">[10]</ref> the dimensionality of the latent space was set to</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="27" xml:id="foot_1">.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title/>
		<author>
			<persName><forename type="first">B</forename><surname>Cd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Pf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electroencephalography, Journal of Neurology, Neurosurgery and Psychiatry</title>
		<imprint>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Efficient learning of sparse representations with an energy-based model</title>
		<author>
			<persName><forename type="first">R</forename><surname>Marc'aurelio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Christopher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sumit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Conference on Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Auto-encoding variational bayes</title>
		<author>
			<persName><forename type="first">K</forename><surname>Diederik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Max</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">An overview of variational autoencoders for source separation, finance, and bio-signal applications</title>
		<author>
			<persName><forename type="first">S</forename><surname>Aman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Tokunbo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2022">2022</date>
			<publisher>MDPI</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Bereska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gavves</surname></persName>
		</author>
		<idno>arXiv e-prints</idno>
		<ptr target="https://arxiv.org/abs/2404.14082v1.arXiv:2404.14082" />
		<title level="m">Mechanistic interpretability for ai safety: A review</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Neel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lawrence</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
		<title level="m">Progress measures for grokking via mechanistic interpretability</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Explainable artificial intelligence: A survey</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">K</forename><surname>Došilović</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brčić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Hlupić</surname></persName>
		</author>
		<idno type="DOI">10.23919/MIPRO.2018.8400040</idno>
	</analytic>
	<monogr>
		<title level="m">2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics</title>
				<meeting><address><addrLine>MIPRO)</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="210" to="0215" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Explainable artificial intelligence (xai) 2.0: A manifesto of open challenges and interdisciplinary research directions</title>
		<author>
			<persName><forename type="first">L</forename><surname>Longo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brcic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cabitza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Confalonieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Ser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Guidotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Hayashi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Herrera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Holzinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Khosravi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Lecue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Malgieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Páez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Samek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Speith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Stumpf</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.inffus.2024.102301</idno>
		<ptr target="https://doi.org/10.1016/j.inffus.2024.102301" />
	</analytic>
	<monogr>
		<title level="j">Information Fusion</title>
		<imprint>
			<biblScope unit="volume">106</biblScope>
			<biblScope unit="page">102301</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Ai safety: state of the field through quantitative lens</title>
		<author>
			<persName><forename type="first">M</forename><surname>Juric</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sandic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brcic</surname></persName>
		</author>
		<idno type="DOI">10.23919/MIPRO48935.2020.9245153</idno>
	</analytic>
	<monogr>
		<title level="m">2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1254" to="1259" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Interpreting disentangled representations of person-specific convolutional variational autoencoders of spatially preserving eeg topographic maps via clustering and visual plausibility</title>
		<author>
			<persName><forename type="first">A</forename><surname>Taufique</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Luca</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
			<publisher>MDPI</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon</title>
		<author>
			<persName><forename type="first">T</forename><surname>Vimal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Etai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Shuangfei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Omid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Roni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Joshua</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">The clock and the pizza: Two stories in mechanistic explanation of neural networks</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ziqian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ziming</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Max</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jacob</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Hopkins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Viégas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pfister</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wattenberg</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.13382</idno>
		<title level="m">Emergent world representations: Exploring a sequence model trained on a synthetic task</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Qiu</surname></persName>
		</author>
		<idno>arXiv e-prints</idno>
		<ptr target="https://arxiv.org/abs/2402.12201.arXiv:2402.12201" />
		<title level="m">Dictionary learning improves patchfree circuit discovery in mechanistic interpretability: A case study on othello-gpt</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Conmy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Mavor-Parker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lynch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Heimersheim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Garriga-Alonso</surname></persName>
		</author>
		<idno>arXiv e-prints</idno>
		<ptr target="https://arxiv.org/abs/2304.14997.arXiv:2304.14997" />
		<title level="m">Towards automated circuit discovery for mechanistic interpretability</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Syed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Rager</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Conmy</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2310.10348</idno>
		<idno type="arXiv">arXiv:2310.10348</idno>
		<ptr target="https://arxiv.org/abs/2310.10348.doi:10.48550/arXiv.2310.10348" />
		<title level="m">Attribution patching outperforms automated circuit discovery</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Deap: A database for emotion analysis ;using physiological signals</title>
		<author>
			<persName><forename type="first">S</forename><surname>Koelstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Muhl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Soleymani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-S</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yazdani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ebrahimi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Pun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nijholt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Patras</surname></persName>
		</author>
		<idno type="DOI">10.1109/T-AFFC.2011.15</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Affective Computing</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="18" to="31" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Impossibility results in ai: A survey</title>
		<author>
			<persName><forename type="first">M</forename><surname>Brcic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">V</forename><surname>Yampolskiy</surname></persName>
		</author>
		<idno type="DOI">10.1145/3603371</idno>
		<ptr target="https://doi.org/10.1145/3603371.doi:10.1145/3603371" />
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
