Filming the sound: Anomaly Detection on Audio Tape Recordings using Computer Vision Algorithms

Filming the sound: Anomaly Detection on Audio Tape Recordings using Computer Vision Algorithms ZaferÇınar zafer.cinar@studenti.unipd.it Department of Information Engineering Centro di Sonologia Computazionale (CSC) University of Padua

Via Giovanni Gradenigo, 6b 35131 Padua Italy

AlessandroRusso alessandro.russo@dei.unipd.it Department of Information Engineering Centro di Sonologia Computazionale (CSC) University of Padua

Via Giovanni Gradenigo, 6b 35131 Padua Italy

MatteoSpanio spanio@dei.unipd.it Department of Information Engineering Centro di Sonologia Computazionale (CSC) University of Padua

Via Giovanni Gradenigo, 6b 35131 Padua Italy

NiccolòPretto niccolo.pretto@unibz.it Faculty of Engineering Media Interaction Lab Free University of Bozen-Bolzano

Via Bruno Buozzi, 1 39100 Bozen Italy

SergioCanazza sergio.canazza@unipd.it Department of Information Engineering Centro di Sonologia Computazionale (CSC) University of Padua

Via Giovanni Gradenigo, 6b 35131 Padua Italy

Filming the sound: Anomaly Detection on Audio Tape Recordings using Computer Vision Algorithms 1613-0073 0E10AFBF309ADAA0BF605E5C24FF2FA7 GROBID - A machine learning software for extracting information from scholarly documents open reel audio tapes, irregularities detection, computer vision, preservation, restoration Orcid 0000-0001-6691-759X (A. Russo) 0000-0002-2436-7208 (M. Spanio) 0000-0002-3742-7150 (N. Pretto) 0000-0001-7083-4615 (S. Canazza)

The preservation of open-reel audio tapes is critical for maintaining valuable cultural and historical audio archives, yet current digitisation and analysis operations are often error-prone due to tape degradation and the long duration of the recordings. Considering the analog nature of this kind of recording, anomaly detection algorithms, applied to the video of the tape flowing on the playback head, can be used to detect errors and details with musicological value. This paper presents a new dataset of high-quality videos and a new algorithm for anomaly detection on audio tapes. Experimental results show notable improvements in detection performance, though false positives remain a challenge at higher speeds. Additionally, the new algorithm supports a wider range of playback speeds, improving its flexibility. This improvement is an important step towards a reliable implementation of the IEEE/MPAI CAE ARP standard (3302-2022).

Introduction

Nowadays, audiovisual archives are increasingly facing the challenge of preserving their collections from deterioration [1]. Digitization is a key solution, converting analog materials like photos, films, videos, and audio recordings into digital formats to mitigate physical degradation. However, the digitization process must be based on a scientific methodology to ensure minimal information loss. The Centro di Sonologia Computazionale (CSC) of the University of Padua has been working in audio document preservation for the last decade [2,3], carrying out its research activity to develop a preservation methodology for audio documents [4]. At CSC, digitization goes beyond simply migrating the audio content; it includes gathering metadata and contextual information, such as photos and video documentation. This approach became essential when working with archives of electronic music composers like Luciano Berio and Luigi Nono, who left markings and notes on tapes, also as indications for live performances, when tapes were used almost like instruments on stage. Furthermore, the video documentation captures important information about tape conditions and mechanical issues that may affect its playback, such as dirt, loss of magnetic paste or deformations. Since reviewing hours of video can be time-consuming and prone to errors, artificial intelligence may assist archivists and researchers by automatically detecting points of interest on the surface of the tapes and, therefore, the corresponding moment on the digitized audio recording. The methodology proposed by CSC has been the core reference during the implementation of the IEEE/MPAI CAE ARP standard, approved in December of 2022 1 . The implementation of a standard is often a long-term process that requires continuous updates and refinements to accommodate evolving needs and technological advancements. In the case of the IEEE/MPAI CAE ARP standard, the development of effective tools required several iterations of improvements since its first version [5]. As new technologies emerge, software must be re-evaluated and enhanced to address previous limitations, improve overall performance, and meet the evolving expectations of both the archival and technical communities. This is the case of the Video Analyzer, a component described in the standard developed to detect anomalies (also referred to as irregularities), such as splices, tape degradations or notes on the surface of the digitized tapes. This module analyzes a video framing a close-up of the tape flowing in front of the magnetic head during the digitization process. This paper contributes to this ongoing effort by presenting a series of significant improvements to the anomaly detection process in open-reel audio tapes. The key contributions of this work include:

1. a new dataset of high-quality videos can be used for testing new algorithms. 2. a new detection algorithm that shows a strong improvement in comparison to the initial algorithm implemented in the Video Analyzer component. 3. an extension of the supported playback speeds managed by the algorithm.

These improvements result in a more robust and accurate system for identifying irregularities on audio tapes, which will improve the reliability of the overall implementation of IEEE/MPAI CAE ARP standard and, therefore, foster correct preservation and reduce restoration and analysis efforts. The next Section provides a overall view on the background and related works at the base of this paper. Section 3 provides more information about the standard. Then, the proposed algorithm is described in Section 4. The experiment and dataset used to test the algorithm are reported in Section 5 along with its performance, compared to the previous version. Finally, Section 6 concludes the article with a discussion of the results and further opportunities for development.

Related work

Anomaly detection has been a lasting yet active research area in various research communities for several decades [6]. Traditional methods often rely on machine learning models, such as convolutional neural networks (CNNs) and support vector machines (SVMs), to recognize patterns in images or video frames that deviate from normal behaviour. Studies on the application of deep learning models to anomaly detection are still ongoing, with efforts concentrated on capturing high-dimensional representations of normal data [7]. More recently, unsupervised approaches, including autoencoders and Generative Adversarial Networks (GANs), have emerged as effective alternatives for detecting abnormalities, particularly in cases where labeled data is limited [8,9]. However, in this specific context, using deep learning models poses several challenges. One major issue is the need for large-scale datasets, which are difficult to obtain due to the time-consuming and error-prone nature of annotating frames for tape anomalies. The lack of sufficient labeled data makes it impractical to train deep learning models effectively. In the specific domain of anomaly detection in time-based media, like videos, frame-byframe comparisons are usually adopted to detect subtle visual differences [10]. The process typically involves analyzing pixel-level changes and identifying unusual patterns such as signal damages or inconsistencies. For example, background subtraction methods have been used in video surveillance to detect anomalies, though they suffer from false positives when exposed to subtle variations like lighting changes [11]. These techniques have also been adopted for other applications, such as medical image analysis and industrial monitoring [12,13].

Overview of the Video Analyzer

The Video Analyzer is one of the main components of the IEEE/MPAI CAE ARP standard (an overall description of the standard will be provided in Section 3). This component implements an anomaly detection algorithm on the video of the tape flowing on the playback head of the tape recorder. The frame of each irregularity identified by the algorithm and related metadata are the output of this component. The first version of the Video Analyzer featured an innovative method for detecting irregularities on the surface of open-reel audio tapes by analyzing videos produced during the digitization process [14]. This approach enabled a detailed frame-by-frame inspection, allowing for the identification of physical issues such as splices, scratches, and deformations. The system employed advanced computer vision algorithms, notably the Generalized Hough Transform and Speeded Up Robust Features (SURF), to identify regions of interest (ROIs) on the tapes and detect potential anomalies. The detection process involved several key steps. First, the system focused on detecting ROIs in the section of the tape beneath the reading head, using fixed elements like the pinch roller, a rotating rubber wheel which accompanies and controls the movement of the tape, as reference points to ensure consistency across frames. In the next phase, consecutive video frames were compared to identify significant differences. This was done through a pixel-level analysis, where changes in pixel intensity were monitored. If the number of differing pixels between two frames exceeded a set threshold, the system flagged the frame as containing an anomaly. The system generated a difference image for each pair of frames, representing potential irregularities in the tape. After the anomaly detection phase, the identified irregularities were not classified immediately. Instead, the output from this process was passed to the Tape Irregularity Classifier, a module within the IEEE/MPAI CAE ARP standard. At this stage, a convolutional neural network was used to categorize the detected irregularities, identifying specific issues such as splices, dirt, or other forms of damage on the tape's surface. One of the primary limitations of the original methodology was its reliance on the PAL (720x576) video format, which was the standard for most of the archived video recordings. The use of PAL video introduced several challenges. For instance, the low resolution and interlaced nature of PAL video (25 interlaced frames per second) affected the accuracy of anomaly detection, particularly during the image classification stage. The misalignment between odd and even lines due to interlacing reduced the precision of the convolutional neural network in identifying anomalies, as it disrupted the visual clarity required to detect subtle tape irregularities [14]. To overcome these issues, high-definition (HD) video is necessary. By increasing the video resolution up to Full HD (1920x1080) and frame rate to 50 fps (progressive), the enhanced method aims at reducing motion blur by using a shutter speed of 1/100 of a second and improving the detail captured in each frame. This allows for a more precise identification of anomalies, such as small scratches or splices that may go unnoticed in lower-quality videos. This improvement also allows for an overall better performance of machine learning models, which can leverage the additional details to detect more complex irregularities. Thus, the transition to FHD video is crucial to refining the anomaly detection process and achieving more accurate and reliable results.

IEEE/MPAI CAE ARP

MPAI is an international, independent, non-profit organization dedicated to developing standards for AIbased data coding. MPAI Context-based Audio Enhancement (MPAI-CAE) standard aims to enhance the user experience in audio-related applications such as entertainment, communication, teleconferencing, gaming, post-production, and restoration. The MPAI-CAE international standard was approved in May 2022 and subsequently adopted by the IEEE Standards Association as 3302-2022 in December of the same year. The presented work falls within the Audio Recording Preservation (ARP) use case, a part of the MPAI-CAE standard. The IEEE/MPAI CAE ARP standard provides precise software references for audio document preservation. Its technical specifications adopt the preservation methodology developed at the CSC, incorporating AI-based computational tools to extract information from digitized audio/video of analog open reel tapes. The use of AI enables the automatic detection of irregularities on the surface of the tape, improving precision and speed in selecting and extracting irregularities, such as splices, marks, and loss of magnetic paste. The technical architecture of the standard includes five modules that target and process different digital inputs: the Audio Analyzer, the Video Analyzer, the Tape Irregularity Classifier, the Tape Audio Restoration, and the Packager. The initial version of the Video Analyzer was implemented by training the system using a dataset of videos created by the CSC during numerous digitization projects. These videos were recorded during the A/D transfer of the signal to provide a visual record of magnetic tapes, most of which were preparatory tapes containing electronic music recordings. Documenting the presence of splices, different tape segments, possible alterations, annotations, and marks added by the composer can be extremely useful for reconstructing the philological history of a given work as well as detecting parts that could require an audio restoration [15]. Moreover, the video provides valuable information regarding the preservation conditions of audio documents, allowing to keep trace of splices, marks, and other surface irregularities.

Algorithm Description

The proposed method maintains the same framework as the existing approach, where consecutive frame pairs are compared as the video plays. The logic behind this method is to treat the problem as motion detection. While there is constant movement in the Region of Interest (ROI), areas without irregularities appear static, like still images. This allows for the use of a traditional motion detection algorithm, with irregularities seen as moving objects and regular regions as the background. The modification introduced in this work is based on frame differencing with additional filtering steps. This approach is chosen to improve the accuracy of detecting irregularities by focusing on meaningful changes between consecutive frames while reducing the impact of noise and irrelevant variations. By adding filtering, the method improves at distinguishing real irregularities from minor fluctuations that don't indicate actual issues. The flowchart in Figure 1 shows the frame differencing process with additional steps that incorporate Otsu's method [16] for thresholding and determining irregularities. The first step in this method is identifying ROI, based on the latest techniques: where the detection of the reading head is carried out using the Generalized Hough Transform and Speeded-Up Robust Features (SURF) is used for detecting the position of the pinch roller. In this implementation, the ROI of the tape area is reduced by half to decrease the computational load and improve accuracy. While this method reduces computational overhead, it may increase the number of irregularities detected in the case of problems that extend over long portions of the tape (for example, writing on the tape or a large scratch) or tapes running at very low speeds. While this solution creates a certain amount of "noise", later modules of the standard are able to handle the extra frames, and this approach also reduces information loss. This step helps the method focus only on the relevant parts of the video, making it more efficient and accurate. Figure 2b shows the reduced ROI.

Next, the method calculates the absolute difference between pairs of consecutive frames to create a difference image. This image shows the intensity variations between frames, with significant changes indicating possible irregularities. By focusing on these intensity differences instead of just color changes, the method can better detect motion and irregularities, ensuring that subtle but important differences are not missed. This approach treats differences more selectively, unlike the previous method, which treated all differences the same. Figure 3c shows the result of the absolute difference between the two consecutive frames Figure 3a and Figure 3b. After creating the difference image, the standard deviation is calculated to decide if a frame needs further evaluation. For a frame to be evaluated, the standard deviation must exceed the threshold referred to as deviation limit in Figure 1. Frames with a standard deviation below this limit are considered regular and are excluded from further processing. This step reduces the computational load by ensuring that only frames with significant intensity variations are processed. The deviation limit varies based on the tape speed, with values set as follows: 2.25 for 30 inches per second (ips), 2.5 for 15 ips, 2.6 for 7.5 ips, and 2.75 for 3.75 ips. This separation is made to address the intensity change with different speeds, as higher speeds result in greater motion blur and reduced intensity. For frames that pass the standard deviation check, Otsu's method is used to find the right threshold for binarizing the difference image. Depending on the result, either a global or Otsu's threshold is selected to create a binary motion image. To improve the accuracy of detecting irregularities, upper and lower limits are set on the threshold values. The upper limit is 15 to prevent the threshold from being too high and missing parts of an irregularity by potentially filtering out portions of it. The lower limit is 5 to stop the threshold from being too low, which could cause false positives by highlighting insignificant portions of the frame. This approach adapts the thresholding process for each frame and makes irregularity detection more robust and reliable. Once the threshold is set, it is applied to the difference image to create a binary motion image. This image shows potential irregularities as white pixels on a black background, making it easier to identify significant differences between frames. The binary motion image serves as a key intermediate step. Figure 4a shows the binarized image obtained by thresholding the difference image in Figure 3c. To further improve the binary motion image, an opening operation is applied with a kernel 3x3. This process, which involves erosion followed by dilation, removes small, irrelevant artifacts caused by tape vibration or other external factors while keeping the main irregularities intact. This step improves the clarity of potential irregularities, making them easier to detect and analyze in the final evaluation phase. Finally, the method counts the number of white pixels in the processed image and compares it to a set threshold of 5% of the total pixels to determine if the frame contains an irregularity. If the count exceeds this threshold, the frame is marked as irregular; otherwise, it is classified as regular. Figure 4b shows the image after morphological opening. Figure 5 and Figure 6 illustrate the processes applied to an annotation and a shadow, respectively. A further improvement concerns the quality of the videos provided as input of the algorithm. The analysis was previously conducted on videos in PAL format at 25 fps interlaced with a resolution of 720x576. The new algorithm works on high-definition videos with 50 fps (progressive), a fixed shutter speed of 1/100 of a second and a resolution of 1920x1080. This enhancement in video quality allows for more precise detection of irregularities, as finer details and smaller anomalies can be captured and analyzed more effectively. Since the video documentation began before the development of this project, many videos were of lower quality and interlaced. One approach to handle the misalignment caused by interlacing was to separate the even and odd fields, which mitigated the misalignment but reduced the resolution to 720x228. To make effective use of these existing resources, the new method provides a solution by employing interpolation-based deinterlacing. This approach ensures that the captured irregularities are not limited to the reduced resolution of 720x228 but are instead maintained at the original full resolution. By preserving the original dimensions, the method aims to enhance the accuracy of irregularity detection and also addresses the challenge of misaligned lines, which could impact classification in the future. Figure 7a shows a frame from the original interlaced video and Figure 7b shows the resulting image of deinterlacing. In this novel version, the speed options have been expanded. The previous version of the Video Analyzer was specifically calibrated for detecting the anomalies on tapes recorded at 7.5 ips and 15 ips. However, the new updated version also supports lower and higher speeds (3.75 ips and 30 ips). This change makes the method more flexible and adaptable, allowing it to handle a wider range of tape playback scenarios.

Experiment and evaluation

To evaluate the performance of the improved anomaly detection method, an experiment was conducted using videos of open-reel tapes featuring various manually annotated irregularities (mainly splices, as they are the most common one). The test aimed at comparing the improved detection method against the original one across different playback speeds, providing a comprehensive analysis of precision and recall. The videos were captured at four distinct speeds: 3.75 ips, 7.5 ips, 15 ips, and 30 ips, with tape durations varying accordingly-10 minutes for 3.75 ips, 34 minutes for 7.5 ips, 10 minutes for 15 ips, and 9 minutes for 30 ips. At 3.75 ips, there were 14 splices present. At 7.5 ips, the tape had 66 splices, along with 4 annotations, 3 shadows, and 3 end-of-tape markers (a full description of this kind of irregularies can be found in [4]). For 15 ips, the tape featured 55 splices and 1 shadow. At 30 ips, the tape had 93 splices, 2 end-of-tape markers and 2 annotations. For the sake of comparison completeness, the authors also modified the previous version of the algorithm to support two additional speeds. The experiment focused on two key metrics: precision (the ability to correctly identify anomalies without generating false positives) and recall (the ability to detect all present anomalies). Both metrics were measured for each playback speed to compare the results of the old and new methods. The results, summarized in Table 1, show a notable difference between the two approaches. The new method successfully detected all irregularities at every speed but produced some false positives, while the old method demonstrated inconsistent performance, particularly struggling at lower speeds.

The new method detected all irregularities across the given speeds. However, in some cases, it generated some false positives. This was especially noticeable at 30 ips, where the motion blur caused by the higher speed required an increment of the sensitivity value for detecting the frame differences, which leads to more false positives under the presence of lighting changes and vibrations. It should be noted that some of the false positives are duplicates of the same irregularities. The method was designed to ensure that closely occurring irregularities are not missed, which sometimes results in duplicates. In future releases, the method can be improved by merging consecutive irregularities into one after the classification. This change could help in reducing false positives and enhancing the overall detection precision. The old method faced significant challenges, specifically at 3.75 ips, where it failed to detect any irregularity. This shortcoming is likely due to its reliance on detecting large pixel differences between frames, which are less noticeable at slower speeds. Consequently, the old method showed some improvements at the highest speed (30 ips), where pixel variations are more pronounced, with a lower detection precision in comparison to the new improvement. Additionally, due to its reliance on pixel count, the old method struggled to detect irregularities other than splices. This limitation arose because irregularities such as shadows and annotations often affect only a smaller area of the tape, making them less noticeable in comparison with splices. The new method addresses these issues more effectively: by focusing on intensity differences between frames, and using filtering techniques, it provides a more accurate detection. As a result, the overall performance of the novel method is better than the previous one, providing a more accurate and consistent irregularity detection across different speeds and various video conditions.

Conclusions

This work introduced significant improvements in the automatic detection of superficial irregularities on open-reel audio tapes using computer vision techniques. The key contributions include the development of a new dataset of high-quality videos2 , an enhanced detection algorithm with improved accuracy, and an expanded range of supported playback speeds. Despite these advancements, the system still faces certain limitations. The increased recall of the new algorithm, especially at higher playback speeds, led to false positives, particularly due to lighting changes and vibrations. Additionally, double detections of the same irregularities highlighted the need for further refinements, such as post-processing techniques that can merge closely occurring irregularities into a single detection. Future works will focus on addressing these limitations. Possible improvements will include refining the filtering process to reduce false positives and optimizing the classification module to handle more complex irregularities. Furthermore, expanding the algorithm's capabilities to work with other tape formats and video resolutions could enhance its applicability to a broader range of archival materials, supporting better preservation and restoration efforts in audiovisual archives.

Figure 1 :1Figure 1: Flowchart of the frame differencing method with additional filtering.

Figure 2 :2Figure 2: Comparison of original and new Regions of Interest (ROIs). The green rectangles highlight the ROIs while red rectangles highlight the reading heads.

Figure 3 :3Figure 3: Comparison of previous and current frames, with the calculated difference image.

Figure 4 :Figure 5 :45Figure 4: Progression from the difference image to thresholding, followed by morphological opening.

Figure 6 :6Figure 6: Progression from the current frame to final image for a shadow.

(a) Original interlaced frame (b) Deinterlaced frame

Figure 7 :7Figure 7: Comparison between an interlaced frame with misaligned lines and the deinterlaced frame.

Table 11Evaluation for different speed optionsTape Speed Precision -New Recall -New Precision -Old Recall -Old3.75ips0.93331.00000.00000.00007.5ips0.92681.00000.41560.842115ips0.96551.00000.96430.482130ips0.88181.00000.96550.8660

Link to the standard https://standards.ieee.org/ieee/3302/11006/ (Last Accessed: November 4, ) The dataset presented in this article is available on Zenodo. The assigned DOI is 10.5281/zenodo.14028922. Please refer to this repository to access data and additional details.

Acknowledgments

This work is partially supported by the SYCURI Project, funded by the University of Padova in the Program "World Class Research Infrastructure".

(S. Canazza) GLOBE https://matteospanio.github.io/ (M. Spanio); https://www.unibz.it/it/faculties/engineering/academic-staff/person/47860-niccolo-pretto (N. Pretto); https://www.dei.unipd.it/~canazza/ (S. Canazza

Will you be mine forever? Audio archiving, multitracks, and 90s digital FRumsey Journal of the Audio Engineering Society 68 2020 Four decades of music research, creation, and education at Padua's Centro di Sonologia Computazionale SCanazza GDePoli 10.1162/comj_a_00537 Computer Music Journal 43 2020 Gesture, music and computer: The Centro di Sonologia Computazionale at Padova University, a 50-year history SCanazza GDe Poli AVidolin 10.3390/s22093465 Sensors 22 2022 Computing methodologies supporting the preservation of electroacoustic music from analog magnetic tape NPretto CFantozzi EMicheloni VBurini SCanazza 10.1162/comj_a_00487 Computer Music Journal 42 2018 Sound and music computing using AI: Designing a standard MBosi NPretto MGuarise SCanazza 10.5281/zenodo.5045003 Proceedings of the 18th Sound and Music Computing Conference, Virtual the 18th Sound and Music Computing Conference, Virtual 2021 Deep learning for anomaly detection: A review GPang CShen LCao AV DHengel 10.1145/3439950 ACM Computing Surveys 54 2021 RChalapathy SChawla ArXiv abs/1901.03407 Deep learning for anomaly detection: A survey 2019 Unsupervised anomaly detection with generative adversarial networks to guide marker discovery TSchlegl PSeeböck SMWaldstein USchmidt-Erfurth GLangs 10.1007/978-3-319-59050-9_12 Information Processing in Medical Imaging MNiethammer MStyner SAylward HZhu IOguz P.-TYap DShen

Cham

Springer International Publishing 2017 GAN-based anomaly detection: A review XXia XPan NLi XHe LMa XZhang NDing 10.1016/j.neucom.2021.12.093 Neurocomputing 493 2022 Real-world anomaly detection in surveillance videos WSultani CChen MShah 10.1109/CVPR.2018.00678 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018. 2018 Performance evaluation of background subtraction techniques for video frames SQasim KNKhan MYu MSKhan 10.1109/ICAI52203.2021.9445253 2021 International Conference on Artificial Intelligence (ICAI) 2021 Leveraging motion saliency via frame differencing for enhanced object detection in videos LNans CMediavilla DMarez SParameswaran 10.1117/12.2678373 International Society for Optics and Photonics MSAlam VKAsari SPIE 2023 12527 125270V Pattern Recognition and Tracking XXXIV Moving object recognition on production line based on adaptive frame differencing algorithm AWang YJi MChen YLiu ZLi SYan 10.1109/CCDC62350.2024.10587928 2024 36th Chinese Control and Decision Conference (CCDC) 2024 Enhancing preservation and restoration of open reel audio tapes through computer vision ARusso MSpanio SCanazza 10.1007/978-3-031-51026-7_26 Image Analysis and Processing -ICIAP 2023 Workshops GLForesti AFusiello EHancock

Nature Switzerland; Cham

Springer 2024 Multimedia archives: New digital filters to correct equalization errors on digitized audio tapes NPretto EMicheloni AChmiel NDPozza DMarinello ESchubert SCanazza 10.1155/2021/5410218 Advances in Multimedia 2021 A threshold selection method from gray-level histograms NOtsu 10.1109/TSMC.1979.4310076 IEEE Transactions on Systems, Man, and Cybernetics 9 1979