1. Introduction

Visual Sudoku Puzzle Classification: A Suite of Collective Neuro-Symbolic Tasks

Eriq Augustine

Connor Pryor

Charles Dickens

Jay Pujara

William Yang Wang

Lise Getoor

1 0 University of California , Santa Barbara , USA 1 University of California , Santa Cruz , USA 2 University of Southern California , USA

Neuro-symbolic computing (NeSy) is an emerging field that has the goal of integrating the low-level representational power of deep neural networks with high-level symbolic reasoning. Due to the youth of the field and the complexity of neuro-symbolic integration, there are few benchmarks that showcase the powers of NeSy, and even fewer built specifically with NeSy in mind. To address the lack of NeSy benchmarks, we introduce Visual Sudoku Puzzle Classification (ViSudo-PC). ViSudo-PC is a new NeSy benchmark dataset combining visual perception with relational constraints. The goal of the benchmark is to both highlight opportunities and elicit challenges. In addition to providing a new NeSy benchmark suite, we also provide an exploratory analysis that showcases ViSudo-PC's dificulty and possibilities.

eol>Benchmark Dataset Neuro-Symbolic Integration Relational Data Structured Prediction

1. Introduction

Integrating neural and symbolic reasoning is a long-standing challenge in the machine learning community. Neuro-symbolic computing (NeSy), which combines low-level neural perception and logic-based reasoning [ 1, 2, 3 ], is a promising area of research that aims to integrate these concepts in a seamless fashion. NeSy systems have shown the advantage of incorporating neural and logical reasoning, including the ability to learn with less data, robustness to noise, the ability to perform joint reasoning (structured prediction), and more. Unfortunately, there is a dearth and realistic. A comprehensive NeSy test suite designed with the ability to vary the structural constraints and perceptual dificulty is a challenge facing the community.

There are a variety of tasks and datasets commonly used in NeSy research. Many involve visual reasoning. Examples include identifying the subject or context of the image such as Visual Relationship Detection [ 4 ], Semantic Image Interpretation [ 5 ], and Visual Genome [ 6 ]. These datasets support interesting and complex visual tasks. However, the complex nature of the task can lead to ambiguous answers (e.g., even humans may misunderstand the context of an arbitrary image). Additional NeSy test domains involve reasoning with knowledge graphs such as FB15k and WN18 [ 7 ]. These datasets often lack direct subsymbolic information, requiring that information to be generated from other sources (e.g., from word embeddings). Finally, MNIST Addition [ 8 ] is another popular NeSy dataset. MNIST Addition uses MNIST digit images as operands in addition equations, with the goal of predicting the sum of the digit images. MNIST Addition is an excellent NeSy testbed, however, it is limited by the ease of MNIST image classification. With MNIST classifiers that can achieve over 99.9% accuracy [ 9 ], MNIST Addition (both single and multi variants) is possible using little symbolic reasoning.

Our goal is to create a comprehensive NeSy benchmark that can be used to further NeSy research. Such a benchmark needs to take into account the nature of neuro-symbolic computing at each step. Specifically, a benchmark made for the NeSy community should: 1) include selfcontained symbolic and subsymbolic information, both of which are necessary to solve the problem; 2) contain settings/tasks with varying degrees of hardness; 3) include entities that can be collectively reasoned over; and 4) have unambiguous labels.

In this work, we introduce a novel benchmark specifically designed for NeSy systems, Visual Sudoku Puzzle Classification (ViSudo-PC). ViSudo-PC expands on the concept of visual Sudoku puzzles introduced in Wang et al. (2019) and visual Sudoku puzzle classification introduced in Pryor et al. (2022). Given a Sudoku puzzle constructed from images as input, the classification task is to determine whether the Sudoku puzzle is correctly solved. Preforming well on the classification of visual Sudoku puzzles requires systems that are able to reason about the perceptual information in the images as well as the collective information from Sudoku constraints. ViSudo-PC expands upon the perceptual challenge of previous MNIST compositional tasks [ 8, 10, 11 ] by drawing images from four diferent sources. Additionally, ViSudo-PC includes a collection of progressively harder tasks.

Our key contributions are as follows: 1) We construct ViSudo-PC, an extensive NeSy benchmark that integrates four canonical visual datasets into five tasks of varying dificulty requiring symbolic and sub-symbolic reasoning to solve, 2) We perform an exploratory evaluation over two ViSudo-PC tasks to quantify the dificulty of these tasks in diferent settings, and 3) We discuss ways that the data and tasks of ViSudo-PC can be extended and improved.

2. Benchmark Data

Visual Sudoku Puzzle Classification (ViSudo-PC) expands upon the classification task proposed in [ 11 ]. The data includes completed Sudoku puzzles, along with their classification as “correct" (solved) or “incorrect"12. For those unfamiliar with Sudoku, Sudoku is a puzzle game in which there is a 9x9 grid, called a “puzzle" or “board", in which each cell is populated with numbers 1 – 9. A puzzle is correct if no row, column, or non-overlapping 3x3 subgrid (or “block") contains the same number. Classifying whether a Sudoku puzzle is solved correctly simply involves checking the three types of constraints. Following [ 10 ], rather than providing symbolic information (e.g., labels) for the cell content, we can complicate the problem by providing subsymbolic information in the form of images. The images can be of digits (or, as we’ll see later, other objects) for each cell as in Figure 1. Note that no information is provided about the label of each cell, so a classification system must learn to identify or distinguish between the cell labels at the same time as to check if the puzzle is solved.

2.1. Data Sources

We build upon existing image classification work [ 12, 13, 14, 15 ] to create Sudoku puzzles where the cell contents and labels originate from multiple data sources. Each data source provides 28 pixel by 28 pixel grayscale images covering a diferent domain of objects, examples of these images are displayed in Figure 9. Table 1 summarizes the data sources used by ViSudo-PC to construct Sudoku puzzles.

Dataset MNIST EMNIST-ML FMNIST KMNIST

Train Examples Test Examples Labels Subject

MNIST MNIST is one of the most well-known and widely used image classification datasets [ 12 ]. It is composed of 70,000 examples of handwritten digits distributed roughly evenly across the ten digit classes.

Extended MNIST (EMNIST) extends MNIST by introducing additional examples of handwritten digits as well as examples of handwritten English letters [ 13 ]. EMNIST provides several diferent ways to group/classify the data, including by author, by class (arranged into 62 classes denoting [ 0-9 ], [a-z], and [A-Z]), and by merged classes. The merged classes setting combines similar uppercase and lowercase letters, e.g., “c" & “C", “m" & “M”, and “o" & “O”, into 47 total classes. ViSudo-PC uses the merged classes, but removes digits to avoid overlap with the MNIST digits. We refer to this subset of EMNIST as EMNIST Merged Letters (EMNIST-ML). The images provided in EMNIST are transposed horizontally and rotated 90 degrees anti-clockwise. To maintain consistency with the other data sources, each EMNIST-ML image is adjusted to an upright position.

1Data is available at https://linqs-data.soe.ucsc.edu/public/datasets/ViSudo-PC/v01/. 2ViSudo-PC also provides a data generation tool discussed in Appendix B.

Fashion-MNIST (FMNIST) uses images of fashion products [ 14 ] instead of digits. Classes include items such as “Coats", “Bags", and “Trousers". Samples of each FMNIST class are illustrated in Figure 9c. The complex nature of fashion items and wide range of variants makes classification FMNIST images considerably harder than standard MNIST [ 14 ]. Kuzushiji-MNIST (KMNIST) uses Kuzushiji Japanese characters [ 15 ]. Kuzushiji is a cursive form of Japanese writing rarely used today. To reduce the 49 character Japanese alphabet down to 10 classes, KMNIST chooses one character from each of the 10 Hiragana rows (representing diferent consonant sounds). The stylistic nature of a cursive writing variant, like Kuzushiji, makes KMNIST intrinsically more dificult than standard MNIST.

2.2. Puzzle Construction

ViSudo-PC puzzles are square and can be any dimension with an integer square root, as long as enough cell labels are provided by the data sources (e.g., MNIST, with 10 classes, alone can only support Sudoku puzzles with a dimension of 9 or less). To construct ViSudo-PC puzzles, cell labels are first selected from the relevant data sources. The exact method of choosing data sources and cell labels varies between tasks and is discussed in detail in Section 3. Cell labels are randomly selected for each cell, ensuring that no Sudoku constraint is violated. Once cell labels are determined, images of those labels are assigned to each cell. To create a pool of images for each cell label, train and test splits from each data source are merged, shufled, partitioned by label, and split into train, test, and validation image pools. Images are never shared between splits.

To create negative puzzle examples (incorrectly solved puzzles), existing correct puzzles are corrupted. Our method of corruption allows ViSudo-PC to contain incorrect puzzles that are just slightly incorrect, instead of incorrect puzzles that are constructed by random and would likely have many mistakes. Puzzles are corrupted in one of two ways: via replacement or via substitution. Replacement corruptions randomly choose a location in a puzzle and an alternate label, and then replaces that cell with an image uniformly sampled from the split’s pool of images for that label. Substitution corruptions swap two random cells in the same puzzle. After each corruption is made, a coin with a configurable bias is flipped to see if another corruption of the same type is performed. Finally, each corrupted puzzle is checked to ensure that multiple corruptions did not create a correct puzzle.

To increase the connectivity of puzzles, ViSudo-PC allows for the introduction of overlap into the data. Overlap is when the same image is used multiple times when generating puzzles. Adding overlap gives the predictor an opportunity to recognize the same entity being used in the same or diferent puzzles. A predictor employing joint reasoning can take advantage of this opportunity to improve performance.

The degree of overlap is controlled by a parameter . From a collection of images , || images are uniformly sampled with replacement. The sampled images are added to , which is then shufled, forming a new collection of || + || images.

3. Benchmark Tasks

Five diferent tasks are provided as a part of the ViSudo-PC benchmark, each providing increasingly dificult problems. In all settings, the goal is to classify Sudoku puzzles as correct or incorrect. Cell labels are provided for debugging, but should never be used in any oficial task. Basic In this task, a single data source is specified and the first cell labels from the data source are used, where is the dimension of the puzzles. These are the cell labels for the puzzles, and then images for the cell labels are chosen randomly. This extends the original Visual-SudokuClassification problem from Pryor et al. (2022) by including alternate data sources.

Random Label Per Split (PerSplit) This task builds upon the Basic task by randomizing the cell labels used in each split. PerSplit randomly selects cell labels from the specified data sources to use throughout all train, test, and validation puzzles. Any non-empty subset of data sources can be specified. The challenge posed by PerSplit is that any model/architecture used must be efective on several types of images, and not specific to one label set. For example, an architecture specialized to classify MNIST digits may fail to classify the shoes and purses in FMNIST. Any architecture that performs well on all variations of this task must be able to deal with the digits, English letters, clothes, and Japanese characters that may appear in a single puzzle. Random Label Per Puzzle (PerPuzzle) This task increases the dificulty by re-sampling the cell labels for each individual puzzle. So for each puzzle, a set of cell labels is uniformly sampled from the specified data sources. Note that this task becomes considerably more dificult when more data sources are used, since the pool of possible cell labels is larger. All cell labels present in the test and validation puzzles are guaranteed to be used in train puzzles. For this task (and the following tasks), methods that rely solely on subsymbolic information (image pixels) will likely have great dificulty. Because of the larger pool of cell labels, each cell label may be represented by fewer examples than in previous tasks. To perform well in this task, models may need to distinguish between cell labels for each puzzle and use the symbolic information in Sudoku constraints, instead of learning an image classifier on the train split.

Random Label Per Cell (PerCell) Instead of limiting each puzzle to using cell labels, this task randomly chooses a cell label for each cell in each puzzle. Thus it can use as many as 2 cell labels (limited by the puzzle size and provided data sources). The number of cell labels used per puzzle is randomly chosen and that information is not provided to the predictor outside of the train puzzles. This is the only task that violates the full rules of Sudoku, as more that cell labels are potentially used. In this task, a puzzle is considered correct as long as the row, column, and block constraints of Sudoku are not violated, i.e., no duplicate cell labels appear in any row, column, or block. Again, we guarantee that all cell labels present in the test and validation puzzles are also present in train puzzles. The potentially large and unknown number of cell labels makes this task extremely challenging for any system that relies on an image classifier. To perform well on this task, a predictor needs to be able to discriminate between cell labels without seeing many examples of each and without knowledge of the number of cell labels. Transfer The nfial task is a transfer learning task. The same process used for Basic is used here, except that two disjoint sets of cell labels are chosen, one for training and another one for test and validation. For example, when MNIST is used as a data source with = 4, the cell labels [ 0 - 3 ] are present in the train set, while [ 4 - 7 ] are present in the test and validation sets.

4. Exploratory Evaluation

We provide an initial exploratory evaluation of ViSudo-PC using the Basic and Random Label Per Split tasks. We investigate the following questions: 1) Is there a diference in the dificulty of each data source? 2) How does the use of multiple data sources afect performance? 3) How does overlap afect performance?

4.1. Models

We evaluate over three models from Pryor et al. (2022) using hyperparameters specified in the paper; all unspecified parameters were left at their default values.

Baseline-Digit This model takes as input the cell labels of a Sudoku puzzle and outputs a probability of the puzzle being valid. Note that this can be seen as the best possible scenario (or cheating), where the neural model is able to correctly identify every cell image. This model uses a feedforward multi-layer perceptron trained to minimize the cross-entropy loss. Formally, this neural baseline consists of 3 fully connected dense layers of sizes 16, 512, and 256 each with a ReLu activation and a final dense output layer of size 1 with a softmax activation. Baseline-Visual This model takes as input the pixels for each cell in a Sudoku puzzle and outputs the probability of the puzzle being valid. This model uses a convolutional neural network multi-layer perceptron trained to minimize the cross-entropy loss. Formally, this neural baseline consists of 3 convolutional layers with kernel size of 3, where each is followed by a max pooling layer of size 2 with stride 2. This then feeds into the same model as bld.

NeuPSL A NeSy model that has distinct neural perception and symbolic reasoning components. The NeuPSL neural model takes as input the pixels for each cell in a Sudoku puzzle, and outputs a probability distribution for each class, which it then feeds into a symbolic model that verifies the Sudoku constraints. Formally, the NeuPSL neural model is a simple image classifier first mentioned in Manhaeve et al. (2021). This neural model consists of 2 convolutional layers with kernel size of 5, where each is followed by a max pooling layer of size 2 with stride 2. This then feeds into fully connected dense layers of sizes 256, 120, 84 with a ReLu activation and a final dense output layer of size with a softmax activation (where is the number of classes). The symbolic PSL model implements the rules of Sudoku as described in Pryor et al. (2022), i.e., has no duplicate digits in any row, column, or square.

4.2. Data Source Dificulty

To assess the dificulty of each data source for the tasks presented by ViSudo-PC, we first look at the dificulty of each task in the simpler context of image classification. Table 2 shows the state-of-the-art image classification performance for each data source at the time of writing [ 17 ]. All data sources achieve accuracy in the 90s, with MNIST and KMNIST performing the best and both achieving more than 99% accuracy, while EMNIST is the hardest with an accuracy of only 91.59%.

By examining each data source’s image classification performance, we can get an idea of the best possible performance a naive ViSudo-PC model can achieve. Where a naive model simply attempts to classify each cell in a puzzle independently (assuming cell labels are supplied to the model). ViSudo-PC provides both 4x4 and 9x9 puzzles (with the ability to generate larger puzzles), therefore the expected accuracy of a naive model is ()16 and ()81 respectively3.

Data Source

Model MNIST EMNIST-Merged4 FMNIST KMNIST

CNN Ensemble [ 9 ] WaveMix [ 18 ] DARTS [19] SpinalNet [20]

Image Accuracy

Additionally, we assess the performance of the three models on the Basic task using 50 4x4 training puzzles from each data source. Table 3 shows the results of the three models on each data source. The Baseline-Visual is unable to generalize over any of the data sources. Baseline-Digit, however, performs approximately the same over all data sources, as it does not use any perceptual information. Unsurprisingly, NeuPSL performs the best on MNIST, which has the simplest image cell labels. And despite being the most dificult data source for the state-of-the-art image classifiers, NeuPSL performed second best on EMNIST-ML. 3The expected accuracy only includes performance on positive examples, and additionally excludes cases where multiple classification mistakes are made which result is an accidental correct classification. 4Digits are included in this setting, but excluded for ViSudo-PC. 0.71 ± 0.04 0.69 ± 0.05 0.70 ± 0.04 0.71 ± 0.03

4.3. Performance across Multiple Data Sources

To examine the impact of using multiple data sources for puzzle generation, we ran the Random Label Per Split task using 4x4 puzzles generated with data from one or more data sources. As shown by Figure 7, NeuPSL performs well in almost all settings, outperforming BaselineDigit, but struggling more whenever KMNIST is used. This result is consistent with NeuPSL’s previous performance with KMNIST. Baseline-Digit has very consistent performance and is almost agnostic to the diferent data sources. As expected, Baseline-Visual fails to generalize and produces consistently poor performance.

4.4. Overlap Performance

To determine the efect of overlap on performance, we ran the three models on the Basic task using difering amounts of overlapping images in 4x4 puzzles from each data source. As shown in Figure 8, both the Baseline-Digit and NeuPSL models show a benefit from increasing the amount of overlap. NeuPSL, which is able to collectively reason, shows much larger improvements as the amount of overlap is increased than Baseline-Digit, which may just be benefiting from fewer unique images. Here, Baseline-Visual also fails to generalize and cannot beat random guessing.

5. Discussion

There are several interesting ways in which both the data and tasks in ViSudo-PC can be extended by including additional structural information. Cell labels can be expanded to hierarchies of labels, e.g., a MNIST zero may be a part of the cell label hierarchy: 0 → Digit → Alphanumeric → Glyph. These hierarchies can then be used to create a variety of new tasks, at diferent abstraction levels. For example, one could define a task where cell labels sharing a common hierarchical ancestor are considered the same. Another possible new task involves adding additional constraints on the Sudoku problem. For example, requiring specific blocks to contain cells from diferent label classes, e.g., all numbers or letters. Recall that in the current set of tasks, no cell labels are given, and results are not evaluated over cell labels, only over puzzle classification. Another set of new tasks can be introduced by including cell labels in either the inference or evaluation process. Finally, an additional interesting direction is introducing confounding information. For example, following [21], confounding information in the form of color could be explicitly added to the data generation process.

Acknowledgments

This work was partially supported by the National Science Foundation grants CCF-1740850 and CCF-2023495. [19] M. S. Tanveer, M. U. K. Khan, C.-M. Kyung, Fine-tuning darts for image classification, in:

International Conference on Pattern Recognition (ICPR), 2021, pp. 4789–4796. [20] H. Kabir, M. Abdar, S. M. J. Jalali, A. Khosravi, A. F. Atiya, S. Nahavandi, D. Srinivasan,

Spinalnet: Deep neural network with gradual input, arXiv (2020). [21] B. Kim, H. Kim, K. Kim, S. Kim, J. Kim, Learning not to learn: Training deep neural networks with biased data, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9012–9020.

A. Data Sources

Examples of each data source are provided in Figure 9. MNIST, FMNIST, and KMNIST have all their provided classes displayed. EMNIST-ML is shown with 10 of its 47 classes. (a) MNIST

(b) EMNIST-ML (c) FMNIST (d) KMNIST

B. Data Generation

ViSudo-PC provides a data generation tool that allows users to construct their own ViSudo-PC datasets5. The settings in Table 4 are used to create the data provided with ViSudo-PC6. Here, we provide a brief description of each parameter.

Parameter

Value Dimension {4, 9} Data Sources ({MNIST, EMNIST − ML, FMNIST, KMNIST}) − {} Train Count {1, 2, 5, 10, 20, 30, 40, 50, 100} Test Count 100 Valid Count 100 Overlap {0.0, 0.5, 1.0, 2.0}

Corrupt Chance 0.5

Dimension Dimension determines the size of the Sudoku puzzles generated. All Sudoku puzzles in ViSudo-PC are square, and the dimension is the number of cells on each side of the puzzle. Additionally, because Sudoku puzzles require square blocks within each puzzle, the puzzle dimension must have an integer square root.

Data Sources Data sources determines the cell labels and images. For many of the tasks discussed in Section 3, data may come from more than one source.

Train/Test/Valid Counts The number of correct puzzles to generate for each split. During the corruption process, the same number of incorrect puzzles will also be generated. Overlap Section 2.2.

The constant controls the number of examples that are duplicated, as discussed in Corruption Chance While generating negative instances as described earlier, the chance of continuing the corruption process after each corruption is made. This controls how many mistakes are in the negative examples.

5Code is available at https://github.com/linqs/visual-sudoku-puzzle-classification. 6Data is available at https://linqs-data.soe.ucsc.edu/public/datasets/ViSudo-PC/v01/.

{ MNIST } { MNIST } Images / Instance 9 × 9 60,000

[1]

T. R.

Besold , A. S. d'Avila Garcez , S.

Bader , H.

Bowman , P. M.

Domingos , P.

Hitzler , K.

Kühnberger , L. C.

Lamb , D.

Lowd , P. M. V.

Lima , L. de Penning, G. Pinkas, H.

Poon , G. Zaverucha, Neural-symbolic learning and reasoning: A survey and interpretation , arXiv ( 2017 ).

[2]

A. S.

d'Avila Garcez ,

Gori ,

L. C.

Lamb ,

Serafini ,

Spranger ,

S. N.

Tran , Neuralsymbolic computing: An efective methodology for principled integration of machine learning and reasoning , Journal of Applied Logics 6 ( 2019 ) 611 - 632 .

[3]

De Raedt ,

Dumančić ,

Manhaeve , G. Marra, From statistical relational to neurosymbolic artificial intelligence , in: IJCAI , 2020 , pp. 4943 - 4950 .

[4]

Lu ,

Krishna ,

Bernstein ,

Fei-Fei , Visual relationship detection with language priors , in: European Conference on Computer Vision (ECCV) , 2016 , pp. 852 - 869 .

[5]

Donadello ,

Serafini ,

A. D.

Garcez , Logic tensor networks for semantic image interpretation , arXiv ( 2017 ).

[6]

Krishna ,

Zhu ,

Groth ,

Johnson ,

Hata ,

Kravitz ,

Chen ,

Kalantidis ,

L.-J.

Li ,

D. A.

Shamma , et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations , International Journal of Computer Vision (IJCV) 123 ( 2017 ) 32 - 73 .

[7]

Bordes ,

Usunier ,

Garcia-Durán ,

Weston ,

Yakhnenko , Translating embeddings for modeling multi-relational data , in: International Conference on Neural Information Processing Systems (NeurIPS) , 2013 , pp. 2787 - 2795 .

[8]

Manhaeve ,

Dumancic ,

Kimmig ,

Demeester , L. De Raedt, Deepproblog: Neural probabilistic logic programming , in: NeurIPS , 2018 , pp. 3753 - 3763 .

[9]

An ,

Lee ,

Park ,

Yang ,

So , An ensemble of simple convolutional neural network models for mnist digit recognition , arXiv ( 2020 ).

[10]

P.-W.

Wang ,

Donti ,

Wilder ,

Kolter , Satnet: Bridging deep learning and logical reasoning using a diferentiable satisfiability solver , in: International Conference on Machine Learning (ICML) , 2019 , pp. 6545 - 6554 .

[11]

Pryor ,

Dickens ,

Augustine ,

Albalak ,

Wang ,

Getoor , Neupsl: Neural probabilistic soft logic , arXiv ( 2022 ).

[12]

LeCun , L. Bottou,

Bengio ,

Hafner , Gradient-based learning applied to document recognition , Proceedings of the IEEE 86 ( 1998 ) 2278 - 2324 .

[13]

Cohen ,

Afshar ,

Tapson ,

A. V.

Schaik , Emnist: Extending mnist to handwritten letters , in: International Joint Conference on Neural Networks (IJCNN) , 2017 , pp. 2921 - 2926 .

[14]

Xiao ,

Rasul ,

Vollgraf , Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms , arXiv ( 2017 ).

[15]

Clanuwat ,

Bober-Irizar ,

Kitamoto ,

Lamb ,

Yamamoto ,

Ha , Deep learning for classical japanese literature , arXiv ( 2018 ).

[16]

Manhaeve ,

Dumančić ,

Kimmig ,

Demeester , L. De Raedt, Neural probabilistic logic programming in deepproblog , Artificial Intelligence 298 ( 2021 ) 103504 .

[17]

P. W.

Code , Image classification, https://paperswithcode.com/task/image-classification, 2022 . Accessed: 2022 -05-27.

[18]

Jeevan ,

Sethi , Wavemix: Resource-eficient token mixing for images , arXiv ( 2022 ).