Using Camera-Drones and Artificial Intelligence to Automate Warehouse Inventory René Kesslera , Christian Melchinga , Ralph Goehrsb and Jorge Marx Gómeza a University of Oldenburg, Ammerländer Heerstr. 114-118, Oldenburg, 26129, Germany b abat AG, An der Reeperbahn 10, Bremen, 28127, Germany Abstract Inventory is a very important, but also a very time-consuming manual process in warehouse logistics. This paper presents an approach to automate manual inventory using a camera-drone and various AI procedures. Thereby, sensor technology, such as RFID, is avoided, and only the visual representation of the products and goods is used. We developed a custom dataset that was used for the training of an object detection model to extract and count all relevant objects based on an image of the warehouse. Fur- thermore, we can show that different pre-processing steps and especially image augmentation methods can significantly influence the performance of such models. Keywords inventory, logistics, artificial intelligence, object detection, drones 1. Motivation and problem statement Logistics plays a crucial role in today’s global economy - every company depends on reliable and intact logistics processes to create value. At the same time, logistics is subject to very high margin pressure. On the one hand, improved service is expected from logistics service providers, but on the other hand, customers do not want to to pay extra for it [1]. If the margin is to be maintained or even increased, this can only be achieved through savings in internal processes. For many companies, digital transformation and the use of so-called smart technologies can be the solution for optimizing processes and procedures [2, 3, 4, 5], since many logistics processes still involve a lot of manual effort, including inventory, which is essential for every company [6]. In practice and science, this is often referred to as Industry 4.0. Four main design principles apply to use cases in this area: Networking, information transparency, technical assistance and decentralized decisions [7]. The approach pursued in this work can also be classified in these principles. Through the combined use of camera drones and methods for processing the data, such as AI, human abilities can be imitated: The processing of visual signals. The result is that previously manual activities, such as the inventory process in a warehouse, could be automated. In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University, Palo Alto, California, USA, March 22-24, 2021. " rene.kessler@uol.de (R. Kessler); christian.melching@uol.de (C. Melching); ralph.goehrs@abat.de (R. Goehrs); jorge.marx.gomez@uol.de (J.M. Gómez)  0000-0002-6426-9266 (R. Kessler) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) The goal of the inventory is to record the type and number of internal goods and to quantify this inventory with an exact value [6]. The actual process of inventory can differ not only in company-specific factors but also in the fact that two types of inventories are common. Thus, a distinction can be made between physical inventory, i.e. counting physical goods, and the non-physical inventory, where, for example, financial goods or bank balances are recorded. In this paper, further focus will be on physical inventory. According to the German Commercial Code (HGB), German companies are obliged to carry out an inventory atleast once a year1 (paragraph 240 German Commercial Code). While this cycle may be sufficient for companies with little movement of goods, it makes sense to keep shorter cycles especially for companies in the retail sector. However, the inventory can also be understood as an instrument of quality management concerning transparency in order to be able to monitor the business goals and their achievement. With the help of an inventory, deficits, faulty processes, non-optimal flows, or even theft can be detected. During an inventory, there is always a large amount of personnel effort involved. Depending on the inventory’s size and scope, several people are often exclu- sively occupied with the manual counting of the goods. Inventory not only causes personnel costs, but also disrupts operational processes. Therefore, companies find themselves in a di- chotomy between transparency and costs, which is why they often work with samples whose findings can then be extrapolated to the entire inventory. This business conflict can be resolved by automating the manual localization and identification of products and goods during inven- tory, resulting in greater transparency at lower costs [8]. Existing approaches are based on the use of sensor technology and often only consider a very specific sub-area (e.g. industries). In this paper, a generic approach is followed, which exclusively involves the visual representation of the products and is based on deep learning methods. To also automate the acquisition of the images and thus to evaluate a vehicle for the operationalization of the approach, a drone is also used, which has already proven itself in comparable applications [9, 4, 10, 11]. Combined with the current problems in inventory processes described above and the great potential through automation of these process steps, this leads to the research question: Which AI procedures are suitable to recognize products on images in order to count them for an inventory? To solve the described problem, a data-driven approach was followed, which was oriented towards the established CRISP-DM model [12, 13]. During implementation, the company co- operated intensively with a North German beverage distributor, which has several storage lo- cations and a high volume of trade in the B2B sector. Initially, domain knowledge about the storage locations and the inventory process was collected in several workshops and discus- sions. Also, we were granted access to the different warehouses in order to record the custom dataset (3). Finally, the results of the experiments were presented to the practice partner, and practical implications were derived. The paper is structured as follows: In the next section, Related Work (2), an overview of the current research work is given. Section three, Dataset (3), is dedicated to the structure of the specially compiled dataset and its annotations. The fourth section, Experiment and Results 1 https://www.gesetze-im-internet.de/hgb/__240.html, accessed on 20.11.2020 (4), represents the core of the work and describes the methods used and their results. In the concluding section, Discussion and Future Work (5), the results are summarized, and an outlook on further work is given. 2. Related work A search for related work has led to identifying numerous publications that deal with drones in logistics and use of digital technologies to optimize or even automate the inventory process. Two main lines of research have been identified and are presented below. Locating and identification utilizing sensor technology: Radio-Frequency Identification (RFID) tags are a widespread and established method to track products and load carriers in a warehouse [14, 15, 16], and to achieve an increase in transparency regarding warehouse movements [17]. Often RFID readers are attached to a drone, and the drone flies over storage locations and tracks the individual goods [18, 15, 19, 9, 3, 20, 8]. Drones can detect storage locations that would otherwise be difficult to reach (e.g., very high storage locations in a high rack) and minimize the risk for the employees [16, 21]. In summary, the identification of loads using RFID, especially in combination with a drone, has proven to save costs and automate inventory processes [17]. However, this always presupposes that the loads are equipped with RFID tags, representing a further cost factor in logistics. Reading optical product features or characteristics: There is a necessary consensus that image processing in logistics can be a great advantage for the traceability and monitoring of goods [22]. If camera-drones are used in inventory management, they are often used as a medium for reading optical product annotations, such as one-dimensional barcodes or QR- Codes [23, 24, 8]. However, these approaches assume that those annotations exist. Considering packed products, barcodes are often available and do not pose a problem. But with unpackaged goods or empties, optical annotations are rarely present. Here is a need for solutions that focus on the optical representation of goods. Although both AI-based image processing and the use of drones are considered to have great potential in logistics [25, 10, 8], there are hardly any pub- lications available that combine these two approaches. Freistetter and Hummel (2019) outlined an approach to drone-based inventory in libraries. They flew off bookshelves and identified book spines using computer vision techniques. As soon as a book is in the center of the image, the title of the book is read [26]. This is a particular indoor use case. Despite the fundamen- tal similarity, a use case from a library cannot necessarily be transferred to larger industrial warehouses. Especially concerning disruptive factors (e.g. environmental influences, such as changing weather, which can lead to different lighting), recordings in libraries are less affected. Dörr et al. show an approach that deals with a similar use case in the warehouse area. The goal of the approach is product structure recognition based on image data. Different convolutional neural networks are used based on top of each other. For the training of the models a separate dataset was built [27]. The very recent publications show that the combination of drone-based inventory and AI image processing is currently subject of research [16]. Especially the approach of Dörr et al. shows many similarities to our approach presented Figure 1: Exemplary sample from the final dataset here, e.g., the hierarchical structure with several deep learning models [27]. However, there are also several differences. The approaches differ in their place of use since this thesis is an outdoor use case. Also, the type of objects to be identified differs. 3. Dataset The use case considered here focuses on the automated recognition of beverage pallets on im- ages. Since there is no public dataset for this specific problem, we developed a custom dataset. The structure and further processing of the dataset is described in the following. Data Acquisition: A drone of the model DJI Phantom pro 4 v2 2 was used to make the video recordings in the warehouses of the practice partner. The drone was selected because it was already available for the research project, so there was no need to purchase a new drone. In ad- dition, the characteristics of this drone are very common, which is why the findings are equally transferable to other drone models. The video recordings were made in Full-HD resolution, a frame rate of 30 frames per second and using the built-in image stabilization. To create the greatest possible variance in the data, the recordings were made over four days and at two locations of the beverage dealer. The aim was to record the weather’s influence on the images (e.g., brightness) by recording the images under different conditions. During the project, two days with sunny weather and one day each with cloudy and very cloudy weather were used for the recording. Special attention was also paid to the recorded scenes. We tried to capture all pallet locations of the outdoor warehouse. It could not be avoided that not all types of pallets (e.g., different manufacturers of beverages) are represented equally often in the data because the different pallets’ stock varies very much in reality. After finishing the video recordings, the video data had to be processed. For this purpose, the filmed sequences were viewed manually, and irrelevant scenes were removed (e.g., the drone’s starting and landing sequences). Since there are only marginal differences in subsequent frames at 30 frames per second, only every 30th frame of the video clips was transferred to the final data set when the training data set was created to avoid potential overfitting when training the neural network. After the frames’ automated extraction, they were manually sighted, and faulty or low-quality images were removed. The final dataset consists of 336 images, separated into train and test set. The test set only contains images from pallet stacks that are not included in the train set to avoid memorization. This includes pallets with beverages of previously un- 2 https://www.dji.com/de/phantom-4-pro-v2/specs seen brands and breweries. Data Annotation: After image acquisition, annotation has been applied using the tool label- studio from Heartex 3 . This tool was used because it offers many possibilities for annotating images and was already used in another context within the research project. In this process, each pallet that was largely visible was annotated using polygons instead of only bounding boxes to make use of the annotation masks later on. Additionally, each polygon was given a class to differentiate between two types of pallets, pallets containing cases of beer and pallets containing other beverages. The resulting annotations were exported and converted to the COCO dataset format [28], as it is one of the standard formats for object detection and seg- mentation in images that are widely supported by most frameworks. This results in a training set containing 284 images with 5261 annotated polygons and a test set containing 52 images and 1471 polygons. 4. Experiment and Results The experiment was conducted as follows. First a baseline model was used to test the impact of various modifications of the input data on the prediction accuracy. Then, several models using different architectures, selected based on defined criteria, were trained to evaluate their performance applying the identified modifications using the baseline model. Finally, the best performing model was used to perform a qualitative evaluation and to identify possible errors. 4.1. Baseline Model The model used during the following experiments is a Mask R-CNN using a ResNet50 Back- bone, implemented using Detectron2 [29, 30]. The model was pre-trained using the MSCOCO17 dataset to try to make up for the low amount of images in the used dataset, described in 3. It was then trained over up to 4000 iterations, where each iteration used a batch of twelve images. Evaluations of the model were performed every 250 iterations during training, and once the training was completed. During training and testing, the images were resized to 1000x750 and not further modified. To validate each experiment’s results, it was repeated multiple times; the following metrics are averages over all runs. For all experiments the same train-test-split was used. 4.2. Initial Results and Adjustments The initial baseline model achieved an average precision of 27.06 and 27.59 evaluating bounding boxes and segmentation respectively, using the test portion of the dataset. As these results are far from sufficient to predict the pallets’ position on images accurately, significant adjustments were necessary to improve the performance of the model. 3 https://labelstud.io Figure 2: Example of an image (left), its simplified version (middle) and its annotations (right). 4.2.1. Merging of classes During the first tests, it became clear that the model had difficulties in classifying the pallets given the annotated classes (beer and other beverages). Not only was the per-class-precision much higher on the pallets marked as containing cases of beer (33.647 compared to 20.463, evaluating bounding boxes), some pallets were often wrongly classified. While this problem’s origin likely lies within the training data that contained more annotations and variants of pal- lets of beer cases, it is challenging to balance the classes to reduce this spread since nearly all images contain pallets of both categories. The classes have been merged to create a model to predict only the bounding box and segmentation mask of a pallet, not further classifying its contents to circumvent this problem. If applied like this, the classification task must be pro- cessed by a different system, possibly using classifier or brand detectors. This work has not yet further pursued the creation of such a solution. A model trained on a classless dataset achieves an average precision of 45.95 and 46.70 on bounding boxes and masks, as noted in table 1. A following manual inspection of the predic- tions also confirmed the increase of the quantity and quality of the predictions. 4.2.2. Reduction of image area Another problem of the model was the detection of small pallets on the edges of the images. It is likely caused by the low amount of small objects in the training data and distortion of the camera lens on the edges of the image. This problem can be addressed by cutting off parts on the left and right edges of the recordings. This should not harm the quality of the inventory process as the drone flight is planned to fly so that each row is at least once near the center of the recorded images and therefore not lost in this process. An example of such reduction of the image contents is visible in figure 2 where 50% of the image have been removed, in equal parts on each side. During the training and evaluation process, the input images were resized to 750x750 instead of 1000x750, to better match the aspect ratio of the modified images. The resulting model achieves values far better than before the removal. This is expected as the problem is simplified significantly. The model achieves a mean Average Precision (mAP) of 47.68 and 46.70 evaluating predicted bounding boxes and segmentation, respectively. 4.2.3. Image augmentation Image augmentation is widely used in many research projects [31, 32, 33], but can have differ- ent effects depending on the problem [27]. Therefore, different image augmentation methods were tested and evaluated based on the performance of the models trained using augmented images. The augmentation of the original images was applied using imgaug 4 during the load- ing of the image batch of each iteration. Each image of the batch was augmented individually with random parameters in given boundaries. As imgaug offers a wide variety of methods to augment images, some were selected to evaluate its performance on the given problem. In this case we chose augmentations to simulate variations of real recordings, such as MotionBlur, Per- spectiveTransformation, Contrast and JpegCompression, to accommodate for movement of the drone, varying lighting and weather conditions, while also considering methods used tradi- tionally to adapt the image, Rotation, ScaleXY, FlipLR and CropAndPad. The following methods were chosen for comparison. (a) FlipLR - Performs a horizontal flip with a probability of 0.5. (b) ScaleXY{150,125} - Scales the width and the height of the images with using random values from [0.5, 1.5] for ScaleXY150 and [0.75, 1.25] for ScaleXY125. (c) Rotate{10,20,30} - Rotates the image around its center using random degrees up to 10 for Rotate10, 20 for Rotate20 and 30 for Rotate30. Rotations can applied in both directions. (d) Contrast - Increases or decreases the contrast of the image using random values from [0.5, 1.4]. (e) JpegCompression - Reduce the quality of the image by applying Jpeg compression us- ing a random degree from [0.7, 0.95]. (f) MotionBlur - Creates a motion blur effect with a kernel of size 7x7. (g) CropPad{25,50} - Remove random percentage of images from all edges and pad the image to its original size. Using random values from [-0.25, 0.25] for CropAndPad25 and from [-0.5, 0.5] for CropAndPad50. (h) PerspectiveTransform - Transforms the image, as if the camera had a different per- spective using random scales from [0.01, 0.1]. (i) FlipLR-ScaleXY150 - Combines FlipLR and ScaleXY150. (j) CropPad25-ScaleXY150 - Combines CropAndPad25 and ScaleXY150. (k) Rotate20-ScaleXY150 - Combines Rotate20 and ScaleXY150. An overview of the effect of these augmentations is displayed in figure 3. Some of the tested augmentations were omitted in the figure as their effects are barely visible due to the size of the individual images or are variations or combinations of already displayed effects. It has to be noted that after the application of one or multiple augmentations the bounding boxes were recomputed according to the transformed mask, to make them a minimal fit to the object again. Otherwise, some augmentations, such as Rotation, could create a bounding box according to the transformed bounding box, which could be too large to accurately locate the object. In some situations, this results in a small, but measurable, boost of performance of models using affected augmentations. 4.2.4. Baseline results To evaluate the different possible modifications, we trained several models using the same settings and evaluated them on the shared test set. For the evaluation of the different methods, 4 https://github.com/aleju/imgaug Figure 3: Examples of results using various selected augmentation methods, applied to “Orignal”. we compared both the mAP of the predicted bounding boxes and the mAP of the segmentations, even though they are very similar, displayed in table 1. The best performing model without the use of image augmentation was the model trained using the simplest variation of the images, utilizing the cutting of edges and merging of differ- ent classes. It achieved a precision of more than 70 and therefore performs better than most other models, including many models trained using image augmentation. Once image augmen- tation (Table 1) is considered, the model trained on unaugmented images is outperformed by several different models. While nearly all models using merged classes and cut images perform better, only a few models using a different image base produce comparable or better results. Especially interesting are the effects of specific augmentation methods. While the Rotation augmentation decreased the accuracy of models using images with uncut edges, it increased the precision on images scaled to a 1:1 aspect ratio. Some methods seem almost always to reduce the prediction quality, such as Contrast (d), JpegCompression (e), and MotionBlur (f). While the idea behind using these methods was to make images slightly more corrupt to in- crease the ability to learn from realistic variations of these images, it mostly hurt performance. Other augmentation methods used, such as ScaleXY (b), FlipLR (a), and CropPad (g), seem to always improve the results of the trained model when used alone or in combination with other methods, contrary to the observation by Dörr et al.. This is supported by the fact that the almost always best-performing method used the combination of ScaleXY (b) and FlipLR (a). 4.3. Evaluation of other architectures While the results provided by the different Mask R-CNN models certainly provide valuable information, the model itself is no longer state-of-the-art in terms of precision. Therefore we selected three different models and tested their performance using the results gained using the previous model. The first model we additionally tested is DetectoRS [34], whose innova- tive characteristic is the use of Recursive Feature Pyramids. It achieves near state-of-the-art performance on the MSCOCO17 dataset and was implemented and trained using MMDete- cion [35]. The second model selected is Yolact [36]. Yolact is an architecture that is able to generate predictions of the recordings in near real time and was trained using the MMDetec- tion framework aswell. While real time predictions are not necessary during the stocktaking Table 1 Results of various experiments and combinations of possible modifications. Each entry “A/B” denotes the mAP of the bounding boxes and the segmentation respectively. Original Class Merging LeftRight-Cut Merging+Cut No augmentation 27.06/27.59 45.95/46.70 47.68/48.93 70.39/72.19 (a) FlipLR 30.03/31.13 46.58/47.48 53.58/55.44 73.94/76.01 (b) ScaleXY125 45.01/46.27 67.57/69.28 54.62/56.45 78.97/81.06 ScaleXY150 49.83/51.00 73.35/75.54 56.01/57.49 78.86/80.93 (c) Rotate10 25.48/26.23 41.89/42.81 52.87/53.46 77.07/77.63 Rotate20 25.31/26.06 42.96/43.95 50.11/51.35 76.08/76.71 Rotate30 24.95/25.83 43.56/44.99 47.49/48.43 75.57/76.04 (d) Contrast 26.35/26.89 40.22/40.71 48.97/50.05 68.69/70.15 (e) JpegCompression 26.12/27.06 42.21/43.29 49.35/50.65 67.83/69.35 (f) MotionBlur 25.27/26.30 43.61/44.76 44.89/46.30 71.19/73.01 (g) CropPad25 47.22/48.51 70.81/72.89 55.63/57.39 78.68/80.58 CropPad50 46.80/47.91 68.97/70.92 52.47/54.07 75.50/77.28 (h) PerspectiveTransform 31.54/32.10 53.13/53.99 54.10/55.42 78.11/79.30 (i) FlipLR-ScaleXY150 51.33/52.71 73.84/75.78 57.09/58.85 81.53/83.41 (j) CropPad25-ScaleXY150 47.08/48.30 71.25/73.24 57.99/59.28 78.13/79.67 (k) Rotate20-ScaleXY150 46.71/47.76 70.92/72.57 56.12/56.50 79.51/79.93 Table 2 Performance values of different models using Merging+Cut and augmentation (i). Model mAP (bbox/segm) Inference time (GPU) Inference time (CPU) Parameters Mask R-CNN 81.5/83.4 0.04 s 1,72 s 43,937,313 DetectoRS 88.3/85.6 0.13 s Not supported 131,648,615 Yolact 50.3/60.6 0.05 s 0.59 s 34,727,123 DETR 74.8/81.8 0.12 s 2.10 s 42,835,552 process, Yolact was selected since these models could easily be used to serve different purposes within the same domain. The third and last model evaluated is DETR [37] due to it’s innova- tive approach. DETR utilizes the transformer architecture introduced in the domain of NLP to generate instance segmentations. It was chosen to evaluate whether or not new and innovative approaches can be applied to the domain of palettes, and trained using the code provided by the authors5 . 4.3.1. Results Each of the additional models was tested and evaluated on the test dataset, using the merged and cut variant and the augmentation method (i) that showed to increase performance the most. The results are displayed in table 2. It is clear that DetectoRS outperforms all other models by a significant margin in terms of precision. It achieves a mAP of 88.3 and 85.6 while taking 0.13 s per frame using a NVIDIA® GeForce® RTX 2080 Ti, making it also slower than all other models. In contrast, Yolact achieves the lowest mAP and, contrary to expectations, 5 https://github.com/facebookresearch/detr Figure 4: Exemplary predictions using the best performing model, DetectoRS, according to table 2 is not the fastest model, but is 0.01 s slower than Mask R-CNN. However, Yolact is by far the fastest model when using a CPU. DETR, with it’s new approach, is both both slower and less precise than Mask R-CNN and does not stand out in any metric. In addition to evaluation based on metrics, timings, and model size, a manual qualitative evaluation was performed. This showed that the mAP value was consistent with the visual impression. In terms of both bounding boxes and segmentation, DetectoRS provides the best results. Mask R-CNN also delivers satisfactory results, while the quality of DETR and Yolact in particular falls off sharply. To get an impression of the quality of the predictions of DetectoRS, some of its predictions are visualized in figure 4. The generated predictions are predominantly of high to very high quality. In a few cases, however, there are (partly) incorrect predictions (figure 5). Three typical errors can be described as follows: 1. Recognition of side views of pallets: Despite the label strategy and the pre-processing steps, in the case of images with a very specific acquisition angle, namely whenever the side views occupy a large image area, the isolated, incorrect identification of pallets occurs, in which side views are provided with a bounding box and mask. 2. Individual pallets are not recognized: The evaluation has shown that in rare cases individual pallets are not recognized. The special feature here is that this error always refers only to a maximum of two pallets standing next to each other. All other pallets on these images were recognized completely and without errors. Further optimization of the detector parameters (e.g., tresholding) will most likely solve this error. 3. Strongly overlapping bounding boxes: Pallets with boxes of different colors some- times have overlapping bounding boxes. This has no consequences for the pallet recog- nition, but it could lead to problems in subsequent steps, such as the classification of the pallets. To solve this problem, further methods could be used in the preprocessing of the images. 5. Discussion and Future Work In this work, we present an initial step for automatized inventory using images recorded by a drone and an AI based object detector to identify the location of pallets on recorded images. Figure 5: Visualization of Predictions with Errors Various modifications have been tested to increase the accuracy of predicted bounding boxes and segmentation masks, partly without compromising the quality and direct usability of the results. In doing so, we also showed that image augmentation methods could increase the precision of models significantly, contrary to the observations in related projects [27]. In sum- mary, it has been shown that horizontal flipping and image scaling as an image augmentation technique can have a positive impact on performance during model training. In particular, the model architecture DetectoRS showed very good results in the experiments, measured by mAP (bounding boxes and segmentation), although not being the fastest in comparison. While al- ready delivering promising results, there are certain factors limiting the usage of the developed models in practice. Firstly, a solution for the classification of the type or class of beverages on the pallets must be developed and integrated. While it would have been preferable to predict its class using the same model that also predicts its location, tests showed it impacted its localization per- formance significantly. Secondly, while the detection of the front of the pallets gives valuable information to automate the inventory process, additional information is needed to complete it. This mainly includes the length-wise number of stacks of pallets in their row, which could be recorded from above. Finally, regardless of the technical implementation, discussions and tests with practical partners have shown that the organization in the warehouse must also be changed if drones are to be used. Even though drones are very flexible and can reach storage areas that are difficult for humans to reach, processes in the warehouse must be adapted to ensure that drones can be used safely and efficiently [38]. References [1] M. Hompel, B. Otto, Essay zur Logistik 4.0 (2015). doi:10.13140/RG.2.1.2857.4245. [2] pwc, Five forces transforming transport logistics, 2019. URL: https://www.pwc.pl/pl/pdf/ publikacje/2018/transport-logistics-trendbook-2019-en.pdf. [3] C. Cimini, A. Lagorio, F. Pirola, R. Pinto, Exploring human factors in logistics 4.0: empirical evidence from a case study, IFAC-PapersOnLine 52 (2019) 2183 – 2188. URL: http://www.sciencedirect.com/science/article/pii/S2405896319315137. doi:https: //doi.org/10.1016/j.ifacol.2019.11.529, 9th IFAC Conference on Manufactur- ing Modelling, Management and Control MIM 2019. [4] M. Maslarić, S. Nikolicic, D. Mirčetić, Logistics response to the industry 4.0: The physical internet, Open Engineering 6 (2016). doi:10.1515/eng-2016-0073. [5] E. Companik, M. J. Gravier, M. T. F. II, Feasibility of warehouse drone adoption and implementation, Journal of Transportation Management 28(2) (2018) 33–50. [6] H. Gleissner, J. C. Femerling, IT in Logistics, Springer International Publishing, Cham, 2013, pp. 189–223. URL: https://doi.org/10.1007/978-3-319-01769-3_9. doi:10.1007/ 978-3-319-01769-3_9. [7] M. Hermann, T. Pentek, B. Otto, Design principles for industrie 4.0 scenarios, in: 2016 49th Hawaii International Conference on System Sciences (HICSS), 2016, pp. 3928–3937. doi:10.1109/HICSS.2016.488. [8] T. Fernández-Caramés, O. Blanco-Novoa, I. Froiz-Míguez, Fraga-Lamas, Towards an au- tonomous industry 4.0 warehouse: A uav and blockchain-based system for inventory and traceability applications in big data-driven supply chain management, Sensors 2019 19 (2019) 2394. [9] C. S. Tang, L. P. Veelenturf, The strategic role of logistics in the industry 4.0 era, Transportation Research Part E: Logistics and Transportation Review 129 (2019) 1 – 11. URL: http://www.sciencedirect.com/science/article/pii/S1366554519306349. doi:https: //doi.org/10.1016/j.tre.2019.06.004. [10] DHL Trend Research, The logistics trend radar (5th edition), 2020. URL: https://www.dhl.com/content/dam/dhl/global/core/documents/pdf/ glo-core-logistics-trend-radar-5thedition.pdf. [11] A. Lotz, Drones in logistics: A feasible future or a waste of effort (2015). URL: https: //scholarworks.bgsu.edu/honorsprojects/204. [12] V. Grover, K. Lyytinen, New state of play in information systems research: The push to the edges, MIS Q. 39 (2015) 271–296. [13] C. Shearer, The crisp-dm model: The new blueprint for data mining, Journal of Data Warehousing 5 (2000) 13–22. [14] E. Ilie-Zudor, Z. Kemény, F. van Blommestein, L. Monostori, A. van der Meulen, A survey of applications and requirements of unique identification systems and rfid tech- niques, Computers in Industry 62 (2011) 227 – 252. URL: http://www.sciencedirect.com/ science/article/pii/S0166361510001521. doi:https://doi.org/10.1016/j.compind. 2010.10.004. [15] P. Thanapal, J. Prabhu, M. Jakhar, A survey on barcode rfid and nfc, IOP Conference Series: Materials Science and Engineering 263 (2017) 042049. doi:10.1088/1757-899X/ 263/4/042049. [16] D. Cristiani, F. Bottonelli, A. Trotta, M. Di Felice, Inventory management through mini- drones: Architecture and proof-of-concept implementation, in: 2020 IEEE 21st Interna- tional Symposium on "A World of Wireless, Mobile and Multimedia Networks" (WoW- MoM), 2020, pp. 317–322. doi:10.1109/WoWMoM49955.2020.00060. [17] Y. Y. Cheung, K. L. Choy, C. W. Lau, Y. K. Leung, The impact of rfid technology on the formulation of logistics strategy, in: PICMET ’08 - 2008 Portland International Conference on Management of Engineering Technology, 2008, pp. 1673–1680. doi:10.1109/PICMET. 2008.4599787. [18] P. Jhunjhunwala, M. Shriya, E. Rufus, Development of hardware based inventory man- agement system using uav and rfid, in: 2019 International Conference on Vision To- wards Emerging Trends in Communication and Networking (ViTECoN), 2019, pp. 1–5. doi:10.1109/ViTECoN.2019.8899488. [19] B. Rahmadya, R. Sun, S. Takeda, K. Kagoshima, M. Umehira, A framework to determine secure distances for either drones or robots based inventory management systems, IEEE Access 8 (2020) 170153–170161. doi:10.1109/ACCESS.2020.3024963. [20] M. Beul, D. Droeschel, M. Nieuwenhuisen, J. Quenzel, S. Houben, S. Behnke, Fast au- tonomous flight in warehouses for inventory applications, IEEE Robotics and Automation Letters 3 (2018) 3121–3128. doi:10.1109/LRA.2018.2849833. [21] J. P. Škrinjar, P. Škorput, M. Furdić, Application of unmanned aerial vehicles in logistic processes, in: I. Karabegović (Ed.), New Technologies, Development and Application, Springer International Publishing, Cham, 2019, pp. 359–366. [22] H. Borstell, A short survey of image processing in logistics - how image processing con- tributes to efficiency of logistics processes through intelligence, 2018. doi:10.13140/RG. 2.2.11060.76168. [23] I. Kalinov, A. Petrovsky, V. Ilin, E. Pristanskiy, M. Kurenkov, V. Ramzhaev, I. Idrisov, D. Tsetserukou, Warevision: Cnn barcode detection-based uav trajectory optimization for autonomous warehouse stocktaking, IEEE Robotics and Automation Letters 5 (2020) 6647–6653. doi:10.1109/LRA.2020.3010733. [24] S. Hong-ying, The application of barcode technology in logistics and warehouse man- agement, in: 2009 First International Workshop on Education Technology and Computer Science, volume 3, 2009, pp. 732–735. doi:10.1109/ETCS.2009.698. [25] L. Wawrla, O. Maghazei, T. Netland, Application of drones in warehouse operations (2019). URL: www.pom.ethz.ch, whitepaper from ETH Zurich (Chair of Production and Operations Management. [26] A. Freistetter, K. A. Hummel, Human-drone teaming: Use case bookshelf inventory, in: Proceedings of the 9th International Conference on the Internet of Things, IoT 2019, As- sociation for Computing Machinery, New York, NY, USA, 2019. URL: https://doi.org/10. 1145/3365871.3365913. doi:10.1145/3365871.3365913. [27] L. Dörr, F. Brandt, M. Pouls, A. Naumann, Fully-automated packaging structure recog- nition in logistics environments, in: 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), volume 1, 2020, pp. 526–533. doi:10.1109/ETFA46521.2020.9212152. [28] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ra- manan, P. Dollár, C. L. Zitnick, Microsoft COCO: common objects in context, CoRR abs/1405.0312 (2014). URL: http://arxiv.org/abs/1405.0312. arXiv:1405.0312. [29] K. He, G. Gkioxari, P. Dollár, R. B. Girshick, Mask R-CNN, CoRR abs/1703.06870 (2017). URL: http://arxiv.org/abs/1703.06870. arXiv:1703.06870. [30] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, Detectron2, https://github.com/ facebookresearch/detectron2, 2019. [31] M. Tan, R. Pang, Q. V. Le, Efficientdet: Scalable and efficient object detection, CoRR abs/1911.09070 (2019). URL: http://arxiv.org/abs/1911.09070. arXiv:1911.09070. [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, A. C. Berg, SSD: single shot multibox detector, CoRR abs/1512.02325 (2015). URL: http://arxiv.org/abs/1512.02325. arXiv:1512.02325. [33] X. Chen, R. Girshick, K. He, P. Dollár, Tensormask: A foundation for dense object seg- mentation, 2019. arXiv:1903.12174. [34] S. Qiao, L.-C. Chen, A. Yuille, Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution, 2020. arXiv:2006.02334. [35] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, D. Lin, MMDetection: Open mmlab detection toolbox and bench- mark, arXiv preprint arXiv:1906.07155 (2019). [36] D. Bolya, C. Zhou, F. Xiao, Y. J. Lee, Yolact: Real-time instance segmentation, in: ICCV, 2019. [37] M. Zheng, P. Gao, X. Wang, H. Li, H. Dong, End-to-end object detection with adaptive clustering transformer, 2020. arXiv:2011.09315. [38] E. H. C. Harik, F. Guérin, F. Guinand, J. Brethé, H. Pelvillain, Towards an autonomous warehouse inventory scheme, in: 2016 IEEE Symposium Series on Computational Intel- ligence (SSCI), 2016, pp. 1–8. doi:10.1109/SSCI.2016.7850056.