using

—Many real-life machine and computer vision applications are focusing on object detection and recognition. In recent years, deep learning-based approaches gained increasing interest due to their high accuracy levels. License Plate (LP) detection and classification have been studied extensively over the last decades. However, more accurate and language-independent approaches are still required. This paper presents a new approach to detect LPs and recognize their country, language, and layout. Furthermore, a new LP dataset for both multi-national and multi-language detection, with either one-line or two-line layouts is presented. The YOLOv2 detector with ResNet feature extraction core was utilized for LP detection, and a new low complexity convolutional neural network architecture was proposed to classify LPs. Results show that the proposed approach achieves an average detection precision of 99.57%, whereas the country, language, and layout classification accuracy is 99.33%.


INTRODUCTION
Object detection and classification has attracted a lot of research the recent years, with the advancements in vision technology, computer technology, and deep learning algorithms [1]. Object detection aims to estimate the location of objects of interest contained in an image, while object classification aims to categorize an object within a certain number of categories [2]. Traditional object detection and classification approaches have three steps, namely informative region selection, feature extraction and classification. In region selection, it is possible to scan the entire image using a multiscale sliding window, as numerous objects may appear in different locations with various sizes and aspect ratios [1]. Feature extraction aims to obtain visual features providing a semantic and robust representation. Some popular feature extraction methods used in the literature are Haar-like features [3], Scale-Invariant Feature Transform (SIFT) [4], Histogram of Oriented Gradients (HOG) [5], and hybrid feature selection techniques [6]. Classification aims to assign a target object in one of many categories. Traditional classification approaches include Supported Vector Machine (SVM) [7], AdaBoost [8], and Deformable Part-based Models (DPM) [9]. Recent breakthroughs in Convolutional Neural Network (CNN)-based approaches [10] attracted researchers to use Regions with CNN (R-CNN) features for object detection [11]. CNN-based methods have the capacity to learn complex features with deeper architectures and utilize training algorithms to learn informative object representations without the need to design the features manually [12]. Furthermore, researchers studied extensively various CNN models such as AlexNet [10], VGG [13], GoogLeNet [14], ResNet [15], and FDREnet [16] to improve the accuracy of classification and regression problems in machine learning. Generic object detection refers to the detection of objects from predefined classes obtaining the spatial location (e.g. bounding box) inside an image. It can typically be categorized into two types, namely regression/classification and region-based methods [17]. Region-based methods include R-CNN [11], Fast R-CNN [18], Faster R-CNN [19] and Mask R-CNN [20]. On the other hand, regression/classification-based methods include YOLO (You Only Look Once) [21], SSD [22], YOLOv2 [23] and YOLOv3 [24].
Automatic License Plate Recognition (ALPR) is a group of techniques that use License Plate Detection (LPD), character segmentation, and character recognition on images to identify vehicle LP numbers. ALPR is also referred as License Plate Detection and Recognition (LPDR). ALPR is used in various real-life applications such as parking systems, electronic toll collection, and traffic security and control [25]. State-of-the-art object detection algorithms based on deep learning have provided promising results for LP country and layout classification. However, the multi-orientation and multi-scale nature of LPs in addition to distortion and illumination issues, make LPD a challenging task to perform [26]. LPD using deep learning has been extensively studied over the last decade. Authors in [27] proposed the use of a CNN-based Multi-Directional (MD)-YOLO framework for LPD, but their method does not successfully detect small LPs. In [28] a faster R-CNN approach was presented, detecting at first vehicle regions and then locating the LP in each vehicle region. Its performance evaluation results showed 98.39% precision and 96.83% recall. A new approach was proposed in [29], referred as YOLO-L, where the prospective number and size of LP candidate boxes are selected using "k-means++" clustering with a modified YOLOv2 model and pre-identification to distinguish LPs from similar objects. This method achieved a precision of 98.86% and a recall of 98.86%. Researchers in [30] introduced the largest Brazilian LP dataset, referred as UFPR dataset, and proposed a four stage LPDR system comprising of vehicle detection, LPD, character segmentation, and character recognition. The LPD stage used CR-CNN core fast-YOLO, obtained a recall of 98.33%. Furthermore, researchers in [31] introduced a large and comprehensive Chinese LP dataset called CCPD, and proposed an end-to-end LPDR system using RPnet in the LPD phase, comparing the detection of Average Precision (AP) results to SSD, YOLOv2, and Faster R-CNN detection techniques by using 250k unique car LPs.
On the other hand, little research has been performed on multi-language and multi-national LP detection, mostly due to the lack of international LP datasets. Nevertheless, a few recent studies focused on developing a global end-to-end ALPR system, as reported in [32]. Authors in [32] proposed an approach for multi-national license plate detection for images with complex backgrounds, in which the YUV color space was initially used for detecting the rear vehicle lights, and the LP area was detected using a histogram-based approach on the edge energy map. The utilized dataset comprised of LPs from America, China, Serbia, Italy, Pakistan, United Arab Emirates (UAE), and Hungary. The dataset had only single-line LPs and obtained a detection accuracy of 90%. Researchers in [33] used VGG with LSTM to classify the registration country of LPs from Latvia, Lithuania, Estonia, Russia, Sweden, Poland, Germany, Finland, and Belarus. A recent research used tiny YOLOv3 to detect LPs from South Korea, Taiwan, Greece, USA, and Croatia [34]. Several approaches expressed interest in multi-national LPs, but they tested their detectors on each country's dataset separately, rather than accumulating them into one dataset [35][36][37][38]. Moreover, multi-language LPs were addressed in a few approaches. Authors in [38] proposed a mask R-CNN detector for LPs with English and Arabic characters from USA and Tunisia. In [39] Korean and English LPs were targeted, using the term multi-style detection to refer to different country, language and one or two-line LP styles. Most of the reported researches studied the LP Classification (LPC) problem inside the LPD stage. In these cases, the detector determines the bounding box, and at the same time gives the class label of an LP. However, in [32,37] multinational LPD was presented by just detecting LPs, without providing any other information for nation, language or layout. In [33] the classification of detected LPs by the issuing country was studied, reporting a classification accuracy of 92.8%. On the other hand, authors in [39] proposed a module to classify the detected LPs to single and double-line, without reporting its accuracy but only the entire system results.
In this paper, multi-national LPs from USA, Europe (EU), Turkey (TR), UAE and Kingdom of Saudi Arabia (KSA) are targeted, using YOLOv2 detector with ResNet50 feature extraction for LPD. For this purpose, a new dataset, named as LPDC2020, was constructed and presented. After the segmentation of the detected LPs, a CNN was used to detect the country, language and the one or two-line layout of the LP. The proposed detector and classifier were also tested on several benchmark datasets from those countries, in addition to LPDC2020. The proposed approach aims to close the gap in multi-national, multi-language and multi-layout LP detection problem, by utilizing a single unified system, and to the best of our knowledge it is the first and only study incorporating LPs from North and South America, Europe, and Middle East (TR, UAE and KSA).

A. LP Datasets Available in the Literature
Most of the frequently used LP datasets utilized in previous researches are available online, and their details are summarized in Table I. Any private datasets, not publicly accessible, are disregarded.

B. LPDC2020 Dataset
This paper introduces a new LP dataset, which was collected manually using mobile cameras in Turkey, named LPDC2020. It has two image sets: vehicular images to train the LPD module, and cropped LP images to train the LPC module. In addition, due the lack of publicly available Arabic LP datasets, images for KSA and UAE LPs available in the internet were used. All images were processed and annotated manually in a labor-intensive process. Table II shows the number of LPD images collected for each country. Some sample LPs from different countries with one and two-line layouts included in the dataset are shown in Figure 1. Table III shows the structure of LPDC2020 classification dataset. It is noted that, taking one and two-line layouts into account, the LPC dataset incorporates 11 different characteristics. The total number of cropped LP images is 29030, containing LP images from the previously mentioned countries.  III. FUNDAMENTALS OF CNN The fundamentals of any CNN are the convolutional layers consisting of learnable filters having small spatial size and specific depth. For an input image I and kernel K the general equation of 2D convolution [46] used in computer vision and machine learning is defined as: with i and m being the row indexes, while j and n are the column indexes. The activation layer produces an output value of the neuron using certain activation functions for a given input value. An example is the Rectified Linear Unit (ReLU) [10], where the output will be zero for negative input values and same as the input in any other case. The second important part of CNN is the pooling layer, which is responsible for reducing the input's spatial size by keeping the most important activations. This reduces the amount of computations and the number of learnable parameters. A dropout layer is used to combat overfitting, omitting randomly some neurons in each training step by setting their activation values to zero. As a result, the network can learn using a random combination of neurons. The Fully Connected (FC) layer, also called as dense layer [47], is the third important part of CNNs. Each neuron in the input layer is connected to all output neurons of this layer. The purpose of the FC layer is to learn for non-linear combinations of features. For x neurons input, learnable weight matrix W, and learnable bias vector b, the output of the fully connected layer y can be expressed as: At the end of the architecture, i.e. after the last fully connected layer, a Softmax layer is used. This layer is used for classification problems, providing a probabilistic interpretation of the input with respect to the sum of all input exponentials, declared as: This layer is also called the loss function layer, since during training a loss function is applied at the end of the CNN. In general, for N samples, the Mean Square Error (MSE) can be used in object detection as in (4) and cross-entropy function is used for classification problems as in (5) [47]: where, ‫ݕ‬ is the i-th actual output, and ‫ݕ‬ ො ݅ is the i-th predicted output.
IV. PROPOSED APPROACH This research addresses two problems; The detection of an LP in an image, and the classification of the detected LP's country, language, and layout.

A. License Plate Detection
The proposed approach is based on using the YOLOv2 detector with the ResNet50 [15] network as the core CNN for the LP detector. The utilized ResNet50 architecture is displayed in Table IV. The input layer size of ResNet50 was redesigned to be 672×672 instead of the original 224×224 pixels. The original size did not provide adequate features for LPD. For an original vehicular image with small size it will be difficult to detect the LP region after reducing its resolution. Naturally, there is a restriction on the minimum LP size required inside the detector's input image, due to the network forward propagation size of ResNet50, which is 224/7=32. Hence, LPs sized 32×32 pixels will correspond to a single point in the output feature map and consequently, any smaller regions will vanish. The proposed detector core network was designed to have a forward propagation size of 672/42=16. The first 40 layers of ResNet50 were used in the proposed YOLOv2 core CNN. The input size was set to 672×672 pixels, and the output feature map was 42×42 pixels. The minimum LP size was set to 16x16 pixels. It should be noted that smaller LPs can still be detected but with lower precision. In addition, the proposed approach can detect LPs sized up to 670×670 pixels. Figure 2 shows the block diagram of the proposed approach. The proposed detector had 27992604≈28M total learnable parameters.
The YOLOv2 detector divides the input image to an S×S grid, where S is the output feature map size of the YOLOv2 core Resnet40 (i.e. the output of Conv4 layer), and S was set to 42 The LP sizes in LPDC2020 were analyzed to select their anchor boxes, using the pyramid of anchors method of Faster R-CNN [19]. As shown in Figure 3, LP sizes span on a range of 10 to 670 pixels. Hence, in order to select anchor boxes of high intersection of union (IOU), six minimum LP sizes were used. These sizes were defined as 10×10, 10×20, 10×30, 10×40, 10×50, and 30×14 pixels, with a pyramid level of 15 and anchor box pyramid scale of 1.3. As a result, 90 anchor boxes with a minimum of 0.625 and mean IOU of 0.85 were obtained. According to (6), the proposed last YOLOv2 layer had 540 filters. B. License Plate Classification Α simple CNN was designed for LP classification, and its accuracy is compared to VGG [13]. The input image size is set to 224×224 pixels, being the same as the input size of VGG network for fair comparison. The classification CNN construction is shown in Table V. The proposed classifier design has a total number of 2635773≈2.64M learnable parameters, being much less than the VGG learnable parameter amount of 138M. Both a Batch Normalization (BN) [48] and a ReLU non-linear activation layer [10] follow each convolutional layer. BN normalizes the input batch mean and standard deviation, and then performs scaling and shifting based on learnable scale and shift parameters [48]. All convolution kernels have a size of 5×5 with stride 1 without padding. Hence, each convolutional layer results in a dimension shrinkage of 4 rows/columns. The dimension of the output feature map was computed according to (7): where ܹ ௨௧ is the output feature map width, ܹ is the input feature map width, ܹ is the kernel width, ܲ is the padding, and ܹ ௦ is the kernel stride in the horizontal direction. For input/output height relation, (7) can be applied using ‫ܪ‬ instead of ܹ. The input size is 224×224×3. After 4 pooling and 8 convolutional layers, the output size is reduced to 6×6×128. After that, Conv9 and Conv10 layers shrink the output to 1×1×512 neurons. Using this design, the input image is convolved to a single neuron with 512 channels. Afterwards, these neurons are fitted to 11 classes in the FC layer by applying (2). This layer weights all input neurons and forwards them to the Softmax layer, which provides a score for the 11 classes and performs the classification task as described in (3). It is safe to note that the proposed design is a simple stacked CNN with a low number of learnable parameters.  Learnable  parameters  Input  -224×224×3  -Conv1  5×5×32  220×220×32  2496  Conv2  5×5×32  216×216×32  25696  Maxpooling  2×2  108×108×32  -Conv3  5×5×64  104×104×64  51392  Conv4  5×5×64  100×100×64  102592  Maxpooling  2×2  50×50×64  -Conv5  5×5×96  46×46×96  153888  Conv6  5×5×96  42×42×96  230688  Maxpooling  2×2  21×21×96  -Conv7  5×5×128  17×17×128  307584  Conv8  5×5×128  13××13×128  409984  Maxpooling  2×2  6×6×128  -Conv9  5×5×256  2×2×256  819968  Conv10  2×2×512  1×1×512  525824  Fully connected  11  1×1×11  5643  Softmax -1×1×11 -

C. Practical Aspects
The training process used Stochastic Gradient Descent with Momentum (SGDM) [46]. The SGDM training was carried out for 10 epochs, with an initial Learning Rate (LR) drop factor of 0.5 for every 2 epochs. The training set was shuffled for every epoch. In YOLOv2 training, the mini-batch size was only six images, due to memory constraints, and LR was set to 1×10 -5 . Also, LP classification CNN mini-batch size was 120 images and LR was set to 2.5×10 -2 . After the first results, model parameter tuning was applied to continue training, using ADAM adaptive learning rate optimization [46]. In ADAM, the batch size was doubled and LR was halved every 10 epochs, as long as the final error shows improvement.

V. RESULTS AND DISCUSSION
A MATLAB environment was used to evaluate the proposed approach. A GeForce 1060 6GB RAM GPU with computational capability of 6.1 was used for training and testing. The next subsections describe the evaluation criteria for both LPD and LPC.

A. LPD
The LP detection performance evaluation was performed using Precision (P), Recall (R) and Average Precision (AP) values. Any detected LP bounding box having an overlap greater than IOU=0.5 with the ground truth bounding box is considered as a correct detection. Precision is the percentage of the number of correctly detected LPs over the total number of detected LPs. R is the percentage of the number of correctly detected LPs over the total number of ground truth LPs. AP is the area under the precision recall curve. P and R are calculated by (8) and (9), where TP is true positive, FP is false positive, and FN is false negative detection.
Table VI shows the proposed detector's AP performance compared to previous approaches presented in [32,33,37]. The proposed detector outperforms the previous approaches in terms of AP performance. It should be noted that in [32] only the accuracy for detected over all LPs in a private dataset is evaluated. Authors in [37] evaluated only the LPD precision, without presenting any AP values. It is evident that the proposed approach provides better detection score. Those approaches were selected because they evaluated performance using images from all the countries of interest together in one dataset. Hence, these approaches can be considered as multi-national and multi-language LPD methods. Furthermore, some researches trained and tested detectors for different datasets separately, in order to evaluate the performance on each dataset. Table VII provides a comparison in terms of P, R and AP performance for these methods. In order to conduct a fair comparison, there was a need to train the proposed detector on every dataset separately. However, the proposed detector had higher R rate and AP on all datasets. This is partly due to the large number of different LPs used in LPDC2020 and to its superior architecture. It is noted that one and two-line LP layout classification was studied in [34] with classification results combined in the character recognition stage for multi-national Korean, Taiwanese, Chinese, and Latin LPs. Table VIII shows the proposed method's AP results per country. It is apparent that the performance is similar, with slightly lower results for KSA LPs.

B. LPC
The proposed CNN for classifying the LP's issuing country, language and layout was evaluated in terms of overall accuracy. Table IX shows the classification accuracy of the proposed CNN. The proposed CNN classification is only 0.38% less accurate than VGG16, which is regarded to be state-of-the-art, but with significantly fewer learnable parameters. The number of learnable parameters of the proposed approach is only 1.9% of the parameters used in VGG16. As a result, the proposed CNN is faster and less complex with a small penalty in classification accuracy.  Table X shows the misclassification rates of the proposed approach. It is noted that Turkish and European Union's LPs have a higher classification error, as they share the same LP style standard. In contrast, BR and UAE have a unique style, and USA LPs can include object shapes varying from standard LP characters, making it easy to classify them with a small error.

VI. CONCLUSION
Detecting country and language is important to build a global ALPR system, while correct layout classification is essential in order to read the detected characters in the right order. This paper focused on LP detection and classification of multi-national and multi-language LPs with different layouts from BR, USA, EU, TR, KSA and UAE, proposing a method that can detect LPs regardless of their country of origin, language or layout. Furthermore, a second classification stage was used to recognize LPs' issuing country, language and layout. In addition, a new multi-national, multi-language and multi-layout LP dataset was introduced in order to enable benchmarking and to close the gap in this field. The developed detection and classification approach was based on deep learning. The results were promising and the LP detection average precision was 99.57%, while the LP classification accuracy was 99.33%. The current study paves the way to designing a global ALPR system. In the future, an end-to-end training process could be developed to test the whole system as a unified ALPR model.