An effective feature set for enhancing printed Tamil character recognition

: Selection of features for extraction and classification are the essential factors in achieving high performance in character recognition. Feature extraction process produces feature vectors that define the shape and characteristics of the pattern to identify them uniquely. Many feature extraction and classification approaches are available for Tamil and other languages, but there is still room to identify a better set of features for extraction to obtain higher recognition rate of Optical Character Recognition (OCR) for Tamil printed text. This research aims at producing an efficient set of features for extraction, which is capable of increasing the accuracy and reducing the runtime to improve the performance of the best OCR system to classify isolated Tamil printed characters. The proposed set of features is experimented on a large dataset using One-versus-All (OVA) Support Vector Machine (SVM). Two types of the pool of different feature vectors are created with features used in this study such as basic, density, histogram oriented gradients (HOG), and transition. In comparison with the current best approach, the testing results of Pool 1 gives better recognition accuracy of 94.87 % for OVA SVM and 97.07 % for the Unbalanced Decision Tree (UDT) SVM algorithms, but could not reach an improved recognition speed. Likewise, the results of Pool 2 improves the performance of the system by giving not only better recognition accuracy of 94.30 % for OVA SVM and 96.35% for the UDT SVM algorithms but also reached an improved recognition speed than the selected best OCR approach. The proposed set of features improves the recognition rate by 2.57–3.14% on OVA SVM and 3.22–3.94% on UDT SVM. 2015) (basic feature – 05, density 1154, transition


Optical Character Recognition (OCR)
Information sharing is necessary to increase the knowledge level of people. This sharing can be carried out using any kind of medium such as books, documents, or manuscripts. Millions of books, documents and manuscripts have been produced in this world from the ancient era. Most of the old documents are not available today, and some are scarce. We are unable to make use of them because they are not readily available.
Most of the documents are hard copies produced as printed documents or handwritten ones. Converting these documents into digital format leads to their preservation, making editing possible and producing many copies. Digital copies of materials can be stored in digital libraries and shared readily over the internet. They are also easy to reproduce. Electronic documents might be prepared in two ways. One way is by scanning the printed documents using the scanner or a digital camera. The other way of digitisation is creating the document by June 2021 Journal of the National Science Foundation of Sri Lanka 49 (2) retyping from the start. Most of the organisations, which are maintaining digital libraries such as the World Digital Library (World Digital Library Home, n.d.), Hathi Trust Digital Library (HathiTrust Digital Library, n.d.) and Noolaham ( "நூலகம்," n.d. Examples: கு, கூ, சு, சூ, ஞு, ஞூ, டி, டீ, டு, டூ, ணு, ணூ, து, தூ ் , n.d.), converts their collection into digital form by scanning. This type of digitisation produces un-editable images of documents which require a significant amount of storage space. Retyping a whole document needs more time and human resource to complete it. If a material could be prepared as editable, searchable, and portable with less storage space and preparing time, then it could be made entirely accessible to many people at the same time efficiently.
OCR is used to accomplish this purpose. It uses pattern recognition techniques to identify the patterns in images and converts them into an editable text format. These converted documents can be utilised easily by copying, searching or dividing into sub-documents.
Character recognition can be divided into two broad categories: optical (printed) and handwritten character recognition. Moreover, handwritten character recognition is divided into two types based on the input devices. One is online character recognition that uses digitizer tablets or PDA screens to get the input and identifies the characters in real time. The other is offline character recognition that uses scanners and cameras to get the image inputs and determines the characters by applying some image processing algorithms. In case of optical character recognition, the input may be a good quality document or a degraded one. This research relates to printed character recognition. A sophisticated, well integrated optical character recognition system should carry the main parts as pre-processing and segmentation, character recognition, and post-processing. Preprocessing and segmentation step contains binarization, noise removal, skew detection and correction, text or non-text classification and segmentation procedures. Character recognition step comprises feature extraction and classification procedures. Finally, post-processing is to identify and make the recognised text free from linguistic misspellings due to OCR.
The implementation methods of OCR differ from language to language. Up to now, several approaches have been proposed for many languages including Tamil. But those are with some shortcomings and lack accuracy. In this study, those available approaches of Tamil OCR have been intensely reviewed and found out the state-ofthe-art among them. The overall aim is to suggest a best set of features to improve the accuracy with an increased recognition speed of the state-of-the-art of OCR for Tamil languages. The main focus was on the feature selection. Based on the analysis results of the literature review, it was found that the approach proposed in Ramanan (2015) has the best of ideas and highest accuracy among the available strategies. Therefore, the method explained in Ramanan (2015) was chosen as the state-of-the-art for printed Tamil character recognition. Through this work, it is intended to suggest a better approach for feature selection which is capable of increasing the accuracy and reducing the runtime of the system.

Tamil language
Tamil is a Dravidian language which is a descendant of Proto-Dravidian. Tamil language has a history of more than 2000 years starting from 300 BC. It is categorised into three periods such as Old Tamil (300 BC-AD 700), Middle Tamil (700 AD-1600 AD) and Modern Tamil (1600 AD-present). It is an official language of two countries, Sri Lanka and Singapore. It also impacts the educational career in many countries. Tamil has the official stage in the Indian State of Tamil Nadu (Tamil Language, n.d.).

Characters in Tamil Script
There are 247 characters in Tamil  The objective of this paper is to propose the best set of features to enhance the accuracy of Printed Tamil character recognition with an increased recognition speed. Moreover, this study finds out the best approach among the existing OCR approaches by having a thorough analysis on their performance for the purpose of proving that the feature sets proposed through this work have better results than the best available at present.

June 2021
Journal of the National Science Foundation of Sri Lanka 49 (2) Pre-processing and segmentation OCR system goes through three main phases; preprocessing and segmentation, character recognition and post-processing. In order to achieve higher recognition rates, it is mandatory to decrease the variation of the input image that causes a reduction in recognition rate and increases the complexities. Therefore, OCR systems use pre-processing and segmentation phases to overcome this problem. Moreover, the input image can be a scanned image, image captured by camera or merely a print-screen of a page. The input may contain not just the text but pictures, equations, tables, etc. Pre-processing and segmentation are required steps to extract the text and convert the input image of the text to segmented components. These processes transform the raw data into a format that will be more readily and efficiently processed.
Binarization process is used to convert the grey scale or colour images to binary images. Thresholding is the technique used in binarisation. It separates the foreground of an image from its background. Median filtering is performed to reduce the noise as in Ramanan (2015). For the skew detection and correction process, the algorithm proposed in Shafii (2014) was applied. For the testing purpose of this approach, two types of datasets were used. One type (Tobacco-800 Complex Document Image Dataset and Groundtruth, n.d.) is used in Shafii (2014), and the other one is UJTDdocF (Datasets of printed Tamil characters and printed documents -the University of Jaffna, n.d.). For the page layout analysis process, the approach proposed by Shafii (2014) was used. He has proposed an effective algorithm for segmentation of Persian Scripts and the same algorithm was applied to the segmentation process.

The State-of-the-art of OCR for Tamil Languages
The state-of-the-art was found through an analysis on existing works. This study used a list of factors in finding the state of the art such accuracy, coverage of characters, extracted features, used classifiers, and amount of dataset used for training and testing phases. A summary of this review is listed in Table 1.
There are a total of 247 characters in Tamil script. These full set of characters can be formed using a total of 124 unique symbols. Among the existing approaches, some authors prepared their system to recognise the characters, and others have prepared the system to recognise the unique symbols. In some approaches, more than 124 symbols are covered as their scope (Ramakrishnan & Mahata, 2000;Aparna & Ramakrishnan, 2002;Aparna & Chakravarthy, 2003). Those symbol sets include isolated Tamil characters, English numerals, and punctuation marks. Among the existing approaches, four studies achieved an accuracy above 98 %. Ramakrishnan and Mahata (2000) reached 99.1 % of efficiency, and they covered the full set of character symbols. They just used around 4000 samples only to test and train their system. Aparna and Ramakrishnan (2002) also reached 98 % of accuracy with the scope of recognising the full set of character symbols. They also used just 4000 samples to test and train their system. Karthieswari et al. (2014) came up with an approach to reach an accuracy of 98.70 %, but they did not specify the number of characters they covered and the amount of data used to test and train. The approach proposed by Ramanan et al. (2016) reached a better accuracy of 98.80 % with the scope of a 124 symbol set (full set of Tamil character symbols). They have used a vast dataset with 12,400 data samples and this is the highest amount used in a Tamil character recognition system. However, they did not specify the recognition speed of their system.
Through the analysis it was clear that none of the work carried out yet meets the following requirement.  To identify Tamil characters, which include all the set of symbols (equal to or above 124) and tested using a reasonable amount of datasets (to examine all the symbols), with speed and reasonable recognition accuracy.
However, among these existing works the best approach is proposed by Ramanan et al. (2016), which is a part of Ramanan (2015). The failure rate of this approach is 1.2 %. Typically, an A4 sized printed paper can contain an average of 2000 characters per page. This failure rate might cause 24 errors per page. Such an error is not acceptable for a working recognition system. Therefore, an efficient set of features is proposed to increase this recognition accuracy.

METHODOLOGY
Character recognition in a scanned image is carried out with the use of pattern recognition techniques. Furthermore, character recognition phase can be divided into two main steps namely Feature Extraction and Classification. Feature extraction is a process of generating the feature vectors, which can be used to Journal of the National Science Foundation of Sri Lanka 49 (2) June 2021 represent the characters uniquely. Feature selection to extract, the vector plays a vital role in character recognition, and the feature extraction process has a primary role in improving recognition accuracy as well. Figure 1 depicts the steps of character recognition. It consists of pre-processing, feature extraction and classification steps.

Pre-processing
The images are pre-processed by following two kinds of processes for removing the noise and resizing the images. Morphological operation or noise filtering can be used to remove the noises in a binary image. After removing noise, each character image is enclosed in a tight fit rectangular boundary using a bounding box, and it is followed by discarding the outside portion of the image using horizontal and vertical projection. Finally, the image is scaled to the size of 64 × 64 pixels.

Feature extraction
Most of the Tamil characters have a similar shape or slight variations. There are possibilities to misclassify the characters due to this similarity. Therefore, the selected features, which are going to be extracted should be able to represent the unique identification of different characters. The pre-processed images are the inputs for this process. Table 2 contains the summary of misclassified printed Tamil characters using the proposed hybrid decision tree in Ramanan (2015) applied on UJTDchar. The first column and the second column in each block represent the actual, and the number of misclassifications along with the predicted characters, respectively.
The contribution of this study is to propose a different set of new features which are capable of correctly classifying those misclassified characters by Ramanan (2015) mentioned in Table 2 and improve the performance of the current best approach. Therefore, firstly, those actual characters' font shapes were carefully compared with the predicted characters' font shape (which were misclassified by Ramanan, 2015) to find out the differences among them. Seven new features were identified in addition to the features of Ramanan (2015). The new features are explained in the forthcoming section. The following section describes the full set of extracted features of this work which are the combination of newly added features through this study and the features proposed in Ramanan (2015).

Density Features (DF):
The following equation computes the density of a zone of the image.  Table 2 and the performance of the current best approach. Therefore, firstly, those actual characters' font ere carefully compared with the predicted characters' font shape (which were misclassified anan, 2015) to find out the differences among them. Seven new features were identified in to the features of Ramanan (2015). The new features are explained in the forthcoming section.
lowing section describes the full set of extracted features of this work which are the tion of newly added features through this study and the features proposed in Ramanan (2015).

Basic features (BF)
Two sets of basic features are extracted namely 'Basic Feature Type 1 (BF1)' and 'Basic Feature Type 2 (BF2)'.    It represents the total count of available right diagonal lines in the character [ Figure 3(a)]. Same as the right diagonal lines, the total count of available left diagonal lines is also extracted for the type of BF1 feature set. In the case of these two features, if the diagonal lines are sharp without aliasing effects [ Figure 3(a)] then it gives the accurate count.
Otherwise, if it is with aliasing edges [ Figure 3(f)], then it provides an inaccurate result. Therefore, these two types of features are omitted for the BF2 feature set to classify the characters, which have the aliasing edges. k. Euler Number This feature represents the value of the total number of objects in the image minus the total number of holes in those objects. Table 2 Explains that new features can be applied to correctly classify those misclassified printed Tamil characters by Ramanan (2015) For example, consider the last two characters of னீ and ளீ ; both these characters are having different number of loops, different number of end points, and different values for the density of loops. The first character (னீ ) is having three loops, two end points and a different value for the total density of loops than the second character (ளீ ). The second character is having only two loops and three end points. Because of these differences, these two characters can be discriminated clearly. Likewise, all misclassified characters can be discriminated clearly from one another.

Histogram of Oriented Gradients (HOG)
Four kinds of HOG features are extracted in the same way as in Ramanan (2015). They are described below. i. snHOG • Pre-processed character images are divided into nine sub-images.     eHOG with 225 dimensions is created by combining these five (dHOG + tlHOG + blHOG + brHOG + trHOG) HOG features.

Transition Features (TF)
In pre-processed character images, the number of transitions (Saba et al., 2011) is computed from the background (1) to foreground (0) in four directions in the same way as in Ramanan (2015). Feature sets are extracted from the dataset using Algorithm 1.

Input: Set of images (Tamil characters)
Output: Different feature vectors (concatenation of basic, density, HOG, and transition features) Step1: Binarize the input image Step2: Pre-process the binarized image (remove noise and resize it into 64 × 64 pixels) Step3: Extract basic, density, HOG, and Transition features Step4: Repeat steps 1 to 3 until reaching the last image of the image set Step5: End of program After the extraction of all of these features, ten types of feature sets are formed as FS1, FS2, FS3, …, FS10 by having a different concatenation of features. The details of these feature sets are specified in Table 3.

Classification
The extracted feature vectors are analysed using the OVA (One-Versus-All) SVM, and the result is compared with the accuracy of (Ramanan, 2015). In this study also, the same classes of Tamil Characters (124 unique classes) used in (Ramanan, 2015) shown in Table 4 have been used.
All of the experiments were carried out on a PC running with an Intel Core i5 2.4GHz processor and 2GB RAM under Matlab R2017a platform. The LIBSVM tool (Chih-Chung & Chih-Jen, n.d.) was used.

Experimental setup
Dataset Ramanan (2015) had prepared a vast dataset as part of his research. Two kinds of datasets in his preparation have been made publicly available (Datasets of printed Tamil characters and printed documents -University of Jaffna, n.d.). The same dataset was used in this study. They are mentioned below.
• UDTchar consist of 12400 printed Tamil individual character samples cropped from different printed Tamil scanned documents. In which, 124 unique symbols with 100 images per symbol are available. • UJTDdocF prepared with 20 different font faces of printed Tamil scanned documents which consist a total of 30 pages, and each page has an average of 275 words. Each page complexed by different font styles, and font sizes having bold and italics formatting.
We used 30 % of UJTDchar for testing and 70 % for training the system, similar to the current best approach (Ramanan, 2015).
Two (2) pools of feature sets were created using these feature vectors, namely, feature set Pool 1 and feature Journal of the National Science Foundation of Sri Lanka 49 (2) June 2021

Evaluation criteria
The recognition rate is computed as the percentage of correctly recognised characters against the number of characters in the document. sulting accuracy proves that the proposed feature selection achieves an increased the accuracy of Ramanan (2015). In addition to this, an analysis was carried out on the serve the result that would be obtained if the feature sets of Pool 1 were processed with of UDT SVM proposed in Ramanan (2015). The result of Ramanan (2015) was the proposed feature sets through this study would result in 97.07 % of accuracy.
fied character classes of Ramanan (2015) is compared with the results of this study. It re than twenty classes give better results than Ramanan (2015) and it is included in f Pool 1 include a total of 2868 (basic feature -12, density feature -88, hog -2394, ure -374) values of features. But the author of (Ramanan, 2015) used only 1619 (basic ensity feature -88, hog -1154, transition feature -372) values of features.
total count of used features is larger than that of the selected approach. Therefore, the recognition of the present system might be slower than (Ramanan, 2015).

Testing Results
In the classification part, 10-fold cross validation (SVM classifier-MATLAB crossval) for training and testing the data of each feature sets was used. Cross validation is a multi-time experiment, and the average accuracy of those folds for each feature set was calculated.  Table 6 shows the calculated average accuracy of each feature set.
The result gained for the feature sets of Pool 1 of this study is compared with the outcome of Ramanan (2015). The proposed approach through this study gives 94.87 % recognition rate for the algorithm of OVA SVM, but the method in Ramanan (2015) gave only 91.73 % of accuracy for this same dataset.
The resulting accuracy proves that the proposed feature selection achieves an increased accuracy over the accuracy of Ramanan (2015). In addition to this, an analysis was carried out on the outputs to observe the result that would be obtained if the feature sets of Pool 1 were processed with the algorithm of UDT SVM proposed in Ramanan (2015). The result of Ramanan (2015) was 93.13 %, but the proposed feature sets through this study would result in 97.07 % of accuracy.
The misclassified character classes of Ramanan (2015) is compared with the results of this study. It shows that more than twenty classes give better results than Ramanan (2015) and it is included in Table 5 Feature sets of Pool 1 include a total of 2868 (basic feature -12, density feature -88, hog -2394, transition feature -374) values of features. But Ramanan (2015) used only 1619 (basic feature -05, density feature -88, hog -1154, transition feature -372) values of features. The total count of used features is larger than that of the selected approach. Therefore, the speed of recognition of the present system might be slower than (Ramanan, 2015). As there is a research question to propose an efficient set of features to improve the performance, it is needed to consider the improvement on accuracy and speed as well. Through the feature sets of Pool 1, only an improvement on accuracy was achieved, and not on the speed of recognition. Among the extracted features, HOG has the most significant amount of vector dimensions. The total count of used feature vectors can be reduced by removing the feature vectors from the pools which are created by the concatenation of HOG. Therefore, another pool (Pool 2) of feature vectors which contains FS1, FS4, FS5, FS6, FS7, FS10 only was prepared. An accuracy of 94.30 % was reached for OVA classification algorithm and a better increase in accuracy through Pool 2 was also achieved. Moreover, the total count of used features in this study (1164) is smaller than the selected approach, which is 1619. Therefore, the recognition speed also could potentially be higher than the selected approach.
UDT SVM classification algorithm gave us an accuracy of 96.35% for Pool 2.

CONCLUSION
This paper has presented a better set of features, which can be used to recognise the misclassified printed Tamil characters in Ramanan (2015) with the highest accuracy. The proposed work includes concatenated feature sets of basic, density, transition, and HOG features of printed Tamil character images.
In the pre-processing stage, the process of skew angle detection and correction is performed using an algorithm  composed of an axes-parallel bounding box and works regardless of the content of the documents. Therefore, this algorithm works in the presence of graphical images, tables, charts, etc. with no angle limitation. As a result of testing with the dataset UJTDdocF 100 % accuracy was achieved. In the segmentation stage, the proposed algorithm produces error free segmentation for the testing datasets of UJTDdocF documents which do not contain the overlapping and touching Tamil characters.
Furthermore, in the state-of-the-art finding process, a thorough review and analysis of the proposed approaches of feature selection for printed Tamil character recognition was conducted. All the experiments of this study was processed with large datasets publicly available (Tobacco-800 Complex Document Image Dataset and Groundtruth, n.d.; Datasets of printed Tamil characters and printed documents -the University of Jaffna, n.d.). Moreover, in the classification stage, the proposed feature vectors of Pool 1 are tested using the OVA technique. The recognition rate was 94.87 %, which shows an increase of 3.14 % than the results of Ramanan (2015). Thus, the current approach per page can reduce a total of 63 character misclassifications. The analysis results for UDT-based SVM gives an accuracy of 97.07 %, which shows an increase of 3.94 % than the results of (Ramanan, 2015). Therefore, a total of 80 character misclassifications can be solved by this proposed work using the feature sets of Pool 1. Even though the accuracy is high, the drawback of Pool 1 is that the recognition speed might be lower than the selected approach because of the increased count on total used features.
Moreover, the proposed feature sets of pool 2 were classified with OVA algorithm. The recognition rate was 94.30 % which shows an increase of 2.57 % than the results of Ramanan, (2015). The analysis results for UDT-based SVM gives an accuracy of 96.35 %, which shows an increase of 3.22 % than the selected approach. Due to these increases in accuracy, 52 character misclassifications by OVA, and 65 character misclassifications by UDT SVM can be avoided than Ramanan (2015). Also, the feature vectors of Pool 2 give an improvement on recognition speed. The total count of used feature vector dimensions of Pool 2 is lower than the selected approach. Therefore, the recognition speed is potentially higher than in Ramanan (2015).
Finally, the results prove that the proposed feature sets through this research give better results on accuracy and speed compared to the proposed feature sets of Ramanan (2015). The performance of printed Tamil character Journal of the National Science Foundation of Sri Lanka 49 (2) June 2021 recognition was improved through the proposed set of features. According to the study, better character recognition accuracy can be achieved if the proposed feature sets are processed with the hybrid decision tree proposed in Ramanan (2015). In this approach the accuracy can be higher than the reported accuracy of 98.80 % in Ramanan (2015). Future work will focus to expand to recognise the handwritten Tamil characters.