Multi-label classification of computer science documents using fuzzy logic

Classification has been already used for the prediction of predefined topics in many diversified domains including research paper classification task. A research paper may belong to one or more than one topic (classes). The state-ofthe-art techniques in this area have the following limitations such as: (1) most of the techniques classify documents to at most one principal topic and do not identify all of the topic associations for research papers, (2) considers the classification problem of research documents in discrete domain and the accuracy of these techniques remain low when considering multiple classes for a single document. These limitations led us to explore the fuzzy domain for the classification of Computer Science documents because we are not sure whether the documents belong to one category or more than one category. Furthermore, fuzzy classification will help to identify the degree to which papers belong to different topics. To validate the findings of our research, we need a comprehensive dataset. Such a dataset has been made available by the scientific community for Computer Science domain. Therefore, in this paper, we restrict our focus to the Computer Science domain. Key features are extracted from the Title and


June 2016
Journal of the National Science Foundation of Sri Lanka 44 (2) soft set based classifier, rough set based classifier, artificial neural network and approaches based on natural language processing (Salton & McGill, 1983;Sebastiani, 2002).
However in research paper classification, there are some limitations such as: 1) most of the above techniques categorise documents into one category from multiple categories (Gerstl et al., 2001), Computer Science documents belong to more than one category; 2) the stateof-the-art techniques consider the classification problem of computer science documents in discrete domain and the accuracy remains very low for the few systems, which consider multiple topics for classification (Salton et al., 1981;Salton, 1990;Gerstl et al., 2001; Kok-Chin & Choo-Yee, 2006). These issues (limitations) led us to explore the fuzzy domain for a proper classification of research documents. To evaluate the system we need a benchmark or user judgments and making a benchmark, which contains a comprehensive set of documents is a challenging task. Such a comprehensive benchmark dataset is available in the domain of Computer Science; therefore, the proposed technique has been validated on the comprehensive dataset of the Computer Science domain.
Fuzzy logic or domain was introduced by the mathematician Lotfi A. Zadeh (Zadeh, 1965). Zadeh is not only the founder of this but also the founder of fuzzy sets and fuzzy based systems. Fuzzy logic and sets are used to solve a variety of problems like pattern recognition, decision support, medicine, law, information retrieval, taxonomy and topology etc. (Perry, 1995). Fuzziness is about uncertainty and it indicates the probability that something is true. It has been used in information retrieval to account for data itself (Gershon, 1992), for result visualisation (Deller et al., 2007), for ontology to support in matching (Zhai et al., 2008) and for methods of matching (Ji & Yao, 2007).
In this paper, we propose a fuzzy classifier for the classification of Computer Science papers. We used research papers or articles (documents) from the dataset of the Journal of Universal Computer Science (J.UCS). This dataset contains research papers of different domains of Computer Science. The reason for the selection of this dataset is twofold: 1) the J.UCS covers all areas of computer science topics; 2) the authors belong to diversified domains, which gives a fair chance to the proposed technique to evaluate the system. Both of these helped us in the comprehensive evaluation of the proposed approach. We extracted key feature terms from the Title and Keywords of the papers. We selected the Title and Keywords of scientific publications because usually they contain the theme of the work and are also easily available online, which does not require extensive effort to acquire this metadata. Set of documents are represented below: are key features (terms) from the given set of documents D and 1 2 3 , , , , n C C C C K are categories of these set of documents D. On the basis of above representation of documents in the form of key features, set of rules are represented as follows: where ij x are terms, i represents the document and j (1,2,3,..) represents the terms of that document i , as the document may belong to more than one category. In ij C , i represents the document and j ( In this study, we initially extracted the Title and Keywords from research papers and evaluated our approach using the documents (research papers) in the J.UCS dataset. First we trained our framework on 80 % of papers from the dataset and then 20 % of papers were used for the testing purpose. Fuzzy approach was used because we were not sure whether the documents belonged to one category or more. Therefore, we had to assign one or more categories to those documents. In training, initially we generated each rule for each paper of the category. Then, rules belonging to the same categories were merged by fuzzy based rule merger algorithm. We assigned weights to each mergerule for deciding or predicting the category. The test document's rule weight was then compared and by using fuzzy classifier algorithm, the category for test document was predicted. Details of our algorithms and framework have been explained in the proposed framework section (Figure 4). We also assigned some weights to those terms, which appeared repeatedly in the Title or Keywords of the document (research paper). For this purpose we used the term frequency technique for calculating the weights of each term in the rule. Our rules were also evolved and updated regularly whenever the new test document appeared for automatic classification. Rules were evolved to improve the performance of our approach for document classification (Computer Science documents). Our results for category prediction were better than the existing techniques.

PROPOSED FRAMEWORK
Document classification of Computer Science papers has been done using a number of techniques and datasets. The datasets used are normally the content and metadata of the papers. The content gives better precision due to rich number of features (Dendek et al., 2014), however, the content of scientific documents is not always available openly. Therefore, some authors have tried to classify papers based on metadata. Metadata is often defined as data about data or description about the actual data. In the domain of research papers it describes the creation, context or content of the actual documents. By using metadata, inconsistency or redundancy can be identified easily because the dataset of metadata is not too large. Metadata of a scientific document are the title, authors, keywords etc. However, metadata provide limited number of features, which does not give very accurate classification. The objective of this research is to use freely available metadata and test which metadata features are better suited for classification using a number of innovative approaches. This research has proposed, developed, and tested a technique on metadata and have reported the results achieved so far. Another important finding from literature was that most of the works only focus on single classification of research papers. This means a paper is categorised to be associated with only one topic. However, research papers belong to more than one topic. This phenomenon (multi-label classification) has also been focused in this research.
The reason to select fuzzy classification is that research papers do not belong to only one category. There is a great possibility that a paper is partially associated with one topic and partially related to other topics. For example a paper on 'Network Routing Algorithm' has two associations: one with the network topic and the second with the algorithm topic. To identify such overlaps, fuzzy based systems have great accuracy and flexibility (Dehzangi et al., 2007;Yaguinuma et al., 2014).
To solve these types of problems, we used fuzzy logic. Fuzzy logic has been used to deal with the improbability and ambiguity of real world problems (Gershon, 1992). We proposed a framework for the categorisation of papers into one or more than one categories. First, we applied some preprocessing techniques to enable our dataset for the input of the framework. Then, we proposed an algorithm 'fuzzy based rules merger (FBRM)', to merge the rules generated. Next, we proposed second algorithm 'fuzzy classifier' to classify the papers into one or more than one categories. Finally, to increase the performance of our approach, rule updater is used to enrich our knowledge base (training set) for document classification. The details of the proposed framework is described below.

Preprocessing
Features selection is an important part for document classification. Document classification's performance may be affected by the increase of features. So some preprocessing steps are necessary. For this purpose we take three tables (papers, papers_category and categories) from the J.UCS dataset, which is shown in Figure 1.
From these tables, we generate the training dataset for our approach. The sample of training dataset is shown in Figure 2. Each row in Figure 2 represents the rule. The number of rules is equal to the number of rows in the dataset, which are R1, R2, R3….,Rn. In Figure 2, we can see that some papers belong to more than one category. That is why we used fuzzy approach in this paper. After that, we combined (merged) those papers, which belonged to the same category. There is a chance that some papers may belong to two or more categories. For this purpose, we apply the fuzzy logic to identify the most relevant category of the paper. The relevance of documents with relation to categories can be represented by means of linguistic terms. In addition, the importance of the document categories via linguistic variables allow the generation of fuzzy rules that can be used for identifying the most relevant category for that particular paper (Senthamarai & Ramaraj, 2008). As some papers (documents) may belong to two or more categories, we have to find the most relevant categories for those papers. For this purpose, we developed a formula, and for calculation we first find the membership (here we find term frequency weights) of those papers with respect to their categories and then applied an alpha-cut "φ" (threshold) on that membership to identify the most relevant categories to those papers. Formal representation of identifying those categories is as follows: ( ) : where P is the paper (document), C i is the set of categories, Ф is an alpha-cut (threshold), which can be assigned to any value determined by domain experts; μ ci (P) is the membership (term frequency weight) of P in category C i and P:C i represents that paper P belongs to category set C i .
In Figure 3, rules such as R3, R5, R7, R9, R12, R14 and R18 represent the paper's ID belonging to the same category. Papers belonging to the same category are then merged into a single rule such as R35 for papers of category A. Similarly all the rules, which represent  the same categories are merged into a single rule. At the end, each category has only one rule. All this is done by our fuzzy based rules merger (FBRM) algorithm. When all the training papers (documents) are assigned to their respective categories as shown in Figure 3, to remove the unrelated, unnecessary and not meaningful words from the Keywords and Title, we used an approach to remove the stop words and stemming algorithm (Porter, 1997) to break the compound words into single words. After that we applied our FBRM algorithm to calculate the term frequencies of Keywords, Title and Keywords + Title against each category. Our proposed framework is shown in Figure 4. It has two main components. One is FBRM algorithm and the other is fuzzy classifier.

Fuzzy based rules merger (FBRM) algorithm
FBRM algorithm merges rules, which belong to the same category. Initially in preprocessing, we assigned a rule for each document (eg: R3, R5, R7, R9, R12, R14, R18) and then combined those rules (eg: R35), which belonged to the same category. This algorithm extracts Keywords and Title from research papers and concatenates them against each category.
We separately concatenate the Keywords and Title against each category and also concatenate both Keywords and Title together against each category. In addition, we have calculated the term frequency (TF) against the resultant Keywords string, resultant Title string and resultant of both Keywords and Title together. The FBRM algorithm is shown in Figure 5.

Fuzzy classifier
When a user submits a test document for classification, preprocessing steps are performed as discussed above and term frequency weights of the test document are computed. After comparing test document terms weights with the rules weights of each category, we got some results against each category. For that particular test document, fuzzy based classifier predicted the most relevant category or categories on the basis of membership of each category by applying the "φ" ɑ-cut (threshold). The algorithm of fuzzy classifier is shown in Figure 6. In the selected dataset, manual selection of topics by the authors of the papers is available. The proposed system was evaluated against those predefined topics. The comparisons have been shown in Figure 7. The sample output of fuzzy classifier algorithm is presented in Figure 8.

Rules updater
When the fuzzy classifier assigned a category or categories for a test document, we have to update rule weights for that particular category or categories where the classifier assigned the test document. In this way, we enrich our knowledge base (training set) for document classification. By doing this, the performance of our classification approach will increase due to increase of our training rules weights.
The working of our framework is explained in the following steps:

RESULTS AND DISCUSSION
To evaluate the proposed scheme we calculate precision and accuracy on the Journal of Universal Computer Science (J.UCS) dataset. Related features of the J.UCS dataset and the number of research papers used for training and testing the dataset are also provided in Tables 1 and 2. Figure 9 shows the categories-wise papers of the J.UCS dataset.
In Table 3, 'YES' and 'NO' represent a crisp decision given for document classification where document d i assigns to category(ies) C i . Prediction of each document's category entry in the table indicates the number of documents specified against each type (YES or NO).
The description of each type of contingency table is as follows: In True Positive (TP), system predicts the numbers of true positive documents which actually belong to category C i ; in False Positive (FP), system predicts the numbers of false positive documents which actually do not belong to category C i ; in False Negative (FN), system predicts the numbers of false negative documents which actually belongs to category C i and in True Negative (TN), system predicts the number of true negative documents which actually do not belong to category C i . Based on the above parameters, the standard performance measures for evaluation are computed such as: precision and recall. Precision is the percentage of True Positive as correct and recall is the percentage of True Positive as predicted.       After detailed analysis of our results, precision and recall of our approach are 93 % and 96 %, respectively. We calculated the precision and recall for each paper with respect to each category, counted the papers in each category and added their precision and recall percentage. After that we determined average precision and recall for each category, which are shown in Figures 10 and 11.
We have compared our approach with different document classification approaches, which are techniques for text document classification on the basis of similarity (Senthamarai & Ramaraj, 2008 Table 5, comparison has been done on the basis of accuracy. We have concluded that our approach performs better than other mentioned document classification approaches. The performance measure graph (accuracy) of all mentioned approaches is shown in Figure 12.

CONCLUSION
This paper proposed, implemented and evaluated a framework for fuzzy based classification of Computer Science documents. Both algorithms, fuzzy based rules merger and fuzzy classifier worked well for Computer Science document classification. Rules updating mechanism increased the performance of our approach for Computer Science document classification. In this paper, we tested the proposed framework on the comprehensive dataset of J.UCS against ACM categorisation hierarchy. According to the comparison with state-of-the-art classification systems, the accuracy of the proposed approach proved to be better.