A novel mutual dependence measure in structure learning

Mutual dependence between features plays an important role in the formulation of classifiers, clustering and other machine intelligent techniques. In this study a novel measure of mutual information known as integration to segregation (I2S), explaining the relationship between the two features is proposed. Some important characteristics of the proposed measure was investigated and its performance in terms of class imbalance measures was compared. It was shown that I2S possesses the characteristics, which are useful in controlling overfitting problems. In structure learning techniques such as Bayesian belief networks, conventional measures of dependency relationship cope with the overfitting problem by restricting the number of parents for a node; however it is still not impressive because complete overfitting is not eliminated. In contrast, I2S is capable of significantly maximizing the discriminant function with a better control of overfitting in the formulation of structure learning.


INTRODUCTION
Various computational techniques have produced large amounts of data dealing with multifarious complexities and noticeable heterogeneity, yielding uncertainties and risks.Machine learning and data mining techniques have enabled the researchers to extract useful patterns out of a large dataset.Classification is a notable and impressive technique in machine learning and data mining.A classifier can be defined as a function where class instance is defined as to the objects described by a set of attributes .The dataset of attributes X contain N labelled instances of with the objective of correctly predicting the class label of a new data instance in the learning phase of a classifier.Among many of the classification systems introduced, a Bayesian belief network (BBN) is considered a robust technique by virtue of its ability to decompose complex probabilistic models into brief and tractable elements (Jensen & Neilson, 2007).The data mining community has extensively used it in knowledge discovery tools due to its solid statistical foundation and the capability for inference (Cooper & Herskovits, 1992;Chen et al., 2008;Etminani et al., 2010;Carvalho et al., 2011).The BBN is a strong probabilistic model for knowledge representation.
A BBN is drawn by a directed acyclic graph (DAG) representing a set of conditional probability distributions for each stochastic node of the DAG; whereas, each arc between the two nodes represents the direction of inference or induction.A node (child), which is directly pointed to by another node (parent) receives inference from its parent node(s), while the parent node obtains induction from the child node in terms of probabilistic distribution.These concepts of inference and induction are helpful in formulating BBN classifiers.
The mutual dependence and correlation between two attributes of a dataset is a key problem in the sphere of structure learning.Numerous pairwise measures have been introduced explaining a particular or general relationship (Gibbons & Subhabrata, 2003;Wasserman, 2007;Corder & Foreman, 2009;Bagdonavicius et al., 2011).However, it has been described that correlation and dependence are intrinsically different phenomena.Although wide application of correlation in various domains of interest has been reported, a careful examination of correlation measures highlights two problems in structure learning.The first issue is related to its incapability of describing the nonlinear structure between the random variables.It has been pointed out that two uncorrelated variables do not suggest their independence to each other (Grimmett & Stirzaker, 2001).The second problem is the inability of providing circumscribed knowledge about the underlying true dependence nature (Grimmett & Stirzaker, 2001).Thus arises a dictum that "correlation is unable to imply causation" emphasizing that correlation is not well suited in classification problems for the sake of establishing causal relationships between variables (Aldrich, 1995).Jensen & Neilson (2007) elaborated two important characteristics for scoring functions used in the belief network; (a) the ability of any scoring metric to balance the accuracy of a structure keeping in view the structure complexity and (b) the computational tractability of any scoring function (metric).Bayesian information criterion (BIC) (Schwarz, 1978), Bayesian Dirichlet equivalence uniform (BDeu) (Buntine, 1991), Akaike information criterion (AIC) (Akaike, 1974), entropy and minimum description length (MDL) (Lam & Bacchus, 1994;Suzuki, 1996), and factorized conditional log-likelihood (fCLL) (Carvalho et al., 2011) have been reported to satisfy these characteristics.Among these scoring functions, BIC, AIC, BDeu and MDL are based on log-likelihood (LL) as given below: Where G denotes directed acyclic graph given the dataset D. Other three counters include n, q i and r i representing the number of cases, the number of distinct states of a feature variable and the number of distinct states of a parent of an i th feature variable.The log-likelihood tends to increase its value as the number of features increases.The phenomenon occurs because the additions of every edge are prone to pay contributions to the resultant loglikelihood of the final structure.This process can be controlled considerably by means of introducing some penalty factor or otherwise restricting the number of parents for every node in the graph.
AIC and BIC are usually applied under the hypothesis that regression orders k and i are identical.This assumption brings extra computation and also yields erroneous estimation in theoretical information measures in structure learning (Yang et al., 2013).Yang & Lee (2012) demonstrated the linear impact of improvement in model quality within the scope of exercising BIC function score in K2 (Cooper & Herskovits, 1992).However, it is arguable that there must be an intelligent heuristic to sharply extrapolate the optimized size of the training data.We are of the view that an optimized solution can be achieved by exploiting various intelligent algorithms for tree and graph.

METHODS AND MATERIALS
In the previous section a brief notion of the decomposability of various scoring measures into a frequency counting problem in structure learning was given.This frequency counting problem thus defined leads to a deficiency in correctly identifying discriminative approaches in defining a sink node correctly.An improved measure of approximation based on joint and marginal probability is proposed while establishing a hypothesis such that Hypothesis H 1 : I2S is a tractable approximation to correctly identify the topology between a pair of nodes in a DAG for structure learning.
It details out the relationship between two features such that the states of a dependent feature can be explained as a result of the states of the independent feature.It is essential to point out the following two assumptions for defining I2S mathematically.These include the discrete nature of the dataset features.The second assumption is that each case of the dataset holds an independent probabilistic nature.I2S can be expressed by means of definition 1 and 2 as given below: Definition 1: Given two features F 1 and F 2 , I2S can be expressed mathematically; Conditional probability (CP) is a function of joint probability (JP) and marginal probability (MP) as shown by the equation below; The terms m and n point out the vector length of the 1 st feature F1 and the 2 nd feature F2, respectively in equation 3.There are four terms involved in the mathematical equation of I2S. .This characteristic is most important to correctly identify the true order of the two nodes in structure learning for decision making.
I2S Network: It has been reported that the greedy approach is more popular in the application of building and learning belief networks (Carvalho et al., 2011).Moreover, it has been described that K2 (Cooper & Herskovits, 1992) is one of the most optimized techniques for searching algorithms in Bayesian networks.In K2 algorithm, ordering of the features is known a prior, which helps in selection of the most suitable set of parents for each feature.Its input parameters are a set of nodes sorted topologically.Every node in this set is scanned, while the previous nodes are added repeatedly until the resulting score given by the joint probability of the data and the network structure is not incremented.Some notation in the light of well known and relevant concepts of discrete belief networks were introduced and these concepts were formulated into a structure learner devised on the basis of I2S.I2S is a measure defined to measure the dependency (explanation) of one feature on another feature.It is a direct measurement of cardinal relationship in a way that if any distinct value of feature 2 is addressed by only a single value of feature 1, then this will increase the value of I2S where I2S is normalized between 0 to 1.It is described formally such as Î: I2S (F 1 F 2 ).The notation Î will be useful in defining the value of I2S from the 1 st feature (F 1 ) to the 2 nd feature (F 2 ).For a dataset D, a pairwise matrix of Î can be defined;

I2S Network Classifiers:
An ordered list of the features was developed using I2S.Let M be a matrix in which each element corresponds to the measurement of I2S from the i th feature to j th feature.Let be defined as a list of sorted matrix where sorting criteria is defined by It results in an ordered list known as .I2S based network classifier is a network over X = (X 1 , X 2 , X 3 , …X n ,C) where feature C is considered as a class, hence the goal is to classify the instances (X 1 , X 2 , X 3 , … , X n ) in terms of distinct states of the class.Usually, in the literature, it is common to restrict towards augmented naive Bayes classifier for the sake of computational efficiency (Carvalho et al., 2011); where class feature is placed at the top of the graph with null parents.This relaxation is based on the assumption that the goal is to retrieve the best possible structure, which truly represents the underlying dataset.All of the query variables, which have a parent node within a DAG must have various instances of unique states formally defined as , where C represents the unique values of the feature.We shall introduce notations related to non-augmented naïve Bayes models.Let the i th parent variable of any feature possess distinct values denoted as m ij , where j is the number of unique values, which i th holds.Hence the possible number of configurations of the parent set of any feature can be described as; description is useful in defining the function of conditional probability table (CPT) such as: The generation of CPT turns the network into a classifier.A given instance of data can be tested against this conditional probability table for its inference or induction.

RESULTS
This section will present the results with their empirical validation in detail.The performance of the proposed measure used in introduced classifiers is measured by accuracy, which is a function of true positive (TP) rate and false positive (FP) rate.It is formally defined as; Experimentation was performed on 29 datasets obtained from UCI (Blake & Merz, 1998) and was preprocessed into weka (Hall et al., 2009)  five datasets marked by (*) (Table 1), in which the class feature was placed as the last attribute (this is a mandatory requirement by weka).The flags dataset was a class-less dataset, so the feature 'religion' was fixed as its class attribute.All of these datasets contain nominal, continuous and discrete features while some datasets also contain missing cases, which were ignored by default in weka.It is evident from Table 1 that the dataset is versatile in the number of classes, cases and attribute count so that no question of bias can be raised.
Figure 1, which is a stacked cylindrical graph indicates the comparison of result accuracy for six scoring functions and introduced measures.Each cylinder is shown in three colours.The blue colour indicates the percentage of datasets in which the performance of I2S was significantly better than the other scoring function.The red colour indicates the number of datasets where the proposed measure neither delivers better nor demonstrates poor accuracy in classification.The green colour indicates the number of datasets in which I2S failed to yield better results.A careful examination of Figure 1 shows that the accuracy of I2S was comparably higher in comparison to AIC and entropy, where I2S delivers improved accuracy over 22 and 21 datasets while it does not give better results over 3 and 5 datasets, respectively.The recently introduced scoring function measure fCLL gives comparatively better accuracy in comparison to the other five scoring functions when competing with I2S.
Apart from the results shown in Figure 1, one may argue that achieving accuracy may not be so impressive; whereas the percentage improvement in accuracy is more compelling.This motivates the introduction of results from another perspective shown in Figure 2, which indicates the percentage of average improvement of accuracy achieved by using the I2S classifier in the K2 searching algorithm.In the case of the entropy measure, the average increase in accuracy was observed as more than 7.5 % while it was 1.19 % in comparison to BDeu.
To roughly characterize the computational complexity of the proposed scoring measure, it was noted that the time complexity of 12S was more or less equivalent to that of BDeu and BIC.However, the time complexity of entropy was slightly better than I2S.Moreover, the time complexity for AIC and MDL was significantly better than I2S in many of the datasets.

CONCLUSION AND FUTURE WORK
In classification, structure prediction from Bayesian inference models is a common practice for the purpose of retrieving hidden rules from masses of data.This process broadly consists of two steps.The first step deals with the construction of the best suitable structure from the data and the second part with the inference from this structure.This study was focused on the first part, which involved the construction of the most suitable network structure.The core part in the design of a BBN classifier is to introduce a discriminant function within the vector space of attributes through utilization of a priori knowledge.The effectiveness of the Bayesian belief network using greedy heuristics like the K2 searching mechanism has earned it an excellent place in the domain of classification systems.Arguments were presented about various scoring functions including BDeu, AIC, entropy, BIC, MDL and a recently introduced fCLL on the ground of overfitting while introducing a new dependency measure in the domain of structure learning.Theoretically, application of mutual information in structure learning is not a novel idea as it was introduced some six decades ago (Chow & Liu, 1968;Pearl, 1988).In this study a novel decomposable scoring function was introduced for the task of structure learning.The introduced measure, known integration to segregation is characterized by the mutual dependence approximated by marginal and joint probability.The novel measure is particularly designed for discriminative learning because it is decomposable and score-equivalent with the capability of permitting efficient estimation of structure learning.The accuracy merit of I2S is evaluated and compared to the common state of-the-art scoring measures given a reasonable size of benchmark datasets obtained from the UCI repository and preprocessed in weka.I2S performed better than generatively-trained Bayesian network classifiers using K2 searching algorithm and numerous scoring functions.The proposed measure is expected to generate a realistic network, which is likely to tally with the practical thinking of field experts in the domain of knowledge.Although the asymptotic complexity of the proposed measure is almost of the same order as the conventional BIC and BDeu scoring metrics, it is still poor in computational complexity as compared to MDL in particular.

Acknowledgement
We are greatly thankful to anonymous reviewers who suggested numerous insightful comments during the revision of this article.

Definition 2 :
CP among all of the states of the 2 nd feature.The term )CP of all of the states of the second feature.The factor m / (m-1) and MP i are used for scaling and normalizing the factors by which the final value of I2S always pulsates between 0 and 1.In the forthcoming section, the results of various feature selection techniques will be presented as compared to this technique based on the proposed measure I2S.Given a directed acyclic graph (DAG), I2S is sensitive to the order of sink and its parent node.A swap will change the value of I2S such that is useful in the development of structure learning.
arff file format.No further preprocessing was done on these datasets except on September 2013 Journal of the National Science Foundation of Sri Lanka 41(3)

Table 1 :
Statistical information about dataset used in this study