## Research Software

**[1] Tree-augmented Naïve (TAN) Bayes Network: (MATLAB Implementation)**

The TAN Bayes Network is a generalization of the naïve Bayes network in that it allows more flexibility in the network structure, and given the structural restrictions, the optimal structure can be learned in polynomial time. It happens to be an attractive

*middle-path*between a naïve Bayes network and a full Bayes network. This model and algorithm was proposed, to the best of my knowledge, in the following paper, which serves as an excellent source for related technicalities and theorems.

· N. Friedman, D. Geiger, and M. Goldszmidt,

**Bayesian Network Classifiers**,

*, 29(2-3):131-163, 1997.*

__Machine Learning__**[Externally Hosted PDF]**

I found it hard to find

*easy-to-use*Matlab implementations of training/testing the TAN Bayes Network, so I wrote my own. Note that the software only works with

*binary-valued feature vectors*and

*binary class variables*, although it is straightforward to extend it, especially to discrete/continous valued features. Here you can download the code and use it

__for research purposes only__. The training part is an implementation of the

**Construct-TAN**procedure described in the paper above. The links are given below. To obtain these files, right-click on them and choose "Save Link As". Unzip all the files to the same directory/folder. When using the programs with Matlab, make sure you either (a) change the

*current directory*to this directory, or (b) add it to your Matlab path (recommended).

·

**TAN Package**

*(sorry, this is no longer available for download)*

The package consists of the following (main) programs of interest:

·

**Input**

__Training:__[prob, graph] = TAN_learn(cvar, avar):*avar*is an (NxD) matrix of N training samples, one per row, of D-dimensional binary training vectors associated with D features, and

*cvar*is an (Nx1) vector of binary classes corresponding to the N samples. Output

*prob*is the learned model, to be used as input for the testing module, and

*graph*is the adjacency matrix for the resultant directed graph (minus the class variable) that can be used as input for the

*graph_layout()*module (written by a group at University of Nijmegen) that allows you to visualize this graph in Matlab (described below).

·

**Input**

__Testing:__[res] = TAN_classify(prob, tdata):*prob*is the model learned by the

*TAN_learn()*procedure, and

*tdata*is an (LxD) matrix containing the test data, consisting of L test cases, one per row, each being D-dimensional binary vectors as in the case of training. Output

*res*is an (Lx2) matrix, one row per test vector, where (:,1) values are proportional to the probability that

*class=0*and (:,2) values are proportional to the probability that

*class=1*, conditional on the corresponding vectors. Priors over classes 0 and 1 are considered to be equal, but you can directly multiply these numbers by any choice of priors, as a post-processing step. Odds ratio is one way to use these numbers to assign class labels to the test vectors.

·

__Visualizing the TAN Structure:__graph_layout(graph):__Developed by a group at University of Nijmegen__, this program takes as input

*graph*generated by

*TAN_learn()*, and displays the TAN structure learned, without the implicit class variable and all its directed edges to the other variables (to avoid clutter). However, adding a row and column to the

*graph*matrix for the class variable will generate the full structure. Please refer to documentation on the software from the University of Nijmegen for further technical details on it.

Feel free to e-mail me with questions. Please re-send if I do not respond.