DMA/L/z1

< DMA | L

In class

From the UCI repository (UCI Machine Learning Repository) download the 'Wine' data set (for classification). Get familiar with this data set (how many examples and attributes it has? what are the attributes? what is the goal of classification for this set?)

Use the MATLAB script wineDataSet.zip, that preprocesses data and prepares it for further work, that is: it reads data from the wine.data file, it descritizes each attribute (except for the decision attribute) into 5 intervals of equal width, it places the decision attribute (containing the class of wine) as the last attribute (originally the first one). The script returns a matrix of size: 178 x 14, where examples are written down as rows and attributes as columns.

Write a script allowing for a random partitioning of the data into 2 subsets: a training subset and a testing subset, according to the imposed proportion (e.g. 70% for the training data), passed as an argument to the script.

Write a script constructing a naive Bayes classifier - it means: the script should take the training data as an argument, it should build suitable matrices containing conditional probability distributions for input attributes, and a vector of a priori probabilities for the classes. The distributions should be returned as the results and will be later taken advantage of at the classification stage.

Write a classifying script. It should take as input: the object (wine) to be classified (as a vector), and probability distributions (from the previous script). The script should return the class predicted for the object (wine) by the naive Bayes classifier.

Write a script to calculate the classification accuracy (the frequence of correctly classified examples) for given data set (either training or testing).

Provide your classifier building script with a switch turning the Laplace correction on or off. With the Laplace correction turned on repeat the former operations (build the classifier and test it).

Find for yourself another data set (it can be also taken from UCI, but does not have to), for which the classification task can be formulated. The data set should be suitably large - at least 1000 examples (instances) and at least 20 attributes.
Continuous attributes should be discretized (according to a your own idea, but without loosing to much information). Your program should not have the restricion that all attributes must have the same size of domain - the same number of values it can attain (as it was a simplification for the laboratory).
Write a script performing the K-fold cross-validation.
Build a naive Bayes classifier for your data and test it using K-fold cross-validation (e.g. for K = 3, K = 5, K = 10 and K = number of instances).