ItAI/LS/z3

From WikiZMSI

< ItAI | LS

Classroom

Write a Python program for NBC (discrete variant) that performs classification for the "wine" data set (recognition of wines based on their chemical contents).

  • From UCI repository download the "wine" data set and get familiar with it. Attention: check which of the variables is the decision variable.
  • Read data from the text file wine.data into a numpy array (take advantage of numpy.genfromtxt function), and split the array into two arrays: X (178 x 13) and y (178 x 1 - class labels).
  • Write a function that allows to split the data into a training and a testing part (according to a proportion given as an argument).
  • Write a function that allows to discretize the continuous data into a specified number of bins (subintervals) of equal width.
  • Write a class that represents the NBC (discrete variant) in compliance with the scikit-learn library. That means: inherit from BaseEstimator and ClassifierMixin classes (they are present in sklearn.base package) and prepare the key methods: fit (to learn from data) and predict (to classify data), and predict_proba as a helper function. Reflect on: choosing convenient data structures to store the needed probability distributions - a priori distribution (one-dimensional) and all conditional distributions (three-dimensional). These could be arrays, lists, dictionaries or their suitable combinations. Remember that you need to know and store in NBC the domain sizes for the discrete input variables.
  • Attention: at the lab stage, computations of NBC's responses in predict_proba function can be performed according to the definition i.e. as a product of probabilities (without taking logarithms).
  • Fit the classifier and establish its accuracies on both training and testing data.
  • Introduce a parameter allowing to turn on/off the LaPlace correction. Repeat the computations with the LaPlace correction turned on.

Homework

  • Prepare an NBC for a new data set of your own choice (at least 1000 examples and at least 20 features).
  • Get familiar with the variables in the new data set and suitably discretize only its continuous variables. Carry out experiments and report the obtained testing accuracy. Check the influence on accuracy of: LaPlace correction and the number of discretization bins.
  • Introduce a parameter that allows to turn on/off the numerically safe computations. To achieve this, modify suitably the implementation and apply logarithms where needed. That means be able to compute responses via sums of logarithms of probabilities (rather than via product of probabilities). Hint: instead of computing logarithms on the fly in the predict_proba function, you may compute and memorize them already at the fit stage. Try to arrange a numerically dangerous situation (hundreds of features present) and compare both the numerically safe and unsafe computations.