DAaML/z2

Classification And Regression Trees (CART)

Prepare for Olivetti data a decision column "glasses" that indicates whether a person wears glasses or not. The file with indexes of people with glasses is here: (olivetti_glasses.txt) and can be read via: glasses = np.genfromtxt('olivetti_glasses.txt', delimiter=',') .astype(int)

Prepare data to recognize glasses (with the train/test split) using as features 50 first principal components (from PCA).

Start implementation of CART classifier as a Python class based on the scikit-learn scheme. I.e. the class should extend BaseEstimator and ClassifierMixin classes, and contain in particular the methods: fit and predict. Suggestion: the built tree can be stored as a numpy array, where rows represent tree nodes and columns represent such information as: parent node index, index of feature to split on, split value, indexes of children nodes (left, right), node response (if a leaf), node depth.

Implement three variants of impurity function (misclassification error, entropy, Gini index) and an auxiliary function that finds the conditional probability distribution within a given subset of data (e.g. subset of data indexes).

Implement a recursive function building the CART tree. The function should be called from within the fit function.

Add suitable extra parameters to the constructor that allow for earlier stoppage of recursion based on: tree depth, minimum percentage of data points at a node.

Add the possibility to prune tree via browsing (exhaustive) and scoring subtrees of the full tree. The score should be calculated as: training error + no. of leaves * penalty for each leaf. Suitable recursive procedure for that function can constitute the second stage of fit function (after the full tree becomes built). Browsed subtrees and their scores can be stored in dictionaries (hash map). Hint 1: instead of building smaller numpy arrays for subtrees and rearranging parent-children links, the advised simpler approach is to copy the full tree array as a subtree (with its size kept unchanged) and the to suitably dettach/attach nodes by placing zeros or non-zeros in children's columns; this appraoch can be accompanied with keeping boolean flags of nodes that participate in a given subtree. Hint 2: boolean flags of participating nodes can be used as keys in the dictionaries (for scores and subtrees) but first need to be transformed into tuples or strings to make them hashable in Python.

Check how often subtrees-based pruning improves testing accuracy on Olivetti data ("glasses recognizer") when: using 50 or 100 features (resulting from PCA) and for three randomization seeds (0, 1, 2) to check different train/test splits.

Implement an algorithm with automatic selection of penalty for each leaf (using cross-validation technique).

Carry out suitable experiments and prepare a report (pdf up to 5 pages) on CART classifiers applied as a "glasses recognizer" for Olivetti data. Report the influence on testing accuracy of the following elements: number of applied features, impurity functions, tree reductions / pruning techniques. All experiments should be repeated (in an external loop) over 10 different randomization seeds (influencing the train / test split of data) and averaged.