DAaML/z1

From WikiZMSI

Principal Components Analysis (PCA)

Classroom

  • Download "Olivetti Faces" data set (40 people x 10 photos x 4096 pixels) taking advantage of scikit-learn package - function: sklearn.datasets.fetch_olivetti_faces().
  • Try to display several selected images (useful commands: reshape and imshow from pyplot).
  • Write a helper function that allows to visualize a certain number of images or vectors representing images (e.g. show_some_images). The function shall be useful on several occasions - to visualize: input images, eigen faces, image reconstructions (with reduced number of features). Suggested arguments: two-dimensional images (or linear vectors with images - the function should automatically handle both variants), indexes (which images should be displayed, all by default), logical flag - whether images should be displayed as a grid (or linearly one after another), title for the window.
  • Split data randomly into train and test parts (e.g. in some imposed proportion 80%, 20%) taking advantage of function sklearn.train_test_split. Use 'stratify' option (it guarantees preserving a constant distribution of decision variable in both parts). In order to keep reproducibility of partitioning, it is suggested to use 'random_state' argument (it is related to the randomization seed). From now on, all subsequent actions should be performed only on the train part of data. Test data should be kept unused up to the moment of performing image reconstructions and classification.
  • Get familiar with the following commands from numpy package: mean, var, std, cov, corrcoef and the command eig (numpy.linalg).
  • Write a function to perform PCA. Suggested arguments: data array, wanted number of dimensions (possible reduction). Suggested outputs: eigen vectors, eigen values (suitably sorted). Measure the working time of your function. Keep only real parts of eigen vectors (for numerical reasons imaginary parts do exist but are very close to zeros). To force the correct order of eigen values and vectors use numpy.argsort.
  • For future convenience and because of long working time of eig function it is suggested that results can be saved in a binary file, using e.g.: pickle.dump and pickle.load. For this purpose, write a helper function with file name as an argument and (for saving) the list of objects to be binarily dumped.
  • Visualize a certain number of the first eigen vectors (eigen faces) as images (use your function: show_some_images).
  • Calculate the 'mean face image' from data and visualize it.
  • Project data onto a certain number (e.g. 100) first eigen vectors - which means, calculate new data (with new feature values). Before the projection, take into account the subtraction of mean image. Compare correlations between old and new features (corrcoef command).

Homework

  • Write a function that allows to perform a sequence of reconstructions (approximations) for a given input image, by increasing the number of applied features (principal components / dimensions). Suggested arguments: vector representing the original image, matrix with eigen vectors (may be reduced in particular), list with counts of features for successive reconstructions (e.g. dims=[10, 20, 100, 200, 1000, 2000]), vector with the so-called mean image. Suggested output: obtained reconstructions stored in the format of a data array (i.e. successive reconstructions are written as rows in a numpy array).
  • Calculate mean absolute errors (MAE) between reconstructions and the original for different feature counts (e.g. 100, 200, 500, 1000).
  • Prepare a visualization function that displays pairs of images (original, some reconstruction). Above each pair, display the MAE error.
  • Extend your function that performs PCA with an additional (and optional) argument called e.g. 'variance_sum_ratio'. This argument is a number from (0, 1] interval - and shall reduce dimensionality by selecting such smallest number of dimensions (first principal components) that 'explains' the wanted part of the total variance. E.g. when variance_sum_ratio = 0.95, then one has to find the smallest number k, such that (lambda_1 + ... + lambda_k) / (lambda_1 + ... + lambda_n) >= 0.95. The additional argument can work independently from the current argument 'components' (that decides about the number of components explicitely). Therefore, in implementation, one of these arguments must have the priority (in case user specifies both of them accidentally).