DMA/L/z2p

From WikiZMSI

< DMA | L

In class

  • Download dataset with 7501 shopping transactions collected over the course of one week from a French retail store: store_data.csv.
  • Some additional information on this data set and Apriori algorithm can be found at: https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/.
  • Write a Python function that reads this .csv file and prepares suitable data structures to work further on. Suggested outputs: list of distinct items (sorted alphabetically), list of transactions (each transaction being a numpy array of integer indexes), binary array representing transactions, two-way mapping dictionaries (strings to integers, integers to strings) to be able to map names to integer indexes and vice-versa. Attention (!): there exist a few typographic errors in the original .csv file (e.g. additional spaces around names) - correct them automatically inside the function.
  • Write a function that constitutes the stage 1 of Apriori algorithm i.e. finds all frequent itemsets in the data given the min_supp parameter (minimum support). Suggested output: a list whose each k-th element is a dictionary storing frequent itemsets of length k (keys could be strings representing itemsets, the values should be supports). Efficiency tip: when calculating supports for a large collection of candidates, take advantage of the binary array representing transactions.
  • Prepare additional functions allowing for pickling and unpickling the frequent itemsets to (from) binary files.
  • For the given data set (shopping transactions from a French retail store) generate (and pickle) frequent itemsets for the following minimum supports: 0.01, 0.005, 0.0025, 0.00125.
  • Write a function that constitutes the stage 2 of Apriori algorithm i.e. generates association rules based on frequent itemsets and given the min_conf parameter (minimum confidence). For the time being (at the lab) the function does not have to return any results, the rules can be simply printed to the screen (and can be printed using integers, e.g. [2, 25, 39] -> [3, 14]).
  • Using the largest collection of frequent itemsets, generate (and observe) association rules for the following values of min_conf parameter: 0.5, 0.75, 0.9.

At home

  • Write a helper function that allows to print a rule in a human-friendly format (using a suitable integers-to-strings mapper) and apply it in the function that generates rules.
  • Think over and provide suitable results (useful for future points) from the rule generating function such as: rule premises, consequences, supports and confidences.
  • Write a code (could be in main function) that graphs a scatter plot with rules represented as points (matplotlib.pyplot.scatter). In the plot title provide information about min_supp and min_conf parameters.
  • Write a function that indicates Pareto-optimal rules among the found rules (the function can return e.g. indexes of those rules).
  • Print Pareto-optimal rules and mark (distinguish) them in the plot.