Examenvragen Datamining 2018-2019

(1)

Examenvragen Datamining 2018-2019

De cursusdienst van de faculteit Bedrijfswetenschappen en Economie aan de Universiteit Antwerpen.

Op het Weduc forum vind je een groot aanbod van samenvattingen, examenvragen, voorbeeldexamens en veel meer, bijgehouden door je medestudenten.

(2)

Examenvragen datamining Questions data challenge:

Why did you choose technique X? Did you consider preprocessing step X?

What were the most important variables? Omstandigheden & regio How did you choose a training, validation and test set? Split sample What step improved your performance most?

Can you explain what you wrote on page X? Is comprehensibility of the model important? Explain the supervised ratio

A nominal variable with more than 100 values is a high-cardinality variable. Did you perform a grid search for every class?

1. What is datamining?

The automatic extraction of patterns from large amounts of data (via tools/technologies that incorporate the principles of data science). Goal: find non-obvious patterns.

2. Explain how a decision tree is built. Clearly explain what the three steps are that need to be defined. (large)

Recursive partitioning: I) find most important variable II) split into subsamples III) repeat until

sufficient strong prediction. Three steps:

1. Splitting rule: calculate all possible splits, calculate the goodness of the split (highest weighted mean decrease in impurity which is either entropy or Gini index), choose the best split (gain ratio also possible).

2. Stopping rule: 1) Early stopping with creating a bigger tree because of overfitting, stop when the accuracy of the validation set decreases (while the accuracy of the training set keeps increasing because of the overfitting). 2) Prune the three after growing a large one and select the pruned tree with the lowest misclassification rate on the sample.

3. Assignment rule: two ways, 1) largest distribution, 2) take misclassification cost into account DT +: interpretation, handle both continuous and categorical variable, robust for outliers

DT -: may be unstable (similar goodness of split but very dissimilar consequences), may be too complex

3. You have facebook information of 7000 people. Your goal is predicting the political preference (large) (2x)

(3)

What is the target variable (0 = Democrat; 1 = Republican) What are the features (Likes of Facebook pages)

What is the model: Naïeve Bayes or Evidence lift (=naive naive bayes) Evaluation and measurement performance method

For which research domain can you use this method as well

4. Why is kNN a lazy learner? What does this mean for the computing efficiency? (2x)

Issues lazy learner: I) Comprehensibility (decision and model justification? “We declined your loan because some other guy that has your characteristics has defaulted”) II) Dimensionality (too much irrelevant features) III) Computational efficiency (training time = 0 THAT’S WHY IT IS A LAZY LEARNER; testing time: calculate NN and compare each new instance with ALL the neighbors  high testing time and high costs!) IV) nature of attributes (scaling of attributes and dummy encoding  lots of attributes)

Lazy learner: generalization of the training set is delayed until a query is made to the system. You have a table with all the NN, when a new item has to be classified, only then you compare it with the neighbors.

Eager learning: trying to generalize before receiving queries. For example with a linear regression from all the training sets are already coefficients calculated that try to fit the model, when a new item has to be classified, just plug it in the formula!

5. Can you use a discrete classifier to built I) confusion matrix II) lift curve III) ROC. If yes, how?

Discrete classifier vs ranking classifier [4.29]  confusion: yes! For lift curve and ROC you both need a ranking classifier in order to change the threshold step by step and draw a ROC curve or lift curve.

6. What is overfitting and how can you avoid it? (large question)

Fitting the noise = finding patterns that do not generalize = the model memorizes. Draw graph [3.46] (trade off error(%) vs complexity with holdout data and training data)  aim: generalize to unseen data points. (logistic regression more prone to overfit than SVM).

How to avoid: 1. Set aside validation set and use fitting graph (error(or acc) vs complexity). 2. Use the training set to train the model until the validation set shows the minimum error (or max acc). 3. Then the generalization behavior can be tested on the test data. (divide by split sample or N-fold cross-validation)

(4)

7. You have a model with AUC of 85% but zero true positives. Is this possible? Explain true positives and AUC. (large question)

This is possible if the threshold is very high. This does not change the AUC, however, all instances will be classified as negative so no true positives. True positive = classified as positive when the actual is also positive. AUC = area under Curve = the probability that any randomly chosen positive instance is ranked above any randomly chosen negative instance. This is one if all the positive instances are classified above all the negative instances (and therefore perfect prediction).

8. Explain collaborative recommendation

Content-based recommendation (based on users’ (C) profile contents and contents of specific items S) vs collaborative recommendation (based on other users’ ratings). They will recommend you content that similar persons rated high. Overspecialization (no new genres are recommended) not an issue, but popularity bias, new item problem (not yet been rated by others) and new user problem (they don’t know which others are similar to you).

9. What determines your performance when using a kNN model? K not too high  this smooths out the prediction, not too low  overfitting.

Dimensionality (too much irrelevant features). Better low dimensionality data compared to high dimensionality data  also the nature of dummy encoding etc will influence this.

Computational efficiency (training time = 0 THAT’S WHY IT IS A LAZY LEARNER; testing time: calculate NN and compare each new instance with ALL the neighbors  high testing time and high costs!) Nature of attributes (scaling of attributes and dummy encoding  lots of attributes)

10. Explain ROC, with AUC, advantages and disadvantages and why is it used so often in practice? (large). Also: compare the disadvantages and advantages with respect to the profit curve. Side question for ‘worse than random’: what can you say about predictive power? How can we solve this?  invert (1-predictions)

ROC = Receiver Operating Characteristics curve (Y: TP, X: FP); AUC = Area under this curve. Often used in practice because (relative, still useable when priors are wrong?). Disadvantage profit curve: you need cost and benefit information, advantage: when you have those, this is the best way to evaluate a classifier!

Adv: you can make choices without taking the threshold into account. Many models already compute scores. When ROC is calculated, you can also calculate the AUC, which is a handy metric.

Disadv: does not take the benefits and profits into account like the profit curve.

11. How do you know that one classification technique works better than another? (large) 1. Accuracy but unbalanced classes & unequal costs (maximizing accuracy is usually not an

appropriate goal)  2. Confusion matrix (you can also consider expected value) 3. Compare with baseline model (simple or majority)

(5)

Visualizing:

I) Profit curve (Y: profit, X: % of people targeted, start and end in the same point; the preferred one when having cost and benefit information)  best classifier is the highest, given constraints for the % of people you can target.

II) ROC(Receiver Operating Characteristics) curve (Y: TP; X: FP) best classifier goes most to the left top corner.

III) AUC (Area under Curve; 0.5 for random model  best classifier has highest AUC) = the probability that a random positive instance is classified above (has higher score) a random chosen negative instance ( = 1 when all positives are ranked above the negatives which means perfect prediction) IV) Cumulative response curve (Y: TP, X: % targeted): best = most to the left top corner

V) Lift curve (Y: lift; X: % targeted): the advantage a classifier has over random model, highest lift=best. Lift = tp rate classifier / tp random classifier (e.g. 80%/40% = lift 2)

12. Bigram

White House, United States, etc. Extension on the Bag of Words text mining technique where word order matters.

13. How does ANN work and what is the link with deep learning? Application to Asos? (large)  side question: I) draw also multi layer perceptron. II) What is the function of training set, validation set, and test set? (determine weights, determine number of hidden layers & number of neurons, determine the performance of the model, respectively)

Mimics how our brain works with interconnected neurons exchanging information. Output is in function of a summation of lot of inputs which are influenced by their weight. 1. Input nodes 2. Sum of weighted inputs 3. Activation function f at the output node 4. Output (classification). Cfr perceptron drawing  XOR Problem (limited to linear decision boundaries)  MLP (multi layer perceptron) = organize neurons into layers (with hidden layers)! The model will then estimate the weights (cfr regression coefficients), they start randomly and are then adapted by the neural network. The link with deep learning is that deep learning is also a neural network but uses a huge number of layers in order to classify (mostly voice recognition and image recognition).

ASOS

1. Product characterization (learning product attributes to recommend fashion products) A Image embeddings (unsupervised convolutional Neural Network: convolution + pooling, then connected encoding step with hidden layers and neurons, then deconvolution and unpooling to reconstruct the input)

+ B text descriptions

+ C metadata: product type (dress, earring, etc.), brand, devision(menswear, womenswear)  Machine learning: combined into a neural network  product attributes

By using these neural networks, product attributes are learned. The attributes then help to build a hybrid recommender system to better recommend products for customers.

(6)

2. Customer profiling (learning the Customer LifeTime Value of our customer base)

14. What influences the quality of k-means (clustering)?

Number of clusters (k), whether or not in converges? This method has several loopholes such as getting trapped in local minima (no guarantee for a global optimum) and large number of distance calculations that ultimately leads to high time complexity. The accuracy as well as complexity of this algorithm is based on certain criteria which includes the selection of initial centroid and the strategy used in performing calculations from each data object to different cluster centers.

15. What is ROC? (small)

Receiver Operating Characteristics. Graph with Y-axis the true positive rates and X-axis the false positive rates.

16. Explain these 5 things about ANN: I) perceptron, II) architecture ANN, III) backpropagation learning IV) how to optimize the architecture?, V) link with deep learning? (large)

Perceptron: drawing of x1  input node  weight  output node (activation function)  y

Architecture ANN: x1  input layer with input nodes  weight  to each hidden layer node  weight  to output layer node  y (answer 4: this architecture is optimized by determining the optimal number of neurons and hidden layers by using the validation set)

Backpropagation: provide networks with training examples one by one: correct  no correction of weights, incorrect  backpropagate errors through the network and adapt weights according to delta learning rule (steepest descend: minize E= 0.5 * sum(Oi – di)^2)  Wnew = Wold – mu*deltaE/deltaWold, with mu = learning rate. This is done until deltaE/deltaW is zero (minimum) The iteration (epoch): 1. Set weights randomly 2. Forward pass (just run the model: calculate network outputs) 3. Backward pass (adapt weights based on the error) 4. Repeat step 2 and 3 until stop criterion is met.

Problems with ANN: due to steepest descend you may end up in a local minimum for the error (instead of global minimum)  solution: other algorithm such as conjugate-gradient, or just use SVM. And how to choose learning rate mu? High: might miss the minimum, low: slow convergence. OR adaptive: start with large learning rate and decrease gradually.

Deep learning: it is a neural network but with many layers. Primarily for image recognition and voice recognition  automatically learns shapes without supervision.

Advantages: networks with 1 layer are universal approximators, very good generalization capability (noise resistant), allow to effectively deal with high dimensional, sparse input spaces.

Disadvantages: NNs are black box techniques (cfr. Comprehensibility?), how to choose the network topology? (hidden neurons, hidden layers, activation functions, etc.), local minima (cfr SVM or other algorithm for error backpropagation)

17. What is the difference between training, validation, and test set. How is this used with ANN? Idem but how is this used with SVM?

(7)

Training = train the model; validation = select the model (optimal tree size, optimal C with SVM); test set = obtain unbiased evaluation of the model on unseen data.

ANN: training set to determine the weights of the inputs (by backpropagating), validation set to determine the optimal number of hidden layers & number of neurons (too many neurons may overfit the data), test set to deploy the model on to see the performance.

18. How can data be represented? Chapter 7? [4]

19. Explain the formula of naïve Bayes for P(C=c|E), given a dataset for which you have the Facebook likes of persons, and you wish to predict if someone is highly educated or not. P(C=c|E) posterior probability = [P(E|C=c), probability that evidence is observed when instance is of the class * P(C=c), prior probability so the probability without evidence that someone is of a that class]/ P(E), probability of the evidence (often left out  same for all classes + classes are mutually exclusive and exhaustive).

What is a datapoint? Facebook profile/person

What is a target variable? Highly educated (1), or not (0) What are the features? Facebook pages likes

Give a feature that would have a lift > 1. Science pages

20. Explain stacking (large)

Ensemble method combined by learning. Not majority voting but meta learner will combine predictions of base learners. So input for level-1 meta learner are the predictions (meta instances) of the several level-0 base learners (usually different learning schemes, e.g. one decision tree, one SVM, one ANN).

+ draw process

21. How do recommender systems work? (large)

Utility functions CBR: u(c,S) = score(ContentBasedProfile(c), Content(S)) CR: u(c,S) = aggregate U(cj,S)

Content-based recommendation (based on users’ (C) profile contents and contents of specific items S) vs collaborative recommendation (based on other users’ ratings). They will recommend you content that similar persons rated high. Overspecialization (no new genres are recommended) not an issue, but popularity bias, new item problem (not yet been rated by others) and new user problem (they don’t know which others are similar to you).

(8)

Bag-of-words: every document is a mere collection of words (ignore word order, grammar, structure etc)

N-gram method = extension to BoW  word order is important!! E.g. white house! However, increases greatly the size of the feature set  co-occurrences only (words that occur more often together than separately e.g. United States, Single Resolution Mechanism)

Another extension: named entity recognition, put elements into categories such as Location, Time, Person, Organization, Date, etc.  you need pre-defined categories (you can train them on a labeled dataset)

23. What is information gain?

IG of a variable/attribute measures how much entropy will decrease. Highest IG = most informative. 24. What is confidence?

Apriori goal: detecting frequently occurring patterns between items given a minimal level of support and confidence. How: Find frequent item sets  derive association rules for these sets.

Support(X  Y) = #transactions containing X U Y / total # transactions = P(X U Y) Confidence(X  Y) = support(X U Y) / support(X) = P(Y|X)

25. What is overfitting?

Fitting the noise = finding patterns that do not generalize = the model memorizes. 26. What is entropy? [3.16]

Measure of impurity, chaos. Perfectly mixed group: entropy 1, perfectly separated group: entropy 0. Formula: entropy = -[p(°).2log p(°) + p(*).2log p(*)]

27. What is the difference between document clustering and document classification?

Document classification (supervised learning): experts set up a training set for the classifier to learn on and establish classification rules. Massive dimension in order to classify automatically huge number of online documents. Mostly regression, SVM, or naïve Bayes.

Document clustering (unsupervised learning): automatically group related documents based on their content, no training set. Three main steps: I) preprocessing (stemming, stop words, normalization, etc.) II) hierarchical clustering to compute similarities III) Slicing to flatten the tree into desired number of levels (e.g. slice into three categories, science, music and sports)

Another one: sentiment analysis/classification (can be both supervised (label a training set) as unsupervised (use existing sentiment lexicon).

28. Explain how the SVM relates to regularization. (on validation set [3.61])

Regularisation: penalizing large weights/complexity (mechanism to avoid overfitting). High C means high importance of errors so probably complex model (effect of regularization fades away). Low C means high importance of the minimized weights, therefore low complexity model and no overfitting.

(9)

29. Wat is de confusion matrix en hoe wordt deze gegenereerd? Indien je een ranking classifier hebt, hoe zou je de confusion matrix kunnen toepassen? Leg het belang uit tussen het onderscheid van de verschillende soorten fouten die je kan hebben.

Confusion matrix: [TP FP; FN TN]. When ranking classifier  is confusion matrix to set up a ROC curve. TPR (hit rate = sensitivity), FPR (false alarm rate), TNR (specificity), FNR, precision, F statistic

30. Teken de Cumulative Response Curve van de gegevens en geef deze voor het baseline model indien dit random is. Bijvraag (mondeling): geef nu de CRC voor een perfect model. For the perfect model all the positives have to be at the top so that when the threshold is step by step decreased, all the positives are observed (Y-axis = 1 = all the true positives are observed), when for example 20% of the dataset is looked at (if 20% of the actuals are positive).

31. Wat is de Apriori trick? (dit is niet de methode!! Staat niet in cursus, uitgelegd in de les) Every subset of a frequent item set must also be a frequent item set  less computation.

32. Een statisticus zegt dat je data mining niet moet gebruiken want je zal altijd wel een patroon vinden. Hoe weerleg je dit?

You will indeed always find a pattern but this does not mean that the pattern will predict future unseen instances accurately. What is needed is to find a pattern that is generalizable to the entire population with high predictive power.

Overfitting?

33. Oefening text mining: entity, 2 bigrams, preprocessing stappen, tf-idf tabel, sensitivity analysis

Preprocessing steps: normalization (lower case), stemming, stop-words.

IDF(t) = 1 + log(total number of documents/number of docs containing t): more t  lower IDF. TF (term frequency)-IDF(t,d) = TF(t,d)*IDF(t)  calculate similarity with cosine similarity.

34. Is AUC always the best metric to choose the optimal model? Or should you also look at accuracy? “The highest accuracy always does not guarantee the right model, but the highest AUC does.”

Accuracy is not a good metric when classes are unbalanced.

35. What is pruing in practice? Is it good to do with Decision Tree and not with Random Forest? In practice using less variables to keep on splitting the datapoints (because eventually you will have classes with only a few instances  overfitting)

36. What techniques are lazy learners beyond kNN? Examples in text mining (bag-of-words, N-gram) ? Naïve bayes?

37. RF: What is the use and meaning of m? Why build a full tree n times? What does “fully grown and not pruned” mean?

(10)

The m are the number of features that are used to split the nodes (usually around 20% of M). N trees are built because it is an ensemble method in order to increase accuracy. It is not pruned because the features are randomly chosen which decreases potential overfitting significantly!

38. What does a SVM effectively do with the datapoints? How does this work in practice?

39. Give the definition of Ethics in a data mining context

Ethics is about right versus wrong. It gets interesting when it is hard to determine right from wrong. Also, just because we can do something, does not mean we should do something. In data mining context it is about data gathered on humans, which has an impact on humans, and how this is ethical. Informed consent is an important factor and everyone must have the right to withdraw consent.

I. Data gathering: bias & experimentation

II. Data preprocessing: privacy: right to be forgotten on the web? Data asymmetry. Privacy principles  data minimization & informed consent!. Also: re-identification from so-called anonymized data.

III. GDPR: good thing? (Informed consent? Even more asymmetric?) IV. Trolley problem

V. Ethics differ across cultures (younger vs older: high Southern cluster, low Eastern cluster) VI. Model evaluation: FATE (Fair, Accountable, Transparent, Explainable)

40. Explain the random forest technique as an Ensemble method (large)

Ensemble method combined by consensus. A Random Forest model, as the name implies, consists out of multiple decision trees. It is an ensemble learning method that outputs the mode of the classes of the individual trees. The “random” part refers to the fact that the algorithm selects a random subset of the features (m < M, typically 20%) to use at each split in the tree. By choosing a random subset, the correlation between the different trees will be lower. There is one important parameter to be set: the number of trees (T). You can run a grid search to find the optimal number of trees (validate on a validation set).

For each tree: I) choose training set by n times with replacement. II) for each node, randomly choose m features and calculate the best split III) fully grown and not pruned (so no stopping rule)  majority voting among all the trees.

41. Explain: associative learning 42. Explain: regularization

43. Explain: difference between kNN and k clustering

Both uses similarity functions, but kNN also uses combining function to see which value a certain instance gets. K Clustering just goes to the nearest (most similar) cluster. Also clustering is unsupervised whereas kNN is supervised learning.

44. Give a description of following ensemble models + how they work I) Bagging

(11)

Ensemble method combined by consensus. Step 1: bootstrap the original dataset into a lot of different datasets “bags”. Step 2 is the aggregation of the bootstrap (bagging) to train a classifier on each bootstrap sample  majority voting (=consensus) to determine the class.

Bootstrap Aggregating II) Boosting

Ensemble method combined by learning (from labeled data). Boost a set of weak learners (misclassified so make them more important) to a strong learner. Adaboost: create bootstrap sample based on weights (misclassified will be more in sample so boosted)  train classifier and apply to original dataset  wrong classified: increase weight, right classified: decrease weight  again bootstrap sample according to weights (the misclassified records will be picked more often and therefore be better classified in the future).

 Final prediction is weighted average of all the classifiers with weight representing the training accuracy.

III) Stacking

Ensemble method combined by learning. Not majority voting but meta learner will combine predictions of base learners. So input for level-1 meta learner are the predictions (meta instances) of the several level-0 base learners (usually different learning schemes, e.g. one decision tree, one SVM, one ANN).

Do you know another one? Why the name “bagging?”

IV) Random forest

Ensemble method combined by consensus. A Random Forest model, as the name implies, consists out of multiple decision trees. It is an ensemble learning method that outputs the mode of the classes of the individual trees. The “random” part refers to the fact that the algorithm selects a random subset of the features (m < M, typically 20%) to use at each split in the tree. By choosing a random subset, the correlation between the different trees will be lower. There is one important parameter to be set: the number of trees (T). You can run a grid search to find the optimal number of trees (validate on a validation set).

For each tree: I) choose training set by n times with replacement. II) for each node, randomly choose m features and calculate the best split III) fully grown and not pruned (so no stopping rule)  majority voting among all the trees.

 Conclusion ensemble methods: base models are combined by learning from labeled data or by their consensus  higher accuracy.

(12)