• No results found

Exam Data Mining Date: 5-11-2015 Time: 13.30-16.30

N/A
N/A
Protected

Academic year: 2021

Share "Exam Data Mining Date: 5-11-2015 Time: 13.30-16.30"

Copied!
4
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Exam Data Mining Date: 5-11-2015 Time: 13.30-16.30

General Remarks

1. You are allowed to consult 1 A4 sheet with notes written on both sides.

2. You are allowed to use a pocket calculator. Use of mobile phones is not allowed.

3. Always show how you arrived at the result of your calculations. Otherwise you can not get partial credit if the final answer is incorrect.

4. There are five questions, with which you can earn 100 points.

Question 1 Short Questions (20 points)

Answer the following questions:

(a) Consider the association rule X → Y , where X and Y denote disjoint item sets.

Prove that if we move an item from Y to X, then the confidence of the rule either increases or stays the same, but it can not decrease.

(b) Consider the following claim: Two directed independence graphs are equivalent if they have the same moral graph. Show that this claim is incorrect by giving a

counterexample.

(c) In frequent tree mining, what is the difference between an induced subtree and an embedded subtree?

1

(2)

(d) Give the essential graph of the following directed independence graph:

A

B

C D

Question 2: Classification Trees (25 points)

The tree given below, denoted by Tmax, has been constructed on the training sample:

90 10

30 5

10 5

6 4 4 1

20 0

60 5

20 5 40 0

10 5 10 0

t1

t2 t3

t4 t5 t6 t7

t8 t9 t10 t11

In each node, the number of observations with class 0 is given in the left part, and the number of observations with class 1 in the right part. The leaf nodes have been drawn as rectangles.

(a) Compute the impurity of nodes t1, t2 and t3 using the gini-index.

(b) Give the impurity reduction achieved by the first split.

(c) Compute T1, the smallest minimizing subtree of Tmax for α = 0.

(d) Compute the cost-complexity pruning sequence T1 > T2 > . . . > {t1}. For each tree in the sequence, give the interval of α values for which it is the smallest minimizing subtree of Tmax.

2

(3)

Question 3: Frequent Sequence Mining (15 points)

Consider the following database of sequences:

sid sequence 1 ABBA 2 ABACAB 3 BADAD

Use the GSP algorithm to find all frequent sequences with minsup=2. Visualize the search process as a prefix tree. Write the support between brackets next to a candidate sequence if and only if it needs to be counted on the database.

(It is advised to rotate your answer sheet 90 to draw the tree in landscape mode.)

Question 4: Undirected Graphical Models (25 points)

The following data concerns an outbreak of food poisoning after the traditional Christmas Lunch of the personnel of the Department of Information and Computing Sciences of our University. This time the theme was Dutch cuisine. Of the food eaten, interest focused on the “Berenhap” and “Frikandel”. The variables are:

1. Berenhap eaten (1) or not eaten (0) (B) 2. Frikandel eaten (1) or not eaten (0) (F ) 3. Sick (1) or not (0) (S)

Questionnaires were completed by 100 of the 114 persons attending. The table of observed counts is given below.

n(B, F, S) S

B F 0 1

0 0 22 4

0 1 3 12

1 0 8 1

1 1 12 38

For example, n(0, 1, 1) = 12 is the number of people that did not have a Berenhap, but did eat the Frikandel, and became sick.

(a) Estimate P (S = 1|B = 1) and P (S = 1|B = 0).

(b) Based on the estimates computed at (a), would you say there is an association be- tween eating a “Berenhap” and becoming sick? Explain.

(c) Draw the undirected independence graph of the graphical model expressing the con- straint B ⊥⊥ S | F , and state the corresponding independence assumption(s) in words.

3

(4)

(d) Compute the fitted counts ˆn(B, F, S) for the model given under (c).

(e) Perform a statistical test to check whether the model you fitted under (d) gives an adequate fit of the data, using α = 0.05. To perform the test, you may consult the following table with critical values:

degrees of freedom (ν) 1 2 3 4 5 6 7 8

critical value (χ2ν;0.05) 3.84 6.00 7.82 9.50 11.1 12.6 14.1 15.5 Clearly state whether or not the model is rejected, and explain how you made that decision.

Question 5: Bayesian Networks (15 points)

We perform a greedy hill-climbing search to find a good Bayesian network structure on 4 variables denoted A, B, C, and D. Neighbour models are obtained by adding, deleting, or reversing an edge. We start the search process from the following initial graph:

A B

D C

In step 1 of the search we find that deleting the edge B → D gives the biggest improvement in the BIC score.

(a) For which operations (addition, deletion, reversal of an edge) do we need to compute the change in score in step 2 of the search? Note: assume that scores of operations computed in previous iterations that are still valid are not recomputed.

(b) Why do we have the reversal operator, even though the same change to the model could be achieved by first deleting the edge, and subsequently adding the edge in the opposite direction in the next step?

4

Referenties

GERELATEERDE DOCUMENTEN

Laci Lov´asz is a main inspirator of the new area of graph limits and graph connection matrices and their relations to graph parameters, partition functions, mathematical

(c) Draw the undirected independence graph of the graphical model expressing the con- straint G ⊥ ⊥ R | C, where C denotes Crime, and state the corresponding indepen-

The algorithm performs a hill-climbing search where the neighbors of the current model are obtained by either: removing an arrow from the current model, adding an arrow to the

Page 6 of 13 Make a mark here if you answered parts of these questions on an extra sheet: [ ]... Problem 4: Matrices

50 There are four certification schemes in Europe established by the public authorities.The DPA of the German land of Schleswig- Holstein based on Article 43.2 of the Data

Assuming this is not a case of association, but of a grave of younger date (Iron Age) discovered next to some flint implements from the Michelsberg Culture, the flint could be

In beide jaarrekeningen 2017 is echter de volgende tekst opgenomen: “Er is echter sprake van condities die duiden op het bestaan van een onze- kerheid van materieel belang op

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC-BY-NC-ND 4.0), which permits unrestricted use, distribution,