Well, truth be told. I'm very worried about decision trees..
not even mentioning the exam in general.
Looking at the solutions to Assignment 3 we have roughly 12 pages as an answer to the decision tree question. Worried about the time it takes to do this and the chances of making mistakes through the process.
My summary on the concept (for those who need another perspective / who would like to comment and correct me)
1. Start with the whole example set and define S = [number for each category eg. 5l,3m,6h]
ie. we have a total of 14 values, of which 5 are low, 3 are medium and 6 are high
2. Calculate the Entropy for the Complete set defined as S (log is actually log base 2)
- (4/14 log 5/14) - (3/14 log 3/14) - (6/14 log 6/14) = 1.531
Now we determine the gain of each of the other attributes still left
3. For each attribute (eg here will be Credit History)
a) Define it in terms of the possible values for S where S = [5l,3m,6h] as defined above
Thus we have: 3 possible values for Credit History (bad, good, unknown)
- in terms of S's possible values, for Bad Credit History we will have 0 Low, 1 Medium and 3 High.
Thus for bad: Sb = [0l, 1m, 3h] Notice that it contains only 4 out of the 14 values (4/14)
Also, Sg = [3l,1m,1h] (5/15 values)
Su = [2l,1m,2h] (5/14 values)
We calculate Entropy for each of these - why? cause we want to deduct it from entropy of the set S to get the information GAIN.
Entrypy(Sb) = -(0/4 log 0/4) - (1/4 log 1/4) - (3/4 log 3/4) = 0.811
E(Sg) = 1.371
E(Su) = 1.522
Now Gain = E(S) - 4/14(E(Sb)) - 5/15(E(Sg)) - 5/15(S(Su)) = 0.266
This is the information gain for the first attribute. NOw we do it for the other three and pick the highest valued attribute.This is our first node. (InCOME)
b) Now notice that INcome in turn has three possible values. Each of these will form a branch to the next level in your tree. Thus for each of these
you will have to decide the next node. Thus for each of the 3 possible values, filter out all lines that aren't relevent (ie.start with value <R15k and work with those lines only - there are 4)
Now you again define S. S = [4H]. Because we only have 1 type of value, this will be the result of this branch.
second take the value R15k - R35K
Filter out all irrelevant lines.
We have to value types of our goal node (ie. 2x High and 2x Medium).
Define S = [2m, 2H]
Get Entropy (S) = -(2/4log2/4) - (2/4 log 2/4)
Now for each of the left over attributes (Credit History, Debt, Collateral), find its entropy and information gain. Pick the best 1 and use that as the next node in your tree
eg.
Credit History: For unknown we have 1 High and 1 Medium goal value Su = [1h, 1m] Fraction of total S = 2/4
For Good we only have 1 Medium goal value Sg =[0h, 1m] Fraction 1/4
For Bad we have 1 High goal value Sb = [1h, 0m] Fraction 1/4
Work out eaches entropy eg(unknown = - (1/2 log 1/2) - (1/2 log 1/2))
Then deduct it from entropy of S (with each attribute as a fraction)
Pick the highest gain and insert the node in your tree.
The moment your example list gets filtered and ends up with a single goal value (also not that if you end up with a single line - you will always have 1 goal value), then just insert the goal value as the node in your tree. The moment you have more than 1 optio for a goal value, you need to do the calc above.
Dan
Danie van Eeden
------------------------