C4.5 is a standard benchmark in machine learning. For this reason, it is incorporated in Orange, although Orange has its own implementation of decision trees.
uses the original Quinlan's code for learning so the tree you get is exactly like the one that would be build by standalone C4.5. Upon return, however, the original tree is copied to Orange components that contain exactly the same information plus what is needed to make them visible from Python. To be sure that the algorithm behaves just as the original, we use a dedicated class orange.C45TreeNode instead of reusing the components used by Orange's tree inducer (ie, orange.TreeNode). This, however, could be done and probably will be done in the future; we shall still retain orange.C45TreeNode but offer transformation to orange.TreeNode so that routines that work on Orange trees will also be usable for C45 trees.
C45Learner and C45Classifier behave
like any other Orange learner and classifier. Unlike most of Orange learning algorithms, C4.5 does not accepts weighted examples.
We haven't been able to obtain the legal rights to distribute C4.5 and therefore couldn't statically link it into Orange. Instead, it's incorporated as a plug-in which you'll need to build yourself. The procedure is trivial, but you'll need a C compiler. On Windows, the scripts we provide work with MS Visual C and the files CL.EXE and LINK.EXE must be on the PATH. On Linux you're equipped with gcc. Mac OS X comes without gcc, but you can download it for free from Apple.
Orange must be installed prior to building C4.5. (This is because the build script will copy the created file next to Orange, which it obviously can't if Orange isn't there yet.)
orange.C45Learner(). If this fails, something went wrong; see the diagnostic messages from buildC45.py and read the below paragraph.If the process fails, here's what buildC45.py really does: it creates .h files that wrap Quinlan's .i files and ensure that they are not included twice. It modifies C4.5 sources to include .h's instead of .i's. This step can hardly fail. Then follows the platform dependent step which compiles ensemble.c (which includes all the Quinlan's .c files it needs) into c45.dll or c45.so and puts it next to Orange. If this fails, but you do have a C compiler and linker, and you know how to use them, you can compile the ensemble.c into a dynamic library yourself. See the compile and link steps in buildC45.py, if it helps. Anyway, after doing this check that the built C4.5 gives the same results as the original.
C45Learner's attributes have double names - those that you know from C4.5 command lines and the corresponding names of C4.5's internal variables. All defaults are set as in C4.5; if you change nothing, you are running C4.5.
Attributes
false, default) or gain ratio
for selection of attributes (true)false, no subsetting)false)true)C45Learner also offers another way for setting
the arguments: it provides a function commandline
which is given a string and parses it the same way as C4.5 would
parse its command line.
C45Classifier contains a faithful reimplementation of Quinlan's function from C4.5. The only difference (and the only reason it's been rewritten) is that it uses a tree composed of orange.C45TreeNodes instead of C4.5's original tree structure.
Attributes
C45TreeNodes.
This class is a reimplementation of the corresponding struct from Quinlan's C4.5 code.
Attributes
C45TreeNode.Leaf (0), C45TreeNode.Branch (1), C45TreeNode.Cut (2), C45TreeNode.Subset (3). "Leaves" are leaves, "branches" split examples based on values of a discrete attribute, "cuts" cut them according to a threshold value of a continuous attributes and "subsets" use discrete attributes but with subsetting so that several values can go into the same branch.Value returned by that leaf. The field is defined for internal nodes as well.DiscDistribution).tested is None, if node is of type Branch or Cut tested is a discrete attribute, and if node is of type cut then tested is a continuous attribute.Cut. Undefined otherwise.Subset. Element mapping[i] gives the index for an example whose value of tested is i. Here, i denotes an index of value, not a Value.The simplest way to use C45Learner is to call it. This
script constructs the same learner as you would get by calling the usual C4.5.
part of c45.py (uses lenses.tab)
Arguments can be set by the usual mechanism (the below to lines do the same, except that one uses command-line symbols and the other internal variable names)
The way that could be prefered by veteran C4.5 user might be through
method commandline.
There's nothing special about using C45Classifier - it's just like any other. To demonstrate what the structure of C45TreeNode's looks like, will show a script that prints it out in the same format as C4.5 does. (You can find the script in module orngC45).
def printTree0(node, classvar, lev):
var = node.tested
if node.nodeType == 0:
print "%s (%.1f)" % (classvar.values[int(node.leaf)], node.items),
elif node.nodeType == 1:
for i, val in enumerate(var.values):
print ("\n"+"| "*lev + "%s = %s:") % (var.name, val),
printTree0(node.branch[i], classvar, lev+1)
elif node.nodeType == 2:
print ("\n"+"| "*lev + "%s <= %.1f:") % (var.name, node.cut),
printTree0(node.branch[0], classvar, lev+1)
print ("\n"+"| "*lev + "%s > %.1f:") % (var.name, node.cut),
printTree0(node.branch[1], classvar, lev+1)
elif node.nodeType == 3:
for i, branch in enumerate(node.branch):
inset = filter(lambda a:a[1]==i, enumerate(node.mapping))
inset = [var.values[j[0]] for j in inset]
if len(inset)==1:
print ("\n"+"| "*lev + "%s = %s:") % (var.name, inset[0]),
else:
print ("\n"+"| "*lev + "%s in {%s}:") % (var.name, reduce(lambda x,y:x+", "+y, inset)),
printTree0(branch, classvar, lev+1)
def printTree(tree):
printTree0(tree.tree, tree.classVar, 0)
print
Leaves are the simplest. We just print out the value contained in node.leaf. Since this is not a qualified value (ie., C45TreeNode does not know to which attribute it belongs), we need to convert it to a string through classVar, which is passed as an extra argument to the recursive part of printTree.
For discrete splits without subsetting, we print out all attribute values and recursively call the function for all branches. Continuous splits are equally easy to handle.
For discrete splits with subsetting, we iterate through branches, retrieve the corresponding values that go into each branch to inset, turn the values into strings and print them out, separately treating the case when only a single value goes into the branch.