In machine learning, a decision tree is a predictive model which maps observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining and machine learning.
Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels.
In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data (but the resulting classification tree can be an input for decision making).
TensorFlow is an open source software library for machine learning in various kinds of perceptual and language understanding tasks. This post is based on one of many YouTube videos by Josh Gordon (Google Developers) on Machine Learning.
It is currently used for both research and production by multiple different teams in dozens of commercial Google products, such as speech recognition, Gmail, Google Photos, and search. TensorFlow was originally developed by the Google Brain team for Google’s research and production purposes and later released under the Apache 2.0 open source license.
I am mentioning TensorFlow because the motivation for this post was based on a couple of courses on machine learning using TensorFlow. Reviewing and learning is a good way to improve (see Feynman Learning Technique in this blog).
I have installed on my Windows developer machine Anaconda3. That install includes a version of Python and many libraries that are used in machine learning tasks. For writing TensorFlow programs I am and will be using Spyder and PTVS (Python Tools for Visual Studio) for Visual Studio Professional 2013.
Following is a simple example of a decision tree classifier written in Python using the scikit-learn library (does not make use of TensorFlow):
# using scikit-learn
# supervised learning:
# – collect data
# – train classifier (decision tree in this case)
# – make predictions
# training and actual data should be read from a file
# label should be associated from the data
# **** import libraries ****
from sklearn import tree
#import tensorflow as tf
# **** check we are up and running ****
# **** change “smooth” to 1 and “bumpy” to 0 ****
features = [[140, 1], [130, 1], [150, 0], [170, 0]]
# **** change “apple” to 0 and “orange” to 1 ****
labels = [0, 0, 1, 1]
print(” labels:”, labels)
# **** pick a decision tree classifier ****
clf = tree.DecisionTreeClassifier()
# **** train the classifier ****
clf = clf.fit(features, labels)
# **** predict what type of fruit is this (160g and bumppy) ****
label = clf.predict([[160, 0]])
# **** display the label returned by the classifier ****
print(” label:”, label)
if (label == 1):
print(” label: orange”)
print(” label: apple”)
The idea is to build a classifier based on a decision tree to be able to differentiate between apples and oranges. The training data follows:
In practice one would need more samples. In general, the more samples the better the classifier should do with actual data. We will touch this point on a future post.
Following is a representation of the decision tree classifier automatically built by the scikit-learn library:
Following is a capture of the console for this exercise:
features: [[140, 1], [130, 1], [150, 0], [170, 0]]
labels: [0, 0, 1, 1]
The interesting thing to note is that (with the only exception of printing a different label) we could use the same program with the following data:
If you have comments or questions regarding this post or any other entry in this blog please do not hesitate and send me a message via email. Will reply as soon as possible and will not use your name unless you explicitly tell me to do so.
Follow me on Twitter: @john_canessa