What is a decision tree?
A decision tree is a map of the possible outcomes of a series of related choices. It allows an individual or organization to weigh possible actions against one another based on their costs, probabilities, and benefits. They can be used either to drive informal discussion or to map out an algorithm that predicts the best choice mathematically.
A decision tree typically starts with a single node, which branches into possible outcomes. Each of those outcomes leads to additional nodes, which branch off into other possibilities. This gives it a treelike shape.
There are three different types of nodes: chance nodes, decision nodes, and end nodes. A chance node, represented by a circle, shows the probabilities of certain results. A decision node, represented by a square, shows a decision to be made, and an end node shows the final outcome of a decision path.
Decision trees can also be drawn with flowchart symbols, which some people find easier to read and understand.
Decision tree symbols
Here list of some important symbol which is used to create a decision tree.
There is other more symbol which is you can search using google, Now we go ahead and read the next point regarding decision tree:
Important Terminology related to Decision Trees
Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.
Types of Decision Trees
Types of the decision tree are based on the type of target variable we have. It can be of two types:
1. Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it called a categorical variable decision tree. E.g.:- In the above scenario of student problem, where the target variable was “Student will play cricket or not” i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has a continuous target variable then it is called Continuous Variable Decision Tree.
Assumptions while creating Decision Tree
Some of the assumptions we make while using Decision tree:
In the beginning, the whole training set is considered as the root.
Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
Records are distributed recursively on the basis of attribute values.
Order to placing attributes as root or internal node of the tree is done by using some statistical approach.
Advantages of Decision Tree:
Easy to Understand: Decision tree output is very easy to understand even for people from the non-analytical background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
Useful in Data exploration: Decision tree is one of the fastest ways to identify the most significant variables and the relation between two or more variables. With the help of decision trees, we can create new variables/features that have a better power to predict the target variable. It can also be used in the data exploration stage. For e.g., we are working on a problem where we have information available in hundreds of variables, their decision tree will help to identify the most significant variable.
Decision trees implicitly perform variable screening or feature selection.
Decision trees require relatively little effort from users for data preparation.
Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
The data type is not a constraint: It can handle both numerical and categorical variables. It can also handle multi-output problems.
Non-Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about space distribution and the classifier structure.
Disadvantages of Decision Tree:
Overfitting: Decision-tree learners can create over-complex trees that do not generalize the data well. This is called overfitting. Overfitting is one of the most practical difficulties for decision tree models. This problem gets solved by setting constraints on model parameters and pruning.
Not fit for continuous variables: While working with continuous numerical variables, the decision tree loses information, when it categorizes variables in different categories.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
Greedy algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees, where the features and samples are randomly sampled with replacement.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the data set prior to fitting with the decision tree.
Information gain in a decision tree with categorical variables gives a biased response for attributes with greater no. of categories.
How to draw a decision tree
To draw a decision tree, first pick a medium. You can draw it by hand on paper or a whiteboard, or you can use special decision tree software. In either case, here are the steps to follow:
1. Start with the main decision. Draw a small box to represent this point, then draw a line from the box to the right for each possible solution or action. Label them accordingly.
2. Add chance and decision nodes to expand the tree as follows:
If another decision is necessary, draw another box.
If the outcome is uncertain, draw a circle (circles represent chance nodes).
If the problem is solved, leave it blank (for now).
From each decision node, draw possible solutions. From each chance node, draw lines representing possible outcomes. If you intend to analyze your options numerically, include the probability of each outcome and the cost of each action.
3. Continue to expand until every line reaches an endpoint, meaning that there are no more choices to be made or chance outcomes to consider. Then, assign a value to each possible outcome. It could be an abstract score or a financial value. Add triangles to signify endpoints.
With a complete decision tree, you’re now ready to begin analyzing the decision you face.