Project 1
Data and Model Analytics using SAS Enterprise
Miner
MIS 6334 Advanced BA with SAS
Group 11
Chaithanya
Sai Srinivas Sabnaveesu – cxs163030
Ritesh
Kunisseri Puliyakote – rxk166230
Sindhuja
Masiragani – sxm164031
Tejasvi
Ramadas Sagar – txs161330
Vinod
Prem Kumar – vmp160030
Table of
Contents
Objective
|
Part I. Data
Pre processing
|
1. Edit
Variables
|
2. StatExplore
|
3. Dataset
Partition
|
Part II. Building Decision
Trees
|
1. Automatically pruned decision
tree
|
2. Decision Tree using Log worth (Interactive decision
tree)
|
Part III. Building Neural
Networks and a Regression Model
|
1. Imputation
|
2. Transform Variables
|
3. Regression Model
|
4. Ensemble
Model
|
5. Neural Network
|
6. Auto
Neural Network
|
Part IV. Model Comparison and Champion Model
Selection
|
Part V. Improving Model Performance
|
Managerial Implications and
Learning
|
In this project we worked on Expedia Data
set and worked with multiple predictive models. The objective of this project
is to find the best fitting model along with managerial relevant findings to
predict the customer’s decision in booking the flight.
Part I – Basic Data Pre-processing
Edit Variables:
While adding the DataSource, on the Advanced Data Source
Wizard window, we changed role of depend to Target. X12, X32 and X38 are
changed to Rejected. Also, we changed Level of X1, X6, X15, X33, and X37 to
binary, and X3, X5, X7 to ordinal.
StatExplore:
The raw dataset is explored further and observed the following:
·
All the variables except X32
and X38 have more than half missing values in the raw data. So these variables
are rejected
Interval Variable Summary Statistics:
Class Variable Summary
Statistics:
Data Partition:
Training: 55% of the data was allocated for training
Validation:
45% of the data was allocated for validation
Part II. Building Decision Trees
Automatically pruned Decision Tree:
Pruning tree:
·
Maximum Depth: 10
·
Leaf Size: 2
·
Number of Surrogate rules: 2
Results:
Misclassification
Rate:
Train: 0.05131
Validation: 0.09602
Fit Statistics:
Tree Diagram:
Interactive Decision Tree
Considering
the log worth decision values, we pruned tree explicitly to run the model with same
other pruning parameters as the above decision tree model.
Pruning tree:
·
Maximum Depth: 4
·
Leaf Size: 5
·
Number of Surrogate rules: 4
Results:
Misclassification
Rate:
Train: 0.12128
Validation: 0.13656
Fit Statistics:
Tree Diagram:
Part III. Building Neural Networks and a Regression
Model
Imputation:
We have used impute node to impute the missing values.
·
For Class Variables, we replaced (.)dot for the
missing values present in dataset.
·
For Interval variables, we imputed data with
mean.
Transform Variables:
After partitioning the data, the following
step in Data pre-processing is transforming the input variables to reduce the
skewness in data. In Transform variable node, we used log10 transformation on
interval variables whose skewness is close to 5 and more than 5.
Regression Model:
As per our dataset, the target variable is
binary. So we have used Logistic Regression.
Results:
Misclassification
Rate:
Train: 0.12070
Validation: 0.13514
Fit Statistics:
Ensemble Model:
Designed an ensemble model by combining
Bagging and Boosting with optimal decision tree by adding start and end group
nodes to the Decision Tree.
Fit Statistics:
Results:
Misclassification
Rate:
Train: 0.05131
Validation: 0.09602
Neural Network:
Fit Statistics:
Results:
Misclassification
Rate:
Train: 0.13003
Validation: 0.13869
Auto Neural Network:
Fit Statistics:
Results:
Misclassification
Rate:
Train: 0.13294
Validation: 0.13442
Part IV.
Model Comparison and Champion Model Selection
Cost of 5 for misclassifying 1 as 0 and a
cost of 1 for misclassifying 0 as 1.
Decision Function and Weights:
Decision Function and Weights:
Models:
Fit Statistics:
ROC Curve:
Champion Model: Decision Tree
Part V.
Improving Model Performance
Ensemble Model is being the champion model,
we decided to use ensemble method i.e. ensemble the Decision Tree with use of
bagging and boosting. We have used various variations of Interactive decision
tree, Logistic Regression, Neural Network, Support Vector Machine, Auto Neural
Network models to determine the misclassification rate.
|
|
Data
|
|
Target False
|
TRUE
|
FALSE
|
TRUE
|
Model Node
|
Model Description
|
Role
|
Target
|
Label
Negative
|
Negative
|
Positive
|
Positive
|
|
|
|
|
|
|
|
|
Reg
|
Regression
|
TRAIN
|
depend
|
186
|
1466
|
21
|
42
|
Reg
|
Regression
|
VALIDATE
|
depend
|
164
|
1191
|
26
|
25
|
Neural
|
Neural Network
|
TRAIN
|
depend
|
223
|
1487
|
.
|
5
|
Neural
|
Neural Network
|
VALIDATE
|
depend
|
188
|
1210
|
7
|
1
|
Tree
|
Decision Tree
|
TRAIN
|
depend
|
150
|
1477
|
10
|
78
|
Tree
|
Decision Tree
|
VALIDATE
|
depend
|
141
|
1182
|
35
|
48
|
Tree3
|
Interactive Decision Tree
|
TRAIN
|
depend
|
202
|
1481
|
6
|
26
|
Tree3
|
Interactive Decision Tree
|
VALIDATE
|
depend
|
170
|
1195
|
22
|
19
|
Ensmbl
|
Ensemble
|
TRAIN
|
depend
|
87
|
1486
|
1
|
141
|
Ensmbl
|
Ensemble
|
VALIDATE
|
depend
|
128
|
1210
|
7
|
61
|
AutoNeural
|
AutoNeural
|
TRAIN
|
depend
|
228
|
1487
|
0
|
0
|
AutoNeural
|
AutoNeural
|
VALIDATE
|
depend
|
189
|
1217
|
0
|
0
|
HPSVM
|
HP SVM
|
TRAIN
|
depend
|
206
|
1457
|
30
|
22
|
HPSVM
|
HP SVM
|
VALIDATE
|
depend
|
173
|
1190
|
27
|
16
|
Results:
Misclassification Rate:
Train: 0.05131
Validation: 0.09602
Managerial Implications and Learning
Managerial Implications:
In course of the analysis and
findings we can recommend the below managerial insights:
1.
Targeting potential customers based on their
online behaviour
·
Customers browsing data (Cookie Information) and
his past activity on other websites
·
Designing tactics based on the user information
that is received
·
We can leverage the following variables to
support our insights, X33(if the user has booked at any site up to this point)
and X25(% of total bookings are to this site).
2.
Using the variable which signifies if the user
has spent substantial time on our website and hasn’t proceeded with the
booking, we can make use of this information and use it in remarketing, that is
if the user is trying to book the ticket on competitor’s website, we can make
provision to flash our tickets price (with a discounted amount) and try to
navigate the customer back to our website.
3.
Also, going one step further, we can consider Age
and Gender factors combined with the time the user spends on the site, if we have
predicted the user is a potential customer we can come up Demographic targeting.
In Demographic targeting we can segregate people based on variable importance
such as age and gender and target them with likely discounts and coupons and
make sure they book with our site.
Learnings:
1.
Stat Explorer Node helped to look for missing
class and interval values, skewness of data, outliers and inconsistent data
2.
Multiple decision trees are generated in Random
Forest, and it is always better to reduce the data skewness to better results
3.
By changing the maximum number of leaf nodes in
decision node configuration helps us to overcome overfitting.
4.
We can use combination of
boosting and bagging as plausible substitute for random forest.
No comments:
Post a Comment