Wednesday, November 1, 2017

Data and Model Analytics using SAS Enterprise Miner











Project 1
 Data and Model Analytics using SAS Enterprise Miner
MIS 6334 Advanced BA with SAS








Group 11
Chaithanya Sai Srinivas Sabnaveesu – cxs163030
Ritesh Kunisseri Puliyakote – rxk166230
Sindhuja Masiragani – sxm164031
Tejasvi Ramadas Sagar – txs161330
Vinod Prem Kumar – vmp160030

Table of Contents

Objective
Part I. Data Pre processing
1.            Edit Variables
2.            StatExplore
3.            Dataset Partition
Part II. Building Decision Trees
1.            Automatically pruned decision tree
2.            Decision Tree using Log worth (Interactive decision tree)
Part III. Building Neural Networks and a Regression Model
1.               Imputation
2.            Transform Variables
3.            Regression Model
4.            Ensemble Model
5.            Neural Network
6.           Auto Neural Network
Part IV.  Model Comparison and Champion Model Selection
Part V.  Improving Model Performance
Managerial Implications and Learning






In this project we worked on Expedia Data set and worked with multiple predictive models. The objective of this project is to find the best fitting model along with managerial relevant findings to predict the customer’s decision in booking the flight.
Part I – Basic Data Pre-processing
Edit Variables:

While adding the DataSource, on the Advanced Data Source Wizard window, we changed role of depend to Target. X12, X32 and X38 are changed to Rejected. Also, we changed Level of X1, X6, X15, X33, and X37 to binary, and X3, X5, X7 to ordinal. 



StatExplore:
The raw dataset is explored further and observed the following:
·        All the variables except X32 and X38 have more than half missing values in the raw data. So these variables are rejected



Interval Variable Summary Statistics:



Class Variable Summary Statistics:


Data Partition:

Training:  55% of the data was allocated for training
Validation: 45% of the data was allocated for validation




Part II. Building Decision Trees
Automatically pruned Decision Tree:
Pruning tree:
·        Maximum Depth: 10
·        Leaf Size: 2
·        Number of Surrogate rules: 2

Results:
Misclassification Rate:
Train: 0.05131
Validation: 0.09602



Fit Statistics:



Tree Diagram:




Interactive Decision Tree
Considering the log worth decision values, we pruned tree explicitly to run the model with same other pruning parameters as the above decision tree model.
Pruning tree:
·        Maximum Depth: 4
·        Leaf Size: 5
·        Number of Surrogate rules: 4

Results:
Misclassification Rate:
Train: 0.12128
Validation: 0.13656



Fit Statistics:



Tree Diagram:




Part III. Building Neural Networks and a Regression Model
Imputation:
We have used impute node to impute the missing values.
·        For Class Variables, we replaced (.)dot for the missing values present in dataset.
·        For Interval variables, we imputed data with mean.
Transform Variables:
After partitioning the data, the following step in Data pre-processing is transforming the input variables to reduce the skewness in data. In Transform variable node, we used log10 transformation on interval variables whose skewness is close to 5 and more than 5.





Regression Model:
As per our dataset, the target variable is binary. So we have used Logistic Regression.
Results:
Misclassification Rate:
Train: 0.12070
Validation: 0.13514



Fit Statistics:



Ensemble Model:
Designed an ensemble model by combining Bagging and Boosting with optimal decision tree by adding start and end group nodes to the Decision Tree.





Fit Statistics:




Results:
Misclassification Rate:
Train: 0.05131
Validation: 0.09602
Neural Network:





Fit Statistics:



Results:
Misclassification Rate:
Train: 0.13003
Validation: 0.13869

Auto Neural Network:


Fit Statistics:



Results:
Misclassification Rate:
Train: 0.13294
Validation: 0.13442

Part IV.  Model Comparison and Champion Model Selection
Cost of 5 for misclassifying 1 as 0 and a cost of 1 for misclassifying 0 as 1.
Decision Function and Weights:



Models:




Fit Statistics:


ROC Curve:




Champion Model: Decision Tree
Part V.  Improving Model Performance
Ensemble Model is being the champion model, we decided to use ensemble method i.e. ensemble the Decision Tree with use of bagging and boosting. We have used various variations of Interactive decision tree, Logistic Regression, Neural Network, Support Vector Machine, Auto Neural Network models to determine the misclassification rate. 


Data

Target   False
TRUE
FALSE
TRUE
Model Node
Model Description
Role
Target
Label  Negative
Negative
Positive
Positive








Reg
Regression
TRAIN
depend
186
1466
21
42
Reg
Regression
VALIDATE
depend
164
1191
26
25
Neural
Neural Network
TRAIN
depend
223
1487
.
5
Neural
Neural Network
VALIDATE
depend
188
1210
7
1
Tree
Decision Tree
TRAIN
depend
150
1477
10
78
Tree
Decision Tree
VALIDATE
depend
141
1182
35
48
Tree3
Interactive Decision Tree
TRAIN
depend
202
1481
6
26
Tree3
Interactive Decision Tree
VALIDATE
depend
170
1195
22
19
Ensmbl
Ensemble
TRAIN
depend
87
1486
1
141
Ensmbl
Ensemble
VALIDATE
depend
128
1210
7
61
AutoNeural
AutoNeural
TRAIN
depend
228
1487
0
0
AutoNeural
AutoNeural
VALIDATE
depend
189
1217
0
0
HPSVM
HP SVM
TRAIN
depend
206
1457
30
22
HPSVM
HP SVM
VALIDATE
depend
173
1190
27
16

Results:
Misclassification Rate:
Train: 0.05131
Validation: 0.09602

Managerial Implications and Learning
Managerial Implications:
In course of the analysis and findings we can recommend the below managerial insights:
1.      Targeting potential customers based on their online behaviour
·        Customers browsing data (Cookie Information) and his past activity on other websites
·        Designing tactics based on the user information that is received
·        We can leverage the following variables to support our insights, X33(if the user has booked at any site up to this point) and X25(% of total bookings are to this site).

2.      Using the variable which signifies if the user has spent substantial time on our website and hasn’t proceeded with the booking, we can make use of this information and use it in remarketing, that is if the user is trying to book the ticket on competitor’s website, we can make provision to flash our tickets price (with a discounted amount) and try to navigate the customer back to our website.

3.      Also, going one step further, we can consider Age and Gender factors combined with the time the user spends on the site, if we have predicted the user is a potential customer we can come up Demographic targeting. In Demographic targeting we can segregate people based on variable importance such as age and gender and target them with likely discounts and coupons and make sure they book with our site.
Learnings:
1.      Stat Explorer Node helped to look for missing class and interval values, skewness of data, outliers and inconsistent data
2.      Multiple decision trees are generated in Random Forest, and it is always better to reduce the data skewness to better results
3.      By changing the maximum number of leaf nodes in decision node configuration helps us to overcome overfitting.
4.      We can use combination of boosting and bagging as plausible substitute for random forest.


No comments:

Post a Comment