Advanced SAS: Data and Model Analytics using SAS Enterprise Miner

Project 1

Data and Model Analytics using SAS Enterprise Miner

MIS 6334 Advanced BA with SAS

Group 11

Chaithanya Sai Srinivas Sabnaveesu – cxs163030

Ritesh Kunisseri Puliyakote – rxk166230

Sindhuja Masiragani – sxm164031

Tejasvi Ramadas Sagar – txs161330

Vinod Prem Kumar – vmp160030

Table of Contents

Objective

Part I. Data Pre processing

1. Edit Variables

2. StatExplore

3. Dataset Partition

Part II. Building Decision Trees

1. Automatically pruned decision tree

2. Decision Tree using Log worth (Interactive decision tree)

Part III. Building Neural Networks and a Regression Model

1. Imputation

2. Transform Variables

3. Regression Model

4. Ensemble Model

5. Neural Network

6. Auto Neural Network

Part IV. Model Comparison and Champion Model Selection

Part V. Improving Model Performance

Managerial Implications and Learning

Objective:

In this project we worked on Expedia Data set and worked with multiple predictive models. The objective of this project is to find the best fitting model along with managerial relevant findings to predict the customer’s decision in booking the flight.

Part I – Basic Data Pre-processing

Edit Variables:

While adding the DataSource, on the Advanced Data Source Wizard window, we changed role of depend to Target. X12, X32 and X38 are changed to Rejected. Also, we changed Level of X1, X6, X15, X33, and X37 to binary, and X3, X5, X7 to ordinal.

StatExplore:

The raw dataset is explored further and observed the following:

· All the variables except X32 and X38 have more than half missing values in the raw data. So these variables are rejected

Interval Variable Summary Statistics:

Class Variable Summary Statistics:

Data Partition:

Training: 55% of the data was allocated for training

Validation: 45% of the data was allocated for validation

Part II. Building Decision Trees

Automatically pruned Decision Tree:

Pruning tree:

· Maximum Depth: 10

· Leaf Size: 2

· Number of Surrogate rules: 2

Results:

Misclassification Rate:

Train: 0.05131

Validation: 0.09602

Fit Statistics:

Tree Diagram:

Interactive Decision Tree

Considering the log worth decision values, we pruned tree explicitly to run the model with same other pruning parameters as the above decision tree model.

Pruning tree:

· Maximum Depth: 4

· Leaf Size: 5

· Number of Surrogate rules: 4

Results:

Misclassification Rate:

Train: 0.12128

Validation: 0.13656

Fit Statistics:

Tree Diagram:

Part III. Building Neural Networks and a Regression Model

Imputation:

We have used impute node to impute the missing values.

· For Class Variables, we replaced (.)dot for the missing values present in dataset.

· For Interval variables, we imputed data with mean.

Transform Variables:

After partitioning the data, the following step in Data pre-processing is transforming the input variables to reduce the skewness in data. In Transform variable node, we used log10 transformation on interval variables whose skewness is close to 5 and more than 5.

Regression Model:

As per our dataset, the target variable is binary. So we have used Logistic Regression.

Results:

Misclassification Rate:

Train: 0.12070

Validation: 0.13514

Fit Statistics:

Ensemble Model:

Designed an ensemble model by combining Bagging and Boosting with optimal decision tree by adding start and end group nodes to the Decision Tree.

Fit Statistics:

Results:

Misclassification Rate:

Train: 0.05131

Validation: 0.09602

Neural Network:

Fit Statistics:

Results:

Misclassification Rate:

Train: 0.13003

Validation: 0.13869

Auto Neural Network:

Fit Statistics:

Results:

Misclassification Rate:

Train: 0.13294

Validation: 0.13442

Part IV. Model Comparison and Champion Model Selection

Cost of 5 for misclassifying 1 as 0 and a cost of 1 for misclassifying 0 as 1.
Decision Function and Weights:

Models:

Fit Statistics:

ROC Curve:

Champion Model: Decision Tree

Part V. Improving Model Performance

Ensemble Model is being the champion model, we decided to use ensemble method i.e. ensemble the Decision Tree with use of bagging and boosting. We have used various variations of Interactive decision tree, Logistic Regression, Neural Network, Support Vector Machine, Auto Neural Network models to determine the misclassification rate.

		Data		Target False	TRUE	FALSE	TRUE
Model Node	Model Description	Role	Target	Label Negative	Negative	Positive	Positive

Reg	Regression	TRAIN	depend	186	1466	21	42
Reg	Regression	VALIDATE	depend	164	1191	26	25
Neural	Neural Network	TRAIN	depend	223	1487	.	5
Neural	Neural Network	VALIDATE	depend	188	1210	7	1
Tree	Decision Tree	TRAIN	depend	150	1477	10	78
Tree	Decision Tree	VALIDATE	depend	141	1182	35	48
Tree3	Interactive Decision Tree	TRAIN	depend	202	1481	6	26
Tree3	Interactive Decision Tree	VALIDATE	depend	170	1195	22	19
Ensmbl	Ensemble	TRAIN	depend	87	1486	1	141
Ensmbl	Ensemble	VALIDATE	depend	128	1210	7	61
AutoNeural	AutoNeural	TRAIN	depend	228	1487	0	0
AutoNeural	AutoNeural	VALIDATE	depend	189	1217	0	0
HPSVM	HP SVM	TRAIN	depend	206	1457	30	22
HPSVM	HP SVM	VALIDATE	depend	173	1190	27	16

Results:

Misclassification Rate:

Train: 0.05131

Validation: 0.09602

Managerial Implications and Learning

Managerial Implications:

In course of the analysis and findings we can recommend the below managerial insights:

1. Targeting potential customers based on their online behaviour

· Customers browsing data (Cookie Information) and his past activity on other websites

· Designing tactics based on the user information that is received

· We can leverage the following variables to support our insights, X33(if the user has booked at any site up to this point) and X25(% of total bookings are to this site).

2. Using the variable which signifies if the user has spent substantial time on our website and hasn’t proceeded with the booking, we can make use of this information and use it in remarketing, that is if the user is trying to book the ticket on competitor’s website, we can make provision to flash our tickets price (with a discounted amount) and try to navigate the customer back to our website.

3. Also, going one step further, we can consider Age and Gender factors combined with the time the user spends on the site, if we have predicted the user is a potential customer we can come up Demographic targeting. In Demographic targeting we can segregate people based on variable importance such as age and gender and target them with likely discounts and coupons and make sure they book with our site.

Learnings:

1. Stat Explorer Node helped to look for missing class and interval values, skewness of data, outliers and inconsistent data

2. Multiple decision trees are generated in Random Forest, and it is always better to reduce the data skewness to better results

3. By changing the maximum number of leaf nodes in decision node configuration helps us to overcome overfitting.

4. We can use combination of boosting and bagging as plausible substitute for random forest.

Advanced SAS

Wednesday, November 1, 2017

Data and Model Analytics using SAS Enterprise Miner

No comments:

Post a Comment