GEO 6166 Multivariate Statistics Classification Trees Worksheet Assignment 4 includes part A and part B data in attached put answer in the word. file. s

GEO 6166 Multivariate Statistics Classification Trees Worksheet Assignment 4 includes part A and part B

data in attached

Don't use plagiarized sources. Get Your Custom Essay on

Just from $13/Page

Order Essay

put answer in the word. file.

separate script file part A and part B (.R)

use R studio

For this part A of the assignment, you will submit:

An edited version of this document responding where prompted. When submitting this edited document, add your initials at the end of the name (e.g. Assignment4A_SR.docx). When asked to write something in this document, use italic font so it stands out.

The second will be a script file that contains the code you use as you progress thru the tasks. Name your script file similarly (e.g. Assignment4A_SR.R).

and

For this part B of the assignment, you will submit:

An edited version of this document responding where prompted. When submitting this edited document, add your initials at the end of the name (e.g. Assignment4B_SR.docx). When asked to write something in this document, use italic font so it stands out.

The second will be a script file that contains the code you use as you progress thru the tasks. Name your script file similarly (e.g. Assignment4B_SR.R). GEO 6166
Multivariate Statistics
Assignment 4 Part A (Classification Trees)
For this part of the assignment, you will submit:
An edited version of this document responding where prompted. When submitting this edited
document, add your initials at the end of the name (e.g. Assignment4A_SR.docx). When asked to write
something in this document, use italic font so it stands out.
The second will be a script file that contains the code you use as you progress thru the tasks. Name
your script file similarly (e.g. Assignment4A_SR.R).
You have four data frames for this part of the assignment: OJTrain, OJTrain01, OJTest, and OJTest01. They
have been derived from the same original data set that comprised 1070 observations, and represent training
data sets of 800 observations and test data sets of 270 observations. Note that OJTrain and OJTrain01 are
identical except that the response variable is coded as a factor in OJTrain and is in column 1 of the data frame,
whereas in OJTrain01 the response variable is coded as a 0/1 integer and is the last column of the data frame.
The same difference applies to OJTest and OJTest01.
The data sets relate to individual purchases of two brands of orange juice:
Brand of Orange Juice Purchases
Variable
F = factor
brand (F)
PCH
PMM
DCH
DMM
SCH (F)
SMM (F)
LCH
SPMM
SPCH
PD
%DMM
%DCH
LPD
STR (F)
Description
Response
CH: Citrus Hill, MM: Minute Maid (MM)
Predictors (14 )
Price of CH ($/c)
Price of MM ($/c)
Discount for CH ($/c)
Discount for MM ($/c)
Was CH on Special (1 Yes, 0 No)
Was MM on Special (1 Yes, 0 No)
Measure of Customer Brand Loyalty to CH
Sale Price of MM ($/c)
Sale Price of CH ($/c)
Price Difference
% Discount for MM
% Discount for CH
List Price Difference
One of 5 Stores (0,1,2,3,4)
1|Page
Begin by producing a classification tree object (name it OJtree) by estimating a classification tree for the
OJTrain data set and using just the defaults for this function in terms of controlling the depth of the tree. Plot
the classification tree object and add text to the plot. Copy/Paste the plot below.
1. Response
As you look at the plot you should see 8 terminal nodes (or leaves). If we mentally number them 1 to 8 looking
left to right, provide a description below of what the observations in node number 6 (a CH node) represent in
terms of the predictor variables.
2. Response
Create an object, dfOJtree, from retrieving the data frame from your OJtree object. Copy and paste that data
frame below.
3. Response
Looking at the data frame, notice that the last column gives you a probability. In this case this is the probability
of the response being CH. Using the tree plot and again mentally numbering the terminal nodes from 1 to 8
looking left to right, figure out which of the terminal nodes is the purest and which one is the least pure, and
respond with their numbers below.
4. Response
From the data frame you will also notice that one column informs how many observations there are at each
split or at each terminal node. Which terminal node has the lowest number of observations, and how many?
5. Response
Using a summary on your OJtree object report below what the misclassification error rate (MER) is, as a %, for
this tree in the appropriate cell of Table 1 below. (Other cells will be filled-in as you work thru this
assignment).
Table 1
Training Data
Training Data
Tree Object
Size of Tree:
MER (%)
OOB (%)
OJtree
OJtree2
OJdeep
OJtreeT6
OJtreeT7
8
14
75
6
7
NA
NA
NA
NA
NA
2|Page
OJBag
OJrf1
OJrf2
OJrf3
NA
NA
NA
NA
The depth (or complexity) of tree that is grown using the tree() function is controlled by an argument of that
name that then uses the function tree.control(). By default, and using our OJTrain data, these would be as
follows:
control = tree.control (800, OJTrain, mincut = 5, minsize = 10, mindev = 0.01)
The argument minsize is the smallest number of observations that can be in a node. If you look at the data
frame you posted above then you will notice that all our terminal nodes were above this number, so, the
OJtree did not stop growing because it hit this limit. In fact, it stopped because it hit the threshold for
improvement in deviance given by the argument mindev. This is set, by default, to 1/100 th of the null (or root)
deviance. Now you will create another classification tree object, name it OJtree2, by setting this mindev value
lower:
Now well use the prune.tree() function on this object to see how deviance changes with different size of trees
and well plot that change. Execute the code below and then copy/paste your plot below it.
6. Response
You will see a plot where the change in deviance as size of tree grows becomes smaller and smaller. A red line
for the size of tree (8) that came from using the default setting of mindev=0.01 (i.e. your original OJtree
object) is plotted too. The null or root deviance is 1062, so 0.01 * 1062 would be 10.62. So our original tree
(OJtree) stopped at 8 terminal nodes because the change in deviance was less than 10.62. To see this, in
numbers, return the pt object in the console and copy/paste its output below:
7. Response
Look at the $dev component of the pt object. Youll notice the null deviance is the last number. If you count
back 8 places (i.e. 8 splits) from that number youll reach the value 578.79. Notice that the value below that is
570.64. Now you can see why our original tree stopped at 8 terminal nodes because the extra decrease in
3|Page
deviance to go further (578.79-570.64) was less than 10.62. By the way, if you look at the $k component of the
pt object you will notice it reports these marginal decreases in the deviance for you.
Before moving on, do a summary of the OJtree2 object and report its misclassification error rate in the
appropriate cell of Table 1 above.
However rather than using a rather arbitrary threshold value for change in deviance, we can use crossvalidation methods to help determine the size of tree. To do this we first generate a deep tree. One way to do
this is to set the mindev argument to 0 so that the tree only stops splitting when the size of the nodes
produced would be less than the argument minsize (by default this is 10, see above).
Create a deep tree object based on the guideline above, using the OJTrain data, and naming it OJdeep. Do a
summary of this object and report its misclassification error rate in Table 1 above.
Extract the data frame from this object and look at it. Do all the leaves now have 10 or fewer observations? If
no, then explain why.
8. Response
Plot your deep tree (without text!) and copy and paste it below.
9. Response
Now conduct a cross-validation analysis based on the deep tree that will estimate the best size of tree. Rather
than just do this once though, perform the cross-validation analysis 1000 times and output a summary of how
many times from the 1000 runs, different sizes of trees were deemed the best. Report the result below. (Note,
the 1000 cross-validation analysis may take some time on your computer coffee break!)
10. Response
So, if you feel like a longer break, repeat the 1000 runs a few more times!
My best guess is that youll likely find that a tree of size 6 or 7 splits is what is deemed the best based on your
cross-validation analysis.
Assuming you found 6 or 7 splits as the most prevalent (and even if not, though unlikely), prune your original
deep tree to each of these two sizes to produce two new objects, naming them OJtreeT6 and OJtreeT7. Plot
them, adding text and copy/paste them below.
11. Response
4|Page
Perform a summary on the size 6 and size 7 trees and report their misclassification error rates in Table 1.
Now, again using the data OJTrain, estimate a bagged classification tree object, named OJbag, based on 500
trees.
Then use the following code:
From the output of this, calculate the misclassification error rate and enter into Table 1. You should notice
that the error rate is very low you are essentially taking the majority vote on each observation over 500
trees so this is not too surprising. However, its not the fit to the training data we are concerned about, its the
test error rate that is important.
One way to assess the test error rate is to use the out-of-bag (OOB) error rate. Enter the value of this for the
object OJbag into the fourth column of Table 1.
Copy/paste the table of predictor importance below, and then, below that, the plots of predictor importance.
What are the top 4 predictors in terms of a) producing more accurate predictions, and b) reproducing node
purity?
12. Response
You will now produce a random forest object, named OJrf1, for the training data set OJTrain based on 500
trees. In terms of the number of predictors to use at each split, base this on the rule-of-thumb guide typically
used for this type of tree, rounding up to the nearest integer.
Then use the following code:
From the output of this, calculate the misclassification error rate and enter into Table 1. You should notice
that the error rate is again quite low. Compare this error rate to the similar one produced from the bagged
model and comment below on the difference and why it exists.
13. Response
However, again, its not the fit to the training data we are concerned about, its the test error rate that is
important and one way to estimate this, as above, is thru the OOB error rate. Enter this for the object OJrf1 in
5|Page
Table 1. Compare the OOB error rate for the bagged and random forest trees and comment below as to which
gives a better test fit.
14. Response
Rather than just go with the rule-of-thumb suggestion of the number of predictors to consider at each split,
we can use the tuneRF() function to give us an estimate of the best value to use. Run this function ten times
for the training data OJTrain (using 500 trees per try) and note which value gives you the lowest OOB error
rate the most often.
My best guess is that the value returned above was 3. So now estimate a new random forest object, OJrf2,
that uses this value and then complete columns 3 and 4 of Table 1 for this object. Comment below on how the
OJrf1 and OJrf2 models compare on the MER and the OOB error rate.
15. Response
We can also investigate the best size of tree to fit using cross-validation. Perform a random tree crossvalidation procedure using the OJTrain training data set, producing a plot. Repeat this 10 times. Youll
probably notice that tree sizes of 4 and 7 frequently give inflection points in the plots.
So, now estimate a new random forest object, OJrf3, that uses the value 7 for the number of predictors to use
at each split. Complete columns 3 and 4 of Table 1 for this object.
An alternative to bootstrap based methods for tree models is to use boosting. You will use the gbm() function
from the gbm library for this, but this function requires that the response variable be coded as an integer. So,
use the OJTrain01 data for this instead of OJTrain (Note that the response variable is now Brand instead of
brand and is positioned in the last column of the data frame). Run the following code to create two boosted
models, OJbst01 and OJbst2:
Finally, you will compare the various models that you have estimated by assessing their performance on
separate test data sets. Your results can be summarized in Table 2 below. You will calculate the
misclassification error rate directly by comparing predicted values of the response to the observed.
6|Page
Tree Object
Table 2
Test Data (OJTest)
MER (%)
OJtree
OJtree2
OJdeep
OJtreeT6
OJtreeT7
OJBag
OJrf1
OJrf2
OJrf3
OJbst1
OJbst2
For the first 9 of your tree objects in Table 2, the form of code you will need is as follows:
Once you have used the resultant table to calculate the misclassification error rate, you can replace the tree
object that is the first argument to the predict() function and repeat.
For the last 2 tree objects (the boosted models), the code you will need is as follows:
Again, just change the first argument in the predict() function to use it for the second tree object.
Discuss the results of fitting these different trees to the test data sets.
16. Response
End of Assignment 4 Part A
7|Page
Assignment 4 Part B
For this part of the assignment, you will submit:
An edited version of this document responding where prompted. When submitting this edited
document, add your initials at the end of the name (e.g. Assignment4B_SR.docx). When asked to write
something in this document, use italic font so it stands out.
The second will be a script file that contains the code you use as you progress thru the tasks. Name
your script file similarly (e.g. Assignment4B_SR.R).
You have two data frames for this part of the assignment: saltrain and saltest. One is a data set for training
your models (n=200), the other is a test data set (n=63) for testing them. This data is about the salaries of
baseball hitters just for Nathan (!) – and the variables for this data are described below:
Variable
Salary
AtBat
Hits
HmRun
Runs
RBI
Walks
Years
CAtBat
CHits
CHmRun
CRuns
CRBI
CWalks
League
Division
PutOuts
Assists
Errors
Description
Response
Logarithm of annual salary on opening day, 1987, in US dollars
Predictors (18)
Number of times at bat in 1986
Number of hits in 1986
Number of home runs in 1986
Number of runs in 1986
Number of runs batted in in 1986
Number or walks in 1986
Number of years in the major leagues
Career at bats.
Career hits
Career home runs
Career runs
Career runs batted in
Career walks
National or American League
East or West Division
Number of put outs in 1986
Number of assists in 1986
Number of errors in 1986
Begin by producing a regression tree object (name it Stree) by estimating a regression tree for the training
data set and using just the defaults for this function in terms of controlling the depth of the tree. Plot the
regression tree object and add text to the plot. Copy/Paste the plot below.
17. Response
As you look at the plot you should see 9 terminal nodes (or leaves).
8|Page
The summary on the Stree object reports the residual mean deviance (RMD) for this tree. However, to be
consistent in the type of error being used, we can derive the MSE (mean of the squared residuals) directly. To
do this, use the code below:
Enter this value into the appropriate cell of Table 1 below. (Other cells will be filled-in as you work thru this
assignment).
Tree Object
Stree
Sdeep
StreeCV
SBag
Srf1
SJrf2
Srf3
Size of Tree or
No. Iterations
Table 1
Training Data
MSE (3 decimals)
9
Training Data
OOB
NA
NA
NA
NA
NA
NA
NA
Sbst1
Sbst2
Sbst3
Sbst4
Sbst5
NA
NA
NA
NA
NA
However rather than using the default settings for controlling the depth of tree in the tree() function, we can
use cross-validation methods to help determine the optimum size of tree. To do this we first generate a deep
tree. One way to do this is to set the mindev argument in the function used as an argument to tree() to 0, so
that the tree only stops splitting when other default settings are reached at a potential split.
Create a deep tree object based on the guideline above, using the saltrain training data, and naming it Sdeep.
Report its size of tree and MSE in Table 1 above. Also report its size in the second column of Table 3.
Plot your deep tree (without text!) and copy and paste it below.
18. Response
Now conduct a cross-validation analysis based on the deep tree that will estimate the best size of tree. Rather
than just do this once though, perform the cross-validation analysis 1000 times and output a summary of how
9|Page
many times from the 1000 runs, different sizes of trees were deemed the best. Report the result below using a
copy/paste. (Note, the 1000 cross-validation analysis may take some time on your computer).
19. Response
In this case, the likely result is that the best fitting size of tree according to cross-validation is, in fact, the deep
tree. However, take the next most suggested size of tree and create a new tree object, name it StreeCV, based
on that size. Plot this new tree below (with text) and also report its size and MSE in Table 1, and its size in
Table 3.
20. Response
You will now create an object, Sbag that is a bagged model for the saltrain data, using the default number of
trees (500). To obtain its MSE on the training data you can use the code below:
Enter this value into the third column of Table 1. However, its not the fit to the training data we are
concerned about, its the test error that is important. One way to assess the test error is to use the out-of-bag
(OOB) errors. Based on the Sbag object report the MSE that is based on the OOB errors in the fourth column
of Table 1.
Copy/paste the table of predictor importance based on the Sbag object below, and then, below that, the plots
of predictor importance. What are the top 4 predictors in terms of a) producing more accurate predictions,
and b) producing node purity?
21. Response
Explain what node purity means in the context of a regression tree:
22. Response
You will now produce a random forest object, named Srf1, for the training data set saltrain based on 500
trees. In terms of the number of predictors to use at each split, base this on the rule-of-thumb guide typically
used for this type of tree. Copy/paste the plots of predictor importance for the Srf1 object below and then
compare and discuss with the same plots based on the bagged tree model above.
23. Response
10 | P a g e
Using the code below, enter the MSE of the training data for the Srf1 object into Table 1. Also enter the MSE
based on OOB errors into Table 1.
Rather than just go with the rule-of-thumb suggestion of the number of predictors to consider at each split,
we can use the tuneRF() function to give us an estimate of the best value to use. Run this function ten times
for the training data saltrain (using 500 trees per try) and note which value gives you the lowest OOB error
rate the most often.
My best guess is that the value returned above was 3. So now estimate a new random forest object, Srf2 that
uses this value and then complete columns 3 and 4 of Table 1 for this object.
We can also investigate the best size of random forest tree to fit using cross-validation. Perform a random tree
cross-validation procedure using the saltrain training data set, producing a plot (use the step=0.9 argument
when doing this). Repeat this 10 times. Youll probably notice that tree sizes of 8 or 9 often mark the tree sizes
where the cross validation errors no longer decrease appreciably.
So, now estimate a new random forest object, Srf3 that uses the value 9 for the number of predictors to use at
each split. Complete columns 3 and 4 of Table 1 for this object.
You will now turn to estimating some boosted models. Create a boosted tree object, named Sbst1, using the
following code:
Notice how the shrinkage parameter is set using the variable sp. This object is generated using 1000 learning
trees of depth 4 splits, with a shrinkage parameter equal to 0.025, and using a bag fraction of 0.5 (default).
One way to estimate the optimum size of boosted tree is to look at the OOB squared errors (remembering
that at each iteration only half the data is being used). To do this use the code below:
This will produce 2 plots so you will need to use the back arrow in R-Studios plot pane to see the first one.
These plots show you:
Plot1: The decline in the squared error loss for the training data as the model learns (black line) and the
improvement in the squared error loss (blue line).
11 | P a g e
Plot 2: The change in the OOB squared error loss, with the vertical blue line indicating the suggested
optimum number of iterations for the boosted model.
You will also notice that when you produce these plots you also get a warning about how using the OOB
generally underestimates the optimal number of iterations and recommending cross-validation methods
instead.
Regar…
Purchase answer to see full
attachment

GEO 6166 Multivariate Statistics Classification Trees Worksheet Assignment 4 includes part A and part B data in attached put answer in the word. file. s

Calculate the price of your order

Essay Writing Service