Big Data Analytics Regression Analysis to A Business Situation Questions Q1. After reading Chapter 6, consider how we can apply regression analysis to a business situation. Lets take company sales as an example. We know that the sales performance of products and services is affected by marketing, demand, economy, reputation, and so on.
Does this mean that company leaders should apply statistical analysis to determine the right mix of components (i.e., marketing, demand, and so on) that affect sales performance?
In addition, discuss the importance of regression analysis in statistics?
Requirement:
200 – 300 words
Deadline: 1 day
Q2. For your midterm assignment, please make sure to read Chapter 6 of your textbook.
Please provide comprehensive responses to the following:
(a) In the use of a categorical variable with n possible values, explain the following:
1. Why only n 1 binary variables are necessary
2. Why using n variables would be problematic
(b) Describe how logistic regression can be used as a classifier.
(c) Discuss how the ROC curve can be used to determine an appropriate threshold value for a classifier
(d) If the probability of an event occurring is 0.4, then
1. What is the odds ratio?
2. What is the log odds ratio?
Requirements:
– Typed in a word document.
– Please write in APA Style and include at least Five (5) reputable sources.
– The complete paper should be between 800-to-1000-words.
Deadline: 4 days.
Chapter 6 pdf attached. Chapter 6
Advanced Analytical Theory and Methods: Regression
Key Concepts
1. Categorical Variable
2. Linear Regression
3. Logistic Regression
4. Ordinary Least Squares (OLS)
5. Receiver Operating Characteristic (ROC) Curve
6. Residuals
In general, regression analysis attempts to explain the influence that a set of variables has
on the outcome of another variable of interest. Often, the outcome variable is called a
dependent variable because the outcome depends on the other variables. These additional
variables are sometimes called the input variables or the independent variables.
Regression analysis is useful for answering the following kinds of questions:
What is a persons expected income?
What is the probability that an applicant will default on a loan?
Linear regression is a useful tool for answering the first question, and logistic regression is
a popular method for addressing the second. This chapter examines these two regression
techniques and explains when one technique is more appropriate than the other.
Regression analysis is a useful explanatory tool that can identify the input variables that
have the greatest statistical influence on the outcome. With such knowledge and insight,
environmental changes can be attempted to produce more favorable values of the input
variables. For example, if it is found that the reading level of 10-year-old students is an
excellent predictor of the students success in high school and a factor in their attending
college, then additional emphasis on reading can be considered, implemented, and
evaluated to improve students reading levels at a younger age.
6.1 Linear Regression
Linear regression is an analytical technique used to model the relationship between several
input variables and a continuous outcome variable. A key assumption is that the
relationship between an input variable and the outcome variable is linear. Although this
assumption may appear restrictive, it is often possible to properly transform the input or
outcome variables to achieve a linear relationship between the modified input and
outcome variables. Possible transformations will be covered in more detail later in the
chapter.
The physical sciences have well-known linear models, such as Ohms Law, which states
that the electrical current flowing through a resistive circuit is linearly proportional to the
voltage applied to the circuit. Such a model is considered deterministic in the sense that if
the input values are known, the value of the outcome variable is precisely determined. A
linear regression model is a probabilistic one that accounts for the randomness that can
affect any particular outcome. Based on known input values, a linear regression model
provides the expected value of the outcome variable based on the values of the input
variables, but some uncertainty may remain in predicting any particular outcome. Thus,
linear regression models are useful in physical and social science applications where there
may be considerable variation in a particular outcome based on a given set of input values.
After presenting possible linear regression use cases, the foundations of linear regression
modeling are provided.
6.1.1 Use Cases
Linear regression is often used in business, government, and other scenarios. Some
common practical applications of linear regression in the real world include the following:
Real estate: A simple linear regression analysis can be used to model residential
home prices as a function of the homes living area. Such a model helps set or
evaluate the list price of a home on the market. The model could be further improved
by including other input variables such as number of bathrooms, number of
bedrooms, lot size, school district rankings, crime statistics, and property taxes.
Demand forecasting: Businesses and governments can use linear regression models
to predict demand for goods and services. For example, restaurant chains can
appropriately prepare for the predicted type and quantity of food that customers will
consume based upon the weather, the day of the week, whether an item is offered as a
special, the time of day, and the reservation volume. Similar models can be built to
predict retail sales, emergency room visits, and ambulance dispatches.
Medical: A linear regression model can be used to analyze the effect of a proposed
radiation treatment on reducing tumor sizes. Input variables might include duration of
a single radiation treatment, frequency of radiation treatment, and patient attributes
such as age or weight.
6.1.2 Model Description
As the name of this technique suggests, the linear regression model assumes that there is a
linear relationship between the input variables and the outcome variable. This relationship
can be expressed as shown in Equation 6.1.
6.1
where:
1.
is the outcome variable
2.
are the input variables, for j = 1, 2,
, p 1
3.
is the value of when each
4.
is the change in based on a unit change in , for j = 1, 2,
, p 1
5.
equals zero
is a random error term that represents the difference in the linear model and a
particular observed value for
Suppose it is desired to build a linear regression model that estimates a persons annual
income as a function of two variablesage and educationboth expressed in years. In
this case, income is the outcome variable, and the input variables are age and education.
Although it may be an over generalization, such a model seems intuitively correct in the
sense that peoples income should increase as their skill set and experience expand with
age. Also, the employment opportunities and starting salaries would be expected to be
greater for those who have attained more education.
However, it is also obvious that there is considerable variation in income levels for a
group of people with identical ages and years of education. This variation is represented
by in the model. So, in this example, the model would be expressed as shown in
Equation 6.2.
6.2
In the linear model, the
represent the unknown p parameters. The estimates for these
unknown parameters are chosen so that, on average, the model provides a reasonable
estimate of a persons income based on age and education. In other words, the fitted model
should minimize the overall error between the linear model and the actual observations.
Ordinary Least Squares (OLS) is a common technique to estimate the parameters.
To illustrate how OLS works, suppose there is only one input variable, x, for an outcome
variable y. Furthermore, n observations of (x,y) are obtained and plotted in Figure 6.1.
Figure 6.1 Scatterplot of y versus x
The goal is to find the line that best approximates the relationship between the outcome
variable and the input variables. With OLS, the objective is to find the line through these
points that minimizes the sum of the squares of the difference between each point and the
line in the vertical direction. In other words, find the values of
and such that the
summation shown in Equation 6.3 is minimized.
6.3
The n individual distances to be squared and then summed are illustrated in Figure 6.2.
The vertical lines represent the distance between each observed y value and the line
.
Figure 6.2 Scatterplot of y versus x with vertical distances from the observed points to a
fitted line
In Figure 3.7 of Chapter 3, Review of Basic Data Analytic Methods Using R, the
Anscombes Quartet example used OLS to fit the linear regression line to each of the four
datasets. OLS for multiple input variables is a straightforward extension of the one input
variable case provided in Equation 6.3.
The preceding discussion provided the approach to find the best linear fit to a set of
observations. However, by making some additional assumptions on the error term, it is
possible to provide further capabilities in utilizing the linear regression model. In general,
these assumptions are almost always made, so the following model, built upon the earlier
described model, is simply called the linear regression model.
Linear Regression Model (with Normally Distributed Errors)
In the previous model description, there were no assumptions made about the error term;
no additional assumptions were necessary for OLS to provide estimates of the model
parameters. However, in most linear regression analyses, it is common to assume that the
error term is a normally distributed random variable with mean equal to zero and constant
variance. Thus, the linear regression model is expressed as shown in Equation 6.4.
6.4
where:
1.
is the outcome variable
2.
are the input variables, for j = 1, 2,
, p 1
3.
is the value of when each
4.
is the change in based on a unit change in , for j = 1, 2,
, p 1
5.
and the
equals zero
are independent of each other
This additional assumption yields the following result about the expected value of y, E(y)
for given
:
Because
given
are constants, the E(y) is the value of the linear regression model for the
. Furthermore, the variance of y, V(y), for given
is this:
Thus, for a given
, y is normally distributed with mean
and variance . For a regression model with just one input variable, Figure 6.3 illustrates
the normality assumption on the error terms and the effect on the outcome variable, , for
a given value of .
Figure 6.3 Normal distribution about y for a given value of x
For
, one would expect to observe a value of near 20, but a value of y from 15 to 25
would appear possible based on the illustrated normal distribution. Thus, the regression
model estimates the expected value of for the given value of . Additionally, the
normality assumption on the error term provides some useful properties that can be
utilized in performing hypothesis testing on the linear regression model and providing
confidence intervals on the parameters and the mean of
given
. The
application of these statistical techniques is demonstrated by applying R to the earlier
linear regression model on income.
Example in R
Returning to the Income example, in addition to the variables age and education, the
persons gender, female or male, is considered an input variable. The following code reads
a comma-separated-value (CSV) file of 1,500 peoples incomes, ages, years of education,
and gender. The first 10 rows are displayed:
income_input = as.data.frame( read.csv(c:/data/income.csv) )
income_input[1:10,]
ID Income Age Education Gender
1 1 113 69 12 1
2 2 91 52 18 0
3 3 121 65 14 0
4 4 81 58 12 0
5 5 68 31 16 1
6 6 92 51 15 1
7 7 75 53 15 0
8 8 76 56 13 0
9 9 56 42 15 1
10 10 53 33 11 1
Each person in the sample has been assigned an identification number, ID. Income is
expressed in thousands of dollars. (For example, 113 denotes $113,000.) As described
earlier, Age and Education are expressed in years. For Gender, a 0 denotes female and a 1
denotes male. A summary of the imported data reveals that the incomes vary from $14,000
to $134,000. The ages are between 18 and 70 years. The education experience for each
person varies from a minimum of 10 years to a maximum of 20 years.
summary(income_input)
ID
Income
Age
Education
Min. : 1.0 Min. : 14.00 Min. :18.00 Min. :10.00
1st Qu.: 375.8 1st Qu.: 62.00 1st Qu.:30.00 1st Qu.:12.00
Median : 750.5 Median : 76.00 Median :44.00 Median :15.00
Mean : 750.5 Mean : 75.99 Mean :43.58 Mean :14.68
3rd Qu.:1125.2 3rd Qu.: 91.00 3rd Qu.:57.00 3rd Qu.:16.00
Max. :1500.0 Max. :134.00 Max. :70.00 Max. :20.00
Gender
Min. :0.00
1st Qu.:0.00
Median :0.00
Mean :0.49
3rd Qu.:1.00
Max. :1.00
As described in Chapter 3, a scatterplot matrix is an informative tool to view the pair-wise
relationships of the variables. The basic assumption of a linear regression model is that
there is a linear relationship between the outcome variable and the input variables. Using
the lattice package in R, the scatterplot matrix in Figure 6.4 is generated with the
following R code:
library(lattice)
splom(income_input[c(2:5)], groups=NULL, data=income_input,
axis.line.tck = 0,
axis.text.alpha = 0)
Figure 6.4 Scatterplot matrix of the variables
Because the dependent variable is typically plotted along the y-axis, examine the set of
scatterplots along the bottom of the matrix. A strong positive linear trend is observed for
Income as a function of Age. Against Education, a slight positive trend may exist, but the
trend is not quite as obvious as is the case with the Age variable. Lastly, there is no
observed effect on Income based on Gender.
With this qualitative understanding of the relationships between Income and the input
variables, it seems reasonable to quantitatively evaluate the linear relationships of these
variables. Utilizing the normality assumption applied to the error term, the proposed linear
regression model is shown in Equation 6.5.
6.5
Using the linear model function, lm(), in R, the income model can be applied to the data
as follows:
results |t|)
(Intercept) 7.26299 1.95575 3.714 0.000212 ***
Age
0.99520 0.02057 48.373 < 2e-16 ***
Education 1.75788 0.11581 15.179 < 2e-16 ***
Gender -0.93433 0.62388 -1.498 0.134443
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 12.07 on 1496 degrees of freedom
Multiple R-squared: 0.6364, Adjusted R-squared: 0.6357
F-statistic: 873 on 3 and 1496 DF, p-value: < 2.2e-16
The intercept term, , is implicitly included in the model. The lm() function performs the
parameter estimation for the parameters (j = 0, 1, 2, 3) using ordinary least squares and
provides several useful calculations and results that are stored in the variable called
results in this example.
After the stated call to lm(), a few statistics on the residuals are displayed in the output.
The residuals are the observed values of the error term for each of the n observations and
are defined for i = 1, 2,
n, as shown in Equation 6.6.
6.6
where bj denotes the estimate for parameter ?j for j = 0, 1, 2,
p 1
From the R output, the residuals vary from approximately 37 to +37, with a median close
to 0. Recall that the residuals are assumed to be normally distributed with a mean near
zero and a constant variance. The normality assumption is examined more carefully later.
The output provides details about the coefficients. The column Estimate provides the
OLS estimates of the coefficients in the fitted linear regression model. In general, the
(Intercept) corresponds to the estimated response variable when all the input variables
equal zero. In this example, the intercept corresponds to an estimated income of $7,263 for
a newborn female with no education. It is important to note that the available dataset does
not include such a person. The minimum age and education in the dataset are 18 and 10
years, respectively. Thus, misleading results may be obtained when using a linear
regression model to estimate outcomes for input values not representative within the
dataset used to train the model.
The coefficient for Age is approximately equal to one. This coefficient is interpreted as
follows: For every one unit increase in a persons age, the persons income is expected to
increase by $995. Similarly, for every unit increase in a persons years of education, the
persons income is expected to increase by about $1,758.
Interpreting the Gender coefficient is slightly different. When Gender is equal to zero, the
Gender coefficient contributes nothing to the estimate of the expected income. When
Gender is equal to one, the expected Income is decreased by about $934.
Because the coefficient values are only estimates based on the observed incomes in the
sample, there is some uncertainty or sampling error for the coefficient estimates. The Std.
Error column next to the coefficients provides the sampling error associated with each
coefficient and can be used to perform a hypothesis test, using the t-distribution, to
determine if each coefficient is statistically different from zero. In other words, if a
coefficient is not statistically different from zero, the coefficient and the associated
variable in the model should be excluded from the model. In this example, the associated
hypothesis tests p-values, Pr(>|t|), are very small for the Intercept, Age, and
Education parameters. As seen in Chapter 3, a small p-value corresponds to a small
probability that such a large t value would be observed under the assumptions of the null
hypothesis. In this case, for a given j = 0, 1, 2,
, p 1, the null and alternate hypotheses
follow:
For small p-values, as is the case for the Intercept, Age, and Education parameters, the
null hypothesis would be rejected. For the Gender parameter, the corresponding p-value is
fairly large at 0.13. In other words, at a 90% confidence level, the null hypothesis would
not be rejected. So, dropping the variable Gender from the linear regression model should
be considered. The following R code provides the modified model results:
results2 |t|)
(Intercept) 6.75822 1.92728 3.507 0.000467 ***
Age
0.99603 0.02057 48.412 < 2e-16 ***
Education 1.75860 0.11586 15.179 < 2e-16 ***
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 12.08 on 1497 degrees of freedom
Multiple R-squared: 0.6359, Adjusted R-squared: 0.6354
F-statistic: 1307 on 2 and 1497 DF, p-value: < 2.2e-16
Dropping the Gender variable from the model resulted in a minimal change to the
estimates of the remaining parameters and their statistical significances.
The last part of the displayed results provides some summary statistics and tests on the
linear regression model. The residual standard error is the standard deviation of the
observed residuals. This value, along with the associated degrees of freedom, can be used
to examine the variance of the assumed normally distributed error terms. R-squared (R2) is
a commonly reported metric that measures the variation in the data that is explained by the
regression model. Possible values of R2 vary from 0 to 1, with values closer to 1
indicating that the model is better at explaining the data than values closer to 0. An R2 of
exactly 1 indicates that the model explains perfectly the observed data (all the residuals
are equal to 0). In general, the R2 value can be increased by adding more variables to the
model. However, just adding more variables to explain a given dataset but not to improve
the explanatory nature of the model is known as overfitting. To address the possibility of
overfitting the data, the adjusted R2 accounts for the number of parameters included in the
linear regression model.
The F-statistic provides a method for testing the entire regression model. In the previous ttests, individual tests were conducted to determine the statistical significance of each
parameter. The provided F-statistic and corresponding p-value enable the analyst to test
the following hypotheses:
In this example, the p-value of 2.2e 16 is small, which indicates that the null hypothesis
should be rejected.
Categorical Variables
In the previous example, the variable Gender was a simple binary variable that indicated
whether a person is female or male. In general, these variables are known as categorical
variables. To illustrate how to use categorical variables properly, suppose it was decided
in the earlier Income example to include an additional variable, State, to represent the
U.S. state where the person resides. Similar to the use ...
Purchase answer to see full
attachment
Consider the following information, and answer the question below. China and England are international trade…
The CPA is involved in many aspects of accounting and business. Let's discuss some other…
For your initial post, share your earliest memory of a laser. Compare and contrast your…
2. The Ajax Co. just decided to save $1,500 a month for the next five…
How to make an insertion sort to sort an array of c strings using the…
Assume the following Keynesian income-expenditure two-sector model: AD = Cp + Ip Cp = Co…