MUSA508-HW4-Housing Subsidy Algorithm

palette5 <- c( "#ff4500",'#e3eec2'  , '#853353' ,'#2f8fb1', '#f4ac94')

palette2 <- c("#1f95ac","#ffb026")

subsidy <- read.csv("C:/Users/Yiming Ma/Documents/UPenn/MUSA508/hw/hw4/housingSubsidy.csv")

# View(subsidy)
# y.numeric = 1: take the program
# y.numeric = 0: not take the programs

# colnames(subsidy)
# table(subsidy$education)
# table(subsidy$y)

# 451/4119

I. Introduction and Motivation

The home repair tax credit program can be seen as an incentive for homeowners to renovate the house and increase the home value. The Department of Housing and Community Development (HCD) tries to reach out to eligible homeowners every year, but in a random manner, the uptake of the credit is only 11%. Thus, a more targeted campaign is needed.

II. Exploratory Analysis

1. Develop and interpret data visualizations that describe feature importance/correlation.

As shown in Figure 2.1, only 451 homeowners (10.95%) of the total 4,119 homeowners reached out to take the credit.

Through the graphs below, higher inflation rate, higher expense on home repairs, homeowner without previous contact or fewer previous contact, with university degree, with mortgage, taking administration or blue-collar related jobs, previously contacted via cellular, no previous market campaign, full time residence out of Philly, married homeowners tend to have a higher likelihood of not taking the credit. Homeowners experiencing higher employment rate, lower consumer price index, lower consumer confidence index tend to have a higher likelihood of taking the credit.

The behavior of taking the credit was not significantly affected by lien against the owner’s property (taxLien), homeowners’ age (age).

Note: Feasures age and pdays are regarded as continuous variables here for better virualization though they labeled a categorical variables in the codebook provided.

ggplot(data = subsidy) +
  geom_bar(aes(y,fill = y),width = 0.4) +
          scale_fill_manual(values = palette2,
                          name = 'Took Credit') +
  labs(x="Took Credit", y="Value", 
           title = "Taking credit vs. No-taking credit\n",
       caption = 'Figure 2.1') +
      theme(legend.position = "none") +
  plotTheme()

subsidy %>%
  dplyr::select(y,previous, unemploy_rate,cons.price.idx,cons.conf.idx,inflation_rate, spent_on_repairs, age, pdays) %>%
  gather(Variable, value, -y) %>%
    ggplot(aes(y, value, fill=y)) + 
      geom_bar(position = "dodge", stat = "summary", fun.y = "mean") + 
      facet_wrap(~Variable, ncol = 3, scales = "free") +
        scale_fill_manual(values = palette2,
                          name = 'Took Credit') +
      labs(x="Took Credit", y="Value", 
           title = "Feature associations with the likelihood of taking credit",
           subtitle = "(continous outcomes on average)\n",
       caption = 'Figure 2.2') +
      theme(legend.position = "none") +
  plotTheme()

subsidy %>%
  dplyr::select(y,previous, unemploy_rate,cons.price.idx,cons.conf.idx,inflation_rate, spent_on_repairs, age, pdays) %>%
    gather(Variable, value, -y) %>%
    ggplot() + 
    geom_density(aes(value, color=y), fill = "transparent") + 
    facet_wrap(~Variable, ncol = 2, scales = "free") +
        scale_color_manual(values = palette2,
                          name = 'Took Credit') +
    labs(x="Took Credit", y="Value", 
           title = "Feature associations with the likelihood of taking credit",
           subtitle = "(continous outcomes in density)\n",
       caption = 'Figure 2.3') +
      theme(legend.position = "right") +
  plotTheme()

subsidy %>%
    dplyr::select(y,job, marital,education,taxLien,mortgage) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, ncol = 5, scales="free") +
        scale_fill_manual(values = palette2,
                          name = 'Took Credit') +
        labs(x="Took Credit", y="Value",
             title = "Feature associations with the likelihood of taking credit",
             subtitle = "(Categorical Features)\n",
       caption = 'Figure 2.4') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme()

subsidy %>%
    dplyr::select(y,taxbill_in_phl,contact, campaign,poutcome) %>%
    gather(Variable, value, -y) %>%
    count(Variable, value, y) %>%
      ggplot(., aes(value, n, fill = y)) +   
        geom_bar(position = "dodge", stat="identity") +
        facet_wrap(~Variable, ncol = 4, scales="free") +
        scale_fill_manual(values = palette2,
                          name = "Took Credit") +
        labs(x="Took Credit", y="Value",
             title = "Feature associations with the likelihood of taking credit",
             subtitle = "(Categorical Features, cont'd)\n",
       caption = 'Figure 2.5') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme()

III. Initial Model Building

In this section, this project tries to build a Logistic Regression model with all existing features.

3.1 Split dataset into a 65/35 training/test set

3.2 Build a model with all features

The first model was built with all features in the dataset included. Only few features (contacttelephone, monthdec, campaign, etc.)are significantly related to whether an individual took the credit or not, indicting that the model needs to be refined.

subsidyreg1 <- glm(y_numeric ~ .,
                  data=subsidyTrain %>% dplyr::select(-X,
                                                    -y,-taxLien),
                  family="binomial" (link="logit"))

summary(subsidyreg1)

## 
## Call:
## glm(formula = y_numeric ~ ., family = binomial(link = "logit"), 
##     data = subsidyTrain %>% dplyr::select(-X, -y, -taxLien))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4312  -0.4130  -0.3410  -0.2743   2.7971  
## 
## Coefficients:
##                                 Estimate  Std. Error z value Pr(>|z|)  
## (Intercept)                  -64.4569819 131.5866187  -0.490   0.6242  
## age                            0.0040401   0.0085015   0.475   0.6346  
## jobblue-collar                -0.2019782   0.2695338  -0.749   0.4536  
## jobentrepreneur               -0.4398813   0.4574926  -0.962   0.3363  
## jobhousemaid                  -0.6081370   0.5533354  -1.099   0.2718  
## jobmanagement                 -0.2958843   0.3058101  -0.968   0.3333  
## jobretired                    -0.0770689   0.3703018  -0.208   0.8351  
## jobself-employed              -0.7201679   0.4588905  -1.569   0.1166  
## jobservices                   -0.3785310   0.2992370  -1.265   0.2059  
## jobstudent                    -0.5827998   0.4749019  -1.227   0.2197  
## jobtechnician                 -0.0815229   0.2332106  -0.350   0.7267  
## jobunemployed                 -0.0036131   0.4282215  -0.008   0.9933  
## jobunknown                    -0.3449106   0.6996710  -0.493   0.6220  
## maritalmarried                 0.0025538   0.2390554   0.011   0.9915  
## maritalsingle                 -0.0272334   0.2743888  -0.099   0.9209  
## maritalunknown               -11.7954967 329.2148401  -0.036   0.9714  
## educationbasic.6y              0.3670326   0.3999401   0.918   0.3588  
## educationbasic.9y             -0.0754439   0.3385100  -0.223   0.8236  
## educationhigh.school           0.1334629   0.3225799   0.414   0.6791  
## educationilliterate          -13.1856118 882.7435574  -0.015   0.9881  
## educationprofessional.course   0.1791336   0.3502604   0.511   0.6091  
## educationuniversity.degree     0.2875891   0.3230265   0.890   0.3733  
## educationunknown               0.3203488   0.4125826   0.776   0.4375  
## mortgageunknown               -0.1105733   0.5672672  -0.195   0.8455  
## mortgageyes                   -0.2088628   0.1428249  -1.462   0.1436  
## taxbill_in_phlyes              0.1817119   0.2008873   0.905   0.3657  
## contacttelephone              -0.6417183   0.2850122  -2.252   0.0244 *
## monthaug                       0.0119168   0.4435987   0.027   0.9786  
## monthdec                       1.8188174   0.7795533   2.333   0.0196 *
## monthjul                       0.0987037   0.3739052   0.264   0.7918  
## monthjun                       0.0601515   0.4719817   0.127   0.8986  
## monthmar                       1.3910066   0.5690012   2.445   0.0145 *
## monthmay                      -0.4100960   0.3159273  -1.298   0.1943  
## monthnov                      -0.6154623   0.4249091  -1.448   0.1475  
## monthoct                       0.0704836   0.5478438   0.129   0.8976  
## monthsep                      -0.9263980   0.6826431  -1.357   0.1748  
## day_of_weekmon                -0.3172493   0.2213597  -1.433   0.1518  
## day_of_weekthu                -0.1405554   0.2150320  -0.654   0.5133  
## day_of_weektue                -0.1571005   0.2166871  -0.725   0.4684  
## day_of_weekwed                -0.2054917   0.2312499  -0.889   0.3742  
## campaign                      -0.1080497   0.0434241  -2.488   0.0128 *
## pdays                         -0.0007349   0.0007506  -0.979   0.3276  
## previous                       0.2429603   0.2130561   1.140   0.2541  
## poutcomenonexistent            0.7425384   0.3500727   2.121   0.0339 *
## poutcomesuccess                1.1095148   0.7477278   1.484   0.1378  
## unemploy_rate                 -0.9391549   0.4866028  -1.930   0.0536 .
## cons.price.idx                 0.9077695   0.8614268   1.054   0.2920  
## cons.conf.idx                  0.0229236   0.0292698   0.783   0.4335  
## inflation_rate                 0.4997422   0.4479791   1.116   0.2646  
## spent_on_repairs              -0.0044715   0.0107313  -0.417   0.6769  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1853.7  on 2678  degrees of freedom
## Residual deviance: 1489.0  on 2629  degrees of freedom
## AIC: 1589
## 
## Number of Fisher Scoring iterations: 13

3.3 Make Prediction and Plot Outcomes

A dataframe of predictions for the 1440 observations in the test set was created, called testProbs.

These predictions are the estimated probabilities of taking the credit for these out-of-sample subjects. We can compare them to the observed outcome. From Figure 3.3 we can see that although “not took” is clustering at 1, there is no clear pattern of “took” since it takes up only 10% in the dataset.

# subsidyreg1$xlevels$taxLien <- union(subsidyreg1$xlevels$taxLien, 
# levels(as.factor(subsidyTest$taxLien)))
testProbs1 <- data.frame(Outcome1 = as.factor(subsidyTest$y_numeric),
                        Probs1 = predict(subsidyreg1, newdata = subsidyTest, na.rm = TRUE, type= "response" ))

# glimpse(testProbs1)

ggplot(testProbs1, aes(x = Probs1, fill = as.factor(Outcome1))) + 
  geom_density() +
  facet_grid(Outcome1 ~ .) +
  scale_fill_manual(values = palette2,
                    name = "Took Credit") +
  labs(x = "Take Credit", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome\n",
       caption = 'Figure 3.3') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "right") +
  plotTheme()

3.4 Good of Fit: Confusion Matrix

A “confusion matrix” for the threshold of 50% shows us the rate at which we got True Positives (aka Sensitivity), False Positives, True Negatives (aka Specificity) and False Negatives for that threshold.

True Positives: Predicted correctly homeowner would take the credit.
False Positives: Predicted incorrectly homeowner would take the credit.
True Negatives: Predicted correctly homeowner would not take the credit.
False Negatives: We predicted that a homeowner would not take the credit but they did.

It is obvious that this initial model is suffering low sensitivity (0.24), indicating we need to re-build the model or optimizing it.

testProbs1 <- 
  testProbs1 %>%
  mutate(predOutcome1  = as.factor(ifelse(testProbs1$Probs1 > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs1$predOutcome1, testProbs1$Outcome1, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1258  119
##          1   25   38
##                                               
##                Accuracy : 0.9                 
##                  95% CI : (0.8833, 0.915)     
##     No Information Rate : 0.891               
##     P-Value [Acc > NIR] : 0.1449              
##                                               
##                   Kappa : 0.3019              
##                                               
##  Mcnemar's Test P-Value : 0.000000000000009189
##                                               
##             Sensitivity : 0.24204             
##             Specificity : 0.98051             
##          Pos Pred Value : 0.60317             
##          Neg Pred Value : 0.91358             
##              Prevalence : 0.10903             
##          Detection Rate : 0.02639             
##    Detection Prevalence : 0.04375             
##       Balanced Accuracy : 0.61128             
##                                               
##        'Positive' Class : 1                   
##

IV. Model Optimization

4.1 Feature Engineering

Based on the exploratory analysis above, taxLien is ruled out from the feature set. To refine the model, precontact and contactInterval were converted to binary variables. education was divided into four categories (High School, Others, Professional, University) and added to the model as new variables.

subsidy$edu.level[subsidy$education == "basic.4y" |subsidy$education == "basic.6y"|subsidy$education == "basic.9y"] <- "Basic"
subsidy$edu.level[subsidy$education == "unknown" |subsidy$education == "illiterate"] <- "Others"
subsidy$edu.level[subsidy$education == "high.school"]  <- "High School"
subsidy$edu.level[subsidy$education == "professional.course"]  <- "Professional"
subsidy$edu.level[subsidy$education == "university.degree"]  <- "University"

subsidy <- subsidy %>%
  dplyr::mutate(precontact = ifelse(previous == "0","no","yes"),
                contactInterval = ifelse(pdays > 7,"Long","Short") )

4.2 Model Building

set.seed(3456)
trainIndex <- createDataPartition(subsidy$y, p = .65,
                                  list = FALSE,
                                  times = 1)
subsidyTrain <- subsidy[ trainIndex,]
subsidyTest  <- subsidy[-trainIndex,]

subsidyreg2 <- glm(y_numeric ~ .,
                  data=subsidyTrain %>% 
                    dplyr::select(-X,-y,-age,-education,-taxLien,-pdays,-previous,-month,-job),
                  family="binomial" (link="logit"))


stargazer(subsidyreg1, type = "html", 
          title = "Table 4.2.1 Summary Statistics of Model 1 on Training Set",
          header = FALSE,
          single.row = TRUE)

**Table 4.2.1 Summary Statistics of Model 1 on Training Set**

	Dependent variable:

	y_numeric

age	0.004 (0.009)
jobblue-collar	-0.202 (0.270)
jobentrepreneur	-0.440 (0.457)
jobhousemaid	-0.608 (0.553)
jobmanagement	-0.296 (0.306)
jobretired	-0.077 (0.370)
jobself-employed	-0.720 (0.459)
jobservices	-0.379 (0.299)
jobstudent	-0.583 (0.475)
jobtechnician	-0.082 (0.233)
jobunemployed	-0.004 (0.428)
jobunknown	-0.345 (0.700)
maritalmarried	0.003 (0.239)
maritalsingle	-0.027 (0.274)
maritalunknown	-11.795 (329.215)
educationbasic.6y	0.367 (0.400)
educationbasic.9y	-0.075 (0.339)
educationhigh.school	0.133 (0.323)
educationilliterate	-13.186 (882.744)
educationprofessional.course	0.179 (0.350)
educationuniversity.degree	0.288 (0.323)
educationunknown	0.320 (0.413)
mortgageunknown	-0.111 (0.567)
mortgageyes	-0.209 (0.143)
taxbill_in_phlyes	0.182 (0.201)
contacttelephone	-0.642^** (0.285)
monthaug	0.012 (0.444)
monthdec	1.819^** (0.780)
monthjul	0.099 (0.374)
monthjun	0.060 (0.472)
monthmar	1.391^** (0.569)
monthmay	-0.410 (0.316)
monthnov	-0.615 (0.425)
monthoct	0.070 (0.548)
monthsep	-0.926 (0.683)
day_of_weekmon	-0.317 (0.221)
day_of_weekthu	-0.141 (0.215)
day_of_weektue	-0.157 (0.217)
day_of_weekwed	-0.205 (0.231)
campaign	-0.108^** (0.043)
pdays	-0.001 (0.001)
previous	0.243 (0.213)
poutcomenonexistent	0.743^** (0.350)
poutcomesuccess	1.110 (0.748)
unemploy_rate	-0.939^* (0.487)
cons.price.idx	0.908 (0.861)
cons.conf.idx	0.023 (0.029)
inflation_rate	0.500 (0.448)
spent_on_repairs	-0.004 (0.011)
Constant	-64.457 (131.587)

Observations	2,679
Log Likelihood	-744.481
Akaike Inf. Crit.	1,588.962

Note:	p<0.1; p<0.05; p<0.01

stargazer(subsidyreg2, type = "html", 
          title = "Table 4.2.2 Summary Statistics of Model 2 on Training Set",
          header = FALSE,
          single.row = TRUE)

**Table 4.2.2 Summary Statistics of Model 2 on Training Set**

	Dependent variable:

	y_numeric

maritalmarried	-0.018 (0.230)
maritalsingle	-0.061 (0.245)
maritalunknown	-11.947 (329.038)
mortgageunknown	-0.023 (0.526)
mortgageyes	-0.169 (0.139)
taxbill_in_phlyes	0.156 (0.195)
contacttelephone	-0.900^*** (0.216)
day_of_weekmon	-0.253 (0.215)
day_of_weekthu	-0.151 (0.211)
day_of_weektue	-0.162 (0.212)
day_of_weekwed	-0.237 (0.225)
campaign	-0.100^** (0.043)
poutcomenonexistent	0.383^* (0.212)
poutcomesuccess	1.483^*** (0.573)
unemploy_rate	-0.891^*** (0.225)
cons.price.idx	1.465^*** (0.357)
cons.conf.idx	0.065^*** (0.021)
inflation_rate	-0.004 (0.286)
spent_on_repairs	0.004 (0.005)
edu.levelHigh School	0.129 (0.205)
edu.levelOthers	0.348 (0.330)
edu.levelProfessional	0.244 (0.231)
edu.levelUniversity	0.369^** (0.187)
precontactyes
contactIntervalShort	0.337 (0.588)
Constant	-156.582^*** (54.618)

Observations	2,679
Log Likelihood	-769.184
Akaike Inf. Crit.	1,588.367

Note:	p<0.1; p<0.05; p<0.01

4.3 Make Prediction and Plot Outcomes

testProbs2 <- data.frame(Outcome2 = as.factor(subsidyTest$y_numeric),
                        Probs2 = predict(subsidyreg2, newdata = subsidyTest, na.rm = TRUE, type= "response" ))



ggplot(testProbs2, aes(x = Probs2, fill = as.factor(Outcome2))) + 
  geom_density() +
  facet_grid(Outcome2 ~ .) +
  scale_fill_manual(values = palette2,
                          name = "Took Credit") +
  labs(x = "Take Credit", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome\n",
       caption = 'Figure 4.3') +
  theme(strip.text.x = element_text(size = 18),
        legend.position = "right") +
  plotTheme()

4.4 Good of Fit: Confusion Matrix

Unfortunately, the steps in data engineering did not improve the sensitivity of the model. The sensitivity dropped from 0.24 to 0.21.

testProbs2 <- 
  testProbs2 %>%
  mutate(predOutcome2  = as.factor(ifelse(testProbs2$Probs2 > 0.5 , 1, 0)))

caret::confusionMatrix(testProbs2$predOutcome2, testProbs2$Outcome2, 
                       positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1272  124
##          1   11   33
##                                              
##                Accuracy : 0.9062             
##                  95% CI : (0.89, 0.9208)     
##     No Information Rate : 0.891              
##     P-Value [Acc > NIR] : 0.0325             
##                                              
##                   Kappa : 0.2947             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.21019            
##             Specificity : 0.99143            
##          Pos Pred Value : 0.75000            
##          Neg Pred Value : 0.91117            
##              Prevalence : 0.10903            
##          Detection Rate : 0.02292            
##    Detection Prevalence : 0.03056            
##       Balanced Accuracy : 0.60081            
##                                              
##        'Positive' Class : 1                  
##

4.5 Cross Validation on Both Models

The same result is shown in Figure 4.1 and 4.2, With sensitivity decreased slightly after adjusting features.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)


cvFit.reg1 <- train(y ~ ., data = subsidy %>%
                  dplyr::select(-y_numeric, -X) %>%
                  dplyr::mutate(y = ifelse(y=="no","c1.yes","c1.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)

cvFit.reg2 <- train(y ~ ., data = subsidy %>%
                  dplyr::select(-X,-y_numeric,-age,-education,-taxLien,-pdays,-previous) %>%
                  dplyr::mutate(y = ifelse(y=="no","c2.yes","c2.no")), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)


cvFit.reg1

## Generalized Linear Model 
## 
## 4119 samples
##   22 predictor
##    2 classes: 'c1.no', 'c1.yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4077, 4077, 4078, 4078, 4078, 4077, ... 
## Resampling results:
## 
##   ROC        Sens    Spec     
##   0.7691055  0.2195  0.9809009

cvFit.reg2

## Generalized Linear Model 
## 
## 4119 samples
##   17 predictor
##    2 classes: 'c2.no', 'c2.yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 4078, 4078, 4077, 4077, 4078, 4077, ... 
## Resampling results:
## 
##   ROC        Sens    Spec    
##   0.7658097  0.2175  0.982545

dplyr::select(cvFit.reg1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit.reg1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", 
         title="CV Goodness of Fit Metrics, Model 1",
         subtitle = "Across-fold mean reprented as dotted lines\n",
         caption = "Figure 4.1") +
  plotTheme()

dplyr::select(cvFit.reg2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit.reg2$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", 
         title="CV Goodness of Fit Metrics, Model 2",
         subtitle = "Across-fold mean reprented as dotted lines\n",
         caption = "Figure 4.2") +
    plotTheme()

V. ROC Curve for New Model

According to the ROC Curve, a threshold that predicts credit acceptance correctly 80% of the time, will predict credit acceptance incorrectly >20% of the time. For every additional improvement in true positive rate, the model will make a greater proportion of false positive errors. Moving from a 85% to a 95% true positive rate dramatically increases the false positive rate.

auc(testProbs2$Outcome2, testProbs2$Probs2)

## Area under the curve: 0.8015

ggplot(testProbs2, aes(d = as.numeric(testProbs2$Outcome2), m = Probs2)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = '#a6dbbc') +
  labs(title = "ROC Curve - Model 2\n",
       caption = 'Figure 5') +
  plotTheme()

VI. Cost-Benefit Analysis

This has all been building to an examination of the model in the context of our credit campaign. Let’s set this up to estimate the revenues associated with using this model under the following scenario:

6.1 Cost/Benefit Equation

True Positive - Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit: -$5000 - $2850 + ($10,000 + $56,000) * 0.08 return for 25% of cases that take credit, but only $2850 lose for 75% of cases receive the marketing allocation but do not take the credit.
True Negative - Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated: $0.
False Positive - Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated: - $2850
False Negative - We predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we ‘0 out’ this category, assuming the cost/benefit of this is $0.

6.2 Create the “Cost/Benefit Table”

The table below presents the count and the revenue of the four categories under the default threshold of 0.5. Optimizing the threshold might generate a higher overall revenue.

cost_benefit_table <-
   testProbs2 %>%
      count(predOutcome2, Outcome2) %>%
      summarize(True_Negative = sum(n[predOutcome2==0 & Outcome2==0]),
                True_Positive = sum(n[predOutcome2==1 & Outcome2==1]),
                False_Negative = sum(n[predOutcome2==0 & Outcome2==1]),
                False_Positive = sum(n[predOutcome2==1 & Outcome2==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               ifelse(Variable == "True_Negative", 0 *Count,
               ifelse(Variable == "True_Positive",((-5000 - 2850 + (10000 + 56000) * 0.08) * Count * 0.25 +
                                                    (- 2850) * Count * 0.75),
               ifelse(Variable == "False_Negative", 0 * Count,
               ifelse(Variable == "False_Positive", (- 2850) * Count, 0))))) %>%
    bind_cols(data.frame(Description = c(
              "Predicted correctly homeowner would not take the credit",
              "Predicted correctly homeowner would take the credit",
              "Predicted incorrectly homeowner would not take the credit",
              "Predicted incorrectly homeowner would take the credit")))


kable(cost_benefit_table) %>% 
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 6.2 Cost/Benefit Table\n",
           general_title= '\n')

Variable	Count	Revenue	Description
True_Negative	1272	0	Predicted correctly homeowner would not take the credit
True_Positive	33	-91740	Predicted correctly homeowner would take the credit
False_Negative	124	0	Predicted incorrectly homeowner would not take the credit
False_Positive	11	-31350	Predicted incorrectly homeowner would take the credit

Table 6.2 Cost/Benefit Table

6.3 Optimize Thresholds

This section selects optimal thresholds by plot the confusion metric outcomes for each threshold.

whichThreshold_subsidy <- iterateThresholds(testProbs2)

whichThreshold_subsidy_revenue <- 
whichThreshold_subsidy %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

whichThreshold_subsidy  %>%
  filter(Variable != 'False_Negative') %>%
  ggplot(.,aes(Threshold, Count, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Count by confusion matrix type and threshold\n",
       y = "Count",
       caption = 'Figure 6.2') +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix"))

whichThreshold_subsidy  %>%
  filter(Variable != 'False_Negative') %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Revenue by confusion matrix type and threshold\n",
       y = "Revenue",
       caption = 'Figure 6.3') +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix"))

6.4 Threshold as a Function of Total Revenue and Total Count

As shown in Table 6.4, the total revenue increased as the threshold goes up, and reached the highest value at the threshold of 0.85.

whichThreshold_subsidy_revenue <- 
  whichThreshold_subsidy %>% 
    mutate(total_count =
               ifelse(Variable == "True_Positive", Count * 0.25, 
               ifelse(Variable == 'False_Negative', Count ,0 ))) %>% 
    group_by(Threshold) %>% 
    summarize(Total_Revenue = sum(Revenue),
              Total_Count_of_Credits = sum(total_count))

whichThreshold_subsidy_revenue %>%
  dplyr::select(Threshold, Total_Revenue, Total_Count_of_Credits) %>%
  gather(Variable, Value, -Threshold) %>%
  ggplot(aes(Threshold, Value, colour = Variable)) +
    geom_point() +
    geom_vline(xintercept = pull(arrange(whichThreshold_subsidy_revenue, -Total_Revenue)[1,1])) +
    scale_y_continuous( limits=c(-4064571, 2000) )+ 
    scale_colour_manual(values = palette2) + 
    
    plotTheme() + 
    labs(title = "Total Revenue and Total Credits by Threshold\n",
         caption = 'Figure 6.4')

6.5 Table of Total Revenue and Total Count

Optimizing the threshold turned the revenue from negative to positive and increased the count of credit acceptance.

output_table <- rbind(whichThreshold_subsidy_revenue[50,],whichThreshold_subsidy_revenue[85,])


kable(output_table) %>%
  kable_styling(font_size = 12, full_width = F,
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general = "Table 6.5 Total Revenue and Credit by Threshold\n",
           general_title= '\n')

Threshold	Total_Revenue	Total_Count_of_Credits
0.50	-121818	132.25
0.85	1283	157.00

Table 6.5 Total Revenue and Credit by Threshold

VII. Conclusion

The current model cannot be put into production because the sensitivity, indicating the correct credit acceptance prediction and the catogory with the highest revenue, is far from satisfying. One possible way of improving the model is to include more features such as the annual income of the homeowner. With the goal of accurately identifying more homeowners and offering them tax credit, it might be helpful to contact the homeowners at a more frequent basis. Also, it’s recommended to contact over email or mail instead of over phone, since it has been proven to be a negative factor in the current model.