MUSA508 Midterm Project: Hedonic Home Price Prediction

I. Introduction & Motivation

Zillow is one of the most popular real estate databases online. One of Zillow’s key features is its Zestimates, a popular consumer tool for seeing how much homes are worth. These estimates are based on information from sources such as comparable sales and public data. According to Zillow, most Zestimates are “within 10 percent of the selling price of the home.” Although we cannot expect perfectly accurate estimates from any algorithm, it is important to improve the accuracy in many possible ways. This project focuses on one possible improvement: incorporating enough local factors to improve the accuracy of this predictive algorithm.

Miami is a global tourist destination, well known for the local beaches, such as Miami Beach and South Beach, which best represent the uniqueness of Miami. The tourism and real estate industry have been dominating Miami’s economy and mutually boosting each other. There are other aspects that make the real estate market unique and make sales price prediction in Miami trickier than in other cities.

Renters form 70% of the population. In Miami, renters form a large percentage of the population. Partly due to the mobility of work. Most of the people work in temporary or seasonal jobs specifically in the tourism industry. Moreover, rental housing is more affordable than buying a house due to the overall high property sale prices in Miami. Rental market can affect the sales market. And that is why a lot of investors are buying single-family homes and then renovating them into multiple rentable units.

Foreign buyers pricing out locals out of Miami. And as South Florida continues to grow into an international destination beyond vacations and beaches, locals must compete with cash-rich foreign and out-of-state buyers, whose growing presence keeps driving up land and housing prices while local wages remain stagnant.

Florida flooding could devalue real estate. According to the Jupiter Intelligence report, by 2050, annual flooding damage county-wide in Miami-Dade County is expected to roughly double, leading to shortages in affordable insurance coverage and real estate market instability. On the list of the 20 urban areas in America that will suffer the most from rising seas, Florida has five: St Petersburg, Tampa, Miami, Miami Beach and Panama City. In 2016, Zillow predicted that one out of eight homes in Florida would be underwater by 2100, a loss of $413bn in property.

However, with the understanding of incorporating local intelligence into the model, this is no easy task to do. On the one hand, some features can be hard to quantify. On the other hand, open datasets might not be available. For example, case-level crime data is unavailable on the Miami Open Data Portal, and it is almost intuitive to take crime into account when predicting housing value of every city. Thus, seeking other ways to illustrate this feature is crucial for carrying on with our analysis on the Miami housing market.

This project adopted geospatial machine learning techniques, split housing sale price dataset into training and test set, incorporated local intelligence features from Census data and Miami Open Data portal, and trained the final model to achieve the lowest possible error rate.

After several rounds of testing, the final model includes important features such as school location, hospital location, and vacant rate in census tract. Applying the final model to the test set, the Mean Absolute Errors (MAE) of the model is 59645, and the Mean Absolute Percent Errors (MAPE) is 0.12

# 1. load Libraries
library(sf)
library(tidyverse)
# install.packages('mapview')
library(mapview)
library(spdep)
library(caret)
library(ckanr) # for opening data APIs built on CKAN technology
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(stargazer) # for creating table
library(broom)
library(tufte)
library(rmarkdown)
library(kableExtra)
library(tidycensus)

# 2. Identify functions
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 15,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.text.x = element_text(size = 14))
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(color = "darkred", size=15, face="bold"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("#E5E5E5", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2),
    strip.background = element_rect(fill = "#E5E5E5", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

#3. Load Quantile break functions

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}


# Load hexadecimal color palette

palette <- c('#feedde', '#fdbe85', '#fd8d3c', '#e6550d', '#a63603')

# for calculating average nearest neighbor distance.

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    dplyr::summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull() # pull() is similar to $. It's mostly useful because it looks a little nicer in pipes, it also works with remote data frames, and it can optionally name the output.
  
  return(output)  
}

II. Data Manipulation and Visualization

2.0 Setup

In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quintile breaks, and average nearest neighbor distance for further analysis.

2.1 Data Wrangling

In this section, we loaded Miami data and other complementary datasets, conducted feature selection and engineering. Besides, for other analysis, we separate the whole dataset into miami.training and miami.test in advance.

2.1.1 Data loading

Five categorical datasets were used in this project:

Miami House Price and Internal Characteristics: basic geometric dataset provided by the course in advance.
Miami Base Data: geometric data containing the geographic information of Miami and Miami Beach.
Miami Neighborhood Shapefiles: digital geographic dataset contains polygon features representing neighborhood areas within the City of Miami as well as Miami Beach, Florida.
Miami Beach Neighborhood: As an alternative way to using neighborhood data of Miami Beach, we used Miami Municipal Boundary data that is a polygon feature class of municipal boundaries within Miami-Dade County, and specifically filter geometric data in Miami Beach.
Census Data: demographic variables from the ACS 2017 for census tracts in Miami-Dade County. We specifically selected 8 variables in the analysis:
- TotalPop: Total population in each census tract
- Whites: White population in each census tract
- MedHHInc: Median household income in each census tract
- MedRent: Median Rent for properties in each census tract
- TotalPoverty: Population living under the level of poverty in each census tract
- TotalVacant: Total vacant unit in each census tract
- TotalUnit: Total housing units in each census tract
- RenterOccpied: Total householder lived in renter-occupied housing units
With variables above, we created the four features to be used in the model building.
- pctWhite: white population proportion in each census tract
- pctPoverty: Poverty population proportion in each census tract
- pctTotalVacant: Vacant unites proportion in each census tract
- pctRenterOccupied: Proportion of householder living in renter-occupied housing units in each census tract
Notice: Those vacant related variables are vital features we explored that would significantly improve our model and powerfully predict price. Detailed information will be provided in the following analysis.
Amenity Data: compilatory data chosen for describing amenities and public service of Miami-Dade County. The datasets we chose encompasses crime rate, education opportunity, healthcare service and entertainment accessibility. See each dataset below:
1. Sexual Predator: A point feature class of Sexual Offenders and Predators within Miami-Dade County that is used by the Sexual Offender/Predator Residence Search Internet application.
2. Public School: A point feature class of the Miami-Dade County Public Schools facilities.
3. College, University: A point feature class of colleges and universities within Miami-Dade County.
4. Hospital: A point feature class of the Hospital facilities within Miami-Dade County.
5. Landmark: A point feature class of landmarks within Miami-Dade County.
6. Hotel, Motel, Inn: A point feature class of Hotel, Motel and Inn facilities within Miami-Dade County.
7. Shopping Mall: A point feature class that contains the locations of major shopping malls within Miami-Dade County. Shopping Mall is considered a large retail complex containing a variety of stores and often restaurants in which one or more buildings form a complex of shops representing merchandisers with interconnecting walkways that enable customers to walk from unit to unit.

# I. Base data
## House Price & internal characteristics
# miami.sf <- st_read('C:/Users/Yiming Ma/Documents/UPenn/MUSA508/hw/hw2/studentsData.geojson')

miami.sf <- st_read('/Users/penguin/Box Sync/GitHub/MUSA508-Midterm/studentsData.geojson')


## Miami base data
miami.base <-st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson") %>%
  filter(NAME %in% c("MIAMI", "MIAMI BEACH"))

## Neighborhood data
miami.neigh <- st_read('https://opendata.arcgis.com/datasets/2f54a0cbd67046f2bd100fb735176e6c_0.geojson')
# plot(miami.neigh)

## Miami beach neighborhood data
miami.beach <- st_read('https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson') %>%
  filter(NAME == "MIAMI BEACH") 

# Combine Miami beach neighborhood data and Miami city neighborhood data together
index <- miami.sf %>%
  rowid_to_column(var = "rowIndex") 

index.neigh <-st_join(index,miami.neigh) %>%
  distinct(rowIndex, .keep_all = TRUE) %>%
  select(-Shape_STAr,-Shape_STLe,-Shape__Area,-Shape__Length)


index.beach <- st_join(index,miami.beach) %>%
  filter(NAME == "MIAMI BEACH") %>%
  distinct(rowIndex, .keep_all = TRUE) %>%
  rename(LABEL = NAME) %>% 
  select(-MUNICID,-MUNICUID,-CREATEDBY,-CREATEDDATE,-MODIFIEDBY,
  -MODIFIEDDATE,-SHAPE_Length,-SHAPE_Area,-FIPSCODE)

### Create final data we will work on!
miami.sf <- rbind(index.neigh, index.beach) %>%
  distinct(rowIndex, .keep_all = TRUE) 


### check neighborhood information
miami.neigh.num <- as.data.frame(table(miami.sf$LABEL)) 

## Census Tract data
## View(load_variables(2017,'acs5',cache = TRUE))
tracts17 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E",
                                             "B19013_001E","B25058_001E",
                                             "B06012_002E", 
                                             # vacant variables
                                             "B25002_003E", 
                                             "B25004_002E","B25004_003E",
                                             "B25004_004E","B25004_005E",
                                             # total housing unit
                                             "B25001_001E",
                                             # renter occupied 
                                             'B08137_003E'), 
          year=2017, state= 12, county= 086, geometry=T, output="wide") %>%
  st_transform('EPSG:2236') %>%
  rename(TotalPop = B25026_001E, 
         Whites = B02001_002E,
         MedHHInc = B19013_001E, 
         MedRent = B25058_001E,
         TotalPoverty = B06012_002E,
         TotalVacant = B25002_003E,
         ForRent = B25004_002E,
         ForRentVac = B25004_003E,
         ForSale = B25004_004E,
         ForSaleVac = B25004_005E,
         TotalUnit = B25001_001E,
         RenterOccupied = B08137_003E
         ) %>%
  dplyr::select(-NAME, -starts_with("B")) %>% #-starts_with("B") awesome!
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
         pctTotalVacant = ifelse(TotalUnit > 0, TotalVacant / TotalUnit * 100, 0),
         TotalOccupied = TotalUnit - TotalVacant,
         pctRenterOccupied = ifelse(TotalOccupied >0, RenterOccupied/TotalOccupied, 0)) %>%
  dplyr::select(-Whites, -TotalPoverty) 


projected.tracts17 <- 
  tracts17 %>% 
  st_transform(st_crs(miami.sf))

miami.sf <-st_join(miami.sf, projected.tracts17) %>%
  distinct(rowIndex, .keep_all = TRUE)
  


## Amenity Data
## 1. Miami landmark
miami.landmark <- st_read('https://opendata.arcgis.com/datasets/70a14825e66f4f0eb28d2a9cceba1761_0.geojson')
miami.landmark.sf <- miami.landmark %>%
  select(geometry) %>%
  na.omit() %>%
  distinct()

## 2. Miami Shopping Mall
miami.mall <- st_read('https://opendata.arcgis.com/datasets/cb24d578246647a9a4c57bbd80c1caa8_0.geojson')
miami.mall.sf <- miami.mall %>%
  select(geometry) %>%
  na.omit() %>%
  distinct()

## 3. Miami Sexual Predator
miami.sexual <- st_read('https://opendata.arcgis.com/datasets/f8759d722aeb4198bfe7c4ad780604d2_0.geojson')

miami.sexual.sf <- miami.sexual %>%
  select(geometry) %>%
  na.omit() %>%
  distinct()

## 4. School data
miami.school <- st_read('https://opendata.arcgis.com/datasets/d3db0fce650d4e40a5949b0acae6fe3a_0.geojson')
miami.school.sf <- miami.school %>%
  select(geometry) %>%
  na.omit() %>%
  distinct()


## 5. Hotel data
miami.hotel <- st_read('https://opendata.arcgis.com/datasets/d37bbc15e7304b4ca4607783283147b7_0.geojson')
miami.hotel.sf <- miami.hotel%>%
  select(geometry) %>%
  na.omit() %>%
  distinct()

## 6. College data
miami.college <- st_read('https://opendata.arcgis.com/datasets/7db056c406b943dc8f3f377b99d77588_0.geojson')
miami.college.sf <- miami.college%>%
  select(geometry) %>%
  na.omit() %>%
  distinct()

## 7. Hospital data
miami.hospital <- st_read('https://opendata.arcgis.com/datasets/0067a0e8b40644f980afa23ad34c32c4_0.geojson')
miami.hospital.sf <- miami.hospital %>%
  select(geometry) %>%
  na.omit() %>%
  distinct()

2.1.2 Feature Engineering

In this section, we created some new features to describe the amenities and public services in Miami-Dade County, as well as to each home sale observation’s internal characteristics.

First of all, we were aimed at creating features to measure the extent to which buyers can assess multiple public services, including hospitals, shopping malls, landmarks, hotels, motels and inns, and education resources. To utmost avoid scale bias triggered by setting arbitrary areal unit or fixed buffer distance of each home sale observation, we chose to calculate the average nearest neighbor distance from each home sale to its k nearest neighbor public services. Accordingly, the smaller the value is, the more possibility that the observation can access specific service. To better compare each feature’s impact, we set the k from 3 to 5.

By leveraging the exactly same approach, we also created a new feature to measure the extent to which buyers can expose sexual assaults.

Secondly, we added four more features to indicate the duration each house was built, effectively used, and if the house was built with a pool or a patio.

## 1. Landmark
miami.sf <-
  miami.sf %>% 
  mutate(
    landmark_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 1),
    landmark_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 2), 
    landmark_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 3), 
    landmark_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 4), 
    landmark_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 5)) %>%
  distinct(rowIndex, .keep_all = TRUE) 
  


## 2. Shopping mall
miami.sf <-
  miami.sf %>% 
  mutate(
    mall_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 1),
    mall_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 2), 
    mall_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 3), 
    mall_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 4), 
    mall_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 5)) %>%
  distinct(rowIndex, .keep_all = TRUE) 

## 3. Sexual Assaults
miami.sf <-
  miami.sf %>% 
  mutate(
    sexual_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 1),
    sexual_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 2), 
    sexual_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 3), 
    sexual_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 4), 
    sexual_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 5)) %>%
  distinct(rowIndex, .keep_all = TRUE) 


## 4. Hotel Access
miami.sf <-
  miami.sf %>% 
  mutate(
    hotel_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 1),
    hotel_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 2), 
    hotel_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 3), 
    hotel_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 4), 
    hotel_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 5)) %>%
  distinct(rowIndex, .keep_all = TRUE) 


## 5. School Access

miami.sf <-
  miami.sf %>% 
  mutate(
    school_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 1),
    school_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 2), 
    school_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 3), 
    school_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 4), 
    school_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 5)) %>%
  distinct(rowIndex, .keep_all = TRUE) 


## 6. College Access
miami.sf <-
  miami.sf %>% 
  mutate(
    college_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 1),
    college_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 2), 
    college_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 3), 
    college_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 4), 
    college_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 5)) %>%
  distinct(rowIndex, .keep_all = TRUE) 

## 7. Hospital Access
miami.sf <-
  miami.sf %>% 
  mutate(
    hospital_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 1),
    hospital_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 2), 
    hospital_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 3), 
    hospital_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 4), 
    hospital_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 5)) %>%
  distinct(rowIndex, .keep_all = TRUE) 

## 8. Internal characteristics
miami.sf <-
  miami.sf %>% 
  mutate(pool_house = ifelse(str_detect(XF1, "Pool"), 1, 0),
         patio_house = ifelse(str_detect(XF1, "Patio"), 1, 0),
         Age = 2020- YearBuilt,
         Age.effective = 2020- EffectiveYearBuilt)

2.1.3 Split dataset by ‘toPredict’

As indicated earlier, we separated the whole dataset (3503 observations in total) into miami.training and miami.test in advance for model building in next section. So far, we have 2,627 home sale observations for working on, while we used the left 876 observations to test model for competition.

# Split dataset 
miami.training <- miami.sf %>%
  filter(toPredict == '0')

# summary(miami.training)

miami.test <- miami.sf %>%
  filter(toPredict == '1') 

# summary(miami.test)

2.2 Summary Statistics of Features

As required, we provided three tables of summary statistics with variable descriptions for sorting features based on their categories below.

In total, we took 62 features into consideration in the initial stage.

19 features (including SalePrice came from the category “internal characteristics”36 features came from the category “amenity/public service”.For spatial features, we decided to continue to stick on neighborhood features as well as 6 more features extracted from census data, and summarize house number in each neighborhood for better understanding. In this analysis, we regarded census data as indicators to show the house unit spatial environment.

Detailed summary statistics are provided below.

## All potential features
all.feature.list <- miami.training %>%
  dplyr:: select(-saleDate,-saleType,-saleQual,-saleYear,-Property.Address,
         -Year,-WVDB,-HEX,-GPAR,-County.2nd.HEX,-County.Senior,-County.LongTermSenior,
         -County.Other.Exempt,-City.2nd.HEX,-City.Senior,-City.LongTermSenior,
         -City.Other.Exempt,-MillCode,-Land.Use,-Owner1,-Owner2,-Mailing.Address,
         -Mailing.City,-Mailing.State,-Mailing.Zip,-Mailing.Country,
         -starts_with("Legal"), -YearBuilt, -EffectiveYearBuilt, -X, -FID, -starts_with("Shape_"), -Folio,
         -XF1, -XF2, -XF3) %>%
  st_drop_geometry()

## 2.1 Feature: Internal Characteristics
internal.feature.list <- all.feature.list %>%
  dplyr:: select(SalePrice, Land, Bldg, Total, Assessed,County.Taxable,City.Taxable,AdjustedSqFt,
                 LotSize, Bed, Bath, Stories, Units, LivingSqFt, ActualSqFt, Age,Age.effective,
                 pool_house, patio_house) 

stargazer(internal.feature.list, type = "html", 
          title = "Table DATA 2.1 Summary Statistics of Internal Characteristics ",
          header = FALSE,
          single.row = TRUE)

**Table DATA 2.1 Summary Statistics of Internal Characteristics**

Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

SalePrice	2,627	974,540.600	2,047,999.000	12,500	299,000	826,250	27,750,000
Land	2,627	485,089.200	985,245.200	27,050	134,957	457,257.5	20,972,640
Bldg	2,627	289,575.800	838,520.000	1,000	79,941	203,242	15,175,769
Total	2,627	774,665.000	1,627,564.000	55,159	233,916	687,787	21,974,519
Assessed	2,627	737,515.200	1,600,859.000	29,137	217,670.5	628,332.5	21,974,519
County.Taxable	2,627	713,294.700	1,603,350.000	0	192,542.5	607,352	21,974,519
City.Taxable	2,627	713,294.700	1,603,350.000	0	192,542.5	607,352	21,974,519
AdjustedSqFt	2,627	2,044.029	1,492.860	331	1,248	2,251	18,006
LotSize	2,627	7,681.807	4,748.604	1,250	5,400	7,931.2	80,664
Bed	2,627	2.992	1.100	0	2	3	13
Bath	2,627	2.054	1.316	0	1	2	12
Stories	2,627	1.192	0.432	0	1	1	4
Units	2,627	1.006	0.078	1	1	1	2
LivingSqFt	2,627	1,971.459	1,429.946	288	1,194.5	2,215	18,006
ActualSqFt	2,627	2,324.683	1,809.884	388	1,383	2,544.5	20,192
Age	2,627	68.772	22.720	1	67	82	115
Age.effective	2,627	47.199	26.669	1	24	71	115
pool_house	2,627	0.255	0.436	0	0	1	1
patio_house	2,627	0.178	0.383	0	0	0	1

## 2.2 Feature: Amenities/Public Services
amenity.feature.list <- all.feature.list %>%
  dplyr:: select(starts_with('landmark_'),starts_with('mall_'),starts_with('hospital_'),
                 starts_with('college_'),starts_with('school_'),starts_with('hotel_'),
                 pctTotalVacant, pctRenterOccupied, pctWhite,pctPoverty, MedRent ,MedHHInc)



stargazer(amenity.feature.list, type = "html", 
          title = "Table DATA 2.2 Summary Statistics of Amenities/Public Services ",
          header = FALSE,
          single.row = TRUE)

**Table DATA 2.2 Summary Statistics of Amenities/Public Services**

Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

landmark_nn1	2,627	0.015	0.009	0.001	0.009	0.019	0.052
landmark_nn2	2,627	0.019	0.011	0.002	0.012	0.023	0.053
landmark_nn3	2,627	0.022	0.012	0.003	0.014	0.026	0.058
landmark_nn4	2,627	0.024	0.012	0.004	0.016	0.027	0.062
landmark_nn5	2,627	0.025	0.012	0.006	0.017	0.029	0.065
mall_nn1	2,627	0.023	0.010	0.001	0.015	0.031	0.051
mall_nn2	2,627	0.031	0.012	0.009	0.022	0.039	0.058
mall_nn3	2,627	0.037	0.012	0.012	0.029	0.046	0.065
mall_nn4	2,627	0.043	0.012	0.020	0.033	0.052	0.071
mall_nn5	2,627	0.047	0.012	0.028	0.036	0.057	0.077
hospital_nn1	2,627	0.019	0.010	0.001	0.012	0.024	0.056
hospital_nn2	2,627	0.025	0.011	0.002	0.018	0.030	0.069
hospital_nn3	2,627	0.029	0.013	0.005	0.020	0.034	0.076
hospital_nn4	2,627	0.033	0.014	0.005	0.023	0.037	0.083
hospital_nn5	2,627	0.036	0.015	0.006	0.026	0.038	0.089
college_nn1	2,627	0.013	0.008	0.0004	0.007	0.016	0.053
college_nn2	2,627	0.016	0.009	0.002	0.010	0.019	0.053
college_nn3	2,627	0.018	0.009	0.002	0.013	0.022	0.054
college_nn4	2,627	0.020	0.009	0.002	0.015	0.024	0.056
college_nn5	2,627	0.022	0.010	0.002	0.016	0.025	0.058
school_nn1	2,627	0.007	0.004	0.0004	0.004	0.009	0.027
school_nn2	2,627	0.009	0.004	0.001	0.006	0.011	0.028
school_nn3	2,627	0.010	0.005	0.002	0.007	0.013	0.028
school_nn4	2,627	0.012	0.005	0.003	0.008	0.014	0.028
school_nn5	2,627	0.013	0.006	0.004	0.009	0.015	0.032
hotel_nn1	2,627	0.010	0.006	0.0001	0.005	0.013	0.030
hotel_nn2	2,627	0.011	0.007	0.0004	0.006	0.014	0.030
hotel_nn3	2,627	0.012	0.007	0.001	0.007	0.015	0.031
hotel_nn4	2,627	0.013	0.007	0.001	0.007	0.016	0.032
hotel_nn5	2,627	0.013	0.007	0.001	0.008	0.017	0.033
pctTotalVacant	2,627	15.068	9.325	2.237	8.509	19.776	48.308
pctRenterOccupied	2,627	0.657	0.271	0.142	0.460	0.773	1.480
pctWhite	2,627	0.735	0.298	0.058	0.611	0.957	1.172
pctPoverty	2,627	0.216	0.111	0.066	0.123	0.304	0.582
MedRent	2,540	1,099.507	402.345	245.000	836.000	1,185.000	2,271.000
MedHHInc	2,627	56,581.220	43,518.950	14,699	30,099	60,057	172,750

## 2.3 Spatial Structure: Neighborhood
spatial.feature.list <- all.feature.list %>%
  dplyr:: select(LABEL) 


as.data.frame(table(spatial.feature.list)) %>%
  rename(`Neighborhood List` = spatial.feature.list,
         Number =Freq) %>%
  kable(caption = 'Table DATA 2.3 Spatial Feature: Neighborhood') %>%
  kable_styling("striped", full_width = F)

Table DATA 2.3 Spatial Feature: Neighborhood
Neighborhood List	Number
79th Street	1
Allapattah Industrial District	4
Auburndale	47
Bay Heights	13
Baypoint	23
Bayside	33
Belle Island	5
Belle Meade	48
Belle Meade West	16
Bird Grove East	10
Bird Grove West	11
Biscayne Island	2
Biscayne Plaza	2
Brentwood	3
Buena Vista Heights	19
Buena Vista West	48
Citrus Grove	49
Civic Center	3
Coral Gate	26
Curtis Park	13
Douglas Park	39
East Grove	53
East Little Havana	17
Edgewater	3
Edison	65
Fair Isle	15
Flagami	330
Flora Park	49
Grove Center	7
Hadley Park	128
Haynesworth	1
Highland Park	4
Historic Buena Vista East	10
King Heights	29
La Pastorita	7
Latin Quarter	4
Le Jeune Gardens	1
Lemon City/Little Haiti	24
Liberty Square	40
Little River Central	14
Little River Gardens	2
Melrose	19
Miami Avenue	15
Morningside	49
North Grapeland Heights	43
North Grove	43
North Sewell Park	12
Northeast Overtown	4
Northwestern Estates	32
Oakland Grove	1
Old San Juan	22
Orange Bowl	3
Orchard Villa	28
Palm Grove	4
Parkdale North	16
Parkdale South	17
Roads	88
San Marco Island	3
Santa Clara	44
Shenandoah North	95
Shenandoah South	93
Shorecrest	64
Silver Bluff	93
South Grapeland Heights	21
South Grove	128
South Grove Bayside	27
South Sewell Park	19
Spring Garden	3
Vizcaya	1
West Grapeland Heights	13
West Grove	32

census.feature.list <- all.feature.list %>%
  dplyr:: select(pctTotalVacant, pctRenterOccupied, pctWhite,pctPoverty, MedRent ,MedHHInc)


stargazer(census.feature.list, type = "html", 
          title = "Table DATA 2.4 Spatial Feature: Census Data ",
          header = FALSE,
          single.row = TRUE)

**Table DATA 2.4 Spatial Feature: Census Data**

Statistic	N	Mean	St. Dev.	Min	Pctl(25)	Pctl(75)	Max

pctTotalVacant	2,627	15.068	9.325	2.237	8.509	19.776	48.308
pctRenterOccupied	2,627	0.657	0.271	0.142	0.460	0.773	1.480
pctWhite	2,627	0.735	0.298	0.058	0.611	0.957	1.172
pctPoverty	2,627	0.216	0.111	0.066	0.123	0.304	0.582
MedRent	2,540	1,099.507	402.345	245.000	836.000	1,185.000	2,271.000
MedHHInc	2,627	56,581.220	43,518.950	14,699	30,099	60,057	172,750

2.3 Correlation Matrix

For better visualization, features were divided into internal characteristics and amenity again when creating correlation matrix. Based on the graduated color, some features have strong relation with Sale Price, such asLand, Total, Taxable and Actual SqFt.

# DATA - 3 Correlation Matrix
miami.numericVars.internal <- 
  select_if(internal.feature.list, is.numeric) %>% 
  na.omit()

amenity.feature.list.with.price <- all.feature.list %>%
  dplyr:: select(SalePrice, starts_with('landmark_'),starts_with('mall_'),starts_with('hospital_'),
                 starts_with('college_'),starts_with('school_'),starts_with('hotel_'),
                 pctTotalVacant, pctRenterOccupied, pctWhite,pctPoverty, MedRent ,MedHHInc)

miami.numericVars.amenity <- 
  select_if(amenity.feature.list.with.price, is.numeric) %>% 
  na.omit()


ggcorrplot(
  round(cor(miami.numericVars.internal), 1), 
  p.mat = cor_pmat(miami.numericVars.internal),
  colors = c('#05A167', "white", '#6897BB'),
  lab_size = 1,
  tl.cex = 8,
  type="lower",
  insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1) +  
  labs(title = "Correlation across numeric variables\n(internal features)\n",
       caption = 'Figure DATA 3.1') +
  plotTheme()

ggcorrplot(
  round(cor(miami.numericVars.amenity), 1), 
  p.mat = cor_pmat(miami.numericVars.amenity),
  colors = c('#05A167', "white", '#6897BB'),
  lab_size = 1,
  tl.cex = 5,
  type="lower",
  insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1) +  
  labs(title = "Correlation across numeric variables\n(amenity/public service features)\n",
       caption = 'Figure DATA 3.2') +
  plotTheme()

2.4 Home Price Correlation Scatterplots

In this section, we specifically explored the correlation between sale price and four interesting and meaningful features: pctTotalVacant, AdjustSqFt, school_nn4 (average distance of four nearest public schools) and hospital_nn1 (the accessibility to nearest hospital).

Through graphs below, it is obvious that the first three of four features are positively correlated to sale price, while the distance to nearest hospital is negatively correlated to the sale price that is consistent with our intuition.

# DATA - 4 Scatter plot  
## 1. Two House Internal characteristics
features <- c('Adjusted Square Feet', 'Percentage of Vacant Property')
names(features) <- c("AdjustedSqFt", "pctTotalVacant")
price.house.plot <- 
  st_drop_geometry(miami.training) %>% 
  dplyr::select(SalePrice, pctTotalVacant, AdjustedSqFt) %>%
  filter(SalePrice <= 1000000) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "darkred") +
  facet_wrap(~Variable, ncol = 2, scales = "free",
             labeller = labeller(Variable = features)) +
  plotTheme() + 
  labs(title = "Price as a Function of House Internal Characteristics Variables:\nVacant Property, Adjusted Housing Area\n",
       caption = 'Figure DATA 4.1') 

price.house.plot

## 2. Average Distance of Nearest Four Public Schools Correlation 
price.school_nn4.plot <-
  st_drop_geometry(miami.training) %>% 
  dplyr::select(SalePrice, school_nn4) %>%
  filter(SalePrice <= 1000000) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "darkred") +
  labs(title = "Price as a Function of Average Distance of\nNearest Four Public Schools Correlation\n",
       caption = 'Figure DATA 4.2') +
  plotTheme()

price.school_nn4.plot

## 3. Hospital Correlation 
price.hospital_nn1.plot <-
  st_drop_geometry(miami.training) %>% 
  dplyr::select(SalePrice, hospital_nn1) %>%
  filter(SalePrice <= 1000000) %>%
  gather(Variable, Value, -SalePrice) %>% 
  ggplot(aes(Value, SalePrice)) +
  geom_point(size = .5) + 
  geom_smooth(method = "lm", se=F, colour = "darkred") +
  labs(title = "Price as a Function of Average Distance of\nNearest Four Hospitals Correlation\n",
       caption = 'Figure DATA 4.3') +
  plotTheme()

price.hospital_nn1.plot

2.5 Map of Sale Price per Square Feet

The sale price per SqFt in Miami shows a clear spatial pattern. The sale price in Miami Beach and Downtown Miami is significantly higher than other areas.

# DATA - 5 Map of Sale Price per Square Feet

miami.training <- miami.training %>%
  mutate(priceFt = SalePrice/LivingSqFt)

map.priceFt <- ggplot() + 
  geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
  geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
  geom_sf(data = st_centroid(miami.training),aes(color = q5(priceFt)),size = .5) +
  scale_color_manual(values = palette,
                     labels = qBr(miami.training,'priceFt'),
                     name = "Price/ft^2") +
  labs(title = 'Sale Price Per Square Foot\n',
       caption = 'Figure DATA 5') +
  mapTheme() + 
  plotTheme()

map.priceFt

2.6 Maps of Independent Variables

In this section, we were aimed at mapping the vacant units proportion by each census tract where certain home sale observation is located at, the adjusted area of each home sale observation and the distribution of average distance of each home sale observation to the nearest hospital.

# DATA - 6 

## 1. Vacant units
map.pctV <- ggplot() + 
  geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
  geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
  geom_sf(data = st_centroid(miami.training),aes(color = q5(pctTotalVacant)),size = .5) +
  scale_color_manual(values = palette,
                     labels = qBr(miami.training,'pctTotalVacant'),
                     name = "Percentage of Vacant Units") +
  labs(title = 'Percentage of Vacant Units',
       subtitle = 'Note: dots indicating the vacant unts proportion in certain census tract\nwhere certain home sale observation is located at.\n',
       caption = 'Figure DATA 6.1') +
  mapTheme() + 
  plotTheme()

map.pctV

## 2. Adjusted Square Feet 

map.Adjust.SqFt <- ggplot() + 
  geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
  geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
  geom_sf(data = st_centroid(miami.training),aes(color = q5(AdjustedSqFt)),size = .5) +
  scale_color_manual(values = palette,
                     labels = qBr(miami.training,'AdjustedSqFt'),
                     name = "Adjusted Square Feet") +
  labs(title = 'Distribution of Adjusted Square Feet\n',
       caption = 'Figure DATA 6.2') +
  mapTheme() + 
  plotTheme()

map.Adjust.SqFt

## 3. Hospital

hospital_nn1_expanded <- miami.training %>%
  mutate(hospital_nn1.expanded = hospital_nn1 * 1000)

map.hospital <- ggplot() + 
  geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
  geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
  geom_sf(data = st_centroid(hospital_nn1_expanded),aes(color = q5(hospital_nn1.expanded)),size = .5) +
  scale_color_manual(values = palette,
                     labels = qBr(hospital_nn1_expanded,'hospital_nn1.expanded'),
                     name = "Average Distrance\nto Nearest Hospital\n(timed by 1000 \nfor better visualization)\n") +
  labs(title = 'Distribution of Average Distrance to Nearest Hospital\nof Each Home Sale Observation\n',
       caption = 'Figure DATA 6.3') +
  mapTheme() + 
  plotTheme()

map.hospital

III. Methods

The mean objective of this analysis is to train a robust model to predict house price as accurately as possible. Accordingly, this section started to provide a detailed interpretation on the process of the model building.

Data and Sample In order to accurately build and test models, we randomly split the whole miami.training into two parts: model.miami.training with 60% of observations for model training and model.miami.test with 40% of observations for model testing.
Dependent Variable The dependent variable is home sale price: SalePrice.
Independent Variables Based on the hedonic home price model, potential features should encompass at least three components: internal characteristics, like the number of bathrooms; neighborhood amenities/public services, like hospital accessibility; and the underlying spatial process of prices. After creating features and engineering and checking correlation coefficients, 54 features are eligible for the model and obviously or slightly correlated with SalePrice that were the initial independent variables for model building. We gradually screened out variables and eventually selected only 16 robust features included in the final model
Procedure The primitive approach used in this analysis for model building is Ordinary Least Squares Regression (OLS) which is aimed at predicting the SalePrice of houses as a function of several statistical parameters such as intercept, slope and selected features. In order to determine which features were able to significantly increase the accuracy and generalizability of the model, we systematically compared, plotted, and mapping the mean absolute errors (MAE), mean absolute percent errors (MAPE), as well as root mean square errors (RMSE, the standard deviation of the residuals) between original SalePrice and predicted SalePrice of both training set and test set, conducted the cross-validation on 100 folds, explored the underlying spatial pattern of price and errors caused by problematic prediction. To address the problem of neglecting the underlying spatial correlation caused by neighborhood, we then compared, plotted, and mapped the mean absolute errors (MAE), and mean absolute percent errors (MAPE) on the test set between baseline model without neighborhood features and updated model with neighborhood features, and then determined our final model based on series model examinations.

IV. Result

4.1 Dataset Splitting and Model Building

First of all, as indicated in the III. Methods, the dataset was divided into training set and test set, with the ratio of 6:4.

# 1. Split dataset
set.seed(31357)

# get index for training sample
inTrain <- caret::createDataPartition(
  y = paste(miami.training$LABEL,miami.training$SalePrice ), 
  p = .60, list = FALSE)
# split data into training and test
model.miami.training <- miami.training[inTrain,] 
model.miami.test <- miami.training[-inTrain,]

Secondly, we built three models with different features to examine which features should be included.

Model 1 included all the features we were interested in.
Model 2 excluded some of features which were not significantly correlated with SalePrice in Model 1.
Model 3 continued to exclude some of features which were not significantly correlated with SalePrice in Model 2 but remained the ones with comparatively large correlation coefficients. And Model 3 was the baseline model in this analysis.

The baseline model contained 9 internal characteristic features: Land, Bldg, Assessed, County.Taxable, AdjustedSqFt, LotSize, Bed, Bath, and ActualSqFt,as well as 7 amenity features: college_nn1, school_nn3, school_nn4, school_nn5, pctTotalVacant, MedRent, MedHHInc.

Model summaries were all shown below.

## First model: All internal features, amenity features
M1 <- lm(SalePrice ~ ., data = st_drop_geometry(model.miami.training) %>% 
           dplyr::select(colnames(internal.feature.list), 
                         colnames(amenity.feature.list)
))

stargazer(M1, type = "html", 
          title = "Table 4.1.1 Summary Statistics of Model 1 ",
          header = FALSE,
          single.row = TRUE)

**Table 4.1.1 Summary Statistics of Model 1**

	Dependent variable:

	SalePrice

Land	0.991^*** (0.021)
Bldg	1.093^*** (0.023)
Total
Assessed	0.517^*** (0.112)
County.Taxable	-0.303^*** (0.108)
City.Taxable
AdjustedSqFt	-159.901^*** (37.232)
LotSize	-13.966^*** (1.849)
Bed	-11,249.950 (6,997.370)
Bath	17,335.080^** (7,506.400)
Stories	-10,399.910 (13,662.860)
Units	-53,679.120 (55,759.680)
LivingSqFt	33.200 (23.408)
ActualSqFt	134.091^*** (19.856)
Age	252.681 (243.817)
Age.effective	-63.247 (188.533)
pool_house	14,246.280 (13,530.970)
patio_house	-4,738.318 (11,416.990)
landmark_nn1	-2,792,163.000 (2,351,722.000)
landmark_nn2	16,876,972.000^** (7,583,331.000)
landmark_nn3	-31,296,437.000^*** (12,121,768.000)
landmark_nn4	23,404,861.000 (19,773,533.000)
landmark_nn5	-6,786,591.000 (12,738,600.000)
mall_nn1	176,783.400 (1,845,897.000)
mall_nn2	-4,287,708.000 (4,432,823.000)
mall_nn3	5,844,359.000 (5,697,162.000)
mall_nn4	496,030.000 (8,900,686.000)
mall_nn5	-2,054,266.000 (5,713,360.000)
hospital_nn1	-1,818,032.000 (1,888,218.000)
hospital_nn2	4,421,767.000 (4,345,964.000)
hospital_nn3	-4,954,592.000 (8,681,228.000)
hospital_nn4	-1,725,413.000 (13,318,517.000)
hospital_nn5	3,390,553.000 (8,335,588.000)
college_nn1	-4,720,098.000^* (2,629,573.000)
college_nn2	5,923,026.000 (5,873,775.000)
college_nn3	-7,233,703.000 (12,571,717.000)
college_nn4	4,070,657.000 (17,754,491.000)
college_nn5	669,506.300 (10,417,329.000)
school_nn1	-687,049.800 (3,421,098.000)
school_nn2	3,194,391.000 (6,890,515.000)
school_nn3	-23,921,011.000^** (11,137,220.000)
school_nn4	40,568,356.000^** (17,839,289.000)
school_nn5	-18,838,280.000 (12,109,364.000)
hotel_nn1	-1,220,964.000 (4,501,937.000)
hotel_nn2	9,678,923.000 (12,125,473.000)
hotel_nn3	-2,871,359.000 (21,318,716.000)
hotel_nn4	-27,767,343.000 (31,010,023.000)
hotel_nn5	21,457,163.000 (18,334,682.000)
pctTotalVacant	963.610 (863.743)
pctRenterOccupied	6,812.634 (24,174.830)
pctWhite	19,574.130 (38,267.100)
pctPoverty	26,776.040 (81,935.820)
MedRent	-61.026^** (27.140)
MedHHInc	0.592^* (0.311)
Constant	118,214.900 (91,423.930)

Observations	2,378
R²	0.992
Adjusted R²	0.991
Residual Std. Error	196,411.800 (df = 2325)
F Statistic	5,280.829^*** (df = 52; 2325)

Note:	p<0.1; p<0.05; p<0.01

## Second model: 
M2 <- lm(SalePrice ~ ., data = st_drop_geometry(model.miami.training) %>% 
           dplyr::select(# internal features
                         SalePrice, Land, Bldg, Assessed, County.Taxable,
                         AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt, 
                         # amenity features
                         landmark_nn2, landmark_nn3, 
                         hospital_nn1, hospital_nn3, hospital_nn4,
                         college_nn1, school_nn3, school_nn4, school_nn5,
                         pctTotalVacant, pctRenterOccupied, MedRent, MedHHInc) 
)

stargazer(M2, type = "html", 
          title = "Table 4.1.2 Summary Statistics of Model 2 ",
          header = FALSE,
          single.row = TRUE)

**Table 4.1.2 Summary Statistics of Model 2**

	Dependent variable:

	SalePrice

Land	0.985^*** (0.021)
Bldg	1.082^*** (0.022)
Assessed	0.507^*** (0.110)
County.Taxable	-0.288^*** (0.106)
AdjustedSqFt	-118.404^*** (23.316)
LotSize	-13.279^*** (1.726)
Bed	-13,051.200^* (6,836.993)
Bath	17,439.450^** (7,183.896)
ActualSqFt	127.631^*** (18.325)
landmark_nn2	1,799,788.000 (3,559,741.000)
landmark_nn3	-2,635,938.000 (3,763,173.000)
hospital_nn1	-2,196,553.000^** (933,375.700)
hospital_nn3	3,016,274.000 (3,704,837.000)
hospital_nn4	-1,101,879.000 (3,527,266.000)
college_nn1	-1,973,853.000^*** (731,082.300)
school_nn3	-28,006,369.000^*** (7,261,776.000)
school_nn4	45,941,758.000^*** (15,793,927.000)
school_nn5	-17,237,124.000^* (9,831,561.000)
pctTotalVacant	669.337 (634.989)
pctRenterOccupied	14,496.700 (21,090.990)
MedRent	-46.688^** (20.841)
MedHHInc	0.541^** (0.244)
Constant	74,872.530^*** (28,563.910)

Observations	2,378
R²	0.992
Adjusted R²	0.991
Residual Std. Error	196,282.500 (df = 2355)
F Statistic	12,497.180^*** (df = 22; 2355)

Note:	p<0.1; p<0.05; p<0.01

# Third Model
M3 <- lm(SalePrice ~ ., data = st_drop_geometry(model.miami.training) %>% 
             dplyr::select(# internal features
               SalePrice, Land, Bldg, Assessed, County.Taxable,
               AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt, 
               # amenity features
               college_nn1,
               school_nn3, school_nn4, school_nn5,
               pctTotalVacant, MedRent, MedHHInc) 
)

stargazer(M3, type = "html", 
          title = "Table 4.1.3 Summary Statistics of Model 3 ",
          header = FALSE,
          single.row = TRUE)

**Table 4.1.3 Summary Statistics of Model 3**

	Dependent variable:

	SalePrice

Land	0.985^*** (0.021)
Bldg	1.081^*** (0.022)
Assessed	0.499^*** (0.110)
County.Taxable	-0.281^*** (0.106)
AdjustedSqFt	-117.790^*** (23.306)
LotSize	-13.032^*** (1.713)
Bed	-12,423.740^* (6,833.118)
Bath	18,002.580^** (7,169.642)
ActualSqFt	126.495^*** (18.312)
college_nn1	-2,236,855.000^*** (707,102.700)
school_nn3	-28,465,266.000^*** (6,919,330.000)
school_nn4	44,475,526.000^*** (15,632,117.000)
school_nn5	-15,908,357.000^* (9,572,297.000)
pctTotalVacant	1,037.946^* (585.099)
MedRent	-38.722^** (18.400)
MedHHInc	0.479^** (0.200)
Constant	67,019.810^*** (22,540.550)

Observations	2,378
R²	0.991
Adjusted R²	0.991
Residual Std. Error	196,340.900 (df = 2361)
F Statistic	17,172.940^*** (df = 16; 2361)

Note:	p<0.1; p<0.05; p<0.01

## Below is our baseline regression model
m2.training <- M3

4.2 Training Set Summary Results

The results of the regression on the training set are in the following table. The current model embraced 16 independent features and reached 0.992 R-square, indicating that 99% of the variance for SalePrice can be explained by this model.

# 2. Table of model summary on my training set
stargazer(m2.training, type = "html", 
          title = "Table 4.2.1 Summary Statistics of Baseline Model on Training Set",
          header = FALSE,
          single.row = TRUE)

**Table 4.2.1 Summary Statistics of Baseline Model on Training Set**

	Dependent variable:

	SalePrice

Land	0.985^*** (0.021)
Bldg	1.081^*** (0.022)
Assessed	0.499^*** (0.110)
County.Taxable	-0.281^*** (0.106)
AdjustedSqFt	-117.790^*** (23.306)
LotSize	-13.032^*** (1.713)
Bed	-12,423.740^* (6,833.118)
Bath	18,002.580^** (7,169.642)
ActualSqFt	126.495^*** (18.312)
college_nn1	-2,236,855.000^*** (707,102.700)
school_nn3	-28,465,266.000^*** (6,919,330.000)
school_nn4	44,475,526.000^*** (15,632,117.000)
school_nn5	-15,908,357.000^* (9,572,297.000)
pctTotalVacant	1,037.946^* (585.099)
MedRent	-38.722^** (18.400)
MedHHInc	0.479^** (0.200)
Constant	67,019.810^*** (22,540.550)

Observations	2,378
R²	0.991
Adjusted R²	0.991
Residual Std. Error	196,340.900 (df = 2361)
F Statistic	17,172.940^*** (df = 16; 2361)

Note:	p<0.1; p<0.05; p<0.01

broom::glance(m2.training) %>%
  kable(caption = 'Table 4.2.2 Summary of Model Evaluation Parameters') %>%
  kable_styling("striped", full_width = F)

Table 4.2.2 Summary of Model Evaluation Parameters
r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.9914805	0.9914227	196340.9	17172.93	0	16	-32347.84	64731.67	64835.61	9.101599e+13	2361	2378

4.3 MAE, MAPE on Test Set

The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 59645.31, the errors are accounted for 12.36%

# 3. MAE, MAPE on my test set

m2.miami.test <-
  model.miami.test %>% 
  mutate(Regression = "Baseline Regression",
         SalePrice.Predict = predict(m2.training, newdata = model.miami.test),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice) %>%
  filter(SalePrice < 5000000) 
  
test_GoodnessFit<-as.data.frame(cbind(mean(m2.miami.test$SalePrice.AbsError,na.rm = T),
                                      mean(m2.miami.test$SalePrice.APE,na.rm = T)))
colnames(test_GoodnessFit)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_GoodnessFit,
      caption = 'Table RESULT 3. MAE and MAPE for Test Set') %>%
  kable_styling("striped", full_width = F)

Table RESULT 3. MAE and MAPE for Test Set
Mean Absolute Errors (MAE)	MAPE
59645.31	0.1236198

4.4 Cross-Validation

This section conducted cross-validation to assess how our model would generalize to other independent data sets, in which we could flag problem of overfitting or feature selection bias.

Both a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.

The variation of 100 folds can also be visualized with a histogram of across-fold MAE. Although most of the errors are clustering tightly together, some errors are scattered, indicating the inconsistency of the model prediction.

# 4. Cross-Validation Test on Model 2
## 4.1 use caret package cross-validation method
fitControl <- caret:: trainControl(method = "cv", 
                                   number = 100,
                                   savePredictions = TRUE)

set.seed(717)

m2.cv <- 
  caret::train(SalePrice ~ ., data = st_drop_geometry(miami.training) %>% 
                 dplyr::select(# internal features
                   SalePrice, Land, Bldg, Assessed, County.Taxable,
                   AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt, 
                   # amenity features
                   college_nn1,
                   school_nn3, school_nn4, school_nn5,
                   pctTotalVacant, MedRent, MedHHInc) ,
               method = "lm", 
               trControl = fitControl, 
               na.action = na.pass)

## 4.2 show RMSE, MAE, R^2
kable(m2.cv$resample,
          caption = 'Table RESULT 4. Cross-validation Test: Summary of RMSE, R Squared and MAE') %>%
  kable_styling("striped", full_width = F)

Table RESULT 4. Cross-validation Test: Summary of RMSE, R Squared and MAE
RMSE	Rsquared	MAE	Resample
198139.61	0.9853120	119795.03	Fold001
60086.39	0.9960435	48432.08	Fold002
122823.07	0.9900892	76756.16	Fold003
207097.01	0.9704187	96194.43	Fold004
162566.98	0.9895799	98348.59	Fold005
199639.82	0.9986516	95787.21	Fold006
731604.37	0.9949180	308338.47	Fold007
127088.59	0.9976789	72499.21	Fold008
196071.62	0.9895360	97195.16	Fold009
239853.33	0.9599551	110998.29	Fold010
186646.45	0.9458751	104977.37	Fold011
82553.65	0.9910639	60294.86	Fold012
77518.37	0.9974416	58943.17	Fold013
412016.70	0.9864821	150361.53	Fold014
205209.63	0.9710427	104502.83	Fold015
90037.17	0.9876723	61543.14	Fold016
138328.50	0.9949880	84990.71	Fold017
84309.07	0.9945610	60266.43	Fold018
104292.57	0.9485969	65590.16	Fold019
139049.40	0.9931974	84661.85	Fold020
213424.92	0.8892981	98464.62	Fold021
97822.97	0.9947580	64024.16	Fold022
129640.96	0.9617291	79505.35	Fold023
133704.88	0.9962616	70747.15	Fold024
178009.60	0.9994841	115080.18	Fold025
141117.31	0.9957801	87553.03	Fold026
596292.62	0.9927644	182950.18	Fold027
371061.62	0.8329185	136399.18	Fold028
154420.55	0.9829284	84976.60	Fold029
94905.69	0.9803618	59535.60	Fold030
92959.86	0.9825883	68315.44	Fold031
110161.30	0.9994238	73494.07	Fold032
187838.59	0.9973274	111943.09	Fold033
194452.12	0.9974584	102532.76	Fold034
137632.83	0.9875680	98998.44	Fold035
86774.46	0.9632703	70697.19	Fold036
139760.85	0.9947746	81720.61	Fold037
103178.62	0.9834567	76393.19	Fold038
74894.92	0.9997241	50002.23	Fold039
151937.04	0.9637194	83080.49	Fold040
105624.15	0.9840339	82284.21	Fold041
362108.14	0.9456223	155438.69	Fold042
136263.54	0.9755882	98364.83	Fold043
166337.59	0.9475920	112916.77	Fold044
105441.35	0.9991632	76498.76	Fold045
144510.48	0.9942572	91785.92	Fold046
105413.46	0.9966984	70204.28	Fold047
300114.49	0.9960530	134121.91	Fold048
171303.30	0.9976148	85137.17	Fold049
107595.16	0.9365308	81378.68	Fold050
122186.91	0.9986492	81565.15	Fold051
169460.14	0.9375060	108010.98	Fold052
215624.41	0.9653245	125246.01	Fold053
123012.37	0.9997520	79336.06	Fold054
80035.26	0.9946228	57943.16	Fold055
162030.20	0.9565660	91569.13	Fold056
141797.61	0.9892701	90537.01	Fold057
242441.20	0.9926873	93936.52	Fold058
94436.06	0.9777007	58001.11	Fold059
270521.03	0.9977142	107391.70	Fold060
85328.28	0.9926478	64009.44	Fold061
112068.01	0.9963280	67084.45	Fold062
70130.07	0.9647300	50401.73	Fold063
90925.92	0.9983749	64357.41	Fold064
110789.51	0.9687090	64734.13	Fold065
80713.56	0.9877941	62576.56	Fold066
250286.38	0.9825317	123288.27	Fold067
72278.11	0.9994246	57037.43	Fold068
225347.54	0.9413927	117897.92	Fold069
168131.35	0.9934126	80147.15	Fold070
152362.98	0.9412925	81758.69	Fold071
367494.49	0.9844215	128898.98	Fold072
91232.84	0.9859311	57671.03	Fold073
86841.99	0.9902169	60353.74	Fold074
101498.28	0.9646062	74243.32	Fold075
217979.38	0.9986084	132059.32	Fold076
120466.03	0.9970362	80202.54	Fold077
158587.14	0.9678984	89297.14	Fold078
121976.12	0.9976592	83994.63	Fold079
175696.23	0.9867851	102838.63	Fold080
109927.21	0.9970085	71705.27	Fold081
210105.20	0.9416603	119954.07	Fold082
313337.30	0.9949994	110378.88	Fold083
331910.00	0.9987171	157354.12	Fold084
367224.18	0.9917536	173486.56	Fold085
112781.76	0.9995143	82741.16	Fold086
141576.60	0.9709638	86515.32	Fold087
127408.69	0.9980478	80797.91	Fold088
160226.58	0.9642931	100449.11	Fold089
205442.75	0.9604512	113785.68	Fold090
75554.76	0.9893433	61585.62	Fold091
113371.01	0.9836034	72270.88	Fold092
233922.94	0.9916611	115189.21	Fold093
147770.83	0.9955132	85004.00	Fold094
270069.39	0.9994397	129313.17	Fold095
190135.49	0.9961020	128383.47	Fold096
159917.75	0.9963515	76691.39	Fold097
407381.55	0.9961371	143143.95	Fold098
78599.00	0.9729700	53994.23	Fold099
125683.41	0.9740752	86213.44	Fold100

m2.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title = 'Cross-validation Test: Distribution of MAE, RMSE, R Squared\n',
       caption = "Figure RESULT 4.1") +
  plotTheme()

ggplot(data = m2.cv$resample) +
  geom_histogram(aes(x = m2.cv$resample$MAE), fill = 'orange') +
  labs(title="Distribution of Cross-validation MAE",
       subtitle = "K = 100\n",
       caption = "Figure RESULT 4.2") +
  xlab('MAE of Model 2') +
  ylab('Count') +
  plotTheme()

# If the model generalized well, the distribution of errors would cluster tightly together. 
# Instead, this range of errors suggests the model predicts inconsistently, and would likely be unreliable for predicting houses that have not recently sold.

4.5 Plot Predicted Prices

The Figure RESULT 5.1 below shows the predicted prices as a function of observed price. This plot indicates that our model can predict price perfectly since the regression line (shown in dark-blue) and the reference line (shown in orange) are completely overlapped, and almost all the house predictions are closely clustered along with the regression line.

# 5. Plot predicted prices as a function of observed prices
preds.train <- data.frame(pred   = predict(m2.training, model.miami.training),
                          actual = model.miami.training$SalePrice,
                          source = "Training Set")
preds.test  <- data.frame(pred   = predict(m2.training,newdata = model.miami.test),
                          actual = model.miami.test$SalePrice,
                          source = "Test Set")
preds <- rbind(preds.train, preds.test)

## Overall Plotting
ggplot(preds, aes(x = actual, y = pred, color = source)) +
  geom_point() +
  geom_smooth(method = "lm", color = "darkblue") +
  geom_abline(color = "orange") +
  coord_equal() +
  theme_bw() +
  labs(title = "Predicted Prices as a\nFunction of Observed Prices",
       subtitle = 'Blue line is model regression line\nwhile orange line is reference line\n',
       caption = 'Figure RESULT 5.1',
       x = "Observed Prices",
       y = "Predicted Prices") +
  theme(
    legend.position = "none"
  ) +
  plotTheme()

ggplot(preds, aes(x = actual, y = pred, color = source)) +
  geom_point() +
  geom_smooth(method = "lm", color = "darkblue") +
  geom_abline(color = "orange") +
  coord_equal() +
  theme_bw() +
  facet_wrap(~source, ncol = 2) +
  labs(title = "Predicted Prices as a Function of Observed Prices, by Sets",
       subtitle = 'Blue line is model regression line while orange line is reference line\n',
       caption = 'Figure RESULT 5.2',
       x = "Observed Prices",
       y = "Predicted Prices") +
  theme(
    legend.position = "none"
  ) +
  plotTheme()

4.6 Map of Test Set Residuals

In this section, we leveraged three different tools to examine if prediction errors display certain spatial patterns, and accordingly to determine whether we need to add more spatial features in the existing model.

Mapping residuals: In the map of the residuals on the test set, most of the errors are clustering in the Downtown Miami and Miami Beach area. There is no clear pattern of the error directions.
Plotting spatial lags: A spatial lag is a variable that averages the neighboring values of a location. The spatial lag of error plot shows that as home price errors increase, the nearby home price errors slightly decrease, indicating that our model errors are rarely spatially autocorrelated.
Moran’s I: Another approach for measuring spatial autocorrelation is Moran’s I. In this model Moran’s I is around 0.196, indicating a certain level of clustering on model errors. The frequency of all 999 randomly permuted I were plotted as a histogram with the Observed I indicated by the orange line (see Figure Result 6.3). That the observed I is higher than all of the 999 randomly generated I’s provides visual confirmation of spatial autocorrelation despite being slight. This suggests that variation in price is likely related to the spatial process has been omitted from our current model.

Therefore, we thought it was worthy to include neighborhood features in the model for the further test.

# 6. Residual Exploration
## 6.1 A map of residuals for test set
palette.res <- c('#ff0000', '#ff7878', '#ffffff', '#74d680', '#378b29')
map.res.test <- ggplot() + 
  geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
  geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
  geom_sf(data = m2.miami.test,aes(color = q5(SalePrice.Error)), show.legend = "point",size = 2) +
  scale_color_manual(values = palette.res,
                     labels = qBr(m2.miami.test,'SalePrice.Error'),
                     name = "Residuals") +
  labs(title = 'Map of residuals on test set\n',
       caption = 'Figure RESULT 6.1') +
  mapTheme() + 
  plotTheme()
map.res.test

## 6.2 Spatial lag in errors: average errors in five nearest neighbors
k_nearest_neighbors = 5

# average errors in five nearest neighbors
coords.test <- st_centroid(st_geometry(m2.miami.test), of_largest_polygon=TRUE)
neighborList.test <- knn2nb(knearneigh(coords.test, k_nearest_neighbors))
spatialWeights.test <- nb2listw(neighborList.test, style="W") 
m2.miami.test$lagPriceError <- lag.listw(spatialWeights.test,m2.miami.test$SalePrice.Error)

error.spatial.lag <- ggplot(m2.miami.test, aes(x = lagPriceError, y = SalePrice.Error)) +
  geom_point(colour = "black") +
  geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
  labs(title = "Error on Test Set as a function of the Spatial Lag of Price\n",
       caption = "Figure RESULT 6.2",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Errors of Sale Price") +
  plotTheme()

error.spatial.lag

## 6.3 Moran's I Test
moranTest <- moran.mc(m2.miami.test$SalePrice.AbsError, 
                      spatialWeights.test, nsim = 999)

moran.plot <- ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange, I = 0.196",
       x="Moran's I",
       y="Count",
       caption="Figure RESULT 6.3") +
  plotTheme()

moran.plot

4.7 Map of Predicted Prices on the Whole Dataset

Applying the predictive model on the whole dataset, the distribution of sales price is shown in the map below. Sale prices are generally higher in Downtown Miami, Miami Beach and the North area. There is no significant difference between the pattern of the distribution of price predictions in the two datasets (topredict == 0 and topredict == 1).

# 7. Map of predicted values for the whole dataset
miami.sf$Predicted.Price<- predict(m2.training, newdata = miami.sf)

feature <- c("toPredict = 0", "toPredict = 1")
names(feature) <- c(0,1)

map.predicted.price <- ggplot() + 
  geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
  geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
  geom_sf(data = st_centroid(miami.sf),aes(color = q5(Predicted.Price)),size = .5) +
  scale_color_manual(values = palette,
                     labels = qBr(miami.sf,'Predicted.Price'),
                     name = "Price, $") +
  facet_wrap(~as.factor(toPredict),
             labeller = labeller(toPredict = feature)) +
  labs(title = 'Map of Predicted Sale Price\n',
       subtitle = '',
       caption = 'Figure RESULT 7') +
  mapTheme() + 
  plotTheme()


map.predicted.price

4.8 Map of MAPE by Neighborhood

Indicated in the summary table as well as the choropleth map below, MAPE varies across neighborhoods (only 22 neighborhoods left in the test set). In the test set, the model exhibited larger MAPE in Downtown and North Miami.

# 8. Map of MAPE by neighborhood
## 8.1 Statistic summary by neighborhood group
nhood_sum <- m2.miami.test %>% 
  group_by(LABEL) %>%
  summarize(meanPrice = mean(SalePrice, na.rm = T),
            meanPrediction = mean(SalePrice.Predict, na.rm = T),
            meanMAE = mean(SalePrice.AbsError, na.rm = T),
            meanMAPE = mean(SalePrice.APE, na.rm = T)) 

nhood_sum %>%
  st_drop_geometry %>%
  arrange(desc(meanMAPE)) %>% 
  kable(caption = 'Table RESULT 8. Map of MAPE by Neighborhood') %>% 
  kable_styling("striped", full_width = F)

Table RESULT 8. Map of MAPE by Neighborhood
LABEL	meanPrice	meanPrediction	meanMAE	meanMAPE
Belle Meade	525000.0	693827.2	168827.210	0.3215756
Flora Park	235000.0	173195.4	61804.585	0.2629982
Hadley Park	218181.8	220490.7	40196.438	0.2629211
Auburndale	315000.0	394983.8	79983.783	0.2515129
King Heights	247500.0	186553.2	60946.762	0.2426961
Douglas Park	330000.0	395485.9	65485.918	0.1984422
South Grove	1089285.7	1065505.8	196290.348	0.1943366
Edison	225000.0	245860.4	39177.641	0.1635606
Buena Vista West	282500.0	241985.2	40514.803	0.1379306
Flagami	348550.7	341875.8	41095.196	0.1176901
Shenandoah South	435000.0	414730.8	40233.809	0.0980084
Shenandoah North	425000.0	447634.1	35817.841	0.0978073
Silver Bluff	437500.0	468627.9	39720.576	0.0870146
NA	1542931.0	1573447.3	109346.042	0.0858543
Citrus Grove	316666.7	290469.0	26197.670	0.0856725
La Pastorita	380000.0	348617.1	31382.939	0.0825867
Liberty Square	220000.0	201958.8	18041.217	0.0820055
Northwestern Estates	190000.0	202313.3	12313.323	0.0648070
West Grove	550000.0	584857.9	34857.907	0.0633780
Roads	600000.0	565369.8	34630.229	0.0577170
North Grapeland Heights	290000.0	301803.1	11803.125	0.0407004
Shorecrest	638333.3	656533.8	18200.425	0.0315013
Santa Clara	220000.0	226119.9	6119.898	0.0278177
Melrose	230000.0	224225.4	5774.562	0.0251068
Baypoint	1800000.0	1783610.7	16389.297	0.0091052

## 8.2 Map of MAPE by neighborhood

miami.neigh.sum <- miami.neigh %>%
  left_join(st_drop_geometry(nhood_sum), by = "LABEL") %>%
  mutate(meanMAPE.expanded = meanMAPE * 100)
  
            
map.MAPE.nhood <- ggplot() + 
  geom_sf(data = miami.neigh.sum,
          aes(fill = q5(meanMAPE.expanded))) + 
  geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
  scale_fill_manual(values = palette,
                     labels = qBr(miami.neigh.sum,'meanMAPE.expanded'),
                     name = "MAPE\n(timed by 100\nfor better display)") +
  labs(title = 'Map of MAPE on Test Set, by Neighborhood\n',
       subtitle = '',
       caption = 'Figure RESULT 8') +
  mapTheme() + 
  plotTheme()

map.MAPE.nhood

4.9 Plot MAPE by Neighborhood

This section plotted MAPE by neighborhood as a function of mean price by neighborhood. Neighborhoods with mean sale price below or around 500,000 have higher MAPE, while MAPE in neighborhoods with higher mean sale price is generally lower. This finding indicates that most errors occurred in the neighborhoods where house price was comparatively moderate.

# 9. Scatter plot of MAPE by neighborhood as a function of mean price by neighborhood

MAPE.nhood.plot <-ggplot()+
  geom_point(data = nhood_sum, aes(x = meanPrice, y = meanMAPE)) +
  stat_smooth(data = nhood_sum, aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="red") + 
  labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
       caption = 'Figure RESULT 9') +
  xlab('Mean Price by Neighborhood') +
  ylab('Mean MAPE by Neighborhood') +
  plotTheme()

MAPE.nhood.plot

4.10 Generalizability Examination

This section gathered Census data to test how well baseline model generalizes to different group contexts, such as race group and income level.

4.10.1 Context Group Building

First of all, we selected four census variables to display the spatial pattern of each group: 1. Majority White vs. Majority Non-White: divided whether the proportion of white population is accounted for more than 50% of total population or not. 2. High Income vs. Low Income: divided whether the median household income in certain census tract is above or below the average level. 3. Majority Vacant vs. Majority Non-Vacant: divided whether the proportion of vacant unit is accounted for more than 50% of total vacant unit or not. 4. Majority Renter Occupied vs. Majority Non-Renter Occupied: divided whether the proportion of renter occupied units is accounted for more than 50% of total units or not.

Based on the maps shown below, race group and income level have displayed the obvious spatial pattern, and thus, we determined to generalize our model on the two context groups.

# 10. Model generalizability on census data
## 10.1 Group Plotting
census <-   miami.training %>% # miami.training contains all features we used including census data and geometry information
   mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
          incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
          vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
          pctRenterContext = ifelse(pctPoverty > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied")) %>%
  select(raceContext,incomeContext,vacantContext,pctRenterContext)


census <- census %>%
  st_transform(st_crs(tracts17))


grid.arrange(ncol = 2,
             ggplot() + 
               geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') + 
               geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
               geom_sf(data = na.omit(census), aes(color = raceContext),
                       show.legend = "point", size= 1) +
               scale_color_manual(values = c("#05A167", "#6897BB"), name="Race Context") +
               labs(title = "Race Context of House\n",
                    caption = 'Figure RESULT 10.1') +
               mapTheme() + 
               theme(legend.position="bottom") +
               plotTheme(),
             
             ggplot() + 
               geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') + 
               geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
               geom_sf(data = na.omit(census), aes(color = incomeContext),
                       show.legend = "point", size= 1) +
               scale_fill_manual(values = c("#05A167", "#6897BB"), name="Income Context") +
               labs(title = "Income Context of House\n",
                    caption = 'Figure RESULT 10.2') +
               mapTheme() + 
               theme(legend.position="bottom") +
                plotTheme())

grid.arrange(ncol = 2,
             ggplot() + 
               geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') + 
               geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
               geom_sf(data = na.omit(census), aes(color = vacantContext),
                       show.legend = "point", size= 1) +
               scale_color_manual(values = c("#05A167", "#6897bb"), name="Vacant Context") +
               labs(title = "Vacant Context of House\n",
                    caption = 'Figure RESULT 10.3') +
               mapTheme() + 
               theme(legend.position="bottom") +
               plotTheme(),
             ggplot() + 
               geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') + 
               geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
               geom_sf(data = na.omit(census), aes(color = pctRenterContext),
                       show.legend = "point", size= 1) +
               scale_fill_manual(values = c("#05A167", "#6897BB"), name="Renter Proportion Context") +
               labs(title = "Renter Proportion Context\n",
                    caption = 'Figure RESULT 10.4') +
               mapTheme() + 
               theme(legend.position="bottom") +
                plotTheme())

4.10.2 Model Modifying with Neighborhood Effects

Existing findings have indicated that neighborhood could also be a potential feature for model building. Hence, this section built a new model called Neighborhood Effects model based on the baseline model by adding neighborhood feature LABEL.

MAE, and MAPE of both regression models are shown below. The finding was not intuitive that the new model performed much better in terms of MAE but slightly worse in terms of MAPE.

Hence, it is imperative to further examine how well both models could be generalized in race context and income level context, and then to determine which model should be remained.

### Based on the group distribution, we decided to select race group and income group

## 10.2 Neighborhood Fixed Effect
### 10.2.1 Model building
m2.nhood <- lm(SalePrice ~ ., data = as.data.frame(model.miami.training) %>% 
     dplyr::select(# spatial features
                   LABEL,
                   # internal features
                   SalePrice, Land, Bldg, Assessed, County.Taxable,
                   AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt, 
                   # amenity features
                   college_nn1,
                   school_nn3, school_nn4, school_nn5,
                   pctTotalVacant, MedRent, MedHHInc) 
)

m2.miami.test.nhood <-
  model.miami.test %>% 
  mutate(Regression = "Neighborhood Effects",
         SalePrice.Predict = predict(m2.nhood, model.miami.test),
         SalePrice.Error = SalePrice - SalePrice.Predict,
         SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
         SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice) %>%
  filter(SalePrice < 5000000)

### 10.2.2. Model combination and comparison
bothRegressions <- 
  rbind(
    dplyr::select(m2.miami.test, starts_with("SalePrice"), Regression, LABEL) ,
    dplyr::select(m2.miami.test.nhood, starts_with("SalePrice"), Regression, LABEL) )

st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -LABEL) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
  summarize(meanValue = mean(Value, na.rm = T)) %>%
  spread(Variable, meanValue) %>%
  kable(caption = 'Table RESULT 10.1.\nSummary of MAE, and MAPE by Regression Models') %>%
  kable_styling("striped", full_width = F)

Table RESULT 10.1. Summary of MAE, and MAPE by Regression Models
Regression	SalePrice.AbsError	SalePrice.APE
Baseline Regression	59645.31	0.1236198
Neighborhood Effects	47095.86	0.1246893

4.10.3 Generalizability on Context Groups

Regardless of any context group, our new model with neighborhood features do not perform better than the baseline model in terms of MAPE. Hence, we determined to exclude neighborhood features in the model.

Despite higher MAPE in the Majority Non-White group, the baseline model is generalized well across context groups with MAPE close to the one generated on the test set (MAPE on test set is 12.362% for your reference).

### 10.2.3 Generalizability in Different Groups
census <-   miami.training %>% # miami.training contains all features we used including census data and geometry information
   mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
          incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
          vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
          pctRenterContext = ifelse(pctPoverty > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied")) %>%
  select(raceContext,incomeContext,vacantContext,pctRenterContext)

race.group <- st_join(bothRegressions, census) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Table  RESULT 10.2.\nTest set MAPE by Neighborhood Racial Context") %>% 
  kable_styling("striped", full_width = F)

race.group

Table RESULT 10.2. Test set MAPE by Neighborhood Racial Context
Regression	Majority Non-White	Majority White
Baseline Regression	19%	11%
Neighborhood Effects	20%	11%

income.group <- st_join(bothRegressions, census) %>% 
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption  = " RESULT 10.3.\nTest set MAPE by Neighborhood Income Context") %>% 
  kable_styling("striped", full_width = F)

income.group

RESULT 10.3. Test set MAPE by Neighborhood Income Context
Regression	High Income	Low Income
Baseline Regression	11%	13%
Neighborhood Effects	20%	13%

4.10.4 Final Model

Eventually, we settled down with a model built with the original 16 features used in the baseline model, and conducted prediction on the secret data set. Summary statistics of predicted sale price is attached below.

Table RESULT 10.4 Summary Statistics

N	Mean	St.Dev.	Min.	Max.
876	1044617	1217748	69398	13970170

MODEL <- lm(SalePrice ~ ., data = as.data.frame(miami.training) %>% 
                 dplyr::select(
   
                   # internal features
                   SalePrice, Land, Bldg, Assessed, County.Taxable,
                   AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt, 
                   # amenity features
                   college_nn1,
                   school_nn3, school_nn4, school_nn5,
                   pctTotalVacant, MedRent, MedHHInc) 
)


secret_data <- miami.test%>% 
                 dplyr::select(Folio,
                   # internal features
                   SalePrice, Land, Bldg, Assessed, County.Taxable,
                   AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt, 
                   # amenity features
                   college_nn1,
                   school_nn3, school_nn4, school_nn5,
                   pctTotalVacant, MedRent, MedHHInc)
MedRent.no.na <- mean(secret_data$MedRent,na.rm = TRUE)
secret_data[is.na(secret_data)] <- MedRent.no.na


secret_preds <- predict(MODEL, newdata = secret_data)
output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = "MySun") 

# write.csv(output_preds, "MySun.csv")

V. Discussion

Accuracy and Generalizability: Overall, the predictive model is effective in terms of predicting sale price in the City of Miami and Miami Beach. The model successfully captured more than 99% of variation in prices we were required to predict that indicated a big victory! For the sake of accuracy, the MAE on the test set is less than 60k with 12.36% MAPE. As for the generalizability, MAPEs are no larger than 13% regardless of context groups to which the model was applied. A potential weakness of the model should be addressed is if the generalizability test result was overestimated or manipulated since proportion of white population and median household income were both features in the model.
Feature Effect: The model is built with 9 internal characteristic features: Land, Bldg, Assessed, County.Taxable, AdjustedSqFt, LotSize, Bed, Bath, and ActualSqFt, 4 amenity features: college_nn1, school_nn3, school_nn4, andschool_nn5, and 3 spatial featurespctTotalVacant, MedRent, and MedHHInc. The highlight of this model is including vacant unit proportion on census tract level as one of the features to predict sale price. We regarded it as an interesting addition to display the spatial environment the house unit is located at. The accessibility of education resources is an easy-ignored but powerful feature to show how amenities affect the local house price. Rather than only focus on the entertainment demand, we cared more about if the local residents’requirements on education have been refilled. Evidence has indicated that such a consideration is worthy. Secondly, it seems counterintuitive that we did not include the distance to coastline as a feature in the model, but we think proximity to water is just the surface outcome; it is unable to explain why high price is cluttered inside the City of Miami boundary despite being far from water. Besides, although we eventually excluded neighborhood features from the model, this spatial pattern still deserves attention when it comes to larger dataset size. Additionally, although we included far more amenity features in the initial stage, such as the proximity to landmarks, hotels, shopping malls for entertainment, the proximity to hospitals for healthcare service, and the exposure to certain crime, those features were eventually excluded from the model based on the series of rigorous model evaluation. However, we still believe it is worthy taking those features into consideration and they might be helpful when being used in other regression models.
Errors and Spatial Variations: Generally speaking, larger MAEs occurred in the Downtown Miami, Miami Beach area and north region of Miami City, while the model performed better in the rest of areas. One guess is potential spatial features that are directly related to the three areas have been somehow missed out. But we still feel confident about the model based on the examinations above.

VI. Conclusion

If all the model examination and interpretation have been conducted correctly as shown above, we would like to recommend this model to Zillow as it incorporates some Miami’s local intelligent features that were absent in the previous models and performed comparatively well on either accuracy or generalizability.

To further reduce the error rate, more amenity features such as local restaurants and bars (data unavailable so far), financial related agencies, public transportation and spatial features such as zoning should be taken into consideration. And more context studies on Miami downtown and north region are also highly needed since we would like to extract some similarities between the two regions that would be the magic ingredients of cooking models in the future.