Zillow is one of the most popular real estate databases online. One of Zillow’s key features is its Zestimates, a popular consumer tool for seeing how much homes are worth. These estimates are based on information from sources such as comparable sales and public data. According to Zillow, most Zestimates are “within 10 percent of the selling price of the home.” Although we cannot expect perfectly accurate estimates from any algorithm, it is important to improve the accuracy in many possible ways. This project focuses on one possible improvement: incorporating enough local factors to improve the accuracy of this predictive algorithm.
Miami is a global tourist destination, well known for the local beaches, such as Miami Beach and South Beach, which best represent the uniqueness of Miami. The tourism and real estate industry have been dominating Miami’s economy and mutually boosting each other. There are other aspects that make the real estate market unique and make sales price prediction in Miami trickier than in other cities.
Renters form 70% of the population. In Miami, renters form a large percentage of the population. Partly due to the mobility of work. Most of the people work in temporary or seasonal jobs specifically in the tourism industry. Moreover, rental housing is more affordable than buying a house due to the overall high property sale prices in Miami. Rental market can affect the sales market. And that is why a lot of investors are buying single-family homes and then renovating them into multiple rentable units.
Foreign buyers pricing out locals out of Miami. And as South Florida continues to grow into an international destination beyond vacations and beaches, locals must compete with cash-rich foreign and out-of-state buyers, whose growing presence keeps driving up land and housing prices while local wages remain stagnant.
Florida flooding could devalue real estate. According to the Jupiter Intelligence report, by 2050, annual flooding damage county-wide in Miami-Dade County is expected to roughly double, leading to shortages in affordable insurance coverage and real estate market instability. On the list of the 20 urban areas in America that will suffer the most from rising seas, Florida has five: St Petersburg, Tampa, Miami, Miami Beach and Panama City. In 2016, Zillow predicted that one out of eight homes in Florida would be underwater by 2100, a loss of $413bn in property.
However, with the understanding of incorporating local intelligence into the model, this is no easy task to do. On the one hand, some features can be hard to quantify. On the other hand, open datasets might not be available. For example, case-level crime data is unavailable on the Miami Open Data Portal, and it is almost intuitive to take crime into account when predicting housing value of every city. Thus, seeking other ways to illustrate this feature is crucial for carrying on with our analysis on the Miami housing market.
This project adopted geospatial machine learning techniques, split housing sale price dataset into training and test set, incorporated local intelligence features from Census data and Miami Open Data portal, and trained the final model to achieve the lowest possible error rate.
After several rounds of testing, the final model includes important features such as school
location, hospital
location, and vacant rate
in census tract. Applying the final model to the test set, the Mean Absolute Errors (MAE) of the model is 59645, and the Mean Absolute Percent Errors (MAPE) is 0.12
# 1. load Libraries
library(sf)
library(tidyverse)
# install.packages('mapview')
library(mapview)
library(spdep)
library(caret)
library(ckanr) # for opening data APIs built on CKAN technology
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools) # for regression model plots
library(stargazer) # for creating table
library(broom)
library(tufte)
library(rmarkdown)
library(kableExtra)
library(tidycensus)
# 2. Identify functions
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 15,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 14))
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(color = "darkred", size=15, face="bold"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("#E5E5E5", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "#E5E5E5", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
#3. Load Quantile break functions
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
# Load hexadecimal color palette
palette <- c('#feedde', '#fdbe85', '#fd8d3c', '#e6550d', '#a63603')
# for calculating average nearest neighbor distance.
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <- as.matrix(measureFrom)
measureTo_Matrix <- as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
dplyr::summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull() # pull() is similar to $. It's mostly useful because it looks a little nicer in pipes, it also works with remote data frames, and it can optionally name the output.
return(output)
}
In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quintile breaks, and average nearest neighbor distance for further analysis.
In this section, we loaded Miami data and other complementary datasets, conducted feature selection and engineering. Besides, for other analysis, we separate the whole dataset into miami.training
and miami.test
in advance.
Five categorical datasets were used in this project:
Miami House Price and Internal Characteristics: basic geometric dataset provided by the course in advance.
Miami Base Data: geometric data containing the geographic information of Miami and Miami Beach.
Miami Neighborhood Shapefiles: digital geographic dataset contains polygon features representing neighborhood areas within the City of Miami as well as Miami Beach, Florida.
Miami Beach Neighborhood: As an alternative way to using neighborhood data of Miami Beach, we used Miami Municipal Boundary data that is a polygon feature class of municipal boundaries within Miami-Dade County, and specifically filter geometric data in Miami Beach.
Census Data: demographic variables from the ACS 2017 for census tracts in Miami-Dade County. We specifically selected 8 variables in the analysis:
TotalPop
: Total population in each census tractWhites
: White population in each census tractMedHHInc
: Median household income in each census tractMedRent
: Median Rent for properties in each census tractTotalPoverty
: Population living under the level of poverty in each census tractTotalVacant
: Total vacant unit in each census tractTotalUnit
: Total housing units in each census tractRenterOccpied
: Total householder lived in renter-occupied housing unitsWith variables above, we created the four features to be used in the model building.
pctWhite
: white population proportion in each census tractpctPoverty
: Poverty population proportion in each census tractpctTotalVacant
: Vacant unites proportion in each census tractpctRenterOccupied
: Proportion of householder living in renter-occupied housing units in each census tractNotice: Those vacant related variables are vital features we explored that would significantly improve our model and powerfully predict price. Detailed information will be provided in the following analysis.
Amenity Data: compilatory data chosen for describing amenities and public service of Miami-Dade County. The datasets we chose encompasses crime rate, education opportunity, healthcare service and entertainment accessibility. See each dataset below:
Sexual Predator: A point feature class of Sexual Offenders and Predators within Miami-Dade County that is used by the Sexual Offender/Predator Residence Search Internet application.
Public School: A point feature class of the Miami-Dade County Public Schools facilities.
College, University: A point feature class of colleges and universities within Miami-Dade County.
Hospital: A point feature class of the Hospital facilities within Miami-Dade County.
Landmark: A point feature class of landmarks within Miami-Dade County.
Hotel, Motel, Inn: A point feature class of Hotel, Motel and Inn facilities within Miami-Dade County.
Shopping Mall: A point feature class that contains the locations of major shopping malls within Miami-Dade County. Shopping Mall is considered a large retail complex containing a variety of stores and often restaurants in which one or more buildings form a complex of shops representing merchandisers with interconnecting walkways that enable customers to walk from unit to unit.
# I. Base data
## House Price & internal characteristics
# miami.sf <- st_read('C:/Users/Yiming Ma/Documents/UPenn/MUSA508/hw/hw2/studentsData.geojson')
miami.sf <- st_read('/Users/penguin/Box Sync/GitHub/MUSA508-Midterm/studentsData.geojson')
## Miami base data
miami.base <-st_read("https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson") %>%
filter(NAME %in% c("MIAMI", "MIAMI BEACH"))
## Neighborhood data
miami.neigh <- st_read('https://opendata.arcgis.com/datasets/2f54a0cbd67046f2bd100fb735176e6c_0.geojson')
# plot(miami.neigh)
## Miami beach neighborhood data
miami.beach <- st_read('https://opendata.arcgis.com/datasets/5ece0745e24b4617a49f2e098df8117f_0.geojson') %>%
filter(NAME == "MIAMI BEACH")
# Combine Miami beach neighborhood data and Miami city neighborhood data together
index <- miami.sf %>%
rowid_to_column(var = "rowIndex")
index.neigh <-st_join(index,miami.neigh) %>%
distinct(rowIndex, .keep_all = TRUE) %>%
select(-Shape_STAr,-Shape_STLe,-Shape__Area,-Shape__Length)
index.beach <- st_join(index,miami.beach) %>%
filter(NAME == "MIAMI BEACH") %>%
distinct(rowIndex, .keep_all = TRUE) %>%
rename(LABEL = NAME) %>%
select(-MUNICID,-MUNICUID,-CREATEDBY,-CREATEDDATE,-MODIFIEDBY,
-MODIFIEDDATE,-SHAPE_Length,-SHAPE_Area,-FIPSCODE)
### Create final data we will work on!
miami.sf <- rbind(index.neigh, index.beach) %>%
distinct(rowIndex, .keep_all = TRUE)
### check neighborhood information
miami.neigh.num <- as.data.frame(table(miami.sf$LABEL))
## Census Tract data
## View(load_variables(2017,'acs5',cache = TRUE))
tracts17 <-
get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E",
"B19013_001E","B25058_001E",
"B06012_002E",
# vacant variables
"B25002_003E",
"B25004_002E","B25004_003E",
"B25004_004E","B25004_005E",
# total housing unit
"B25001_001E",
# renter occupied
'B08137_003E'),
year=2017, state= 12, county= 086, geometry=T, output="wide") %>%
st_transform('EPSG:2236') %>%
rename(TotalPop = B25026_001E,
Whites = B02001_002E,
MedHHInc = B19013_001E,
MedRent = B25058_001E,
TotalPoverty = B06012_002E,
TotalVacant = B25002_003E,
ForRent = B25004_002E,
ForRentVac = B25004_003E,
ForSale = B25004_004E,
ForSaleVac = B25004_005E,
TotalUnit = B25001_001E,
RenterOccupied = B08137_003E
) %>%
dplyr::select(-NAME, -starts_with("B")) %>% #-starts_with("B") awesome!
mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
pctTotalVacant = ifelse(TotalUnit > 0, TotalVacant / TotalUnit * 100, 0),
TotalOccupied = TotalUnit - TotalVacant,
pctRenterOccupied = ifelse(TotalOccupied >0, RenterOccupied/TotalOccupied, 0)) %>%
dplyr::select(-Whites, -TotalPoverty)
projected.tracts17 <-
tracts17 %>%
st_transform(st_crs(miami.sf))
miami.sf <-st_join(miami.sf, projected.tracts17) %>%
distinct(rowIndex, .keep_all = TRUE)
## Amenity Data
## 1. Miami landmark
miami.landmark <- st_read('https://opendata.arcgis.com/datasets/70a14825e66f4f0eb28d2a9cceba1761_0.geojson')
miami.landmark.sf <- miami.landmark %>%
select(geometry) %>%
na.omit() %>%
distinct()
## 2. Miami Shopping Mall
miami.mall <- st_read('https://opendata.arcgis.com/datasets/cb24d578246647a9a4c57bbd80c1caa8_0.geojson')
miami.mall.sf <- miami.mall %>%
select(geometry) %>%
na.omit() %>%
distinct()
## 3. Miami Sexual Predator
miami.sexual <- st_read('https://opendata.arcgis.com/datasets/f8759d722aeb4198bfe7c4ad780604d2_0.geojson')
miami.sexual.sf <- miami.sexual %>%
select(geometry) %>%
na.omit() %>%
distinct()
## 4. School data
miami.school <- st_read('https://opendata.arcgis.com/datasets/d3db0fce650d4e40a5949b0acae6fe3a_0.geojson')
miami.school.sf <- miami.school %>%
select(geometry) %>%
na.omit() %>%
distinct()
## 5. Hotel data
miami.hotel <- st_read('https://opendata.arcgis.com/datasets/d37bbc15e7304b4ca4607783283147b7_0.geojson')
miami.hotel.sf <- miami.hotel%>%
select(geometry) %>%
na.omit() %>%
distinct()
## 6. College data
miami.college <- st_read('https://opendata.arcgis.com/datasets/7db056c406b943dc8f3f377b99d77588_0.geojson')
miami.college.sf <- miami.college%>%
select(geometry) %>%
na.omit() %>%
distinct()
## 7. Hospital data
miami.hospital <- st_read('https://opendata.arcgis.com/datasets/0067a0e8b40644f980afa23ad34c32c4_0.geojson')
miami.hospital.sf <- miami.hospital %>%
select(geometry) %>%
na.omit() %>%
distinct()
In this section, we created some new features to describe the amenities and public services in Miami-Dade County, as well as to each home sale observation’s internal characteristics.
First of all, we were aimed at creating features to measure the extent to which buyers can assess multiple public services, including hospitals, shopping malls, landmarks, hotels, motels and inns, and education resources. To utmost avoid scale bias triggered by setting arbitrary areal unit or fixed buffer distance of each home sale observation, we chose to calculate the average nearest neighbor distance from each home sale to its k nearest neighbor public services. Accordingly, the smaller the value is, the more possibility that the observation can access specific service. To better compare each feature’s impact, we set the k from 3 to 5.
By leveraging the exactly same approach, we also created a new feature to measure the extent to which buyers can expose sexual assaults.
Secondly, we added four more features to indicate the duration each house was built, effectively used, and if the house was built with a pool or a patio.
## 1. Landmark
miami.sf <-
miami.sf %>%
mutate(
landmark_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 1),
landmark_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 2),
landmark_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 3),
landmark_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 4),
landmark_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.landmark.sf)), 5)) %>%
distinct(rowIndex, .keep_all = TRUE)
## 2. Shopping mall
miami.sf <-
miami.sf %>%
mutate(
mall_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 1),
mall_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 2),
mall_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 3),
mall_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 4),
mall_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.mall.sf)), 5)) %>%
distinct(rowIndex, .keep_all = TRUE)
## 3. Sexual Assaults
miami.sf <-
miami.sf %>%
mutate(
sexual_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 1),
sexual_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 2),
sexual_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 3),
sexual_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 4),
sexual_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.sexual.sf)), 5)) %>%
distinct(rowIndex, .keep_all = TRUE)
## 4. Hotel Access
miami.sf <-
miami.sf %>%
mutate(
hotel_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 1),
hotel_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 2),
hotel_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 3),
hotel_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 4),
hotel_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hotel.sf)), 5)) %>%
distinct(rowIndex, .keep_all = TRUE)
## 5. School Access
miami.sf <-
miami.sf %>%
mutate(
school_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 1),
school_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 2),
school_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 3),
school_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 4),
school_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.school.sf)), 5)) %>%
distinct(rowIndex, .keep_all = TRUE)
## 6. College Access
miami.sf <-
miami.sf %>%
mutate(
college_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 1),
college_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 2),
college_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 3),
college_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 4),
college_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.college.sf)), 5)) %>%
distinct(rowIndex, .keep_all = TRUE)
## 7. Hospital Access
miami.sf <-
miami.sf %>%
mutate(
hospital_nn1 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 1),
hospital_nn2 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 2),
hospital_nn3 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 3),
hospital_nn4 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 4),
hospital_nn5 = nn_function(st_coordinates(st_centroid(miami.sf)), st_coordinates(st_centroid(miami.hospital.sf)), 5)) %>%
distinct(rowIndex, .keep_all = TRUE)
## 8. Internal characteristics
miami.sf <-
miami.sf %>%
mutate(pool_house = ifelse(str_detect(XF1, "Pool"), 1, 0),
patio_house = ifelse(str_detect(XF1, "Patio"), 1, 0),
Age = 2020- YearBuilt,
Age.effective = 2020- EffectiveYearBuilt)
As indicated earlier, we separated the whole dataset (3503 observations in total) into miami.training
and miami.test
in advance for model building in next section. So far, we have 2,627 home sale observations for working on, while we used the left 876 observations to test model for competition.
As required, we provided three tables of summary statistics with variable descriptions for sorting features based on their categories below.
In total, we took 62 features into consideration in the initial stage.
19 features (including SalePrice
came from the category “internal characteristics”36 features came from the category “amenity/public service”.For spatial features, we decided to continue to stick on neighborhood features as well as 6 more features extracted from census data, and summarize house number in each neighborhood for better understanding. In this analysis, we regarded census data as indicators to show the house unit spatial environment.
Detailed summary statistics are provided below.
## All potential features
all.feature.list <- miami.training %>%
dplyr:: select(-saleDate,-saleType,-saleQual,-saleYear,-Property.Address,
-Year,-WVDB,-HEX,-GPAR,-County.2nd.HEX,-County.Senior,-County.LongTermSenior,
-County.Other.Exempt,-City.2nd.HEX,-City.Senior,-City.LongTermSenior,
-City.Other.Exempt,-MillCode,-Land.Use,-Owner1,-Owner2,-Mailing.Address,
-Mailing.City,-Mailing.State,-Mailing.Zip,-Mailing.Country,
-starts_with("Legal"), -YearBuilt, -EffectiveYearBuilt, -X, -FID, -starts_with("Shape_"), -Folio,
-XF1, -XF2, -XF3) %>%
st_drop_geometry()
## 2.1 Feature: Internal Characteristics
internal.feature.list <- all.feature.list %>%
dplyr:: select(SalePrice, Land, Bldg, Total, Assessed,County.Taxable,City.Taxable,AdjustedSqFt,
LotSize, Bed, Bath, Stories, Units, LivingSqFt, ActualSqFt, Age,Age.effective,
pool_house, patio_house)
stargazer(internal.feature.list, type = "html",
title = "Table DATA 2.1 Summary Statistics of Internal Characteristics ",
header = FALSE,
single.row = TRUE)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
SalePrice | 2,627 | 974,540.600 | 2,047,999.000 | 12,500 | 299,000 | 826,250 | 27,750,000 |
Land | 2,627 | 485,089.200 | 985,245.200 | 27,050 | 134,957 | 457,257.5 | 20,972,640 |
Bldg | 2,627 | 289,575.800 | 838,520.000 | 1,000 | 79,941 | 203,242 | 15,175,769 |
Total | 2,627 | 774,665.000 | 1,627,564.000 | 55,159 | 233,916 | 687,787 | 21,974,519 |
Assessed | 2,627 | 737,515.200 | 1,600,859.000 | 29,137 | 217,670.5 | 628,332.5 | 21,974,519 |
County.Taxable | 2,627 | 713,294.700 | 1,603,350.000 | 0 | 192,542.5 | 607,352 | 21,974,519 |
City.Taxable | 2,627 | 713,294.700 | 1,603,350.000 | 0 | 192,542.5 | 607,352 | 21,974,519 |
AdjustedSqFt | 2,627 | 2,044.029 | 1,492.860 | 331 | 1,248 | 2,251 | 18,006 |
LotSize | 2,627 | 7,681.807 | 4,748.604 | 1,250 | 5,400 | 7,931.2 | 80,664 |
Bed | 2,627 | 2.992 | 1.100 | 0 | 2 | 3 | 13 |
Bath | 2,627 | 2.054 | 1.316 | 0 | 1 | 2 | 12 |
Stories | 2,627 | 1.192 | 0.432 | 0 | 1 | 1 | 4 |
Units | 2,627 | 1.006 | 0.078 | 1 | 1 | 1 | 2 |
LivingSqFt | 2,627 | 1,971.459 | 1,429.946 | 288 | 1,194.5 | 2,215 | 18,006 |
ActualSqFt | 2,627 | 2,324.683 | 1,809.884 | 388 | 1,383 | 2,544.5 | 20,192 |
Age | 2,627 | 68.772 | 22.720 | 1 | 67 | 82 | 115 |
Age.effective | 2,627 | 47.199 | 26.669 | 1 | 24 | 71 | 115 |
pool_house | 2,627 | 0.255 | 0.436 | 0 | 0 | 1 | 1 |
patio_house | 2,627 | 0.178 | 0.383 | 0 | 0 | 0 | 1 |
## 2.2 Feature: Amenities/Public Services
amenity.feature.list <- all.feature.list %>%
dplyr:: select(starts_with('landmark_'),starts_with('mall_'),starts_with('hospital_'),
starts_with('college_'),starts_with('school_'),starts_with('hotel_'),
pctTotalVacant, pctRenterOccupied, pctWhite,pctPoverty, MedRent ,MedHHInc)
stargazer(amenity.feature.list, type = "html",
title = "Table DATA 2.2 Summary Statistics of Amenities/Public Services ",
header = FALSE,
single.row = TRUE)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
landmark_nn1 | 2,627 | 0.015 | 0.009 | 0.001 | 0.009 | 0.019 | 0.052 |
landmark_nn2 | 2,627 | 0.019 | 0.011 | 0.002 | 0.012 | 0.023 | 0.053 |
landmark_nn3 | 2,627 | 0.022 | 0.012 | 0.003 | 0.014 | 0.026 | 0.058 |
landmark_nn4 | 2,627 | 0.024 | 0.012 | 0.004 | 0.016 | 0.027 | 0.062 |
landmark_nn5 | 2,627 | 0.025 | 0.012 | 0.006 | 0.017 | 0.029 | 0.065 |
mall_nn1 | 2,627 | 0.023 | 0.010 | 0.001 | 0.015 | 0.031 | 0.051 |
mall_nn2 | 2,627 | 0.031 | 0.012 | 0.009 | 0.022 | 0.039 | 0.058 |
mall_nn3 | 2,627 | 0.037 | 0.012 | 0.012 | 0.029 | 0.046 | 0.065 |
mall_nn4 | 2,627 | 0.043 | 0.012 | 0.020 | 0.033 | 0.052 | 0.071 |
mall_nn5 | 2,627 | 0.047 | 0.012 | 0.028 | 0.036 | 0.057 | 0.077 |
hospital_nn1 | 2,627 | 0.019 | 0.010 | 0.001 | 0.012 | 0.024 | 0.056 |
hospital_nn2 | 2,627 | 0.025 | 0.011 | 0.002 | 0.018 | 0.030 | 0.069 |
hospital_nn3 | 2,627 | 0.029 | 0.013 | 0.005 | 0.020 | 0.034 | 0.076 |
hospital_nn4 | 2,627 | 0.033 | 0.014 | 0.005 | 0.023 | 0.037 | 0.083 |
hospital_nn5 | 2,627 | 0.036 | 0.015 | 0.006 | 0.026 | 0.038 | 0.089 |
college_nn1 | 2,627 | 0.013 | 0.008 | 0.0004 | 0.007 | 0.016 | 0.053 |
college_nn2 | 2,627 | 0.016 | 0.009 | 0.002 | 0.010 | 0.019 | 0.053 |
college_nn3 | 2,627 | 0.018 | 0.009 | 0.002 | 0.013 | 0.022 | 0.054 |
college_nn4 | 2,627 | 0.020 | 0.009 | 0.002 | 0.015 | 0.024 | 0.056 |
college_nn5 | 2,627 | 0.022 | 0.010 | 0.002 | 0.016 | 0.025 | 0.058 |
school_nn1 | 2,627 | 0.007 | 0.004 | 0.0004 | 0.004 | 0.009 | 0.027 |
school_nn2 | 2,627 | 0.009 | 0.004 | 0.001 | 0.006 | 0.011 | 0.028 |
school_nn3 | 2,627 | 0.010 | 0.005 | 0.002 | 0.007 | 0.013 | 0.028 |
school_nn4 | 2,627 | 0.012 | 0.005 | 0.003 | 0.008 | 0.014 | 0.028 |
school_nn5 | 2,627 | 0.013 | 0.006 | 0.004 | 0.009 | 0.015 | 0.032 |
hotel_nn1 | 2,627 | 0.010 | 0.006 | 0.0001 | 0.005 | 0.013 | 0.030 |
hotel_nn2 | 2,627 | 0.011 | 0.007 | 0.0004 | 0.006 | 0.014 | 0.030 |
hotel_nn3 | 2,627 | 0.012 | 0.007 | 0.001 | 0.007 | 0.015 | 0.031 |
hotel_nn4 | 2,627 | 0.013 | 0.007 | 0.001 | 0.007 | 0.016 | 0.032 |
hotel_nn5 | 2,627 | 0.013 | 0.007 | 0.001 | 0.008 | 0.017 | 0.033 |
pctTotalVacant | 2,627 | 15.068 | 9.325 | 2.237 | 8.509 | 19.776 | 48.308 |
pctRenterOccupied | 2,627 | 0.657 | 0.271 | 0.142 | 0.460 | 0.773 | 1.480 |
pctWhite | 2,627 | 0.735 | 0.298 | 0.058 | 0.611 | 0.957 | 1.172 |
pctPoverty | 2,627 | 0.216 | 0.111 | 0.066 | 0.123 | 0.304 | 0.582 |
MedRent | 2,540 | 1,099.507 | 402.345 | 245.000 | 836.000 | 1,185.000 | 2,271.000 |
MedHHInc | 2,627 | 56,581.220 | 43,518.950 | 14,699 | 30,099 | 60,057 | 172,750 |
## 2.3 Spatial Structure: Neighborhood
spatial.feature.list <- all.feature.list %>%
dplyr:: select(LABEL)
as.data.frame(table(spatial.feature.list)) %>%
rename(`Neighborhood List` = spatial.feature.list,
Number =Freq) %>%
kable(caption = 'Table DATA 2.3 Spatial Feature: Neighborhood') %>%
kable_styling("striped", full_width = F)
Neighborhood List | Number |
---|---|
79th Street | 1 |
Allapattah Industrial District | 4 |
Auburndale | 47 |
Bay Heights | 13 |
Baypoint | 23 |
Bayside | 33 |
Belle Island | 5 |
Belle Meade | 48 |
Belle Meade West | 16 |
Bird Grove East | 10 |
Bird Grove West | 11 |
Biscayne Island | 2 |
Biscayne Plaza | 2 |
Brentwood | 3 |
Buena Vista Heights | 19 |
Buena Vista West | 48 |
Citrus Grove | 49 |
Civic Center | 3 |
Coral Gate | 26 |
Curtis Park | 13 |
Douglas Park | 39 |
East Grove | 53 |
East Little Havana | 17 |
Edgewater | 3 |
Edison | 65 |
Fair Isle | 15 |
Flagami | 330 |
Flora Park | 49 |
Grove Center | 7 |
Hadley Park | 128 |
Haynesworth | 1 |
Highland Park | 4 |
Historic Buena Vista East | 10 |
King Heights | 29 |
La Pastorita | 7 |
Latin Quarter | 4 |
Le Jeune Gardens | 1 |
Lemon City/Little Haiti | 24 |
Liberty Square | 40 |
Little River Central | 14 |
Little River Gardens | 2 |
Melrose | 19 |
Miami Avenue | 15 |
Morningside | 49 |
North Grapeland Heights | 43 |
North Grove | 43 |
North Sewell Park | 12 |
Northeast Overtown | 4 |
Northwestern Estates | 32 |
Oakland Grove | 1 |
Old San Juan | 22 |
Orange Bowl | 3 |
Orchard Villa | 28 |
Palm Grove | 4 |
Parkdale North | 16 |
Parkdale South | 17 |
Roads | 88 |
San Marco Island | 3 |
Santa Clara | 44 |
Shenandoah North | 95 |
Shenandoah South | 93 |
Shorecrest | 64 |
Silver Bluff | 93 |
South Grapeland Heights | 21 |
South Grove | 128 |
South Grove Bayside | 27 |
South Sewell Park | 19 |
Spring Garden | 3 |
Vizcaya | 1 |
West Grapeland Heights | 13 |
West Grove | 32 |
census.feature.list <- all.feature.list %>%
dplyr:: select(pctTotalVacant, pctRenterOccupied, pctWhite,pctPoverty, MedRent ,MedHHInc)
stargazer(census.feature.list, type = "html",
title = "Table DATA 2.4 Spatial Feature: Census Data ",
header = FALSE,
single.row = TRUE)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Pctl(75) | Max |
pctTotalVacant | 2,627 | 15.068 | 9.325 | 2.237 | 8.509 | 19.776 | 48.308 |
pctRenterOccupied | 2,627 | 0.657 | 0.271 | 0.142 | 0.460 | 0.773 | 1.480 |
pctWhite | 2,627 | 0.735 | 0.298 | 0.058 | 0.611 | 0.957 | 1.172 |
pctPoverty | 2,627 | 0.216 | 0.111 | 0.066 | 0.123 | 0.304 | 0.582 |
MedRent | 2,540 | 1,099.507 | 402.345 | 245.000 | 836.000 | 1,185.000 | 2,271.000 |
MedHHInc | 2,627 | 56,581.220 | 43,518.950 | 14,699 | 30,099 | 60,057 | 172,750 |
For better visualization, features were divided into internal characteristics and amenity again when creating correlation matrix. Based on the graduated color, some features have strong relation with Sale Price, such asLand
, Total
, Taxable
and Actual SqFt
.
# DATA - 3 Correlation Matrix
miami.numericVars.internal <-
select_if(internal.feature.list, is.numeric) %>%
na.omit()
amenity.feature.list.with.price <- all.feature.list %>%
dplyr:: select(SalePrice, starts_with('landmark_'),starts_with('mall_'),starts_with('hospital_'),
starts_with('college_'),starts_with('school_'),starts_with('hotel_'),
pctTotalVacant, pctRenterOccupied, pctWhite,pctPoverty, MedRent ,MedHHInc)
miami.numericVars.amenity <-
select_if(amenity.feature.list.with.price, is.numeric) %>%
na.omit()
ggcorrplot(
round(cor(miami.numericVars.internal), 1),
p.mat = cor_pmat(miami.numericVars.internal),
colors = c('#05A167', "white", '#6897BB'),
lab_size = 1,
tl.cex = 8,
type="lower",
insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1) +
labs(title = "Correlation across numeric variables\n(internal features)\n",
caption = 'Figure DATA 3.1') +
plotTheme()
ggcorrplot(
round(cor(miami.numericVars.amenity), 1),
p.mat = cor_pmat(miami.numericVars.amenity),
colors = c('#05A167', "white", '#6897BB'),
lab_size = 1,
tl.cex = 5,
type="lower",
insig = c("pch", "blank"), pch = 1, pch.col = "black", pch.cex =1) +
labs(title = "Correlation across numeric variables\n(amenity/public service features)\n",
caption = 'Figure DATA 3.2') +
plotTheme()
In this section, we specifically explored the correlation between sale price and four interesting and meaningful features: pctTotalVacant
, AdjustSqFt
, school_nn4
(average distance of four nearest public schools) and hospital_nn1
(the accessibility to nearest hospital).
Through graphs below, it is obvious that the first three of four features are positively correlated to sale price, while the distance to nearest hospital is negatively correlated to the sale price that is consistent with our intuition.
# DATA - 4 Scatter plot
## 1. Two House Internal characteristics
features <- c('Adjusted Square Feet', 'Percentage of Vacant Property')
names(features) <- c("AdjustedSqFt", "pctTotalVacant")
price.house.plot <-
st_drop_geometry(miami.training) %>%
dplyr::select(SalePrice, pctTotalVacant, AdjustedSqFt) %>%
filter(SalePrice <= 1000000) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "darkred") +
facet_wrap(~Variable, ncol = 2, scales = "free",
labeller = labeller(Variable = features)) +
plotTheme() +
labs(title = "Price as a Function of House Internal Characteristics Variables:\nVacant Property, Adjusted Housing Area\n",
caption = 'Figure DATA 4.1')
price.house.plot
## 2. Average Distance of Nearest Four Public Schools Correlation
price.school_nn4.plot <-
st_drop_geometry(miami.training) %>%
dplyr::select(SalePrice, school_nn4) %>%
filter(SalePrice <= 1000000) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "darkred") +
labs(title = "Price as a Function of Average Distance of\nNearest Four Public Schools Correlation\n",
caption = 'Figure DATA 4.2') +
plotTheme()
price.school_nn4.plot
## 3. Hospital Correlation
price.hospital_nn1.plot <-
st_drop_geometry(miami.training) %>%
dplyr::select(SalePrice, hospital_nn1) %>%
filter(SalePrice <= 1000000) %>%
gather(Variable, Value, -SalePrice) %>%
ggplot(aes(Value, SalePrice)) +
geom_point(size = .5) +
geom_smooth(method = "lm", se=F, colour = "darkred") +
labs(title = "Price as a Function of Average Distance of\nNearest Four Hospitals Correlation\n",
caption = 'Figure DATA 4.3') +
plotTheme()
price.hospital_nn1.plot
The sale price per SqFt in Miami shows a clear spatial pattern. The sale price in Miami Beach and Downtown Miami is significantly higher than other areas.
# DATA - 5 Map of Sale Price per Square Feet
miami.training <- miami.training %>%
mutate(priceFt = SalePrice/LivingSqFt)
map.priceFt <- ggplot() +
geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = st_centroid(miami.training),aes(color = q5(priceFt)),size = .5) +
scale_color_manual(values = palette,
labels = qBr(miami.training,'priceFt'),
name = "Price/ft^2") +
labs(title = 'Sale Price Per Square Foot\n',
caption = 'Figure DATA 5') +
mapTheme() +
plotTheme()
map.priceFt
In this section, we were aimed at mapping the vacant units proportion by each census tract where certain home sale observation is located at, the adjusted area of each home sale observation and the distribution of average distance of each home sale observation to the nearest hospital.
# DATA - 6
## 1. Vacant units
map.pctV <- ggplot() +
geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = st_centroid(miami.training),aes(color = q5(pctTotalVacant)),size = .5) +
scale_color_manual(values = palette,
labels = qBr(miami.training,'pctTotalVacant'),
name = "Percentage of Vacant Units") +
labs(title = 'Percentage of Vacant Units',
subtitle = 'Note: dots indicating the vacant unts proportion in certain census tract\nwhere certain home sale observation is located at.\n',
caption = 'Figure DATA 6.1') +
mapTheme() +
plotTheme()
map.pctV
## 2. Adjusted Square Feet
map.Adjust.SqFt <- ggplot() +
geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = st_centroid(miami.training),aes(color = q5(AdjustedSqFt)),size = .5) +
scale_color_manual(values = palette,
labels = qBr(miami.training,'AdjustedSqFt'),
name = "Adjusted Square Feet") +
labs(title = 'Distribution of Adjusted Square Feet\n',
caption = 'Figure DATA 6.2') +
mapTheme() +
plotTheme()
map.Adjust.SqFt
## 3. Hospital
hospital_nn1_expanded <- miami.training %>%
mutate(hospital_nn1.expanded = hospital_nn1 * 1000)
map.hospital <- ggplot() +
geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = st_centroid(hospital_nn1_expanded),aes(color = q5(hospital_nn1.expanded)),size = .5) +
scale_color_manual(values = palette,
labels = qBr(hospital_nn1_expanded,'hospital_nn1.expanded'),
name = "Average Distrance\nto Nearest Hospital\n(timed by 1000 \nfor better visualization)\n") +
labs(title = 'Distribution of Average Distrance to Nearest Hospital\nof Each Home Sale Observation\n',
caption = 'Figure DATA 6.3') +
mapTheme() +
plotTheme()
map.hospital
The mean objective of this analysis is to train a robust model to predict house price as accurately as possible. Accordingly, this section started to provide a detailed interpretation on the process of the model building.
Data and Sample In order to accurately build and test models, we randomly split the whole miami.training
into two parts: model.miami.training
with 60% of observations for model training and model.miami.test
with 40% of observations for model testing.
Dependent Variable The dependent variable is home sale price: SalePrice
.
Independent Variables Based on the hedonic home price model, potential features should encompass at least three components: internal characteristics, like the number of bathrooms; neighborhood amenities/public services, like hospital accessibility; and the underlying spatial process of prices. After creating features and engineering and checking correlation coefficients, 54 features are eligible for the model and obviously or slightly correlated with SalePrice
that were the initial independent variables for model building. We gradually screened out variables and eventually selected only 16 robust features included in the final model
Procedure The primitive approach used in this analysis for model building is Ordinary Least Squares Regression (OLS) which is aimed at predicting the SalePrice
of houses as a function of several statistical parameters such as intercept, slope and selected features. In order to determine which features were able to significantly increase the accuracy and generalizability of the model, we systematically compared, plotted, and mapping the mean absolute errors (MAE), mean absolute percent errors (MAPE), as well as root mean square errors (RMSE, the standard deviation of the residuals) between original SalePrice
and predicted SalePrice
of both training set and test set, conducted the cross-validation on 100 folds, explored the underlying spatial pattern of price and errors caused by problematic prediction. To address the problem of neglecting the underlying spatial correlation caused by neighborhood, we then compared, plotted, and mapped the mean absolute errors (MAE), and mean absolute percent errors (MAPE) on the test set between baseline model without neighborhood features and updated model with neighborhood features, and then determined our final model based on series model examinations.
First of all, as indicated in the III. Methods, the dataset was divided into training set and test set, with the ratio of 6:4.
# 1. Split dataset
set.seed(31357)
# get index for training sample
inTrain <- caret::createDataPartition(
y = paste(miami.training$LABEL,miami.training$SalePrice ),
p = .60, list = FALSE)
# split data into training and test
model.miami.training <- miami.training[inTrain,]
model.miami.test <- miami.training[-inTrain,]
Secondly, we built three models with different features to examine which features should be included.
SalePrice
in Model 1.SalePrice
in Model 2 but remained the ones with comparatively large correlation coefficients. And Model 3 was the baseline model in this analysis.The baseline model contained 9 internal characteristic features: Land
, Bldg
, Assessed
, County.Taxable
, AdjustedSqFt
, LotSize
, Bed
, Bath
, and ActualSqFt
,as well as 7 amenity features: college_nn1
, school_nn3
, school_nn4
, school_nn5
, pctTotalVacant
, MedRent
, MedHHInc
.
Model summaries were all shown below.
## First model: All internal features, amenity features
M1 <- lm(SalePrice ~ ., data = st_drop_geometry(model.miami.training) %>%
dplyr::select(colnames(internal.feature.list),
colnames(amenity.feature.list)
))
stargazer(M1, type = "html",
title = "Table 4.1.1 Summary Statistics of Model 1 ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
SalePrice | |
Land | 0.991*** (0.021) |
Bldg | 1.093*** (0.023) |
Total | |
Assessed | 0.517*** (0.112) |
County.Taxable | -0.303*** (0.108) |
City.Taxable | |
AdjustedSqFt | -159.901*** (37.232) |
LotSize | -13.966*** (1.849) |
Bed | -11,249.950 (6,997.370) |
Bath | 17,335.080** (7,506.400) |
Stories | -10,399.910 (13,662.860) |
Units | -53,679.120 (55,759.680) |
LivingSqFt | 33.200 (23.408) |
ActualSqFt | 134.091*** (19.856) |
Age | 252.681 (243.817) |
Age.effective | -63.247 (188.533) |
pool_house | 14,246.280 (13,530.970) |
patio_house | -4,738.318 (11,416.990) |
landmark_nn1 | -2,792,163.000 (2,351,722.000) |
landmark_nn2 | 16,876,972.000** (7,583,331.000) |
landmark_nn3 | -31,296,437.000*** (12,121,768.000) |
landmark_nn4 | 23,404,861.000 (19,773,533.000) |
landmark_nn5 | -6,786,591.000 (12,738,600.000) |
mall_nn1 | 176,783.400 (1,845,897.000) |
mall_nn2 | -4,287,708.000 (4,432,823.000) |
mall_nn3 | 5,844,359.000 (5,697,162.000) |
mall_nn4 | 496,030.000 (8,900,686.000) |
mall_nn5 | -2,054,266.000 (5,713,360.000) |
hospital_nn1 | -1,818,032.000 (1,888,218.000) |
hospital_nn2 | 4,421,767.000 (4,345,964.000) |
hospital_nn3 | -4,954,592.000 (8,681,228.000) |
hospital_nn4 | -1,725,413.000 (13,318,517.000) |
hospital_nn5 | 3,390,553.000 (8,335,588.000) |
college_nn1 | -4,720,098.000* (2,629,573.000) |
college_nn2 | 5,923,026.000 (5,873,775.000) |
college_nn3 | -7,233,703.000 (12,571,717.000) |
college_nn4 | 4,070,657.000 (17,754,491.000) |
college_nn5 | 669,506.300 (10,417,329.000) |
school_nn1 | -687,049.800 (3,421,098.000) |
school_nn2 | 3,194,391.000 (6,890,515.000) |
school_nn3 | -23,921,011.000** (11,137,220.000) |
school_nn4 | 40,568,356.000** (17,839,289.000) |
school_nn5 | -18,838,280.000 (12,109,364.000) |
hotel_nn1 | -1,220,964.000 (4,501,937.000) |
hotel_nn2 | 9,678,923.000 (12,125,473.000) |
hotel_nn3 | -2,871,359.000 (21,318,716.000) |
hotel_nn4 | -27,767,343.000 (31,010,023.000) |
hotel_nn5 | 21,457,163.000 (18,334,682.000) |
pctTotalVacant | 963.610 (863.743) |
pctRenterOccupied | 6,812.634 (24,174.830) |
pctWhite | 19,574.130 (38,267.100) |
pctPoverty | 26,776.040 (81,935.820) |
MedRent | -61.026** (27.140) |
MedHHInc | 0.592* (0.311) |
Constant | 118,214.900 (91,423.930) |
Observations | 2,378 |
R2 | 0.992 |
Adjusted R2 | 0.991 |
Residual Std. Error | 196,411.800 (df = 2325) |
F Statistic | 5,280.829*** (df = 52; 2325) |
Note: | p<0.1; p<0.05; p<0.01 |
## Second model:
M2 <- lm(SalePrice ~ ., data = st_drop_geometry(model.miami.training) %>%
dplyr::select(# internal features
SalePrice, Land, Bldg, Assessed, County.Taxable,
AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt,
# amenity features
landmark_nn2, landmark_nn3,
hospital_nn1, hospital_nn3, hospital_nn4,
college_nn1, school_nn3, school_nn4, school_nn5,
pctTotalVacant, pctRenterOccupied, MedRent, MedHHInc)
)
stargazer(M2, type = "html",
title = "Table 4.1.2 Summary Statistics of Model 2 ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
SalePrice | |
Land | 0.985*** (0.021) |
Bldg | 1.082*** (0.022) |
Assessed | 0.507*** (0.110) |
County.Taxable | -0.288*** (0.106) |
AdjustedSqFt | -118.404*** (23.316) |
LotSize | -13.279*** (1.726) |
Bed | -13,051.200* (6,836.993) |
Bath | 17,439.450** (7,183.896) |
ActualSqFt | 127.631*** (18.325) |
landmark_nn2 | 1,799,788.000 (3,559,741.000) |
landmark_nn3 | -2,635,938.000 (3,763,173.000) |
hospital_nn1 | -2,196,553.000** (933,375.700) |
hospital_nn3 | 3,016,274.000 (3,704,837.000) |
hospital_nn4 | -1,101,879.000 (3,527,266.000) |
college_nn1 | -1,973,853.000*** (731,082.300) |
school_nn3 | -28,006,369.000*** (7,261,776.000) |
school_nn4 | 45,941,758.000*** (15,793,927.000) |
school_nn5 | -17,237,124.000* (9,831,561.000) |
pctTotalVacant | 669.337 (634.989) |
pctRenterOccupied | 14,496.700 (21,090.990) |
MedRent | -46.688** (20.841) |
MedHHInc | 0.541** (0.244) |
Constant | 74,872.530*** (28,563.910) |
Observations | 2,378 |
R2 | 0.992 |
Adjusted R2 | 0.991 |
Residual Std. Error | 196,282.500 (df = 2355) |
F Statistic | 12,497.180*** (df = 22; 2355) |
Note: | p<0.1; p<0.05; p<0.01 |
# Third Model
M3 <- lm(SalePrice ~ ., data = st_drop_geometry(model.miami.training) %>%
dplyr::select(# internal features
SalePrice, Land, Bldg, Assessed, County.Taxable,
AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt,
# amenity features
college_nn1,
school_nn3, school_nn4, school_nn5,
pctTotalVacant, MedRent, MedHHInc)
)
stargazer(M3, type = "html",
title = "Table 4.1.3 Summary Statistics of Model 3 ",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
SalePrice | |
Land | 0.985*** (0.021) |
Bldg | 1.081*** (0.022) |
Assessed | 0.499*** (0.110) |
County.Taxable | -0.281*** (0.106) |
AdjustedSqFt | -117.790*** (23.306) |
LotSize | -13.032*** (1.713) |
Bed | -12,423.740* (6,833.118) |
Bath | 18,002.580** (7,169.642) |
ActualSqFt | 126.495*** (18.312) |
college_nn1 | -2,236,855.000*** (707,102.700) |
school_nn3 | -28,465,266.000*** (6,919,330.000) |
school_nn4 | 44,475,526.000*** (15,632,117.000) |
school_nn5 | -15,908,357.000* (9,572,297.000) |
pctTotalVacant | 1,037.946* (585.099) |
MedRent | -38.722** (18.400) |
MedHHInc | 0.479** (0.200) |
Constant | 67,019.810*** (22,540.550) |
Observations | 2,378 |
R2 | 0.991 |
Adjusted R2 | 0.991 |
Residual Std. Error | 196,340.900 (df = 2361) |
F Statistic | 17,172.940*** (df = 16; 2361) |
Note: | p<0.1; p<0.05; p<0.01 |
The results of the regression on the training set are in the following table. The current model embraced 16 independent features and reached 0.992 R-square, indicating that 99% of the variance for SalePrice
can be explained by this model.
# 2. Table of model summary on my training set
stargazer(m2.training, type = "html",
title = "Table 4.2.1 Summary Statistics of Baseline Model on Training Set",
header = FALSE,
single.row = TRUE)
Dependent variable: | |
SalePrice | |
Land | 0.985*** (0.021) |
Bldg | 1.081*** (0.022) |
Assessed | 0.499*** (0.110) |
County.Taxable | -0.281*** (0.106) |
AdjustedSqFt | -117.790*** (23.306) |
LotSize | -13.032*** (1.713) |
Bed | -12,423.740* (6,833.118) |
Bath | 18,002.580** (7,169.642) |
ActualSqFt | 126.495*** (18.312) |
college_nn1 | -2,236,855.000*** (707,102.700) |
school_nn3 | -28,465,266.000*** (6,919,330.000) |
school_nn4 | 44,475,526.000*** (15,632,117.000) |
school_nn5 | -15,908,357.000* (9,572,297.000) |
pctTotalVacant | 1,037.946* (585.099) |
MedRent | -38.722** (18.400) |
MedHHInc | 0.479** (0.200) |
Constant | 67,019.810*** (22,540.550) |
Observations | 2,378 |
R2 | 0.991 |
Adjusted R2 | 0.991 |
Residual Std. Error | 196,340.900 (df = 2361) |
F Statistic | 17,172.940*** (df = 16; 2361) |
Note: | p<0.1; p<0.05; p<0.01 |
broom::glance(m2.training) %>%
kable(caption = 'Table 4.2.2 Summary of Model Evaluation Parameters') %>%
kable_styling("striped", full_width = F)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.9914805 | 0.9914227 | 196340.9 | 17172.93 | 0 | 16 | -32347.84 | 64731.67 | 64835.61 | 9.101599e+13 | 2361 | 2378 |
The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 59645.31, the errors are accounted for 12.36%
# 3. MAE, MAPE on my test set
m2.miami.test <-
model.miami.test %>%
mutate(Regression = "Baseline Regression",
SalePrice.Predict = predict(m2.training, newdata = model.miami.test),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice) %>%
filter(SalePrice < 5000000)
test_GoodnessFit<-as.data.frame(cbind(mean(m2.miami.test$SalePrice.AbsError,na.rm = T),
mean(m2.miami.test$SalePrice.APE,na.rm = T)))
colnames(test_GoodnessFit)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_GoodnessFit,
caption = 'Table RESULT 3. MAE and MAPE for Test Set') %>%
kable_styling("striped", full_width = F)
Mean Absolute Errors (MAE) | MAPE |
---|---|
59645.31 | 0.1236198 |
This section conducted cross-validation to assess how our model would generalize to other independent data sets, in which we could flag problem of overfitting or feature selection bias.
Both a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.
The variation of 100 folds can also be visualized with a histogram of across-fold MAE. Although most of the errors are clustering tightly together, some errors are scattered, indicating the inconsistency of the model prediction.
# 4. Cross-Validation Test on Model 2
## 4.1 use caret package cross-validation method
fitControl <- caret:: trainControl(method = "cv",
number = 100,
savePredictions = TRUE)
set.seed(717)
m2.cv <-
caret::train(SalePrice ~ ., data = st_drop_geometry(miami.training) %>%
dplyr::select(# internal features
SalePrice, Land, Bldg, Assessed, County.Taxable,
AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt,
# amenity features
college_nn1,
school_nn3, school_nn4, school_nn5,
pctTotalVacant, MedRent, MedHHInc) ,
method = "lm",
trControl = fitControl,
na.action = na.pass)
## 4.2 show RMSE, MAE, R^2
kable(m2.cv$resample,
caption = 'Table RESULT 4. Cross-validation Test: Summary of RMSE, R Squared and MAE') %>%
kable_styling("striped", full_width = F)
RMSE | Rsquared | MAE | Resample |
---|---|---|---|
198139.61 | 0.9853120 | 119795.03 | Fold001 |
60086.39 | 0.9960435 | 48432.08 | Fold002 |
122823.07 | 0.9900892 | 76756.16 | Fold003 |
207097.01 | 0.9704187 | 96194.43 | Fold004 |
162566.98 | 0.9895799 | 98348.59 | Fold005 |
199639.82 | 0.9986516 | 95787.21 | Fold006 |
731604.37 | 0.9949180 | 308338.47 | Fold007 |
127088.59 | 0.9976789 | 72499.21 | Fold008 |
196071.62 | 0.9895360 | 97195.16 | Fold009 |
239853.33 | 0.9599551 | 110998.29 | Fold010 |
186646.45 | 0.9458751 | 104977.37 | Fold011 |
82553.65 | 0.9910639 | 60294.86 | Fold012 |
77518.37 | 0.9974416 | 58943.17 | Fold013 |
412016.70 | 0.9864821 | 150361.53 | Fold014 |
205209.63 | 0.9710427 | 104502.83 | Fold015 |
90037.17 | 0.9876723 | 61543.14 | Fold016 |
138328.50 | 0.9949880 | 84990.71 | Fold017 |
84309.07 | 0.9945610 | 60266.43 | Fold018 |
104292.57 | 0.9485969 | 65590.16 | Fold019 |
139049.40 | 0.9931974 | 84661.85 | Fold020 |
213424.92 | 0.8892981 | 98464.62 | Fold021 |
97822.97 | 0.9947580 | 64024.16 | Fold022 |
129640.96 | 0.9617291 | 79505.35 | Fold023 |
133704.88 | 0.9962616 | 70747.15 | Fold024 |
178009.60 | 0.9994841 | 115080.18 | Fold025 |
141117.31 | 0.9957801 | 87553.03 | Fold026 |
596292.62 | 0.9927644 | 182950.18 | Fold027 |
371061.62 | 0.8329185 | 136399.18 | Fold028 |
154420.55 | 0.9829284 | 84976.60 | Fold029 |
94905.69 | 0.9803618 | 59535.60 | Fold030 |
92959.86 | 0.9825883 | 68315.44 | Fold031 |
110161.30 | 0.9994238 | 73494.07 | Fold032 |
187838.59 | 0.9973274 | 111943.09 | Fold033 |
194452.12 | 0.9974584 | 102532.76 | Fold034 |
137632.83 | 0.9875680 | 98998.44 | Fold035 |
86774.46 | 0.9632703 | 70697.19 | Fold036 |
139760.85 | 0.9947746 | 81720.61 | Fold037 |
103178.62 | 0.9834567 | 76393.19 | Fold038 |
74894.92 | 0.9997241 | 50002.23 | Fold039 |
151937.04 | 0.9637194 | 83080.49 | Fold040 |
105624.15 | 0.9840339 | 82284.21 | Fold041 |
362108.14 | 0.9456223 | 155438.69 | Fold042 |
136263.54 | 0.9755882 | 98364.83 | Fold043 |
166337.59 | 0.9475920 | 112916.77 | Fold044 |
105441.35 | 0.9991632 | 76498.76 | Fold045 |
144510.48 | 0.9942572 | 91785.92 | Fold046 |
105413.46 | 0.9966984 | 70204.28 | Fold047 |
300114.49 | 0.9960530 | 134121.91 | Fold048 |
171303.30 | 0.9976148 | 85137.17 | Fold049 |
107595.16 | 0.9365308 | 81378.68 | Fold050 |
122186.91 | 0.9986492 | 81565.15 | Fold051 |
169460.14 | 0.9375060 | 108010.98 | Fold052 |
215624.41 | 0.9653245 | 125246.01 | Fold053 |
123012.37 | 0.9997520 | 79336.06 | Fold054 |
80035.26 | 0.9946228 | 57943.16 | Fold055 |
162030.20 | 0.9565660 | 91569.13 | Fold056 |
141797.61 | 0.9892701 | 90537.01 | Fold057 |
242441.20 | 0.9926873 | 93936.52 | Fold058 |
94436.06 | 0.9777007 | 58001.11 | Fold059 |
270521.03 | 0.9977142 | 107391.70 | Fold060 |
85328.28 | 0.9926478 | 64009.44 | Fold061 |
112068.01 | 0.9963280 | 67084.45 | Fold062 |
70130.07 | 0.9647300 | 50401.73 | Fold063 |
90925.92 | 0.9983749 | 64357.41 | Fold064 |
110789.51 | 0.9687090 | 64734.13 | Fold065 |
80713.56 | 0.9877941 | 62576.56 | Fold066 |
250286.38 | 0.9825317 | 123288.27 | Fold067 |
72278.11 | 0.9994246 | 57037.43 | Fold068 |
225347.54 | 0.9413927 | 117897.92 | Fold069 |
168131.35 | 0.9934126 | 80147.15 | Fold070 |
152362.98 | 0.9412925 | 81758.69 | Fold071 |
367494.49 | 0.9844215 | 128898.98 | Fold072 |
91232.84 | 0.9859311 | 57671.03 | Fold073 |
86841.99 | 0.9902169 | 60353.74 | Fold074 |
101498.28 | 0.9646062 | 74243.32 | Fold075 |
217979.38 | 0.9986084 | 132059.32 | Fold076 |
120466.03 | 0.9970362 | 80202.54 | Fold077 |
158587.14 | 0.9678984 | 89297.14 | Fold078 |
121976.12 | 0.9976592 | 83994.63 | Fold079 |
175696.23 | 0.9867851 | 102838.63 | Fold080 |
109927.21 | 0.9970085 | 71705.27 | Fold081 |
210105.20 | 0.9416603 | 119954.07 | Fold082 |
313337.30 | 0.9949994 | 110378.88 | Fold083 |
331910.00 | 0.9987171 | 157354.12 | Fold084 |
367224.18 | 0.9917536 | 173486.56 | Fold085 |
112781.76 | 0.9995143 | 82741.16 | Fold086 |
141576.60 | 0.9709638 | 86515.32 | Fold087 |
127408.69 | 0.9980478 | 80797.91 | Fold088 |
160226.58 | 0.9642931 | 100449.11 | Fold089 |
205442.75 | 0.9604512 | 113785.68 | Fold090 |
75554.76 | 0.9893433 | 61585.62 | Fold091 |
113371.01 | 0.9836034 | 72270.88 | Fold092 |
233922.94 | 0.9916611 | 115189.21 | Fold093 |
147770.83 | 0.9955132 | 85004.00 | Fold094 |
270069.39 | 0.9994397 | 129313.17 | Fold095 |
190135.49 | 0.9961020 | 128383.47 | Fold096 |
159917.75 | 0.9963515 | 76691.39 | Fold097 |
407381.55 | 0.9961371 | 143143.95 | Fold098 |
78599.00 | 0.9729700 | 53994.23 | Fold099 |
125683.41 | 0.9740752 | 86213.44 | Fold100 |
m2.cv$resample %>%
pivot_longer(-Resample) %>%
mutate(name = as.factor(name)) %>%
ggplot(., aes(x = name, y = value, color = name)) +
geom_jitter(width = 0.1) +
facet_wrap(~name, ncol = 3, scales = "free") +
theme_bw() +
theme(legend.position = "none") +
labs(title = 'Cross-validation Test: Distribution of MAE, RMSE, R Squared\n',
caption = "Figure RESULT 4.1") +
plotTheme()
The Figure RESULT 5.1 below shows the predicted prices as a function of observed price. This plot indicates that our model can predict price perfectly since the regression line (shown in dark-blue) and the reference line (shown in orange) are completely overlapped, and almost all the house predictions are closely clustered along with the regression line.
# 5. Plot predicted prices as a function of observed prices
preds.train <- data.frame(pred = predict(m2.training, model.miami.training),
actual = model.miami.training$SalePrice,
source = "Training Set")
preds.test <- data.frame(pred = predict(m2.training,newdata = model.miami.test),
actual = model.miami.test$SalePrice,
source = "Test Set")
preds <- rbind(preds.train, preds.test)
## Overall Plotting
ggplot(preds, aes(x = actual, y = pred, color = source)) +
geom_point() +
geom_smooth(method = "lm", color = "darkblue") +
geom_abline(color = "orange") +
coord_equal() +
theme_bw() +
labs(title = "Predicted Prices as a\nFunction of Observed Prices",
subtitle = 'Blue line is model regression line\nwhile orange line is reference line\n',
caption = 'Figure RESULT 5.1',
x = "Observed Prices",
y = "Predicted Prices") +
theme(
legend.position = "none"
) +
plotTheme()
ggplot(preds, aes(x = actual, y = pred, color = source)) +
geom_point() +
geom_smooth(method = "lm", color = "darkblue") +
geom_abline(color = "orange") +
coord_equal() +
theme_bw() +
facet_wrap(~source, ncol = 2) +
labs(title = "Predicted Prices as a Function of Observed Prices, by Sets",
subtitle = 'Blue line is model regression line while orange line is reference line\n',
caption = 'Figure RESULT 5.2',
x = "Observed Prices",
y = "Predicted Prices") +
theme(
legend.position = "none"
) +
plotTheme()
In this section, we leveraged three different tools to examine if prediction errors display certain spatial patterns, and accordingly to determine whether we need to add more spatial features in the existing model.
Mapping residuals: In the map of the residuals on the test set, most of the errors are clustering in the Downtown Miami and Miami Beach area. There is no clear pattern of the error directions.
Plotting spatial lags: A spatial lag is a variable that averages the neighboring values of a location. The spatial lag of error plot shows that as home price errors increase, the nearby home price errors slightly decrease, indicating that our model errors are rarely spatially autocorrelated.
Moran’s I: Another approach for measuring spatial autocorrelation is Moran’s I. In this model Moran’s I is around 0.196, indicating a certain level of clustering on model errors. The frequency of all 999 randomly permuted I were plotted as a histogram with the Observed I indicated by the orange line (see Figure Result 6.3). That the observed I is higher than all of the 999 randomly generated I’s provides visual confirmation of spatial autocorrelation despite being slight. This suggests that variation in price is likely related to the spatial process has been omitted from our current model.
Therefore, we thought it was worthy to include neighborhood features in the model for the further test.
# 6. Residual Exploration
## 6.1 A map of residuals for test set
palette.res <- c('#ff0000', '#ff7878', '#ffffff', '#74d680', '#378b29')
map.res.test <- ggplot() +
geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = m2.miami.test,aes(color = q5(SalePrice.Error)), show.legend = "point",size = 2) +
scale_color_manual(values = palette.res,
labels = qBr(m2.miami.test,'SalePrice.Error'),
name = "Residuals") +
labs(title = 'Map of residuals on test set\n',
caption = 'Figure RESULT 6.1') +
mapTheme() +
plotTheme()
map.res.test
## 6.2 Spatial lag in errors: average errors in five nearest neighbors
k_nearest_neighbors = 5
# average errors in five nearest neighbors
coords.test <- st_centroid(st_geometry(m2.miami.test), of_largest_polygon=TRUE)
neighborList.test <- knn2nb(knearneigh(coords.test, k_nearest_neighbors))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
m2.miami.test$lagPriceError <- lag.listw(spatialWeights.test,m2.miami.test$SalePrice.Error)
error.spatial.lag <- ggplot(m2.miami.test, aes(x = lagPriceError, y = SalePrice.Error)) +
geom_point(colour = "black") +
geom_smooth(method = "lm", se = FALSE, colour = "darkred") +
labs(title = "Error on Test Set as a function of the Spatial Lag of Price\n",
caption = "Figure RESULT 6.2",
x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
y = "Errors of Sale Price") +
plotTheme()
error.spatial.lag
## 6.3 Moran's I Test
moranTest <- moran.mc(m2.miami.test$SalePrice.AbsError,
spatialWeights.test, nsim = 999)
moran.plot <- ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "orange",size=1) +
scale_x_continuous(limits = c(-1, 1)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in orange, I = 0.196",
x="Moran's I",
y="Count",
caption="Figure RESULT 6.3") +
plotTheme()
moran.plot
Applying the predictive model on the whole dataset, the distribution of sales price is shown in the map below. Sale prices are generally higher in Downtown Miami, Miami Beach and the North area. There is no significant difference between the pattern of the distribution of price predictions in the two datasets (topredict
== 0 and topredict
== 1).
# 7. Map of predicted values for the whole dataset
miami.sf$Predicted.Price<- predict(m2.training, newdata = miami.sf)
feature <- c("toPredict = 0", "toPredict = 1")
names(feature) <- c(0,1)
map.predicted.price <- ggplot() +
geom_sf(data = st_union(miami.neigh),fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = st_centroid(miami.sf),aes(color = q5(Predicted.Price)),size = .5) +
scale_color_manual(values = palette,
labels = qBr(miami.sf,'Predicted.Price'),
name = "Price, $") +
facet_wrap(~as.factor(toPredict),
labeller = labeller(toPredict = feature)) +
labs(title = 'Map of Predicted Sale Price\n',
subtitle = '',
caption = 'Figure RESULT 7') +
mapTheme() +
plotTheme()
map.predicted.price
Indicated in the summary table as well as the choropleth map below, MAPE varies across neighborhoods (only 22 neighborhoods left in the test set). In the test set, the model exhibited larger MAPE in Downtown and North Miami.
# 8. Map of MAPE by neighborhood
## 8.1 Statistic summary by neighborhood group
nhood_sum <- m2.miami.test %>%
group_by(LABEL) %>%
summarize(meanPrice = mean(SalePrice, na.rm = T),
meanPrediction = mean(SalePrice.Predict, na.rm = T),
meanMAE = mean(SalePrice.AbsError, na.rm = T),
meanMAPE = mean(SalePrice.APE, na.rm = T))
nhood_sum %>%
st_drop_geometry %>%
arrange(desc(meanMAPE)) %>%
kable(caption = 'Table RESULT 8. Map of MAPE by Neighborhood') %>%
kable_styling("striped", full_width = F)
LABEL | meanPrice | meanPrediction | meanMAE | meanMAPE |
---|---|---|---|---|
Belle Meade | 525000.0 | 693827.2 | 168827.210 | 0.3215756 |
Flora Park | 235000.0 | 173195.4 | 61804.585 | 0.2629982 |
Hadley Park | 218181.8 | 220490.7 | 40196.438 | 0.2629211 |
Auburndale | 315000.0 | 394983.8 | 79983.783 | 0.2515129 |
King Heights | 247500.0 | 186553.2 | 60946.762 | 0.2426961 |
Douglas Park | 330000.0 | 395485.9 | 65485.918 | 0.1984422 |
South Grove | 1089285.7 | 1065505.8 | 196290.348 | 0.1943366 |
Edison | 225000.0 | 245860.4 | 39177.641 | 0.1635606 |
Buena Vista West | 282500.0 | 241985.2 | 40514.803 | 0.1379306 |
Flagami | 348550.7 | 341875.8 | 41095.196 | 0.1176901 |
Shenandoah South | 435000.0 | 414730.8 | 40233.809 | 0.0980084 |
Shenandoah North | 425000.0 | 447634.1 | 35817.841 | 0.0978073 |
Silver Bluff | 437500.0 | 468627.9 | 39720.576 | 0.0870146 |
NA | 1542931.0 | 1573447.3 | 109346.042 | 0.0858543 |
Citrus Grove | 316666.7 | 290469.0 | 26197.670 | 0.0856725 |
La Pastorita | 380000.0 | 348617.1 | 31382.939 | 0.0825867 |
Liberty Square | 220000.0 | 201958.8 | 18041.217 | 0.0820055 |
Northwestern Estates | 190000.0 | 202313.3 | 12313.323 | 0.0648070 |
West Grove | 550000.0 | 584857.9 | 34857.907 | 0.0633780 |
Roads | 600000.0 | 565369.8 | 34630.229 | 0.0577170 |
North Grapeland Heights | 290000.0 | 301803.1 | 11803.125 | 0.0407004 |
Shorecrest | 638333.3 | 656533.8 | 18200.425 | 0.0315013 |
Santa Clara | 220000.0 | 226119.9 | 6119.898 | 0.0278177 |
Melrose | 230000.0 | 224225.4 | 5774.562 | 0.0251068 |
Baypoint | 1800000.0 | 1783610.7 | 16389.297 | 0.0091052 |
## 8.2 Map of MAPE by neighborhood
miami.neigh.sum <- miami.neigh %>%
left_join(st_drop_geometry(nhood_sum), by = "LABEL") %>%
mutate(meanMAPE.expanded = meanMAPE * 100)
map.MAPE.nhood <- ggplot() +
geom_sf(data = miami.neigh.sum,
aes(fill = q5(meanMAPE.expanded))) +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
scale_fill_manual(values = palette,
labels = qBr(miami.neigh.sum,'meanMAPE.expanded'),
name = "MAPE\n(timed by 100\nfor better display)") +
labs(title = 'Map of MAPE on Test Set, by Neighborhood\n',
subtitle = '',
caption = 'Figure RESULT 8') +
mapTheme() +
plotTheme()
map.MAPE.nhood
This section plotted MAPE by neighborhood as a function of mean price by neighborhood. Neighborhoods with mean sale price below or around 500,000 have higher MAPE, while MAPE in neighborhoods with higher mean sale price is generally lower. This finding indicates that most errors occurred in the neighborhoods where house price was comparatively moderate.
# 9. Scatter plot of MAPE by neighborhood as a function of mean price by neighborhood
MAPE.nhood.plot <-ggplot()+
geom_point(data = nhood_sum, aes(x = meanPrice, y = meanMAPE)) +
stat_smooth(data = nhood_sum, aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="red") +
labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
caption = 'Figure RESULT 9') +
xlab('Mean Price by Neighborhood') +
ylab('Mean MAPE by Neighborhood') +
plotTheme()
MAPE.nhood.plot
This section gathered Census data to test how well baseline model generalizes to different group contexts, such as race group and income level.
First of all, we selected four census variables to display the spatial pattern of each group: 1. Majority White vs. Majority Non-White: divided whether the proportion of white population is accounted for more than 50% of total population or not. 2. High Income vs. Low Income: divided whether the median household income in certain census tract is above or below the average level. 3. Majority Vacant vs. Majority Non-Vacant: divided whether the proportion of vacant unit is accounted for more than 50% of total vacant unit or not. 4. Majority Renter Occupied vs. Majority Non-Renter Occupied: divided whether the proportion of renter occupied units is accounted for more than 50% of total units or not.
Based on the maps shown below, race group and income level have displayed the obvious spatial pattern, and thus, we determined to generalize our model on the two context groups.
# 10. Model generalizability on census data
## 10.1 Group Plotting
census <- miami.training %>% # miami.training contains all features we used including census data and geometry information
mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
pctRenterContext = ifelse(pctPoverty > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied")) %>%
select(raceContext,incomeContext,vacantContext,pctRenterContext)
census <- census %>%
st_transform(st_crs(tracts17))
grid.arrange(ncol = 2,
ggplot() +
geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = na.omit(census), aes(color = raceContext),
show.legend = "point", size= 1) +
scale_color_manual(values = c("#05A167", "#6897BB"), name="Race Context") +
labs(title = "Race Context of House\n",
caption = 'Figure RESULT 10.1') +
mapTheme() +
theme(legend.position="bottom") +
plotTheme(),
ggplot() +
geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = na.omit(census), aes(color = incomeContext),
show.legend = "point", size= 1) +
scale_fill_manual(values = c("#05A167", "#6897BB"), name="Income Context") +
labs(title = "Income Context of House\n",
caption = 'Figure RESULT 10.2') +
mapTheme() +
theme(legend.position="bottom") +
plotTheme())
grid.arrange(ncol = 2,
ggplot() +
geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = na.omit(census), aes(color = vacantContext),
show.legend = "point", size= 1) +
scale_color_manual(values = c("#05A167", "#6897bb"), name="Vacant Context") +
labs(title = "Vacant Context of House\n",
caption = 'Figure RESULT 10.3') +
mapTheme() +
theme(legend.position="bottom") +
plotTheme(),
ggplot() +
geom_sf(data = st_union(miami.neigh), fill = '#E5E5E5') +
geom_sf(data = st_union(miami.beach),fill = '#E5E5E5') +
geom_sf(data = na.omit(census), aes(color = pctRenterContext),
show.legend = "point", size= 1) +
scale_fill_manual(values = c("#05A167", "#6897BB"), name="Renter Proportion Context") +
labs(title = "Renter Proportion Context\n",
caption = 'Figure RESULT 10.4') +
mapTheme() +
theme(legend.position="bottom") +
plotTheme())
Existing findings have indicated that neighborhood could also be a potential feature for model building. Hence, this section built a new model called Neighborhood Effects model based on the baseline model by adding neighborhood feature LABEL
.
MAE, and MAPE of both regression models are shown below. The finding was not intuitive that the new model performed much better in terms of MAE but slightly worse in terms of MAPE.
Hence, it is imperative to further examine how well both models could be generalized in race context and income level context, and then to determine which model should be remained.
### Based on the group distribution, we decided to select race group and income group
## 10.2 Neighborhood Fixed Effect
### 10.2.1 Model building
m2.nhood <- lm(SalePrice ~ ., data = as.data.frame(model.miami.training) %>%
dplyr::select(# spatial features
LABEL,
# internal features
SalePrice, Land, Bldg, Assessed, County.Taxable,
AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt,
# amenity features
college_nn1,
school_nn3, school_nn4, school_nn5,
pctTotalVacant, MedRent, MedHHInc)
)
m2.miami.test.nhood <-
model.miami.test %>%
mutate(Regression = "Neighborhood Effects",
SalePrice.Predict = predict(m2.nhood, model.miami.test),
SalePrice.Error = SalePrice - SalePrice.Predict,
SalePrice.AbsError = abs(SalePrice - SalePrice.Predict),
SalePrice.APE = (abs(SalePrice - SalePrice.Predict)) / SalePrice) %>%
filter(SalePrice < 5000000)
### 10.2.2. Model combination and comparison
bothRegressions <-
rbind(
dplyr::select(m2.miami.test, starts_with("SalePrice"), Regression, LABEL) ,
dplyr::select(m2.miami.test.nhood, starts_with("SalePrice"), Regression, LABEL) )
st_drop_geometry(bothRegressions) %>%
gather(Variable, Value, -Regression, -LABEL) %>%
filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
group_by(Regression, Variable) %>%
summarize(meanValue = mean(Value, na.rm = T)) %>%
spread(Variable, meanValue) %>%
kable(caption = 'Table RESULT 10.1.\nSummary of MAE, and MAPE by Regression Models') %>%
kable_styling("striped", full_width = F)
Regression | SalePrice.AbsError | SalePrice.APE |
---|---|---|
Baseline Regression | 59645.31 | 0.1236198 |
Neighborhood Effects | 47095.86 | 0.1246893 |
Regardless of any context group, our new model with neighborhood features do not perform better than the baseline model in terms of MAPE. Hence, we determined to exclude neighborhood features in the model.
Despite higher MAPE in the Majority Non-White group, the baseline model is generalized well across context groups with MAPE close to the one generated on the test set (MAPE on test set is 12.362% for your reference).
### 10.2.3 Generalizability in Different Groups
census <- miami.training %>% # miami.training contains all features we used including census data and geometry information
mutate(raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"),
incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High Income", "Low Income"),
vacantContext = ifelse(pctTotalVacant > .5, "Majority Vacant", "Majority Non-Vacant"),
pctRenterContext = ifelse(pctPoverty > .5, "Majority Renter Occupied", "Majority Non-Renter Occupied")) %>%
select(raceContext,incomeContext,vacantContext,pctRenterContext)
race.group <- st_join(bothRegressions, census) %>%
group_by(Regression, raceContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(raceContext, mean.MAPE) %>%
kable(caption = "Table RESULT 10.2.\nTest set MAPE by Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F)
race.group
Regression | Majority Non-White | Majority White |
---|---|---|
Baseline Regression | 19% | 11% |
Neighborhood Effects | 20% | 11% |
income.group <- st_join(bothRegressions, census) %>%
group_by(Regression, incomeContext) %>%
summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
st_drop_geometry() %>%
spread(incomeContext, mean.MAPE) %>%
kable(caption = " RESULT 10.3.\nTest set MAPE by Neighborhood Income Context") %>%
kable_styling("striped", full_width = F)
income.group
Regression | High Income | Low Income |
---|---|---|
Baseline Regression | 11% | 13% |
Neighborhood Effects | 20% | 13% |
Eventually, we settled down with a model built with the original 16 features used in the baseline model, and conducted prediction on the secret data set. Summary statistics of predicted sale price is attached below.
N | Mean | St.Dev. | Min. | Max. |
---|---|---|---|---|
876 | 1044617 | 1217748 | 69398 | 13970170 |
MODEL <- lm(SalePrice ~ ., data = as.data.frame(miami.training) %>%
dplyr::select(
# internal features
SalePrice, Land, Bldg, Assessed, County.Taxable,
AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt,
# amenity features
college_nn1,
school_nn3, school_nn4, school_nn5,
pctTotalVacant, MedRent, MedHHInc)
)
secret_data <- miami.test%>%
dplyr::select(Folio,
# internal features
SalePrice, Land, Bldg, Assessed, County.Taxable,
AdjustedSqFt, LotSize, Bed,Bath, ActualSqFt,
# amenity features
college_nn1,
school_nn3, school_nn4, school_nn5,
pctTotalVacant, MedRent, MedHHInc)
MedRent.no.na <- mean(secret_data$MedRent,na.rm = TRUE)
secret_data[is.na(secret_data)] <- MedRent.no.na
secret_preds <- predict(MODEL, newdata = secret_data)
output_preds <- data.frame(prediction = secret_preds, Folio = secret_data$Folio, team_name = "MySun")
# write.csv(output_preds, "MySun.csv")
Accuracy and Generalizability: Overall, the predictive model is effective in terms of predicting sale price in the City of Miami and Miami Beach. The model successfully captured more than 99% of variation in prices we were required to predict that indicated a big victory! For the sake of accuracy, the MAE on the test set is less than 60k with 12.36% MAPE. As for the generalizability, MAPEs are no larger than 13% regardless of context groups to which the model was applied. A potential weakness of the model should be addressed is if the generalizability test result was overestimated or manipulated since proportion of white population and median household income were both features in the model.
Feature Effect: The model is built with 9 internal characteristic features: Land
, Bldg
, Assessed
, County.Taxable
, AdjustedSqFt
, LotSize
, Bed
, Bath
, and ActualSqFt
, 4 amenity features: college_nn1
, school_nn3
, school_nn4
, andschool_nn5
, and 3 spatial featurespctTotalVacant
, MedRent
, and MedHHInc
. The highlight of this model is including vacant unit proportion on census tract level as one of the features to predict sale price. We regarded it as an interesting addition to display the spatial environment the house unit is located at. The accessibility of education resources is an easy-ignored but powerful feature to show how amenities affect the local house price. Rather than only focus on the entertainment demand, we cared more about if the local residents’requirements on education have been refilled. Evidence has indicated that such a consideration is worthy. Secondly, it seems counterintuitive that we did not include the distance to coastline as a feature in the model, but we think proximity to water is just the surface outcome; it is unable to explain why high price is cluttered inside the City of Miami boundary despite being far from water. Besides, although we eventually excluded neighborhood features from the model, this spatial pattern still deserves attention when it comes to larger dataset size. Additionally, although we included far more amenity features in the initial stage, such as the proximity to landmarks, hotels, shopping malls for entertainment, the proximity to hospitals for healthcare service, and the exposure to certain crime, those features were eventually excluded from the model based on the series of rigorous model evaluation. However, we still believe it is worthy taking those features into consideration and they might be helpful when being used in other regression models.
Errors and Spatial Variations: Generally speaking, larger MAEs occurred in the Downtown Miami, Miami Beach area and north region of Miami City, while the model performed better in the rest of areas. One guess is potential spatial features that are directly related to the three areas have been somehow missed out. But we still feel confident about the model based on the examinations above.
If all the model examination and interpretation have been conducted correctly as shown above, we would like to recommend this model to Zillow as it incorporates some Miami’s local intelligent features that were absent in the previous models and performed comparatively well on either accuracy or generalizability.
To further reduce the error rate, more amenity features such as local restaurants and bars (data unavailable so far), financial related agencies, public transportation and spatial features such as zoning should be taken into consideration. And more context studies on Miami downtown and north region are also highly needed since we would like to extract some similarities between the two regions that would be the magic ingredients of cooking models in the future.