Take Home Ex 3

Published

March 11, 2023

Modified

March 25, 2023

Getting Started

In this take-home exercise, you are tasked to predict HDB resale prices at the sub-market level (i.e.HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. You are also required to compare the performance of the conventional OLS method versus the geographical weighted methods.

The study should focus on either three-room, four-room or five-room flat and transaction period should be from 1st January 2021 to 31st December 2022. The test data should be January and February 2023 resale prices.

The Data

Data sets will be used in this model building exercise, they are:

  1. MP 2014 Subzone Boundary

  2. Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020

  3. HDB Resale Price

Loading of packages

pacman::p_load(olsrr, corrplot, ggpubr, sf, spdep, GWmodel, tmap, tidyverse, gtsummary,mapview,leaflet,RColorBrewer,tidygeocoder,jsonlite,httr,onemapsgapi,reporter,magrittr,readxl,gdata, units,matrixStats, SpatialML,rsample, Metrics)

Importing Geospatial Data

After I imported the subzone layer and analysed below, I’ve found out that there are some subzone area which can be removed because the area are mostly industrial/campsite and would not have any residents living there.

mpsz19 <- st_read(dsn = "data/geospatial", layer = "MPSZ-2019") %>% 
  st_transform(3414) %>% filter(SUBZONE_N != "NORTH-EASTERN ISLANDS" & SUBZONE_N != "SOUTHERN GROUP"& SUBZONE_N != "SEMAKAU"& SUBZONE_N != "SUDONG")
Reading layer `MPSZ-2019' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
st_crs(mpsz19)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
st_bbox(mpsz19)
     xmin      ymin      xmax      ymax 
 2667.538 21448.473 55941.942 50256.334 

After plotting a map, we can see that there is an island called NORTH-EASTERN ISLAND, SOUTHERN GROUP, SEMAKAU and SUDONG which is out of analysis area. we should remove them for the purpose of this takehome exercise. We can re-import the data with out this names by doing a filter(). Re run the map and look! its clean now. We can move on to our analysis.

Population Analysis

Importing Aspatial Data Population

In this section, I will be importing Singapore population data. In our Hands-on Ex03 we have use a choropleth map to portray the spatial distribution of aged population of Singapore by Master Plan 2014 Subzone Boundary. Let’s further analyse this data and code further for our analysis.

popdata <- read_csv("data/aspatial/population_2011to2020.csv")
popdata2020 <- popdata %>%
  filter(Time == 2020) %>%
  group_by(PA, SZ, AG) %>%
  summarise(`POP` = sum(`Pop`)) %>%
  ungroup()%>%
  pivot_wider(names_from=AG, 
              values_from=POP) %>%
  mutate(YOUNG = rowSums(.[3:6])
         +rowSums(.[12])) %>%
mutate(`ECONOMY ACTIVE` = rowSums(.[7:11])+
rowSums(.[13:15]))%>%
mutate(`AGED`=rowSums(.[16:21])) %>%
mutate(`TOTAL`=rowSums(.[3:21])) %>%  
mutate(`DEPENDENCY` = (`YOUNG` + `AGED`)
/`ECONOMY ACTIVE`) %>%
  select(`PA`, `SZ`, `YOUNG`, 
       `ECONOMY ACTIVE`, `AGED`, 
       `TOTAL`, `DEPENDENCY`)

As we look at the raw data, we can create a new data frame to combine the number of population and group by PA and SZ.

Joining the attribute data and geospatial data

For master plan and Pop data frame, there are 2 common variable which we can join them together. PLN_AREA_N = PA & SUBZONE_N = SZ but we have to convert them to upper case first

popdata2020 <- popdata2020 %>%
  mutate_at(.vars = vars(PA, SZ), 
          .funs = funs(toupper)) %>%
  filter(`ECONOMY ACTIVE` > 0)

left join both master plan and population data by their sub zone level.

mpsz_pop2020 <- left_join(mpsz19, popdata2020,
                          by = c("SUBZONE_N" = "SZ"))

Exploratory Data Analysis (EDA)

tm_shape(mpsz_pop2020)+
  tm_fill("DEPENDENCY",
          style = "quantile",
          palette = "Greens") +
  tm_borders(alpha = 0.5)

tmap_options(check.and.fix = TRUE)
tm_shape(mpsz_pop2020)+
  tm_fill("TOTAL",
          style = "quantile",
          palette = "Blues") +
  tm_borders(alpha = 0.5)

Based on the chart above, we can see that in general, the east area is more dense as compared to the other region.

Lets compare between 3 groups.

tmap_mode("view")
tmap_options(check.and.fix = TRUE)
youngmap <- tm_shape(mpsz_pop2020)+ 
  tm_polygons("YOUNG", 
              style = "quantile", 
              palette = "Reds",
              )+ 
  tm_view(set.zoom.limits = c(9,14))

tmap_options(check.and.fix = TRUE)
activemap <- tm_shape(mpsz_pop2020)+ 
  tm_polygons("ECONOMY ACTIVE", 
              style = "quantile", 
              palette = "Blues",
              )+ 
  tm_view(set.zoom.limits = c(9,14))

agedmap <- tm_shape(mpsz_pop2020)+ 
  tm_polygons("AGED", 
              style = "quantile", 
              palette = "Oranges") +
  tm_view(set.zoom.limits = c(9,14))

tmap_arrange(youngmap,activemap, agedmap, asp=1, ncol=3, sync = TRUE)
tmap_mode("plot")

Based on the interactive graph above, we can say that the east area has a more dense aged group as compared to the economy active group. In general, I can also say that; the east area has a higher aging population as compared to other areas. Furthermore, we can also conclude that there are more resident staying in the east side than other region.

Existing Dwelling Analysis

Notice that in the excel sheet, there is a column called dwelling as shown below. We can also further analysis the current type of dwelling to narrow down which type of housing are most common/popular.

dwelling2020 <- popdata %>%
  filter(Time == 2020 & 
           TOD == "HDB 3-Room Flats" |
           TOD == "HDB 4-Room Flats" |
           TOD == "HDB 5-Room and Executive Flats") %>%
  group_by(PA, SZ, TOD) %>%
  summarise(`POP` = sum(`Pop`))

Joining the attribute data and geospatial data

dwelling2020 <- dwelling2020 %>%
  mutate_at(.vars = vars(PA, SZ), 
          .funs = funs(toupper)) %>%
  filter(`POP` > 0)
mpsz_dwelling2020 <- left_join(mpsz19, dwelling2020,
                          by = c("SUBZONE_N" = "SZ"))

Exploratory Data Analysis (EDA)

 tmap_mode("view")
HDB3room <- tm_shape(mpsz_dwelling2020 %>% filter( TOD == "HDB 3-Room Flats"))+
  tm_fill("POP",
          style = "quantile",
          palette = "Greens") +
  tm_view(set.zoom.limits = c(10,14))

HDB4room <- tm_shape(mpsz_dwelling2020 %>% filter( TOD == "HDB 4-Room Flats"))+ 
  tm_polygons("POP", 
              style = "quantile", 
              palette = "Blues",
              )+ 
  tm_view(set.zoom.limits = c(10,14))

HDB5room <- tm_shape(mpsz_dwelling2020 %>% filter( TOD == "HDB 5-Room and Executive Flats"))+ 
  tm_polygons("POP", 
              style = "quantile", 
              palette = "Reds") +
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(HDB3room, HDB4room, HDB5room, asp=1, ncol=3, sync = TRUE)

Mentioned above that the study should focus on either three-room, four-room or five-room flat. Based on the chart above, we can see that 4-room flat has a higher results as compared to 3-room and 5-room due to its scale of 536,920 for 4-room, next highest is 495,380 for 5-room and 30,880 for 3-room. With this dwelling analysis, I will be focusing on 4-Room flat cause of its popularity in Singapore.

Importing Locational factors

The data mostly are taken from data.gov or LTA data mall factors include:

  1. Bus stop location LTA data mall

  2. Eldercare data.gov

  3. Foodarea

  4. schools data.gov

  5. childcare (taken from our inclass exercise)

  6. train station kaggel

  7. malls from kaggle

  8. supermarket geofabrik

  9. hospital (collated manually) SGhospital

As there are many views about elite schools in Singapore as attending an elite primary school in Singapore is important depends on an individual’s unique circumstances and goals. In this take home exercise, I will be picking the top 10 most common popular primary schools. These are the top 10 popular schools according to this website.

busstops <- st_read(dsn = "data/aspatial/BusStopLocation", layer = "BusStop")
Reading layer `BusStop' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\BusStopLocation' 
  using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
eldercare <- st_read(dsn = "data/aspatial/eldercare-services-shp", layer = "ELDERCARE")
Reading layer `ELDERCARE' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\eldercare-services-shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 14481.92 ymin: 28218.43 xmax: 41665.14 ymax: 46804.9
Projected CRS: SVY21
poi <- st_read(dsn = "data/aspatial/poi_singapore", layer = "Singapore_POIS")
Reading layer `Singapore_POIS' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\poi_singapore' 
  using driver `ESRI Shapefile'
Simple feature collection with 18687 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3699.08 ymin: 22453.07 xmax: 52883.86 ymax: 50153.8
Projected CRS: SVY21 / Singapore TM

After we see the poi datafame, lets check for unique values under fclass

unique(poi$fclass)
  [1] "bakery"              "viewpoint"           "atm"                
  [4] "attraction"          "toilet"              "fast_food"          
  [7] "convenience"         "supermarket"         "food_court"         
 [10] "bank"                "post_office"         "restaurant"         
 [13] "cafe"                "monument"            "shelter"            
 [16] "tourist_info"        "bench"               "college"            
 [19] "telephone"           "memorial"            "bar"                
 [22] "pub"                 "pharmacy"            "cinema"             
 [25] "theatre"             "department_store"    "school"             
 [28] "playground"          "water_tower"         "fire_station"       
 [31] "vending_any"         "dentist"             "post_box"           
 [34] "stadium"             "hospital"            "police"             
 [37] "recycling"           "artwork"             "camera_surveillance"
 [40] "library"             "car_dealership"      "laundry"            
 [43] "observation_tower"   "lighthouse"          "sports_centre"      
 [46] "fountain"            "gift_shop"           "doityourself"       
 [49] "toy_shop"            "florist"             "hairdresser"        
 [52] "clothes"             "beverages"           "jeweller"           
 [55] "vending_machine"     "kindergarten"        "greengrocer"        
 [58] "butcher"             "battlefield"         "museum"             
 [61] "waste_basket"        "pitch"               "community_centre"   
 [64] "drinking_water"      "clinic"              "market_place"       
 [67] "general"             "picnic_site"         "bookshop"           
 [70] "travel_agent"        "camp_site"           "embassy"            
 [73] "town_hall"           "doctors"             "bicycle_rental"     
 [76] "zoo"                 "bicycle_shop"        "comms_tower"        
 [79] "tower"               "guesthouse"          "nightclub"          
 [82] "university"          "computer_shop"       "sports_shop"        
 [85] "arts_centre"         "swimming_pool"       "furniture_shop"     
 [88] "theme_park"          "veterinary"          "beauty_shop"        
 [91] "outdoor_shop"        "video_shop"          "ice_rink"           
 [94] "motel"               "public_building"     "car_rental"         
 [97] "stationery"          "newsagent"           "chemist"            
[100] "mobile_phone_shop"   "ruins"               "hotel"              
[103] "park"                "shoe_shop"           "optician"           
[106] "mall"                "hostel"              "car_wash"           
[109] "kiosk"               "chalet"              "water_well"         
[112] "prison"              "wayside_shrine"      "recycling_paper"    
[115] "wayside_cross"       "car_sharing"         "garden_centre"      
[118] "recycling_glass"     "dog_park"           

We notice that there are a lot of category, lets only select supermarket for our analysis as there are some inaccuracy in certain classes.

supermarkets <- st_read(dsn = "data/aspatial/poi_singapore", layer = "Singapore_POIS") %>% 
  filter(fclass == "supermarket")
Reading layer `Singapore_POIS' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\poi_singapore' 
  using driver `ESRI Shapefile'
Simple feature collection with 18687 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3699.08 ymin: 22453.07 xmax: 52883.86 ymax: 50153.8
Projected CRS: SVY21 / Singapore TM
foodarea <- st_read(dsn = "data/aspatial/poi_singapore", layer = "Singapore_POIS") %>% 
  filter(fclass == "food_court"| fclass == "fast_food"
         |fclass =="restaurant" |fclass =="cafe")
Reading layer `Singapore_POIS' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial\poi_singapore' 
  using driver `ESRI Shapefile'
Simple feature collection with 18687 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3699.08 ymin: 22453.07 xmax: 52883.86 ymax: 50153.8
Projected CRS: SVY21 / Singapore TM
trainstation <- read_csv("data/aspatial/mrt_lrt_data.csv")
childcare <- st_read(dsn = "data/aspatial", layer = "ChildcareServices")
Reading layer `ChildcareServices' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\aspatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1545 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 11203.01 ymin: 25667.6 xmax: 45404.24 ymax: 49300.88
z_range:       zmin: 0 zmax: 0
Projected CRS: SVY21 / Singapore TM
hospital <- read_excel("data/aspatial/hospital.xlsx")
good_prischool <- read_csv("data/aspatial/school-directory-and-information/general-information-of-schools.csv") %>% filter(
  school_name == "NANYANG PRIMARY SCHOOL" | 
    school_name == "CATHOLIC HIGH SCHOOL" |
    school_name == "TAO NAN SCHOOL" |
    school_name == "NAN HUA PRIMARY SCHOOL" |
    school_name == "ST. HILDA'S PRIMARY SCHOOL" |
    school_name == "HENRY PARK PRIMARY SCHOOL" |
    school_name == "ANGLO-CHINESE SCHOOL (PRIMARY)" |
    school_name == "RAFFLES GIRLS' PRIMARY SCHOOL" |
    school_name == "PEI HWA PRESBYTERIAN PRIMARY SCHOOL" |
     school_name == "CHIJ ST. NICHOLAS GIRLS' SCHOOL")
primaryschool <- read_csv("data/aspatial/school-directory-and-information/general-information-of-schools.csv") %>% filter(mainlevel_code =="PRIMARY")
shoppingmalls <- read_csv("data/aspatial/shopping_mall_coordinates.csv")

Change dataframe to sf format

busstop.sf <- st_transform(busstops, crs=3414)
eldercare.sf <- st_transform(eldercare, crs=3414)
foodarea.sf <- st_transform(foodarea, crs=3414)
supermarkets.sf <- st_transform(supermarkets, crs=3414)
childcare.sf <- st_transform(childcare, crs=3414)
hospital.sf <- st_as_sf(hospital,
                            coords = c("Long", "Lat"),
                            crs=4326) %>%
  st_transform(crs=3414)
trainstation.sf <- st_as_sf(trainstation,
                            coords = c("lng", "lat"),
                            crs=4326) %>%
  st_transform(crs=3414)

We note that there are no Lat and Long in the primary school data frame. we need to tidy the data for both primary and good primary schools. To prevent it from rendering the website for too long, I will be exporting and importing the tidied files.

# primaryschool[c('block', 'street')] <- str_split_fixed(primaryschool$address, ' ', # 2)
# primaryschool$street<- trim(primaryschool$street)
# primaryschool$street<- toupper(primaryschool$street)
# good_prischool[c('block', 'street')] <- str_split_fixed(good_prischool$address, ' '# , 2)
# good_prischool$street<- trim(good_prischool$street)
# good_prischool$street<- toupper(good_prischool$street)
#library(httr)
#geocode <- function(block, streetname) {
#  base_url <- "https://developers.onemap.sg/commonapi/search"
#  address <- paste(block, streetname, sep = " ")
#  query <- list("searchVal" = address, 
#                "returnGeom" = "Y",
#                "getAddrDetails" = "N",
#                "pageNum" = "1")
#  
#  res <- GET(base_url, query = query)
#  restext<-content(res, as="text")
#  
#  output <- fromJSON(restext)  %>% 
#    as.data.frame %>%
#    select(results.LATITUDE, results.LONGITUDE)
#
#  return(output)
#}
#good_prischool$LATITUDE <- 0
#good_prischool$LONGITUDE <- 0
#
#for (i in 1:nrow(good_prischool)){
#  temp_output <- geocode(good_prischool[i, 32], good_prischool[i, 33])
#  
#  good_prischool$LATITUDE[i] <- temp_output$results.LATITUDE
#  good_prischool$LONGITUDE[i] <- temp_output$results.LONGITUDE
#}
# primaryschool$LATITUDE <- 0
# primaryschool$LONGITUDE <- 0
# 
# for (i in 1:nrow(primaryschool)){
#   temp_output <- geocode(primaryschool[i, 32], primaryschool[i, 33])
#   
#   primaryschool$LATITUDE[i] <- temp_output$results.LATITUDE
#   primaryschool$LONGITUDE[i] <- temp_output$results.LONGITUDE
# }

Export tidy data for schools

#write.csv(primaryschool,"data/exported/primaryschool.csv")
#write.csv(good_prischool,"data/exported/good_prischool.csv")

Import tidy data for schools

primaryschool <- read_csv("data/exported/primaryschool.csv")
good_prischool <- read_csv("data/exported/good_prischool.csv")

After we have done, we can proceed to convert the df to sf.

shoppingmalls.sf <- st_as_sf(shoppingmalls,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

good_prischool.sf <- st_as_sf(good_prischool,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

primaryschool.sf <- st_as_sf(primaryschool,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

Exploratory Data Analysis (EDA)

tmap_mode("plot")

PLOT_BUS <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(busstop.sf) +
  tm_dots(col="red", size=0.05) +
  tm_layout(main.title = "Bus Stops",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_TRAIN <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(trainstation.sf) +
  tm_dots(col="blue", size=0.05) +
  tm_layout(main.title = "Train Station",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_BUS, PLOT_TRAIN, 
             asp=1, ncol=2,
             sync = FALSE)

Notice for bus stop, there are a few points that is out of Singapore, lets remove those points namely “LARKIN TER”, “KOTARAYA II TER”, “JOHOR BAHRU CHECKPT”, “JB SENTRAL”. After that, we can re-run the map above.

busstop.sf <- busstop.sf  %>%
  filter(LOC_DESC != "JOHOR BAHRU CHECKPT" & LOC_DESC != "LARKIN TER"& LOC_DESC != "KOTARAYA II TER"& LOC_DESC != "JB SENTRAL")

Lets looks at other data.

tmap_mode("plot")

PLOT_CHILD <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(childcare.sf) +
  tm_dots(col="orange", size=0.05) +
  tm_layout(main.title = "Child Care",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_ELDER <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(eldercare.sf) +
  tm_dots(col="green", size=0.05) +
  tm_layout(main.title = "Elder Care",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_CHILD, PLOT_ELDER, 
             asp=1, ncol=2,
             sync = FALSE)

tmap_mode("plot")

PLOT_PRISCHOOL <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(primaryschool.sf) +
  tm_dots(col="#009999", size=0.05) +
  tm_layout(main.title = "Primary school",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_GOODPRISCHOOL <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(good_prischool.sf) +
  tm_dots(col="red", size=0.05) +
  tm_layout(main.title = "Good Primary School",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_PRISCHOOL, PLOT_GOODPRISCHOOL, 
             asp=1, ncol=2,
             sync = FALSE)

tmap_mode("plot")
PLOT_FOOD <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(foodarea.sf) +
  tm_dots(col="#009999", size=0.05) +
  tm_layout(main.title = "Food Area",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_SUPERMART <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(supermarkets.sf) +
  tm_dots(col="#0000FF", size=0.05) +
  tm_layout(main.title = "Supermarket",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_FOOD, PLOT_SUPERMART, 
             asp=1, ncol=2,
             sync = FALSE)

After plotting, we saw a point which is not within SG, let’s remove them and re run the map.

foodarea.sf <- foodarea.sf  %>%
  filter(osm_id  != "4493618264")
tmap_mode("plot")

PLOT_HOST <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(hospital.sf) +
  tm_dots(col="red", size=0.05) +
  tm_layout(main.title = "Hospital",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))


PLOT_MALL <- tm_shape(mpsz19) +
  tm_borders(alpha = 0.5) +
  tmap_options(check.and.fix = TRUE) +
tm_shape(shoppingmalls.sf) +
  tm_dots(col="orange", size=0.05) +
  tm_layout(main.title = "Shopping mall",
          main.title.position = "center",
          main.title.size = 1.2,
          frame = TRUE)+
  tm_view(set.zoom.limits = c(10,14))

tmap_arrange(PLOT_HOST, PLOT_MALL, 
             asp=1, ncol=2,
             sync = FALSE)

Looks like all data points are within Singapore! Next, we will be preparing data for the HDB resale price.

Importing Aspatial Data HDB resale price

We will import the HDB resale prices and also filter the data to 4-room flat from 1st January 2021 to 31st December 2022 fro training purposes

hdb_resale <- read_csv("data/aspatial/resale_flat_price_full.csv")  %>% 
  filter(flat_type == "4 ROOM") %>%
  filter(month >= "2021-01" & month <= "2022-12")
glimpse(hdb_resale)
Rows: 23,656
Columns: 11
$ month               <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", …
$ block               <chr> "547", "414", "509", "467", "571", "134", "204", "…
$ street_name         <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO …
$ storey_range        <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm      <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, 9…
$ flat_model          <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 19…
$ remaining_lease     <chr> "59 years", "57 years 09 months", "58 years 06 mon…
$ resale_price        <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 41…

Next, we will be importing test data. The test data should be January and February 2023 resale prices. We will be filtering January to Feb 2023 and 4-room flat.

hdb_resale_test <- read_csv("data/aspatial/resale_flat_price_full.csv")  %>% 
  filter(flat_type == "4 ROOM") %>%
  filter(month >= "2023-01" & month <= "2023-02")

In the below data prep, we should clean the data for both test and train.

Data Preparation

Since there is no Long and Lat in the data, we need to create one. But before that we can combine the block with the street name, create category column representing story range, create Lat Long column and reformat remaining lease.

  1. Combine block and street name
hdb_resale$address <-  paste(hdb_resale$block, hdb_resale$street_name, sep=" ")
hdb_resale_test$address <-  paste(hdb_resale_test$block, hdb_resale_test$street_name, sep=" ")
  1. Looking at megan’s Takehome, she created duplicate values representing the story range. I was thinking of something different, so I decided we can categories in 4 category Low, Mid, High, very High instead.
  • Low: 01-06

  • Middle: 07-12

  • High: 13-24

  • Very High: >= 25

unique(hdb_resale$storey_range)
 [1] "04 TO 06" "01 TO 03" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
 [7] "19 TO 21" "22 TO 24" "28 TO 30" "25 TO 27" "31 TO 33" "43 TO 45"
[13] "34 TO 36" "37 TO 39" "40 TO 42" "46 TO 48" "49 TO 51"

Low

hdb_resale$story_level_low <- ifelse(hdb_resale$storey_range=="01 TO 03"|hdb_resale$storey_range=="04 TO 06", 1, 0)
hdb_resale_test$story_level_low <- ifelse(hdb_resale_test$storey_range=="01 TO 03"|hdb_resale_test$storey_range=="04 TO 06", 1, 0)

Middle

hdb_resale$story_level_mid <- ifelse(hdb_resale$storey_range=="07 TO 09"|hdb_resale$storey_range=="10 TO 12", 1, 0)
hdb_resale_test$story_level_mid <- ifelse(hdb_resale_test$storey_range=="07 TO 09"|hdb_resale_test$storey_range=="10 TO 12", 1, 0)

High

hdb_resale$story_level_high <- ifelse(hdb_resale$storey_range=="13 TO 15"|hdb_resale$storey_range=="16 TO 18"|hdb_resale$storey_range=="19 TO 21"|hdb_resale$storey_range=="22 TO 24", 1, 0)
hdb_resale_test$story_level_high <- ifelse(hdb_resale_test$storey_range=="13 TO 15"|hdb_resale_test$storey_range=="16 TO 18"|hdb_resale_test$storey_range=="19 TO 21"|hdb_resale_test$storey_range=="22 TO 24", 1, 0)

Very High

hdb_resale$story_level_veryhigh <- ifelse(hdb_resale$storey_range>="25 TO 27", 1, 0)
hdb_resale_test$story_level_veryhigh <- ifelse(hdb_resale_test$storey_range>="25 TO 27", 1, 0)
  1. Create Lat Long column

The below code, i have copied the from our senior Megan website. The below code extracts the Latitude and Longitude of the address through OneMap API. I have commented after I ran the code once, the process took me about 1 hour to finish extracting the longlat.

I will be exporting it into XLSX file and read in the file.

#library(httr)
#geocode <- function(block, streetname) {
#  base_url <- "https://developers.onemap.sg/commonapi/search"
#  address <- paste(block, streetname, sep = " ")
#  query <- list("searchVal" = address, 
#                "returnGeom" = "Y",
#                "getAddrDetails" = "N",
#                "pageNum" = "1")
#  
#  res <- GET(base_url, query = query)
#  restext<-content(res, as="text")
#  
#  output <- fromJSON(restext)  %>% 
#    as.data.frame %>%
#    select(results.LATITUDE, results.LONGITUDE)
#
#  return(output)
#}
#hdb_resale$LATITUDE <- 0
#hdb_resale$LONGITUDE <- 0
#
#for (i in 1:nrow(hdb_resale)){
#  temp_output <- geocode(hdb_resale[i, 4], hdb_resale[i, 5])
#  
#  hdb_resale$LATITUDE[i] <- temp_output$results.LATITUDE
#  hdb_resale$LONGITUDE[i] <- temp_output$results.LONGITUDE
#}
#hdb_resale_test$LATITUDE <- 0
#hdb_resale_test$LONGITUDE <- 0
#
#for (i in 1:nrow(hdb_resale_test)){
#  temp_output <- geocode(hdb_resale_test[i, 4], hdb_resale_test[i,5])
#  
#  hdb_resale_test$LATITUDE[i] <- temp_output$results.LATITUDE
#  hdb_resale_test$LONGITUDE[i] <- temp_output$results.LONGITUDE
#}
#write.csv(hdb_resale,"data/exported/hdb_resale_latlong.csv")
#write.csv(hdb_resale_test,"data/exported/hdb_resale_test_latlong.csv")

Read in the hdb resale price with latlong column.

hdb_resale <- read_csv("data/exported/hdb_resale_latlong.csv")
hdb_resale_test <- read_csv("data/exported/hdb_resale_test_latlong.csv")

Lets check for any NA values

sum(is.na(hdb_resale$LATITUDE))
[1] 0
sum(is.na(hdb_resale$LONGITUDE))
[1] 0
sum(is.na(hdb_resale_test$LATITUDE))
[1] 0
sum(is.na(hdb_resale_test$LONGITUDE))
[1] 0

There is no missing value and we can proceed to the next steps.

  1. Convert Remaining lease format
str_list <- str_split(hdb_resale$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      hdb_resale$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    hdb_resale$remaining_lease[i] <- year
  }
}
str_list <- str_split(hdb_resale_test$remaining_lease, " ")

for (i in 1:length(str_list)) {
  if (length(unlist(str_list[i])) > 2) {
      year <- as.numeric(unlist(str_list[i])[1])
      month <- as.numeric(unlist(str_list[i])[3])
      hdb_resale_test$remaining_lease[i] <- year + round(month/12, 2)
  }
  else {
    year <- as.numeric(unlist(str_list[i])[1])
    hdb_resale_test$remaining_lease[i] <- year
  }
}

Converting aspatial data dataframe into a sf object.

Currently, the hdb_resale tibble data frame is aspatial. We will convert it to a sf object and the output will be in point feature form. The code chunk below converts hdb_resale data frame into a simple feature data frame by using st_as_sf() of sf packages.

hdb_resale.sf <- st_as_sf(hdb_resale,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

hdb_resale_test.sf <- st_as_sf(hdb_resale_test,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)
head(hdb_resale.sf)
Simple feature collection with 6 features and 17 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 28960.32 ymin: 38439.17 xmax: 30770.07 ymax: 39578.64
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 18
   ...1 month   town       flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶
  <dbl> <chr>   <chr>      <chr>   <chr> <chr>   <chr>     <dbl> <chr>     <dbl>
1     1 2021-01 ANG MO KIO 4 ROOM  547   ANG MO… 04 TO …      92 New Ge…    1981
2     2 2021-01 ANG MO KIO 4 ROOM  414   ANG MO… 01 TO …      92 New Ge…    1979
3     3 2021-01 ANG MO KIO 4 ROOM  509   ANG MO… 01 TO …      91 New Ge…    1980
4     4 2021-01 ANG MO KIO 4 ROOM  467   ANG MO… 07 TO …      92 New Ge…    1979
5     5 2021-01 ANG MO KIO 4 ROOM  571   ANG MO… 07 TO …      92 New Ge…    1979
6     6 2021-01 ANG MO KIO 4 ROOM  134   ANG MO… 07 TO …      98 New Ge…    1978
# … with 8 more variables: remaining_lease <chr>, resale_price <dbl>,
#   address <chr>, story_level_low <dbl>, story_level_mid <dbl>,
#   story_level_high <dbl>, story_level_veryhigh <dbl>, geometry <POINT [m]>,
#   and abbreviated variable names ¹​flat_type, ²​street_name, ³​storey_range,
#   ⁴​floor_area_sqm, ⁵​flat_model, ⁶​lease_commence_date

Let’s visualize the 4-room price in a map view.

tmap_mode("view")
tmap_options(check.and.fix = TRUE)
tm_shape(hdb_resale.sf) +  
  tm_dots(col = "resale_price",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))

By looking at the map, it becomes apparent that the central & east region of Singapore has a higher concentration of flats that command greater resale values.

Let’s find out the top 10 area that has the highest price.

town_mean <- aggregate(hdb_resale.sf[,"resale_price"], list(hdb_resale.sf$town), mean)
top10_town = top_n(town_mean, 10, `resale_price`) %>%
  arrange(desc(`resale_price`))
top10_town
Simple feature collection with 10 features and 2 fields
Geometry type: MULTIPOINT
Dimension:     XY
Bounding box:  xmin: 19536.43 ymin: 28217.39 xmax: 35958.92 ymax: 41493.47
Projected CRS: SVY21 / Singapore TM
           Group.1 resale_price                       geometry
1     CENTRAL AREA     856354.6 MULTIPOINT ((28640.73 29932...
2       QUEENSTOWN     780134.9 MULTIPOINT ((22133.07 32910...
3      BUKIT MERAH     711179.0 MULTIPOINT ((25024.14 28462...
4  KALLANG/WHAMPOA     702049.1 MULTIPOINT ((19536.43 41493...
5        TOA PAYOH     651544.0 MULTIPOINT ((29021.66 34720...
6         CLEMENTI     647851.6 MULTIPOINT ((19863.73 32474...
7      BUKIT TIMAH     634049.4 MULTIPOINT ((21224.71 35657...
8           BISHAN     605247.6 MULTIPOINT ((27638.13 37842...
9          GEYLANG     592893.1 MULTIPOINT ((32749.43 33252...
10      ANG MO KIO     546980.0 MULTIPOINT ((28070.74 38987...

Comparing to megan’s previous analysis the HDB resale flat from 2019-01 to 2020-10 with the current 2021-01 to 2022-12. We can see the increase in price just over 3 years. From my code chunk above, we can see that BUKIT TIMAH’S resale price has decrease from 4th position to 7th position. MARINE PARADE is not seen in the 10th position and it’s over taken by ANG MO KIO.

Megan’s 2019-01 to 2020-10

We can find out more insights by picking variety of factors that deem fit.

Exploratory Data Analysis (EDA)

ggplot(data=hdb_resale.sf, aes(x=`resale_price`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

The figure above reveals a right skewed distribution. This means that more hdb units were transacted at relative lower prices at around 500k

Proximity Distance Calculation

In this section, we need to find the proximity to particular facilities - which we can compute with st_distance(), and find the closest facility (shortest distance) with the rowMins() function of our matrixStats package. The values will be appended to the data frame as a new column. (the below code will be credited to our senior Megan)

#library(units)
#library(matrixStats)
#proximity <- function(df1, df2, varname) {
#  dist_matrix <- st_distance(df1, df2) %>%
#    drop_units()
#  df1[,varname] <- rowMins(dist_matrix)
#  return(df1)
#}
#hdb_resale_train.sf <- 
#  # the columns will be truncated later on when viewing 
#  # so we're limiting ourselves to two-character columns for ease of #viewing #etween
#  proximity(hdb_resale.sf, busstop.sf, "PROX_BS") %>%
#  proximity(.,childcare.sf, "PROX_CHILDCARE") %>%
#  proximity(., eldercare.sf, "PROX_ELDERCARE") %>%
#  proximity(., foodarea.sf, "PROX_FOOD") %>%
#  proximity(., trainstation.sf, "PROX_MRT") %>%
#  proximity(., good_prischool.sf, "PROX_TOPPRISCH") %>%
#  proximity(., shoppingmalls.sf, "PROX_MALL") %>%
#  proximity(., supermarkets.sf, "PROX_SPRMKT") %>%
#  proximity(., hospital.sf, "PROX_HOST") 

#hdb_resale_test <- 
#  # the columns will be truncated later on when viewing 
#  # so we're limiting ourselves to two-character columns for ease of #viewing #etween
#  proximity(hdb_resale_test.sf, busstop.sf, "PROX_BS") %>%
#  proximity(.,childcare.sf, "PROX_CHILDCARE") %>%
#  proximity(., eldercare.sf, "PROX_ELDERCARE") %>%
#  proximity(., foodarea.sf, "PROX_FOOD") %>%
#  proximity(., trainstation.sf, "PROX_MRT") %>%
#  proximity(., good_prischool.sf, "PROX_TOPPRISCH") %>%
#  proximity(., shoppingmalls.sf, "PROX_MALL") %>%
#  proximity(., supermarkets.sf, "PROX_SPRMKT") %>%
#  proximity(., hospital.sf, "PROX_HOST") 

After we ran the proximity distance, lets run the number of radius as well.

#num_radius <- function(df1, df2, varname, radius) {
#  dist_matrix <- st_distance(df1, df2) %>%
#    drop_units() %>%
#    as.data.frame()
#  df1[,varname] <- rowSums(dist_matrix <= radius)
#  return(df1)
#}
#hdb_resale_train_final <- 
#  num_radius(hdb_resale_train.sf, foodarea.sf, "NUM_FOOD", 350) %>%
#  num_radius(., childcare.sf, "NUM_CHILDCARE", 350) %>%
#  num_radius(., busstop.sf, "NUM_BUS_STOP", 350) %>%
#  num_radius(., supermarkets.sf, "NUM_SPMKT", 350) %>%
#  num_radius(., primaryschool.sf, "NUM_SCHOOL", 1000)

#hdb_resale_test_final <- 
#  num_radius(hdb_resale_test, foodarea.sf, "NUM_FOOD", 350) %>%
#  num_radius(., childcare.sf, "NUM_CHILDCARE", 350) %>%
#  num_radius(., busstop.sf, "NUM_BUS_STOP", 350) %>%
#  num_radius(., supermarkets.sf, "NUM_SPMKT", 350) %>%
#  num_radius(., primaryschool.sf, "NUM_SCHOOL", 1000)
#st_write(hdb_resale_train_final, "data/exported/hdb_resale_train_final.shp")
#st_write(hdb_resale_test_final, "data/exported/hdb_resale_test_final.shp")
#write.csv(hdb_resale_train_final, "data/exported/hdb_resale_train.csv")
#write.csv(hdb_resale_test_final, "data/exported/hdb_resale_test.csv")

Read in the exported file

train_resale <- st_read(dsn = "data/exported", layer ="hdb_resale_train_final")
Reading layer `hdb_resale_train_final' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\exported' 
  using driver `ESRI Shapefile'
Simple feature collection with 23656 features and 31 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.79 ymin: 28217.39 xmax: 42645.18 ymax: 48741.06
Projected CRS: SVY21 / Singapore TM
test_resale <- st_read(dsn = "data/exported", layer ="hdb_resale_test_final")
Reading layer `hdb_resale_test_final' from data source 
  `C:\yifei-alpaca\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\exported' 
  using driver `ESRI Shapefile'
Simple feature collection with 1848 features and 31 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11655.33 ymin: 28287.8 xmax: 42444.75 ymax: 48675.05
Projected CRS: SVY21 / Singapore TM

Computing Correlation Matrix

Lets examine if there is a sign of multicolinearity.

notice that we wanna check the correlation for remaining lease as well. we can swap the column position.

train_resale<- train_resale[c(1,2,3,4,5,6,7,8,9,10,12,13,11,14:32)]
train_resale$rmnng_l <- as.numeric(as.character(train_resale$rmnng_l))
train_resale_nogeom <- train_resale %>%
  st_drop_geometry()
corrplot::corrplot(cor(train_resale_nogeom[,13:31]), 
                   diag = FALSE, 
                   order = "AOE",
                   tl.pos = "td", 
                   tl.cex = 0.5, 
                   method = "number", 
                   type = "upper")

In the correlation matrix, we can see that story_level_low AND story_level_middle has a negative correlation between each other. Negative correlation can deduce that when x increases, y decreases. This shows that when resident bought a higher level(decrease), the occupancy of lower level is high(increases). Hence, they show a negative correlation.

we can see that NUM_SPM AND PROX_SP has a negative correlation between each other. Negative correlation can deduce that when x increases, y decreases. This shows that when a shopping mall distance is further (increase), the number of shopping mall resident would go is lower (decreases). Hence, they show a negative correlation.

Even though the two variables may have a moderate negative correlation, this does not necessarily imply that the behavior of one has any causal influence on the other. Hence, I decided to keep them.

Building a multiple linear regression

set.seed(99)
resale_train_mlr <- lm(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + stry_lvl_h + stry_lvl_v +
                  PROX_BS + PROX_CH + PROX_EL +
                  PROX_FO + PROX_MR + PROX_TO + 
                  PROX_MA + PROX_SP + PROX_HO + NUM_FOO +
                  NUM_CHI + NUM_BUS +
                  NUM_SPM + NUM_SCH,
                data=train_resale)
summary(resale_train_mlr)

Call:
lm(formula = rsl_prc ~ flr_r_s + rmnng_l + stry_lvl_l + stry_lvl_m + 
    stry_lvl_h + stry_lvl_v + PROX_BS + PROX_CH + PROX_EL + PROX_FO + 
    PROX_MR + PROX_TO + PROX_MA + PROX_SP + PROX_HO + NUM_FOO + 
    NUM_CHI + NUM_BUS + NUM_SPM + NUM_SCH, data = train_resale)

Residuals:
    Min      1Q  Median      3Q     Max 
-352809  -52221   -5835   47721  455069 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.828e+05  1.011e+04  27.975  < 2e-16 ***
flr_r_s      2.346e+03  7.884e+01  29.756  < 2e-16 ***
rmnng_l      4.457e+03  4.632e+01  96.226  < 2e-16 ***
stry_lvl_l  -2.064e+05  2.987e+03 -69.085  < 2e-16 ***
stry_lvl_m  -1.798e+05  2.962e+03 -60.715  < 2e-16 ***
stry_lvl_h  -1.505e+05  2.986e+03 -50.410  < 2e-16 ***
stry_lvl_v          NA         NA      NA       NA    
PROX_BS     -1.409e+01  9.991e+00  -1.410 0.158526    
PROX_CH      2.501e+01  6.921e+00   3.613 0.000303 ***
PROX_EL     -2.326e+01  9.092e-01 -25.584  < 2e-16 ***
PROX_FO     -7.040e+01  4.673e+00 -15.067  < 2e-16 ***
PROX_MR     -1.903e+01  1.345e+00 -14.150  < 2e-16 ***
PROX_TO     -1.879e+01  2.960e-01 -63.493  < 2e-16 ***
PROX_MA      2.402e+01  1.720e+00  13.968  < 2e-16 ***
PROX_SP     -1.769e+01  4.708e+00  -3.757 0.000172 ***
PROX_HO      7.465e-01  4.329e-01   1.724 0.084672 .  
NUM_FOO      1.704e+03  3.807e+01  44.758  < 2e-16 ***
NUM_CHI     -3.734e+03  2.962e+02 -12.608  < 2e-16 ***
NUM_BUS     -7.465e+02  1.998e+02  -3.736 0.000188 ***
NUM_SPM      1.557e+03  6.302e+02   2.471 0.013472 *  
NUM_SCH     -8.290e+03  3.987e+02 -20.791  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 78910 on 23636 degrees of freedom
Multiple R-squared:  0.629, Adjusted R-squared:  0.6287 
F-statistic:  2109 on 19 and 23636 DF,  p-value: < 2.2e-16

The R-squared of 0.629 reveals that the simple regression model built is able to explain about 63% of the resale prices.

Since p-value is much smaller than 0.0001, we will reject the null hypothesis that mean is a good estimator of resale price. This will allow us to infer that simple linear regression model above is a good estimator of resale price.

Looking at p-value <0.05, we can see that not all the independent variables are statistically significant, and said variables should be removed. The variables I’ve identified as insignificant are surprisingly, PROX_BS and PROX_HO. We will revised the model by removing those variables which are not statistically significant and are ready to calibrate the revised model by using the code chunk below.

resale_train_mlr_revised <- lm(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + stry_lvl_h  +
                  PROX_CH + PROX_EL +
                  PROX_FO + PROX_MR + PROX_TO + 
                  PROX_MA + PROX_SP + NUM_FOO +
                  NUM_CHI + NUM_BUS +
                  NUM_SPM + NUM_SCH,
                data=train_resale)
ols_regress(resale_train_mlr_revised)
                            Model Summary                              
----------------------------------------------------------------------
R                       0.793       RMSE                    78916.388 
R-Squared               0.629       Coef. Var                  15.000 
Adj. R-Squared          0.629       MSE                6227796316.198 
Pred R-Squared          0.628       MAE                     61293.737 
----------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares           DF       Mean Square       F          Sig. 
--------------------------------------------------------------------------------
Regression    2.495146e+14           17      1.467733e+13    2356.745    0.0000 
Residual      1.472126e+14        23638    6227796316.198                       
Total         3.967272e+14        23655                                         
--------------------------------------------------------------------------------

                                          Parameter Estimates                                            
--------------------------------------------------------------------------------------------------------
      model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
--------------------------------------------------------------------------------------------------------
(Intercept)     281742.067     10015.130                  28.132    0.000     262111.767     301372.367 
    flr_r_s       2360.038        78.360        0.125     30.118    0.000       2206.447       2513.628 
    rmnng_l       4444.765        45.883        0.470     96.872    0.000       4354.832       4534.698 
 stry_lvl_l    -206633.466      2984.270       -0.773    -69.241    0.000    -212482.828    -200784.104 
 stry_lvl_m    -180118.191      2958.808       -0.678    -60.875    0.000    -185917.646    -174318.736 
 stry_lvl_h    -150784.027      2984.073       -0.463    -50.530    0.000    -156633.001    -144935.052 
    PROX_CH         24.233         6.901        0.015      3.512    0.000         10.707         37.759 
    PROX_EL        -22.906         0.890       -0.109    -25.727    0.000        -24.651        -21.161 
    PROX_FO        -70.566         4.667       -0.069    -15.119    0.000        -79.714        -61.418 
    PROX_MR        -18.937         1.344       -0.062    -14.089    0.000        -21.572        -16.303 
    PROX_TO        -18.584         0.273       -0.335    -68.081    0.000        -19.119        -18.049 
    PROX_MA         23.417         1.687        0.066     13.884    0.000         20.111         26.722 
    PROX_SP        -17.711         4.704       -0.021     -3.765    0.000        -26.931         -8.490 
    NUM_FOO       1690.104        37.357        0.207     45.242    0.000       1616.883       1763.326 
    NUM_CHI      -3746.632       296.140       -0.057    -12.652    0.000      -4327.085      -3166.180 
    NUM_BUS       -632.164       188.244       -0.014     -3.358    0.001      -1001.136       -263.193 
    NUM_SPM       1563.971       629.696        0.013      2.484    0.013        329.727       2798.215 
    NUM_SCH      -8479.605       388.360       -0.105    -21.834    0.000      -9240.815      -7718.395 
--------------------------------------------------------------------------------------------------------
tbl_regression(resale_train_mlr_revised, intercept = TRUE)
Characteristic Beta 95% CI1 p-value
(Intercept) 281,742 262,112, 301,372 <0.001
flr_r_s 2,360 2,206, 2,514 <0.001
rmnng_l 4,445 4,355, 4,535 <0.001
stry_lvl_l -206,633 -212,483, -200,784 <0.001
stry_lvl_m -180,118 -185,918, -174,319 <0.001
stry_lvl_h -150,784 -156,633, -144,935 <0.001
PROX_CH 24 11, 38 <0.001
PROX_EL -23 -25, -21 <0.001
PROX_FO -71 -80, -61 <0.001
PROX_MR -19 -22, -16 <0.001
PROX_TO -19 -19, -18 <0.001
PROX_MA 23 20, 27 <0.001
PROX_SP -18 -27, -8.5 <0.001
NUM_FOO 1,690 1,617, 1,763 <0.001
NUM_CHI -3,747 -4,327, -3,166 <0.001
NUM_BUS -632 -1,001, -263 <0.001
NUM_SPM 1,564 330, 2,798 0.013
NUM_SCH -8,480 -9,241, -7,718 <0.001
1 CI = Confidence Interval
ols_vif_tol(resale_train_mlr_revised)
    Variables Tolerance      VIF
1     flr_r_s 0.9091436 1.099936
2     rmnng_l 0.6659340 1.501650
3  stry_lvl_l 0.1258227 7.947692
4  stry_lvl_m 0.1265487 7.902098
5  stry_lvl_h 0.1872945 5.339186
6     PROX_CH 0.8888416 1.125060
7     PROX_EL 0.8803703 1.135886
8     PROX_FO 0.7511644 1.331266
9     PROX_MR 0.8075724 1.238279
10    PROX_TO 0.6491926 1.540375
11    PROX_MA 0.6915140 1.446102
12    PROX_SP 0.5257319 1.902110
13    NUM_FOO 0.7519249 1.329920
14    NUM_CHI 0.7615254 1.313154
15    NUM_BUS 0.8869960 1.127401
16    NUM_SPM 0.5345563 1.870710
17    NUM_SCH 0.6735171 1.484743

Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.

Test for Non Linearity

ols_plot_resid_hist(resale_train_mlr_revised)

The figure reveals that the residual of the multiple linear regression model is resemble normal distribution.

Test for Spatial Autocorrelation

mlr.output <- as.data.frame(resale_train_mlr_revised$residuals)
resale.res.sf <- cbind(hdb_resale.sf, 
                        resale_train_mlr_revised$residuals)
resale.sp <- as_Spatial(resale.res.sf)
resale.sp
class       : SpatialPointsDataFrame 
features    : 23656 
extent      : 11519.79, 42645.18, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 18
names       :  ...1,   month,       town, flat_type, block,   street_name, storey_range, floor_area_sqm,    flat_model, lease_commence_date, remaining_lease, resale_price,          address, story_level_low, story_level_mid, ... 
min values  :     1, 2021-01, ANG MO KIO,    4 ROOM,     1,  ADMIRALTY DR,     01 TO 03,             70, Adjoined flat,                1967,            44.5,       250000,   1 CHAI CHEE RD,               0,               0, ... 
max values  : 23656, 2022-12,     YISHUN,    4 ROOM,    9B, YUNG SHENG RD,     49 TO 51,            145,       Type S1,                2019,           97.33,      1370000, 9B BOON TIONG RD,               1,               1, ... 

Now, we will display the distribution of the residuals on an interactive map.

tm_shape(mpsz19)+
  tmap_options(check.and.fix = TRUE) +
  tm_polygons(alpha = 0.4) +
tm_shape(resale.res.sf) +  
  tm_dots(col = "resale_train_mlr_revised.residuals",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode("plot")
train_data_sp <- as_Spatial(train_resale)
train_data_sp
class       : SpatialPointsDataFrame 
features    : 23656 
extent      : 11519.79, 42645.18, 28217.39, 48741.06  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
variables   : 31
names       : X___1,   month,       town, flt_typ, block,       strt_nm,  stry_rn, flr_r_s,       flt_mdl, ls_cmm_, rsl_prc,          address, rmnng_l, stry_lvl_l, stry_lvl_m, ... 
min values  :     1, 2021-01, ANG MO KIO,  4 ROOM,     1,  ADMIRALTY DR, 01 TO 03,      70, Adjoined flat,    1967,  250000,   1 CHAI CHEE RD,    44.5,          0,          0, ... 
max values  : 23656, 2022-12,     YISHUN,  4 ROOM,    9B, YUNG SHENG RD, 49 TO 51,     145,       Type S1,    2019, 1370000, 9B BOON TIONG RD,   97.33,          1,          1, ... 

The weights have a very large influence on the parameter estimation of the geographically weighted regression (GWR). The weights show the relationship between observations or locations in the model. Types of weights that are often used in GWR are Gaussian kernels. This weighting can also be arranged into two forms. There are the fixed Gaussian kernel and the adaptive Gaussian kernel. Fixed is used when each location has the same bandwidth value. Adaptive is used when each location has a different bandwidth value. Adaptive is appropriate if points are irregularly spread – it ensures that there are enough points to calibrate the regression. In this take home, I’ve decided to use adaptive.

#bw_adaptive <- bw.gwr(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + #stry_lvl_h + stry_lvl_v  +
#                  PROX_CH + PROX_EL +
#                  PROX_FO + PROX_MR + PROX_TO + 
#                  PROX_MA + PROX_SP + NUM_FOO +
#                  NUM_CHI + NUM_BUS +
#                  NUM_SPM + NUM_SCH,
#                  data=train_data_sp,
#                  approach="CV",
#                  kernel="gaussian",
#                  adaptive=TRUE,
#                  longlat=FALSE)

#saveRDS(bw_adaptive, "bwadaptive.rds")

Read in RDS

bw_adaptive <- read_rds("bwadaptive.rds")

Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.

#gwr_adaptive <- gwr.basic(formula = rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + #stry_lvl_m + stry_lvl_h  + stry_lvl_v  +
#                  PROX_CH + PROX_EL +
#                  PROX_FO + PROX_MR + PROX_TO + 
#                  PROX_MA + PROX_SP + NUM_FOO +
#                  NUM_CHI + NUM_BUS +
#                  NUM_SPM + NUM_SCH,
#                         data=train_data_sp,
#                          bw=bw_adaptive, 
#                          kernel = 'gaussian', 
#                          adaptive=TRUE,
#                          longlat = FALSE)
#saveRDS(gwr_adaptive, "gwradaptive.rds")
gwr_adaptive <- read_rds("gwradaptive.rds")
gwr_adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2023-03-20 01:08:29 
   Call:
   gwr.basic(formula = rsl_prc ~ flr_r_s + rmnng_l + stry_lvl_l + 
    stry_lvl_m + stry_lvl_h + PROX_CH + PROX_EL + PROX_FO + PROX_MR + 
    PROX_TO + PROX_MA + PROX_SP + NUM_FOO + NUM_CHI + NUM_BUS + 
    NUM_SPM + NUM_SCH, data = train_data_sp, bw = bw_adaptive, 
    kernel = "gaussian", adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  rsl_prc
   Independent variables:  flr_r_s rmnng_l stry_lvl_l stry_lvl_m stry_lvl_h PROX_CH PROX_EL PROX_FO PROX_MR PROX_TO PROX_MA PROX_SP NUM_FOO NUM_CHI NUM_BUS NUM_SPM NUM_SCH
   Number of data points: 23656
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
    Min      1Q  Median      3Q     Max 
-353910  -52373   -5979   47973  454155 

   Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
   (Intercept)  2.817e+05  1.002e+04  28.132  < 2e-16 ***
   flr_r_s      2.360e+03  7.836e+01  30.118  < 2e-16 ***
   rmnng_l      4.445e+03  4.588e+01  96.872  < 2e-16 ***
   stry_lvl_l  -2.066e+05  2.984e+03 -69.241  < 2e-16 ***
   stry_lvl_m  -1.801e+05  2.959e+03 -60.875  < 2e-16 ***
   stry_lvl_h  -1.508e+05  2.984e+03 -50.530  < 2e-16 ***
   PROX_CH      2.423e+01  6.901e+00   3.512 0.000446 ***
   PROX_EL     -2.291e+01  8.903e-01 -25.727  < 2e-16 ***
   PROX_FO     -7.057e+01  4.667e+00 -15.119  < 2e-16 ***
   PROX_MR     -1.894e+01  1.344e+00 -14.089  < 2e-16 ***
   PROX_TO     -1.858e+01  2.730e-01 -68.081  < 2e-16 ***
   PROX_MA      2.342e+01  1.687e+00  13.884  < 2e-16 ***
   PROX_SP     -1.771e+01  4.704e+00  -3.765 0.000167 ***
   NUM_FOO      1.690e+03  3.736e+01  45.242  < 2e-16 ***
   NUM_CHI     -3.747e+03  2.961e+02 -12.652  < 2e-16 ***
   NUM_BUS     -6.322e+02  1.882e+02  -3.358 0.000786 ***
   NUM_SPM      1.564e+03  6.297e+02   2.484 0.013010 *  
   NUM_SCH     -8.480e+03  3.884e+02 -21.834  < 2e-16 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 78920 on 23638 degrees of freedom
   Multiple R-squared: 0.6289
   Adjusted R-squared: 0.6287 
   F-statistic:  2357 on 17 and 23638 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 1.472126e+14
   Sigma(hat): 78889.69
   AIC:  600649.7
   AICc:  600649.8
   BIC:  577338.5
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 786 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                     Min.     1st Qu.      Median     3rd Qu.       Max.
   Intercept  -1.7389e+06 -1.2155e+05  7.0193e+03  9.9607e+04 3.3940e+05
   flr_r_s     8.6710e+02  1.9806e+03  2.8876e+03  4.1149e+03 1.3994e+04
   rmnng_l    -1.1919e+03  3.3808e+03  4.1239e+03  5.4525e+03 8.4989e+03
   stry_lvl_l -2.3074e+05 -1.2338e+05 -6.7234e+04 -2.9001e+04 1.7166e+06
   stry_lvl_m -2.0529e+05 -8.3954e+04 -4.6785e+04  5.0384e+02 1.8210e+06
   stry_lvl_h -1.7942e+05 -5.6567e+04 -3.1012e+04  1.3712e+04 1.7773e+06
   PROX_CH    -1.3259e+02 -4.0739e+01 -1.6158e+01  7.0767e+00 8.4255e+01
   PROX_EL    -9.3297e+01 -1.1051e+01 -3.5346e+00  3.5163e+00 3.6019e+01
   PROX_FO    -1.3438e+02 -7.3820e+01 -1.3345e+01  1.4598e+01 1.1559e+02
   PROX_MR    -1.5674e+02 -5.4680e+01 -2.5458e+01  3.2652e-01 8.8030e+01
   PROX_TO    -5.4015e+01 -1.9658e+01 -7.3801e+00 -4.4153e-02 3.2364e+01
   PROX_MA    -8.3612e+01 -1.3898e+01  4.9998e+00  2.5821e+01 1.1674e+02
   PROX_SP    -1.4199e+02 -2.6149e+01  9.8640e+00  3.3325e+01 8.8046e+01
   NUM_FOO    -2.7866e+02  2.9871e+02  8.7951e+02  2.3560e+03 5.8807e+03
   NUM_CHI    -9.6189e+03 -3.8823e+03 -2.1064e+03  4.0805e+02 7.2433e+03
   NUM_BUS    -5.1294e+03 -1.1419e+03 -1.1020e+02  1.0291e+03 4.8399e+03
   NUM_SPM    -1.1477e+04 -4.2146e+03 -1.6747e+02  2.7902e+03 1.2560e+04
   NUM_SCH    -1.9326e+04 -4.0084e+03  1.7104e+03  5.7388e+03 1.7434e+04
   ************************Diagnostic information*************************
   Number of data points: 23656 
   Effective number of parameters (2trace(S) - trace(S'S)): 368.6028 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 23287.4 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 575401.7 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 575108.7 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 554029.2 
   Residual sum of squares: 4.949118e+13 
   R-square value:  0.8752514 
   Adjusted R-square value:  0.8732767 

   ***********************************************************************
   Program stops at: 2023-03-20 01:16:42 

Visualizing GWR Output

Condition Number, LocalR2, Predicted, Residual, Coefficient standard error are all stored in a SpatialPointsDataFrame or SpatialPolygonsDataFrame object integrated with fit.points, GWR coefficient estimates, y value, predicted values, coefficient standard errors and t-values in its “data” slot in an object called SDF of the output list. Chapter 13

resale.sf.adaptive <- st_as_sf(gwr_adaptive$SDF) %>%
  st_transform(crs=3414)
 gwr_adaptive.output <- as.data.frame(gwr_adaptive$SDF)
resale.sf.adaptive <- cbind(resale.res.sf, as.matrix(gwr_adaptive.output))

Local R2

tmap_mode("view")
tm_shape(mpsz19)+
  tm_polygons(alpha = 0.1) +
tm_shape(resale.sf.adaptive) +  
  tm_dots(col = "Local_R2",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))

We can observe that majority of the HDB flats have Local R-squared values are within the range of 0.6 to 1. A high local R2 value indicates that the independent variables explain a large proportion of the variance in the dependent variable within a specific spatial unit, while a low local R2 value indicates that the independent variables are not very good at explaining the variation within the spatial unit. Refer to the map, the central area has a higher range and the east area has a lower range. This shows that the independent variable are not as good at explaining the variation in the central region.

With curiosity, lets narrow down to the east and north east region, to find out why there is a poor variation. But before that, we can extract out the area first.

tmap_mode("plot")
tm_shape(mpsz19[mpsz19$REGION_N=="NORTH-EAST REGION", ])+
  tm_polygons()+
tm_shape(resale.sf.adaptive) + 
  tm_bubbles(col = "Local_R2",
           size = 0.15,
           border.col = "gray60",
           border.lwd = 1)

tm_shape(mpsz19[mpsz19$REGION_N=="EAST REGION", ])+
  tm_polygons()+
tm_shape(resale.sf.adaptive) + 
  tm_bubbles(col = "Local_R2",
           size = 0.15,
           border.col = "gray60",
           border.lwd = 1)

train_resale_narrow <- st_join(train_resale,mpsz19 ,
                          by = c("PLN_AREA_N" = "town")) %>% filter (REGION_N =="NORTH-EAST REGION" | REGION_N=="EAST REGION")
test_resale_narrow <- st_join(test_resale,mpsz19 ,
                          by = c("PLN_AREA_N" = "town")) %>% filter (REGION_N =="NORTH-EAST REGION" | REGION_N=="EAST REGION")

Drop columns

train_resale_narrow <- train_resale_narrow[ -c(32:37) ]
test_resale_narrow <- test_resale_narrow[ -c(32:37) ]

Preparing Coordinates data

#coords_train <- st_coordinates(train_resale)
#coords_test <- st_coordinates(test_resale)

coords_train <- st_coordinates(train_resale_narrow)
coords_test <- st_coordinates(test_resale_narrow)
#write_rds(coords_train, "coords_train.rds" )
#write_rds(coords_test, "coords_test.rds" )

Dropping geometry field

train_data <- train_resale_narrow %>% 
  st_drop_geometry()

Calibrating Random Forest Model

set.seed(999)
rf <- ranger(rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + stry_lvl_h + stry_lvl_v  +
                  PROX_CH + PROX_EL +
                  PROX_FO + PROX_MR + PROX_TO + 
                  PROX_MA + PROX_SP + NUM_FOO +
                  NUM_CHI + NUM_BUS +
                  NUM_SPM + NUM_SCH,
             data=train_data)
print(rf)
Ranger result

Call:
 ranger(rsl_prc ~ flr_r_s + rmnng_l + stry_lvl_l + stry_lvl_m +      stry_lvl_h + stry_lvl_v + PROX_CH + PROX_EL + PROX_FO + PROX_MR +      PROX_TO + PROX_MA + PROX_SP + NUM_FOO + NUM_CHI + NUM_BUS +      NUM_SPM + NUM_SCH, data = train_data) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      10569 
Number of independent variables:  18 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       1158840275 
R squared (OOB):                  0.8292815 

Calibrating Geographical Random Forest Model

Calculating Bandwidth

#gwRF_bw <- grf.bw(formula = rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m + #stry_lvl_h  +
#                  PROX_CH + PROX_EL +
#                  PROX_FO + PROX_MR + PROX_TO + 
#                  PROX_MA + PROX_SP + NUM_FOO +
#                  NUM_CHI + NUM_BUS +
#                  NUM_SPM + NUM_SCH,
#                  trees = 50,
#                  nthreads = 1,
#                 data = train_data,
#                 kernel = "adaptive",
#                 coords = coords_train)

After letting it run for more than 2 days , I decided to stop and pick the highest R-square of the bandwidth. Below are the image that I have captured what I have ran previously.

The above output, the highest R2 of the Local model would be Bandwidth of 1202. Hence, I will be using this value for the random forest predict.

The code chunk below calibrate a geographic ranform forest model by using grf() of SpatialML package. Previously, we have calculated the adjusted bandwidth, we will use that value for our grf() function. We decided to use adaptive and should be consistent for the kernel.

#set.seed(99)
#gwRF_adaptive <- grf(formula = rsl_prc ~ flr_r_s+ rmnng_l + stry_lvl_l + stry_lvl_m #+ stry_lvl_h +
#                  PROX_CH + PROX_EL +
#                  PROX_FO + PROX_MR + PROX_TO + 
#                  PROX_MA + PROX_SP + NUM_FOO +
#                  NUM_CHI + NUM_BUS +
#                  NUM_SPM + NUM_SCH,
#                     dframe=train_data,
#                      ntree=50,
#                     bw = 1202,
#                     kernel="adaptive",
#                     coords=coords_train
#                )

#write_rds(gwRF_adaptive, "gwRF_adaptive_narrow.rds")
gwRF_adaptive <- read_rds("gwRF_adaptive_narrow.rds")

Using the grf function gives us information about the model fit, including the number of trees in the forest and the out-of-bag (OOB) error rate. The OOB error rate is an estimate of the generalization error of the model, which is how well it is expected to perform on new, unseen data. In this model, an R-squared value of 0.66 indicates a moderate-to-strong relationship between the independent variables and the dependent variable, as it implies that 66% of the variance in the dependent variable is explained by the independent variables.

Predicting by using test data

test_data <- cbind(test_resale_narrow, coords_test) %>%
  st_drop_geometry()
#gwRF_pred <- predict.grf(gwRF_adaptive, 
#                           test_data, 
#                           x.var.name="X",
#                           y.var.name="Y", 
#                           local.w=1,
#                           global.w=0)
#write_rds(gwRF_pred, "GRF_pred3.rds")

Converting the predicting output into a data frame

GRF_pred <- read_rds("GRF_pred3.rds")
GRF_pred_df <- as.data.frame(GRF_pred)
test_data_p <- cbind(test_data, GRF_pred_df)

Calculating Root Mean Square Error

rmse(test_data_p$rsl_prc, 
     test_data_p$gwRF_pred)
[1] NaN

Visualizing the predicted values

ggplot(data = test_data_p,
       aes(x = GRF_pred,
           y = rsl_prc)) +
  geom_point()

Conclusion

From this analysis, we can see that the prices of 4-room resale flats in Singapore have generally increased over the years due to a variety of factors. The growing demand for housing in Singapore, which has led to higher prices for all types of homes, including 4-room resale flats. Additionally, the location of the flat can also impact its price, with flats located in more desirable areas commanding higher prices.

We also derived from the result that BUKIT TIMAH’S resale price has decrease from 4th position to 7th position. MARINE PARADE is not seen in the 10th position and it’s over taken by ANG MO KIO.

Surprisingly, The variables insignificant are PROX_BS and PROX_HO and has been removed.Based on our locational factors under milticolinearity, we can see that all the values are below 0.8 hence, showing no signs of multicolinearity.

The most important factor based on the gwRF_adaptive would be PROX_TO (top schools), followed by NUm_SPM (number of shopping malls within 350m) .

Interestingly, as we compare the story level, we can see that middle floors has the highest importance score than low, high and v high. We can also say that resale price for middle level are also an important factor to price.

We had our R-squared value of 0.66 indicates a moderate-to-strong relationship between the independent variables and the dependent variable, as it implies that 66% of the variance in the dependent variable is explained by the independent variables.

However, the acceptability of an R-squared value may vary depending on the research field, the type of data being analyzed, and the level of complexity of the model. In some fields, an R-squared value of 0.6 may be considered quite good, while in others it may be considered relatively weak. In our case, I think that it is still fine.

Reason being, I think that it’s also important to note R-squared should not be used as the sole metric for evaluating the performance of a model. Other factors, such as the significance of the independent variables, the appropriateness of the model assumptions, and the predictive accuracy of the model, should also be considered.

There are a some independent variable that are unable to measure, such as:

  1. Market conditions such as supply and demand, interest rates, and economic factors can also influence resale price. When the demand for properties is high and the supply is low, resale prices tend to be higher, while in a sluggish market, resale prices may be lower.
  2. The reputation of the property developer can also impact the resale price. Developers with a good reputation for building high-quality properties tend to command higher resale values.
  3. Properties with desirable views, such as waterfront or city skyline views, can command higher resale prices.
  4. The presence of upcoming developments such as new MRT stations, shopping centers, or other amenities can impact the resale value of nearby properties.
  5. The quality of finishes and materials used in the property can also affect resale value. Properties with high-end finishes such as marble flooring or designer fixtures tend to command higher resale values or properties that have room for improvement or expansion may be more attractive to buyers and command higher resale values.
  6. Seller’s price decision can also affect the resale value of a property. When a seller prices a property too high, it may take longer to sell or may not sell at all. This can lead to a perception in the market that the property is overpriced, which can negatively impact the property’s resale value.

Suggestion for future work

We realized that there are a higher population staying in the east area, and most of the top schools are located around the central area. Where top schools are being the most important variable a few ways we can boost the reputation of the school. However increasing the number of top schools in Singapore can be a complex a issue that involves various stakeholders, including the government, educators, parents, and students. Strategies that could potentially help increase the number of top schools.

  1. Increasing funding for education: Providing sufficient funding for schools can help ensure that they have access to the resources and facilities necessary to provide high-quality education. This could include funding for teacher salaries, classroom resources, and technology infrastructure.

  2. Encouraging parents to be actively involved in their children’s education can help create a more supportive learning environment and improve student outcomes. This could involve initiatives such as parent-teacher conferences, volunteering opportunities, and parent education programs.

  3. Creating a culture that values academic excellence and rewards high performance can motivate students to strive for academic success. This could include recognizing and celebrating student achievements, offering scholarships and other incentives for academic excellence, and encouraging healthy competition among students.

  4. Encouraging collaboration and knowledge-sharing among schools can help promote best practices and improve overall educational outcomes. This could involve initiatives such as teacher exchanges, joint research projects, and shared training opportunities. Example: Hwa Chong X Tampines