Improving Zillow's Zestimate with 36 Lines of Code
Eduardo Ariño de la Rubia2017-06-08 | 3 min read
Zillow and Kaggle recently started a $1 million competition to improve the Zestimate. We used H2O’s AutoML to generate a solution.
The new Kaggle Zillow Price competition received a significant amount of press, and for good reason. Zillow has put $1 million on the line if you can improve the accuracy of their Zestimate feature. This is Zillow’s estimation as to the value of a home. As they state in the contest description, improving this estimate can more accurately reflect the value of the nearly 110 million homes in the US!
We built a project as a quick and easy way to leverage some of the amazing technologies that are being built by the data science community! In this project is a script take_my_job.R
which uses the amazing H2O AutoML framework.
H2O’s machine learning library is an industry leader, and their latest foray into bringing AI to the masses is the AutoML functionality. With a single function call, it trains many models in parallel, ensembles them together, and builds a powerful predictive model.
The script is just 36 lines:
library(data.table)library(h2o) data_path <- Sys.getenv("DOMINO_EARINO_ZILLOW_HOME_VALUE_PREDICTION_DATA_WORKING_DIR") properties_file <- file.path(data_path, "properties_2016.csv")train_file <- file.path(data_path, "train_2016.csv")properties <- fread(properties_file, header=TRUE, stringsAsFactors=FALSE, colClasses = list(character = 50))train <- fread(train_file) properties_train = merge(properties, train, by="parcelid",all.y=TRUE)
In these first 12 lines, we set up our environment and import the data as R data.table objects. We are using Domino environment variable functionality in line 4 to not have to hardcode any paths in the script, as hardcoded paths often cause significant challenges.
On line 12, we are creating the training set by merging the properties file with the training dataset, which contains the logerror column we will be predicting.
h2o.init(nthreads = -1)Xnames <- names(properties_train)[which(names(properties_train)!="logerror")]Y <- "logerror"dx_train <- as.h2o(properties_train)dx_predict <- as.h2o(properties)md <- h2o.automl(x = Xnames, y = Y,stopping_metric="RMSE",training_frame = dx_train,leaderboard_frame = dx_train)
This block of code is all it takes to leverage H2O’s AutoML infrastructure!
On line 14 we are initializing H2O to use as many threads as the machine has cores. Lines 16 and 17 are for setting up the names of the predictor and response variables. On lines 19 and 20 we upload our data.table objects to H2O (which could have been avoided with h2o.importFile in the first place). In lines 22-25 we are telling H2O to build us the very best model it can, using RMSE as the early stopping metric, on the training dataset.
properties_target<- h2o.predict(md@leader, dx_predict)predictions <- round(as.vector(properties_target$predict), 4)result <- data.frame(cbind(properties$parcelid, predictions, predictions * .99,predictions * .98, predictions * .97, predictions * .96,predictions * .95))colnames(result)<-c("parcelid","201610","201611","201612","201710","201711","201712")options(scipen = 999)write.csv(result, file = "submission_automl.csv", row.names = FALSE )
Lines 27-36 are our final bit of prediction and book-keeping.
On line 27 we are predicting our responses using the trained AutoML object. We then round the answer to 4 digits of precision, build the result data.frame, set the names, and write it out.
The only bit of trickery that I added was to shrink the logerror at every column by 1%, with the assumption that Zillow’s team is always making their models a little bit better.
With no input from me whatsoever, this package builds a model which provides a public leaderboard score of 0.0673569
. Not amazing, but remarkable considering I haven’t even looked at the data. Bringing together H2O’s algorithms along with flexible scalable compute and easy environment configuration on Domino made this project quick and easy!
Wrap Up
While hand-built solutions are scoring significantly better than this one on the Kaggle leaderboard, it’s still exciting that a fully automated solution does reasonably well. The future of fully automated data science is exciting, and we can’t wait to keep supporting the amazing tools the community develops!
Eduardo Ariño de la Rubia is a lifelong technologist with a passion for data science who thrives on effectively communicating data-driven insights throughout an organization. A student of negotiation, conflict resolution, and peace building, Ed is focused on building tools that help humans work with humans to create insights for humans.
Summary
RELATED TAGS