Home price prediction can be extremely simple, from rough ballpark figures of your neighbourhood homes, to extremely complex and predicting the national trend for homes in a country. It is a classic Machine Learning problem, one used by Real Estate professionals the world over.
Most of the factors that affect home prices in USA have already been covered in a previous article. This is a summary and description of the factors that can be used to build the prediction model, and their relevance, separated by supply and demand side factors. Some of the measures contribute to the prices via both routes: supply and demand. The model accounts for both effects by definition.
The bigger concern in this modelling is historic data availability (in the public domain) of all factors for the same period and in the same format (monthly or quarterly data). As such, these are the most relevant factors that have been filtered out for building the prediction model.
Supply Side Factors That Affect Home Prices
Various factors that affect the supply of homes available for sale are discussed below:
Months of Supply
Months of supply is the basic measure of supply itself in the real estate market (not a factor as such). Houses for sale is another measure of the same.
Months of Supply: https://fred.stlouisfed.org/graph/?g=zneA
Homes For Sale: https://www.census.gov/construction/nrs/historical_data/index.html
Differential Migration Across Cities
The differential migration across cities can possibly be measured directly via change-of-address requests, but since that data is not readily available, the total number of residence moves can be used. What this, however, does not reflect is the change in pattern of movement. The move can be from rural or suburban to cities or the other way round, and both have a very different impact on the housing market. So, net domestic migration into or out of metropolises is a better measure of the differential migration, and hence that has been taken as a parameter along with the number of total movers.
Unemployment can also affect both demand and supply in the real estate industry. A high unemployment rate can mean that people simply do not have the money to spend on houses. It can also mean that there is lower investment in the industry and hence lower supply.
Data Source: https://fred.stlouisfed.org/series/UNRATE
Mortgage rates are a huge factor that decide how well the real estate market will perform. It plugs into both supply and demand side of the equation. It affects the availability of financing options to buyers, as well as the ease of financing new constructions. It also affects the delinquency rate and the number of refinances for mortgages. People are more likely to default on a higher mortgage rate!
Data Source: https://fred.stlouisfed.org/graph/?g=zneW
Federal Funds Rate
Although mortgage rate and Federal Funds Rate are usually closely related, sometimes they may not be. Historically, there have been occasions when the Fed lowered the Fed Funds Rate, but the banks did not lower mortgage rates, or not in the same proportion. Moreover, Federal Funds Rate influences multiple factors in the economy, beyond just the real estate market, many of which factors indirectly influence the real estate market. It is a key metric to change the way an economy is performing.
Data Source: https://fred.stlouisfed.org/series/DFF#0
The GDP is a measure of output of the economy overall, and the health of the economy. An economy that is doing well usually implies more investment and economic activity, and more buying.
Number of building permits allotted is a measure of not just health of real estate industry, but how free the real estate market is, effectively. It is an indicator of the extent of regulation/de-regulation of the market. It affects the supply through ease of putting a new property on the market.
Data Source: https://www.census.gov/construction/bps/
This is a measure of the number of units of new housing projects started in a given period. Sometimes it is also measured in valuation of housing projects started in a given period.
The amount spent (in millions of USD, seasonally adjusted), is a measure of the activity in the construction industry, and an indicator of supply for future months. It can also be taken as a measure of confidence, since home builders will spend money in construction only if they expect the industry to do well in the future months.
Demand Side Factors That Affect Home Prices
Demand for housing, and specifically, home ownership, is affected by many factors, some of which are closely inter-related. Many of these factors also affect the supply in housing market. Below are a few factors that are prominent in influencing the demand for home buying:
Affordability: Wages & Disposable Personal Income
The “weakly earnings” are taken as a measure of overall wages and earning of all employed persons.
The other measure is disposable personal income: how much of the earning is actually available to an individual for expenditure. This is an important measure as well, as it takes into account other factors like taxes etc.
Median usual weekly nominal earnings, Wage and salary workers 25 years and over: https://fred.stlouisfed.org/series/LEU0252887700Q#0
Real Disposable Personal Income: https://fred.stlouisfed.org/series/DSPIC96#0
Availability of Finance: Mortgage Rates
Delinquency Rate on Mortgages
The delinquency rate on housing mortgages are an indicator of the number of foreclosures in real estate. This is an important factor in both, demand and supply. Higher delinquency rate (higher than credit card delinquency rate) in the last economic recession was a key indicator of the recession and the poorly performing industry and the economy as a whole. It also indicates how feasible it is for a homeowner to buy a house at a certain point of time and is an indicator of the overall demand in the industry.
Data Source: https://fred.stlouisfed.org/series/DRSFRMACBS#0
The extent to which people are utilizing their personal income for savings matters in overall investments and capital availability, and the interest rate for loans (and not just the mortgage rate). It is also an indicator of how much the current population is inclined to spend their money, vs save it for future use. This is an indicator of the demand for home ownership as well.
Personal Saving: https://fred.stlouisfed.org/series/PMSAVE
Behavioural Changes & Changes in Preferences
Changes in home ownership indicate a combination of factors including change in preferences and attitudes of people towards home buying. Change in cultural trends can only be captured by revealed preferences, and this metric can be taken as a revealed metric for propensity for home buying.
The other metric to track changes in preferences is personal consumption expenditure. For eg, if expenditure is increasing, but there is no such increase in homeownership, it would indicate a change in preferences towards home buying and ownership. Maybe people prefer to rent a home than buying one. Hence, both of these parameters are used.
Building The Model
The S&P Case-Shiller Housing Price Index is taken as the y variable, or dependent variable, as an indicator of change in prices. All the above factors make up independent variable or XN, the X vector as a transpose, in a multivariable regression model of order N, comprising X1, X2, X3…XN etc. In a typical prediction framework, how a variable affects the dependent variable is not relevant; only that it does or does not affect it is what matters. As such, many of these factors are possibly endogenous and/or affect the prices in multiple ways: that mechanism is not the subject of study in this model. We are currently seeing the overall effect of the variables on price trends.
Data Cleaning With Pandas
One of the challenges of dealing with this acquired data is that some of the metrics are reported as monthly averages, whereas many others are computed as quarterly averages.
For my dataset,
A) In one set, I have down-sampled the monthly data to quarterly, and up-sampled the annually computed data to quarterly. So, in final computation, all data is quarterly.
B) In another set, I have run the regression with fewer parameters, using only monthly data.
The data density is better in set A, but there are more variables in set B. So it is worthwhile to check both the models, and check the results.
The data we are working with is time series data. So, all the time-date data in every variable must be converted to Python’s
datetime format, for it to be read as time-date data and processed as such when up-sampling or down-sampling. This will also help with merging different series together, using the
The regression itself does not run on time-series data, so the
datetime columns are removed in the final data for the regression.
Dealing with missing values is another common problem in data cleaning. In my dataset, most of the data is official government data for key economic indicators, and hence, there are very few “missing values”. The ones that are missing, are replaced with averaged values (average of one value before and after it). However, I have taken the training data only for the period in which all the variables have data available, namely, from 2000-2020. To up-sample yearly data, the value for the year is divided by 4 and applied to each quarter. This data is mostly for migration data, and this seems to be a fair way to approximate the value, even if not very accurate.
Multiple Linear Regression for Prediction
Exploratory Data Analysis is required to assess what type of regression is needed for building the model. Although using Linear or Multiple Linear Regression is traditional, it is still important to first look at the data and how it is spread out. It also helps eliminate variables which are closely correlated and redundant.
These are the correlation plots and the code for the multivariate linear regression model for Set A, with monthly data and fewer variables:
The code for the model is below:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics date_col1 = ['Period'] date_col2 = ['Quarter Starting'] date_col3 = ['Year'] date_col4 = ['Quarter'] x_monthly = pd.read_csv('x_monthly.csv', parse_dates=date_col1, dayfirst=True) #Monthly Data y_monthly = pd.read_csv('y.csv', parse_dates=date_col1, dayfirst=True) #Monthly y Data print(x_monthly.dtypes) print(y_monthly.dtypes) x = x_monthly[['UNEMP','CONST', 'Months of Supply', 'Mortgage Rate', 'Permits-Number', 'Permits-Valuation', 'Housing Starts', 'Consumption', 'Disposable Income', 'Savings', 'Fed Funds Rate', 'Homes for Sale', 'Homes Sold']] #Removing date column y = y_monthly['HPI'] #Removing date column x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.35, shuffle=False, stratify=None) reg = LinearRegression() reg.fit(x_train, y_train) y_predict = reg.predict(x_test)
These are the correlation plots and the code for the multivariate linear regression model for Set B, with quarterly data and more variables:
The code for the model is below:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sb from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics date_col1 = ['Period'] date_col2 = ['Quarter Starting'] date_col3 = ['Year'] date_col4 = ['Quarter'] x_monthly = pd.read_csv('x_monthly.csv', parse_dates=date_col1, dayfirst=True) #Monthly Data x_quarterly = pd.read_csv('x_quarterly.csv', parse_dates=date_col2, dayfirst=True) #Quarterly Data x_q = pd.read_csv('x_annually_quarterwise_interpolated.csv', parse_dates=date_col4, dayfirst=True) #Upsampled yearly data y_monthly = pd.read_csv('y.csv', parse_dates=date_col1, dayfirst=True) #Monthly y Data x_monthly.set_index(x_monthly['Period'], inplace=True) y_monthly.set_index(y_monthly['Period'], inplace=True) print(x_monthly.dtypes) print(x_quarterly.dtypes) print(x_q.dtypes) print(y_monthly.dtypes) print(x_monthly.head()) print(x_quarterly.head()) x1=x_monthly.loc['2000-01-01':'2020-06-01'] y1=y_monthly.loc['2000-01-01':'2020-06-01'] x_monthly_downsampled = x1.resample('QS').mean() #Resampling monthly data to quarterly y_downsampled = y1.resample('QS').mean() #Resampling y data to quarterly print(y_downsampled.tail(15)) y_CSV = y_downsampled.to_csv('y_q.csv', index=False) #Removing date field y = pd.read_csv('y_q.csv') print(y.tail(15)) #Checking the last few values x2 = pd.merge(x_monthly_downsampled, x_quarterly, left_on='Period', right_on='Quarter Starting') x_merged = pd.merge(x2, x_q, left_on='Quarter Starting', right_on='Quarter') print(x_merged.tail(15)) #Checking the last few values x = x_merged[['UNEMP','CONST', 'Months of Supply', 'Mortgage Rate', 'Permits-Number', 'Permits-Valuation', 'Housing Starts', 'Consumption', 'Disposable Income', 'Savings', 'Fed Funds Rate', 'Homes for Sale', 'Homes Sold', 'Real GDP', 'Earning', 'Interpolated Movers', 'Interpolated Migration']] print(x.head(15)) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.35, shuffle=False, stratify=None) sb.set_style("darkgrid") sb.set_palette("colorblind") sb.pairplot(x_train) sb.heatmap(x_train.corr()) reg = LinearRegression() reg.fit(x_train, y_train) y_predict = reg.predict(x_test) print(reg.score(x_train, y_train)) print(reg.score(x_test, y_test)) plt.figure(figsize=(8,8)) sb.regplot(x=y_test, y=y_predict, ci=None, color="Blue") plt.xlabel("Actual HPI") plt.ylabel('Predicted HPI') plt.title("Actual vs Predicted HPI: Quarterly Data") print(reg.coef_) print('Training Score: ', reg.score(x_train, y_train)) print("Testing Score: ", reg.score(x_test, y_test)) print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_predict)) print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_predict)) print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_predict))) print(y_test.describe())
This model has fewer data points, but has more variance. The code can be run and checked here.
Predicted Prices Vs Actual Prices
There are a few diagnostic measures to check if the model has worked well or not: checking the R2 coefficient and the root mean squared error are two such methods.
Checking key parameters for diagnostics:
print(reg.coef_) print(reg.intercept_) print(reg.score(x_train, y_train)) print(reg.score(x_test, y_test)) print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_predict)) print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_predict)) print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_predict))) print(y_test.describe())
The output of the code is below:
[ 2.38632297e+00 6.57105504e-05 5.75060890e-01 -1.87444947e+00 7.55007994e-04 4.52413371e-05 -5.14923238e-03 7.51528073e-03 6.50880558e-03 -1.10356135e-02 5.74753885e-01 1.02823591e-01 7.83698294e-03] -58.42103376093107 0.9947681575487934 0.88282685264113 Mean Absolute Error: 3.8662503231695493 Mean Squared Error: 45.648884779397825 Root Mean Squared Error: 6.75639584241464 count 88.000000 mean 188.701477 std 19.851016 min 155.610000 25% 170.757500 50% 187.760000 75% 206.117500 max 229.410000 Name: HPI, dtype: float64
The Root Mean Squared error is less than 10% of the mean value of HPI, and much lesser than the standard deviation of the HPI in the test data. This indicates the the model is fairly accurate in predicting values. The R2 value is 0.883 for the testing data, which is normally considered a good value.
Some of the discrepancy in the predictions of the few months of 2020 is expected, since many factors beyond normal reckoning affected home prices, often disproportionate to usual behaviour. From the plot of actual vs predicted values, the prediction seems to be fairly accurate.
Checking key parameters for diagnostics:
The code is the same (variables are named the same way in both sets). Output of the code for set B is as below:
[[ 2.32029236e+00 5.53239900e-05 9.25831693e-01 -1.84288542e+00 4.79069632e-03 2.28540865e-04 -6.40410532e-03 -1.08344186e-04 9.18498039e-03 -1.28937506e-02 1.04188222e+00 9.61878477e-02 7.64612172e-03 3.02200726e-03 1.05100115e-01 -5.23794649e-05 1.90974370e-02]] Training Score: 0.9974713059879058 Testing Score: 0.9600042054312886 Mean Absolute Error: 3.173772313159812 Mean Squared Error: 14.8379529523124 Root Mean Squared Error: 3.8520063541370746 HPI count 29.000000 mean 185.831264 std 19.601974 min 152.863333 25% 169.236667 50% 184.776667 75% 203.663333 max 217.820000
There are fewer outliers in this model, and the R2 coefficient is higher too. The RMS error is lower. So, even though the data density of set B is low, as it contains more variables which impact home prices, it proves to be a better model.
This is a very simple multiple linear model, which uses real world time-series data. For more sophisticated prediction, complex models can be used, with other estimators. However, in this case, the MLR seems to work as required.
Agarwal, Arti, 2021, “Economic Factors”, https://doi.org/10.7910/DVN/RWEWKL, Harvard Dataverse, V1, UNF:6:i4EQ5K1eanomEUZHZTokgA== [fileUNF]
PS: If you like this article, do consider subscribing to the publication 🙂