I have a dataset of house prices from 2000 to 2016 from several U.S.A. cities. From this I want to predict the price of similar houses at the current date, is this better addressed as a regression problem or as a time series problem. An important point is that the data I have till now is rather sparse, just few thousand of records.
Regression vs Time Series – Is House Price Prediction a Regression or Time Series Problem?
pythonregressiontime series
Related Solutions
Here is a simple recipe that may help you get started writing code and testing ideas...
Let's assume you have monthly data recorded over several years, so you have 36 values. Let's also assume that you only care about predicting one month (value) in advance.
- Exploratory data analysis: Apply some of the traditional time series analysis methods to estimate the lag dependence in the data (e.g. auto-correlation and partial auto-correlation plots, transformations, differencing). Let's say that you find a given month's value is correlated with the past three month's data but not much so beyond that.
- Partition your data into training and validation sets: Take the first 24 points as your training values and the remaining points as the validation set.
- Create the neural network layout: You'll take the past three month's values as inputs and you want to predict the next month's value. So, you need a neural network with an input layer containing three nodes and an output layer containing one node. You should probably have a hidden layer with at least a couple of nodes. Unfortunately, picking the number of hidden layers, and their respective number of nodes, is not something for which there are clear guidelines. I'd start small, like 3:2:1.
- Create the training patterns: Each training pattern will be four values, with the first three corresponding to the input nodes and the last one defining what the correct value is for the output node. For example, if your training data are values $$x_1,x_2\dots,x_{24}$$ then $$pattern 1: x_1,x_2,x_3,x_4$$ $$pattern 2: x_2,x_3,x_4,x_5$$ $$\dots$$ $$pattern 21: x_{21},x_{22},x_{23},x_{24}$$
- Train the neural network on these patterns
- Test the network on the validation set (months 25-36): Here you will pass in the three values the neural network needs for the input layer and see what the output node gets set to. So, to see how well the trained neural network can predict month 32's value you'll pass in values for months 29, 30, and 31
This recipe is obviously high level and you may scratch your head at first when trying to map your context into different software libraries/programs. But, hopefully this sketches out the main point: you need to create training patterns that reasonably contain the correlation structure of the series you are trying to forecast. And whether you do the forecasting with a neural network or an ARIMA model, the exploratory work to determine what that structure is is often the most time consuming and difficult part.
In my experience, neural networks can provide great classification and forecasting functionality but setting them up can be time consuming. In the example above, you may find that 21 training patterns is not enough; different input data transformations lead to a better/worse forecasts; varying the number of hidden layers and hidden layer nodes greatly affects forecasts; etc.
I highly recommend looking at the neural_forecasting website, which contains tons of information on neural network forecasting competitions. The Motivations page is especially useful.
You're on the right track by acknowledging that ARIMA modeling is what you should be looking into.
I've seen ARIMA modeling applied to cases involving: inventory stock, business sales, levels of production of particular goods, and various other business related time-series. Without access to the data, I can only speculate that the data you're working with falls into this same sort of category.
Of course, ARIMA modeling is univariate, so any forecasts that you produce will be forecasts for the time-series under investigation. For example, if you model prices then you will derive forecasts for prices - not gross profits. Indeed, you can use price forecasts to build forecasts for profits, so choose carefully the data you want to work with if you have choice among many time-series.
It is common to see ARIMA models used as a benchmark, so even if you believe that more complex models (multiple-series & econometric models) may give you superior forecasts, ARIMA modeling is, nevertheless, still a worthwhile pursuit; if you build a number of models you have something to compare them against and this also helps decide whether or not the extra complexity is necessary.
The reason why ARIMA models are good for benchmarking is because: ARIMA forecasts are optimal (smaller mean-squared forecast error) univariate forecasts (if correctly built). The forecasts are optimal among forecasts from univariate, linear, fixed-coefficient models.
Analysis of your data may lead you to develop other models such as multivariate models, non-linear models or even time-varying parameter models, but starting with the simpler class of ARIMA models is a wise choice in itself because ARIMA analysis can later on complement econometric analysis. For a short discussion on this see Zellner (1978).
Obviously, the classic text to consult for ARIMA modeling (and the closely related Transfer Function models) is Box & Jenkins (1970). A good alternative is Pankratz (1983) which is basically a shorter and simpler version of Box & Jenkins' work - all of the main points are retained in Pankratz's book too.
As already mentioned, ARIMA analysis involves looking at a single time-series of past observations. At some stage, you may want to introduce other independent variables in addition to past observations of the dependent variable. This brings you into the territory of distributed lag models which may or may not be autoregressive. Extending the framework once more and these models can be single-equation or multi-equation (vector equation) models.
One of the factors to be considered when deciding to use single or vector equations will be whether or not there is possible lagged feedback effects among the various variables. These issues are further addressed in Pankratz (1991) which focuses on dynamic regression models.
Lastly, an excellent online time-series forecasting textbook is Rob Hyndman's Forecasting: principles and practice. Furthermore, if you are an R user (or would consider becoming one) then it would be worth your time to familiarize yourself with the R forecast package (again, thanks to Rob Hyndman).
References:
Box, George and Jenkins, Gwilym (1970) Time series analysis: Forecasting and control, San Francisco: Holden-Day.
Hyndman, R.J. and Athanasopoulos, G. (2013) Forecasting: principles and practice. http://otexts.com/fpp/. Accessed on 17 June 2013.
Pankratz, Alan (1983) Forecasting with univariate Box–Jenkins models: concepts and cases, New York: John Wiley & Sons.
Pankratz, Alan (1991) Forecasting with Dynamic Regression Models, New York: John Wiley & Sons.
Zellner, Arnold, 1978. "Folklore versus Fact in Forecasting with Econometric Methods," The Journal of Business, University of Chicago Press, vol. 51(4), pages 587-93, October.
Best Answer
While the other answer is correct that the response variable can be modelled as a linear regression - you are dealing with house prices. As such, your dataset will likely suffer from what is called time series induced heteroscedasticity.
What this basically means is that since your houses will vary by age - i.e. some houses could be one year old, others over thirty years old, then you will have an unconstant variance across your residuals.
If you see this abstract titled "Heteroscedasticity in hedonic house price models", you will note that using Generalised Least Squares was indicated to remove the heteroscedasticity with forecast errors of a lower standard deviation that would be obtained through standard Ordinary Least Squares.
In summary to your question, your data can be modelled using regression analysis, but you do need to watch out for heteroscedasticity and also serial correlation. Moreover, you might find that your distribution (run a qqPlot to check) may not be normal, and your analysis might be better served through first converting your data to that of a normal distribution using a Box-Cox transformation.