Predicting Sales: Site Selection with Python
Sales prediction is a huge part of marketing data science. I am no stranger to sales forecasts for product lines and various time series forecasting, however, predicting sales in retail site selection is something I have not touched on before. Site selection problem interests me as it usually involve data sets with more explanatory variable than there are stores. Each potential location, geocoded by longitude and latitude, represents a point on a map and may be associated with multiple variables including but not limited to population, housing, traffic, and economic conditions.
Since it’s my first time working with site sales prediction, I’ll use a sample data set with less variables. This is a data of a hypothetical restaurant chain, gross sales volume and number of direct competitors within two-miles radius are noted at existing locations. We also have the population and average income data for people living within a three-mile radius of each restaurant location.
We already have the actual sales data for these sites, our goal today is to build a model with these location data for predicting sales. If successful, this model can then be trusted to yield predictions use to pick new locations.
Usually in site selection, we ignore spatial data aspects and employ cross-sectional regression, that’s why we don’t have any time-series data this time. Start with the usual stuff and prepare our set up and specify regression model.
Next we fit the predictions from our model to the actual sales data for current restaurants. Now if we have the same set of location data for new restaurant sites we can plug them into our model for sales prediction. If not, we can come up with random data for three potential sites just to test our model.
Now we have our sales prediction and this can use to select new restaurant sites! This is a small selection problem but it shows what is possible with ordinary least squares regression. I might move on to larger site selection problems when I have time. With larger site selection problems, there is the fun of partitioning explanatory variables into meaningful groups. Some variables relates to demographic, others to characteristics. Some measures are set at two-miles radius from site, others 10-miles.
Working with these groups of explanatory variables, we can then practice tree-structured regression and random forests to obtain lists of the explanatory variables that are the best predictors of sales response. After repeating the subset selection and best-possible regression search for each group, we will finally have an ensemble predictor.
It is the divide-and-conquer aspect of site selection problems that fascinates me, but it is also this same aspect that is so tedious and time-consuming that it terrifies me. I’ll see if I have time next weekend to work on it :).