Identifying Consumer Segments with Python
Very little business can long survive without finding new customers. The process of identifying new customers begins by learning as much as we can about current consumer segments. We can identify segments by psychographic, geographic, demographic and behavioural characteristics.
One of the most traditional methods for market segmentation is cluster analysis. They represent multivariate techniques for grouping consumers based on their similarity to one another. Distance metrics or measures of agreement between consumers guide the segmentation process.
Personally, when conducting a cluster analysis, I begin by seeing what happens when the data is divided into just two groups of clusters, then three clusters, then four, and work all the way up. Usually, we are looking for solutions involving ten or fewer clusters. It is also important to keep in mind that all of the cluster/segments are of sufficient size to warrant marketing attention, thus it’s worth noting the proportion of consumers in each cluster when evaluating solutions.
Today, I have chose to use the data from a marketing conducted by a Portuguese bank between May 2008 and November 2010. The bank was interested in identifying factors that affect client responses to new term deposit offerings, which are the focus of the marketing campaigns. What is the most effective way to segment the market?
The data include factors including but not limited to age, job type, marital status, client banking history, house loan status and education. The data set also reflects the date of the phone call, duration of the call, as well as summary information about all calls with a specific client.
There are several ways to conduct clustering analysis. For this study, I chose to use a partitioning method that requires that input variables have meaningful magnitude or be binary categorical variables. Except for age, the demographic variables in this data are multi-category variables, so we need to perform an extra step to convert multi-category variables into binary categorical variables. (It is also appropriate to standardize all input variables prior to clustering.)
The most important tools to download for this analysis are KMeans and silhouette_score. After we have done all the set ups, we begin to examine each demographic variable.
At this point, we have almost completed our cluster analysis, but I’ll go ahead and evaluate the clusters with the silhouette score to make sure that the segments we identified are well defined and few in numbers. After all, there is no point if the clusters aren’t distinct enough from one another.
The selected clustering solution in this example suggests that two segments are best. Follow the code and you’ll be able to find out which two are the optimal clusters.
Side note: In response to some of comments I’ve received, these blog entries are just my personal practice journey with modelling that focuses more on business literacy, thought process and logic. These aren’t made to share technical step-by-step instructions. If you’re interested in playing around with this data set or have a more interesting method to share, feel free to message me for the data file :).