This was going to be a magic post about clustering an it amazing abilities to find out hidden gems, gaining insight from the data, reducing a problem to a manageable situation, or better explaining a given situation to business partners that might need to focus on an appropriately segmented market.
The problems that I usually deal with have a mix of categorical and continuous variables, often identifying demographic information and geographical information tied to customer satisfaction surveys that are one of the tools a business partner has to answer very specific problems about their planning process and the evolution of their offers.
Clustering is a great way to group these existing customers, identifying common characteristics and adjusting offers and planning process to satisfy current and expected demands. But, the categorical data puts a dent on the clustering process: I can’t properly calculate distances between observations without defining a metric or properly adjusting for what the categorical distances mean. Many people have mentioned that there is no sense on making, say, the days of the week into a number, because what exactly means 3.3 days, half a Wednesday? What does mean when you have 3.5 ethnicity? We have to find a new way to measure that; frequency of categories and measuring medoids.
Next: datasets that feature categorical and continuous variables.