Determining the Validity of Clustering for Data Fusion

  • Kate A. Smith
  • Sheldon Chuan
  • Peter van der Putten
Conference paper
Part of the Advances in Soft Computing book series (AINSC, volume 14)


For many direct marketing activities, organisations frequently find that customer databases do not contain enough information. Additional databases such as socio-economic databases constructed from census and survey data can be purchased to supplement customer databases. One of the difficulties in fusing separate databases however is that the information is based on two different samples and rarely can a unique individual be identified in both samples. Usually a common set of variables are used to determine the similarities between customers in the two samples, and various methods have been proposed for then predicting the missing information from one sample based on the information contained in the other sample. While some complicated methods have been proposed for data fusion, in this paper we investigate the validity of a simple clustering approach. Using a set of variables common to both samples, clusters are generated based on the k-means algorithm. The likely values of missing variables are then inferred based on the average values within the relevant cluster. An out-of-sample test set is used to demonstrate the accuracy of the fused variable predictions.


Common Variable Data Fusion Mean Absolute Percentage Error Absolute Percentage Error Donor Sample 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Shepard, D., The New Direct Marketing, Irwin, 1996.Google Scholar
  2. [2]
    Jephcott, J. and Bock, T., “The application and validation of data fusion”, Journal of the Market Research Society, vol. 40, no. 3, pp. 185–205, 1998.Google Scholar
  3. [3]
    Wiegand, J., (1986), “Combining different media surveys: The German Partnership model and fusion experiments”, Journal of the Market Research Society, Vol. 28, No. 2, pg. 189–208.MathSciNetGoogle Scholar
  4. [4]
    Radner, D. B., Rich, A., Gonzales, M. E., Jabine, T. B. and Muller, H. J., “Report on exact and statistical matching techniques”, Statistical Working Paper 5, Office of Federal Statistical Policy and Standards, US DoC, 1980.Google Scholar
  5. [5]
    Richardson, C., “Data fusion does not work”, Journal of the Market Research Society, vol. 32, no. 3, pp. 472–473, 1996.Google Scholar
  6. [6]
    Baker, K., Harris, P., and O’Brien, J., “Data fusion: an appraisal and experimental evaluation”, Journal of the Market Research Society, vol. 39, no. 1, pp. 225–271, 1989.Google Scholar
  7. [7]
    Kamakura, W. and Wedel, M., “Statistical data fusion for cross-tabulation”; Journal of Marketing Research, vol. 34, no. 4, pp. 485–499, 1997.CrossRefGoogle Scholar
  8. [8]
    van der Putten, P., “Data Fusion: A Way to Provide More Data to Mine in?”, Proceedings 12 11 ’ Belgian-Dutch Artificial Intelligence Conference BNAIC’2000, De Efteling, Kaatsheuvel, The Netherlands, 2000.Google Scholar
  9. [9]
    Kohonen, T., Self-organisation and associative memory, Springer-Verlag, Berlin, 1990.Google Scholar
  10. [10]
    Deboeck, G. and Kohonen, T., Visual Explorations in Finance with Self-Organizing Maps. London: Springer-Verlag, 1998.MATHCrossRefGoogle Scholar
  11. [11]
    Eudaptics (1999). Viscovery SOMine 3.0 User Manual,

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Kate A. Smith
    • 1
  • Sheldon Chuan
    • 1
  • Peter van der Putten
    • 2
  1. 1.School of Business SystemsMonash UniversityAustralia
  2. 2.Leiden Institute of Advanced Computer ScienceLeidenThe Netherlands

Personalised recommendations