Duplicate detection in adverse drug reaction surveillance
The WHO Collaborating Centre for International Drug Monitoring in Uppsala, Sweden, maintains and analyses the world’s largest database of reports on suspected adverse drug reaction (ADR) incidents that occur after drugs are on the market. The presence of duplicate case reports is an important data quality problem and their detection remains a formidable challenge, especially in the WHO drug safety database where reports are anonymised before submission. In this paper, we propose a duplicate detection method based on the hit-miss model for statistical record linkage described by Copas and Hilton, which handles the limited amount of training data well and is well suited for the available data (categorical and numerical rather than free text). We propose two extensions of the standard hit-miss model: a hit-miss mixture model for errors in numerical record fields and a new method to handle correlated record fields, and we demonstrate the effectiveness both at identifying the most likely duplicate for a given case report (94.7% accuracy) and at discriminating true duplicates from random matches (63% recall with 71% precision). The proposed method allows for more efficient data cleaning in post-marketing drug safety data sets, and perhaps other knowledge discovery applications as well.
KeywordsData cleaning Duplicate detection Hit-miss model
Unable to display preview. Download preview PDF.
- Bilenko M, Mooney RJ (2003a) Adaptive duplicate detection using learnable string similarity measures. In: KDD ’03: proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 39–48Google Scholar
- Bilenko M, Mooney RJ (2003b) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 workshop on data cleaning, record linkage and object consolidation, pp 7–12Google Scholar
- Edwards IR (1997) Adverse drug reactions: finding the needle in the haystack. Br Med J 315(7107):500Google Scholar
- Monge AE, Elkan C (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. Research issues on data mining knowledge discovery, Tucson, AZGoogle Scholar
- Nkanza JN, Walop W (2004) Vaccine associated adverse event surveillance (VAAES) and quality assurance. Drug Safety 27:951–952Google Scholar
- Norén GN, Orre R, Bate A (2005) A hit-miss model for duplicate detection in the WHO drug safety database. In: KDD ’05: proceeding of the 11th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 459–468Google Scholar
- Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: KDD ’02: proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 269–278Google Scholar