Quality Assurance Tool Suite for Error Detection in Digital Repositories
Digitization workflows for automatic acquisition of image collections are susceptible to errors and require quality assurance. This paper presents the automated quality assurance tools aiming at detection of possible quality issues that supports decision making for document image collections. The main contribution of this research is the implementation of various image processing tools for different error detection scenarios and their combination in to a single tool suite. The tool suite includes: (1) The matchbox tool for accurate near-duplicate detection in document image collections, based on SIFT feature extraction. (2) The finger detection tool aims at automatic detection of fingers that mistakenly appear in scans from digitized image collections, which uses processing techniques for edge detection, local image information extraction and its analysis for reasoning on scan quality. (3) The cropping error detection tool supports the detection of common cropping problems such as text shifted to the edge of the image, unwanted page borders, or unwanted text from a previous page on the image. Another important contribution of this work is a definition of the quality assurance workflow and its automatic execution for error detection in digital document collections. The presented tool suite detects described errors and presents them for additional manual analysis and collection cleaning. A statistical overview of evaluated data and characteristics like performance and accuracy is delivered. The results of the analysis confirm our hypothesis that an automated approach is able to detect errors with reliable quality, thus making quality control for large digitisation projects a feasible and affordable process.
Keywordsdigital library digital preservation quality assurance image processing
Unable to display preview. Download preview PDF.
- 1.Canny, J.: A computational approach to edge detection. IEEE Trans. Pat. Anal. Mach. Intell., 679–698 (1986)Google Scholar
- 2.Csurka, G., Dance, C.R., Fan, L., Willamowski, J.: Visual categorization with bags of keypoints. In: Workshop on SLCV, ECCV, pp. 1–22 (2004)Google Scholar
- 5.Graf, R., King, R.: Finger detection for quality assurance of digitized image collections. In: Archiving Conference (2013)Google Scholar
- 6.Lu, G., Phillips, J.: Using perceptually weighted histograms for colour-based image retrieval. In: Fourth International Conference on Signal Processing, vol. 2 (1998)Google Scholar
- 9.Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: Proceedings of the 12th Annual ACM International Conference on Multimedia, MULTIMEDIA 2004, pp. 869–876. ACM, New York (2004)Google Scholar
- 10.Le Bourgeois, F., Trinh, E., Allier, B., Eglin, V., Emptoz, H.: Document images analysis solutions for digital libraries, document image analysis for libraries. In: Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL 2004), pp. 2–24 (2004)Google Scholar
- 12.Marr, D., Hildreth, E.: Theory of edge detection. In: Proc. of the Royal Soc. London, pp. 187–217 (1980)Google Scholar
- 13.Meyer, F.: Color image segmentation. In: Image Processing and its Applications, pp. 303–306 (1992)Google Scholar
- 14.Graf, R., King, R., Schlarb, S.: Blank page and duplicate detection for quality assurance of document image collections. In: APA CDAC 2014 (2014)Google Scholar
- 15.Wu, X., Zhao, W.-L., Ngo, C.-W.: Near-duplicate keyframe retrieval with visual keywords and semantic context. In: Proc. of the 6th ACM ICIVR, pp. 162–169. ACM, New York (2007)Google Scholar