Abstract
This paper shows a novel approach of two-tier machine learning to locate bibliographic references in HTML and separate them into fields. First it is demonstrated, how Conditional Random Fields (CRFs) with constraints can be used to split bibliographic references into fields e.g. authors and title. Therefore a unique feature set, constraints and a method for automatic keyword extraction are introduced. The output of this CRF for tagging bibliographic references, Part Of Speech (POS) analysis and Named Entity Recognition (NER) build the first tier and their output is used to locate the bibliographic reference section in the first place. For this the documents are split into blocks, which are then used for classification. For this task a Support Vector Machines (SVM) approach is compared with another one using a CRF. We demonstrate this two-tier approach archives very good results, while the reference tagging approach is able to compete with other state-of-the-art approaches.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bollacker, K.D., Lawrence, S., Giles, C.L.: CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In: Proceedings of the Second International Conference on Autonomous Agents, pp. 116–123. ACM (1998)
Zou, J., Le, D., Thoma, G.R.: Locating and parsing bibliographic references in HTML medical articles. Int. J. Doc. Anal. Recogn. 2, 107–119 (2010)
Hetzner, E.: A simple method for citation metadata extraction using hidden markov models. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 280–284. ACM (2008)
Gao, L., Qi, X., Tang, Z., Lin, X., Liu, Y.: Web-based citation parsing, correction and augmentation. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 295–304. ACM (2012)
Park, S.H., Ehrich, R.W., Fox, E.A.: A hybrid two-stage approach for discipline-independent canonical representation extraction from references. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 285–294. ACM, New York (2012)
Sutton, C., McCallum, A.: Introduction to Conditional Random Fields for Relational Learning. MIT Press, Cambridge (2006)
Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probablistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), pp. 282–289 (2001)
McCallum, A.: Mallet: A machine learning for language toolkit (2002). http://mallet.cs.umass.edu
Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: International Language Resources and Evaluation. European Language Resources Association (2008)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Lindner, S., Höhn, W.: Parsing and maintaining bibliographic references. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) (2012)
Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)
Fontan, L., Lopez-Garcia, R., Alvarez, M., Pan, A.: Automatically extracting complex data structures from the web. In: International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012) (2012)
Ha, J., Haralick, R.M., Phillips, I.T.: Recursive XY cut using bounding boxes of connected components. In: Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 2, pp. 952–955. IEEE (1995)
Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 294–308 (1998)
Finkel, J.R.: Named entity recognition and the stanford NER software (2007)
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 173–180. Association for Computational Linguistics (2003)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., Murthy, K.R.K.: Improvements to platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Thirteenth International Joint Conference on Articial Intelligence, vol. 2, pp. 1022–1027. Morgan Kaufmann Publishers (1993)
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the contruction of internet portals with machine learning. Inf. Retrieval J. 3, 127–163 (2000)
Chang, M.W., Ratinov, L., Roth, D.: Guiding semi-supervision with constraint-driven learning. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 280–287 (2007)
Ganchev, K., Graca, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)
Swain, M., Fawcett, S.: Accounting system implications of TOC. In: Swamidass, P. (ed.) Encyclopedia of Production and Manufacturing Management. Springer, Heidelberg (2000). http://www.springerreference.com January 31 2011
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lindner, S. (2015). Two-Tier Machine Learning Using Conditional Random Fields with Constraints. In: Fred, A., Dietz, J., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2013. Communications in Computer and Information Science, vol 454. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46549-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-662-46549-3_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46548-6
Online ISBN: 978-3-662-46549-3
eBook Packages: Computer ScienceComputer Science (R0)