Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods

Passonneau, Rebecca J.; Rudin, Cynthia; Radeva, Axinia; Liu, Zhi An

doi:10.1007/978-3-642-00382-0_7

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods

Rebecca J. Passonneau¹⁷,
Cynthia Rudin¹⁷,
Axinia Radeva¹⁷ &
…
Zhi An Liu¹⁷

Conference paper

1823 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Abstract

This paper illustrates how a combination of information extraction, machine learning, and NLP corpus annotation practice was applied to a problem of ranking vulnerability of structures (service boxes, manholes) in the Manhattan electrical grid. By adapting NLP corpus annotation methods to the task of knowledge transfer from domain experts, we compensated for the lack of operational definitions of components of the model, such as serious event. The machine learning depended on the ticket classes, but it was not the end goal. Rather, our rule-based document classification determines both the labels of examples and their feature representations. Changes in our classification of events led to improvements in our model, as reflected in the AUC scores for the full ranked list of over 51K structures. The improvements for the very top of the ranked list, which is of most importance for prioritizing work on the electrical grid, affected one in every four or five structures.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gross, P., Boulanger, A., Arias, M., Waltz, D.L., Long, P.M., Lawson, C., Anderson, R., Koenig, M., Mastrocinque, M., Fairechio, W., Johnson, J.A., Lee, S., Doherty, F., Kressner, A.: Predicting electricity distribution feeder failures using machine learning susceptibility analysis. In: The 18th Conference on Innovative Applications of Artificial Intelligence IAAI 2006, Boston, Massachusetts (2006)
Google Scholar
Liddy, E.D., Symonenko, S., Rowe, S.: Sublanguage analysis applied to trouble tickets. In: Proceedings of the Florida Artificial Intelligence Research Society Conference, pp. 752–757 (2006)
Google Scholar
Devaney, M., Ram, A.: Preventing failures by mining maintenance logs with case-based reasoning. In: Proceedings of the 59th Meeting of the Society for Machinery Failure Prevention Technology (MFPT-59) (2005)
Google Scholar
Hirschman, L., Palmer, M., Dowding, J., Dahl, D., Linebarger, M., Passonneau, R., Land, F., Ball, C., Weir, C.: The PUNDIT natural-language processing system. In: Proceedings of the Annual AI Systems in Government Conference, pp. 234–243 (1989)
Google Scholar
Oza, N., Castle, J.P., Stutz, J.: Classification of aeronautics system health and safety documents. IEEE Transactions on Systems, Man and Cybernetics, Part C (accepted for publication)
Google Scholar
Rudin, C., Passonneau, R.J., Radeva, A., Dutta, H., Ierome, S., Isaac, D.: Predicting vulnerability to serious manhole events in manhattan: A preliminary machine learning approach (submitted for publication)
Google Scholar
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 933–969 (2003)
MathSciNet MATH Google Scholar
Rudin, C.: The P-Norm Push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research (accepted, 2008)
Google Scholar
Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the Internat’l. Conf. on Machine Learning (ICML), pp. 377–384 (2005)
Google Scholar
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. computational linguistics (to appear)
Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–46 (1960)
Article Google Scholar
Krippendorff, K.: Content analysis: An introduction to its methodology. Sage Publications, Beverly Hills (1980)
MATH Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Article MATH Google Scholar
Jaccard, P.: Nouvelles recherches sur la distribution florale. Bulletin de la Societe Vaudoise des Sciences Naturelles 44, 223–270 (1908)
Google Scholar

Download references

Author information

Authors and Affiliations

Columbia University, New York, NY 10027, USA
Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva & Zhi An Liu

Authors

Rebecca J. Passonneau
View author publications
You can also search for this author in PubMed Google Scholar
Cynthia Rudin
View author publications
You can also search for this author in PubMed Google Scholar
Axinia Radeva
View author publications
You can also search for this author in PubMed Google Scholar
Zhi An Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Passonneau, R.J., Rudin, C., Radeva, A., Liu, Z.A. (2009). Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-00382-0_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics