Abstract
Opinion mining deals with scientific methods in order to find, extract and systematically analyze subjective information. When performing opinion mining to analyze content on the Web, challenges arise that usually do not occur in laboratory environments where prepared and preprocessed texts are used. This paper discusses preprocessing approaches that help coping with the emerging problems of sentiment analysis in real world situations. After outlining the identified shortcomings and presenting a general process model for opinion mining, promising solutions for language identification, content extraction and dealing with Internet slang are discussed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Alby, T.: Web 2.0. Konzepte, Anwendungen, Technologien, 3rd edn. Hanser, München (2008)
Nelles, O.: Nonlinear system identification: from classical approaches to neural networks and fuzzy models. Springer (2001)
Liu, B.: Web data mining. Exploring hyperlinks, contents, and usage data, 2nd edn. Data-centric systems and applications. Springer, Berlin (2008)
Steinecke, U., Straub, W.: Unstrukturierte Daten im Business Intelligence. Vorgehen, Ergebnisse und Erfahrungen in der praktischen Umsetzung. HMD - Praxis der Wirtschaftsinformatik 47(271), 91–101 (2010)
Guozheng, Z., Faming, Z., Fang, W., Jian, L.: Knowledge Creation in Marketing Based on Data Mining. In: International Conference on Intelligent Computation Technology and Automation (ICICTA), vol. 1, pp. 782–786 (2008)
Holzinger, A.: Weakly Structured Data in Health-Informatics. In: Proceedings of INTERACT 2011 International Conference on Human-Computer Interaction, Workshop: Promoting and Supporting Healthy Living by Design, pp. 5–7 (2011)
Holzinger, A.: On Knowledge Discovery and Interactive Intelligent Visualization of Biomedical Data. In: Proceedings of the 9th International Joint Conference on e-Business and Telecommunications (ICETE 2012), pp. IS9–IS20 (2012)
Holzinger, A., Geierhofer, R., Modritscher, F., Tatzl, R.: Semantic Information in Medical Information Systems: Utilization of Text Mining Techniques to Analyze Medical Diagnoses. Journal of Universal Computer Science 14(22), 3781–3795 (2008)
Maynard, D., Bontcheva, K., Rout, D.: Challenges in developing opinion mining tools for social media. In: Proceedings of @NLP can u tag #user_generated_content?! Workshop at LREC 2012, Istanbul, Turkey (May 2012)
Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums. ACM Trans. Inf. Syst. 26(3), 12:1–12:34 (2008)
Yi, L., Liu, B.: Web page cleaning for web mining through feature weighting. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 43–48. Morgan Kaufmann Publishers Inc., San Francisco (2003)
Gamon, M.: Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)
Kaiser, C.: Opinion Mining im Web 2.0 – Konzept und Fallbeispiel. HMD - Praxis der Wirtschaftsinformatik 46(268), 90–99 (2009)
Kim, S.-M., Hovy, E.: Determining the Sentiment of Opinions. In: Proceedings of 20th International Conference on Computational Linguistics, Geneva, Switzerland, pp. 1367–1373 (2004)
Nadali, S., Masrah, A.A.M., Rabiah, A.K.: Sentiment Classification of Customer Reviews Based on Fuzzy logic. In: Mahmood, A.K. (ed.) International Symposium in Information Technology (ITSim), pp. 1037–1044. IEEE (2010)
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168–177 (2004)
Jindal, N., Liu, B.: Mining comparative sentences and relations. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 2, pp. 1331–1336. AAAI Press (2006)
Hatzivassiloglou, V., Wiebe, J.: Effects of Adjective Orientation and Gradability on Sentence Subjectivity. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 299–305 (2000)
Turney, P.D.: Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp. 417–424 (2002)
Wiebe, J., Mihalcea, R.: Word Sense and Subjectivity. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 1065–1072 (2006)
Ding, X., Liu, B., Yu, P.S.: A Holistic Lexicon-Based Approach to Opinion Mining. In: International Conference on Web Search & Data Mining, Palo Alto, California, February 11-12. ACM, New York (2008)
Popescu, A.-M., Etzioni, O.: Extracting Product Features and Opinions from Reviews. In: Proceedings of Human Language Technology Conference, pp. 339–346 (2005)
Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, 45–66 (2001)
Weisberg, S.: Applied linear regression, vol. 528. Wiley (2005)
Vapnik, V.: The nature of statistical learning theory. Springer (2000)
Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005)
Kreuzthaler, M., Bloice, M.D., Faulstich, L., Simonic, K.M., Holzinger, A.: A Comparison of Different Retrieval Strategies Working on Medical Free Texts. Journal of Universal Computer Science 17(7), 1109–1133 (2011)
Holzinger, A., Simonic, K.M., Yildirim, P.: Disease-disease relationships for rheumatic diseases Web-based biomedical textmining and knowledge discovery to assist medical decision making. In: IEEE COMPSAC, pp. 573–580 (2012)
Koza, J.: Genetic programming II: automatic discovery of reusable programs (1994)
Affenzeller, M., Wagner, S., Winkler, S.: Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications. Numerical Insights. Taylor & Francis (2009)
Bai, X.: Predicting consumer sentiments from online text. Decision Support Systems 50(4), 732–742 (2011)
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 375–384. ACM, New York (2009)
Faschang, P., Petz, G., Dorfer, V., Kern, T., Winkler, S.M.: An Approach to Mining Consumer’s Opinion on the Web. In: 13th International Conference on Computer Aided Systems Theory, Eurocast 2011, pp. 37–39 (2011)
Faschang, P., Petz, G., Wimmer, M., Dorfer, V., Winkler, S.M.: Evaluation of Tools for Opinion Mining. In: EEE (ed.) Proceedings of the 2011 International Conference on E-Learning, E-Business, Enterprise Information Systems & E-Government, Las Vegas, pp. 3–9 (2011)
Schaller, S., Winkler, S.M., Dorfer, V., Petz, G., Fürschuß, H.: A Machine Learning Suite for Opinion Mining in Web. In: Proceedings of the 14th International Asia Pacific Conference on Computer Aided System Theory, IEEE APCast (2012)
Mihalcea, R., Banea, C., Wiebe, J.: Learning Multilingual Subjective Language via Cross-Lingual Projections. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 976–983 (2007)
Benedetto, D., Caglioti, E., Loreto, V.: Language Trees and Zipping. Phys. Rev. Lett. 88(4), 48702 (2002), doi:10.1103/PhysRevLett.88.048702
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data (JADT 1995), pp. 263–268 (1995)
Řehůřek, R., Kolkus, M.: Language Identification on the Web: Extending the Dictionary Method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 357–368. Springer, Heidelberg (2009)
Cavnar, W.B., Trenkle, J.M.: Trenkle: N-Gram-Based Text Categorization. In: Proceedings of SDAIR 1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Dunning, T.: Statistical Identification of Language (1994)
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002), doi:10.1145/565117.565137
Weninger, T., Hsu, W.H.: Text Extraction from the Web via Text-to-Tag Ratio. In: Proceedings of the 2008 19th International Conference on Database and Expert Systems Application, pp. 23–28. IEEE Computer Society, Washington, DC (2008), doi:10.1109/DEXA.2008.12
Weninger, T., Hsu, W.H., Han, J.: CETR: content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, pp. 971–980. ACM, New York (2010), doi:10.1145/1772690.1772789
Pasternack, J., Roth, D.: Extracting article text from the web with maximum subsequence segmentation. In: Proceedings of the 18th International Conference on World Wide Web, pp. 971–980 (2009)
Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 441–450. ACM, New York (2010)
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton, MA, USA (1979)
Baroni, M., Chantree, F., Kilgarriff, A., Sharoff, S.: Cleaneval: a Competition for Cleaning Web Pages
Schmid, H.: TreeTagger - a language independent part-of-speech tagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ (accessed March 10, 2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Petz, G. et al. (2012). On Text Preprocessing for Opinion Mining Outside of Laboratory Environments. In: Huang, R., Ghorbani, A.A., Pasi, G., Yamaguchi, T., Yen, N.Y., Jin, B. (eds) Active Media Technology. AMT 2012. Lecture Notes in Computer Science, vol 7669. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35236-2_62
Download citation
DOI: https://doi.org/10.1007/978-3-642-35236-2_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35235-5
Online ISBN: 978-3-642-35236-2
eBook Packages: Computer ScienceComputer Science (R0)