Improved Document Feature Selection with Categorical Parameter for Text Classification

Wang, Fen; Li, Xiaoxuan; Huang, Xiaotao; Kang, Ling

doi:10.1007/978-3-319-50463-6_8

Fen Wang¹⁶,
Xiaoxuan Li¹⁶,
Xiaotao Huang¹⁶ &
…
Ling Kang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 10026))

Included in the following conference series:

International Conference on Mobile, Secure, and Programmable Networking

639 Accesses
1 Citations

Abstract

Social network develops rapidly and thousands of new data appears on the Internet every day. Classification technology is the key to organize big data. Feature Selection (FS) is a direct way to improve classification efficiency. FS can reduce the size of the feature subset and ensure classification accuracy based on features’ score, which is calculated by FS methods. Most previous studies of FS emphasized on precision while time-efficiency was commonly ignored. In our study, we proposed a method named CDFDC at first. It combines both CDF and Category-Frequency. Secondly, we compared DF, CDF, CHI, IG, CDFP_VM and CDFDC to figure out the relationships among algorithm complexity, time efficiency and classification accuracy. The experiment is implemented with 20-news-group data set and NB classifier. The performance of the FS methods evaluated by seven aspects: precision, Micro F1, Macro F1, feature-selection-time, documents-conversion-time, training-time and classification-time. The result shows that the proposed method performs well on efficiency and accuracy when the size of feature subset is greater than 3,000. And it is also discovered that FS algorithm’s complexity is unrelated to accuracy but complexity can ensure time stability and predictability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Basu, T., Murthy, C.A.: Effective text classification by a supervised feature selection approach. In: 12th IEEE International Conference Data Mining Workshops (ICDMW), pp. 918–925. IEEE Press, New York (2012)
Google Scholar
Li, Q., He, L., Lin, X.: Improved categorical distribution difference feature selection for Chinese document categorization. In: 8th International Conference on Ubiquitous Information Management and Communication, pp. 102:1–102:7. IEEE Press, New York (2014)
Google Scholar
Sharma, A., Dey, S.: A comparative study of feature selection and machine learning techniques for sentiment analysis. In: ACM Research in Applied Computation Symposium, pp. 1–7. IEEE Press, New York (2012)
Google Scholar
Wang, Z., Chen, S., Liu, J., Zhang, D.: Pattern representation in feature extraction and classifier design: matrix versus vector. J. IEEE Trans. Neural Netw. 19, 758–769 (2008)
Article Google Scholar
Tariq, A., Karim, A.: Fast supervised feature extraction by term discrimination information pooling. In: 20th ACM International Conference on Information and Knowledge Management, pp. 2233–2236. IEEE Press, New York (2011)
Google Scholar
Van, M., Kang, H.-J.: Bearing-fault diagnosis using non-local means algorithm and empirical mode decomposition-based feature extraction and two-stage feature selection. J. IET Sci. Measur. Technol. 9, 671–680 (2015)
Article Google Scholar
Somol, P., Novovicova, J.: Evaluating stability and comparing output of feature selectors that optimize feature subset cardinality. J. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1921–1939 (2010)
Article Google Scholar
Meng, J., Lin, H.: A two-stage feature selection method for text categorization. In: Seventh International Conference Fuzzy Systems and Knowledge Discovery (FSKD), pp. 1492–1496. IEEE Press, New York (2010)
Google Scholar
Zhang, W., Yoshida, T., Tang, X.: A comparative study of TF*IDF, LSI and multi-words for text classification. J. Exp. Syst. Appl. 38, 1492–1496 (2011)
Google Scholar
Kadhim, A.I., Cheah, Y.N., Ahamed, N.H., Salman, L.A.: Feature extraction for co-occurrence-based cosine similarity score of text documents. In: IEEE Student Conference Research and Development (SCOReD), pp. 1–4. IEEE Press, New York (2014)
Google Scholar
Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. In: IEEE Transactions on Knowledge and Data Engineering, pp. 1656–1669. IEEE Press, New York (2015)
Google Scholar
Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. J. IEEE Trans. Knowl. Data Eng. 27, 1656–1669 (2015)
Article Google Scholar
Song, S.J., Heo, G.E., Kim, H.J., Jung, H.J., Kim, Y.H., Song, M.: Grounded feature selection for biomedical relation extraction by the combinative approach. In: ACM 8th International Workshop on Data and Text Mining in Bioinformatics, pp. 29–32. IEEE Press, New York (2014)
Google Scholar

Download references

Acknowledgments

I feel much indebted to many people who have instructed me in writing this paper. I would like to express my heartfelt gratitude to my tutor, Prof. Wang, for her warm-heart encouragement and most valuable advice, especially for her insightful comments and suggestions on the draft of this paper. Without her help, encouragement and guidance, I could not have completed this paper.

And I would like to express my thanks to my family and my friends for their valuable encouragement and spiritual support during my study.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Huazhong University of Science and Technology, Hubei, China
Fen Wang, Xiaoxuan Li & Xiaotao Huang
Department of Hydropower and Information Engineering, Huazhong University of Science and Technology, Hubei, China
Ling Kang

Authors

Fen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxuan Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaotao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ling Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoxuan Li .

Editor information

Editors and Affiliations

CNAM/CEDRIC, Paris, France
Selma Boumerdassi
Institut Mines-Télécom – Télécom SudParis, Évry, France
Éric Renault
CNAM/CEDRIC , Paris, France
Samia Bouzefrane

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, F., Li, X., Huang, X., Kang, L. (2016). Improved Document Feature Selection with Categorical Parameter for Text Classification. In: Boumerdassi, S., Renault, É., Bouzefrane, S. (eds) Mobile, Secure, and Programmable Networking. MSPN 2016. Lecture Notes in Computer Science(), vol 10026. Springer, Cham. https://doi.org/10.1007/978-3-319-50463-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-50463-6_8
Published: 10 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50462-9
Online ISBN: 978-3-319-50463-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics