SC Spectra: A Linear-Time Soft Cardinality Approximation for Text Comparison

Jiménez Vargas, Sergio; Gelbukh, Alexander

doi:10.1007/978-3-642-25330-0_19

SC Spectra: A Linear-Time Soft Cardinality Approximation for Text Comparison

Sergio Jiménez Vargas²¹ &
Alexander Gelbukh²²

Conference paper

918 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7095))

Abstract

Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation that is quadratic in the number of terms in the text has been proposed in the past. SC Spectra is a new method of approximation in linear time for text strings, which divides text strings into consecutive substrings (i.e., q-grams) of different sizes. Thus, SC in combination with resemblance coefficients allowed the construction of a family of similarity functions for text comparison. These similarity measures have been used in the past to address a problem of entity resolution (name matching) outperforming SoftTFIDF measure. SC spectra method improves the previous results using less time and obtaining better performance. This allows the new method to be used with relatively large documents such as those included in classic information retrieval collections. SC spectra method exceeded SoftTFIDF and cosine tf-idf baselines with an approach that requires no term weighing.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley & ACM Press (1999)
Google Scholar
Barceló, G., Cendejas, E., Bolshakov, I., Sidorov, G.: Ambigüedad en nombres hispanos. Revista Signos. Estudios de Lingüística 42(70), 153–169 (2009)
Google Scholar
Barceló, G., Cendejas, E., Sidorov, G., Bolshakov, I.A.: Formal Grammar for Hispanic Named Entities Analysis. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 183–194. Springer, Heidelberg (2009)
Chapter Google Scholar
Bilenko, M., Mooney, R., Cohen, W.W., Ravikumar, P., Fienberg, S.: Adaptive name matching in information integration. IEEE Intelligent Systems 18(5), 16–23 (2003), http://portal.acm.org/citation.cfm?id=1137237.1137369
Article Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 313–324. ACM, San Diego (2003), http://portal.acm.org/citation.cfm?id=872757.872796
Chapter Google Scholar
Christen, P.: A comparison of personal name matching: Techniques and practical issues. In: International Conference on Data Mining Workshops, pp. 290–294. IEEE Computer Society, Los Alamitos (2006)
Chapter Google Scholar
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory, 1523–1545 (2005)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI 2003 Workshop on Information Integration on the Web, pp. 73–78 (August 2003), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.15.178
de la Higuera, C., Mico, L.: A contextual normalised edit distance. In: IEEE 24th International Conference on Data Engineering Workshop, Cancun, Mexico, pp. 354–361 (2008), http://portal.acm.org/citation.cfm?id=1547551.1547758
Jimenez, S., Becerra, C., Gelbukh, A., Gonzalez, F.: Generalized Mongue-Elkan Method For Approximate Text String Comparison. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 559–570. Springer, Heidelberg (2009), http://dx.doi.org/10.1007/978-3-642-00382-0_45
Chapter Google Scholar
Jimenez, S., Gonzalez, F., Gelbukh, A.: Text Comparison Using Soft Cardinality. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 297–302. Springer, Heidelberg (2010), http://www.springerlink.com/content/x1w783135m36k880/
Chapter Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. In: Proceedings of the 36th International Conference on Very Large Data Bases, Singapore (2010)
Google Scholar
Kukich, K.: Techniques for automatically correcting words in text. ACM Computing Surveys 24, 377–439 (1992)
Article Google Scholar
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Biocomputing 2002 - Proceedings of the Pacific Symposium, Kauai, Hawaii, USA, pp. 564–575 (2001), http://eproceedings.worldscinet.com/9789812799623/9789812799623_0053.html
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)
MathSciNet MATH Google Scholar
Lin, D.: Information-Theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304 (1998), http://portal.acm.org/citation.cfm?id=645527.657297&coll=Portal&dl=GUIDE&CFID=92419400&CFTOKEN=72654004
Monge, A.E., Elkan, C.: The field matching problem: Algorithms and applications. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, pp. 267–270 (August 1996)
Google Scholar
Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 593–600 (2008), http://portal.acm.org/citation.cfm?id=1599081.1599156
Piskorski, J., Sydow, M.: Usability of string distance metrics for name matching tasks in polish. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2007), Poznań, Poland, October 5-7 (2007), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.9942
Salton, G.: Introduction to modern information retrieval. McGraw-Hill (1983)
Google Scholar
Sarker, B.R.: The resemblance coefficients in group technology: A survey and comparative study of relational metrics. Computers & Industrial Engineering 30(1), 103–116 (1996), http://dx.doi.org/10.1016/0360-83529500024-0
Article Google Scholar
Tejada, S., Knoblock, C.A.: Learning domain independent string transformation weights for high accuracy object identification. In: Proceedings of International Conference on Knowledge Discovery and Data Mining, SIGKDD (2002)
Google Scholar
Winkler, W.E.: The state of record linkage and current research problems. Statistical research divison U.S. Census Bureau (1999), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.4336

Download references

Author information

Authors and Affiliations

Intelligent Systems Research Laboratory (LISI), Systems and Industrial Engineering Department, National University of Colombia, Bogota, Colombia
Sergio Jiménez Vargas
Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Alexander Gelbukh

Authors

Sergio Jiménez Vargas
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Mexican Petroleum Institute (IMP), Eje Central Lazaro Cardenas Norte, 152, Col. San Bartolo Atepehuacan, CP 07730, Mexico DF, Mexico
Ildar Batyrshin
National Polytechnic Institute (IPN), Center for Computing Research (CIC), Av. Juan Dios Bátiz, s/n, Col. Nueva Industrial Vallejo, CP 07738, Mexico D.F., Mexico
Grigori Sidorov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiménez Vargas, S., Gelbukh, A. (2011). SC Spectra: A Linear-Time Soft Cardinality Approximation for Text Comparison. In: Batyrshin, I., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2011. Lecture Notes in Computer Science(), vol 7095. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25330-0_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-25330-0_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25329-4
Online ISBN: 978-3-642-25330-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics