Understanding Short Texts

  • Haixun Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7808)


Many applications handle short texts, and enableing machines to understand short texts is a big challenge. For example, in Ads selection, it is is difficult to evaluate the semantic similarity between a search query and an ad. Clearly, edit distance based string similarity does not work. Moreover, statistical methods that find latent topic models from text also fall short because ads and search queries are insufficient to provide enough statistical signals.

In this tutorial, I will talk about a knowledge empowered approach for text understanding. When the input is sparse, noisy, and ambiguous, knowledge is needed to fill the gap in understanding. I will introduce the Probase project at Microsoft Research Asia, whose goal is to enable machines to understand human communications. Probase is a universal, probabilistic taxonomy more comprehensive than any current taxonomy. It contains more than 2 million concepts, harnessed automatically from a corpus of 1.68 billion web pages and two years worth of search-log data. It enables probabilistic interpretations of search queries, document titles, ad keywords, etc. The probabilistic nature also enables it to incorporate heterogeneous information naturally. I will explain how the core taxonomy, which contains hypernym-hyponym relationships, is constructed and how it models knowledge’s inherent uncertainty, ambiguity, and inconsistency.


Semantic Similarity Search Query Probabilistic Interpretation Latent Topic Probabilistic Nature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Haixun Wang
    • 1
  1. 1.Microsoft Research AsiaChina

Personalised recommendations