Text Analytics: The Dark Data Frontier
Text is everywhere. Analysts at Gartner estimate that upward of 80 percent of enterprise data today is unstructured. Our everyday interactions generate torrents of such data, including tweets, blog posts, advertisements, news, articles, research papers, descriptions, emails, YouTube comments, Yelp reviews, surveys from your insurance company, and call transcripts; there is a tremendous amount of unstructured data, and the majority of it is text. Another general way to describe this large amount of mostly monetizable data (except YouTube comments—those are toxic!) is by classifying it as dark data. The origin of this term is not well known, but it was popularized by Stanford’s Dr. Chris Re, who founded the DeepDive program for extracting valuable information from dark data. The term pertains to the mountains of raw information collected in various ways, and such data remains difficult to analyze.