Filtering out the filth
If you think you get a lot of salacious spam in your email, you should see what turns up when you’re downloading the web. The proliferation of online publishing tools has resulted in pages and pages of sales links, random advertising gibberish, and incoherent babble akin to word salad.
If your company or product brand gets a random mention on one of these computer-generated pages, is that something you really want to know about? Worse, do those mentions make you think your online presence is larger than it is? As humans, we have an uncanny ability to quickly sort the relevant opinions from the splogs, but how do we get a computer do the same?
One possibility is to have the computer filter all web pages containing words that occur frequently in splogs. These words are the usual suspects: the four-letter ones, the ones you might hear in an adult video, and even the kind of pills you might take before starring in one. Unfortunately, people legitimately use some of these words when expressing their opinion. Don’t believe me? Just check out those dirty Twitterers at Cursebird — but don’t say we didn’t warn you.
Our approach is to adopt a keyword filter that removes splogs containing the worst of the worst: words and word combinations you’d probably never even think of, let alone post online. To accommodate the range of human expression, certain kinds of profanity has to make it through this first level of filtering.
We then allow our statistical algorithm to learn what it means to be an irrelevant opinion by providing the algorithm with irrelevant samples, including splogs with some kinds of profanity. This algorithm doesn’t just look for individual words but summarizes each document as numbers. Under this numerical representation, splogs then look vastly different from relevant opinions and can be safely discarded. It’s an effective way to get to the real, human opinion and ditch the mentions that don’t — and shouldn’t — matter.