In CrowdChat platform, we have classified over 66M Twitter accounts and counting. The important precursor step: Finding out real-humans among those millions of accounts. This may appear as seemingly intuitive task for any average person, but is not so for the machines. This post descirbes how we approached the problem. The reader should refer further for the following concepts, but we will try to introduce them as briefly as possible.
The of problem of identifying, if a twitter account belongs to real-human or not can be translated to the problem of identifying if their self-described bio belongs to a language class H.
Lets assume for now, H is language class of all self-described bios that sound like written by Humans. And B is the language class of all bios that are written by Bots/Organization Accounts/Non-Humans.
There are always more humans than non humans in popular social networks like Twitter. Services such as Twitter mention that percentage of bot accounts is insignifcant and they constantly work towards keeping it as less and possible. Hence, it is valid to assume that in a truly randomized sample the percentage of human accounts should always be more than the percentage of non human accounts.
From a basic study , we identify that humans tend to use words like ‘I’ , ‘my’ etc in their bio while “organizational” accounts use the word ‘official’ , “news” , “we” etc. , we call these seed rules.
Next , we know that a bio is made of natural english words.
Lets say one’s account has a bio as follows, “I live in California and work at Crowdchat” this bio can be split into ngrams unigram = [“i”, “live”, “in” ,”california”, “and” , “work”, “at” , “crowdchat”] bigrams = [“i live”, “live in”, “in california”, “california and” ,”and work”, “work at”, “at crowdchat”]
Now its a 0-1 classification problem , we got inspired by “Edwin chen’s blog post about a similar problem of filtering non english tweets from english tweets” (Link) and decide to use the same technique.
A given tweet (in English) is composed of set of character n-grams.
We use EM Algorithm.
Let’s recall the naive Bayes algorithm: given a tweet (a set of character n-grams), we estimate its language to be the language L that maximizes
This would be easy if we knew the language of each tweet, since we could estimate
Or, it would also be easy if we knew the n-gram probabilities for each language, since we could use Bayes’ theorem to compute the language probabilities for each tweet, and then take a weighted variant of the previous paragraph.
Thus, we bootstrap using a two-step process:
- Using Seed rules , assign human (label =1) or non human (label = 0 ).
- If seed rule does not pass , assign a label randomly..
Then follow this approach, till the solution converges.
1 2 3 4 5 6 7 8 9 10