MTurk Really Adds Up

Posted on 14 Oct 2012

Sorry for the lack of updates, nonexistent audience, moving cross-country put a damper on my ability to update my blog with any frequency.

Today’s issue: MTurk is kind of expensive.

About a month ago, a friend and I decided to hack together some python to scrape Twitter for tweets containing certain keywords and use Bayesian classification to determine whether or not the tweet fit into certain (VERY SECRET) categories. Everything worked out until we wanted to actually tag up a set of tweets to use as a training set.


RT @CaliforniaBelle: Never underestimate the power of prayer and remember to trust God's timing. #LifeLessons 

Interesting? 3 - Yes, 2 - Maybe, 1 - No: 1
RT @iiDepressed: What do you want for christmass? A gun, boxes of pills, loads of alcohol, razor blades and a rope please. 

Interesting? 3 - Yes, 2 - Maybe, 1 - No: 1
@tranceforge emmm sbnrny waktu aku dtg disanapun sdh punya logic flownya..tetapi knp sptny smuany lbh ke ... http://t.co/BuQOCdBk 

Interesting? 3 - Yes, 2 - Maybe, 1 - No: 1
RT @Orlandomendez7: #Escogido hoy Guzmán LF, Lugo 2B, Kieschnick RF, Gómez DH, Green 1B, Mesa CF, Castillo C, Tatis 3B, Florimón SS. Cab ... 

Interesting? 3 - Yes, 2 - Maybe, 1 - No: 1
Ever Clean Multiple Cat Premium Clumping Cat Litter | Litter Boxes For Cats http://t.co/YX0V9ZUG 

Interesting? 3 - Yes, 2 - Maybe, 1 - No: 1

Hand-classifying tweets is really, really boring. Faced with the prospect of having to manually tag tweets for the next five hours, I decided to finally check out Amazon’s Mechanical Turk. It’s a really nice service that’s exactly what I wanted and very easy to use. There’s only one problem:

MTurk Total

Two cents doesn’t seem like much, but, when the task involved can take a second to complete and there are tens of thousands of them, it really starts to add up. So much so that doing all of the tagging myself is much more cost-effective than paying MTurk and doing real work with the time it would save me.

So I Got A Tweet (Part 1)

Posted on 08 Sep 2012

One day I received a message on Twitter. It looked something like this:

Twitter Spam

All tweets pictured were to other users. His message to me was pretty far back and I’m lazy.

Intrigued, I immediately clicked on it and got my computer rooted and my identity stolen after taking proper precautions. What followed was a fantastic, surreal journey into the land of unnecessarily obfuscated Javascript.

The link sent to me lead to a free site hosted by the Russian Google equivalent Yandex (I hope this doesn’t sound dismissive. It’s a massive, popular multi-media search engine that offers maps and site hosting and makes most of its money through the display of ads. It’s pretty Googly.).

The page contained the string “Redirecting” and some code for a view counter.

Also this:

eval(function (p, a, c, k, e, r) {
	    e = function (c) {
	        return (c < a ? '' : e(parseInt(c / a))) + ((c = c % a) > 35 ? String.fromCharCode(c + 29) : c.toString(36))
	    };
	    if (!''.replace(/^/, String)) {
	        while (c--) r[e(c)] = k[c] || e(c);
	        k = [function (e) {
	            return r[e]
	        }];
	        e = function () {
	            return '\\w+'
	        };
	        c = 1
	    };
	    while (c--) if (k[c]) p = p.replace(new RegExp('\\b' + e(c) + '\\b', 'g'), k[c]);
	    return p
	}([a bunch of packed js and the building blocks of the regex needed to put it together]))  

Really suspicious packed Javascript!

But what could it be? Painstaking analysis of the code (use of console.log) produced this:

 
eval(unescape(unescape(unescape("eval(function(p%25252Ca%25252Cc%25252Ck%25252Ce%25252Cr)%25257B[snip]

Which, when unescaped, yields a similar packed function to the one above. One more spin of console.log yields (ascii decoded from \x## form for your convenience):

 
var _0x5162 = ["1N T=["\j\E\1o\N\1f\y\I...[3][T[2]](T[1]),0,{}));", "|", "split", "|||||||||||||||||x5C|x7...w|RegExp|x25|x5F|62|113", "", "fromCharCode", "replace", "\w+", "\b", "g"]]
eval(function (b, c, d, e, f, g) {
    f = function (a) {
        return (a < c ? _0x5162[4] : f(parseInt(a / c))) + ((a = a % c) > 35 ? String[_0x5162[5]](a + 29) : a.toString(36))
    };
    if (!_0x5162[4][_0x5162[6]](/^/, String)) {
        while (d--) {
            g[f(d)] = e[d] || f(d)
        };
        e = [function (a) {
            return g[a]
        }];
        f = function () {
            return _0x5162[7]
        };
        d = 1
    };
    while (d--) {
        if (e[d]) {
            b = b[_0x5162[6]](new RegExp(_0x5162[8] + f(d) + _0x5162[8], _0x5162[9]), e[d])
        }
    };
    return b
}(_0x5162[0], 62, 121, _0x5162[3][_0x5162[2]](_0x5162[1]), 0, {}));

So, pretty much exactly the same thing, but with some elements pulled out into a variable. This packing strategy is used two more times.

Seven steps down (and a little extra cleaning up, after), our mystery is solved!

 
var RUN = true;
var t = navigator['userAgent']['toLowerCase']();
var url = 'hXXp://traffichouse.ru/?2';
var notAllow = ['http', 'craw', 'bot', 'surf', 'spid', 'tweet'];
var min = 25;
if (RUN) {
    locate(url, allow(t, notAllow, min))
};

function allow(a, b, c) {
    if (a['length'] < c) {
        return false
    };
    for (var d = 0; d < b['length']; d++) {
        if (a['indexOf'](b[d]) != -1) {
            return false
        }
    };
    return true
};

function locate(a, b) {
    if (b) {
        location['href'] = a
    }
};

The plot thickens! The script checks the browser’s userAgent string, rejecting the browser if its userAgent is too short or contains strings indicating that it might be a bot or twitter app.

And then, after seven layers of obfuscation and a userAgent check, it triggers a Javascript redirect. Great.

###TO BE CONTINUED

Tweet

A Bat

Posted on 16 Aug 2012

Imgur

It’s a tiny bat. It’s inside, not outside where it belongs.