My favorites | Sign in
Logo
                
Search
for
Updated Sep 22, 2009 by tokland
Labels: Phase-Implementation
BreakingMegauploadCaptcha  
How to break the Megaupload captcha

Note: The captcha discussed in this document was only active for a couple of months in 2009, I am keeping it for historical reasons.

Introduction

In the good old days Megaupload used to have one of the simplest captchas around, just three letters with no distortion, no rotation, nothing at all. Obviously any decent OCR was able to decode it and plenty of automated downloading applications worked smoothly.

However, one day -curiously, not much after Rapidshare had dropped their long-hated dogs/cats captcha- this changed. In fact, they changed the captcha one, two, three (!), four (!!) times in a month. Finally (as of April 2009) they have given us a rest with a not-so-easy four-character rotated and overlapped captcha.

Studying the captcha

At first glance it does not seem a bad captcha: while being not so hard for a human to fill the blanks and answer it, the overlapping makes the task difficult for a bot. That's what a good captcha should be. Fortunately, it has some serious issues which make the automatic decoding doable:

  • The characters rotation used is fixed.
  • There is no distortion, the characters are absolutely clean (except for the overlapping).

One simple approach could be correlating the image with good characters and take the closest candidates. While this method did work while the overlapping was not significant, now it's way too high.

Breaking the captcha

We are going to clean the image as much as we can to help an OCR, but we are not going to implement the (complicated) OCR algorithm (KISS, which in plain English means "I feel lazy today"). We will employ Tesseract, which is the best FOSS OCR I am aware of (well, to be honest its command-line interface is simply awful but the accuracy is pretty good).

First of all, our solution begins by identifying three different kind of pixels in the image (using intensively the flood fill function):

  1. Background: Fill the background in pixel (0, 0) and every painted pixel is marked as background (0).
  2. Characters: Black pixels are characters (1). The only problem is that we don't know for sure to which character each zone belongs. We must infer from the position of each zone to extract only four (corresponding to the four characters).
  3. Uncertain zones: Transparent pixels in the image (which are not background for sure) can be either real background (0) or characters overlapping (1), we will need to iterate every possibility. For example, we have identified eight uncertain zones, then we will have 2^8 = 256 OCR process to launch.

For every iteration we build the image filling the characters gaps, (de)rotate them and finally OCR the image with Tesseract. The resulting text is then filtered and only the fitting values (letter-letter-letter-digit) are taken in account. Finally, we pick the characters with more occurrences at each position et voilĂ , this one must be the correct captcha! At this moment (version 0.6) the script achieves a 40% successful rate.

Show me the code

Implemented with Python (dependencies: PIL), the decoding script is part of Plowshare but you either can run it stand-alone from the command-line:

http://code.google.com/p/plowshare/source/browse/trunk/src/modules/extras/megaupload_captcha.py

$ python megaupload_captcha.py captcha.gif

or as Python module:

#!/usr/bin/python
import megaupload_captcha as mc

imagedata = open("captcha.gif").read() 
captcha = mc.decode_captcha(imagedata)

Of course, you are free to use this module in your free software projects.

Misc

Shaun Friedle made his own implementation (based un neural networks) and compared its performance with plowshare:

http://hg.herecomethelizards.co.uk/mu_autocaptcha

http://herecomethelizards.co.uk/mu_captcha/test.html


Comment by crak.otaku, Apr 11, 2009

I am the developer of Tucan Manager ( http://cusl3-tucan.forja.rediris.es/ ). For courtesy i am here to thank you for this algorithm to resolve Megaupload captcha which will be used in Tucan.

Regards, Crak.

Comment by binchmod777, Apr 14, 2009
Comment by sfriedle, May 10, 2009

Hi, I've been working on megaupload captchas for a while, and came up with a similar method to you for the new captchas (identifying blocks of adjacent same-coloured pixels and then assembling them into characters).

I have to credit you for pointing me towards Eric S. Raymond's flood fill algorithm (although I don't actually use it for filling), I was previously using my own inefficient monstrosity which I had to increase the recursion limit to 3000 to use. Also somehow I didn't think of enlarging the image with a 1 pixel border to make all of the background contiguous, before that I was just checking every white pixel around the edge.

The two main ways my method differs is the way in which it reassembles the characters and the OCR. You attempt to sort of brute force the correct combination by checking every possible permutation, whereas I just guess based on the location of the blocks relative to the four largest black blocks. The upside of this is it's significantly faster than your method and I think it's at least as accurate.

I also implemented the character recognition myself rather than relying on a general OCR utility, which means it's a lot more specialised and knows exactly what each character is supposed to look like. The downside is my method is a lot less tolerant to any changes megaupload might make, I'd have to retrain the neural network for any change in the font. I tested my program against yours, and I was pretty pleased with the results ( http://herecomethelizards.co.uk/mu_captcha/test.html ) - note the neural net has not been trained with those captchas, I keep my training and test data separate. If you want to check out my code the repo is at http://hg.herecomethelizards.co.uk/mu_autocaptcha/ .

Shaun


Sign in to add a comment
Hosted by Google Code