|
BreakingMegauploadCaptcha
How to break the Megaupload captcha
Note: The captcha discussed in this document was only active for a couple of months in 2009, I am keeping it for historical reasons. IntroductionIn the good old days Megaupload used to have one of the simplest captchas around, just three letters with no distortion, no rotation, nothing at all. Obviously any decent OCR was able to decode it and plenty of automated downloading applications worked smoothly. However, one day -curiously, not much after Rapidshare had dropped their long-hated dogs/cats captcha- this changed. In fact, they changed the captcha one, two, three (!), four (!!) times in a month. Finally (as of April 2009) they have given us a rest with a not-so-easy four-character rotated and overlapped captcha. Studying the captchaAt first glance it does not seem a bad captcha: while being not so hard for a human to fill the blanks and answer it, the overlapping makes the task difficult for a bot. That's what a good captcha should be. Fortunately, it has some serious issues which make the automatic decoding doable:
One simple approach could be correlating the image with good characters and take the closest candidates. While this method did work while the overlapping was not significant, now it's way too high. Breaking the captchaWe are going to clean the image as much as we can to help an OCR, but we are not going to implement the (complicated) OCR algorithm (KISS, which in plain English means "I feel lazy today"). We will employ Tesseract, which is the best FOSS OCR I am aware of (well, to be honest its command-line interface is simply awful but the accuracy is pretty good). First of all, our solution begins by identifying three different kind of pixels in the image (using intensively the flood fill function):
For every iteration we build the image filling the characters gaps, (de)rotate them and finally OCR the image with Tesseract. The resulting text is then filtered and only the fitting values (letter-letter-letter-digit) are taken in account. Finally, we pick the characters with more occurrences at each position et voilĂ , this one must be the correct captcha! At this moment (version 0.6) the script achieves a 40% successful rate. Show me the codeImplemented with Python (dependencies: PIL), the decoding script is part of Plowshare but you either can run it stand-alone from the command-line: http://code.google.com/p/plowshare/source/browse/trunk/src/modules/extras/megaupload_captcha.py $ python megaupload_captcha.py captcha.gif or as Python module: #!/usr/bin/python
import megaupload_captcha as mc
imagedata = open("captcha.gif").read()
captcha = mc.decode_captcha(imagedata)Of course, you are free to use this module in your free software projects. MiscShaun Friedle made his own implementation (based un neural networks) and compared its performance with plowshare: |
Sign in to add a comment
I am the developer of Tucan Manager ( http://cusl3-tucan.forja.rediris.es/ ). For courtesy i am here to thank you for this algorithm to resolve Megaupload captcha which will be used in Tucan.
Regards, Crak.
Thanks you. I will use it in megaupbash ( http://pablo777.wordpress.com/2008/09/17/automatizar-descargas-de-megaupload-megaupbash/ )
Hi, I've been working on megaupload captchas for a while, and came up with a similar method to you for the new captchas (identifying blocks of adjacent same-coloured pixels and then assembling them into characters).
I have to credit you for pointing me towards Eric S. Raymond's flood fill algorithm (although I don't actually use it for filling), I was previously using my own inefficient monstrosity which I had to increase the recursion limit to 3000 to use. Also somehow I didn't think of enlarging the image with a 1 pixel border to make all of the background contiguous, before that I was just checking every white pixel around the edge.
The two main ways my method differs is the way in which it reassembles the characters and the OCR. You attempt to sort of brute force the correct combination by checking every possible permutation, whereas I just guess based on the location of the blocks relative to the four largest black blocks. The upside of this is it's significantly faster than your method and I think it's at least as accurate.
I also implemented the character recognition myself rather than relying on a general OCR utility, which means it's a lot more specialised and knows exactly what each character is supposed to look like. The downside is my method is a lot less tolerant to any changes megaupload might make, I'd have to retrain the neural network for any change in the font. I tested my program against yours, and I was pretty pleased with the results ( http://herecomethelizards.co.uk/mu_captcha/test.html ) - note the neural net has not been trained with those captchas, I keep my training and test data separate. If you want to check out my code the repo is at http://hg.herecomethelizards.co.uk/mu_autocaptcha/ .
Shaun