Defeating Digg’s CAPTCHA

While using digg.com, I was surprised to see such an obviously weak CAPTCHA challenge. I was able to create a script that defeats it with a 88% accuracy within a couple hours using nothing but free software. (If you are looking for code, forget it. This is almost too much information)

Digg’s CAPTCHA Weaknesses:

  1. Dictionary Words
  2. Same background
  3. Same Font
  4. No deformations
  5. All lowercase letters
  6. Constant colors

Tools

  • gocr – a GPL Optical Character Recognition program
  • ImageMagick – for command time image editing
  • Perl – to tie everything together

Sample Size
100 images with 95 different words with an average word length of 5.3 letters.

First Test
Just dumping all the images through gocr yielded 26% correct responses. Not too shabby. It yields some easily manipulated results:

  • = groUDS
  • = single
  • = t’ o.l,i.c,e ‘. . .’.’. ”,
  • = be.cause.

Looking at the results I’m sure that we could improve the results with a little string manipulation.

Tweaking output
We can mess with the output to yield better results

  • Convert all output to lowercase
  • Remove non letter characters
  • Spell checker

The first two yield 53% correct responses; just with this simple tweak we are able to get more correct guesses than incorrect. With adding the first guess of a spell checker it bumps the accuracy to 67%

Tweaking input

  • Removing boarder
  • Adjusting contrast and brightness
  • Using edge detection

So becomes

Since we are already over 2/3rds accurate we don’t need to adjust the input of every image, just the results that aren’t dictionary words. Part of them problem is that while one adjustment will improve results for one image, it will degrade the results for another. My solution was to try 10 variations, run them through the OCR and then spell check. I then had the program pick the solution with duplicate results, in the case of a tie or no duplicate I had the program pick the one with the fewest number of variations. This method resulted in the final accuracy of 88%.

Problems with this technique
While these quick results have come close to becoming usable, they are still a far cry from 100% accuracy. Since digg uses a consistent font I could train gocr for problematic letters (such as p) also given that in 100 images I received 5 sets of duplicate words I would estimate their dictionary is only a couple thousand and could hand tweak the results.

Other resouces

Disclaimer
I did contact digg last week to let them know I would be publishing this and offered them the opportunity to have it delayed while the upgraded their CAPTCHA. I haven’t heard from them as of now. I still offer them the opportunity to contact me and I will temporarily remove this article.

Update:
~30 hours after I posted this they have changed the nature of their CAPTCHA, I will be writting a follow up soon … here is my reponse

19 thoughts on “Defeating Digg’s CAPTCHA

  1. kai,

    I didn’t notify them that their version 2(?) was partly unreadable, but I’m sure other did.

  2. Hi, i was very interested in this.
    as im working on a small project that requires programing skill, and i outside of QuickBasic have no background in.

    If you would email me i would like to have a chat.
    thanks
    Chris.

  3. Hello ,
    Thank you for this clear explanation, i was looking for a captha script to include at the registration system for
    http://www.blogje.eu/ This is a free on-line blog system where users can create their own website with a nice domainname.
    Recently we found out, we needed some form of captha , to prevent malicious hackers creating brute force accounts automatically in our system.
    I was thinking about EZ-Gimpy, but after reading this tutorial i am not sure anymore…
    I wonder, is there really a form of captcha wich isn’t crackable?

  4. Hello,
    you say you used free software to create your script.

    If you mean the gpl, you have to provide your source code in order to fullfill the license agreement.

    Please be fair and respect open-source licenses!!

  5. reminder,

    you are mistaken, I don’t have to release source code for a program that I’m not distributing.

    From:
    http://www.fsf.org/licensing/licenses/gpl-faq.html#GPLRequireSourcePostedPublic
    “You are free to make modifications and use them privately, without ever releasing them. This applies to organizations (including companies), too; an organization can make a modified version and use it internally without ever releasing it outside the organization.”

  6. Quite correct, Brent. Several organizations use and modify GPL’d software internally, e. g. the US Dept. of Defense. They don’t release their code outside of the organization. That’s one of the freedoms that the GPL provides: the keeping of your tweaks to yourself if you want.

    Thank you for presenting this article. Security through obscurity is not security at all, and as the Microsofts and Apples of the world continue to prove again and again, it never has been. The more that we share information, the better armed we all are to defend ourselves against the baddies out there. The OpenBSD team’s work is an excellent example of this, as is the book “Applied Cryptography”. You did a good thing here.

  7. Look at this captcha – its animated. The whole code is not visible – only a part of it. A visitor can solve the captcha – but no robot.

  8. You guys don’t get it. Spend this time writing a better CAPTCHA or something else that will provide the same functionality. I see no talent here in cracking an existing one. Help out a little and build something usefull, instead of feeling all big because you can crack something you know is obviously crackable.

  9. Zork,

    This wasn’t about feeling all big, this was to showcase one method someone might go about cracking a CAPTCHA. Hopefully providing some insight on how to design better CAPTCHAs.

  10. Actually a better thing to do is ask questions, such as “What is the 3rd letter in this sentance” or “Write the following word below ‘y’”

    These are harder to defeat by bots (since you can personalise them) as well as being solveable by people who are blind and require a screen reader.

  11. Greetings & thanks for writing this article … it was an excellent read. It’s sad to see how easily captcha security can be defeated.

    I just spent an afternoon or so writing an animated gif captcha (check the web site). I am assuming that decoding a series of pictures would be more difficult than just a flat one, but I don’t know that for sure.

    Any chance on getting you to take a look at it from a security perspective and pointing out some of my weak points. Source is included just in case you need to look around for a break-in point.

    Thanks in advance

  12. Pingback: T=Machine » Austin GDC: Vote for your conference

  13. How about combining the comment from ‘someone’ with a captcha, effectivly a question written in a captcha.

    The answer isn’t whats written in the image, but the answer to the question.

    e.g “What is the 3rd letter in this sentance” displayed in a captcha image.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>