While using digg.com, I was surprised to see such an obviously weak CAPTCHA challenge. I was able to create a script that defeats it with a 88% accuracy within a couple hours using nothing but free software. (If you are looking for code, forget it. This is almost too much information)
Digg’s CAPTCHA Weaknesses:
- Dictionary Words
- Same background
- Same Font
- No deformations
- All lowercase letters
- Constant colors
- gocr – a GPL Optical Character Recognition program
- ImageMagick – for command time image editing
- Perl – to tie everything together
100 images with 95 different words with an average word length of 5.3 letters.
Just dumping all the images through gocr yielded 26% correct responses. Not too shabby. It yields some easily manipulated results:
- = groUDS
- = single
- = t’ o.l,i.c,e ‘. . .’.’. ”,
- = be.cause.
Looking at the results I’m sure that we could improve the results with a little string manipulation.
We can mess with the output to yield better results
- Convert all output to lowercase
- Remove non letter characters
- Spell checker
The first two yield 53% correct responses; just with this simple tweak we are able to get more correct guesses than incorrect. With adding the first guess of a spell checker it bumps the accuracy to 67%
- Removing boarder
- Adjusting contrast and brightness
- Using edge detection
Since we are already over 2/3rds accurate we don’t need to adjust the input of every image, just the results that aren’t dictionary words. Part of them problem is that while one adjustment will improve results for one image, it will degrade the results for another. My solution was to try 10 variations, run them through the OCR and then spell check. I then had the program pick the solution with duplicate results, in the case of a tie or no duplicate I had the program pick the one with the fewest number of variations. This method resulted in the final accuracy of 88%.
Problems with this technique
While these quick results have come close to becoming usable, they are still a far cry from 100% accuracy. Since digg uses a consistent font I could train gocr for problematic letters (such as p) also given that in 100 images I received 5 sets of duplicate words I would estimate their dictionary is only a couple thousand and could hand tweak the results.
- PWNtcha – a project to build a captcha decoder
- Breaking a Visual CAPTCHA – the breaking of EZ-Gimpy CAPTCHAs
I did contact digg last week to let them know I would be publishing this and offered them the opportunity to have it delayed while the upgraded their CAPTCHA. I haven’t heard from them as of now. I still offer them the opportunity to contact me and I will temporarily remove this article.
~30 hours after I posted this they have changed the nature of their CAPTCHA, I will be writting a follow up soon … here is my reponse