How Investigators Develop CAPTCHA Solver to Assistance Dark Web Analysis?

A team of investigators at the Universities of Arizona, South Florida, and Georgia, have invented a machine-learning-based CAPTCHA solver that they claim can overcome around 94.4% of real challenges on dark websites. The Study’s goals create a system that can streamline cyber threat Intelligence, which currently needs human involvement for solving CAPTCHAs manually.

Cybercrime costs are rising exponentially, with cyberattacks and data breaches happening every day. As such, having a way to assemble the dark web more transparent for research is key to taking targeted preventive action.

What are Dark Web CAPTCHAs?

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is used by websites to differentiate between real users and bots. These challenges are omnipresent on the dark web to protect platforms from the constant DDoS attacks that competing platforms launch against each other.

These DDoS attacks are carried out by botnets, and thus having a strong layer of CAPTCHA at the login page can keep the threat under control.

However, each website implements its custom CAPTCHA challenge, making it practically impossible to develop a tool that can solve most of them. As such, the collection of cyber-threat intelligence from illicit dark web markets and forums becomes challenging and expensive, as employees have to be involved in the CAPTCHA solving step.

All about the Machine-learning approach

To address this practical problem, the researchers have developed a system that relies on interpreting rasterized images, which is qualitatively different from other recent studies that also used generative adversarial network-based approaches.

How-Investigators-Develop-CAPTCHA-Solver-to-Assistance-Dark-Web-Analysis-image1

The new solver can distinguish letters and numbers by looking at them one by one, denoising the image, identifying their borders between letters, and segmenting the content into individual characters.

How-Investigators-Develop-CAPTCHA-Solver-to-Assistance-Dark-Web-Analysis-image2

As such, the size of the CAPTCHA challenge doesn’t affect the effectiveness of the solver much, especially when measuring accumulative performance for three attempts.

How-Investigators-Develop-CAPTCHA-Solver-to-Assistance-Dark-Web-Analysis-image3

On the aspect of character recognition, the solver uses samples extracted across multiple local regions to identify fine-grained features such as lines and edges, so it can’t be “fooled” by character rotation, font size changes, or color mixups.

How-Investigators-Develop-CAPTCHA-Solver-to-Assistance-Dark-Web-Analysis-image4

Why Real-world testing is required?

Using their most optimized DW-GAN solving model, the researchers tested it against Yellow Brick, a now-defunct dark web market that hosted illicit content listings.

How-Investigators-Develop-CAPTCHA-Solver-to-Assistance-Dark-Web-Analysis-image5

As the above image explains:

Using a crawler enhanced by our DW-GAN, we were able to collect 1,831 illegal products from Yellow Brick. Among these products, there were 286 cybersecurity-related items, including 102 stolen credit cards, 131 stolen accounts, 9 forged document scans, 44 hacking tools, and 1,223 drug-related products.

Overall, collecting “Yellow Brick” market intelligence with DW-GAN took about 5 hours without human involvement. In particular, each HTTP request took 8.8 seconds for loading a new webpage; therefore crawling 1,831 pages took 268.5 minutes. Solving the recurring CAPTCHA challenges (per 15 HTTP requests) took our DW-GAN crawler 18.6 seconds.

Overall, the proposed framework could automatically break CAPTCHA with no more than three attempts. Breaking all CAPTCHA images takes about 76 minutes [sic] in total for all 1,831 product pages, a fully automated process. Of course, this testing data concerns a particular dark web market, but a similar performance level is expected on any site that employs word CAPTCHAs, according to the researchers.

What is the Possible Importance?

Intelligence and highly-capable CAPTCHA solvers like this one can potentially disrupt the space, at least on the dark web where less sophisticated challenges are used.

How-Investigators-Develop-CAPTCHA-Solver-to-Assistance-Dark-Web-Analysis-image6

The authors have published the final version of their solver on GitHub, but not the training data set of 50,000 CAPTCHA images. Someone could presumably work on this model to derive something that works on weak Clearnet CAPTCHA implementations too.

As the paper emphasizes regarding this possibility: “while this study is mainly focused on dark-web CAPTCHA as a more challenging problem, the proposed method in this study is expected to be applicable to other types of CAPTCHA without loss of generality.”

This novel solver may have been developed for the noble purpose of tackling cybercrime, but it still holds the potential to impact those who use the dark web for obscurity and safe exchange of information.

Leave a Reply