Web Scraping for Fun (and Profit)…

There are many websites on the internet that are known to receive regular traffic from hackers, including a number of public forums, which are often used to release their stolen information. Hackers may release some of this information to take credit for a breach, attract buyers for the rest of the stolen information, or increase the damages to the target organization.

In order to maximize exposure, the data is usually released somewhere where people are meant to find it. However, hackers prefer to maintain their cover, so the location must also allow for anonymity. These criteria reduce the number of websites that are likely to be used to release personal or otherwise sensitive information. By monitoring some of the websites that do fit these criteria, it is possible to gain visibility into the data being released and then use this data to create actionable intelligence. Some of the most popular sites that are subject to this type of nefarious traffic are paste-style sites.

These sites get their name from their basic function: copy and paste text on the page. When you submit that text, the website publishes it to the internet and gives you your very own URL to share with people. They are designed to allow programmers to quickly and easily share code across an ocean without having to deal with another layer, such as email.

To make these services easy to use, no login or signup is required. Simply navigate to their webpage, paste your text, and click submit. It’s easy to see how this sort of service could be abused to share stolen information. To better understand just how much information is being released, we have built a platform to scrape the data from some of the most popular paste-style sites.

Web scraping is a method of collecting data with little to no human interaction. Rather than using a web browser to “point-and-click, it’s possible to leverage other networking tools to retrieve and parse data from the internet. By bridging the gap between browsing and scraping, it is possible to retrieve data from multiple locations on the internet, perform an analysis on that data, and store it locally all in one automated process.

Over the course of the past year, we have found some surprising results. By applying this technology and limiting the scope to popular paste-style sites, we have collected over 1.5 million email and password combinations. While large and well-publicized breaches, such as Dropbox or Yahoo!, may affect tens of millions of users in a single leak, the types of breaches that are being captured by our scraper platform are generally much smaller. These small information leaks are occurring all the time and do not receive any real attention from the media because they do not affect a huge number of people.

However, because people are likely to reuse corporate credentials on external sites, breaches of these external systems may put the organization itself at risk. This is what our scraping platform excels at: collecting potentially valid corporate credentials from external third-party breaches. We use this data to monitor for, and alert on, the release of our own business sensitive information or that of our clients, partners and vendors. Alerts are generated within minutes of the information being made public on any of the monitored resources. In the case of a leaked email address and password, steps can be taken to immediately change the user’s password.

But besides these functional objectives, our growing repository of paste data also allows us visibility into what has been released, even if it doesn’t affect our environment directly. For example, in the past year, the scraper platform has passively collected credentials for NASA, the FBI and the Federal Reserve Bank, as well as countless private companies and organizations. Because these data dumps are occurring on external information systems, there is no way for an organization to know that it occurred unless they are proactively monitoring places where their information might be released. OCD Tech will be discussing this web-scraping platform at BSides Boston on April 15, 2017. The event will be held at Harvard University. Hope to see you there!

About the Author: Scott Goodwin is an Experienced IT Security Analyst with OCD Tech. He graduated with a Bachelor of Science in Physics from the University of Massachusetts-Boston in May of 2015. His primary engagements are IT vulnerability assessments, NIST 800-53 and 800-171 assessments, and security advisory services. He is also currently working on several research projects related to open source intelligence and penetration testing. You can follow Scott on Twitter, LinkedIn or on OCD Tech's blog.

Editor’s Note: The opinions expressed in this guest author article are solely those of the contributor and do not necessarily reflect those of Tripwire.

Meet Fortra™ Your Cybersecurity Ally™

Fortra is creating a simpler, stronger, and more straightforward future for cybersecurity by offering a portfolio of integrated and scalable solutions. Learn more about how Fortra’s portfolio of solutions can benefit your business.

Learn More

Web Scraping for Fun (and Profit)…

Meet Fortra™ Your Cybersecurity Ally™

Tripwire Guest Authors

Contact Information

Privacy Policy

Cookie Policy

Impressum