TechMediaToday
CyberSecurity

Web Scraping: Still Very Much a Gray Area

Organizations face a wide range of cyber threats, which is why employees are regularly trained about phishing and the risks of giving sensitive data away to potential attackers.

However, businesses commonly post potentially sensitive and valuable data on their websites for public consumption. Both business competitors and cybercriminals can use web scraping to collect this data for their use.

While the legality of scraping sites for financial gain is still very much a legal grey area. The right of an organization to protect itself and its data from automated threats is not.

A Quick Introduction to Web Scraping

Most organizations place a great deal of information on their public web presence. A visitor can explore the organization’s products, potentially make purchases, contact customer support, and maybe even learn details about how the organization works.

Much of this publicly available data is valuable to third parties for a variety of different purposes. For example, Google wants to know the contents of every website on the Internet to index them for its search engine.

However, having an employee manually visit a website and collect the desired and relevant data is inefficient and unscalable.

Introduction to Web Scraping

This is where web scraping comes in. A web scraper downloads a copy of a webpage using a traditional HTTP request. The text of this page can then be analyzed locally to extract the data that the scraper desires (i.e., keywords for a Google search).

If a web scraper wants to explore an organization’s entire website instead of a single webpage. It will look for internal links within a page and add them to its queue. By scraping each page and looking for and visiting each link that they contain. A scraper can collect all the available information on an organization’s website.

Malicious Applications of Web Scraping

Web scraping can be used for good purposes, like Google’s collection of data for search engine indexing. However, web scraping can also be used for malicious purposes.

For example, another company may scrape your website and take information for competitive advantage.

Cybercriminals also use web scraping for a variety of different purposes. Scraping an organization’s website can be very useful for a cybercriminal performing a spear-phishing attack.

Data placed on the website can provide information about organizational structure, internal email addresses, vendors, and other intelligence that can make a spear-phishing attack look more plausible.

Alternatively, data on a business’s website can be used for reconnaissance before launching a cyberattack. Looking at job postings, error pages, etc. can tell an attacker about the systems and software that an organization is currently running.

By collecting this information and comparing it against lists of known vulnerabilities. A hacker can identify potential attack vectors without taking any action that could tip off the organization’s security team.

Web Scraping: A Legal Gray Area

For many businesses, protecting against web scraping may be important to protect their sources of revenue. Many organizations, like airlines and hotels, prefer that customers book directly with them since this allows them to control rates.

On the other hand, there are sites devoted to providing travellers with the best available travel deals, which may collect data about available locations and rates through web scraping. The legality of web scraping for financial gain impacts both types of organizations.

However, as mentioned, the legality of web scraping is still very much a gray area. On the one hand, any data placed on an organization’s website is intended for public consumption. So, theoretically, anyone can access it.

On the other, many businesses forbid scraping in their Terms of Service (TOS), which are often not legally enforceable. They claim that the contents of their website are protected by copyright.

Protecting Websites Against Automated Threats

Automated Threats

No matter how hard a human tries, they can’t quickly move a mouse in a perfectly straight line across the page. Bots, on the other hand, have no problem with this.

Additionally, the behaviour of a human user browsing a site and a bot performing web scraping vary dramatically. These features, among others, are used to differentiate a bot from a legitimate customer.

As data grows more and more valuable, cybercriminals are increasingly using web scrapers and other automated tools to explore and exploit organizations’ web presences. The ability to identify and block traffic from bots can be invaluable in protecting an organization’s security and sensitive data.

Leave a Comment