WEB SCRAPING & GENERATIVE AI: the Italian Data Protection Authority calls for consideration of the adoption of counter measures on websites and online platforms

Bergs&More
Jun 11, 2024
5 min read

By decision n. 329 of 20th May 2024, the Italian Data Protection Authority adopted the guidance document "Web scraping and generative artificial intelligence guidance document and possible contrast actions" (translated from Italian) attached to the decision itself. The guidance document takes into account the contributions received by the Italian Data Protection Authority within the framework of the fact-finding investigation on web scraping decided by order of 21st December 2023.

It should be noted that “web scraping” refers to the activity of mass and indiscriminate collection of data (including personal data) by bots for the purpose of storing and retaining the collected data for subsequent targeted analysis, processing, and use. Web scraping can be carried out for various purposes, both lawful and unlawful: the guidance document under consideration focuses on web scraping for the purpose of training generative artificial intelligence algorithms.

The guidance document is not addressed to those who carry out web scraping; it does not assess the lawfulness of web scraping for the purpose of training generative artificial intelligence algorithms.[1] Instead, it is aimed at public and private entities that, as data controllers, publish personal data on their websites or online platforms. Thus, the guidance document is particularly interesting as it provides practical guidance to those who manage websites and online platforms in their capacity as data controllers. Indeed, many of the actions proposed in the guidance document are useful for countering or at least mitigating web scraping activity, even though the Italian Data Protection Authority focuses specifically on web scraping for the purpose of training generative artificial intelligence algorithms.

Before presenting possible actions against web scraping, the Italian Data Protection Authority makes several necessary preliminary premises:

pursuant to the accountability principle, the enforcement actions proposed by the Italian Data Protection Authority are not to be considered mandatory. Each data controller must assess, on a case-by-case basis, whether and which measures to implement in order to prevent or mitigate web scraping, taking into account, inter alia, the nature, context and purposes of the personal data published, as well as the protection afforded by other legislation (e.g. copyright law);
the countermeasures envisaged by the Italian Data Protection Authority cannot be considered sufficient to totally prevent web scraping, but represent precautions to be taken to prevent the deemed unauthorised use of published personal data by third parties;
the guidance document does not address the security measures to be implemented to protect personal data from malicious web scraping that exploits vulnerabilities in IT systems.

Turning to the individual enforcement actions envisaged by the Italian Data Protection Authority, they can be summarised as follows.

1. Creation of reserved areas

Reserved areas remove data from public availability, indirectly contributing to greater protection of personal data from web scraping. On the other hand, such a measure shall not result in excessive data processing, requiring, for example, excessive and/or unjustified registration obligations.

2. Inclusion of ad hoc clauses in the terms of service

Including an express prohibition on the use of web scraping techniques in the terms of service of a website or online platform serves as a legal precaution, acting ex post facto, which may act as a deterrent; if this clause is not complied with, the managers of websites and online platforms may take legal action to have the contractual breach of contract by the counterparty declared.

3. Network traffic monitoring

The monitoring of HTTP requests makes it possible to detect any abnormal flows of incoming and outgoing data, enabling the implementation of appropriate protective countermeasures. Such caution may also be accompanied by rate limiting, a technical measure that restricts network traffic and the number of requests by selecting only those from certain IP addresses, thereby preventing excessive data traffic a priori.

4. Intervention on bots

Since web scraping relies on the use of bots, any technique that limits access to bots is effective in preventing and mitigating web scraping. Examples of bot interventions proposed by the Italian Data Protection Authority are summarised as follows:

the use of CAPTCHA checks;
periodically modifying the HTML markup[2] to hinder or otherwise make web scraping by bots more complicated, e.g. by nesting HTML elements[3] or modifying other aspects of the markup, even randomly;
the embedding of content or data to be removed from web scraping within multimedia objects (on the other hand, such a measure could constitute an obstacle for users pursuing legitimate purposes, preventing them, for instance, from copying content from the website);
monitoring of log files in order to block unwanted user-agents, where identifiable;
intervention in the robots.txt file, i.e. the text file that allows website and online platform managers to indicate whether or not the entire website or parts of it may be subject to indexing and web scraping (it is important to note, however, the robots.txt file does not enforce bots to follow the instructions contained therein, so compliance with the robots.txt file is solely based on the assumption of an ethical commitment by web scrapers).

The contrast actions listed by the Italian Data Protection Authority are not mandatory in all cases. Each data controller, as previously mentioned, must assess, on a case-by-case basis, whether and which measures to implement to prevent and/or mitigate web scraping for the purpose of training generative artificial intelligence algorithms. These assessments may be particularly complex, as they involve evaluating the compatibility or incompatibility of the purposes of web scraping for training generative artificial intelligence algorithms conducted by third parties with the purposes and legal basis for making personal data publicly available on websites or online platforms by the data controllers. To this end, data controllers must coordinate data protection regulations with numerous other legal frameworks, such as those concerning copyright, transparency obligations of public administration, and data reuse.

[1] On the other hand, with regard to the legitimacy of web scraping, the following documents can be mentioned. In the document “Report of the work undertaken by the ChatGPT Taskforce”, published by the EDPB on 23rd May 2024, which contains some preliminary results of the investigations coordinated by the ChatGPT Taskforce about the processing of personal data carried out through ChatGPT, it is stated that the assessments on the legality of web scraping implemented by OpenAI are still ongoing.

Furthermore, on 1st May 2024 the Dutch Data Protection Authority published guidelines on web scraping, addressed to web scraper, regardless of whether it is carried out for the purpose of training artificial intelligence algorithms. The document states that legitimate interest is the only legal basis that can abstractly be invoked to carry out web scraping under data protection law. However, given the characteristics of web scraping, it is often difficult, if not impossible, for the conditions of legitimate interest to be fulfilled. The guidelines also include a paragraph addressing web scraping carried out for the purpose of training artificial intelligence systems. The Dutch Data Protection Authority points out that this type of web scraping poses additional risks that can undermine fundamental rights, beyond a potential violation of the GDPR. In fact, a lot of incorrect, misleading, or biased information can be found on the web, which, if used to train artificial intelligence systems, will cause it to return wrong information or lead to discriminatory effects in the future.

[2] Markups can be defined as the means by which a particular interpretation of a text is made explicit.

[3] It seems that the Italian Data Protection Authority is referring here to the use of multiple tags, which are formatting codes that help determine markups.

Author: Avv. Lorenzo Balestra

Contact: Avv. Luisa Romano l.romano@bergsmore.com