Ossi Mätäsaho The Regulation of Web Scraping A brief Literature Review on Legal Frameworks and Access Control Mechanisms Vaasa 2025 School of Technology and Innovations Bachelor’s thesis in Technology Data Architecture 2 Statement of AI Usage In accordance with academic integrity standards and institutional guidelines regarding artificial intelligence use in academic work, I hereby declare the following utilisation of AI tools in the preparation of this bachelor's thesis: Artificial Intelligence Assistant Used - Provider: Anthropic - AI Model: Claude 3.7 Sonnet Areas of Application: - AI was used as a sounding board when interpreting EU directives and CJEU rul- ings. - Questions critical for filtering the source material were asked from the AI used. - Some of the source material was given to AI and tasked to summarise the content. - AI translated one Russian research article into English. Responsibility Statement I acknowledge full responsibility for: - The accuracy and validity of all content presented in this thesis. - The originality of analysis. - The scholarly contribution and academic integrity of this work in its entirety. - All conclusions, interpretations, and recommendations. This declaration has been prepared in compliance with the University of Vaasa guidelines for the use of artificial intelligence in teaching and learning. Legal Disclaimer This thesis examines web scraping regulation for academic purposes only. The legal anal- ysis does not constitute legal advice and reflects the author's understanding as of May 2025. Interpretations of legislation, case law, and technical implementations should not guide decisions about specific web scraping activities. Consult qualified legal counsel be- fore engaging in web scraping. Neither the author nor the University of Vaasa accepts liability for actions based on this thesis. 3 UNIVERSITY OF VAASA School of … Author: Ossi Mätäsaho Title of the Thesis: The Regulation of Web Scraping : A brief Literature Review on Legal Frameworks and Access Control Mechanisms Degree: Bachelor of Science in Technology Programme: Data Architecture Supervisor: Maarit Välisuo Year: 2025 Sivumäärä: 58 ABSTRACT: Web scraping on vakiintunut olennaiseksi keinoksi digitaaliseen tiedonkeruuseen, mutta sen oi- keudellinen asema on edelleen epäselvä. Tässä tutkielmassa tarkastellaan web scrapingin sään- telyä kahdesta vuorovaikutteisesta näkökulmasta kirjallisuuskatsauksen keinoin. Aineistona on käytetty tieteellistä kirjallisuutta, oikeudellisia lähteitä sekä teknisiä raportteja, joiden pohjalta on muodostettu kokonaiskuva nykyisestä sääntely-ympäristöstä ja sen haasteista. Tarkastelu keskittyy tutkimuskysymyksiin: millaisia oikeudellisia haasteita web scrapingiin liittyy, sekä mitä menetelmiä yleisesti käytetään verkkosivustojen automatisoidun käytön rajoittami- seen. Tutkimuksen tavoitteena on lisätä ymmärrystä web scrapingin sääntelyn nykytilasta, sekä oikeudellisten ja teknisten ratkaisujen keskinäisestä suhteesta. Laillinen viitekehys rakentuu erityisesti Euroopan unionin tietokantojen suojaa koskevasta sään- telystä, kuten tietokantadirektiiveistä ja teksti- ja datanlouhintaa koskevista poikkeuksista direk- tiivissä 2019/790. Lisäksi huomio kiinnitetään eri oikeudenkäyttöalueiden hajanaisiin tulkintoi- hin ja siihen, kuinka nämä vaikuttavat web scrapingin laillisuuden arviointiin. Laillisen viiteke- hyksen lisäksi esitellään yleisiksi havaitut pääsynhallintamekanismit, jotka on jaettu teknisiin ja hallinnollisiin menetelmiin. Työssä on havaittu, että web scrapingin oikeudellinen asema on useimmiten tulkinnanvarainen, ja eri mekanismien oikeudellinen sitovuus vaihtelee myös tapauskohtaisesti. Tekniset pääsyn- hallintamekanismit eivät aina estä tehokkaasti kehittyneitä automatisoituja järjestelmiä ja hal- linnollisten menetelmien, kuten käyttöehtojen oikeudellinen painoarvo riippuu niiden tekni- sestä toteutuksesta. Näin ollen web scrapingin sääntely on nykytilassaan hyvinkin epäselvä, jol- loin tulkinnat sen laillisuudesta voivat vaihdella paljonkin. Tutkielma auttaa ymmärtämään automatisoitua tiedonkeruuta koskevaa monimutkaista sään- tely-ympäristöä ja havainnollistaa kuinka lainsäädäntö, ja pääsynhallintamekanismit ovat olen- naisesti sidoksissa toisiinsa. Jatkotutkimusehdotuksina esitetään: syksyllä 2025 sovellettavaksi tulevan EU Datasäädöksen vaikutuksia web scrapingiin, AI-kehityksen vaikutusta teknisten ra- joitteiden toimivuuteen, vankkojen eettisten viitekehysten muodostamista web scrapingin har- joittamiseen. KEYWORDS: web scraping, access control methods, law, database protection, text and data mining 4 Contents 1 Introduction 7 1.1 Research Objectives and Scope 8 1.2 Research Methodology 10 1.3 Research Structure 11 2 Fundamentals of Web Scraping 13 3 Legislative Framework regarding Web Scraping 15 3.1 The Legal Protection of Databases 15 3.2 Text and Data Mining Exceptions in the Digital Single Market 17 3.3 Nature of websites – databases or not 19 3.3.1 Case Fixtures Marketing Ltd v Organismos prognostikon agonon podosfairou 19 3.3.2 Case Ryanair Ltd v. PR Aviation BV 20 4 Access Control Mechanisms 22 4.1 Robots Exclusion Protocol 22 4.2 Terms of Service 23 4.3 CAPTCHA 24 4.3.1 Text-based CAPTCHAs 25 4.3.2 Image-based CAPTCHAs 27 4.3.3 Behaviour-based CAPTCHAs 28 4.3.4 Integrating third-party CAPTCHA systems 29 4.4 IP-based access control 32 5 Literature review 35 5.1 Legal Challenges acknowledged in literature 36 5.1.1 Legal status of Web Scraping 36 5.1.2 Unauthorised access 37 5.1.3 Enforceability of Terms of Service 37 5.1.4 Legal Protection of Websites and their Content 39 5 5.2 Technical Access Control Mechanisms 41 5.2.1 Ethical implications of robots.txt 41 5.2.2 Current Challenges of CAPTCHA systems 41 5.2.3 Role of IP-based access control in preventing web scraping 42 6 Conclusions 44 6.1 Summary of Key Findings and Research Contribution 44 6.2 Future Considerations 47 References 48 Appendices 55 Appendix 1. Taxonomy of text-based CAPTCHAs (Guerar et al., 2022) 55 Appendix 2. Taxonomy of image-based CAPTCHAs (Guerar et al., 2022) 56 6 Figures Figure 1 Traditional model of the consumer buying process (Stankevich, 2017, p.10). 7 Figure 2 The narrative literature review process by Juntunen & Lehenkari (2021). 10 Figure 3 Example of a GIMPY CAPTCHA (von Ahn et al., 2004, p. 58). 25 Figure 4 Two-word system used in reCAPTCHA v1, it also includes audio-based option for the visually impaired (von Ahn et al., 2000). 26 Figure 5 Example of a DotCHA CAPTCHA, each letter must be individually identified (Suzi & Sunghee, 2019). 26 Figure 6 First implementation of selection-based CAPTCHA in No captcha reCAPTCHA (Shet, 2014b). 27 Figure 7 Different variations of a selection-based CAPTCHA (NopeCHA, 2025). 28 Figure 8 General framework for third-party CAPTCHAs (modified from Jin et al. 2023, p. 5.) 30 Figure 9 reCAPTCHA request flow diagram showing signal collection and reCAPTCHA v3 scoring system (Pathum, 2023). 30 Figure 10 hCaptcha request flow diagram showing passcode generation and verification process (hCaptcha, 2025). 31 Figure 11 Dynamic model for token bucket algorithm (Ahmed et al., 2002, p. 267). 34 Abbreviations REP - Robots Exclusion Protocol ToS - Terms of Service CAPTCHA - Completely Automated Public Turing test to tell Computers and Humans Apart IP - Internet Protocol HTML - HyperText Markup Language XML - eXtensible Markup Language HTTP - HyperText Transfer Protocol CJEU - Court of Justice of the European Union 7 1 Introduction The digital marketplace has fundamentally transformed consumer decision-making pro- cesses, with price comparison services emerging as critical tools for informed purchasing decisions. Stankevich (2017, p.8) observes that consumers increasingly seek solutions that integrate multiple functionalities. In response, digital marketplace is seeing an in- crease in services that have adapted to this change in customer requirements. Specifi- cally, price comparison services have evolved from simple price aggregators into com- prehensive platforms that encompass various stages of traditional consumer decision- making process (model illustrated in Figure 1). The three middle stages of this process – Information Search, Alternatives Evaluation, and Purchase Decision – are also particu- larly crucial for successful implementation of a digital marketplace solution. For example, market leaders such as Skyscanner and Booking.com both command significant market share (Curry, 2025) by providing efficient solutions to customers through integrating the three middle stages of the customer decision-making process. Figure 1 Traditional model of the consumer buying process (Stankevich, 2017, p.10). However, the implementation of services like Skyscanner and Booking.com is a process characterized by complexity as it includes, in addition to the technical challenges, also legal and ethical perspectives that must be addressed. Data sourcing for these services can be done either with APIs (application programming interfaces) or through web scrap- ing, usually both. Sourcing data by web scraping poses legislative and ethical challenges as it is not a transaction between service provider and a customer, which is the case with APIs. 8 Digital marketplace is just an example of the various fields where web scraping can be utilized. As Luscombe et al. (2022, p. 1024) note, some of the fields where web scraping is now crucial include the likes of:  Criminology  Communication Science  Economics  Organization Studies  Policy Studies  Political Science  Psychology  Sociology The growing importance of web scraping extends beyond the listed and rather academic fields, finding critical applications in journalism, market research, and public policy anal- ysis. Its ability to extract vast amounts of digital data has revolutionized information gathering across both scholarly and professional domains. In contrast, to the growing importance of web scraping the legal landscape of it has not been addressed thoroughly. The motivation of this research is to synthesize information about some of the legal con- siderations regarding web scraping and relevant access control mechanisms. 1.1 Research Objectives and Scope This thesis aims to explore how web scraping is regulated, focusing on how legislative frameworks address this practice and what mechanisms are used to limit or permit it. The central objective is to examine how current legislation interprets and governs auto- mated data collection, particularly highlighting the legal interpretations of explored ac- cess control mechanisms. The legislative perspective specifically considers directives of the European Parliament and the Council, as they establish a harmonised legal founda- tion across EU member states. Additionally, select U.S. legal papers are included to 9 provide insights into the enforceability of website terms of use. The research objective will be accomplished through answering the following questions: - What are the primary legal challenges associated with web scraping under cur- rent legislative frameworks and judicial interpretations, and how do these chal- lenges affect the permissibility of automated data collection? - What are the general techniques used to control automated access to website data? And what legal implications constitute from circumventing them? To provide a clearer understanding of the research context and findings, this thesis in- cludes an overview of the fundamentals of web scraping. The focus is limited to defining web scraping, outlining its general implementation process, explaining its relevance in digital marketplace environments and overview of other application areas. A compre- hensive technical analysis of various implementation technologies is beyond the scope of this thesis. However, selected examples of commonly used tools will be presented to illustrate different aspects of the foundational concepts. 10 1.2 Research Methodology This thesis will be conducted as a narrative literature review that applies the theoretical process presented by Juntunen & Lehenkari (2021, p. 336) in practice. The research methodology used in this thesis is illustrated in Figure 2. Figure 2 The narrative literature review process by Juntunen & Lehenkari (2021). The primary literature search was conducted using three databases: - Tritonia-Finna: A multidisciplinary academic database offering access to the col- lections and information services of the Tritonia Academic Library, primarily sup- porting higher education and research institutions in Finland. - Google Scholar: A freely accessible search engine that indexes scholarly articles across a wide range of disciplines and sources, providing broad international cov- erage to complement specialized databases. - ResearchGate: An academic networking platform where researchers share pub- lications, collaborate, and often provide open access to their work, including pre- prints and full-text articles. In addition to the primary literature search, snowball sampling methodology was em- ployed to ensure comprehensive coverage of the relevant literature within the research 11 domain. Snowball sampling involves the systematic identification of additional relevant sources through examination of the reference lists contained within previously identified literature, thereby expanding the collection of relevant texts through bibliographic net- works. Zotero was used to manage all the source material for this thesis, including its browser extension for efficient integration to the database. 1.3 Research Structure This thesis is structured to six main chapters, each of which builds progressively towards answering the research questions defined before. The structure listed below ensures a comprehensive approach, starting from foundational understanding of web scraping and then advancing into the relevant legislation and related literature. Brief explanation of the structure and contents of each chapter:  Chapter 1: Introduction o This chapter presents the background, motivation, and relevance of the topic. It also outlines the research objectives, scope, and methodology used to conduct this study.  Chapter 2: Fundamentals of Web Scraping o This chapter outlines the core concepts, definitions, and general princi- ples of web scraping, providing a foundation for understanding the tech- nology and its typical use cases across various domains.  Chapter 3: Legislative Framework regarding Web Scraping o This chapter explores the European Union’s laws on web scraping, ana- lysing directives like Directive 96/9 on database protection and Directive 2019/790 on text and data mining, with case law to illustrate legal impli- cations.  Chapter 4: Access Control Mechanisms o This chapter discusses various methods used to control automated access to websites, including administrative measures like REP and ToS, and technical measures such as CAPTCHA, IP blocking, and rate limiting. 12  Chapter 5: Literature Review o Synthesises academic perspectives on legal challenges and access control mechanisms.  Chapter 6: Conclusions o Summarises key findings through structured tables, discusses relation- ships between legal frameworks and technical implementations, and identifies future research directions. Each chapter is designed to build on the previous one, ensuring a coherent and logical progression of ideas from introduction to conclusion. The structure supports the thesis aim by balancing technical explanation with legal analysis and academic synthesis. 13 2 Fundamentals of Web Scraping Web Scraping is a sophisticated way of gathering and structuring data systematically from the web. According to Zhao (2017, p. 1), it is the process of gathering data from the World Wide Web using HTTP protocols or browsers, then transforming the data into a form of a dataset that is easily analysed. The process uses software agents also known as web robots to simulate the act of browsing and pulling information from websites (Glez-Peña et al., 2014, p. 789). In general the web scraping process can be divided into two sequential phases (Zhao, 2017, p. 1): 1. HTTP Request Phase: The initial communication with the website occurs through either: a. A URL (Uniform Resource Locator) containing a GET query, or b. An HTTP message containing a POST query. During this phase, the requested resource is retrieved from the website and transmitted back to the web scraping application. 2. Data Extraction Phase: Once content is retrieved, the scraper proceeds to: a. Parse the HTML/XML structure b. Identify and extract relevant data points c. Reformat and organise the extracted data into a structured format (e.g., CSV, JSON, database records). Modern web scraping implementations achieve these processes through two well-de- fined software modules: 1. Request and Interaction Module: Responsible for composing the HTTP requests and controlling the web interactions (Glez-Peña et al., 2014, p. 789-790), such as: a. Urllib2 which defines set of functions to dealing with HTTP request like authentication, redirections, cookies and session management (Zhao, 2017, p. 1). 14 b. Selenium which is a web browser automation framework that enables programmatic control of browsers to interact with dynamic websites us- ing JavaScript (Zhao, 2017, p. 1). 2. Parsing and Extraction Module: Responsible for parsing and extracting the data from the content fetched (Glez-Peña et al., 2014, p. 790), such as: a. Beautiful Soup which is designed to scrape HTML and other XML docu- ments (Zhao, 2017, p. 2). b. Pyquery which provides jQuery-like functionalities for parsing XML docu- ments (Zhao, 2017, p.2). The significance of web scraping is highlighted by its effectiveness in handling the rapidly growing amount of data available across the internet in various formats. As Khder (2021, p. 165) concludes it is an essential tool for companies in various fields in their attempts to maintain market position in today’s digital market. Singrodia et al. (2019, p. 5) have identified distinctive application areas for web scraping: - Data mining: Extracting patterns and insights from large datasets collected across multiple websites. - Research: Gathering data for academic studies, market analysis, and trend iden- tification. - Marketing: Monitoring competitor pricing, product offerings, and promotional strategies - Competitive Intelligence: Tracking industry developments and competitor activi- ties to inform business decisions - Personal Tools: Creating customized alerts, dashboards, and aggregators for indi- vidual use - Data Integration: Combining information from multiple sources into unified da- tasets In addition, one of major application areas is consumer targeted digital marketplaces such as already mentioned Skyscanner and Booking.com. 15 3 Legislative Framework regarding Web Scraping Web scraping exists in a complex legal framework that intersects multiple areas of law and its interpretations. The legal complexity surrounding web scraping stems from its technological nature, which often outpaces existing legal frameworks. While the scrap- ing of publicly available content in Europe is regulated variably depending on the country (Fontana, 2025, p. 203), the direction of national regulations is coordinated at the Euro- pean Union level with various directives. The decisions made by the Court of Justice of the European Union in various web scraping related cases also shape the legislative framework. This chapter will examine web scraping related directives and some exam- ples of case law to foster a basic judicial knowledge of the matter. 3.1 The Legal Protection of Databases This chapter will examine the contents of Directive 96/9 of the European Parliament and of the Council of 11 March 1996 on the Legal Protection of Databases, which provides two distinctive layers of protection to databases. Article 1(2) defines databases in the context of Directive 96/9 as systematically or methodologically arranged collections of independent works, data or other materials individually accessible by electronic or other means. The first layer of protection is copyright protection for original databases. Article 3(1) (Directive 96/9) grants copyright to “databases which, by reason of the selection or ar- rangement of their contents, constitute the author’s own intellectual creation”. Sui generis right acts as the second layer of protection as it provides the creator of the database right to “prevent extraction and/or re-utilization of the contents of the whole or of a substantial part, evaluated qualitatively and/or quantitatively, of the contents of that database” (Directive 96/9 art 7 para 1). This protection is only provided if it is shown that there has been a substantial investment in creating the database. Together copy- right protection of database structure and sui generis right create a protection system 16 where either the intellectual originality or substantial investment put into creating the database or both may be protected by law. To understand the effects of sui generis right to web scraping, it is imperative to examine the definition of extraction and re-utilization. Article 7(2) of Directive 96/9 defines them in context as follows: a) 'extraction' shall mean the permanent or temporary transfer of all or a substantial part of the contents of a database to another medium by any means or in any form; (b) 're-utilization' shall mean any form of making available to the public all or a substantial part of the contents of a database by the distribution of copies, by renting, by on-line or other forms of transmission If the database has been put to public availability, no matter the manner, the creator does not have the right to prevent a lawful user from “extracting and/or re-utilizing in- substantial parts of its contents, evaluated qualitatively and/or quantitatively, for any purposes whatsoever” (Directive 96/9 art 8 para 2). In the cases where a lawful user has authorization to only a part of the database, this paragraph applies only to that part. However, Article 7(5) still protects the database from unreasonable exploitation as it de- fines repeated and systematic extraction of insubstantial parts of the database as non- permitted actions (Directive 96/9). Although providing strong protection to databases Directive 96/9 also explicitly recog- nizes that the protection does not extend to their contents. Article 3(2) separates these two entities – a database and its contents – in a way that the protection of a database does not extend to its contents, but the contents still preserve all existing rights 17 (Directive 96/9). This creates a system where different legal protections can coexist at different levels: - The database structure may have copyright protection. - Individual content may have its own separate protection. In addition to the separation of entities it is also clarified in recital 45 of Directive 96/9 that the sui generis right does not in any way extend copyright protection to data or mere facts. 3.2 Text and Data Mining Exceptions in the Digital Single Market Directive 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market of the Union represents clear advancement on clarity of the legislation related to web scraping, as it establishes legal framework for text and data mining activities which are inherently connected to web scraping. The directive includes important exceptions to copyright and database protec- tion rights. To lay the foundation for the examination of this directive it is important to cover the key definitions made in the directive. Article 1(1) defines research organisations as non- profit or public interest entities primarily conduction scientific research or education, where research results cannot be preferentially accessed by any controlling commercial entity (Directive 2019/790). It is also crucial to note Article 1(2) as it gives clear definition for text and data mining (TDM): ‘text and data mining’ means any automated analytical technique aimed at ana- lysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations; The directive makes a distinction between general TDM and TDM conducted for scien- tific research. Research organisations enjoy broader exceptions as they have a manda- tory exception to the following rights (Directive 2019/790 art 3 para 1): 18 - From Directive 96/9 (Database Directive): o Article 5(a): The reproduction right for databases protected by copyright o Article 7(1): The sui generis extraction right for databases - From Directive 2001/29 (Information Society Directive): o Article 2: The reproduction right for copyright works - From Directive 2019/790 (Digital Single Market Directive): o Article 15(1): The new press publishers' rights for online uses of press publications Article four is applicable generally to any entity and provides the following exceptions in paragraph one “for reproductions and extractions of lawfully accessible works and other subject matter for the purposes of text and data mining”: - From Directive 96/9 (Database Directive): o Article 5(a): The reproduction right for copyright-protected databases o Article 7(1): The sui generis right preventing extraction from databases - From Directive 2001/29 (Information Society Directive): o Article 2: The reproduction right for copyright works - From Directive 2009/24 (Software Directive): o Article 4(1)(a): The permanent or temporary reproduction of computer programs o Article 4(1)(b): The translation, adaptation, arrangement or other altera- tion of computer programs - From Directive 2019/790 (Digital Single Market Directive): o Article 15(1): The press publishers' rights for online uses of press publica- tions While providing very similar exceptions to article three, article four does not provide as strong of a protection, as the exception is only applicable on condition that the righthold- ers have not reserved their rights in appropriate manner (Directive 2019/790 art 4 para 3). 19 3.3 Nature of websites – databases or not Whether or not websites qualify as databases has significant implications for web scrap- ing. The definition of a database from Directive 96/9 (art 1 para 2) establishes three key criteria: - A collection of independent works, data or other materials - Arranged in a systematic or methodical way - Individually accessible elements Recital seventeen of the same directive clarifies that databases should be understood to include literary, artistic, musical or other collections of works or collections of other ma- terial such as texts, sound, images, numbers, facts, and data. Although not addressing this question directly the Court of Justice of the European Un- ion has had related cases, which have given clear implications about the judicial inter- pretations of European Union law. 3.3.1 Case Fixtures Marketing Ltd v Organismos prognostikon agonon podosfairou In the Case C-444/02 (2004) Fixtures Marketing Ltd v Organismos prognostikon agonon podosfairou AE (OPAP) it was stated that the definition of database in Article 1(2) of the Directive 96/9 refers to any “collection of works, data or other materials, separable from one another without the value of their contents being affected, including a method or system of some sort for the retrieval of each of its constituent materials” (Case C-444/02 para 32). The specifics of this case were that: - Fixtures Marketing Ltd represented English and Scottish football leagues who claimed intellectual property rights in their fixture lists. - OPAP used information from these fixture lists for organizing betting games. - Fixtures Marketing sued OPAP, claiming violation of the sui generis right. While this case did not directly address whether websites are databases or not, as it only interpreted Directive 96/9 from a general perspective and specifically regarding football 20 fixture lists, it gives perspective on how the directives are interpreted by judicial entities. The preliminary ruling of the CJEU, in addition to elaborating the definition of a database, in this case was that: - A fixture list for a football league constitutes a database within the meaning of Article 1(2) of Directive 96/9 - The expression “investment in … the obtaining … of the contents” of a database in Article 7(1) of Directive 96/9 must be understood to refer to the resources used to cover the resources invested in seeking out existing data not investment in creating the data. Therefore, although a fixture list for a football league constitutes as a database, they are not protected by sui generis right as the court ruled that: “In the context of drawing up a fixture list for the purpose of organising football league fixtures, therefore, it does not cover the resources used to establish the dates, times and the team pairings for the var- ious matches in the league.” The court’s ruling emphasises the implications of its elabo- ration on the definition of a database, which seems to imply a very broad applicability of the term database. 3.3.2 Case Ryanair Ltd v. PR Aviation BV The ruling and implications of Case C-444/02 were further affirmed in Case C-30/14 (2015) Ryanair Ltd. v PR Aviation, where the CJEU issued a preliminary ruling on whether contractual restrictions can be imposed on databases not protected by copyright or sui generis rights (Case C-30/14 para 28). Ryanair had brought an action against PR Aviation for extracting flight data from its web- site for use in a price comparison service, in breach of Ryanair’s terms of service that prohibited such commercial “screen scraping.” After mixed rulings in Dutch courts, the Supreme Court referred the question to the CJEU. The Court held that Directive 96/9 does not apply to databases lacking copyright or sui generis protection. Consequently, Articles 6(1), 8, and 15 do not prevent database creators from imposing contractual 21 restrictions on third-party use. This led to the paradoxical outcome that unprotected databases may offer their creators more contractual freedom than protected ones, al- lowing Ryanair to enforce its terms despite lacking legislative protection for its database. Referring to paragraph 33 of Case C-30/14, the Court emphasized that Article 1(2) of Directive 96/9 gives a broad definition of "database," unconstrained by technical or for- mal considerations. It also cited the judgment in Case C-444/02 in support of this inter- pretation. In essence Case C-30/14 further established the interpretation that the defi- nition of database is in fact very wide, but only to determine what Directive 96/9 might potentially cover – not what it actually protects. 22 4 Access Control Mechanisms To mitigate automated access and data sourcing, websites implement various mecha- nisms that can be categorized as either administrative or technical. Administrative access control refers to non-technical mechanisms that establish boundaries through policy- based approaches rather than technological barriers. These mechanisms rely on volun- tary compliance and may carry legal implications despite lacking technical enforcement. In contrast, technical access controls implement actual barriers that must be overcome for scraping to occur. This chapter examines both types, beginning with administrative measures such as the Robots Exclusion Protocol (REP) and Terms of Service (ToS), followed by technical imple- mentations with CAPTCHA systems and IP-based controls as examples. 4.1 Robots Exclusion Protocol Robots.txt is a fundamental implementation of REP, which represents a standardized methodology for website administrators to communicate their preferences regarding au- tomated client access to their web resources (Koster et al., 2022, p. 1). This protocol, initially conceptualized in 1994, has evolved to a de facto standard for advisory regula- tion of web scraping. The architecture of robots.txt comprises two primary structures: 1. User-Agent Specification: The protocol employs User-Agent directives to identify specific crawlers or crawler groups. As detailed in RFC 9309 (Koster et al., 2022, p. 4-5), these identifiers must conform to precise formatting requirements: a. Case-insensitive matching for crawler identification. b. Permissible characters limited to letters (a-z, A-Z), underscores (_), and hyphens (-). c. Universal wildcard (*) functionality for comprehensive crawler designa- tion. 23 2. Access Control Directives The protocol implements two primary directive types: a. Allow: Explicitly permits access to specified Uniform Resource Identifier (URI) paths. b. Disallow: Restricts crawler access to designated URI paths. These directives operate within hierarchical precedence framework, where rule specific- ity determines application priority (Koster et al., 2022, p. 6). The following example demonstrates the standard syntax for access control implemen- tation, illustrating the hierarchical relationship between general and specific rules (Koster et al., 2022, p. 10): User-Agent: * Allow: /example/page/ Disallow: /example/page/disallowed.gif In this example the initial User-Agent specification indicates universal applicability of the following directives to all automated clients. The access control directives establish hier- archical relationship, where Allow: /example/page/ gives baseline permission for URI paths beginning with the specific prefix. Disallow: /exam- ple/page/disallowed.gif implements a specific restriction for a single resource. While being the de facto standard it is not in any way a security measure for websites as complying with it is totally voluntary, and as Koster (1994) states in his initial definition it is not enforced at all nor can any guarantees be given that any current or future robot respects it. 4.2 Terms of Service Website Terms of Use constitute contractual documents that establish the conditions under which website resources may be accessed and used. While these terms, like ro- bots.txt file, operate on a voluntary basis, they carry out distinctive legal implications as 24 the website owner can accuse the user of breach of contract. According to Dreyer and Stockton (2013) a viable way to prevent scraping is to include prohibitions against it in the website’s terms of use, but the implementation of user agreement determines how legally binding the terms of use are i.e. can the owner of the website accuse user of contact breach. Generally the implementation of user agreement is done either through a clickwrap or browsewrap agreement (Dreyer & Stockton, 2013). A clickwrap agreement consists of a pop-up notification that presents the user with buttons to agree or disagree with the terms of service, which must be available for the user before making the decision(Dreyer & Stockton, 2013; Krotov & Johnson, 2023; Specht v. Netscape Communications Corp., 2001; Zynda, 2004). Clickwrap agreement ensures that to access the website user must agree to the terms of service. Browsewrap is a non-intrusive way to constitute contractual agreement of terms of ser- vice, as it does not prohibit the use of the website in any way. It is usually implemented as a hyperlink to a separate web page containing terms of service (Dreyer & Stockton, 2013; Krotov & Johnson, 2023; Specht v. Netscape Communications Corp., 2001; Zynda, 2004). The difference to clickwrap is that the user is not required to agree or even view them to access the website. 4.3 CAPTCHA The acronym CAPTCHA stands for Completely Automated Public Turing test to tell Com- puters and Humans Apart, introduced first in the early 2000s by researchers at Carne- gie Mellon University (von Ahn et al., 2000). CAPTCHAs are challenge-response tests that protect websites by distinguishing between humans and automated programs (Shi et al., 2022; von Ahn et al., 2000), becoming a central security technique for restricting automated access to website content (Shi et al., 2022). 25 CAPTCHAs represent a variation of the Turing test (Turing, 1950), where rather than a human judge attempting to identify a computer, a computer system now generates tests that humans can pass but current computer programs cannot (von Ahn et al., 2004, p. 58). Paradoxically a CAPTCHA is a program that generates challenges that other computer programs cannot solve. Unlike the purely conversational original Tu- ring test, CAPTCHAs can be based on a variety of human sensory and cognitive abilities (von Ahn et al., 2004). Modern research recognises text-, image-, audio-, video-, human cognition-, and mov- ing object-based CAPTCHAs (Saha et al., 2015; Shi et al., 2022; Xu et al., 2012). Despite numerous variations, most CAPTCHAs are hidden from users, collecting behavioural and environmental data to classify or score them (BuiltWith®, 2025; Jin et al., 2023). This chapter focuses on text-, image-, and behaviour-based CAPTCHAs, as the first two are often used alongside the latter. A brief overview of third-party CAPTCHA provision is also included. 4.3.1 Text-based CAPTCHAs Text-based CAPTCHAs were among the first implementations of a CAPTCHA, with GIMPY establishing early standards for the approach. GIMPY challenged users to recognize three out of seven distorted dictionary words presented in an image (von Ahn et al., 2004, p. 59), as illustrated in Figure 3. Figure 3 Example of a GIMPY CAPTCHA (von Ahn et al., 2004, p. 58). 26 Since then, text-based CAPTCHAS have evolved significantly with the first major evolu- tion being reCAPTCHA v1, which used a two-word system (Von Ahn et al., 2008) shown in Figure 4. This system presented users with two words—one known to the system and one unknown—requiring users to type both correctly to pass. The unknown words came from OCR (Optical Character Recognition) systems that had failed to confidently digitize text from books and printed materials. The approach served a dual purpose: protecting websites while simultaneously crowdsourcing the verification of text that automated OCR systems struggled to recognize due to poor print quality, unusual fonts, or damaged source materials. The evolution of text-based CAPTCHAs has been extensive, with Guerar et al. (2022, p. 5) providing a comprehensive taxonomy (see Appendix 1). More recent innovations in- clude DotCHA (Suzi Kim & Sunghee Choi, 2019), a 3D scatter-type CAPTCHA where users must identify letters composed of small spheres by rotating a 3D model (example shown in Figure 5). To succeed user must rotate the 3D model multiple times as each letter is twisted around a horizontal axis ensuring that all of them are not readable from the same rotation angle. Figure 4 Two-word system used in reCAPTCHA v1, it also includes audio-based option for the visually impaired (von Ahn et al., 2000). Figure 5 Example of a DotCHA CAPTCHA, each letter must be individually identified (Suzi & Sunghee, 2019). 27 4.3.2 Image-based CAPTCHAs Already in 2014 Google revealed that 99.8 % of distorted text variants could be solved using AI technology (Shet, 2014a), and the rapid development of AI capabilities has since then only grown. This has resulted in shift of focus to image-based designs instead of text-based with the assumption that visual challenges are harder for computers that character recognition (Dinh & Hoang, 2023). Generally, the challenge includes a written text describing a task that needs visual recognition capabilities to e.g. complete an image classification task (Guerar et al., 2022, p. 7). The interaction with the challenge varies depending on the design, which have been classified by Guerar et al. (2022, p. 8-10) to six different types (see Appendix 2). Google's reCAPTCHA is the predominant technology for website protection, accounting for approximately 89% of all identified CAPTCHA implementations (BuiltWith®, 2025). As a result of this, selection-based designs are most used in image-based CAPTCHAs, as it is the design that Google’s No captcha reCAPTCHA utilizes alongside behaviour analysis (Shet, 2014b). If the risk analysis cannot predict confidently that the user is human, it will prompt the user with a visual recognition task. The first implementation (shown in Figure 6) presented the user a sample image and nine candidate images, from which the user then had to select the ones similar to the sample (Guerar et al., 2022, p. 11). Figure 6 First implementation of selection-based CAPTCHA in No captcha reCAPTCHA (Shet, 2014b). 28 Since then, the implementation has shifted focus from recognising similar images to la- bel recognition. In this version the CAPTCHA prompts user with a label and nine candi- date images, from which user must select the ones matching the label. This implemen- tation has some variations that are illustrated in Figure 7. One of the key differences between these implementations is that the modern version replaces the selected images with new ones. Figure 7 Different variations of a selection-based CAPTCHA (NopeCHA, 2025). 4.3.3 Behaviour-based CAPTCHAs Due to reCAPTCHA and hCaptcha being the two predominant CAPTCHA providers for websites, with reCAPTCHA covering 89% and hCaptcha 7% of CAPTCHA usage (see Built- With®, 2025), the usage of other than behaviour-based CAPTCHAs in modern internet is quite limited. Google’s reCAPTCHA has had multiple iterations throughout its lifetime, but in recent years the focus has shifted from traditional designs to a behaviour-based design. The second iteration of it, No captcha reCAPTCHA (later known as reCAPTCHAv2), required user to interact with a checkbox to invoke the risk analysis which then decided if further security measures were needed (see Chapter 4.3.2). Alongside this a different variation was rolled out, known as reCAPTCHAv2 Invisible, which does not require a checkbox to 29 be presented to the user (Jin et al., 2023). In this variation the risk analysis is either in- voked programmatically or the activation is bonded to a clickable element on the web- site. The latest version, reCAPTCHAv3, differs from the previous version in that rather than simply classifying user as human or computer it gives user a score (Google, 2024b). The only other ‘big player’ in the CAPTCHA market is hCaptcha, which is very similar to reCAPTCHA. The main difference being that hCaptcha addresses some of the privacy concerns of reCAPTCHA (Cloudfare, 2020b). Other notable difference is that hCaptcha is the only provider allowing website admin to set the difficulty of passing the CAPTCHA manually (Jin et al., 2023). 4.3.4 Integrating third-party CAPTCHA systems Jin et al. (2023, p. 4) suggests that most websites use third-party CAPTCHA providers instead of self-developed CAPTCHAs. Usage distribution data of CAPTCHAs from Built- With® (2025) supports the suggestion by Jin et al., thus resulting the scope of this the- sis to only cover the integration framework of third-party-provided CAPTCHAs. Specifi- cally, the focus will be on a general framework introduced by Jin et al. (2023) and the request flows of the two largest CAPTCHA providers – reCAPTCHA and hCaptcha. The framework presented by Jin et al. (2023, p.4-5) describes the process how third- party CAPTCHA systems work as follows: 1. Website registers with CAPTCHA provider to get public and private keys. 2. User visits a webpage containing a CAPTCHA. 3. User action invokes request to load CAPTCHA content, public site key is used to invoke the CAPTCHA. 4. CAPTCHA content loads from third-party provider.’ 5. User solves the CAPTCHA. 6. Provider validates solution and generates a token. 7. Token is sent to website server via the user’s client. 8. Website server verifies the token with the provider using its private key. 30 This sequence is also illustrated in Figure 8 to further clarify the communication be- tween systems. Step one of the sequence is assumed and not shown in the figure. Figure 8 General framework for third-party CAPTCHAs (modified from Jin et al. 2023, p. 5.) While both reCAPTCHA and hCaptcha follow the general third-party CAPTCHA frame- work, they implement it with distinct operational nuances. Google's reCAPTCHA has evolved through multiple versions, each with significant changes to how the framework is implemented. Figure 9 reCAPTCHA request flow diagram showing signal collection and reCAPTCHA v3 scoring system (Pathum, 2023). 31 As illustrated in Figure 9, reCAPTCHA's approach has shifted dramatically from early ver- sions to the latest implementations. reCAPTCHA v1 used distorted text challenges (see Chapter 4.3.1) but was discontinued in 2018 (Google, 2024a) as AI reached 99.8% accu- racy in solving them (Shet, 2014a). reCAPTCHA v2 introduced the familiar "I'm not a ro- bot" checkbox that secretly analyses mouse movements—humans move their cursors in "wiggly and imperfect ways" compared to bots (Pathum, 2023). The invisible variant of v2 eliminated the checkbox requirement by binding challenges to existing buttons. Most significantly, reCAPTCHA v3 transforms the framework by removing explicit challenges entirely, instead analysing user behaviour in the background and providing a score be- tween 0-1 to indicate human likelihood, fundamentally changing steps 3-5 of the general framework. In contrast, hCaptcha adheres more closely to the general framework due to them main- taining a more traditional challenge-response model, rather than invisible behavioural analysis. The hCaptcha system distinctly separates its Client API from its Siteverify com- ponent, with the Client API handling challenge delivery and the Siteverify service man- aging server-side verification (see Figure 10). Figure 10 hCaptcha request flow diagram showing passcode generation and verification process (hCaptcha, 2025). 32 When a user encounters an hCaptcha widget, they receive a challenge or puzzle that, once solved, generates what hCaptcha terms a "passcode" rather than a token (hCaptcha, 2025). This passcode can be delivered through multiple technical pathways – either em- bedded directly within HTML forms or returned via JavaScript callbacks – providing de- velopers with integration options suited to different application architectures. One no- table technical distinction is hCaptcha's explicit support for XHR (XMLHttpRequest) sub- missions alongside traditional form submissions (hCaptcha, 2025), allowing for more dy- namic frontend implementations while maintaining the security principles of the general framework. Both service providers retain the critical security architecture of the general framework – keeping verification keys private on the server side – while adapting the user interac- tion flow to their specific technical ecosystems. The key difference between these exam- ples is in that reCAPTCHA variations illustrated in Figure 9 invisibly assess and score user, while the highlighted version of hCaptcha utilises traditional challenge-response design. Both providers have multiple variations of CAPTCHA systems available to customers, but for example hCaptcha’s No-CAPTCHA mode is only available for their enterprise service customers (hCaptcha, 2025). 4.4 IP-based access control In addition to application-level mechanisms such as the ones mentioned in previous chapters, websites can also enforce restrictions at the network level. Two commonly em- ployed measures in this category are IP address blocking and IP-based rate limiting, both of which control access on the user’s IP address. These mechanisms require monitoring server access logs, which can be done either man- ually or through various firewall applications (Gheorghe et al., 2018, p. 68). Every device connected to the internet is assigned a unique IP address, which allows it to communi- cate with other devices across networks regardless of physical location (Jyväskylän 33 Yliopisto & Maanpuolustuskoulutusyhdistys, 2024). This address serves as a key identi- fier that network-level controls can act upon. One of the most widely used implementations of IP-based access control is the use of access-control lists (ACLs) within network firewalls. These lists define specific IP ad- dresses or ranges that are either allowed or denied access to particular services. A com- monly recommended best practice is to adopt a “deny by default” posture, allowing only explicitly permitted IPs to access the system (Scarfone & Hoffman, 2009, p. 4-1). This approach ensures that unauthorized or unexpected traffic is blocked unless a rule explic- itly allows it. In practical terms, IP-based access control operates by matching packet header fields against static or dynamic allowlists and blocklists. While IP blocking is a binary control—either permitting or denying traffic—rate limiting introduces a more nuanced approach by regulating the frequency or volume of requests from each IP address. This method is particularly effective in mitigating abusive or exces- sive behaviour, such as denial-of-service attempts or automated scraping (42crunch, 2019; Cloudfare, 2020a; radware, 2023). A widely adopted mechanism or implementing rate limiting is the token bucket algo- rithm (Raghavan et al., 2007, p. 339), which deploys rate limiting through “metaphorical bucket of tokens that refills at a constant rate” (Manoharan, 2024, p. 1791). 34 Figure 11 Dynamic model for token bucket algorithm (Ahmed et al., 2002, p. 267). A practical implementation of this concept is the dynamic model proposed by Ahmed et al. (2002, p. 267), shown above. Notations used in the figure are as follows: - u(tk): the number of tokens added during the time interval [tk-1 , tk ] (i.e. refill rate). - V(tk): size of the arriving packet at time tk. - G(tk): conforming traffic. - R(tk): non-conforming traffic. The logic of this model can be simplified to the following sequence: 1. Packet size of V(tk) arrives at the token bucket. 2. If the bucket contains enough tokens, the packet is conforming, and the equiva- lent number of tokens is removed. 3. If the bucket lacks enough tokens, the packet is nonconforming: – It may be delayed until tokens are available (then reclassified as conforming), or – It may be rejected outright, depending on the system's policy. In essence, the token bucket algorithm allows traffic to pass freely as long as sufficient tokens are available, thus supporting short bursts of activity while limiting long-term av- erage rate (Manoharan, 2024, p. 1791). Once the bucket is empty, the system enforces limits by slowing or dropping incoming requests. Traditionally a token bucket is imple- mented at a single choke point such as firewall or web server. 35 5 Literature review The preceding chapters have established the foundational elements of web scraping (Chapter 2), the relevant European Union legislative frameworks (Chapter 3), and the technical implementation of various access control mechanisms (Chapter 4). Building upon this foundation, this chapter synthesises academic perspectives to address the cen- tral research questions posed in Chapter 1: - What are the primary legal challenges associated with web scraping under cur- rent legislative frameworks and judicial interpretations, and how do these chal- lenges affect the permissibility of automated data collection? - What are the general techniques used to control automated access to website data? And what legal implications constitute from circumventing them? While the previous chapters provided descriptive analyses of legislation and technical implementations in isolation, this literature review examines their interrelationship through the lens of scholarly discourse. Rather than organising this synthesis by individ- ual sources, the chapter employs a thematic approach, examining how various legal frameworks interact with technical controls to create a complex regulatory landscape for web scraping activities. The first section explores the legal challenges identified in existing literature, categoris- ing and evaluating scholarly perspectives on issues such as the legal status of web scrap- ing, unauthorised access, the enforceability of Terms of Service, and legal protections for website content. The second section examines how the literature addresses the access control mecha- nisms discussed in Chapter 4, with particular emphasis on their legal implications and effectiveness. This thematic structure facilitates a comprehensive understanding of the regulatory environment governing web scraping activities across multiple jurisdictions, with particular attention to the European context established in Chapter 3. 36 5.1 Legal Challenges acknowledged in literature 5.1.1 Legal status of Web Scraping Recent literature reveals a complex and evolving legal landscape for web scraping. As Possler (2019, p. 3903) highlights, the legal framework is often unspecified which char- acterises the framework as a “grey area”. Fontana (2025, p. 205) came to the same con- clusion after examining the jurisprudence and legal doctrines of web scraping. In his re- search he found that “there is no specific legislation in the United States, Europe or Asia that explicitly prohibits web scraping.” This has resulted in the fragmentation of legal approaches across jurisdictions, which is widely acknowledged in literature (Brown et al., 2024; Fontana, 2025; Krotov & Johnson, 2023; Pagallo & Ciani Sciolla, 2023). Krotov & Johnson (2023, p. 11) suggest that the legal status of web scraping activities should be considered on a case-by-case basis, which is also backed by Brown et al. (2024, p. 18). The fragmentation of legal approaches is highlighted by Pagallo & Ciani Sciolla (2023, p. 11-13), as they imply an inconsistency in judicial interpretations of Database Protection laws across EU member states citing contradictory Ryanair cases: - The Spanish Supreme Court ruled that Ryanair’s system was a software generat- ing information, not a database. - The Italian Supreme Court and Regional Court of Hamburg have declared that Ryanair does in fact have a database under Article 1(2). As Possler (2019, p. 3903) concludes: “these legal uncertainties render ethical consider- ations on the part of researchers all the more relevant for scraping”. The literature re- search conducted in this thesis validates this relevancy as there were multiple findings of academic papers addressing ethical considerations of web scraping (e.g. Brown et al., 2024; Gold & Latonero, 2017; Krotov & Johnson, 2023; Logos et al., 2023; Luscombe et al., 2022; Pagallo & Ciani Sciolla, 2023; Paige, 2024; Thelwall & Stuart, 2006; Xiao, n.d.). 37 5.1.2 Unauthorised access Literature addresses multiple perspectives on whether scraping constitutes unlawful ac- cess or not. Brown et al. (2024, p. 11) suggest that “scraping that involves breaking into online spaces that are not otherwise available to the public will create higher legal risks than scraping only publicly accessible spaces.” Breaking into online spaces can be inter- preted as not respecting access control mechanisms employed by the website in ques- tion. This interpretation seems to be valid as Fontana (2025, p. 203) considers the possi- bility of having either ToS or robots.txt file to be enough in preventing the TDM excep- tions covered in Chapter 3.2. He also addresses how the cited case law suggests that even the presence of a protective measure against scraping could be sufficient enough to deem the scraping of a website unauthorised. This would consequently imply that circumventing, programmatically solving, or otherwise bypassing a CAPTCHA system could be regarded as unauthorised access to the website content. Krotov & Johnson (2023, p. 10-11) highlight how the Digital Services Act (2022) of Euro- pean Union requires “certain ‘very large online platforms’ and ‘very large search engines’ to” allow researchers to access publicly available – potentially via scraping. They also highlight that the DSA might allow “vetted researchers” to access the private data of these certain online entities. 5.1.3 Enforceability of Terms of Service Literature identifies ToS agreements as a central mechanism for establishing protection against unwanted web scraping (Brown et al., 2024; Fontana, 2025; Krotov & Johnson, 2023; Logos et al., 2023). At the same time, it is acknowledged in literature that certain contractual agreements over ToS are more likely to be legally binding than other – spe- cifically highlighting the distinction between legal enforceability of clickwrap versus browsewrap. 38 Dreyer & Stockton (2013) argue, through references to court cases, that clickwrap agree- ments are generally legally enforceable as it requires the user to read the terms of use, by noticing about them in a pop-up text box that requires agreeing to proceed to the website. On the other hand, browsewrap agreements do not require the user to read the terms of use, as the website notices of them usually by placing a hyperlink on the website, which leads to more subjectivity on its enforceability. Tarra Zynda (2004, p. 507) points out, in article about case Ticketmaster Corp. v. Tickets.com, Inc., that “so far, courts have held browsewrap agreements enforceable if the website provides sufficient notice of the license”. Continuing this it is also brought up in the article that the few courts, that have examined the validity of browsewrap agreements, have established criteria for suf- ficient notice requiring the terms of use to be on the landing page of the website, visible without scrolling to the bottom of the page, and presented clearly as a hyperlink. Frolova and Berman (2024) continue from this and present a list of concrete recommendations for improving legal enforceability of ToS agreements: 1. Visibility and Clarity: a. Terms must be placed in a conspicuous location on the website or app (e.g., login or checkout page). 2. Unambiguous Consent Mechanism: a. The way users accept the terms must be unmistakable, such as a check- box or clearly labeled action button. 3. Explicit Statement of Legal Effect: a. The agreement should state that by accepting the terms, the user be- comes a party to a legally binding agreement. 4. Consent at Registration: a. If site use requires registration, consent should be obtained at the regis- tration stage before the account is created. 5. Notification of Changes: a. If any terms are changed—even minor ones—each user should be in- formed of the changes explicitly. 39 Even though the enforceability of ToS can be strengthened by properly implementing contractual agreements it is highlighted by Fontana (2025, p. 205), that it is a fragile in- strument as enforcing ToS can be “difficult, costly and sometimes only ensures a low likelihood of success.” 5.1.4 Legal Protection of Websites and their Content In addition to the contractual protections offered by properly implemented Terms of Service, website content may also benefit from legal protection under European Union law. As discussed in Chapter 3, where a website qualifies as a database, its content can be shielded by multiple layers of protection: - Protection of the website as a database through copyright and/or sui generis right. - Protection of individual website content through copyright. This layered protection framework is particularly relevant when addressing web scraping, which may infringe copyright or database rights depending on the nature of the website and its contents. 5.1.4.1 Copyright infringement Scraping can lead to copyright infringement in two ways: first, if the target website is protected as a copyrighted database; and second, if the individual content on the site is itself copyrighted. Krotov et al. (2020) highlight the role of a robots.txt file in constituting copyright for website content. They argue through relevant court law that the failure to include a robots.txt file with sufficient instructions could create implied license to use website data. On the other hand, Pagallo & Ciani Sciolla (2023, p. 10-11) place more emphasis on how scraping copyrighted content may be lawful under a set of exceptions, but highlight the pessimism related to this. They note how some interpret current legis- lation in a way that lawful scraping in Europe is only possible for research or cultural organisation for research purpose. This perspective is challenging as the interpretations of law suggest commercial uses to face stricter scrutiny (see e.g. chapter 3.2.), yet as 40 Pagallo & Ciani Sciolla (2023, p. 10) also point out there are cases where scraping has been considered lawful. 5.1.4.2 Sui generis infringement Alongside copyright databases are protected by the sui generis right (see Chapter 3.1). As Pagallo & Ciani Sciolla (2023, p. 11) note there are two distinctive legal thresholds that trigger the sui generis protection for a database: substantiality of the investment; and substantiality of the extraction, which are both evaluated quantitatively and/or qual- itatively. Across different authors (Fontana, 2025; Oesch et al., 2017; Pagallo & Ciani Sciolla, 2023) the legal concept of substantial investment is acknowledged to be very ambiguous, with Oesch et al. describing it as very unclear, and Pagallo & Ciani Sciolla highlighting how there are different interpretations across EU national courts. Fontana also cites relevant court cases as examples of different reasonings for excluding sui generis right. Despite the ambiguity of interpretations, Oesch et al. (2017, p. 74-77, 112-118) synthesise from CJEU cases that the concept of substantial investment should be understood as specifi- cally targeting the formation and compilation of the database itself, rather than the cre- ation of the original data contained within it. In all cited cases the rulings of the CJEU were similar, and along the lines of “the investment aimed at the contents of the data- base can’t be taken into consideration when evaluating the substantiality of the invest- ment”. Resulting in that only investments targeted in the collection, verification or pre- senting of the contents are considered. The second threshold of sui generis protection – substantiality of extraction – is also am- biguous. Pagallo & Ciani Sciolla (2023, p. 12-13) suggest that the substantiality of the extraction must be determined by courts on a case-by-case basis. Their interpretation of legal perspectives is that the extraction of limited or partial data is considered admissible in any case, but that so called ‘diachronic extraction’ should be considered based on how it affects the target. As an example of unjust harm caused by diachronic extraction, they 41 present case QVC Inc. v. Resultly in which the extraction was considered substantial as it crashed the target website for two days. 5.2 Technical Access Control Mechanisms 5.2.1 Ethical implications of robots.txt While not being designed as a security measure for websites against scraping (see Koster, 1994) the increased adoption rate of REP (Chang & He, 2025, p. 1124) has resulted in the perception that honouring robots.txt is a matter of ethical web behaviour, establish- ing an informal but widely respected norm against unauthorized or aggressive data col- lection (see e.g. Chang & He, 2025, pp. 4, 13; Gold & Latonero, 2017, p. 281; HTTP- ProxyOkeyProxy, 2024; Krotov et al., 2020). 5.2.2 Current Challenges of CAPTCHA systems Existing literature identifies text- and image-based CAPTCHAs as the most widely adopted designs (Gutub & Kheshaifaty, 2023; Saha et al., 2015; Shi et al., 2022; Xu et al., 2012). However, studies also show that the traditional CAPTCHA models might not be sufficient to combat the rapid evolvement of AI (e.g. Guerar et al., 2022; Plesner et al., 2024). In response, leading third-party providers such as reCAPTCHA transitioned to- wards behaviour-based CAPTCHA systems (see Chapter 4.3.3); however, these solutions continue to lag behind the evolving capabilities of CAPTCHA solvers. This claim is sup- ported from both academic analyses and the development of solver technologies, as relevant studies and publications have demonstrated the consistent success of solvers in overcoming most CAPTCHA systems (e.g. Jin et al., 2023; Plesner et al., 2024; Sivakorn et al., 2016; xHossein, 2021). The security test conducted by Jin et al. (2023, p. 15) on four third-party CAPTCHA providers: Google reCAPTCHA, Geetest, Arkose Labs, and hCaptcha revealed worrying results against both human solver relay- and automated attacks. A 20- year survey of CAPTCHA technologies by Guerar et al. (2022, pp. 25–26) concluded that, while none of the popular conventional and behavior-based designs had been exten- sively broken at the time, emerging solver capabilities posed a significant threat. But 42 even the ones mentioned to not yet have been broken (e.g. reCAPTCHA v2 Invisible) have since then been broken, thus proving their prediction that: Invisible reCAPTCHA and other academic proposals have not been broken yet, how- ever with the advent of the fourth-generation bots that rotate through thousands of different IP addresses and mimic accurately the human behaviour, it would be difficult to design a secure CAPTCHA based solely on the user behaviour data that can be gathered in a normal (i.e., with no additional sensors or special hardware) environment. Although sensor- and behaviour-based CAPTCHA systems offer promising avenues to counter solver advancements, they introduce significant challenges, particularly regard- ing user privacy, that require careful consideration. Both Dinh and Hoang (2023) and Guerar et al. (2022) emphasize the privacy concerns raised by these CAPTCHA designs, particularly regarding behavioural and sensor data of user, as well as extraction of de- mographic attributes. 5.2.3 Role of IP-based access control in preventing web scraping Web scraping literature consistently identifies IP-based access control mechanisms as significant technical barriers for researchers and developers. Luscombe et al. (2022, p. 1034) categorize both IP blocking and rate limiting IP requests among their comprehen- sive inventory of defensive strategies employed to prevent automated data extraction. This is corroborated by other researchers, such as Shelar (2024, p. 1636) and Tabaku (2021, p. 2) similarly acknowledging these mechanisms as established methods that website administrators deploy to restrict scraping activities. Luscombe et al. (2022, p. 1035) also address technical countermeasures, detailing work- arounds for the IP-based access control mechanisms, such as randomising the used IP address to mask the scraping activity as multiple unique users accessing the website. Significantly they don’t address the workarounds as mere implementation details but 43 place them within broader legal and ethical framework emphasising that circumventing intentionally placed access control mechanisms requires navigation of legal ambiguities and ethical considerations. Legal implications of utilising such a workaround are similar to ignoring contents of robots.txt and/or ToS, and circumvention of a CAPTCHA (see Chapter 5.1.2) in that it could constitute to unauthorised access, thus deeming the scrap- ing activity as unlawful. 44 6 Conclusions This literature review has examined the complex intersection of legal frameworks and technical access control mechanisms related to web scraping. The analysis reveals a frag- mented legal landscape where interpretation varies across jurisdictions, creating signifi- cant uncertainty for researchers, businesses, and website operators. Ambiguity of the relevant law has resulted in uncertainty regarding the legal implications of circumventing both administrative and technical access control mechanisms. 6.1 Summary of Key Findings and Research Contribution The research identified two key dimensions that define the regulatory environment for web scraping: legal challenges and access control mechanisms. These dimensions are summarised in Tables 1 and 2. This research has highlighted the interrelationship be- tween legal frameworks and technical implementations. The effectiveness of technical measures is often enhanced by how courts interpret attempts to circumvent them, while the legal enforceability of administrative measures like Terms of Service depends on its technical implementation (e.g., clickwrap vs. browsewrap). 45 Table 1 Key Legal Challenges identified in Web Scraping. Legal Challenge Description Key Implications Ambiguous Legal Status Web scraping exists in a “grey area” with no spe- cific legislation explicitly regulating it in major juris- dictions. Case-by-case assessment of lawfulness is required. Ethical considerations are especially relevant. Database Protection EU’s sui generis right pro- tects databases represent- ing substantial investment, independently of copyright protection. Two thresholds determine protection: substantiality of investment and substan- tiality of extraction. Contract Infringement Terms of Service can estab- lish contractual restrictions on web scraping, with var- ying degrees of enforcea- bility. Clickwrap agreements gen- erally considered more en- forceable than browse- wrap agreements, though enforcement remains chal- lenging. Unauthorised Access Circumventing access con- trol mechanisms could po- tentially constitute unau- thorized access under vari- ous legal frameworks. Breaking into protected spaces creates significantly higher legal risks than scraping publicly accessi- ble content. Copyright Infringement Web scraping may infringe copyright if it involves cop- ying protected content or database structures. TDM exceptions may apply but are limited, particularly for commercial purposes, and can be overridden by technical protection measures. 46 Table 2 Key Access Control Mechanisms for preventing Web Scraping. Mechanism Type General Imple- mentation Method(s) Legal/Ethical Impli- cations Robots Exclusion Protocol Administrative robots.txt file spec- ifying allowed/dis- allowed paths and user-agents Honouring it is widely considered as part of ethical web behaviour; ig- noring may influ- ence legal interpre- tations. Terms of Service Administrative Contractual agree- ments; most com- monly either click- wrap or browse- wrap. Ignoring may con- stitute contract breach; clickwrap agreements more legally enforceable than browsewrap. CAPTCHA systems Technical Challenge-re- sponse tests and/or invisible be- havioural analysis. Circumvention may constitute unau- thorized access; privacy concerns with behavioural data collection IP-based Access Controls Technical IP blocking and re- quest rate limiting Workarounds may be viewed as unau- thorised access. 47 6.2 Future Considerations The regulatory landscape for web scraping continues to evolve, with several areas war- ranting further research and attention: - Implications of the EU (Data Act, 2024): Set to apply from September 2025, the Act may increase legal certainty around data reuse, address contractual imbal- ances that restrict scraping, and trigger reforms to the Database Directive, po- tentially altering how database protection, specifically sui generis right, applies to scraping. - Evolution of Technical Measures: As demonstrated by the rapid advancement of AI capabilities to overcome CAPTCHA systems, technical protection measures are engaged in an ongoing arms race with scraping technologies. Future research should monitor how this dynamic affects the balance between data accessibility and protection. - Ethical Frameworks: Given the legal ambiguities, the development of robust eth- ical frameworks for web scraping becomes increasingly important. Krotov et al.’s work (Krotov et al., 2020; Krotov & Johnson, 2023) could be developed further to provide a more comprehensive cross-jurisdictional framework. 48 References 42crunch. (2019). Rate limiting by IP address. Retrieved 1 May 2025 from https://docs.42crunch.com/latest/content/extras/protection_rate_limiting_ip.html Ahmed, N. U., Wang, Q., & Barbosa, L. O. (2002). Systems approach to modeling the Token Bucket algorithm in computer networks. Mathematical Problems in Engineering, 8(3), 591831. https://doi.org/10.1080/10241230215282 Brown, M. A., Gruen, A., Maldoff, G., Messing, S., Sanderson, Z., & Zimmer, M. (2024). Web Scraping for Research: Legal, Ethical, Institutional, and Scientific Considerations (arXiv:2410.23432). arXiv. https://doi.org/10.48550/arXiv.2410.23432 BuiltWith®. (2025, April 21). CAPTCHA Usage Distribution on the Entire Internet. Re- trieved 24 April 2025 from https://trends.builtwith.com/widgets/captcha/traffic/Entire- Internet Case C-30/14 (15 January 2015). Ryanair Ltd v. PR Aviation BV. Retrieved 8 April 2025 from https://curia.europa.eu/juris/document/document.jsf?text=&do- cid=161388&pageIn- dex=0&doclang=EN&mode=lst&dir=&occ=first&part=1&cid=333767 Case C-444/02 (9 November 2004). Fixtures Marketing Ltd v. Organismos Prognostikon Agonon Podosfairou AE (OPAP). Retrieved 8 April 2025 from https://curia.europa.eu/ju- ris/showPdf.jsf;jsessionid=B7FE966C88DA4389CB38F5F7B90C159A?text=&do- cid=49635&pageIndex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=316016 Chang, C., & He, X. (2025). The Liabilities of Robots.txt (arXiv:2503.06035). arXiv. https://doi.org/10.48550/arXiv.2503.06035 Cloudfare. (2020a). What is rate limiting? | Rate limiting and bots. Retrieved 1 June 2025 from https://www.cloudflare.com/learning/bots/what-is-rate-limiting/ Cloudfare. (2020b, April 8). Moving from reCAPTCHA to hCaptcha. Retrieved 24 April 2025 from https://blog.cloudflare.com/moving-from-recaptcha-to-hcaptcha/ Curry, D. (2025, January 22). Travel App Revenue and Usage Statistics (2025). Business of Apps. Retrieved 3 February 2025 from https://www.businessofapps.com/data/travel- app-market/ 49 Data Act (2024). Retrieved 1 May 2025 from https://digital-strategy.ec.eu- ropa.eu/en/policies/data-act Digital Services Act (2022). Retrieved 26 April 2025 from https://eur-lex.europa.eu/le- gal-content/EN/TXT/PDF/?uri=CELEX:32022R2065 Dinh, N. T., & Hoang, V. T. (2023). Recent advances of Captcha security analysis: A short literature review. Procedia Computer Science, 218, 2550–2562. https://doi.org/10.1016/j.procs.2023.01.229 Directive 96/9 of the European Parliament and of the Council of 11 March 1996 on the Legal Protection of Databases. Retrieved 28 March 2025 from http://data.eu- ropa.eu/eli/dir/1996/9/oj/eng Directive 2019/790 of the European Parliament and of the Council of 17 April 2019 on Copyright and Related Rights in the Digital Single Market and Amending Directives 96/9 and 2001/29. Retrieved 28 March 2025 from https://eur-lex.europa.eu/legal-con- tent/EN/TXT/PDF/?uri=CELEX:32019L0790 Dreyer, A. J., & Stockton, J. (2013). A Primer for Counseling Clients. Retrieved 14 February 2025 from https://www.skadden.com/-/media/files/publications/2014/01/070071319- skadden.pdf Fontana, Avv. G. (2025). Web scraping: Jurisprudence and legal doctrines. The Journal of World Intellectual Property, 28(1), 197–212. https://doi.org/10.1111/jwip.12331 Frolova, E. E., & Berman, A. M. (2024). Expression of the Parties’ Will in Context of Digital Transformation: Current Trends in Law Enforcement. Law Journal of the Higher School of Economics, 3, 57–83. https://doi.org/10.17323/2072-8166.2024.3.57.83 Gheorghe, M., Mihai, F.-C., & Dârdal, M. (2018). Modern techniques of web scraping for data scientists. Retrieved 28 April 2025 from https://rochi.utcluj.ro/rrioc/arti- cole/RRIOC-11-1-Gheorghe.pdf Glez-Peña, D., Lourenço, A., López-Fernández, H., Reboiro-Jato, M., & Fdez-Riverola, F. (2014). Web scraping technologies in an API world. Briefings in Bioinformatics, 15(5), 788–797. https://doi.org/10.1093/bib/bbt026 Gold, Z., & Latonero, M. (2017). Robots Welcome: Ethical and Legal Considerations for Web Crawling and Scraping. Washington Journal of Law, Technology & Arts, 13(4), 275– 50 312. Retrieved 28 March 2025 from https://heinonline.org/HOL/Page?han- dle=hein.journals/washjolta13&id=283&div=&collection= Google. (2024a, July 10). Choosing the type of reCAPTCHA | Google for Developers. Re- trieved 25 April 2025 from https://developers.google.com/recaptcha/docs/versions Google. (2024b, July 10). reCAPTCHA v3. Google for Developers. Retrieved 24 April 2025 from https://developers.google.com/recaptcha/docs/v3 Guerar, M., Verderame, L., Migliardi, M., Palmieri, F., & Merlo, A. (2022). Gotta CAP- TCHA ’Em All: A Survey of 20 Years of the Human-or-computer Dilemma. ACM Compu- ting Surveys, 54(9), 1–33. https://doi.org/10.1145/3477142 Gutub, A., & Kheshaifaty, N. (2023). Practicality analysis of utilizing text-based CAPTCHA vs. Graphic-based CAPTCHA authentication. Multimedia Tools and Applications, 82(30), 46577–46609. https://doi.org/10.1007/s11042-023-15586-5 hCaptcha. (2025). Developer Guide | hCaptcha. https://docs.hcaptcha.com/ HTTPProxyOkeyProxy. (2024, August 9). The Ethical Implications of Robots.txt in Web Scraping—Okey proxy. Retrieved 28 April 2025 from https://mir- ror.xyz/0xC82a668EBF772623a441eEC2f817B482634a26eb/ZgIsU7VZy4LuxLlo7al- gTMrhtU0XOsx0S46qByzJkJE?utm_source=chatgpt.com Jin, R., Huang, L., Duan, J., Zhao, W., Liao, Y., & Zhou, P. (2023). How Secure is Your Web- site? A Comprehensive Investigation on CAPTCHA Providers and Solving Services (arXiv:2306.07543). arXiv. https://doi.org/10.48550/arXiv.2306.07543 Juntunen, M., & and Lehenkari, M. (2021). A narrative literature review process for an academic business research thesis. Studies in Higher Education, 46(2), 330–342. https://doi.org/10.1080/03075079.2019.1630813 Jyväskylän Yliopisto & Maanpuolustuskoulutusyhdistys. (2024, April 4). 2.2 Verkon osoit- teet. Kansalaisen kyberturvallisuus -verkkokurssi. Retrieved 1 May 2025 from https://m3.jyu.fi/jyumv/ohjelmat/it/panu/kansalaisen-kyberturvallisuus-1/tekstiosuu- det-aanena/recording-23-10-2021-14.44 Khder, M. (2021). Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application. International Journal of Advances in Soft Computing and Its Applica- tions, 13(3), 145–168. https://doi.org/10.15849/IJASCA.211128.11 51 Koster, M. (1994, July 30). A Standard for Robot Exclusion. Retrieved 14 February 2025 from https://www.robotstxt.org/orig.html#author Koster, M., Illyes, G., Zeller, H., & Sassman, L. (2022). Robots Exclusion Protocol (Request for Comments RFC 9309). Internet Engineering Task Force. https://doi.org/10.17487/RFC9309 Krotov, V., & Johnson, L. (2023). Big web data: Challenges related to data, technology, legality, and ethics. Business Horizons, 66(4), 481–491. https://doi.org/10.1016/j.bushor.2022.10.001 Krotov, V., Johnson, L., Murray State University, & Silva, L. (2020). Legality and Ethics of Web Scraping. Communications of the Association for Information Systems, 47, 539–563. https://doi.org/10.17705/1CAIS.04724 Logos, K., Brewer, R., Langos, C., & Westlake, B. (2023). Establishing a framework for the ethical and legal use of web scrapers by cybercrime and cybersecurity researchers: Learnings from a systematic review of Australian research. International Journal of Law and Information Technology, 31(3), 186–212. https://doi.org/10.1093/ijlit/eaad023 Luscombe, A., Dick, K., & Walby, K. (2022). Algorithmic thinking in the public interest: Navigating technical, legal, and ethical hurdles to web scraping in the social sciences. Quality & Quantity, 56(3), 1023–1044. https://doi.org/10.1007/s11135-021-01164-0 Manoharan, M. (2024). API Rate Limiting Mechanisms in SaaS Applications: A Systematic Analysis of DDoS Protection Strategies. International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 10(6), Article 6. https://doi.org/10.32628/CSEIT241061223 NopeCHA. (2025). Google reCAPTCHA V3. Retrieved 24 April 2025 from https://develop- ers.nopecha.com/token/recaptcha3/ Oesch, R., Eloranta, M., & Heino, M. (2017). Immateriaalioikeudet ja yleinen etu. Alma Talent Oy. Pagallo, U., & Ciani Sciolla, J. (2023). Anatomy of web data scraping: Ethics, standards, and the troubles of the law. European Journal of Privacy Law & Technologies, 2, 1–19. https://doi.org/10.57230/ejplt232PS 52 Paige, J. (2024). The Legality and Ethics of Web Scraping in Archaeology. Advances in Archaeological Practice, 12(2), 98–106. https://doi.org/10.1017/aap.2023.42 Pathum, U. (2023, March 26). reCAPTCHA: How it works. Medium. Retrieved 25 April 2025 from https://medium.com/@hwupathum/recaptcha-how-it-works- 4031eae74a8b Plesner, A., Vontobel, T., & Wattenhofer, R. (2024). Breaking reCAPTCHAv2. 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC), 1047–1056. https://doi.org/10.1109/COMPSAC61105.2024.00142 Possler, D., Bruns, S., & Niemann-Lenz, J. (2019). Data Is the New Oil—But How Do We Drill It? Pathways to Access and Acquire Large Data Sets in Communication Science. In- ternational Journal of Communication, 13. Retrieved 16 April 2025 from https://ijoc.org/index.php/ijoc/article/viewFile/10737/2763 radware. (2023). What is rate limiting and how does it work? | Radware. Retrieved 1 May 2025 from https://www.radware.com/cyberpedia/bot-management/rate-limiting/ Raghavan, B., Vishwanath, K., Ramabhadran, S., Yocum, K., & Snoeren, A. C. (2007). Cloud control with distributed rate limiting. ACM SIGCOMM Computer Communication Review, 37(4), 337–348. https://doi.org/10.1145/1282427.1282419 Saha, S. K., Nag, A. K., & Dasgupta, D. (2015). Human-Cognition-Based CAPTCHAs. IT Pro- fessional, 17(5), 42–48. https://doi.org/10.1109/MITP.2015.79 Shet, V. (2014a, April 16). Street View and reCAPTCHA technology just got smarter. Google Online Security Blog. Retrieved 16 April 2025 from https://security.google- blog.com/2014/04/street-view-and-recaptcha-technology.html Shet, V. (2014b, December 3). Are you a robot? Introducing “No CAPTCHA reCAPTCHA”. Google Online Security Blog. Retrieved 24 April 2025 from https://security.google- blog.com/2014/12/are-you-robot-introducing-no-captcha.html Shi, C., Xu, X., Ji, S., Bu, K., Chen, J., Beyah, R., & Wang, T. (2022). Adversarial CAPTCHAs. IEEE Transactions on Cybernetics, 52(7), 6095–6108. https://doi.org/10.1109/TCYB.2021.3071395 53 Singrodia, V., Mitra, A., & Paul, S. (2019). A Review on Web Scrapping and its Applica- tions. 2019 International Conference on Computer Communication and Informatics (ICCCI), 1–6. https://doi.org/10.1109/ICCCI.2019.8821809 Sivakorn, S., Polakis, J., & Keromytis, A. D. (2016). I’m not a human: Breaking the Google reCAPTCHA. Black Hat Asia 2016. Retrieved 16 April 2025 from https://www.black- hat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the- Google-reCAPTCHA-wp.pdf Specht v. Netscape Communications Corp. (5 July 2001). Retrieved 16 April 2025 from https://law.justia.com/cases/federal/district-courts/FSupp2/150/585/2468233/ Stankevich, A. (2017). Explaining the Consumer Decision-Making Process: Critical Litera- ture Review. JOURNAL OF INTERNATIONAL BUSINESS RESEARCH AND MARKETING, 2(6), 7–14. https://doi.org/10.18775/jibrm.1849-8558.2015.26.3001 Thelwall, M., & Stuart, D. (2006). Web crawling ethics revisited: Cost, privacy, and denial of service. Journal of the American Society for Information Science and Technology, 57(13), 1771–1779. https://doi.org/10.1002/asi.20388 Turing, A. M. (1950). Computing Machinery and Intelligence. Mind, 59(236), 433–460. Retrieved 16 April 2025 from https://www.jstor.org/stable/2251299 von Ahn, L., Blum, M., Hopper, N., & Langford, J. (2000). The Official CAPTCHA Site. Re- trieved 16 April 2025 from http://www.captcha.net/ von Ahn, L., Blum, M., & Langford, J. (2004, February). Telling Humans and Computers Apart Automatically. Communications of the ACM, 47(2), 57–60. Retrieved 16 April 2025 from http://www.captcha.net/captcha_cacm.pdf Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., & Blum, M. (2008). reCAPTCHA: Hu- man-Based Character Recognition via Web Security Measures. Science, 321(5895), 1465–1468. https://doi.org/10.1126/science.1160379 xHossein. (2021). PyPasser [Python]. Retrieved 28 April 2025 from https://github.com/xHossein/PyPasser Xiao, G. (2021). Bad Bots: Regulating the Scraping of Public Personal Information. Har- vard Journal of Law & Technology, 34(2). Retrieved 26 April 2025 from 54 https://jolt.law.harvard.edu/assets/articlePDFs/v34/6.-Xiao-Bad-Bots-Regulating-the- Scraping-of-Public-Personal-Information-edit.pdf Xu, Y., Reynaga, G., Chiasson, S., Frahm, J.-M., Monrose, F., & van Oorschot, P. (2012). Security and Usability Challenges of Moving-Object CAPTCHAs: Decoding Codewords in Motion. https://www.usenix.org/system/files/conference/usenixsecurity12/sec12-fi- nal118.pdf Zhao, B. (2017). Web Scraping. In L. A. Schintler & C. L. McNeely (Eds.), Encyclopedia of Big Data (pp. 326–328). Springer International Publishing. https://doi.org/10.1007/978- 3-319-32001-4_6-1 Zynda, T. (2004). Ticketmaster Corp. V. Tickets.com, Inc. - Preserving Minimum Require- ments of Contract on the Internet. Berkeley Technology Law Journal, 19(1), 495–518. https://doi.org/10.15779/Z38Q965 55 Appendices Appendix 1. Taxonomy of text-based CAPTCHAs (Guerar et al., 2022) 56 Appendix 2. Taxonomy of image-based CAPTCHAs (Guerar et al., 2022) 57 58