More

    Brief Outage at Internet Archive: Scraping Incident Causes Temporary Disruption

    Internet Archive scraping incident

    Scrapers briefly caused an outage at the Internet Archive.

    The largest digital library on the Internet experienced momentary unavailability over the weekend following an attempt by an individual using an Amazon-owned web service to rapidly extract thousands of files from the website.

    Brewster Kahle, the founder of the Internet Archive, disclosed on Monday that the website was offline for approximately one hour due to the actions of an individual who employed virtual hosts associated with Amazon Web Services to initiate tens of thousands of requests for downloading Optical Character Recognition (OCR) files.

    OCR is a technological advancement that enables computers to analyze text and characters within digital images. The Internet Archive stands as one of the most extensive repositories of digital files, encompassing PDFs, electronic books, and text-containing images.

    On Sunday, an individual employed 64 virtual hosts on Amazon Web Services to rapidly initiate tens of thousands of downloads within a concentrated timeframe, consequently impacting the Internet Archive’s ability to serve other users worldwide.

    Even by web standards, tens of thousands of requests per second is an extraordinary volume, wrote Kahle.

    The archive managed to restore normalcy by blocking numerous IP addresses associated with this activity. However, the person or group responsible for the initial wave of download requests repeated their actions just a few hours later. This second attempt resulted in an additional hour-long disruption of the archive’s services, as per Kahle’s statement.

    We express our gratitude to our engineers who devoted their efforts on a Sunday afternoon during a holiday weekend to address this issue, Kahle added.

    While the specific individuals involved were not explicitly mentioned, an initial tweet from the Internet Archive stated that the scraping requests were associated with an AI company rapidly collecting Internet Archive texts. Subsequently, the organization issued a follow-up message suggesting that the culprits may not have been an AI company but rather an enthusiastic user.

    The Internet Archive emphasized its support for individuals and groups seeking access to and preservation of its content. However, it urged users to exercise caution and begin their projects gradually. The organization requested that those planning extensive endeavors reach out directly, providing an email address ([email protected]) for correspondence.

    If you encounter any obstacles, we kindly request that you refrain from restarting the process and instead contact us, implored Kahle. We encourage the use of the Internet Archive but also urge you to avoid disrupting our services.

    In addition to housing an extensive collection of files, the Internet Archive is widely recognized for hosting the Wayback Machine, which has meticulously preserved static web pages since the mid-1990s. The archive is situated in the Richmond District of San Francisco.

    Read More: Apple’s Original Cloud Photo Sync Service Discontinued: My Photo Stream Shutdown

    My Photo Stream discontinuation

    Latest articles

    spot_imgspot_img

    Related articles

    Leave a reply

    Please enter your comment!
    Please enter your name here

    spot_imgspot_img