9 min read

Your Digital Collections Are Training AI. Did You Know That?

If your digital collections are publicly accessible on the web, they've probably been used to train AI. Here's what you need to know about AI scraping, copyright risks, and what libraries can do about it.

Quick question: Have you checked whether AI companies are scraping your library's digital collections for training data?

TL;DR

Libraries are now data sources for AI training without compensation or consent. Vendor systems (catalog data, patron data, usage patterns) get incorporated into training datasets and sold back as "AI features."
Contractual loophole: vendors classify data use as "service improvement" not "AI training," avoiding disclosure and licensing obligations. Libraries have no visibility into what data is used for model training.
Patron privacy impact: circulation histories, search patterns, and demographic data become training data for commercial AI systems (discovery, recommendation, behavior prediction).
Library action: audit vendor contracts for AI training clauses, require explicit opt-in/opt-out language, and ensure patron data is never used without explicit consent and compensation to the library.

If your answer is "We don't have a way to check that," you're not alone. Most libraries still don't.

But here's the uncomfortable truth: If your digital collections are publicly accessible on the web, they've been used to train AI. Past tense. It's already happened.

And you might not have any say in it.

How AI Training Actually Works

AI models like ChatGPT, Claude, and Google's Gemini are trained on massive datasets. Billions of documents, images, and web pages scraped from the internet.

The companies building these models use web crawlers (bots) that systematically visit websites, download content, and add it to training datasets.

Your library's website? Fair game. Your digital collections portal? Fair game. Your institutional repository? Fair game.

Unless you've specifically blocked these crawlers (and most libraries haven't), your content is getting scraped.

Why This Matters

You might think: "Who cares? Our digital collections are public anyway. What's the harm?"

Here's the harm:

1. Copyright Infringement Risk

If your digital collections include copyrighted material (digitized books, journal articles, photographs, archival documents), and AI companies are training on that content without permission, that's potentially copyright infringement.

Now, the AI companies claim it's "fair use." Courts haven't decided that yet. But if courts rule against fair use (like they did with the Internet Archive), you could be caught in the middle.

Rightsholders might sue the AI company. The AI company might say "We got this content from [your library]'s website." Suddenly you're part of a legal fight you didn't ask for.

2. Metadata and Privacy Issues

Your digital collections don't just contain content. They contain metadata. Catalog records, subject headings, usage statistics, timestamps, user-generated tags.

That metadata is valuable. It helps AI understand context, relationships, and structure. And if your metadata includes information about who accessed what and when... that's potentially privacy-sensitive.

Most libraries don't realize their metadata is being scraped alongside their content.

3. Loss of Control Over Your Content

Once your content is in an AI training dataset, you can't get it back.

Even if you later decide "We don't want our content used for AI," it's too late. The AI has already learned from it. The model is trained. The dataset is archived.

You've permanently lost control over how your content is used.

4. Ethical Concerns

Many libraries have ethical commitments around how they share and preserve cultural heritage. You've spent years building digital collections that respect:

Indigenous knowledge protocols
Donor restrictions
Privacy of individuals in archival materials
Cultural sensitivities

AI companies don't care about any of that. They're scraping everything indiscriminately, with no regard for context or ethical obligations.

Your carefully curated collection becomes just another data dump in a corporate AI model.

The Common Crawl Problem

You've probably never heard of Common Crawl. But it's one of the largest sources of AI training data.

Common Crawl is a nonprofit organization that regularly crawls the entire web and releases the data as open datasets. AI companies use Common Crawl data extensively because it's free, comprehensive, and legally (somewhat) defensible.

If your library's website is indexed by Common Crawl, your content is in datasets being used to train AI.

Want to check? Go to commoncrawl.org and search for your library's domain.

I'll wait.

What AI Companies Are Saying (And Why It's Nonsense)

When confronted about scraping content without permission, AI companies usually say:

"It's publicly accessible, so it's fair game."
Wrong. "Publicly accessible" doesn't mean "free to use for commercial AI training." Your library makes content publicly accessible for research, education, and preservation. Not to train corporate AI models.

"We're following robots.txt rules."
Robots.txt is a file that tells web crawlers which parts of your site they can and can't access. But it's voluntary. Crawlers can ignore it. And many AI companies do ignore it or interpret it selectively.

"It's fair use."
Maybe. Courts haven't decided yet. And if they decide it's not fair use, you're left dealing with the consequences.

"We anonymize and aggregate the data."
Great. But you didn't ask permission. And "anonymized" doesn't mean "ethical."

What You Can Do Right Now

If you want to prevent (or limit) AI companies from scraping your digital collections, here's what you can do:

1. Update Your robots.txt File

Robots.txt is a text file at the root of your website (e.g., yourLibrary.org/robots.txt) that tells web crawlers what they can and can't access.

Add these lines to block common AI crawlers:

# Block OpenAI (ChatGPT)
User-agent: GPTBot
Disallow: /

# Block Google AI
User-agent: Google-Extended
Disallow: /

# Block Anthropic (Claude)
User-agent: anthropic-ai
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

# Block general AI scrapers
User-agent: ChatGPT-User
Disallow: /

This won't stop all AI scraping (some crawlers don't identify themselves, and others ignore robots.txt), but it's a start.

2. Add a Terms of Service Page

Create a clear Terms of Service that states:

Content on this site is provided for research and educational purposes
Commercial use, including AI training, is prohibited without written permission
Automated scraping for AI training is explicitly forbidden

Will this legally stop AI companies? No. But it establishes your intent and could be useful if you ever need to take legal action.

3. Monitor Your Server Logs

Check who's accessing your digital collections. Look for:

High-volume automated access patterns
User agents associated with AI crawlers (GPTBot, CCBot, Bytespider, etc.)
Unusual traffic spikes

If you see suspicious activity, block those IPs or user agents.

4. Use Metadata to Assert Rights

If you're publishing digital collections, add rights statements to your metadata that explicitly address AI use:

"This content is made available for research and education. Use for commercial purposes, including training artificial intelligence models, is prohibited without permission."

This won't stop AI companies from scraping, but it makes your position clear.

5. Participate in Collective Action

Individual libraries can't fight Big Tech alone. But collective action matters.

Join organizations like:

The Authors Alliance (advocating for fair use and author rights)
Creative Commons (working on AI-specific licensing frameworks)
ALA's Intellectual Freedom Committee (addressing AI and libraries)

The more libraries speak up, the more pressure there is for regulation and ethical AI practices.

The Questions Nobody's Asking (But Should Be)

Here are the uncomfortable questions libraries need to grapple with:

1. If we make content publicly accessible, do we have any right to control how it's used?

Legally, maybe not (depending on copyright and fair use). Ethically, absolutely. Libraries exist to serve the public good, not to fuel corporate AI development.

2. Should we treat AI companies like we treat other commercial entities?

You wouldn't let a for-profit publisher scrape your digital collections and republish them without permission. Why is AI training different?

3. Are we complicit in AI harms if we allow our content to be scraped?

If an AI trained on your library's content generates biased, harmful, or false information, do you bear any responsibility? Even if you didn't intend for it to be used that way?

4. What do our donors and rightsholders think?

If you digitized materials donated to your library, do the donors know their content might be training corporate AI? Did they consent to that? Should they get a say?

The HathiTrust Paradox

Remember HathiTrust? I mentioned it in my post about the Internet Archive lawsuit.

HathiTrust successfully argued that building a searchable database of scanned books was fair use. The court said libraries could make digital copies for preservation and search purposes.

But here's the paradox: HathiTrust was protecting books from AI by keeping them in a controlled, nonprofit environment. The court ruled in their favor because HathiTrust wasn't commercially exploiting the content.

Now AI companies are doing the opposite, scraping content from libraries and using it for commercial products. And they're claiming the same "fair use" defense HathiTrust used.

If courts allow that, it undermines the whole point of the HathiTrust decision. Libraries lose control over digital preservation and access.

What Some Libraries Are Already Doing

More libraries are taking action:

The Internet Archive (despite losing their CDL lawsuit) has been vocal about AI scraping issues. They've documented which AI companies are accessing their collections and advocated for regulations.

Multiple university libraries added AI-specific language to their digital collection licenses and donor agreements in 2025. They're being upfront: "This content might be used to train AI. Here's what we're doing about it."

Library consortia are collaborating on shared policies for blocking AI crawlers and asserting rights over digital collections. Some state library associations published model language for contracts and terms of service in late 2025.

A few libraries have sued AI companies for scraping their digital collections without permission. These cases are in early stages, but they signal that some institutions are willing to fight back.

These are still small steps, but momentum is building.

My Recommendation: Don't Stay Silent

You have three options:

Option 1: Do nothing. Accept that your digital collections will be used to train AI. Hope for the best.

Option 2: Block AI scrapers. Update robots.txt, monitor server logs, assert your rights. Make it harder for AI companies to take your content.

Option 3: Engage and negotiate. Reach out to AI companies and say "If you want our content, let's talk." Set terms. Require transparency. Demand ethical use.

I recommend a combination of 2 and 3.

Blocking AI scrapers sends a message: You don't have free rein over our content. And engaging with AI companies (or advocating for regulation) ensures libraries have a voice in how AI is built.

Doing nothing is the worst option. Silence is consent. And right now, AI companies are treating library silence as permission.

Further Reading:

Want help auditing your digital collections for AI scraping risks? Reach out.

Filed under: Copyright & Ethics, AI Training, Digital Collections

Want updates (or backup)?

Get new posts by email, or book a free 30-minute call if you’re facing a contract, AI policy, or vendor decision.

Get the newsletter Free 30-min call