Is it legal to scrape a government tender portal if its robots.txt disallows everything?

robots.txt isn't a technical barrier — it's a good-faith request from the site owner that automated agents not access any part of the site. Ignoring it for systematic, large-scale crawling is legally riskier, especially if the extracted data is reused commercially. That's usually unnecessary anyway: most procurement authorities that block crawlers on their main portal still publish the same data as an open dataset elsewhere, specifically intended for automated reuse and updated daily.

What format does Spain's open procurement data use, and why does it matter?

Spain's national open data feed uses ATOM/CODICE-XML — CODICE being the Spanish public procurement schema, based on UBL (Universal Business Language). It matters because it's the official, supported way to get the full list of Spanish tenders automatically, without touching the main portal at all — and it's updated daily.

Do I need rotating IP proxies to monitor government tenders?

In practice, no. Government procurement portals generally don't run Cloudflare-level anti-bot protection or aggressive fingerprinting — the kind of defenses that justify rotating proxies elsewhere. The failures you'll see when automating these portals are almost always platform instability — slow forms, expiring sessions — solved with retries and a fresh browser context, not a new IP.

Does the SAM.gov API cover state and local government contracts too?

No — SAM.gov's API covers federal opportunities. State and local government contracts are published on agency-specific portals and aggregators, which vary widely in whether they offer an API or open data feed. This is the same fragmentation problem seen in the EU, where TED covers above-threshold EU-wide contracts but national and regional portals handle everything else.

How does Nomos gather the tender data it analyzes?

Nomos combines automated searches against the public portal — using a headless browser configured to simulate a normal user session, without needing proxies — with AI-based extraction and analysis of each tender's documents. The result is a 0–10 relevance score per company speciality and, where relevant, an automatically generated draft proposal.

Can AI Scrape Government Tender Sites? The Real Data Sources

Can AI really scrape a government tender portal? Sometimes — but usually that's the wrong question. This guide covers the official APIs and open data feeds almost nobody uses, what a robots.txt that blocks everything actually means, and the truth about rotating proxies.

What "Scraping Tender Data" Actually Means

When someone asks whether AI can "go and pull tender data" from a government procurement site, they're usually picturing a bot that logs into a portal every morning, reads new notices, and emails a summary. The end result is correct — that's exactly what tender monitoring software does — but the "how" has more nuance than most guides admit.

This guide covers, with verifiable facts and no filler: which tender data sources actually exist, why some government portals explicitly tell robots not to enter, which official APIs exist (and almost nobody uses), and whether you genuinely need rotating IP proxies for any of this.

The Official APIs Almost Nobody Uses

SAM.gov (United States)

Every US federal solicitation — pre-solicitation notices, solicitations, award notices, sole-source notices — is published on SAM.gov (formerly FedBizOpps). SAM.gov offers a free public "Get Opportunities" API: registered users can request an API key from their Account Details page, with a default limit of 1,000 requests per day, covering solicitation numbers, set-aside type, contact information, place of performance, NAICS/PSC codes and full descriptions.

This is the single most underused resource for US government contractors. Most companies still browse SAM.gov manually with saved searches and email alerts, when the same data — structured, in JSON — is available through an API that takes an afternoon to integrate.

TED (European Union)

For contracts above EU thresholds, the relevant source isn't a national portal — it's TED (Tenders Electronic Daily). TED's API allows anonymous access for searching and retrieving already-published notices — no API key required. Since 14 November 2022, all notices use the eForms standard (Regulation (EU) 2019/1780), a far more consistent structured format than the legacy schema. TED also offers a SPARQL endpoint for open data and bulk XML downloads.

Find a Tender / Contracts Finder (United Kingdom)

The UK's Find a Tender service (the enhanced version launched 24 February 2025 under the Procurement Act 2023) and Contracts Finder both publish notices in Open Contracting Data Standard (OCDS) format via a public API, downloadable as JSON, Excel or CSV under the Open Government Licence. Like SAM.gov and TED, this is structured data published specifically for reuse — not something that needs to be scraped.

What About National Portals That Block Crawlers?

Not every country's procurement portal is built with reuse in mind. Spain's national platform, for instance, has a robots.txt file that simply says: apply to all robots, disallow everything. No partial restriction — a blanket "no" to any automated access to the site itself.

That doesn't mean the data is unavailable — it means the portal itself isn't the right access point. In Spain's case, the same data is published separately as an open dataset in ATOM/CODICE-XML format (CODICE being the Spanish public procurement XML schema, based on UBL), updated daily and explicitly intended for automated reuse. The lesson generalizes: if a portal's robots.txt says no, check whether the same authority publishes an open data feed elsewhere — it usually does, because public-sector reuse obligations require it somewhere.

So Can AI Read a Tender Portal Directly?

Honestly: it depends on what for.

For bulk discovery — "what new tenders were published today across the country" — the right answer is the official API or open dataset (SAM.gov, TED, OCDS feeds, or national open data). It's faster, more reliable, designed for this, and doesn't run into robots.txt restrictions.

For fetching the actual tender document of a specific opportunity you've already identified — opening its detail page, downloading the RFP/solicitation PDF, checking an updated deadline — a headless browser that loads a public, no-login page and does exactly what a person would do with a link is a one-off, low-volume operation. It's a fundamentally different kind of access than crawling an entire site.

The combination that actually works in production: official APIs and open data feeds for "what exists", plus targeted, one-off document fetches for "give me the full RFP for this specific opportunity and analyze it".

The Truth About Rotating Proxies and IP Rotation

This is where most of the myths live. The common assumption is "scraping = you need rotating proxies or you'll get blocked." For many commercial sites, that's true: industry research puts the figure at around 78% of the top 10,000 websites detecting basic scraping within 100 requests, using browser and TLS fingerprinting, JavaScript challenges, and behavioral analysis — the standard Cloudflare/Akamai stack.

Government procurement portals generally aren't behind that kind of protection. They're built on legacy enterprise portal software (WebSphere Portal, Liferay and similar), without Cloudflare or CAPTCHAs in front.

In practice — this is real experience from running Nomos's tender monitoring service in production, which accesses a national procurement portal recurrently —: a headless browser with a normal desktop user agent, a consistent locale and reasonable headers receives no IP-based blocks, even from a single fixed IP, across repeated runs.

The real failure mode isn't "we got blocked for scraping" — it's that the portal itself is slow and occasionally flaky: forms that take time to fully render, searches that return zero results because of a temporary server-side hiccup, sessions that need to be re-established. The fix for that isn't rotating IPs — it's retrying with a fresh browser context (new cookies, new session) after a short backoff.

When do rotating proxies actually make sense? When you're scraping commercial third-party aggregators that do run active anti-bot protection. But in that case, those same aggregators almost always sell a paid API — which is the option that actually makes sense, both for reliability and because it solves the underlying problem instead of fighting it.

How Nomos Gathers Tender Data

In practice, Nomos's tender monitoring combines a few pieces:

Configurable searches by contract type and territory against the public portal, using a headless Chromium browser (Playwright) that simulates a normal session: realistic desktop user agent, consistent locale, coherent language headers.
Retries with a fresh browser context when a search returns zero results — because, as explained above, that's almost always a temporary portal glitch, not a block.
Downloading and extracting text from the tender's PDF documents (technical and administrative specifications), including OCR for scanned documents.
Analysis by a language model that reads the full tender document and produces a 0–10 relevance score per company speciality — not just a match on the title.
From there, automatic generation of a draft proposal with Nomos.

If you want to see this working with your own search criteria, you can configure the tender monitor with your specialities and territories.

Conclusion

"Can AI scrape tender data" is the wrong question. The right one is: where does the data actually come from, and what do you need each source for?

For bulk discovery: official APIs and open data feeds (SAM.gov, TED, OCDS, national open data) — built for this, updated regularly, and outside any robots.txt conflict.
For a specific opportunity's detail: a one-off, targeted fetch of the public document, not a crawl of the whole site.
For all of this: rotating IP proxies are a solution to a problem you most likely don't have. Government procurement portals don't run the kind of anti-bot protection that would justify them — the real challenge is portal reliability, not security.

What actually moves the needle isn't where the data comes from — it's what happens to it next: the difference between getting a list of 50 contract titles a day and getting 5 contracts scored by relevance with a draft proposal already generated.