Robots.txt Generator
Master your crawl budget. Create directives to guide search engines, block AI scrapers, and secure your site architecture.
Robots.txt for Technical SEO
Imagine your website as a massive, sprawling library. Search engines like Google and Bing are the librarians, sending automated software programs (called crawlers, spiders, or bots) to read every book, catalog every page, and organize the information for the public. But what if there are rooms in your library—like an administration office, a private archive, or a messy construction zone—that you don't want the public to see?
This is where the robots.txt file comes in. It acts as the bouncer at the front door of your website. It is a simple text file placed in your root directory that dictates a strict set of rules about which bots are allowed in, and exactly which directories or files they are strictly forbidden from accessing.
How Search Engines Crawl the Web (And Where Robots.txt Fits)
To understand why this file is so crucial for SEO, you must understand the three phases of how a search engine operates:
- Discovery & Crawling: The bot finds a link to your site. Before it looks at a single page of HTML, the very first thing it does is request your
robots.txtfile (e.g.,https://yoursite.com/robots.txt). It reads the rules to see if it is legally allowed to proceed. - Indexing: If allowed, the bot downloads the page content and stores it in a massive database.
- Ranking: When a user types a query, algorithms sort through the index to provide the most relevant answers.
If your robots.txt file is misconfigured, you can accidentally sever the process at Step 1, rendering the greatest content and backlinks in the world entirely useless because Google literally cannot view the page.
The Core Syntax and Directives
The robots.txt protocol relies on a very specific, rigid syntax. A standard block contains two main components:
1. User-agent
This specifies who the rule applies to. You can target specific bots by name, or use a wildcard.
User-agent: *(Applies the following rules to every bot on the internet).User-agent: Googlebot(Applies rules only to Google's standard web crawler).User-agent: Bingbot(Applies rules only to Microsoft Bing).
2. Allow and Disallow Directives
These commands tell the specified User-agent which paths are off-limits or explicitly permitted.
Disallow: /wp-admin/(Blocks crawling of the WordPress admin dashboard).Disallow: /checkout/(Blocks eCommerce checkout pages to prevent duplicate content and index bloat).Allow: /wp-admin/admin-ajax.php(Overrides a higher-level disallow to explicitly allow a specific necessary file).
3. Pattern Matching (Wildcards)
Googlebot supports complex pattern matching using asterisks (`*`) and dollar signs (`$`).
- Asterisk (*): Represents any sequence of characters.
Disallow: /*?sort=blocks any URL containing a sort parameter, regardless of what comes before it. - Dollar Sign ($): Designates the absolute end of a URL.
Disallow: /*.pdf$tells the bot to block the crawling of any file ending exactly with ".pdf".
The Rise of AI Bots (And How to Block Them)
In the generative AI era, a new breed of web crawler has emerged. Companies like OpenAI, Anthropic, and Google send massive swarms of bots to scrape the public internet, using your copyrighted articles, blog posts, and data to train their Large Language Models (LLMs).
Many publishers and SEO professionals choose to opt out of this unregulated data harvesting. Our generator includes a one-click "Block AI Scrapers" preset, which automatically targets the User-agents of the biggest LLM trainers. Here are the primary bots you should block if you wish to protect your intellectual property:
User-agent: GPTBot(OpenAI's primary web crawler for training future GPT models).User-agent: ChatGPT-User(The bot used when a ChatGPT user asks the AI to browse the web for a real-time answer).User-agent: CCBot(Common Crawl, one of the largest open-source datasets used to train AI).User-agent: anthropic-ai(Crawls data to train Claude).User-agent: Google-Extended(Google's specific bot used for training Bard/Gemini, separate from standard Googlebot which handles search indexing).
Crawl Budget Optimization
Google does not have infinite computing power. When they discover your website, they assign it a "Crawl Budget"—a limit on how many pages they are willing to crawl in a given timeframe based on your site's authority and server speed.
If you have an eCommerce site with 10,000 products, and faceted navigation that creates 50,000 parameter URLs (e.g., ?color=red&size=medium), Googlebot might waste its entire crawl budget scanning useless filter variations instead of indexing your new, highly profitable product pages.
By using Disallow directives in your robots.txt file to block these dynamic parameter paths, search engine internal search pages, and tag archives, you force Googlebot to focus its limited budget exclusively on the pages that actually matter for your SEO.
Crucial Mistake: Disallow vs. Noindex
This is the most common and dangerous misconception in technical SEO:
Robots.txt Disallow does NOT remove a page from Google's index.
If you put Disallow: /secret-page/ in your robots.txt, you are telling Google not to crawl it. However, if another website links to /secret-page/, Google will still index the URL because they know it exists. The search result will just look ugly, displaying the URL with a message stating, "No information is available for this page."
If you want a page completely removed from Google, you must allow Google to crawl it, and place a <meta name="robots" content="noindex"> tag in the HTML head of that specific page. Googlebot crawls the page, reads the noindex tag, and deletes the page from its database.
Security Warning: Robots.txt is Public
Your robots.txt file is publicly accessible to anyone on the internet. You can view Google's by simply typing google.com/robots.txt.
Because of this, you should never use robots.txt to hide sensitive information. If you add Disallow: /my-secret-admin-login-url/, you are literally giving hackers a roadmap to your hidden login pages. Security through obscurity is not security. To protect private directories, you must use server-side authentication or .htaccess password protection.
Frequently Asked Questions (FAQ)
Where do I upload the generated robots.txt file?
robots.txt (all lowercase) and uploaded to the absolute highest-level root directory of your website. It must be accessible at https://www.yourdomain.com/robots.txt. If you put it in a subfolder, search engines will not find it and will assume you have no crawl restrictions.Why should I add my Sitemap to this file?
What is the Crawl-delay directive?
Crawl-delay: 10 instructs bots to wait 10 seconds between each page request. Historically, this was used to prevent aggressive bots from crashing cheap shared hosting servers. Note: Googlebot completely ignores the crawl-delay directive. If you need to slow down Googlebot, you must do it via the Google Search Console settings. Bingbot and Yandex, however, still respect crawl-delay.How can I test if my robots.txt is working?
Can I have a different robots.txt for a subdomain?
blog.yoursite.com and store.yoursite.com, they are treated as entirely different entities by search engines. You will need a separate robots.txt file deployed in the root directory of each respective subdomain.Explore More Technical SEO & Server Tools
Controlling crawl access is just one pillar of a sound technical SEO architecture. Enhance your site's indexing, routing, and meta data with our suite of free developer utilities.