What is Robots.txt? Complete Guide with Examples
Robots.txt is a plain text file at the root of a website (example.com/robots.txt) that instructs web crawlers which URLs they can and cannot access. It follows the Robots Exclusion Protocol and uses User-agent and Disallow directives to control crawling behavior. While robots.txt is a request (not enforcement), all major search engines respect it. It's a critical tool for managing crawl budget, preventing indexing of private areas, and guiding crawler behavior.
How Does Robots.txt Work?
When a search engine crawler visits a website, it first checks /robots.txt to read the crawling rules. The file contains one or more User-agent blocks, each specifying which crawler the rules apply to, followed by Disallow (block) and Allow (permit) directives with URL path patterns. The crawler matches its name against User-agent lines and follows the corresponding rules. Wildcards (*) match any string, and the $ anchor matches end-of-URL. The Sitemap directive points crawlers to the XML sitemap.
Key Features
- User-agent targeting for specific crawlers (Googlebot, Bingbot) or all crawlers (*)
- Disallow directive blocking crawlers from specific URL paths or patterns
- Allow directive permitting access to subdirectories within disallowed paths
- Wildcard (*) and end-of-string ($) pattern matching for flexible rules
- Sitemap directive pointing crawlers to the XML sitemap location
Common Use Cases
Admin Area Protection
Websites disallow crawling of /admin/, /dashboard/, and /internal/ paths to prevent search engines from indexing administrative interfaces that shouldn't appear in search results.
Crawl Budget Optimization
Large sites use robots.txt to prevent crawlers from wasting crawl budget on low-value pages like search results, filtered views, and paginated archives.
Staging Environment Protection
Staging and development sites use robots.txt to prevent accidental indexing: 'User-agent: *\nDisallow: /' blocks all crawlers from all pages.