robots.txt 禁止收录协议的写法

封面图 • 2026-01-17 • 互联网

`robots.txt` 是一个用于指导搜索引擎爬虫（如 Google、Bing 等）如何爬取网站内容的文件。它通常位于网站根目录下（例如：`https://www.example.com/robots.txt`）。通过 `robots.txt`，你可以告诉搜索引擎哪些页面可以爬取，哪些页面不能爬取。

### 一、基本语法

`robots.txt` 文件由多个规则组成，每条规则以 `User-agent` 开头，然后是 `Disallow` 或 `Allow` 指令。

#### 1. 允许所有爬虫访问
```txt
User-agent: *
Allow: /
```

#### 2. 禁止所有爬虫访问
```txt
User-agent: *
Disallow: /
```

#### 3. 禁止特定爬虫访问
```txt
User-agent: Googlebot
Disallow: /
```

#### 4. 禁止特定路径
```txt
User-agent: *
Disallow: /private/
```

这表示所有爬虫都不能爬取 `/private/` 目录下的内容。

#### 5. 禁止特定文件类型
```txt
User-agent: *
Disallow: /*.pdf$
```

这表示禁止爬取所有 `.pdf` 文件（注意：不是所有搜索引擎都支持正则表达式，部分只支持前缀匹配）。

#### 6. 禁止特定 URL
```txt
User-agent: *
Disallow: /contact-us.html
```

#### 7. 多个规则
```txt
User-agent: *
Disallow: /private/
Disallow: /admin/
Disallow: /tmp/
```

#### 8. 使用通配符（部分支持）
```txt
User-agent: *
Disallow: /*.php
```

注意：不是所有搜索引擎都支持通配符（如 Google 支持，但有些可能不支持），建议使用更精确的路径。

---

### 二、注意事项

1. **不要依赖 robots.txt 来保护敏感信息**  
   它只是“礼貌”地请求爬虫不要访问某些内容，不代表真正的安全措施。敏感数据应放在服务器端进行权限控制。

2. **robots.txt 不影响搜索引擎索引**  
   如果页面已经被索引，即使被 `robots.txt` 禁止，也可能仍然出现在搜索结果中（除非你使用 `noindex` 标签或 `meta` 标签）。

3. **测试 robots.txt**  
   可以使用 [Google 的 Robots.txt Tester](https://search.google.com/search/console/robots-txt) 工具来验证你的配置是否正确。

---

### 三、示例文件

```txt
User-agent: *
Disallow: /private/
Disallow: /admin/
Disallow: /tmp/
Disallow: /wp-admin/
Disallow: /wp-content/
Disallow: /wp-includes/
Disallow: /*.pdf$
Disallow: /*.zip$
```

这个配置会阻止爬虫访问一些常见的管理目录和文件类型。

---

如果你需要根据具体需求定制 `robots.txt`，欢迎告诉我你的网站结构或目标，我可以帮你写一个合适的版本。