php实现基于确定有穷自动机算法的铭感词过滤

封面图 • 2025-10-21 • PHP

在 PHP 中实现基于**确定有穷自动机（DFA）**的敏感词过滤，是一种高效、稳定的方式。DFA 的优势在于可以在一次扫描中完成整个文本的匹配，时间复杂度为 O(n)，其中 n 是文本长度。

---

## ✅ 一、实现思路

1. **构建敏感词字典树（Trie）**  
   将所有敏感词构建成一个 Trie 树结构，每个节点代表一个字符，路径表示一个词语。

2. **构建 DFA 表**  
   将 Trie 树转换为 DFA 表，使得在扫描过程中，可以快速跳转到下一个状态，避免回溯。

3. **文本扫描**  
   使用 DFA 表对输入文本进行扫描，一旦到达某个终止状态（即敏感词结尾），就标记该位置为敏感词。

---

## ✅ 二、PHP 实现代码

### 1. 定义敏感词列表

```php
$keywords = ['敏感词1', '敏感词2', '测试词', '违法信息'];
```

---

### 2. 构建 Trie 树并生成 DFA 表

```php
class DfaFilter {
    private $trie; // Trie 树
    private $dfa;  // DFA 表

public function __construct($keywords) {
        $this->buildTrie($keywords);
        $this->buildDfa();
    }

private function buildTrie($keywords) {
        $this->trie = [];
        foreach ($keywords as $keyword) {
            $node = &$this->trie;
            for ($i = 0; $i < strlen($keyword); $i++) {
                $char = $keyword[$i];
                if (!isset($node[$char])) {
                    $node[$char] = [];
                }
                $node = &$node[$char];
            }
            $node['is_end'] = true; // 标记为敏感词结尾
        }
    }

private function buildDfa() {
        $this->dfa = [];

// 从根节点开始构建 DFA
        $queue = [&$this->trie];
        $visited = [];

while (!empty($queue)) {
            $nodeRef = array_shift($queue);
            $node = $nodeRef;

// 如果当前节点已经被处理过，跳过
            if (in_array($node, $visited, true)) continue;
            $visited[] = $node;

// 遍历当前节点的所有子节点
            foreach ($node as $char => $child) {
                if ($char === 'is_end') continue;

$nextState = $child;
                $stateId = spl_object_hash($node); // 用对象哈希作为状态 ID

// 计算转移后的状态
                $nextStateId = spl_object_hash($nextState);

// 设置 DFA 表：当前状态 -> 字符 -> 下一状态
                if (!isset($this->dfa[$stateId][$char])) {
                    $this->dfa[$stateId][$char] = $nextStateId;
                }

// 如果子节点是结束节点，则记录该状态为敏感词结尾
                if (isset($child['is_end'])) {
                    $this->dfa[$stateId]['is_end'] = true;
                }

// 将子节点加入队列
                $queue[] = &$nextState;
            }
        }
    }

public function filter($text) {
        $result = '';
        $currentState = spl_object_hash($this->trie); // 初始状态
        $currentWord = '';

for ($i = 0; $i < strlen($text); $i++) {
            $char = $text[$i];
            $currentWord .= $char;

// 查找 DFA 表中当前状态和字符对应的下一状态
            if (isset($this->dfa[$currentState][$char])) {
                $currentState = $this->dfa[$currentState][$char];
            } else {
                // 没有匹配到，重置状态
                $currentState = spl_object_hash($this->trie);
                $currentWord = '';
            }

// 检查是否是敏感词结尾
            if (isset($this->dfa[$currentState]['is_end'])) {
                $result .= '***'; // 替换为***
                $currentWord = ''; // 重置当前单词
                $currentState = spl_object_hash($this->trie); // 重置状态
            } else {
                $result .= $char;
            }
        }

return $result;
    }
}
```

---

### 3. 使用示例

```php
$keywords = ['敏感词1', '敏感词2', '测试词', '违法信息'];

$filter = new DfaFilter($keywords);
$text = "这是一段包含敏感词1和测试词的文本。";
$result = $filter->filter($text);

echo $result;
// 输出: 这是一段包含***和***的文本。
```

---

## ✅ 三、优化建议

- 可以将 `spl_object_hash` 替换为更稳定的唯一 ID，比如使用 `uniqid()` 或者手动编号。
- 对于大规模关键词库，可考虑使用 `array` 结构代替对象引用，提升性能。
- 支持多语言（如中文、英文等）的敏感词过滤。

---

## ✅ 四、总结

通过 DFA 算法实现的敏感词过滤具有以下优点：

- **高效**：一次遍历即可完成全文扫描。
- **准确**：能识别任意长度的敏感词。
- **可扩展**：支持动态添加/删除敏感词。

如果你需要支持更复杂的规则（如正则表达式、模糊匹配等），可以结合其他算法（如 Aho-Corasick）来增强功能。

---

如需我帮你封装成类或提供完整项目结构，请告诉我！