MySQL全文索引实现简单版搜索引擎实例代码

来源：华拓科技网

使用ngram分词解析器创建全文索引

1、对title字段建立全文索引（该字段没有固定的stopwords 分词，使用ngram分词解析器）

需先在my.cnf 配置文件中设置ngram_token_size（默认为2，2个字符作为ngram 的关键词），并重启mysql服务
这里使用默认的 2

select title from article limit 10;
+------------------------------------------------------------------------------+
| title |
+------------------------------------------------------------------------------+
| worth IT |
|Launchpad 江南皮革厂小show |
|Raw 幕后罕见一刻 “疯子”被抬回后台 |
|Raw:公子大骂老爸你就是个绿茶 公子以一打四 |
|四组30平米精装小户型，海量图片，附户型图 |
|夜店女王性感烟熏猫眼妆 |
|大秀哥重摔“巨石”强森 |
|少女时代 崔秀英 服饰科普 林允儿 黄美英 金泰妍 郑秀晶 | 
|德阳户外踏青，花田自助烧烤 |
+------------------------------------------------------------------------------+

2、对title字段创建全文索引

alter table article add fulltext index ft_index_title(title) with parser ngram;
Query OK, 0 rows affected (3 min 29.22 sec)
Records: 0 Duplicates: 0 Warnings: 0

3、会创建倒排索引（title字段越长长，创建的倒排索引越大）

112M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_1.ibd
28M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_2.ibd
20M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_3.ibd
140M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_4.ibd
128M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_5.ibd
668M Jul 5 21:46 FTS_00000000000000a7_00000000000000cd_INDEX_6.ibd

4、不建立全文索引搜索title的某个关键词

select count(*) from article where title like '%户外%';
+----------+
| count(*) |
+----------+
| 22058 |
+----------+
1 row in set (8.60 sec)

select count(*) from article where title like '%后台%';
+----------+
| count(*) |
+----------+
| 1142 |
+----------+

5、使用全文索引搜索某个关键词

响应时间有很大的提升

select count(*) from article where match(title) against('户外');
+----------+
| count(*) |
+----------+
| 22058 |
+----------+
1 row in set (0.07 sec)

select count(*) from article where title like '%后台%';
+----------+
| count(*) |
+----------+
| 1142 |
+----------+
1 row in set (8.31 sec)

6、注意当搜索的关键词字符数大于2 （ngram_token_size定义大小）会出现不一致问题

普通搜索，实际中出现该关键词的记录数为6
select count(*) from article where title like '%公子大%';
+----------+
| count(*) |
+----------+
| 6 |
+----------+
1 row in set (8.40 sec)

全文搜索，出现关键字的记录数为9443
select count(*) from article where match(title) against('公子大');
+----------+
| count(*) |
+----------+
| 9443 |
+----------+
1 row in set (0.06 sec)

实际出现该关键字的记录数为1
select count(*) from article where title like '%花田自助%';
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (8.33 sec)

全文搜索出现该关键词的记录数为3202
select count(*) from article where match(title) against('花田自助');
+----------+
| count(*) |
+----------+
| 3202 |
+----------+
1 row in set (0.06 sec)

结论

当mysql 某字段中有固定的stopword 分词（英文的空格符，中文的“,”"-"等），对该字段建立全文索引，能快速搜索出现某个关键词的相关记录信息，实现简单搜索引擎的效果

当mysql 某字段没有固定的stopword 分词，使用内置解析器ngram 可将字段值分成固定数量（ngram_token_size定义大小）的关键词快速进行搜索；当搜索的关键词的字符数量不等于ngram_token_size定义大小时，会出现与实际情况不一致的问题

全文索引能快速搜索，也存在维护索引的开销；字段长度越大，创建的全文索引也越大，会影响DML语句的吞吐量，可用专门的全文搜索引擎ES来做这件事

参考

InnoDB FULLTEXT Indexes

总结