Elasticsearch 2.2.0 分词篇：中文分词

254次阅读

共计 6676 个字符，预计需要花费 17 分钟才能阅读完成。

在 Elasticsearch 中，内置了很多分词器（analyzers），但默认的分词器对中文的支持都不是太好。所以需要单独安装插件来支持，比较常用的是中科院 ICTCLAS 的 smartcn 和 IKAnanlyzer 效果还是不错的，但是目前 IKAnanlyzer 还不支持最新的 Elasticsearch2.2.0 版本，但是 smartcn 中文分词器默认官方支持，它提供了一个中文或混合中文英文文本的分析器。支持最新的 2.2.0 版本版本。但是 smartcn 不支持自定义词库，作为测试可先用一下。后面的部分介绍如何支持最新的版本。

smartcn

安装分词：plugin install analysis-smartcn

卸载：plugin remove analysis-smartcn

测试：

请求：POST http://127.0.0.1:9200/_analyze/

{
“analyzer”: “smartcn”,
“text”: “ 联想是全球最大的笔记本厂商 ”
}

返回结果：

{
“tokens”: [
{
“token”: “ 联想 ”,
“start_offset”: 0,
“end_offset”: 2,
“type”: “word”,
“position”: 0
},
{
“token”: “ 是 ”,
“start_offset”: 2,
“end_offset”: 3,
“type”: “word”,
“position”: 1
},
{
“token”: “ 全球 ”,
“start_offset”: 3,
“end_offset”: 5,
“type”: “word”,
“position”: 2
},
{
“token”: “ 最 ”,
“start_offset”: 5,
“end_offset”: 6,
“type”: “word”,
“position”: 3
},
{
“token”: “ 大 ”,
“start_offset”: 6,
“end_offset”: 7,
“type”: “word”,
“position”: 4
},
{
“token”: “ 的 ”,
“start_offset”: 7,
“end_offset”: 8,
“type”: “word”,
“position”: 5
},
{
“token”: “ 笔记本 ”,
“start_offset”: 8,
“end_offset”: 11,
“type”: “word”,
“position”: 6
},
{
“token”: “ 厂商 ”,
“start_offset”: 11,
“end_offset”: 13,
“type”: “word”,
“position”: 7
}
]
}

作为对比，我们看一下标准的分词的结果，在请求中巴 smartcn，换成 standard

然后看返回结果：

{
“tokens”: [
{
“token”: “ 联 ”,
“start_offset”: 0,
“end_offset”: 1,
“type”: “<IDEOGRAPHIC>”,
“position”: 0
},
{
“token”: “ 想 ”,
“start_offset”: 1,
“end_offset”: 2,
“type”: “<IDEOGRAPHIC>”,
“position”: 1
},
{
“token”: “ 是 ”,
“start_offset”: 2,
“end_offset”: 3,
“type”: “<IDEOGRAPHIC>”,
“position”: 2
},
{
“token”: “ 全 ”,
“start_offset”: 3,
“end_offset”: 4,
“type”: “<IDEOGRAPHIC>”,
“position”: 3
},
{
“token”: “ 球 ”,
“start_offset”: 4,
“end_offset”: 5,
“type”: “<IDEOGRAPHIC>”,
“position”: 4
},
{
“token”: “ 最 ”,
“start_offset”: 5,
“end_offset”: 6,
“type”: “<IDEOGRAPHIC>”,
“position”: 5
},
{
“token”: “ 大 ”,
“start_offset”: 6,
“end_offset”: 7,
“type”: “<IDEOGRAPHIC>”,
“position”: 6
},
{
“token”: “ 的 ”,
“start_offset”: 7,
“end_offset”: 8,
“type”: “<IDEOGRAPHIC>”,
“position”: 7
},
{
“token”: “ 笔 ”,
“start_offset”: 8,
“end_offset”: 9,
“type”: “<IDEOGRAPHIC>”,
“position”: 8
},
{
“token”: “ 记 ”,
“start_offset”: 9,
“end_offset”: 10,
“type”: “<IDEOGRAPHIC>”,
“position”: 9
},
{
“token”: “ 本 ”,
“start_offset”: 10,
“end_offset”: 11,
“type”: “<IDEOGRAPHIC>”,
“position”: 10
},
{
“token”: “ 厂 ”,
“start_offset”: 11,
“end_offset”: 12,
“type”: “<IDEOGRAPHIC>”,
“position”: 11
},
{
“token”: “ 商 ”,
“start_offset”: 12,
“end_offset”: 13,
“type”: “<IDEOGRAPHIC>”,
“position”: 12
}
]
}

从中可以看出，基本上不能使用，就是一个汉字变成了一个词了。

本文由赛克蓝德 (secisland) 原创，转载请标明作者和出处。

IKAnanlyzer 支持 2.2.0 版本

目前 github 上最新的版本只支持 Elasticsearch2.1.1, 路径为 https://github.com/medcl/elasticsearch-analysis-ik。但现在最新的 Elasticsearch 已经到 2.2.0 了所以要经过处理一下才能支持。

1、下载源码，下载完后解压到任意目录，然后修改 elasticsearch-analysis-ik-master 目录下的 pom.xml 文件。找到 <elasticsearch.version> 行，然后把后面的版本号修改成 2.2.0。

2、编译代码 mvn package。

3、编译完成后会在 target\releases 生成 elasticsearch-analysis-ik-1.7.0.zip 文件。

4、解压文件到 Elasticsearch/plugins 目录下。

5、修改配置文件增加一行：index.analysis.analyzer.ik.type : “ik”

6、重启 Elasticsearch。

测试：和上面的请求一样，只是把分词替换成 ik

返回的结果：

{
“tokens”: [
{
“token”: “ 联想 ”,
“start_offset”: 0,
“end_offset”: 2,
“type”: “CN_WORD”,
“position”: 0
},
{
“token”: “ 全球 ”,
“start_offset”: 3,
“end_offset”: 5,
“type”: “CN_WORD”,
“position”: 1
},
{
“token”: “ 最大 ”,
“start_offset”: 5,
“end_offset”: 7,
“type”: “CN_WORD”,
“position”: 2
},
{
“token”: “ 笔记本 ”,
“start_offset”: 8,
“end_offset”: 11,
“type”: “CN_WORD”,
“position”: 3
},
{
“token”: “ 笔记 ”,
“start_offset”: 8,
“end_offset”: 10,
“type”: “CN_WORD”,
“position”: 4
},
{
“token”: “ 笔 ”,
“start_offset”: 8,
“end_offset”: 9,
“type”: “CN_WORD”,
“position”: 5
},
{
“token”: “ 记 ”,
“start_offset”: 9,
“end_offset”: 10,
“type”: “CN_CHAR”,
“position”: 6
},
{
“token”: “ 本厂 ”,
“start_offset”: 10,
“end_offset”: 12,
“type”: “CN_WORD”,
“position”: 7
},
{
“token”: “ 厂商 ”,
“start_offset”: 11,
“end_offset”: 13,
“type”: “CN_WORD”,
“position”: 8
}
]
}

从中可以看出，两个分词器分词的结果还是有区别的。

扩展词库，在 config\ik\custom 下在 mydict.dic 中增加需要的词组，然后重启 Elasticsearch，需要注意的是文件编码是 UTF-8 无 BOM 格式编码。

比如增加了赛克蓝德单词。然后再次查询：

请求：POST http://127.0.0.1:9200/_analyze/

参数：

{
“analyzer”: “ik”,
“text”: “ 赛克蓝德是一家数据安全公司 ”
}

返回结果：

{
“tokens”: [
{
“token”: “ 赛克蓝德 ”,
“start_offset”: 0,
“end_offset”: 4,
“type”: “CN_WORD”,
“position”: 0
},
{
“token”: “ 克 ”,
“start_offset”: 1,
“end_offset”: 2,
“type”: “CN_WORD”,
“position”: 1
},
{
“token”: “ 蓝 ”,
“start_offset”: 2,
“end_offset”: 3,
“type”: “CN_WORD”,
“position”: 2
},
{
“token”: “ 德 ”,
“start_offset”: 3,
“end_offset”: 4,
“type”: “CN_CHAR”,
“position”: 3
},
{
“token”: “ 一家 ”,
“start_offset”: 5,
“end_offset”: 7,
“type”: “CN_WORD”,
“position”: 4
},
{
“token”: “ 一 ”,
“start_offset”: 5,
“end_offset”: 6,
“type”: “TYPE_CNUM”,
“position”: 5
},
{
“token”: “ 家 ”,
“start_offset”: 6,
“end_offset”: 7,
“type”: “COUNT”,
“position”: 6
},
{
“token”: “ 数据 ”,
“start_offset”: 7,
“end_offset”: 9,
“type”: “CN_WORD”,
“position”: 7
},
{
“token”: “ 安全 ”,
“start_offset”: 9,
“end_offset”: 11,
“type”: “CN_WORD”,
“position”: 8
},
{
“token”: “ 公司 ”,
“start_offset”: 11,
“end_offset”: 13,
“type”: “CN_WORD”,
“position”: 9
}
]
}