构建停用词
添加索引
构建停用词,注意stopwords.txt必须在每个es的config目录下。
PUT http://<ip>/my_index_new_new
{
"settings": {
"analysis": {
"filter": {
"stop_filter": {
"type": "stop",
"stopwords_path": "stopwords.txt"
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "jieba_index",
"filter": [
"stop_filter"
]
}
}
}
}
}
mapping使用停用词
POST http://<ip>/my_index_new_new/testtext/_mapping?include_type_name=true
{
"properties": {
"content": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "my_analyzer"
}
}
}
验证
构建几篇文档
POST http://<ip>/my_index_new_new/testtext/1
{"content":"除此以外守护星停售了吗"}
POST http://<ip>/my_index_new_new/testtext/3
{"content":"公安部:各地校车将享最高路权"}
POST http://<ip>/my_index_new_new/testtext/6
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
查询:
不带停用词的:
POST http://<ip>/my_index2/testtext/_search
{
"query": {
"match": {
"content": "所以守护星停售了吗"
}
},
"highlight": {
"pre_tags": ["<font color='red'>", "<tag2>"],
"post_tags": ["</font>", "</tag2>"],
"fields": {
"content": {}
}
}
}
返回:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.6499447,
"hits": [
{
"_index": "my_index2",
"_type": "testtext",
"_id": "7",
"_score": 1.6499447,
"_source": {
"content": "所以守护星停售了吗"
},
"highlight": {
"content": [
"<font color='red'>所以</font><font color='red'>守护</font><font color='red'>星</font><font color='red'>停售</font><font color='red'>了</font><font color='red'>吗</font>"
]
}
}
]
}
}
带停用词的:
POST http://<ip>/my_index_new_new/testtext/_search
{
"query": {
"match": {
"content": "所以守护星停售了吗"
}
},
"highlight": {
"pre_tags": ["<font color='red'>", "<tag2>"],
"post_tags": ["</font>", "</tag2>"],
"fields": {
"content": {}
}
}
}
返回:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.48773503,
"hits": [
{
"_index": "my_index_new_new",
"_type": "testtext",
"_id": "7",
"_score": 0.48773503,
"_source": {
"content": "所以守护星停售了吗"
},
"highlight": {
"content": [
"所以<font color='red'>守护</font><font color='red'>星</font><font color='red'>停售</font>了吗"
]
}
},
{
"_index": "my_index_new_new",
"_type": "testtext",
"_id": "6",
"_score": 0.48773503,
"_source": {
"content": "除此以外守护星停售了吗"
},
"highlight": {
"content": [
"除此以外<font color='red'>守护</font><font color='red'>星</font><font color='red'>停售</font>了吗"
]
}
}
]
}
}
可以看到,加了停用词的,highlight部分 “除此之外”和“了吗”都没有被匹配上了。
PS:切词测试:
POST http://<ip>/_analyze
{
"analyzer":"jieba_index",
"text":"除此而外守护星停售了吗"
}
response:
{
"tokens": [
{
"token": "除此",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "除此而外",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "此而",
"start_offset": 1,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "而外",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "守护",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 2
},
{
"token": "守护星",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "停售",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "了",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 4
},
{
"token": "吗",
"start_offset": 10,
"end_offset": 11,
"type": "word",
"position": 5
}
]
}
refer:https://my.oschina.net/wyn365/blog/3198190
https://www.cnblogs.com/dengzhizhong/p/6373333.html
https://www.elastic.co/guide/cn/elasticsearch/guide/current/using-stopwords.html
https://qbox.io/blog/how-to-use-elasticsearch-remove-stopwords-from-query
https://github.com/sing1ee/elasticsearch-jieba-plugin