ES添加停用词

构建停用词

添加索引

构建停用词,注意stopwords.txt必须在每个es的config目录下。

PUT http://<ip>/my_index_new_new
{
	"settings": {
		"analysis": {
			"filter": {
				"stop_filter": {
					"type": "stop",
					"stopwords_path": "stopwords.txt"
				}
			},
			"analyzer": {
				"my_analyzer": {
					"tokenizer": "jieba_index",
					"filter": [
						"stop_filter"
					]
				}
			}
		}
	}
}

mapping使用停用词

POST http://<ip>/my_index_new_new/testtext/_mapping?include_type_name=true
{
	"properties": {
		"content": {
			"type": "text",
			"analyzer": "my_analyzer",
			"search_analyzer": "my_analyzer"
		}
	}
}

验证

构建几篇文档

POST http://<ip>/my_index_new_new/testtext/1
{"content":"除此以外守护星停售了吗"}
POST http://<ip>/my_index_new_new/testtext/3
{"content":"公安部:各地校车将享最高路权"}
POST http://<ip>/my_index_new_new/testtext/6
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}

查询:

不带停用词的:

POST http://<ip>/my_index2/testtext/_search
{
	"query": {
		"match": {
			"content": "所以守护星停售了吗"
		}
	},
	"highlight": {
		"pre_tags": ["<font color='red'>", "<tag2>"],
		"post_tags": ["</font>", "</tag2>"],
		"fields": {
			"content": {}
		}
	}
}

返回:

{
    "took": 12,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.6499447,
        "hits": [
            {
                "_index": "my_index2",
                "_type": "testtext",
                "_id": "7",
                "_score": 1.6499447,
                "_source": {
                    "content": "所以守护星停售了吗"
                },
                "highlight": {
                    "content": [
                        "<font color='red'>所以</font><font color='red'>守护</font><font color='red'>星</font><font color='red'>停售</font><font color='red'>了</font><font color='red'>吗</font>"
                    ]
                }
            }
        ]
    }
}

带停用词的:

POST http://<ip>/my_index_new_new/testtext/_search
{
	"query": {
		"match": {
			"content": "所以守护星停售了吗"
		}
	},
	"highlight": {
		"pre_tags": ["<font color='red'>", "<tag2>"],
		"post_tags": ["</font>", "</tag2>"],
		"fields": {
			"content": {}
		}
	}
}

返回:

{
    "took": 25,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 2,
            "relation": "eq"
        },
        "max_score": 0.48773503,
        "hits": [
            {
                "_index": "my_index_new_new",
                "_type": "testtext",
                "_id": "7",
                "_score": 0.48773503,
                "_source": {
                    "content": "所以守护星停售了吗"
                },
                "highlight": {
                    "content": [
                        "所以<font color='red'>守护</font><font color='red'>星</font><font color='red'>停售</font>了吗"
                    ]
                }
            },
            {
                "_index": "my_index_new_new",
                "_type": "testtext",
                "_id": "6",
                "_score": 0.48773503,
                "_source": {
                    "content": "除此以外守护星停售了吗"
                },
                "highlight": {
                    "content": [
                        "除此以外<font color='red'>守护</font><font color='red'>星</font><font color='red'>停售</font>了吗"
                    ]
                }
            }
        ]
    }
}

可以看到,加了停用词的,highlight部分 “除此之外”和“了吗”都没有被匹配上了。

PS:切词测试:

POST http://<ip>/_analyze
{
	"analyzer":"jieba_index",
	"text":"除此而外守护星停售了吗"
}

response:

{
    "tokens": [
        {
            "token": "除此",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "除此而外",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "此而",
            "start_offset": 1,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "而外",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "守护",
            "start_offset": 4,
            "end_offset": 6,
            "type": "word",
            "position": 2
        },
        {
            "token": "守护星",
            "start_offset": 4,
            "end_offset": 7,
            "type": "word",
            "position": 2
        },
        {
            "token": "停售",
            "start_offset": 7,
            "end_offset": 9,
            "type": "word",
            "position": 3
        },
        {
            "token": "了",
            "start_offset": 9,
            "end_offset": 10,
            "type": "word",
            "position": 4
        },
        {
            "token": "吗",
            "start_offset": 10,
            "end_offset": 11,
            "type": "word",
            "position": 5
        }
    ]
}

refer:https://my.oschina.net/wyn365/blog/3198190
https://www.cnblogs.com/dengzhizhong/p/6373333.html
https://www.elastic.co/guide/cn/elasticsearch/guide/current/using-stopwords.html
https://qbox.io/blog/how-to-use-elasticsearch-remove-stopwords-from-query
https://github.com/sing1ee/elasticsearch-jieba-plugin

comments powered by Disqus