100字范文,内容丰富有趣,生活中的好帮手!
100字范文 > 白话Elasticsearch23-深度探秘搜索技术之通过ngram分词机制实现index-time搜索推荐

白话Elasticsearch23-深度探秘搜索技术之通过ngram分词机制实现index-time搜索推荐

时间:2021-03-21 04:16:44

相关推荐

白话Elasticsearch23-深度探秘搜索技术之通过ngram分词机制实现index-time搜索推荐

文章目录

概述官网什么是ngram什么是edge ngramngram和index-time搜索推荐原理例子

概述

继续跟中华石杉老师学习ES,第23篇

课程地址: /view/55

官网

NGram Tokenizer:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

NGram Token Filter:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenfilter.html

Edge NGram Tokenizer:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html

Edge NGram Token Filter:

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html

什么是ngram

什么是ngram

假设有个单词quick,5种长度下的ngram

ngram length=1,会被拆成 q u i c kngram length=2,会被拆成 qu ui ic ckngram length=3,会被拆成 qui uic ickngram length=4,会被拆成 quic uickngram length=5,会被拆成 quick

其中任意一个被拆分的部分 就被称为ngram 。

什么是edge ngram

quick,anchor首字母后进行ngram

qququiquicquick

上述拆分方式就被称为edge ngram

使用edge ngram将每个单词都进行进一步的分词切分,用切分后的ngram来实现前缀搜索推荐功能

举个例子 两个doc

doc1 hello world

doc2 hello we

使用edge ngram拆分

h

he

hel

hell

hello -------> 可以匹配 doc1,doc2

w -------> 可以匹配 doc1,doc2

wo

wor

worl

world

e ---------> 可以匹配 doc2

使用hello w去搜索

hello --> hello,doc1

w --> w,doc1

doc1中hello和w,而且position也匹配,所以,ok,doc1返回,hello world

ngram和index-time搜索推荐原理

搜索的时候,不用再根据一个前缀,然后扫描整个倒排索引了,而是简单的拿前缀去倒排索引中匹配即可,如果匹配上了,那么就好了,就和match query全文检索一样

例子

PUT /my_index{"settings": {"analysis": {"filter": {"autocomplete_filter": {"type":"edge_ngram","min_gram": 1,"max_gram": 20}},"analyzer": {"autocomplete": {"type":"custom","tokenizer": "standard","filter": ["lowercase","autocomplete_filter" ]}}}}}

helloworld

设置

min ngram = 1max ngram = 3

使用edge_ngram ,则会被拆分为一下 ,

hhehel

知识点: autocomplete

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

GET /my_index/_analyze{"analyzer": "autocomplete","text": "helll world"}

设置mapping , 查询的时候还是使用standard

PUT /my_index/_mapping/my_type{"properties": {"title": {"type":"text","analyzer": "autocomplete","search_analyzer": "standard"}}}

造数据

PUT /my_index/my_type/1{"content":"hello Jack"}PUT /my_index/my_type/2{"content":"hello John"}PUT /my_index/my_type/3{"content":"hello Jose"}

查询

GET /my_index/my_type/_search {"query": {"match": {"content": "hello J"}}}

返回:

{"took": 7,"timed_out": false,"_shards": {"total": 5,"successful": 5,"skipped": 0,"failed": 0},"hits": {"total": 3,"max_score": 0.2876821,"hits": [{"_index": "my_index","_type": "my_type","_id": "2","_score": 0.2876821,"_source": {"content": "hello John"}},{"_index": "my_index","_type": "my_type","_id": "1","_score": 0.2876821,"_source": {"content": "hello Jack"}},{"_index": "my_index","_type": "my_type","_id": "3","_score": 0.2876821,"_source": {"content": "hello Jose"}}]}}

如果用match,只有hello的也会出来,全文检索,只是分数比较低推荐使用match_phrase,要求每个term都有,而且position刚好靠着1位,符合我们的期望的

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。