ES_API集合原创

# 基本的API

# 查看集群的健康状况

GET _cluster/health

当颜色为green，主分片与副本都正常分配；当颜色为Yellow，主分片全部正常分配，有副本分片未能正常分配；当颜色为Red，有主分片未能分配。

# 查看所有index

GET -cat/indices
#查看indices
GET /_cat/indices/kibana*?v&s=index
#查看状态为绿的索引
GET /_cat/indices?v&health=green
#按照文档个数排序
GET /_cat/indices?v&s=docs.count:desc
#查看具体的字段
GET /_cat/indices/kibana*?pri&v&h=health,index,pri,rep,docs.count,mt
#How much memory is used per index?
GET /_cat/indices?v&h=i,tm&s=tm:desc

1
2
3
4
5
6
7
8
9
10
11

# 查看索引的mapping和setting设置

GET kibana_sample_data_ecommerce
#单独看mapping
GET mapping_test/_mapping

1
2
3

# 查看索引中文档的总数

GET kibana_sample_data_ecommerce/_count

# 查看前10条文档

POST kibana_sample_data_ecommerce/_search
{
}

1
2
3

# 显式mapping设置

# 自定义mapping的建议

可以参考API手册，纯手写
为了减少输入的工作量，减少出错概率，可以依照以下步骤
- 创建一个临时的index，写入一些样本数据
- 通过访问mapping api获得该临时文件的动态mapping定义
- 修改后用该配置创建我们需要的索引
- 删除临时索引

# 控制当前字段是否被索引

mapping中的index字段是用来控制当前字段是否被索引。默认为true，如果设置成false，该字段不可被索引。

PUT users
{
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text"
        },
        "lastName" : {
          "type" : "text"
        },
        "mobile" : {
          "type" : "text",
          "index": false
        }
      }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# 控制倒排索引记录的内容

mapping中的index options字段，有四种不同级别的设置，用来控制倒排索引记录的内容

docs 记录doc id
freqs 记录doc id和term frequencies
positions 记录doc id / term frequenies / term position
offsets 记录doc id / term frequenies / term position / character offects

Text类型默认记录positions，其他默认为docs

记录内容越多，占用存储空间越大

PUT users
{
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text"
        },
        "lastName" : {
          "type" : "text"
        },
        "mobile" : {
          "type" : "text",
          "index_options": "offsets"
        }
      }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# 实现对NULL值进行搜索

mapping设置中，可以设置某个字段，"null_value":"NULL"，这样的话，就能实现对NULL值实现搜索。

注意的是，只有keyword类型支持设定 null_value

PUT users
{
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text"
        },
        "lastName" : {
          "type" : "text"
        },
        "mobile" : {
          "type" : "keyword",
          "null_value": "NULL"
        }

      }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# copy_to设置

可以满足一些场景，这个场景是输入的查询的字段的值，是已有的各个字段的结合的场景。例如输入的fullname的值，要匹配到mapping中的firstname和lastname。

_all在7中被copy_to所替代
满足一些特定的搜索需求
copy_to将字段的数值拷贝到目标字段，实现类似_all的作用
copy_to的目标字段不出现在_source中

PUT users
{
  "mappings": {
    "properties": {
      "firstName":{
        "type": "text",
        "copy_to": "fullName"
      },
      "lastName":{
        "type": "text",
        "copy_to": "fullName"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

# 数组类型

es中不提供专门的数组类型。但是任何字段，都可以包含多个相同类型的数值。

PUT users/_doc/1
{
  "name":"twobirds",
  "interests":["reading","music"]
}

1
2
3
4
5

# 定义Index Alias

通过设置索引的别名，是的零停机运维。

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies-2019",
        "alias": "movies-latest"
      }
    }
  ]
}

1
2
3
4
5
6
7
8
9
10
11

设置alias创建不同查询的视图

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "movies-2019",
        "alias": "movies-lastest-highrate",
        "filter": {
          "range": {
            "rating": {
              "gte": 4
            }
          }
        }
      }
    }
  ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# 常见错误返回

问题	原因
无法连接	网络故障或集群挂了
连接无法关闭	网络故障或节点出错
429	集群过于繁忙
4XX	请求体格式有错
500	集群内部错误

# _analyzer API

analyzer api是ES自带的API，目的是用来分析和测试研究ES如何进行分词的。有如下三种使用的方式：

第一种方式：通过GET直接指定Analyzer来进行测试

GET /_analyze
{
  "analyzer": "standard"
  "text": "Mastering Elasticsearch, elasticsearch is Action"
}

1
2
3
4
5

第二种方式：通过POST索引名上的某个字段来进行测试

POST books/_analyze
{
  "field": "title",
  "text": "Mastering Elasticsearch"
}

1
2
3
4
5

第三种方式：通过POST自定义分词器来进行测试

POST /_analyze
{
  "tokenizer": "standard",
  "filter": ["lowercase"],
  "text": "Mastering Elasticsearch"
}

1
2
3
4
5
6

# Create 一个文档

# Index的方法与create方法

# index方法与create方法的联系

都可以用于创建一个新的文档

# index方法与create方法的区别

put和post上面的区别

index只有put方法，并且一定要带上文档id
create可以使用put 或post

是否需要指定文档id

需要指定文档id，只能用PUT
不指定文档id的情况下，只能用POST

对待已经存在文档id的情况(指定了文档ID，只能用PUT)

index方法中，如果文档的ID不存在，那么久创建新的文档。否则会先删除现有的文档，再创建新的文档，版本会增加。
create方法中，如果文档的ID已经存在，会创建失败。

是否想要自动生成文档id的情况

index方法无法自动生成文档id
create方法可以自动生成文档id

# index方法的API(只有PUT)

PUT my_index/_doc/1
{
    "user" : "Jack",
    "post_date" : "2019-05-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

1
2
3
4
5
6

# create方法的API(PUT带上文档ID)

PUT my_index/_create/1
{
    "user" : "Jack",
    "post_date" : "2019-05-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
PUT users/_doc/1?op_type=create
{
    "user" : "Jack",
    "post_date" : "2019-05-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

1
2
3
4
5
6
7
8
9
10
11
12

# create方法的API(POST不带上文档ID)

POST users/_doc
{
	"user" : "Mike",
    "post_date" : "2019-04-15T14:12:12",
    "message" : "trying out Kibana"
}

1
2
3
4
5
6

# Read一个文档

GET my_index/_doc/1

找到文档，会返回HTTP 200；找不到文档，会返回HTTP 404

# Update一个文档

#Update 指定 ID  (先删除，在写入)
#文档必须已经存在，更新只会对相应字段做增量修改
#POST方法中，payload需要包含在"doc"中
POST users/_update/1
{
	"doc":
	{
	  "user" : "user1",
      "post_date" : "2019-04-15T14:12:12",
      "message" : "trying out Kibanadddddd"
	}
}

1
2
3
4
5
6
7
8
9
10
11
12

# Delete一个文档

### Delete by Id
# 删除文档
DELETE users/_doc/1

1
2
3

# Bulk API

支持在一次API调用中，对不同的索引进行操作
支持四种类型操作：index \ create \ update \ delete
可以在URL中指定index，也可以在请求的payload中进行
操作中单条操作失败，并不会影响其他操作
返回结果包括了每一条操作执行的结果

POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test2", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

1
2
3
4
5
6
7
8

其中"index"/"delete"/"create"/"update"是CRUD中的CUD。

"filed1"和"filed2"是相应的列名，后面跟的是value值。

# mget批量读取

批量操作，可以减少网络连接所产生的开销，提高性能

### mget 操作
GET /_mget
{
    "docs" : [
        {
            "_index" : "test",
            "_id" : "1"
        },
        {
            "_index" : "test",
            "_id" : "2"
        }
    ]
}
#URI中指定index
GET /test/_mget
{
    "docs" : [
        {

            "_id" : "1"
        },
        {

            "_id" : "2"
        }
    ]
}
GET /_mget
{
    "docs" : [
        {
            "_index" : "test",
            "_id" : "1",
            "_source" : false
        },
        {
            "_index" : "test",
            "_id" : "2",
            "_source" : ["field3", "field4"]
        },
        {
            "_index" : "test",
            "_id" : "3",
            "_source" : {
                "include": ["user"],
                "exclude": ["user.location"]
            }
        }
    ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51

# msearch 批量查询

POST kibana_sample_data_ecommerce/_msearch
{}
{"query" : {"match_all" : {}},"size":1}
{"index" : "kibana_sample_data_flights"}
{"query" : {"match_all" : {}},"size":2}

1
2
3
4
5

# mget与msearch的区别

mget是通过文档ID列表得到文档信息，而msearch是根据查询条件，搜索到相应文档。

# 单次批量操作注意事项

单次批量操作，数据量不宜过大，以免引发性能问题。

一般建议是1000-5000个文档，如果文档很大，可以适当减少队列，大小建议为5-15MB，默认不能超过100M，会报错。

# Search API

返回体中的字段意思

took	花费的时间
total	符合条件的总文档数
hints	结果集，默认是前10个文档
_index	索引名
_id	文档的ID
_score	相关度评分
_source	文档原始信息

# URI Search

也就是在URL中使用查询参数

# URI Search基本查询

#基本查询
GET /movies/_search?q=2012&df=title&sort=year:desc&from=0&size=10&timeout=1s
{
  "profile":"true"
}

1
2
3
4
5

q指定查询语句，使用Query String Syntax
df默认字段，是指查询的字段filed，不指定时就是对所有字段
Sort排序
from 和size用于分页
Profile可以查看查询是如何被执行的（查看使用了什么查询）

# URI Search指定字段查询

GET /movies/_search?q=2012&df=title
{
	"profile":"true"
}
#或者没有df字段，使用title:2012的方式
GET /movies/_search?q=title:2012
{
	"profile":"true"
}

1
2
3
4
5
6
7
8
9

就是查询title字段是2012的文档

# URI Search泛查询

#泛查询，正对_all,所有字段
GET /movies/_search?q=2012
{
	"profile":"true"
}

1
2
3
4
5

# URI Search Term查询

简而言之就是指定了term单词的查询，和上面指定了title是2012就是term query。和上面的指定字段查询类似

GET /movies/_search?q=title:2012
{
	"profile":"true"
}

1
2
3
4

# URI Search Phrase查询

也就是带多个单词，词组的查询。这里要考虑要引号和括号(分组)的不同含义。

# 不加引号和括号

例如查找美丽心灵的影片，使用q=title:Beautiful Mind，通过profile的查询过程可以看出使用的是BooleanQuery，等效于tile字段中包含Beatiful，其他字段(也包含title字段)。

GET /movies/_search?q=title:Beautiful Mind
{
	"profile":"true"
}

1
2
3
4

# 加上引号

"Beautiful Mind",等效于Beautiful AND Mind.这是PhraseQuery，在查询的时候要求这两个单词要同时出现，并且还要求前后的顺序保持一致。

#使用引号，Phrase查询
GET /movies/_search?q=title:"Beautiful Mind"
{
	"profile":"true"
}

1
2
3
4
5

# 加上括号

和前面不加上括号的相比，这是布尔查询，意思是在title字段中，或者出现Beautiful或者出现Mind，或者都可以出现。

#分组，Bool查询
GET /movies/_search?q=title:(Beautiful Mind)
{
	"profile":"true"
}

1
2
3
4
5

# 布尔操作

在单词之间加入AND/OR/NOT或者&& / || !

必须要大写

AND：意思是必须包含Beautiful和Mind，但是对之间的顺序没有要求

# 查找美丽心灵
GET /movies/_search?q=title:(Beautiful AND Mind)
{
	"profile":"true"
}

1
2
3
4
5

OR：意思是或者包含Beautiful，或者包含Mind，或者包含这两个，但是对这两个词的顺序没有要求

GET /movies/_search?q=title:(Beautiful OR Mind)
{
	"profile":"true"
}

1
2
3
4

NOT：意思是包含Beautiful，不包含Mind

GET /movies/_search?q=title:(Beautiful NOT Mind)
{
	"profile":"true"
}

1
2
3
4

# 分组

+：表示must，也就是一定要有，must后面跟的单词是必须要存在的。

%2B：表示的是+的意思。

-：表示must_not，也就是后面跟的单词一定要不存在。

注意：

单独使用+的时候，没有任何含义，和没有+一个意思。

单独使用-的时候，是有意义的。

GET /movies/_search?q=title:(+Beautiful -Mind)
{
	"profile":"true"
}

1
2
3
4

# 范围查询

区间表示：[]闭区间，{}开区间

year:{2019 TO 2018]
year:[* TO 2018]
year:[2002 TO 2018%7D 等同于[2002 TO 2018}

GET /movies/_search?q=title:beautiful AND year:[2002 TO 2018]
{
	"profile":"true"
}

1
2
3
4

# 算数符号

year:>2010
year:(>2010 && <=2018)
year:(+>2010 +<=2018)

GET /movies/_search?q=year:(>2010 AND  <=2018)
{
	"profile":"true"
}

1
2
3
4

# 通配符查询

通配符查询(通配符查询效率低，占用内存大，不建议使用。特别是放在最前面)

? 代表1个字符，* 代表0或多个字符

title:mi?d
title:be*

#通配符查询
GET /movies/_search?q=title:b*
{
	"profile":"true"
}

1
2
3
4
5

# 模糊匹配

title:befutifl~1

就是指当有单词输入错误的时候，做一个模糊的匹配

GET /movies/_search?q=title:beautifl~1
{
	"profile":"true"
}

1
2
3
4

# 近似匹配

当输入的多个单词的时候，不要求他们一定要保证相应的顺序出现，但是两个都必须存在。

GET /movies/_search?q=title:"Lord Rings"~2
{
	"profile":"true"
}

1
2
3
4

# Request Body Search

使用的ES内置的，也就是说在查询体中带上JSON格式，使用特定的DSL的语言方法来查询，使用最为广泛。

分页查询/排序/source过滤字段/脚本字段，这些配置都是在定义query的查询体的上面来定义的。

# 分页查询

在不添加任何"from"和"size"的时候，from默认是从0开始，size是10。也就是从每个主分片上捞取from+size的数据汇聚，然后捞取前10条数据返回。

"from"的意思就是定义了要查询数据的起始位置，如果from是10，那么就是说从第10条开始数据返回。

"size"的意思是起始位置开始要返回的数据量的大小，如果size是10，那么就要从from的起始位置开始，返回10条数据。

获取靠后的翻页成本较高。

POST /kibana_sample_data_ecommerce/_search
{
  "from":10,
  "size":20,
  "query":{
    "match_all": {}
  }
}

1
2
3
4
5
6
7
8

# 排序

最好在"数字型"与"日期型"字段上排序

因为对于多值类型或分析过的字段排序，系统会选一个值，无法得知该值。

#对日期排序
POST kibana_sample_data_ecommerce/_search
{
  "sort":[{"order_date":"desc"}],
  "query":{
    "match_all": {}
  }
}

1
2
3
4
5
6
7
8

# source字段过滤

就是针对查询出来的结果过滤到相应的字段的查询。

如果_source没有存储，那就只返回匹配的文档的元数据

_source支持使用通配符，例如_source["name*","desc*"]

#source filtering
POST kibana_sample_data_ecommerce/_search
{
  "_source":["order_date"],
  "query":{
    "match_all": {}
  }
}

1
2
3
4
5
6
7
8

# 脚本字段

针对于当订单中有不同的汇率，需要结合汇率对订单价格进行排序的情况。

GET kibana_sample_data_ecommerce/_search
{
  "script_fields": {
    "new_field": {
      "script": {
        "lang": "painless",
        "source": "doc['order_date'].value+'hello'"
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14

# 全文查询

# match_all的查询

在query下面使用match_all来查询，默认返回10条数据。

#ignore_unavailable=true，可以忽略尝试访问不存在的索引“404_idx”导致的报错
#查询movies分页
POST /movies,404_idx/_search?ignore_unavailable=true
{
  "profile": true,
	"query": {
		"match_all": {}
	}
}

1
2
3
4
5
6
7
8
9

# match的查询

查询单个单词的时候像是URI Search中的term search，如果查询多个单词的时候又有点像URI Search中的phrase search。

#使用查询表达式 match，默认是OR的方式
POST movies/_search
{
  "query": {
    "match": {
      "title": "last christmas"
    }
  }
}
可以使用AND的方式
POST movies/_search
{
  "query": {
    "match": {
      "title": {
        "query": "last christmas",
        "operator": "and"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

# match phrase查询

专门用于短语的搜索，感觉功能上比match query要更灵活一些。

#加入了slop的时候，就是允许在这些短语的中间加入一个其他的单词
POST movies/_search
{
  "query": {
    "match_phrase": {
      "title":{
        "query": "one love",
        "slop": 1
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12

# Query String Query

query string query和下面的simple query string query使用场景比较少。

query string 比simple query string使用起来更加灵活。

query string 有点类似于URI Query

#里面也有和URI Search中的DF字段
POST users/_search
{
  "query": {
    "query_string": {
      "default_field": "name",
      "query": "Ruan AND Yiming"
    }
  }
}
#也支持多个字段的查询，和URI Search中一样，括号代表分组
POST users/_search
{
  "query": {
    "query_string": {
      "fields":["name","about"],
      "query": "(Ruan AND Yiming) OR (Java AND Elasticsearch)"
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# Simple Query String Query

类似Query String，但是会忽略错误的语法，同时只支持部分查询语法
不支持AND OR NOT，会当作字符串处理
Term之间默认的关系是OR，可以指定operator
支持部分逻辑
- +代替AND
- |代替OR
- -代替NOT

#Simple Query 默认的operator是 Or
POST users/_search
{
  "query": {
    "simple_query_string": {
      "query": "Ruan AND Yiming",
      "fields": ["name"]
    }
  }
}
POST users/_search
{
  "query": {
    "simple_query_string": {
      "query": "Ruan Yiming",
      "fields": ["name"],
      "default_operator": "AND"
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

# 精确值查询

结构化数据，日期，数字，布尔，term 。。。。

# Term query

可以理解为是对精确值的查询。

但是Term 查询的特点是不会对查询所带的字段进行分词处理，而索引文档中的text类型的字段值，在输入的时候会对text类型进行分词，小写化处理。

要不查询的时候，字段值全用小写；要么使用text的keyword来进行精确匹配，也就是查询的时候输入插入的时候的值的大小写。

# 改为小写查询，如果多个单词在一起的，使用起来还是要注意
POST /products/_search
{
  "query": {
    "term": {
      "desc": {
        //"value": "iPhone"
        "value":"iphone"
      }
    }
  }
}
#使用keyword进行精确匹配
POST /products/_search
{
  //"explain": true,
  "query": {
    "term": {
      "productID.keyword": {
        "value": "XHDK-A-1293-#fJ3"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# constant score转为filter

将query转为filter过滤，忽略掉了TF-IDF的算分过程，避免了相关性算分的开销。并且Filter可以有效利用缓存

POST /products/_search
{
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID.keyword": "XHDK-A-1293-#fJ3"
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13

# 布尔值

#对布尔值 match 查询，有算分
POST products/_search
{
  "profile": "true",
  "explain": true,
  "query": {
    "term": {
      "avaliable": {
        "value": "false"
      }
    }
  }
}
#一般对布尔值查询会结合上面的过滤到算法
#对布尔值，通过constant score 转成 filtering，没有算分
POST products/_search
{
  "profile": "true",
  "explain": true,
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "avaliable": {
            "value": "false"
          }
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

# 查找多个精确值

#字符类型 terms
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "terms": {
          "productID.keyword": [
            "QQPX-R-3956-#aD8",
            "JODL-X-1937-#pV7"
          ]
        }
      }
    }
  }
}
#处理多值字段，term 查询是包含，而不是等于
#针对的是这个term字段的值是一个数值，而并非是唯一的值的情况下
#可以通过添加一个genre count字段对每个文档的该字段的数组里面的值多个统计
#在查询的时候，通过去定义这个字段的应该有的值的个数去精确匹配
POST movies/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "genre.keyword": "Comedy"
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

# Range

# 数字Range

#数字 Range 查询
GET products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "price" : {
                        "gte" : 20,
                        "lte"  : 30
                    }
                }
            }
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

gt大于

lt小于

gte大于等于

lte小于等于

# 日期Range

# 日期 range
POST products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "range" : {
                    "date" : {
                      "gte" : "now-1y"
                    }
                }
            }
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

y 年、M 月、w 周、d 天、H或h 小时、m 分钟、s 秒

# exists查询

使用exist查询来处理非空NULL值

#查询存在相应列的文档
#exists查询
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "exists": {
          "field": "date"
        }
      }
    }
  }
}
#查询不存在相应列的文档
POST products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must_not": {
            "exists": {
              "field": "date"
            }
          }
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

# bool查询

bool查询是复合查询。must / should是query context贡献算分；must_not / filter是filter context不贡献算分。

子查询可以任意顺序出现；

可以嵌套多个查询；

在bool查询中，没有must条件的时候，should中必须至少满足一条查询。

#基本语法
POST /products/_search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "price" : "30" }
      },
      "filter": {
        "term" : { "avaliable" : "true" }
      },
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      },
      "should" : [
        { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
        { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
      ],
      "minimum_should_match" :1
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# 解决term query包含而不是相等问题

针对term查询的字段，如果是一个数组类型的时候，term查询是包含而不是相等的问题，通过bool查询来解决，需要新增一个数组里面值的个数的字段。

#must，有算分
POST /newmovies/_search
{
  "query": {
    "bool": {
      "must": [
        {"term": {"genre.keyword": {"value": "Comedy"}}},
        {"term": {"genre_count": {"value": 1}}}

      ]
    }
  }
}
#Filter。不参与算分，结果的score是0
POST /newmovies/_search
{
  "query": {
    "bool": {
      "filter": [
        {"term": {"genre.keyword": {"value": "Comedy"}}},
        {"term": {"genre_count": {"value": 1}}}
        ]

    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

# bool嵌套查询

#嵌套，实现了 should not 逻辑
POST /products/_search
{
  "query": {
    "bool": {
      "must": {
        "term": {
          "price": "30"
        }
      },
      "should": [
        {
          "bool": {
            "must_not": {
              "term": {
                "avaliable": "false"
              }
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

# 改变相关度算分

通过嵌套的bool查询中的问题，来改变相关度的算分的问题。

在同一层级下竞争字段，具有相同的权重；通过嵌套bool查询，可以改变对算分的影响。

#同级查询
POST /animals/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "brown" }},
        { "term": { "text": "red" }},
        { "term": { "text": "quick"   }},
        { "term": { "text": "dog"   }}
      ]
    }
  }
}
#嵌套多级查询
POST /animals/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "quick" }},
        { "term": { "text": "dog"   }},
        {
          "bool":{
            "should":[
               { "term": { "text": "brown" }},
                 { "term": { "text": "brown" }},
            ]
          }

        }
      ]
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

# Boosting Query

# 控制字段的boosting

在bool查询中，通过指定boost的值来实现影响算分，控制相关度。

参数boost的含义：当boost>1的时候，打分的相关度相对性提升；当0<boost<1的时候，打分的权重相对性降低;当boost<0的时候，贡献负分。

POST blogs/_search
{
  "query": {
    "bool": {
      "should": [
        {"match": {
          "title": {
            "query": "apple,ipad",
            "boost": 1.1
          }
        }},
        {"match": {
          "content": {
            "query": "apple,ipad",
            "boost": 2
          }
        }}
      ]
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

# Not Quite Not

意思是查询的时候，正确的值要第一个返回(例如这里我要求查询的苹果公司产品要优先出现)，其他相关的(苹果汁，苹果派的)等信息也要返回，但是是靠后面返回。

提高了查准率和查全率。

使用了boosting query的方式来实现。positive的值是对查询算法加分的，negative的值是对查询算分减分的，negative_boost是指减多少分。

POST news/_search
{
  "query": {
    "boosting": {
      "positive": {
        "match": {
          "content": "apple"
        }
      },
      "negative": {
        "match": {
          "content": "pie"
        }
      },
      "negative_boost": 0.5
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# Disjunction Max Query

Disjunction Max Query应用于单字符串多字段查询的场景，很多搜索引擎都是这样的。允许你只是输入一个字符串，要多所有的字段进行查询。

POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ]
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12

# 使用tie breaker参数

有一些情况下，同时匹配title和body字段的文档比只与一个字段匹配的文档的相关度更高。而disjunction max query 查询只会简单地使用单个最佳匹配语句的评分_score作为整体评分。这个时候，可以使用tie_breaker参数来调整。

获取最佳匹配语句的评分_score;
将其他匹配语句的评分与tie_breaker相乘；
对以上评分求和并规范化。
Tier Breaker是一个介于0-1之间的浮点数。0代表使用最佳匹配；1代表所有语句同等重要。

POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
            "tie_breaker": 0.2
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12

# Multi Match

也是运用于单字符串多字段的场景下查询。

best_fields的场景，最佳字段，单字段之间相互竞争，又相互关联的时候。

POST blogs/_search
{
  "query": {
    "multi_match": {
      "type": "best_fields",
      "query": "Quick pets",
      "fields": ["title","body"],
      "tie_breaker": 0.2,
      "minimum_should_match": "20%"
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12

most fields:多数字段的场景

PUT /titles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {"std": {"type": "text","analyzer": "standard"}}
      }
    }
  }
}
GET /titles/_search
{
   "query": {
        "multi_match": {
            "query":  "barking dogs",
            "type":   "most_fields",
            "fields": [ "title", "title.std" ]
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

cross_fields:用于跨字段的查询

GET address/_search
{
   "query": {
        "multi_match": {
            "query":  "Poland Street W1V",
            "type":   "cross_fields",
            "operator": "and",
            "fields": [ "street", "city","country" ]
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11

# 高亮显示

POST tmdb/_search
{
  "_source": ["title","overview"],
  "query": {
    "multi_match": {
      "query": "basketball with cartoon aliens",
      "fields": ["title","overview"]
    }
  },
  "highlight" : {
        "fields" : {
            "overview" : { "pre_tags" : ["\\033[0;32;40m"], "post_tags" : ["\\033[0m"] },
            "title" : { "pre_tags" : ["\\033[0;32;40m"], "post_tags" : ["\\033[0m"] }

        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# Search Template

ES中的search template是用来解耦程序的开发和ES搜索DSL的

POST _scripts/tmdb
{
  "script": {
    "lang": "mustache",
    "source": {
      "_source": [
        "title","overview"
      ],
      "size": 20,
      "query": {
        "multi_match": {
          "query": "{{q}}",
          "fields": ["title","overview"]
        }
      }
    }
  }
}
#调用template
POST tmdb/_search/template
{
    "id":"tmdb",
    "params": {
        "q": "basketball with cartoon aliens"
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

# Suggester API

搜索引擎中类似的功能，在ES中是通过Suggester API实现的。
原理：将输入的文本分解为Token，然后在搜索的字典里查找相似的Term并返回。
根据不同的使用场景，ES设计了4种类别的Suggesters
- Term & Phrase Suggester
- Complete & Context Suggester

# Term Suggester

有三种suggestion Mode:

Missing：如索引中已经存在，就不提供建议
Popular：推荐出现频率更加高的词
Always：无论是否存在，都提供建议。

每一个建议都包含一个算分，相似性是通过levenshtein edit distance的算法实现的。核心思想就是一个词改动多少字符就可以和另外一个词一致。提供了很多可选参数来控制相似性的模糊程度。例如"max_edits"

# Missing Mode

Suggester就是一个特殊类型的搜索。"text"里是调用时候提供的文本，通常来自于用户界面上用户输入的内容。

用户输入的"lucen"是一个错误的拼写。

回到指定的字段"body"上搜索，当无法搜索到结果时(missing)，返回建议的词。

POST /articles/_search
{
  "size": 1,
  "query": {
    "match": {
      "body": "lucen rock"
    }
  },
  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "suggest_mode": "missing",
        "field": "body"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

# Popular Mode

推荐出现频率更多高的词

POST /articles/_search
{

  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "suggest_mode": "popular",
        "field": "body"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13

# Always Mode

无论是否存在，都提供建议

POST /articles/_search
{

  "suggest": {
    "term-suggestion": {
      "text": "lucen rock",
      "term": {
        "suggest_mode": "always",
        "field": "body",
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13

sorting by frequentcy和prefix length

默认按照score排序，也可以按照"frequency"

默认首字母不一致就不会匹配推荐，但是如果将prefix_length设置为0，就会为hock建议rock

POST /articles/_search
{

  "suggest": {
    "term-suggestion": {
      "text": "lucen hocks",
      "term": {
        "suggest_mode": "always",
        "field": "body",
        "prefix_length":0,
        "sort": "frequency"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

# Phrase Suggester

Phrase Suggester在term suggester上增加了一些额外的逻辑。

例如下面的参数：

Suggest Mode: missing，popular，always
Max Errors：最多可以拼错的Terms数
Confidence：限制返回结果数，默认是1

POST /articles/_search
{
  "suggest": {
    "my-suggestion": {
      "text": "lucne and elasticsear rock hello world ",
      "phrase": {
        "field": "body",
        "max_errors":2,
        "confidence":0,
        "direct_generator":[{
          "field":"body",
          "suggest_mode":"always"
        }],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

# Completion Suggester

completion Suggester提供了"自动完成"(auto complete)的功能。用户每输入一个字符，就需要即时发送一个查询请求到后端查找匹配项。
对性能要求比较苛刻。ES采用了不同的数据结构，并非通过倒排索引来完成的。而是将Analyze的数据编码成FST和索引一起存放。FST会被ES整个加载进内存，速度很快。
FST只能用于前缀查找

# 使用自动补全suggester步骤

定义mapping，使用"completion" type

索引数据

运行"suggest"查询，得到搜索建议

#定义mapping
PUT articles
{
  "mappings": {
    "properties": {
      "title_completion":{
        "type": "completion"
      }
    }
  }
}
#索引数据
POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }
#使用自动补全
POST articles/_search?pretty
{
  "size": 0,
  "suggest": {
    "article-suggester": {
      "prefix": "elk ",
      "completion": {
        "field": "title_completion"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

# Context Suggester

上下文建议搜索

context suggester是completion suggester的扩展，

可以在搜索中加入更多的上下文信息，例如，输入"star"

咖啡相关：建议"starbucks"
电影相关的："star wars"

# 实现上下文suggester

可以定义两种类型的context
- category 任意的字符串
- geo 地理位置信息
实现context suggester的具体步骤
- 定制一个mapping
- 索引数据，并且为每个文档加入context信息
- 结合context进行suggestion查询

#定义mapping
PUT comments/_mapping
{
  "properties": {
    "comment_autocomplete":{
      "type": "completion",
      "contexts":[{
        "type":"category",
        "name":"comment_category"
      }]
    }
  }
}
#插入数据
POST comments/_doc
{
  "comment":"I love the star war movies",
  "comment_autocomplete":{
    "input":["star wars"],
    "contexts":{
      "comment_category":"movies"
    }
  }
}
POST comments/_doc
{
  "comment":"Where can I find a Starbucks",
  "comment_autocomplete":{
    "input":["starbucks"],
    "contexts":{
      "comment_category":"coffee"
    }
  }
}
#带上下文搜索
POST comments/_search
{
  "suggest": {
    "MY_SUGGESTION": {
      "prefix": "sta",
      "completion":{
        "field":"comment_autocomplete",
        "contexts":{
          "comment_category":"coffee"
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

# Metric Aggregation

单值分析：只输出一个分析结果

min,max,avg,sum
cardinality (类似于distincct count)

多值分析：输出多个分析结果

stats, exended stats
percentile, percentile rank
top hits (排在前面的示例)

# 多个 Metric 聚合，找到最低最高和平均工资
POST employees/_search
{
  "size": 0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    },
    "min_salary": {
      "min": {
        "field": "salary"
      }
    },
    "avg_salary": {
      "avg": {
        "field": "salary"
      }
    }
  }
}
# 一个聚合，输出多值
POST employees/_search
{
  "size": 0,
  "aggs": {
    "stats_salary": {
      "stats": {
        "field":"salary"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

# Bucket Aggregation

按照一定的规则，将文档分配到不同的桶中，从而达到分类的目的。ES提供的一些常见的bucket aggregation。

terms
数字类型，有range/data range，或者histogram / data histogram
支持嵌套：也就是在桶里再做分桶

# Terms Aggregation

在对term进行聚合的时候，如果是text类型。有两种方法，要么指定keyword字段，要么就需要打开fieldata，但是fieldata会查找出对应分词后的term。才可以进行terms aggregation

keyword默认支持doc_values
text需要咋mapping中enable，会按照分词后的结果进行分

# 对 Text 字段进行 terms 聚合查询，失败
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job"
      }
    }
  }
}
# 对keword 进行聚合
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
      }
    }
  }
}
# 对 Text 字段打开 fielddata，支持terms aggregation
PUT employees/_mapping
{
  "properties" : {
    "job":{
       "type":     "text",
       "fielddata": true
    }
  }
}
# 对 Text 字段进行 terms 分词。分词后的terms,结果和上面指定keyword的方式是不一样的，这是分词后的结果
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

# Cardinality

类似于SQL中的Distinct

# 对job.keyword 和 job 进行 terms 聚合，分桶的总数并不一样
POST employees/_search
{
  "size": 0,
  "aggs": {
    "cardinate": {
      "cardinality": {
        "field": "job"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12

# 指定SIZE分桶

#指定 bucket 的 size
POST employees/_search
{
  "size": 0,
  "aggs": {
    "ages_5": {
      "terms": {
        "field":"age",
        "size":3
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13

# 嵌套聚合

# 指定size，不同工种中，年纪最大的3个员工的具体信息
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword"
      },
      "aggs":{
        "old_employee":{
          "top_hits":{
            "size":3,
            "sort":[
              {
                "age":{
                  "order":"desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

# 优化terms聚合性能

通过指定mapping中，打开eager_global_ordinals为true，这会让索引数据的时候就开始对数据文档进行预先处理，这样对于要求频繁使用聚合的场景，或者说是要对聚合实时数据有更高的性能的时候来使用。

put index
{
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword",
        "eager_global_ordinals": true
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11

# Range 聚合

自己自定义range

#Salary Ranges 分桶，可以自己定义 key
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_range": {
      "range": {
        "field":"salary",
        "ranges":[
          {
            "to":10000
          },
          {
            "from":10000,
            "to":20000
          },
          {
            "key":">20000",
            "from":20000
          }
        ]
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

# Histogram聚合

通过指定分桶的间隔，来进行分桶

#Salary Histogram,工资0到10万，以 5000一个区间进行分桶
POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_histrogram": {
      "histogram": {
        "field":"salary",
        "interval":5000,
        "extended_bounds":{
          "min":0,
          "max":100000
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# Pipeline 聚合分析

pipeline管道的概念：支持对聚合分析的结果，再次进行聚合分析。

Pipeline的分析结果会输出到原结果中，根据位置的不同，分为两类：

Sibling 结果和现有分析结果同级
- max，min，avg & sum bucket
- stats , extended status buccket
- percentiles bucket
Parent 结果内嵌到现有的聚合分析结果之中
- derivative (求导)
- cumultive sum（累计求和）
- moving function (滑动窗口)

# Sibling Pipeline

查看到平均工资最低的工作类型

# 平均工资最低的工作类型
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "min_salary_by_job":{
      "min_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

平均工资的统计分析

# 平均工资的统计分析
POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "stats_salary_by_job":{
      "stats_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

# Parent Pipeline

按照年龄，对工资进行求导(看工资发展的趋势)

#按照年龄对平均工资求导
POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "derivative_avg_salary":{
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

编辑

#ELK

上次更新: 2022/02/13, 22:40:21

← ES架构规划 ES_常见问题列表→