Elasticsearce-6-elasticsearch聚合分析

终于到了最后一个业务需求:支持管理者对雇员目录做分析。 Elasticsearch 有一个功能叫聚合(aggregations),允许我们基于数据生成一些精细的分析结果。聚合与 SQL 中的 GROUP BY 类似但更强大。

[AdSense-A]

基本聚合

举个例子,挖掘出雇员中最受欢迎的兴趣爱好:
GET /megacorp/employee/_search
{
“aggs”: {
“all_interests”: {
“terms”: { “field”: “interests” }
}
}
}
暂时忽略掉语法,直接看看结果:
{

“hits”: { … },
“aggregations”: {
“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2
},
{
“key”: “forestry”,
“doc_count”: 1
},
{
“key”: “sports”,
“doc_count”: 1
}
]
}
}
}
可以看到,两位员工对音乐感兴趣,一位对林地感兴趣,一位对运动感兴趣。这些聚合并非预先统计,而是从匹配当前查询的文档中即时生成。如果想知道叫 Smith 的雇员中最受欢迎的兴趣爱好,可以直接添加适当的查询来组合查询:

Client程序演示

增加一个方法:

/**      * 挖掘出雇员中最受欢迎的兴趣爱好   聚合搜索using aggrefations      * @param client      */     private static void findInterestHobby(Client client) {         SearchRequestBuilder request = client.prepareSearch("megacorp1")                 .setTypes("employee1")                 .addAggregation(                         AggregationBuilders.terms("agg1").field("interests")                 );         SearchResponse response = request.get();         Aggregations aggs = response.getAggregations();         Map<String,Aggregation> map= aggs.asMap();         Set<String> set = map.keySet();         for (String str : set) {             System.out.println("agg name="+str);             Aggregation agg = map.get(str);             Map<String,Object> data = agg.getMetaData();             Set<String> dataSet = map.keySet();             for (String str2 : dataSet) {                 StringTerms obj = (StringTerms) map.get(str2);                 System.out.println("DocCountError="+obj.getDocCountError());                 System.out.println("SumOfOtherDocCounts="+obj.getSumOfOtherDocCounts());                 List<Bucket> buckes = obj.getBuckets();                 for (Iterator iterator = buckes.iterator(); iterator.hasNext();) {                     Bucket bucket = (Bucket) iterator.next();                     String key = bucket.getKeyAsString();                      System.out.println(key+"="+bucket.getDocCount());                 }             }         }   } 

主方法中增加调用:

// 8.挖掘出雇员中最受欢迎的兴趣爱好   聚合搜索using aggrefations findInterestHobby(client); 

运行后结果报错:

Caused by: RemoteTransportException[[111][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: IllegalArgumentException[Fielddata is disabled on text fields by default. Set fielddata=true on [interests] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.];  Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. ...

fielddata

这里看下fielddata:
大多数字段默认都是索引的,这使得它们可以搜索。但是,在脚本中进行排序、聚合和访问字段值需要从搜索中获得不同的访问模式。

搜索需要回答“哪些文档包含这个术语?”排序和聚合需要回答一个不同的问题:“这个字段对这个文档的值是多少?”。

大多数字段可以使用索引时,找到值但是text文本字段不支持。
Text field使用fielddata的这种内存数据结构。它会在内存中存储反转整个索引的每个片段,包括文档关系。

因为它非常耗费内存所以默认是关闭的disabled,一般不必要设置的不要设置。
参考https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

我们这里让interests这个字段设置为fielddata:true
让已存在的text field设能够fielddata:

Elasticsearch(六)elasticsearch聚合分析
再次调用,运行结果:
agg name=agg1
DocCountError=0
SumOfOtherDocCounts=0
music=11
sports=8
forestry=2

Head插件示例

结果太长了,只显示最后聚合的结果,hits返回的数据结果省略。(下同)
Elasticsearch(六)elasticsearch聚合分析
Elasticsearch(六)elasticsearch聚合分析

有查询条件的聚合

GET /megacorp/employee/_search
{
“query”: {
“match”: {
“last_name”: “smith”
}
},
“aggs”: {
“all_interests”: {
“terms”: {
“field”: “interests”
}
}
}
}
all_interests 聚合已经变为只包含匹配查询的文档:

“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2
},
{
“key”: “sports”,
“doc_count”: 1
}
]
}

Client程序演示

我们把刚才的方法请求部分加上查询条件,就如我们之前学习的那样:

SearchRequestBuilder request = client.prepareSearch("megacorp1")                 .setTypes("employee1")                 .setQuery(QueryBuilders.matchQuery("last_name","Smith"))                 .addAggregation(                         AggregationBuilders.terms("agg1").field("interests") 

其他部分相同
调用结果:
agg name=agg1
DocCountError=0
SumOfOtherDocCounts=0
music=2
sports=1

Head插件示例

Elasticsearch(六)elasticsearch聚合分析
Elasticsearch(六)elasticsearch聚合分析

聚合支持分级汇总

聚合还支持分级汇总 。比如,查询特定兴趣爱好员工的平均年龄:
GET /megacorp/employee/_search
{
“aggs” : {
“all_interests” : {
“terms” : { “field” : “interests” },
“aggs” : {
“avg_age” : {
“avg” : { “field” : “age” }
}
}
}
}
}

得到的聚合结果有点儿复杂,但理解起来还是很简单的:

“all_interests”: {
“buckets”: [
{
“key”: “music”,
“doc_count”: 2,
“avg_age”: {
“value”: 28.5
}
},
{
“key”: “forestry”,
“doc_count”: 1,
“avg_age”: {
“value”: 35
}
},
{
“key”: “sports”,
“doc_count”: 1,
“avg_age”: {
“value”: 25
}
}
]
}

输出基本是第一次聚合的加强版。依然有一个兴趣及数量的列表,只不过每个兴趣都有了一个附加的 avg_age 属性,代表有这个兴趣爱好的所有员工的平均年龄。
即使现在不太理解这些语法也没有关系,依然很容易了解到复杂聚合及分组通过 Elasticsearch 特性实现得很完美。可提取的数据类型毫无限制。

Client程序演示

此部分可以参考https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/_structuring_aggregations.html

通俗点说,你可以在一个聚合下面再次聚合

增加一个方法:

/**      * 子聚合      * @param client      */     private static void findAvgInterestHobby(Client client) {         SearchRequestBuilder request = client.prepareSearch("megacorp1")                 .setTypes("employee1")                 .addAggregation(                      AggregationBuilders.terms("agg1").field("interests")                      .subAggregation(AggregationBuilders.avg("avg_age").field("age"))                 );           SearchResponse response = request.execute().actionGet();         //为了方便直接返回string了,类似第一个例子可以分析         System.out.println(response.toString());     } 

main方法增加调用:

// 9.子聚合 findAvgInterestHobby(client); 

结果显示:
{“took”:8,”timed_out”:false,”_shards”:{“total”:5,”successful”:5,”failed”:0},”hits”:{“total”:13,”max_score”:1.0,”hits”:[{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”5”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”8”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”1 Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”9”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”SmithSmithSmith”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”10”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”冬瓜核桃”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”12”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”蜂蜜”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”2”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”4”,”_score”:1.0,”_source”:{“first_name”:”Douglas1”,”last_name”:”Fir”,”age”:35,”about”:”I like to build cabinets”,”interests”:[“forestry”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”6”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith 1”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”1”,”_score”:1.0,”_source”:{“first_name”:”John”,”last_name”:”Smith1”,”age”:25,”about”:”I love to go rock climbing”,”interests”:[“sports”,”music”]}},{“_index”:”megacorp1”,”_type”:”employee1”,”_id”:”7”,”_score”:1.0,”_source”:{“first_name”:”Jane”,”last_name”:”1Smith”,”age”:”32”,”about”:”I like to collect rock albums”,”interests”:[“music”]}}]},”aggregations”:{“agg1”:{“doc_count_error_upper_bound”:0,”sum_other_doc_count”:0,”buckets”:[{“key”:”music”,”doc_count”:11,”avg_age”:{“value”:26.90909090909091}},{“key”:”sports”,”doc_count”:8,”avg_age”:{“value”:25.0}},{“key”:”forestry”,”doc_count”:2,”avg_age”:{“value”:35.0}}]}}}

Head插件示例

Elasticsearch(六)elasticsearch聚合分析
Elasticsearch(六)elasticsearch聚合分析

打赏作者

您的支持将鼓励我们继续创作!

[微信] 扫描二维码打赏

[支付宝] 扫描二维码打赏

正在跳转到PayPal...

healthsun Author

发表评论

电子邮件地址不会被公开。 必填项已用*标注