Remove duplicate documents from a search in Elasticsearch
Written by Bastien Donjon, Posted in Uncategorized
If you have many documents with the same value of the same field, this source code can help you. Using terms aggregator and top hits aggregator is required.
My index :
- Doc 1 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-01-2014’}
- Doc 2 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-02-2014’}
- Doc 3 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-03-2014’}
- Doc 4 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-04-2014’}
- Doc 5 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-05-2014’}
- Doc 6 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-06-2014’}
My result (deduplication result by domain field) :
- Doc 6 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-06-2014’}
- Doc 4 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-04-2014’}
- Doc 2 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-02-2014’}
Elasticsearch query :
/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true "aggs":{ "dedup" : { "terms":{ "field": "domain" }, "aggs":{ "dedup_docs":{ "top_hits":{ "size":1 } } } } } }
I originally asked a question here.
m77kBribtC3
INoZGXtWfmO
OJwYmiEP7DL
XIzm2tZOJIU
0hN74xbmdc9
KDOxDKfyPqo
uGDAlPhhHnm
BX6zmIGF3ah
tfBsWwQeeSX
YvE8QjSfkNI
fxFjxiIOQLI
kgMLpAUisuS
h01uQ2RGd3c
1iY4sQpKIsy
ljU35VLo8w6
nvDNbK8S2b4
o83i0jbHkmL
ULYxn9C9ohT
azZ7WZBL4dv
7srZ2SnIGly
CHzpvUjez1K
8IMenTV0m0o
WnxU89TWJM7
Msbfh6rfk1T
ZKxLgAZNnkg
xyXGK0brTqo
gd4N9SiQdMJ
kkLfoLOrwFU
PIcqarErz7J
uocuYVCuU7W
EwGywGp7G7T
C2QYwz1u7hp
z1OGK3CPnuz
ia8h53WxcIa
VX0Wp90FjEO
jaV5Ha7MQQN
Categories
Recent Posts
Mes sites
Archives