Remove duplicate documents from a search in Elasticsearch
Written by Bastien Donjon, Posted in Uncategorized
If you have many documents with the same value of the same field, this source code can help you. Using terms aggregator and top hits aggregator is required.
My index :
- Doc 1 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-01-2014’}
- Doc 2 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-02-2014’}
- Doc 3 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-03-2014’}
- Doc 4 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-04-2014’}
- Doc 5 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-05-2014’}
- Doc 6 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-06-2014’}
My result (deduplication result by domain field) :
- Doc 6 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-06-2014’}
- Doc 4 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-04-2014’}
- Doc 2 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-02-2014’}
Elasticsearch query :
/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true "aggs":{ "dedup" : { "terms":{ "field": "domain" }, "aggs":{ "dedup_docs":{ "top_hits":{ "size":1 } } } } } }
I originally asked a question here.