Bastien Donjon

Développeur web à Bordeaux

Monthly Archive: August 2014

Wednesday

27

August 2014

0

COMMENTS

Remove duplicate documents from a search in Elasticsearch

Written by , Posted in Uncategorized

If you have many documents with the same value of the same field, this source code can help you. Using terms aggregator and top hits aggregator is required.

My index :

  • Doc 1 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-01-2014’}
  • Doc 2 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-02-2014’}
  • Doc 3 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-03-2014’}
  • Doc 4 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-04-2014’}
  • Doc 5 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-05-2014’}
  • Doc 6 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-06-2014’}

My result (deduplication result by domain field) :

  • Doc 6 {domain: ‘domain3.fr’, name: ‘name3′, date: ’01-06-2014’}
  • Doc 4 {domain: ‘domain2.fr’, name: ‘name2′, date: ’01-04-2014’}
  • Doc 2 {domain: ‘domain1.fr’, name: ‘name1′, date: ’01-02-2014’}

Elasticsearch query :

/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true
"aggs":{
    "dedup" : {
      "terms":{
        "field": "domain"
       },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":1
           }
         }
      }
    }
  }
}

I originally asked a question here.