Skip to content

doc_count in aggregation result is not the actual number of documents #2721

@rdettai-sk

Description

@rdettai-sk

I noticed that in term aggregations, doc_count isn't the actual number of documents that match the term but the number of times the term appears. For example, in this test, some documents have the term text_field => "Hello Hello" twice and are therefore counted twice.

In Lucene, I think you can get the right doc_count by deduplicating thanks to SortedSetDocValues which stores multi valued terms for a document in sorted order. My understanding is that ES uses this to display actual document counts.

Could something similar be achieved in Tantivy? do we want to? If not, I think we should document doc_count to clarify this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions