Skip to content

Incorrect BM25Encoder values returned for query and document.  #85

@clive-eltropy

Description

@clive-eltropy

Is this a new bug?

  • I believe this is a new bug
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

when I try to get sparse vectors using encode_documents and encode_queries for the same piece of text is gives different values.

piece to text : "the lazy dog"
encode_documents values : 0.58, 0.58
encode_queries: 0.5

Expected Behavior

Getting different values for encode_documents and encode encode_queries for the same piece of text. expecting values should be 0.5 for both right but there is ~0.08 difference, am I missing something?

Steps To Reproduce

    from pinecone_text.sparse import BM25Encoder
    
    corpus = ["The quick brown fox jumps over the lazy dog", "The lazy dog is brown"]

    bm25 = BM25Encoder()
    bm25.fit(corpus)

    print(bm25.encode_documents("the lazy dog")) 
    ### Output: {'indices': [226376294, 2982218203], 'values': [0.5882352941176472, 0.5882352941176472]}
    
    print(bm25.encode_queries("the lazy dog"))
    ### Output: {'indices': [226376294, 2982218203], 'values': [0.5, 0.5]}

Relevant log output

No response

Environment

OS: Ubuntu 20.04
Python 3.9.12
pinecone-text==0.9.0

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions