Skip to content

possible bug in optimization of qat wake signals #306

@sharksforarms

Description

@sharksforarms

Hi all,

I believe I have encountered a bug with the optimization around the reduction of wake signals for qat offload which causes unneeded latency. The issue seems to be around the usage of thread-local.

Commit which added this optimization: 32f3710

This optimization seems to make the assumption that a offload request will be resumed on the same thread that started it, see this pseudo-example:

This example considers a single epoll instance, with multiple worker threads waiting for events.

int global_epoll_inst = epoll_create()

Thread 1:
- app: bssl_qat_async_start_job -> ASYNC_WAIT_CTX_get_all_fds -> epollctl(global_epoll_inst. EPOLL_CTL_ADD, fd)
- engine: tlv->localOpsInFlight = 0
- engine: qat_rsa_decrypt -> QAT_INC_IN_FLIGHT_REQS -> sem_post
- engine: tlv->localOpsInFlight = 1

Intel QAT Polling Thread:
- sem_timedwait -> WAKE
- poll

Thread 2:
- app: epoll_wait(global_epoll_inst) --> resumed
- app: bssl_qat_async_ctx_copy_result(ctx)
- tlv->localOpsInFlight = 0
- engine: QAT_DEC_IN_FLIGHT_REQS
- tlv->localOpsInFlight = -1

Thread 1:
- app: bssl_qat_async_start_job -> ASYNC_WAIT_CTX_get_all_fds -> epollctl(global_epoll_inst. EPOLL_CTL_ADD, fd)
- engine: tlv->localOpsInFlight = 1
- engine: qat_rsa_decrypt -> QAT_INC_IN_FLIGHT_REQS (*bug* no call to sem_post!)
- engine: tlv->localOpsInFlight = 2

Intel QAT Polling Thread:
- sem_timedwait -> TIMEOUT
- poll

...

The side-effect is that the TLV get's into a state where localOpsInFlight it will never == 1 and so sem_post never get's called anymore and then the polling thread's sem times out.

The relevant code section seems to be:

QAT_Engine/qat_hw_rsa.c

Lines 324 to 327 in ba2035c

QAT_INC_IN_FLIGHT_REQS(num_requests_in_flight, tlv);
if (qat_use_signals()) {
if (tlv->localOpsInFlight == 1) {
if (sem_post(&hw_polling_thread_sem) != 0) {

I'm wondering if it would be possible to use num_requests_in_flight == 1 (atomic) in the if ? I did not test this, but the use of a TLV here seems like it could be problematic to consumers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions