Skip to content

[Question] How to map input bytes to the functions that accessed them in v4.0.0? #6581

@rayansiddique9

Description

@rayansiddique9

I'm analyzing parser behavior and need to determine which functions in an instrumented program accessed specific input bytes. This would help understand:

  • Which parsing functions process which byte ranges
  • How input bytes flow through the call stack
  • Which code paths are triggered by specific input patterns

Environment

  • PolyTracker version: 4.0.0 (Docker: trailofbits/polytracker:latest)
  • Platform: macOS ARM64 (using --platform linux/amd64)
  • Python version (in container): 3.10

What I've Successfully Extracted

I've made significant progress exploring the v4.0.0 API and can extract:

1. Function Names

from polytracker.taint_dag import TDFunctionsSection, TDStringSection

# Extract function ID → name mapping
for func_id, fn_header in enumerate(functions_section):
    func_name = string_section.read_string(fn_header.name_offset)

Results: Successfully extracted all function names: main, parse_expr, program, statement, expr, test, sum, term, etc.

2. Function Call Trace

from polytracker.taint_dag import TDEventsSection

# Iterate through execution trace
for event in events_section:
    function_id = event.fnidx      # Function being called
    event_type = event.kind        # ENTRY (0) or EXIT (1)

Results: Complete function call trace with proper nesting:

ENTER program
  ENTER statement
    ENTER expr
      ENTER term

3. Input Byte Offsets

# Extract byte offsets from taint forest
for node in trace.taint_forest.nodes():
    if node.source is not None:
        offset = trace.file_offset(node).offset  # Byte offset in input
        affected_cf = node.affected_control_flow # Whether byte influenced branching

Results: All input bytes tracked with control flow information.

What's Missing: The Correlation

I cannot find a way to correlate these pieces together to answer:
"Which function accessed byte X?"

For example, given:

  • Input file: {i=1;}\n
  • Byte 2 (= character) at offset 2

I need to determine: "The expr function accessed byte 2"

What I've Tried

Approach 1: TDEvent attributes

for event in events_section:
    # event has: fnidx, kind
    # event does NOT have: label, taints, bytes_accessed

Result: Events know which function, but not which bytes it accessed.

Approach 2: Taint nodes

for node in forest.nodes():
    # node has: label, source, affected_control_flow
    # node does NOT have: function, event, accessed_by

Result: Nodes know which byte, but not which function accessed it.

Approach 3: Documented methods

# These raise NotImplementedError:
trace.access_sequence()  # NotImplementedError
trace.function_trace()   # NotImplementedError
for event in trace:      # NotImplementedError (via __iter__)

Result: Documented API methods are not implemented in v4.0.0.

Approach 4: Control Flow Log

from polytracker.taint_dag import TDControlFlowLogSection

# CF log has function_id_mapping but unclear how to correlate with taints

Result: Found function_id_mapping attribute but it's a method, and calling it returns empty results.

Questions

  1. Is there an API I'm missing?
    Is there a method/property that links taint labels to the events/functions that accessed them?

  2. Should I use a different trace format?
    Issue Emitting and loading a DBProgramTrace instead of a TDProgramTrace #6534 mentioned DBProgramTrace vs TDProgramTrace. Can I generate .db files where access_sequence() actually works?

  3. Is this data available internally but not exposed?
    If the correlation exists internally but isn't exposed via Python API, would you accept a PR to add it?

  4. Alternative approach?
    Is there a recommended way to achieve this byte-to-function mapping with the current v4.0.0 API?

Minimal Reproduction

# Instrument a C program
docker run --rm --platform linux/amd64 \
    -v $(pwd):/workdir -w /workdir \
    trailofbits/polytracker bash -c \
    "polytracker build clang program.c -o program && \
     polytracker instrument-targets --taint --ftrace program"

# Execute with stdin tracking
docker run --rm --platform linux/amd64 \
    -v $(pwd):/workdir -w /workdir \
    -e POLYDB=polytracker.tdag \
    -e POLYTRACKER_STDIN_SOURCE=1 \
    trailofbits/polytracker \
    bash -c "./program.instrumented < input.txt"

# Analyze the trace
docker run --rm --platform linux/amd64 \
    -v $(pwd):/workdir -w /workdir \
    trailofbits/polytracker python3 -c "
from polytracker import PolyTrackerTrace
from polytracker.taint_dag import TDFunctionsSection, TDEventsSection, TDStringSection

trace = PolyTrackerTrace.load('polytracker.tdag')

# Can extract functions and events separately,
# but cannot correlate which functions accessed which bytes

Use Case

This mapping would enable:

  • Parser debugging: Identify which function mishandled a specific byte
  • Security analysis: Find which code paths process attacker-controlled bytes
  • Execution visualization: Create diagrams showing byte flow through functions
  • Performance analysis: Identify hot paths for specific input patterns

Related Issues:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions