-
Notifications
You must be signed in to change notification settings - Fork 468
fix(profiling): fix absurdly high lock hold times in Lock Profiler [backport 3.19] #15463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 3.19
Are you sure you want to change the base?
Conversation
…15408) ## Fix lock profiler generating release samples for non-sampled acquires ### Problem As reported by @dchayes-dd in our Slack channel: Lock Profiler displays unrealistic lock hold times (either too large or even negative). The negative values are legitimate int64 overflows, which in itself is a red herring. The root cause is described below. **Example 1** * [Flame graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog&agg_m=%40prof_core_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580147405000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121233000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121233000000&refresh_mode=paused&selected_tf=1763579722252.7107%2C1763580268638&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763568998228&to_ts=1763583398228&live=false) with negative lock hold time: `-970 days / min` <img width="1546" height="322" alt="Screenshot 2025-11-24 at 10 08 09 PM" src="https://github.com/user-attachments/assets/e2f2e38b-3282-493f-bd17-8f7545776d51" /> **Example 2** * [Flame graph](https://app.datadoghq.com/profiling/explorer?query=service%3Awatchdog_explorer%20datacenter%3Aus1.prod.dog%20runtime-id%3Adc67db679fe041f1b1d04dbc8a903e22&agg_m=%40prof_python_cpu_cores&agg_m_source=base&agg_q=service&agg_q_source=base&agg_t=sum&extra_search_fields=%7B%22filters_query%22%3A%22%22%2C%22sample_type%22%3A%22cpu-time%22%7D&fromUser=true&my_code=enabled&profile_type=lock-hold-time&profiling-timeline__stack_order=top-down&profiling-timeline__summary_tab=breakdown&profiling-timeline__tb=1763580092026000000&profiling-timeline__tf_f_0=0&profiling-timeline__tf_f_1=121249000000&profiling-timeline__tf_us_0=0&profiling-timeline__tf_us_1=121249000000&refresh_mode=paused&selected_tf=1763580092026%2C1763580213275&top_n=100&top_o=top&viz=flame_graph&x_missing=true&from_ts=1763579722252&to_ts=1763580268638&live=false) with absurdly high lock hold time: `19,000 days / min` <img width="1532" height="313" alt="Screenshot 2025-11-24 at 10 16 05 PM" src="https://github.com/user-attachments/assets/c7e056ed-faf5-4707-9ddd-002eb3b2234a" /> ### Impact _**This affected virtually ALL customers, unless they set the sampling rate very high (close to 100%).**_ With default `capture_pct = 1.0` (1%): - **99% of acquires** → ARE NOT sampled (`capture()` returns `False`) - **99% of releases** → ARE sampled with `duration = system_uptime` (hours, days, months?) ==> Only **1% of logged releases samples** are legitimate In general, the lower the sampling rate, the worse the problem. ### Root Cause When `capture_sampler.capture()` returned `False` (due to sampling rate), the acquire event was correctly skipped, but the release event was still being sampled; and that sample's lock hold time was equal to `system_uptime` (hours, days, months?) ### Fix Initialize `acquire_time` to None, instead of 0, because `0 is not None` and is a valid timestamp (a buggy check in `_release` that was letting through fake samples.) Non-goal: removed the try / catch guarding deletion of sample's `acquire_time` attribute, since there is a race condition if multiple threads try to release one lock (which is allowed in Python.) Instead, we now explicitly reset the value to None, which is not subject to races. ### Testing #### Unit test * test commit - no fix yet - FAIL ``` $ git checkout ec010c3 $ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled ... # release samples should NOT be generated when acquire wasn't sampled > assert len(release_samples) == 0, ( f"Expected no release samples when acquire wasn't sampled, got {len(release_samples)}" ) E AssertionError: Expected no release samples when acquire wasn't sampled, got 1 E assert 1 == 0 ``` * fix commit - PASS ``` $ git checkout ffeb5c6 $ scripts/ddtest riot run --pass-env 116bda6 -- -k test_release_not_sampled_when_acquire_not_sampled ... collected 142 items / 140 deselected / 2 selected tests/profiling/collector/test_threading.py::TestThreadingLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED tests/profiling/collector/test_threading.py::TestThreadingRLockCollector::test_release_not_sampled_when_acquire_not_sampled[py3.13] PASSED ``` #### Manual Validation Tested with [reproduction script (`repro_lock_profiler.py`)](https://github.com/DataDog/dd-trace-py/compare/vlad/lockprof-fix-absurdly-inflated-lock-hold-times-test-script) running 2000 lock operations with 1ms hold time at 1% sampling rate: **Expected Profile** The script acquires locks for 1ms each with a 1% sampling rate. 2000 lock ops × 1ms = 2 seconds total hold time. With 1% sampling rate = ~20ms of lock hold samples. **BEFORE (buggy):** **1.04 seconds/min** - [Profile link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-before&profile_type=lock-hold-time) - Flamegraph dominated by 809ms in `threading.py:522` and other bogus samples - Expected ~20ms of legitimate samples, got **1040ms total** (52x inflation) <img width="1615" height="690" alt="high_hold_times_repro_before" src="https://github.com/user-attachments/assets/0baf8041-2b7d-4324-ad21-8db178d15dda" /> **AFTER (fixed):** **23 milliseconds/min** - [Profile link](https://app.datadoghq.com/profiling/explorer?query=host%3ACOMP-LR7JK0FKW1%20service%3Alockprof-repro-after&profile_type=lock-hold-time) - Clean flamegraph showing only actual lock hold at `repro_lock_profiler.py:56` - Expected ~20ms, got **23ms** - **~45x reduction** in bogus lock time <img width="1611" height="584" alt="high_hold_times_repro_after" src="https://github.com/user-attachments/assets/68c7811a-a00a-4eec-84ab-0f4d875d80d5" /> (cherry picked from commit 2cfb922)
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 249 ± 2 ms. The average import time from base is: 251 ± 3 ms. The import time difference between this PR and base is: -2.0 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate backport-15408-to-3.19 (5d7bec9) with baseline 3.19 (ea3cfae) 📈 Performance Regressions (3 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 0.411µs (SLO: <10.000µs 📉 -95.9%) vs baseline: +2.2% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +4.5% ✅ add_inplace_aspectTime: ✅ 0.407µs (SLO: <10.000µs 📉 -95.9%) vs baseline: +1.0% Memory: ✅ 39.774MB (SLO: <41.500MB -4.2%) vs baseline: +4.7% ✅ add_inplace_noaspectTime: ✅ 0.317µs (SLO: <10.000µs 📉 -96.8%) vs baseline: -0.8% Memory: ✅ 39.813MB (SLO: <41.500MB -4.1%) vs baseline: +4.9% ✅ add_noaspectTime: ✅ 0.280µs (SLO: <10.000µs 📉 -97.2%) vs baseline: +1.2% Memory: ✅ 39.911MB (SLO: <41.500MB -3.8%) vs baseline: +5.1% ✅ bytearray_aspectTime: ✅ 1.290µs (SLO: <10.000µs 📉 -87.1%) vs baseline: -0.6% Memory: ✅ 39.774MB (SLO: <41.500MB -4.2%) vs baseline: +4.7% ✅ bytearray_extend_aspectTime: ✅ 1.533µs (SLO: <10.000µs 📉 -84.7%) vs baseline: -0.5% Memory: ✅ 39.754MB (SLO: <41.500MB -4.2%) vs baseline: +4.8% ✅ bytearray_extend_noaspectTime: ✅ 0.615µs (SLO: <10.000µs 📉 -93.9%) vs baseline: +0.9% Memory: ✅ 39.754MB (SLO: <41.500MB -4.2%) vs baseline: +4.3% ✅ bytearray_noaspectTime: ✅ 0.481µs (SLO: <10.000µs 📉 -95.2%) vs baseline: -0.4% Memory: ✅ 39.754MB (SLO: <41.500MB -4.2%) vs baseline: +4.9% ✅ bytes_aspectTime: ✅ 1.287µs (SLO: <10.000µs 📉 -87.1%) vs baseline: -0.4% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +4.7% ✅ bytes_noaspectTime: ✅ 0.489µs (SLO: <10.000µs 📉 -95.1%) vs baseline: -1.2% Memory: ✅ 39.852MB (SLO: <41.500MB -4.0%) vs baseline: +5.0% ✅ bytesio_aspectTime: ✅ 1.314µs (SLO: <10.000µs 📉 -86.9%) vs baseline: -0.7% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +4.8% ✅ bytesio_noaspectTime: ✅ 0.500µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +0.8% Memory: ✅ 39.774MB (SLO: <41.500MB -4.2%) vs baseline: +4.9% ✅ capitalize_aspectTime: ✅ 0.740µs (SLO: <10.000µs 📉 -92.6%) vs baseline: +1.1% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +4.9% ✅ capitalize_noaspectTime: ✅ 0.442µs (SLO: <10.000µs 📉 -95.6%) vs baseline: +1.8% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +4.7% ✅ casefold_aspectTime: ✅ 0.733µs (SLO: <10.000µs 📉 -92.7%) vs baseline: -0.2% Memory: ✅ 39.774MB (SLO: <41.500MB -4.2%) vs baseline: +5.0% ✅ casefold_noaspectTime: ✅ 0.368µs (SLO: <10.000µs 📉 -96.3%) vs baseline: -1.1% Memory: ✅ 39.734MB (SLO: <41.500MB -4.3%) vs baseline: +4.6% ✅ decode_aspectTime: ✅ 0.729µs (SLO: <10.000µs 📉 -92.7%) vs baseline: -0.1% Memory: ✅ 39.852MB (SLO: <41.500MB -4.0%) vs baseline: +5.1% ✅ decode_noaspectTime: ✅ 0.419µs (SLO: <10.000µs 📉 -95.8%) vs baseline: +0.7% Memory: ✅ 39.754MB (SLO: <41.500MB -4.2%) vs baseline: +4.6% ✅ encode_aspectTime: ✅ 0.712µs (SLO: <10.000µs 📉 -92.9%) vs baseline: ~same Memory: ✅ 39.734MB (SLO: <41.500MB -4.3%) vs baseline: +4.4% ✅ encode_noaspectTime: ✅ 0.401µs (SLO: <10.000µs 📉 -96.0%) vs baseline: -0.5% Memory: ✅ 39.734MB (SLO: <41.500MB -4.3%) vs baseline: +4.4% ✅ format_aspectTime: ✅ 3.346µs (SLO: <10.000µs 📉 -66.5%) vs baseline: +0.3% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +4.8% ✅ format_map_aspectTime: ✅ 4.086µs (SLO: <10.000µs 📉 -59.1%) vs baseline: 📈 +15.1% Memory: ✅ 39.852MB (SLO: <41.500MB -4.0%) vs baseline: +4.9% ✅ format_map_noaspectTime: ✅ 0.770µs (SLO: <10.000µs 📉 -92.3%) vs baseline: -0.2% Memory: ✅ 39.892MB (SLO: <41.500MB -3.9%) vs baseline: +5.0% ✅ format_noaspectTime: ✅ 0.594µs (SLO: <10.000µs 📉 -94.1%) vs baseline: -0.6% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +5.0% ✅ index_aspectTime: ✅ 0.359µs (SLO: <10.000µs 📉 -96.4%) vs baseline: -0.3% Memory: ✅ 39.872MB (SLO: <41.500MB -3.9%) vs baseline: +5.4% ✅ index_noaspectTime: ✅ 0.275µs (SLO: <10.000µs 📉 -97.3%) vs baseline: ~same Memory: ✅ 39.774MB (SLO: <41.500MB -4.2%) vs baseline: +4.6% ✅ join_aspectTime: ✅ 1.391µs (SLO: <10.000µs 📉 -86.1%) vs baseline: -0.9% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +5.0% ✅ join_noaspectTime: ✅ 0.491µs (SLO: <10.000µs 📉 -95.1%) vs baseline: +0.1% Memory: ✅ 39.675MB (SLO: <41.500MB -4.4%) vs baseline: +4.6% ✅ ljust_aspectTime: ✅ 2.538µs (SLO: <20.000µs 📉 -87.3%) vs baseline: -0.4% Memory: ✅ 39.774MB (SLO: <41.500MB -4.2%) vs baseline: +4.7% ✅ ljust_noaspectTime: ✅ 0.409µs (SLO: <10.000µs 📉 -95.9%) vs baseline: +1.1% Memory: ✅ 39.892MB (SLO: <41.500MB -3.9%) vs baseline: +5.3% ✅ lower_aspectTime: ✅ 2.219µs (SLO: <10.000µs 📉 -77.8%) vs baseline: +0.5% Memory: ✅ 39.813MB (SLO: <41.500MB -4.1%) vs baseline: +4.9% ✅ lower_noaspectTime: ✅ 0.371µs (SLO: <10.000µs 📉 -96.3%) vs baseline: +1.3% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +4.9% ✅ lstrip_aspectTime: ✅ 2.236µs (SLO: <20.000µs 📉 -88.8%) vs baseline: -0.2% Memory: ✅ 39.754MB (SLO: <41.500MB -4.2%) vs baseline: +4.8% ✅ lstrip_noaspectTime: ✅ 0.384µs (SLO: <10.000µs 📉 -96.2%) vs baseline: +1.7% Memory: ✅ 39.852MB (SLO: <41.500MB -4.0%) vs baseline: +5.1% ✅ modulo_aspectTime: ✅ 0.993µs (SLO: <10.000µs 📉 -90.1%) vs baseline: +0.4% Memory: ✅ 39.872MB (SLO: <41.500MB -3.9%) vs baseline: +5.2% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 1.536µs (SLO: <10.000µs 📉 -84.6%) vs baseline: ~same Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +5.2% ✅ modulo_aspect_for_bytesTime: ✅ 0.973µs (SLO: <10.000µs 📉 -90.3%) vs baseline: -0.5% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +4.8% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 1.196µs (SLO: <10.000µs 📉 -88.0%) vs baseline: -1.3% Memory: ✅ 39.852MB (SLO: <41.500MB -4.0%) vs baseline: +5.1% ✅ modulo_noaspectTime: ✅ 0.629µs (SLO: <10.000µs 📉 -93.7%) vs baseline: +0.4% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +5.0% ✅ replace_aspectTime: ✅ 4.899µs (SLO: <10.000µs 📉 -51.0%) vs baseline: +0.4% Memory: ✅ 39.931MB (SLO: <41.500MB -3.8%) vs baseline: +5.3% ✅ replace_noaspectTime: ✅ 0.459µs (SLO: <10.000µs 📉 -95.4%) vs baseline: -0.3% Memory: ✅ 39.734MB (SLO: <41.500MB -4.3%) vs baseline: +4.5% ✅ repr_aspectTime: ✅ 0.910µs (SLO: <10.000µs 📉 -90.9%) vs baseline: -0.6% Memory: ✅ 39.892MB (SLO: <41.500MB -3.9%) vs baseline: +5.1% ✅ repr_noaspectTime: ✅ 0.423µs (SLO: <10.000µs 📉 -95.8%) vs baseline: +1.1% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +5.1% ✅ rstrip_aspectTime: ✅ 1.876µs (SLO: <20.000µs 📉 -90.6%) vs baseline: -1.2% Memory: ✅ 39.892MB (SLO: <41.500MB -3.9%) vs baseline: +5.1% ✅ rstrip_noaspectTime: ✅ 0.379µs (SLO: <10.000µs 📉 -96.2%) vs baseline: ~same Memory: ✅ 39.734MB (SLO: <41.500MB -4.3%) vs baseline: +4.9% ✅ slice_aspectTime: ✅ 0.497µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +0.1% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +5.1% ✅ slice_noaspectTime: ✅ 0.447µs (SLO: <10.000µs 📉 -95.5%) vs baseline: -0.7% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +5.0% ✅ stringio_aspectTime: ✅ 1.596µs (SLO: <10.000µs 📉 -84.0%) vs baseline: +3.4% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +4.8% ✅ stringio_noaspectTime: ✅ 0.721µs (SLO: <10.000µs 📉 -92.8%) vs baseline: ~same Memory: ✅ 39.754MB (SLO: <41.500MB -4.2%) vs baseline: +4.9% ✅ strip_aspectTime: ✅ 2.259µs (SLO: <20.000µs 📉 -88.7%) vs baseline: +2.3% Memory: ✅ 39.715MB (SLO: <41.500MB -4.3%) vs baseline: +4.5% ✅ strip_noaspectTime: ✅ 0.386µs (SLO: <10.000µs 📉 -96.1%) vs baseline: +0.9% Memory: ✅ 39.813MB (SLO: <41.500MB -4.1%) vs baseline: +5.1% ✅ swapcase_aspectTime: ✅ 2.399µs (SLO: <10.000µs 📉 -76.0%) vs baseline: -0.6% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +4.8% ✅ swapcase_noaspectTime: ✅ 0.539µs (SLO: <10.000µs 📉 -94.6%) vs baseline: +0.3% Memory: ✅ 39.754MB (SLO: <41.500MB -4.2%) vs baseline: +4.7% ✅ title_aspectTime: ✅ 2.333µs (SLO: <10.000µs 📉 -76.7%) vs baseline: +0.2% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +4.7% ✅ title_noaspectTime: ✅ 0.505µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +0.5% Memory: ✅ 39.793MB (SLO: <41.500MB -4.1%) vs baseline: +5.1% ✅ translate_aspectTime: ✅ 3.259µs (SLO: <10.000µs 📉 -67.4%) vs baseline: +1.4% Memory: ✅ 39.852MB (SLO: <41.500MB -4.0%) vs baseline: +5.0% ✅ translate_noaspectTime: ✅ 1.041µs (SLO: <10.000µs 📉 -89.6%) vs baseline: -0.2% Memory: ✅ 39.833MB (SLO: <41.500MB -4.0%) vs baseline: +5.0% ✅ upper_aspectTime: ✅ 2.199µs (SLO: <10.000µs 📉 -78.0%) vs baseline: -0.2% Memory: ✅ 39.813MB (SLO: <41.500MB -4.1%) vs baseline: +4.9% ✅ upper_noaspectTime: ✅ 0.371µs (SLO: <10.000µs 📉 -96.3%) vs baseline: +0.6% Memory: ✅ 39.734MB (SLO: <41.500MB -4.3%) vs baseline: +4.7% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 4.261µs (SLO: <10.000µs 📉 -57.4%) vs baseline: -1.5% Memory: ✅ 39.774MB (SLO: <41.000MB -3.0%) vs baseline: +4.8% ✅ ospathbasename_noaspectTime: ✅ 1.076µs (SLO: <10.000µs 📉 -89.2%) vs baseline: -1.0% Memory: ✅ 39.695MB (SLO: <41.000MB -3.2%) vs baseline: +4.5% ✅ ospathjoin_aspectTime: ✅ 6.150µs (SLO: <10.000µs 📉 -38.5%) vs baseline: -1.1% Memory: ✅ 39.813MB (SLO: <41.000MB -2.9%) vs baseline: +5.3% ✅ ospathjoin_noaspectTime: ✅ 2.283µs (SLO: <10.000µs 📉 -77.2%) vs baseline: -0.3% Memory: ✅ 39.656MB (SLO: <41.000MB -3.3%) vs baseline: +4.4% ✅ ospathnormcase_aspectTime: ✅ 3.651µs (SLO: <10.000µs 📉 -63.5%) vs baseline: +1.0% Memory: ✅ 39.852MB (SLO: <41.000MB -2.8%) vs baseline: +4.8% ✅ ospathnormcase_noaspectTime: ✅ 0.576µs (SLO: <10.000µs 📉 -94.2%) vs baseline: +0.3% Memory: ✅ 39.734MB (SLO: <41.000MB -3.1%) vs baseline: +4.8% ✅ ospathsplit_aspectTime: ✅ 4.821µs (SLO: <10.000µs 📉 -51.8%) vs baseline: -0.4% Memory: ✅ 39.872MB (SLO: <41.000MB -2.8%) vs baseline: +5.2% ✅ ospathsplit_noaspectTime: ✅ 1.598µs (SLO: <10.000µs 📉 -84.0%) vs baseline: +0.1% Memory: ✅ 39.793MB (SLO: <41.000MB -2.9%) vs baseline: +5.1% ✅ ospathsplitdrive_aspectTime: ✅ 4.257µs (SLO: <10.000µs 📉 -57.4%) vs baseline: 📈 +13.3% Memory: ✅ 39.833MB (SLO: <41.000MB -2.8%) vs baseline: +4.9% ✅ ospathsplitdrive_noaspectTime: ✅ 0.706µs (SLO: <10.000µs 📉 -92.9%) vs baseline: +0.8% Memory: ✅ 39.734MB (SLO: <41.000MB -3.1%) vs baseline: +4.3% ✅ ospathsplitext_aspectTime: ✅ 4.567µs (SLO: <10.000µs 📉 -54.3%) vs baseline: -0.8% Memory: ✅ 39.754MB (SLO: <41.000MB -3.0%) vs baseline: +4.9% ✅ ospathsplitext_noaspectTime: ✅ 1.370µs (SLO: <10.000µs 📉 -86.3%) vs baseline: -0.6% Memory: ✅ 39.675MB (SLO: <41.000MB -3.2%) vs baseline: +4.6% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.322µs (SLO: <20.000µs 📉 -83.4%) vs baseline: 📈 +11.8% Memory: ✅ 34.406MB (SLO: <35.500MB -3.1%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 203.839µs (SLO: <220.000µs -7.3%) vs baseline: ~same Memory: ✅ 34.544MB (SLO: <35.500MB -2.7%) vs baseline: +4.9% ✅ 1-distribution-metric-1-timesTime: ✅ 3.293µs (SLO: <20.000µs 📉 -83.5%) vs baseline: +0.4% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.8% ✅ 1-distribution-metrics-100-timesTime: ✅ 215.548µs (SLO: <220.000µs -2.0%) vs baseline: +0.9% Memory: ✅ 34.465MB (SLO: <35.500MB -2.9%) vs baseline: +4.9% ✅ 1-gauge-metric-1-timesTime: ✅ 2.506µs (SLO: <20.000µs 📉 -87.5%) vs baseline: 📈 +15.4% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.8% ✅ 1-gauge-metrics-100-timesTime: ✅ 136.971µs (SLO: <150.000µs -8.7%) vs baseline: +0.1% Memory: ✅ 34.465MB (SLO: <35.500MB -2.9%) vs baseline: +4.9% ✅ 1-rate-metric-1-timesTime: ✅ 3.479µs (SLO: <20.000µs 📉 -82.6%) vs baseline: 📈 +12.2% Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +4.9% ✅ 1-rate-metrics-100-timesTime: ✅ 216.560µs (SLO: <250.000µs 📉 -13.4%) vs baseline: -0.8% Memory: ✅ 34.446MB (SLO: <35.500MB -3.0%) vs baseline: +4.6% ✅ 100-count-metrics-100-timesTime: ✅ 20.336ms (SLO: <22.000ms -7.6%) vs baseline: +0.1% Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +4.8% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.220ms (SLO: <2.300ms -3.5%) vs baseline: -1.2% Memory: ✅ 34.524MB (SLO: <35.500MB -2.7%) vs baseline: +5.0% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.404ms (SLO: <1.550ms -9.4%) vs baseline: ~same Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +5.0% ✅ 100-rate-metrics-100-timesTime: ✅ 2.241ms (SLO: <2.550ms 📉 -12.1%) vs baseline: +1.2% Memory: ✅ 34.485MB (SLO: <35.500MB -2.9%) vs baseline: +5.0% ✅ flush-1-metricTime: ✅ 4.532µs (SLO: <20.000µs 📉 -77.3%) vs baseline: +1.1% Memory: ✅ 34.505MB (SLO: <35.500MB -2.8%) vs baseline: +4.9% ✅ flush-100-metricsTime: ✅ 175.998µs (SLO: <250.000µs 📉 -29.6%) vs baseline: -0.5% Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.6% ✅ flush-1000-metricsTime: ✅ 2.117ms (SLO: <2.500ms 📉 -15.3%) vs baseline: -0.6% Memory: ✅ 35.291MB (SLO: <36.500MB -3.3%) vs baseline: +4.8% 🟡 Near SLO Breach (2 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 20.515ms (SLO: <22.300ms -8.0%) vs baseline: ~same Memory: ✅ 67.884MB (SLO: <70.500MB -3.7%) vs baseline: +4.5% ✅ exception-replay-enabledTime: ✅ 1.339ms (SLO: <1.450ms -7.7%) vs baseline: ~same Memory: ✅ 66.238MB (SLO: <67.500MB 🟡 -1.9%) vs baseline: +4.8% ✅ iastTime: ✅ 20.446ms (SLO: <22.250ms -8.1%) vs baseline: -0.2% Memory: ✅ 67.955MB (SLO: <70.000MB -2.9%) vs baseline: +4.8% ✅ profilerTime: ✅ 15.300ms (SLO: <16.550ms -7.6%) vs baseline: -0.8% Memory: ✅ 56.244MB (SLO: <57.500MB -2.2%) vs baseline: +4.9% ✅ resource-renamingTime: ✅ 20.588ms (SLO: <21.750ms -5.3%) vs baseline: ~same Memory: ✅ 67.891MB (SLO: <70.500MB -3.7%) vs baseline: +4.8% ✅ span-code-originTime: ✅ 25.413ms (SLO: <28.200ms -9.9%) vs baseline: ~same Memory: ✅ 69.238MB (SLO: <71.000MB -2.5%) vs baseline: +4.7% ✅ tracerTime: ✅ 20.502ms (SLO: <21.750ms -5.7%) vs baseline: +0.1% Memory: ✅ 67.983MB (SLO: <70.000MB -2.9%) vs baseline: +4.6% ✅ tracer-and-profilerTime: ✅ 22.350ms (SLO: <23.500ms -4.9%) vs baseline: -0.7% Memory: ✅ 69.481MB (SLO: <71.000MB -2.1%) vs baseline: +4.9% ✅ tracer-dont-create-db-spansTime: ✅ 19.347ms (SLO: <21.500ms 📉 -10.0%) vs baseline: -0.3% Memory: ✅ 67.927MB (SLO: <70.000MB -3.0%) vs baseline: +4.7% ✅ tracer-minimalTime: ✅ 16.625ms (SLO: <17.500ms -5.0%) vs baseline: ~same Memory: ✅ 67.972MB (SLO: <70.000MB -2.9%) vs baseline: +4.9% ✅ tracer-nativeTime: ✅ 20.611ms (SLO: <21.750ms -5.2%) vs baseline: +0.6% Memory: ✅ 68.011MB (SLO: <72.500MB -6.2%) vs baseline: +4.8% ✅ tracer-no-cachesTime: ✅ 18.468ms (SLO: <19.650ms -6.0%) vs baseline: -0.4% Memory: ✅ 67.980MB (SLO: <70.000MB -2.9%) vs baseline: +4.7% ✅ tracer-no-databasesTime: ✅ 18.778ms (SLO: <20.100ms -6.6%) vs baseline: -0.3% Memory: ✅ 67.850MB (SLO: <70.000MB -3.1%) vs baseline: +5.0% ✅ tracer-no-middlewareTime: ✅ 20.078ms (SLO: <21.500ms -6.6%) vs baseline: -0.2% Memory: ✅ 67.970MB (SLO: <70.000MB -2.9%) vs baseline: +4.7% ✅ tracer-no-templatesTime: ✅ 20.357ms (SLO: <22.000ms -7.5%) vs baseline: -0.2% Memory: ✅ 68.028MB (SLO: <70.500MB -3.5%) vs baseline: +4.9% 🟡 packagespackageforrootmodulemapping - 4/4✅ cache_offTime: ✅ 344.303ms (SLO: <354.300ms -2.8%) vs baseline: +0.5% Memory: ✅ 40.680MB (SLO: <41.500MB 🟡 -2.0%) vs baseline: +5.9% ✅ cache_onTime: ✅ 0.386µs (SLO: <10.000µs 📉 -96.1%) vs baseline: +0.1% Memory: ✅ 38.681MB (SLO: <41.000MB -5.7%) vs baseline: +4.0%
|
Backport 2cfb922 from #15408 to 3.19.
Fix lock profiler generating release samples for non-sampled acquires
Problem
As reported by @dchayes-dd in our Slack channel: Lock Profiler displays unrealistic lock hold times (either too large or even negative). The negative values are legitimate int64 overflows, which in itself is a red herring. The root cause is described below.
Example 1
-970 days / minExample 2
19,000 days / minImpact
This affected virtually ALL customers, unless they set the sampling rate very high (close to 100%).
With default
capture_pct = 1.0(1%):capture()returnsFalse)duration = system_uptime(hours, days, months?)==> Only 1% of logged releases samples are legitimate
In general, the lower the sampling rate, the worse the problem.
Root Cause
When
capture_sampler.capture()returnedFalse(due to sampling rate), the acquire event was correctly skipped, but the release event was still being sampled; and that sample's lock hold time was equal tosystem_uptime(hours, days, months?)Fix
Initialize
acquire_timeto None, instead of 0, because0 is not Noneand is a valid timestamp (a buggy check in_releasethat was letting through fake samples.)Non-goal: removed the try / catch guarding deletion of sample's
acquire_timeattribute, since there is a race condition if multiple threads try to release one lock (which is allowed in Python.) Instead, we now explicitly reset the value to None, which is not subject to races.Testing
Unit test
Manual Validation
Tested with reproduction script (
repro_lock_profiler.py) running 2000 lock operations with 1ms hold time at 1% sampling rate:Expected Profile
The script acquires locks for 1ms each with a 1% sampling rate.
2000 lock ops × 1ms = 2 seconds total hold time.
With 1% sampling rate = ~20ms of lock hold samples.
BEFORE (buggy): 1.04 seconds/min
threading.py:522and other bogus samplesAFTER (fixed): 23 milliseconds/min
repro_lock_profiler.py:56