Audio statistics #3833

Samoed · 2026-01-03T13:22:10Z

I’ve started integrating audio statistics. For now, I’ve come up with this format. Do you have any suggestions?

class AudioStatistics(TypedDict):
    """Class for descriptive statistics for audio.

    Attributes:
        total_audio_seconds_length: Total length of all audio clips in total frames
        min_audio_seconds_length: Minimum length of audio clip in seconds
        average_audio_seconds_length: Average length of audio clip in seconds
        max_audio_seconds_length: Maximum length of audio clip in seconds
        unique_audios: Number of unique audio clips
        average_sampling_rate: Average sampling rate
        sampling_rates: Dict of unique sampling rates and their frequencies
    """

    total_audio_seconds_length: float

    min_audio_seconds_length: float
    average_audio_seconds_length: float
    max_audio_seconds_length: float

    unique_audios: int

    average_sampling_rate: float
    sampling_rates: dict[int, int]

isaac-chung · 2026-01-03T13:38:36Z

When I see length, I think in seconds. I like the frames approach too, and I'd like it spelled out explicitly (num_frames or whatever). I'd like to see:

the max/min/total number of seconds
the unique set of sampling rates (specify unit)

Would love to hear other feedback as well while I read into it a bit more.

Samoed · 2026-01-03T15:44:04Z

Added seconds and sampling rates

isaac-chung

Sorry for adding more. Revisited some papers and maybe we should use the standard measure of audio dataset size.

mteb/types/statistics.py

mteb/types/_encoder_io.py

isaac-chung

Just wanted to align with HF notation + plus some questions.

isaac-chung · 2026-01-03T20:07:23Z

mteb/abstasks/_statistics_calculation.py

+
+    for audio in audios:
+        array = audio["array"]
+        sampling_rate = audio["sampling_rate"]


This line assumes there is the sampling_rate key. Based on what you mentioned, this will fail for some datasets then?

Yes, but it's better to fix them to improve benchmark quality overall

Could you please open an issue to track this?

I'm not sure what to open. Possible missing sampling rate?

Yeah, all audio should have sampling_rate. So if you say that's not true, then it's an issue.

mteb/types/statistics.py

isaac-chung · 2026-01-03T20:13:30Z

mteb/types/statistics.py

+    unique_audios: int
+
+    average_sampling_rate: float
+    sampling_rates: dict[int, int]


Could this just be a unique set of sampling rates? OK either way.

Suggested change

sampling_rates: dict[int, int]

sampling_rates: list[int]

Co-authored-by: Isaac Chung <[email protected]>

KennethEnevoldsen

Minor things - generally think this looks good (of course Isaac's comments still apply, but nothing more to add)

pyproject.toml

mteb/abstasks/_statistics_calculation.py

Co-authored-by: Kenneth Enevoldsen <[email protected]>

init

c2faf79

Samoed requested review from AdnanElAssadi56, KennethEnevoldsen and isaac-chung January 3, 2026 13:22

Samoed added the maeb Audio extension label Jan 3, 2026

update statistics

b801a17

isaac-chung reviewed Jan 3, 2026

View reviewed changes

mteb/types/statistics.py Outdated Show resolved Hide resolved

mteb/types/_encoder_io.py Outdated Show resolved Hide resolved

update statistics

972497c

isaac-chung reviewed Jan 3, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

Myahr208 mentioned this pull request Jan 3, 2026

<Ai/help/gemini> #3839

Closed

This comment has been minimized.

Sign in to view

Update mteb/types/statistics.py

ff5e93a

Co-authored-by: Isaac Chung <[email protected]>

KennethEnevoldsen approved these changes Jan 4, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

mteb/abstasks/_statistics_calculation.py Outdated Show resolved Hide resolved

Apply suggestions from code review

1f3d3ff

Co-authored-by: Kenneth Enevoldsen <[email protected]>

Audio statistics #3833

Are you sure you want to change the base?

Audio statistics #3833

Uh oh!

Conversation

Samoed commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

isaac-chung commented Jan 3, 2026

Uh oh!

Samoed commented Jan 3, 2026

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

isaac-chung left a comment

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Samoed Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

isaac-chung Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isaac-chung Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

KennethEnevoldsen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Samoed commented Jan 3, 2026 •

edited

Loading

KennethEnevoldsen left a comment •

edited

Loading