SwissGPC

This repository holds the official implementation of the SwissGPC (Swiss German Podcast Corpus) pipeline used to weakly label data collected from YouTube and the Swiss Broadcasting Corporation (SRG / SRF). As we do not possess any rights to the collected data, it is not possible to publish the annotated dataset itself. Instead, we publish the data pipeline that downloads, transcribes and prepares the data for usage in, for example, fine-tuning a model for Voice Adaptation TTS.

If you are interested in how we applied this data using the XTTSv2 architecture, check out our fork of the coqui-tts library here.

Podcasts

The podcasts of this dataset are provided here including links to the host-websites and information about raw and cleaned audio size in hours. As outlined above: We do not possess rights or have ownership of these podcasts, and as such, any changes on the platforms they are hosted on are out of our control. Meaning constant updates of changing hyperlinks, partial or complete removal, and similar changes do not fall within the scope of this repository. We will try to provide a general overview of the availability but cannot guarantee to do so in real-time. The podcasts were downloaded over a period of time spanning from September 2024 to March 2025, and as such my not reflect the actual audio lengths of the podcasts on time of download.

SRF Podcast Name	Raw (h)	Clean (h)	vSwissGPC
#SRFglobal	36.97	33.63	v1.0
100 Sekunden Wissen	186.75	152.12	v1.0
BuchZeichen	365.10	305.62	v2.0
Debriefing 404	243.15	195.29	v1.0
Digital Podcast	434.56	396.59	v1.0
Dini Mundart	39.28	34.84	v1.0
Einfach Politik	40.69	38.07	v2.0
Espresso	565.84	500.50	v2.0
Focus	807.08	630.22	v2.0
Gast am Mittag	34.07	30.43	v1.0
Geek-Sofa	314.01	267.16	v1.0
Input	714.13	602.91	v2.0
SRF-Wissen	44.78	39.17	v1.0
Krimi	240.80	176.05	v2.0
Kultur-Talk	55.57	51.33	v1.0
Literaturclub - Zwei mit Buch	31.65	28.04	v1.0
Medientalk	68.77	62.16	v1.0
Persönlich	763.15	637.87	v2.0
Pipifax	9.04	7.66	v1.0
Podcast am Pistenrand	18.16	15.37	v1.0
Ratgeber	574.46	445.64	v2.0
Rehmann	213.87	182.79	v2.0
Samstagsrundschau	414.45	382.33	v1.0
Sternstunde Philosophie	158.67	136.70	v1.0
Sternstunde Religion	60.58	53.90	v1.0
Sykora Gisler	149.49	125.80	v1.0
Tagesgespräch	1688.26	1557.43	v1.0
Ufwärmrundi	60.72	54.95	v1.0
Vetters Töne	25.37	20.13	v1.0
Wetterfrage	65.52	59.02	v1.0
Wirtschaftswoche	126.23	115.31	v1.0
Wissenschaftsmagazin	403.10	347.52	v1.0
Zivadiliring	49.80	42.55	v1.0
Zytlupe	45.66	36.61	v1.0
Total	9041.28	7765.72

YouTube Podcast Name	Raw (h)	Clean (h)	vSwissGPC
Auf Bewährung - Leben mit Gefängnis	3.00	2.70	v1.0
Berner Jugendtreff	127.80	89.61	v1.0
Ein Buch Ein Tee	3.73	3.26	v1.0
expectations - geplant und ungeplant kinderfrei	16.84	14.80	v1.0
Fadegrad	49.95	42.40	v1.0
Feel Good Podcast	319.60	261.43	v1.0
Finanz Fabio	58.44	49.29	v1.0
Scho ghört	23.45	20.47	v1.0
Sexologie - Wissen macht Lust	15.41	13.57	v1.0
SRF Dokumentationen	398.73	284.01	v2.0
SRF Reportagen	196.39	148.10	v2.0
Über den Bücherrand	14.53	12.59	v1.0
Ungerwegs Daheim	38.67	31.08	v1.0
Wir müssen reden - Public Eye spricht Klartext	17.52	15.54	v1.0
Total	1277.47	988.85

Data pipeline

The data from YouTube is downloaded using pytubefix while the SRF data was sourced via the official SRF API. Specifically for YT, the code expects a playlist of videos instead of just a video link. This is so that all episodes can be downloaded at once. SRF podcasts only require the podcast name without any additional information. The pipeline itself is built to download and transcribe the podcasts sequentially, i.e., one podcast after another. The code can be changed by you to do every step in batch and should not be too much effort to do so. Controlling the pipeline is done via the config.yaml, in which you can set what podcast should be downloaded from which source and which pipeline steps should run. See the table below for more information about the parameters. We utilized hdf5 files in our setup, and as such all data is put into hdf5 files on segmentation. This can be changed to your setup accordingly.

Setup (uv)

Create the environment and install dependencies with uv:

uv venv          # create a virtual environment
uv sync          # install dependencies defined in pyproject.toml

Run the pipeline:

uv run python main.py --config config.yaml

Classifying individual files

A lightweight CLI helper is provided that loads the saved logistic‑regression pipeline and runs it on either phoneme transcripts or raw audio files. To use it you first need to have the uv environment active (see above); the necessary requirements (whisperx, phonemizer, scikit-learn, etc.) are already listed in pyproject.toml.

# classify two phoneme transcripts
uv run python -m src.classification_i4ds.classify_dialect path/utt1.phon path/utt2.phon

# classify audio files (requires whisperx + phonemizer)
uv run python -m src.classification_i4ds.classify_dialect audio1.wav audio2.wav \
     --output predictions.csv

The script prints a tab-separated summary and optionally writes a CSV if --output is supplied. Internally it calls :meth:src.classification_i4ds.best_dialect_model.load_model and feeds the input through the same preprocessing used during training.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
classify_big_manifest_speaker_dialect.sh		classify_big_manifest_speaker_dialect.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SwissGPC

Podcasts

Data pipeline

Setup (uv)

Classifying individual files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

i4Ds/SwissGPC

Folders and files

Latest commit

History

Repository files navigation

SwissGPC

Podcasts

Data pipeline

Setup (uv)

Classifying individual files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages