This project fine-tunes the Wav2Vec2 model for automatic speech recognition (ASR) using the Vercellotti dataset. The pipeline involves data preprocessing, model training, and evaluation.
asr_wav2vec.py- Main script for fine-tuning the ASR model.preprocessing.py- Handles downloading and processing of the dataset.vocab.json- Vocabulary file for tokenization.run_config.json- Configuration file saved during training.output/- Directory where trained models and logs are saved.
The dataset is sourced from TalkBank. The preprocessing script:
- Downloads and extracts transcripts and audio files.
- Cleans transcripts and aligns them with corresponding audio segments.
- Splits data into training, validation, and test sets.
Run the preprocessing script to download and process the dataset:
python preprocessing.pyRun the main script to fine-tune the model:
python asr_wav2vec.py --data_dir <path_to_data> --output_dir <path_to_output> --use_cuda True --finetune True--data_dir: Directory containing the dataset.--output_dir: Directory to save the trained model.--use_cuda: Whether to use GPU for training.--finetune: IfTrue, the model is trained on new data.
The script logs results using Weights & Biases, including:
- Word Error Rate (WER) computation.
- Sample predictions from the test set.
This implementation uses the Hugging Face Wav2Vec2 model and Vercelloti dataset.