Invalid Speech Rejection for Voice Robots in Complex Scenarios

1. Project Introduction

This project focuses on robust understanding for customer-service outbound voice robots in complex real-world conditions.

In practical outbound-call scenarios, the SLU system is easily affected by:

environmental noise
background human speech
non-human-machine conversation segments
colloquial and non-standard expressions

These disturbances can lead to invalid responses and false triggering. To address this issue, we design an invalid speech rejection solution for complex scenarios, improving reject robustness while maintaining intent understanding performance.

2. Project Structure (Concise)

ICDM-SLU/
├── eval.sh                              # Main evaluation entry script
├── speech_intent_slot_eval.py           # Inference + intent/reject metric computation
├── slu_models_e2e.py                    # Core end-to-end SLU model
├── e2e_cl_fusion_data.py                # Clean/noisy data fusion and data pipeline
├── fusionLayerWrapper.py                # Fusion layer wrapper
├── contrastive_pretraining.py           # Contrastive pretraining
├── val_ic_cf_from_eval.py               # Validation helper example
├── tokenizer.sh                         # Tokenizer training script
├── requirements.txt                     # Python dependency list
├── configs/
│   └── parakeet_transformer_large_bpe_SE_fusion.yaml   # Main model/training config
├── ckpt/
│   └── Intent/                          # Model checkpoints and prediction outputs (e.g., predictions.json)
├── encoders/
│   ├── parakeet-tdt_ctc-110m/           # Pretrained ASR encoder resources
│   ├── whisper-m/                       # Whisper-related resources
│   └── t5_tiny/                         # Lightweight text encoder/LM resources
├── eval_utils/
│   ├── inference.py                     # Inference runner
│   ├── evaluator.py                     # SLURP evaluator
│   └── evaluation/metrics/              # Metric utilities
├── SE_modules/                          # Speech enhancement modules (including Matcha-related components)
└── scripts/                             # Utility scripts (export, tokenizer, ASR/LM, etc.)

3. Usage and Inference

3.1 Data Preparation

The evaluation input paths used in eval.sh are:

dataset_manifest: ./data/test/manifest.json
asr_transcripts_filepath: ./data/test/asr_transcripts.jsonl

You can:

run directly with this path layout;
place data in your own location and update paths in eval.sh accordingly.

Recommended in-repo organization:

ICDM-SLU-open-ver/
├── data/
│   └── test/
│       ├── manifest.json
│       └── asr_transcripts.jsonl

SLURP can be obtained from: https://github.com/pswietojanski/slurp

Data construction uses SLURP as the base corpus, then mixes VoiceBank human-speech noise and ESC-50 environmental sounds at equal ratio under SNRS_DB = [0, 5, 10], with WER greater than 25%.

3.2 Environment Setup

conda create -n slu python=3.10.12 -y
conda activate slu
pip install -r requirements.txt

3.3 Run Evaluation

bash eval.sh

3.4 Result Location

If output_filename is not explicitly set, speech_intent_slot_eval.py writes predictions to:

ckpt/Intent/predictions.json

The terminal prints:

scenario/action/intent-related metrics
reject precision, recall, F1, and accuracy

3.5 Final Performance

Under the above noise-augmented SLURP setting, this solution achieves:

Intent recognition accuracy > 90%
Reject accuracy > 95%

4. Open-Source Pretrained Models

NVIDIA Parakeet:
- https://huggingface.co/nvidia/parakeet-tdt_ctc-110m
Whisper medium:
- https://huggingface.co/openai/whisper-medium
T5 small:
- https://huggingface.co/google-t5/t5-small

5. Our implementation refers to the following great repository:

Matcha-TTS:
- https://github.com/shivammehta25/Matcha-TTS
NVIDIA NeMo:
- https://github.com/NVIDIA/NeMo
GTCRN:
- https://github.com/Xiaobin-Rong/gtcrn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Invalid Speech Rejection for Voice Robots in Complex Scenarios

1. Project Introduction

2. Project Structure (Concise)

3. Usage and Inference

3.1 Data Preparation

3.2 Environment Setup

3.3 Run Evaluation

3.4 Result Location

3.5 Final Performance

4. Open-Source Pretrained Models

5. Our implementation refers to the following great repository:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
SE_modules		SE_modules
__pycache__		__pycache__
configs		configs
eval_utils		eval_utils
scripts		scripts
README.md		README.md
contrastive_pretraining.py		contrastive_pretraining.py
e2e_cl_fusion_data.py		e2e_cl_fusion_data.py
eval.sh		eval.sh
fusionLayerWrapper.py		fusionLayerWrapper.py
requirements.txt		requirements.txt
slu_models_e2e.py		slu_models_e2e.py
speech_intent_slot_eval.py		speech_intent_slot_eval.py
tokenizer.sh		tokenizer.sh
val_checkpoint_averaging.py		val_checkpoint_averaging.py
val_ic_cf_from_eval.py		val_ic_cf_from_eval.py

Folders and files

Latest commit

History

Repository files navigation

Invalid Speech Rejection for Voice Robots in Complex Scenarios

1. Project Introduction

2. Project Structure (Concise)

3. Usage and Inference

3.1 Data Preparation

3.2 Environment Setup

3.3 Run Evaluation

3.4 Result Location

3.5 Final Performance

4. Open-Source Pretrained Models

5. Our implementation refers to the following great repository:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages