This project focuses on robust understanding for customer-service outbound voice robots in complex real-world conditions.
In practical outbound-call scenarios, the SLU system is easily affected by:
- environmental noise
- background human speech
- non-human-machine conversation segments
- colloquial and non-standard expressions
These disturbances can lead to invalid responses and false triggering. To address this issue, we design an invalid speech rejection solution for complex scenarios, improving reject robustness while maintaining intent understanding performance.
ICDM-SLU/
├── eval.sh # Main evaluation entry script
├── speech_intent_slot_eval.py # Inference + intent/reject metric computation
├── slu_models_e2e.py # Core end-to-end SLU model
├── e2e_cl_fusion_data.py # Clean/noisy data fusion and data pipeline
├── fusionLayerWrapper.py # Fusion layer wrapper
├── contrastive_pretraining.py # Contrastive pretraining
├── val_ic_cf_from_eval.py # Validation helper example
├── tokenizer.sh # Tokenizer training script
├── requirements.txt # Python dependency list
├── configs/
│ └── parakeet_transformer_large_bpe_SE_fusion.yaml # Main model/training config
├── ckpt/
│ └── Intent/ # Model checkpoints and prediction outputs (e.g., predictions.json)
├── encoders/
│ ├── parakeet-tdt_ctc-110m/ # Pretrained ASR encoder resources
│ ├── whisper-m/ # Whisper-related resources
│ └── t5_tiny/ # Lightweight text encoder/LM resources
├── eval_utils/
│ ├── inference.py # Inference runner
│ ├── evaluator.py # SLURP evaluator
│ └── evaluation/metrics/ # Metric utilities
├── SE_modules/ # Speech enhancement modules (including Matcha-related components)
└── scripts/ # Utility scripts (export, tokenizer, ASR/LM, etc.)
The evaluation input paths used in eval.sh are:
- dataset_manifest: ./data/test/manifest.json
- asr_transcripts_filepath: ./data/test/asr_transcripts.jsonl
You can:
- run directly with this path layout;
- place data in your own location and update paths in eval.sh accordingly.
Recommended in-repo organization:
ICDM-SLU-open-ver/
├── data/
│ └── test/
│ ├── manifest.json
│ └── asr_transcripts.jsonl
SLURP can be obtained from: https://github.com/pswietojanski/slurp
Data construction uses SLURP as the base corpus, then mixes VoiceBank human-speech noise and ESC-50 environmental sounds at equal ratio under SNRS_DB = [0, 5, 10], with WER greater than 25%.
conda create -n slu python=3.10.12 -y
conda activate slu
pip install -r requirements.txtbash eval.shIf output_filename is not explicitly set, speech_intent_slot_eval.py writes predictions to:
- ckpt/Intent/predictions.json
The terminal prints:
- scenario/action/intent-related metrics
- reject precision, recall, F1, and accuracy
Under the above noise-augmented SLURP setting, this solution achieves:
- Intent recognition accuracy > 90%
- Reject accuracy > 95%
- NVIDIA Parakeet:
- Whisper medium:
- T5 small:
- Matcha-TTS:
- NVIDIA NeMo:
- GTCRN: