-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Problem
Real-world multilingual text often mixes languages within sentences:
# Current behavior - FAILS on mixed language text
text = "C'est vraiment amazing!" # French-English
guardrail.validate(text) # Incorrect results
text = "Das ist really gut" # German-English
guardrail.validate(text) # FailsThis is extremely common in:
- Social media (majority of multilingual posts)
- Chat applications
- Informal communication
- Global communities
Proposed Solution
# Enhanced API
result = guardrail.validate(
"C'est un deepfake, right?",
handle_code_mixing=True
)
print(result.explanation)
# {
# 'languages_detected': ['fr', 'en'],
# 'code_mixed': True,
# 'primary_language': 'fr',
# 'mixing_ratio': {'fr': 0.7, 'en': 0.3}
# }Technical Requirements
- Token-level language detection
- Multi-language embedding spaces
- Smooth handling of script switches
- Consistent detection across mixed segments
Implementation Approach
class CodeMixedProcessor:
def process(self, text):
# Segment by language
segments = self.segment_by_language(text)
# Process each segment with appropriate model
results = []
for segment in segments:
model = self.get_model(segment.language)
results.append(model.process(segment.text))
# Aggregate results
return self.aggregate(results)Why This Matters
- Real-world usage: Majority of casual multilingual communication is code-mixed
- Current failure: Guardrails give incorrect results on mixed text
- Growing trend: Code-mixing increasing with global communication
- Safety critical: Malicious content often uses code-mixing to evade detection
Test Cases
test_cases = [
("C'est totally bizarre", ['fr', 'en']),
("Das ist really gut", ['de', 'en']),
("Это очень cool", ['ru', 'en']),
]Note
This is separate from Unicode/UA compliance. Even with perfect Unicode support, code-mixed text needs special handling for:
- Language model selection
- Tokenization boundaries
- Semantic understanding across languages
References
Metadata
Metadata
Assignees
Labels
No labels