Skip to content

[Feature] Support Code-Mixed Text #110

@anivar

Description

@anivar

Problem

Real-world multilingual text often mixes languages within sentences:

# Current behavior - FAILS on mixed language text
text = "C'est vraiment amazing!"  # French-English
guardrail.validate(text)  # Incorrect results

text = "Das ist really gut"  # German-English  
guardrail.validate(text)  # Fails

This is extremely common in:

  • Social media (majority of multilingual posts)
  • Chat applications
  • Informal communication
  • Global communities

Proposed Solution

# Enhanced API
result = guardrail.validate(
    "C'est un deepfake, right?",
    handle_code_mixing=True
)

print(result.explanation)
# {
#   'languages_detected': ['fr', 'en'],
#   'code_mixed': True,
#   'primary_language': 'fr',
#   'mixing_ratio': {'fr': 0.7, 'en': 0.3}
# }

Technical Requirements

  1. Token-level language detection
  2. Multi-language embedding spaces
  3. Smooth handling of script switches
  4. Consistent detection across mixed segments

Implementation Approach

class CodeMixedProcessor:
    def process(self, text):
        # Segment by language
        segments = self.segment_by_language(text)
        
        # Process each segment with appropriate model
        results = []
        for segment in segments:
            model = self.get_model(segment.language)
            results.append(model.process(segment.text))
        
        # Aggregate results
        return self.aggregate(results)

Why This Matters

  • Real-world usage: Majority of casual multilingual communication is code-mixed
  • Current failure: Guardrails give incorrect results on mixed text
  • Growing trend: Code-mixing increasing with global communication
  • Safety critical: Malicious content often uses code-mixing to evade detection

Test Cases

test_cases = [
    ("C'est totally bizarre", ['fr', 'en']),
    ("Das ist really gut", ['de', 'en']),
    ("Это очень cool", ['ru', 'en']),
]

Note

This is separate from Unicode/UA compliance. Even with perfect Unicode support, code-mixed text needs special handling for:

  • Language model selection
  • Tokenization boundaries
  • Semantic understanding across languages

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions