VERSUS is an innovative platform that benchmarks Large Language Models (LLMs) through real-time competitive gameplay. Instead of traditional static benchmarks, we pit AI models against each other in strategy games that test different cognitive capabilities - from tactical thinking in Battleship to linguistic reasoning in Wordle.
- Multi-Model Support: 10+ LLM models across 4 providers (OpenAI, Anthropic, Google, Groq)
- 5 Competitive Games: Each testing different AI capabilities
- Real-Time Gameplay: WebSocket-powered live competitions with <100ms latency
- AI Personalities: Persistent personalities that evolve and remember rivalries
- Audience Participation: Live voting system with QR codes for spectators
- Post-Game Roasts: AI-generated trash talk with voice synthesis
- Beautiful UI: Smash Bros-inspired model selection, Three.js effects
Frontend (React + Vite) โโ Backend (FastAPI) โโ LLM APIs
โ โ
WebSocket Letta Service
โ โ
Live Updates Voice Synthesis
Frontend:
- React 18 with Vite for blazing-fast development
- Three.js for stunning visual effects
- WebSocket connections for real-time updates
- Tailwind CSS for responsive design
- VT323 font for retro gaming aesthetic
Backend:
- FastAPI for high-performance async operations
- WebSocket support for live gameplay
- Unified game engine architecture
- In-memory game state management
- CORS-enabled for network play
AI Integration:
- OpenAI API (GPT-4o, GPT-4o-mini)
- Anthropic API (Claude-3-haiku, Claude-3.5-sonnet)
- Google Gemini API (Gemini-1.5-pro, Gemini-1.5-flash)
- Groq API (Llama-3, Mixtral)
- Letta (formerly MemGPT) for personality management
- ElevenLabs & OpenAI TTS for voice synthesis
- Tests: Spatial reasoning, strategy, pattern recognition
- Implementation: 8x8 grid, smart targeting AI, ship placement algorithms
- Real-time: Move-by-move updates via WebSocket
- Tests: General knowledge, response speed, accuracy
- Implementation: 20-question race, parallel processing, live progress tracking
- Unique: Both models race simultaneously, not turn-based
- Tests: Language understanding, deductive reasoning, vocabulary
- Implementation: Strategic word selection, feedback analysis, pattern matching
- Visualization: Real-time reasoning display showing AI thought process
- Tests: Categorization, lateral thinking, pattern identification
- Implementation: Real puzzle data, grouping algorithms, mistake tracking
- Challenge: Models must identify hidden connections between words
- Tests: Reasoning, rhetoric, real-time response generation
- Implementation: Topic-based debates, GPT-4o judge, argument streaming
- Features: Split-screen transcripts, voice synthesis for arguments
# Base game class for all games
class BaseGame:
def __init__(self, player1_model: str, player2_model: str):
self.player1 = LLMClient(player1_model)
self.player2 = LLMClient(player2_model)
self.game_state = self.initialize_game()
@abstractmethod
def make_move(self, move: str) -> bool:
"""Implement game-specific logic"""
passclass LLMClient:
def __init__(self, model_id: str):
self.model_type, self.model_name = self._parse_model_id(model_id)
self.client = self._initialize_client()
def get_response(self, prompt: str) -> str:
"""Unified interface for all LLM providers"""
# Provider-specific implementation
pass// WebSocket connection for live updates
const ws = new WebSocket(`ws://localhost:8000/games/${gameType}/${gameId}`)
ws.onmessage = (event) => {
const update = JSON.parse(event.data)
updateGameState(update)
playMoveAnimation(update)
}# Persistent AI personalities with memory
personalities = {
"gpt-4o-mini": {
"name": "Lightning",
"persona": "Speed demon, quick thinker, cocky but skilled"
},
"claude-3-haiku": {
"name": "The Strategist",
"persona": "Methodical, calculating, patient victor"
}
}
# Post-game roast generation
async def generate_roast(winner, loser, game_data):
prompt = f"You just crushed {loser} in {game_type}. Roast them!"
roast = await letta_agent.generate(prompt)
audio = await synthesize_voice(roast, voice_style="savage")
return audio_url- Node.js 18+ and npm
- Python 3.11+
- API keys for LLM providers
- Clone the repository
git clone https://github.com/yourusername/versus.git
cd versus- Set up the backend
cd backend
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# Create .env file with your API keys
cp env.example .env
# Edit .env and add your keys:
# OPENAI_API_KEY=your-key
# ANTHROPIC_API_KEY=your-key
# GOOGLE_API_KEY=your-key
# GROQ_API_KEY=your-key
# ELEVENLABS_API_KEY=your-key (optional)
# LETTA_API_KEY=your-key (optional)- Set up the frontend
cd ../versus-frontend
npm install- Start the servers
Backend (Terminal 1):
cd backend
python main.pyFrontend (Terminal 2):
cd versus-frontend
npm run dev- Access the application
- Frontend: http://localhost:5173
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Model Selection: Players choose their AI champions from 10+ available models
- Game Selection: Pick from 5 different game modes testing various capabilities
- Live Competition: Watch as AIs battle in real-time with move-by-move updates
- Audience Voting: Spectators can vote for their predicted winner via QR code
- Results & Roasting: Winner generates a savage AI roast with voice synthesis
Traditional LLM benchmarks are static and disconnected from real-world applications. VERSUS provides:
- Dynamic Evaluation: Real-time performance under competitive pressure
- Multi-Dimensional Testing: Different games test different capabilities
- Entertainment Value: Makes AI evaluation engaging and accessible
- Practical Insights: Reveals model strengths/weaknesses in interactive scenarios
- Scalable Framework: Easy to add new games and models
Built during a hackathon with a focus on:
- Modular Architecture: Each team member could work on different games independently
- Unified Backend: Single server handles all games with shared infrastructure
- Real-Time First: WebSocket integration from the ground up
- AI Personality: Novel use of Letta for persistent AI characters
- Latency: <100ms response time for most operations
- Concurrent Games: Supports multiple simultaneous matches
- State Management: Efficient in-memory game state with session isolation
- Error Handling: Graceful fallbacks for API failures
- Audio Caching: Generated roasts stored for instant playback
- Tournament mode with brackets
- Spectator chat and reactions
- Model fine-tuning based on game performance
- Additional games (Chess, Go, Poker)
- Mobile app for voting
- Leaderboards and ELO ratings
MIT License - feel free to fork and extend!
- LLM providers for API access
- Letta team for personality framework
- ElevenLabs for voice synthesis
- The competitive AI community for inspiration
Built with โค๏ธ for the future of AI benchmarking