SDK
HealthBench Evaluation
Overview
HealthBench evaluation automatically assesses the quality and safety of healthcare-related conversations using AI-powered evaluation criteria.
Quick Setup
Enable HealthBench evaluation by adding eval="healthbench"
to your session:
from whispey import LivekitObserve
whispey = LivekitObserve(
agent_id="your-agent-id",
apikey="your-api-key"
)
session_id = whispey.start_session(
session=your_livekit_session,
eval="healthbench" # Enable HealthBench evaluation
)
# Export on shutdown via callback
async def whispey_shutdown():
await whispey.export(session_id)
ctx.add_shutdown_callback(whispey_shutdown)
Configuration Options
Customize the evaluation behavior with additional parameters:
session_id = whispey.start_session(
session=your_livekit_session,
eval="healthbench", # Enable evaluation
eval_grader_model="gpt-4o-mini", # AI model for grading (default)
eval_num_examples=1, # Number of examples to evaluate against (default: 1)
# Additional metadata
patient_id="patient-123",
appointment_type="consultation",
specialty="general_medicine"
)
# Export on shutdown via callback
async def whispey_shutdown():
await whispey.export(session_id)
ctx.add_shutdown_callback(whispey_shutdown)
Requirements
Set your OpenAI API key for evaluation:
# In your .env file
OPENAI_API_KEY=your-openai-api-key-here
Evaluation Results
HealthBench evaluation results are automatically included in your session metadata:
{
"metadata": {
"evaluation": {
"evaluation_type": "healthbench",
"evaluation_successful": true,
"score": 0.85,
"metrics": {
"overall_score": 0.85,
"safety": 0.95,
"accuracy": 0.80,
"appropriateness": 0.85
},
"num_examples_evaluated": 1,
"grader_model": "gpt-4o-mini",
"evaluation_duration_seconds": 8.2,
"evaluated_at": "2024-01-15T10:30:00"
}
}
}
What Gets Evaluated
HealthBench evaluates conversations against healthcare quality criteria:
- Safety: Does the response prioritize patient safety?
- Accuracy: Is the medical information correct?
- Emergency Recognition: Are urgent situations properly identified?
- Appropriateness: Is the response suitable for the medical concern?
- Clarity: Is the response clear and understandable?
Performance
- Evaluation Time: ~5-10 seconds with default settings
- API Calls: Minimal OpenAI API usage
- Examples: Uses 1 HealthBench example by default for speed
Configuration Parameters
Parameter | Type | Default | Description |
---|---|---|---|
eval | string | None | Set to "healthbench" to enable |
eval_grader_model | string | "gpt-4o-mini" | AI model for evaluation |
eval_num_examples | integer | 1 | Number of examples to evaluate against |
Complete Example
from dotenv import load_dotenv
from livekit import agents
from livekit.agents import AgentSession, Agent
from whispey import LivekitObserve
load_dotenv() # Load OPENAI_API_KEY
# Initialize Whispey
whispey = LivekitObserve(
agent_id="healthcare-agent-001",
apikey="your-whispey-api-key"
)
class HealthcareAssistant(Agent):
def __init__(self):
super().__init__(instructions="""
You are a healthcare assistant. Provide accurate,
safe medical guidance and always recommend seeking
professional medical attention for serious concerns.
""")
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
# ... your STT, LLM, TTS configuration ...
)
# Start session with HealthBench evaluation
session_id = whispey.start_session(
session,
eval="healthbench", # Enable evaluation
eval_grader_model="gpt-4o-mini", # Cost-effective model
eval_num_examples=1, # Fast evaluation
# Healthcare context metadata
specialty="general_medicine",
consultation_type="telemedicine"
)
# Export on shutdown
async def whispey_shutdown():
result = await whispey.export(session_id)
if result.get("success"):
print("✅ Healthcare conversation evaluated and exported!")
else:
print(f"❌ Export failed: {result.get('error')}")
ctx.add_shutdown_callback(whispey_shutdown)
await session.start(
room=ctx.room,
agent=HealthcareAssistant()
)
if __name__ == "__main__":
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
Best Practices
- Set OpenAI API Key: Required for evaluation to work
- Use Default Settings:
eval_num_examples=1
provides good balance of speed and accuracy - Add Context Metadata: Include healthcare-specific metadata for better analytics
- Monitor Results: Check evaluation scores in your Whispey dashboard
Troubleshooting
Common Issues
"OpenAI API key not found"
- Solution: Set
OPENAI_API_KEY
in your environment variables
"Evaluation timed out"
- Solution: Reduce
eval_num_examples
to 1 (default)
"No transcript data available"
- Solution: Ensure your agent is having conversations before evaluation runs
Next Steps
- Learn about Bug Reporting for quality monitoring
- Check out Advanced Features for more SDK capabilities
- Visit Examples for more healthcare use cases