Training data that
makes agents smarter
Turn browser agent sessions into high-quality training datasets. Label, verify, and export data for fine-tuning and RLHF.
From raw sessions to training-ready data
Automated pipeline that transforms agent sessions into clean, labeled datasets.
Capture
SDK automatically captures sessions with full context
Auto-Label
ML models label success, failure, and edge cases
Verify
Human annotators review uncertain labels
Export
Download in your preferred training format
Data for every training approach
Whether you're doing imitation learning, RLHF, or building custom models.
Behavioral Cloning
Learn from successful agent trajectories. Capture action sequences, DOM states, and decision contexts for imitation learning.
RLHF Data
Preference pairs for reinforcement learning from human feedback. Compare agent behaviors and rank outcomes.
Failure Analysis
Labeled failure modes with root cause annotations. Build datasets for failure detection and recovery.
Task Completion
End-to-end task demonstrations with step-by-step breakdowns. Full context from goal to completion.
Enterprise-grade data infrastructure
Automatic Labeling
Sessions are automatically labeled based on outcomes, DOM state changes, and action sequences. Define custom rules for your specific success criteria.
Human-in-the-Loop
Route uncertain labels to human annotators. Built-in quality control with agreement tracking, conflict resolution, and annotator performance metrics.
Flexible Export
Export to OpenAI, Anthropic, or custom formats. JSONL, Parquet, or direct integration with your training pipeline via API.
Active Learning
Identify which examples will most improve your model. Prioritize labeling for high-impact, uncertain, or edge-case sessions.
Data Quality
Automatic PII detection and redaction. Deduplication, outlier detection, and consistency checks ensure clean training data.
Version Control
Track dataset versions with full lineage. Compare model performance across dataset iterations. Roll back to any previous version.
Export with a single API call
from surfs import TrainingData
# Initialize client
client = TrainingData(api_key="your-api-key")
# Build dataset with filters
dataset = client.create_dataset(
name="checkout-flow-v2",
filters={
"task_type": "checkout",
"outcome": "success",
"min_steps": 5,
"date_range": "last_30_days"
},
labeling={
"auto_label": True,
"human_review": "uncertain", # Review uncertain cases
"quality_threshold": 0.95
}
)
# Export to OpenAI fine-tuning format
dataset.export(
format="openai_jsonl",
output="training_data.jsonl",
include_context=True # Include DOM snapshots
)
print(f"Exported {dataset.size} examples")
# Output: Exported 12,847 examplesFrequently asked questions
What formats can I export training data to?
Export to JSON, JSONL, Parquet, or custom formats. We support OpenAI fine-tuning format, Anthropic's format, and custom schemas for your training pipeline.
How does automatic session labeling work?
Our system analyzes session outcomes, DOM changes, and action sequences to automatically label success/failure. You can define custom labeling rules based on selectors, URLs, or API responses.
Can I use human annotators to verify labels?
Yes. Our human-in-the-loop workflow lets you send uncertain labels to annotators for verification. Built-in quality control tracks annotator agreement and flags inconsistencies.
How much training data do I need for fine-tuning?
It depends on your task complexity. Most teams see improvements with 1,000-10,000 high-quality examples. Our platform helps you identify which examples add the most value to your dataset.
Can I filter training data by success rate or task type?
Absolutely. Filter by outcome, duration, cost, error type, or any custom metadata. Build focused datasets for specific behaviors or edge cases.
Start building training datasets
Join teams using surfs.dev to create high-quality training data for their browser agents.