Best Practices
Guidelines for effective multi-agent orchestration with Colony.
Agent Configuration
Sizing Your Colony
Start Small, Scale Up
# Good: Start with 2-3 agents
agents:
- id: main-dev
- id: reviewer
- id: tester
# Avoid: Too many agents initially
# agents: [10+ agents on day one]
Recommended Colony Sizes:
- Small Project (1-2 developers): 2-3 agents
- Medium Project (3-5 developers): 4-6 agents
- Large Project (6+ developers): 6-10 agents
- Maximum Practical: 12-15 agents (beyond this, coordination overhead increases)
Agent Specialization
Do: Clear, Focused Roles
agents:
- id: backend-api
role: Backend Engineer
focus: REST API development and database design
- id: frontend-ui
role: Frontend Engineer
focus: React components and user interface
- id: security-auditor
role: Security Specialist
focus: Security review and vulnerability scanning
Don't: Vague, Overlapping Roles
# Avoid this:
agents:
- id: dev-1
role: Developer
focus: Do stuff
Startup Prompts
Effective Startup Prompts Include:
- Primary Responsibility
- Specific Focus Areas
- Communication Guidelines
- Quality Standards
startup_prompt: |
You are the Backend API Engineer for this project.
PRIMARY RESPONSIBILITIES:
- Design and implement REST API endpoints
- Manage database schema and migrations
- Write API documentation
FOCUS AREAS:
- RESTful design principles
- Database optimization
- API security (auth, validation, rate limiting)
COMMUNICATION:
- Coordinate with frontend-ui for API contracts
- Notify security-auditor before merging
- Report blockers immediately
QUALITY STANDARDS:
- All endpoints must have tests
- Follow project naming conventions
- Document all public APIs
Task Management
Task Design
Good Task Characteristics:
- Atomic: One clear objective
- Testable: Success criteria defined
- Sized Right: 2-4 hours of work
- Well-Scoped: Clear boundaries
# Good Task
colony tasks create api-users-endpoint \
"Implement /api/users endpoint" \
"Create GET /api/users with pagination, filtering by role, and proper auth. Include unit tests and update API docs." \
--priority high
# Too Vague
colony tasks create backend-work \
"Do backend stuff" \
"Work on backend"
Task Dependencies
Use Blockers for Sequencing:
# Step 1: Database schema
colony state task add "Create users table schema" \
--description "Design and implement users table with proper indexes"
# Step 2: Depends on schema
colony state task add "Implement user CRUD" \
--description "Create, Read, Update, Delete operations for users" \
--blockers "Create users table schema"
# Step 3: Depends on CRUD
colony state task add "Add user authentication" \
--description "JWT-based authentication for user endpoints" \
--blockers "Implement user CRUD"
Task Priority Guidelines
| Priority | When to Use | Examples |
|---|---|---|
critical | Blocking production, security issues | "Fix auth bypass vulnerability", "Restore down service" |
high | Core features, important bugs | "Implement login flow", "Fix data loss bug" |
medium | Standard features, improvements | "Add user profile page", "Optimize query performance" |
low | Nice-to-haves, refactoring | "Update dependencies", "Improve logging" |
Communication
Message Types
Broadcasts (colony broadcast or b in TUI):
# Good uses:
colony broadcast "🚨 Critical: Security patch required in auth module"
colony broadcast "✅ Sprint complete - all tests passing"
colony broadcast "📢 New API contract available in docs/"
# Avoid:
colony broadcast "Working on stuff" # Too vague
colony broadcast "Hey reviewer-1..." # Use direct message instead
Direct Messages:
# Coordinate specific work
colony messages send backend-dev "API contract ready for /users endpoint"
# Request help
colony messages send security-audit "Please review auth changes in PR #123"
# Report blockers
colony messages send team-lead "Blocked on database access permissions"
Communication Patterns
1. Pull Request Workflow
# Developer finishes work
colony broadcast "PR #123 ready for review: User authentication"
# Reviewer claims
colony messages send developer-1 "Reviewing PR #123, will have feedback in 30min"
# Review complete
colony messages send developer-1 "PR #123 approved with minor suggestions"
2. Blocker Resolution
# Agent hits blocker
colony broadcast "🚫 Blocked: Need database credentials for integration tests"
# Coordinator responds
colony messages send blocked-agent "DB creds in 1Password vault 'Dev Credentials'"
# Agent unblocks
colony tasks unblock task-123
colony broadcast "✅ Unblocked, resuming work on integration tests"
Shared State
When to Enable Shared State
Enable for:
- Multi-session work (resume after breaks)
- Cross-session coordination
- Long-running projects
- Distributed teams
Skip for:
- Quick experiments
- Single-session work
- Prototype/spike projects
State Hygiene
Regular Sync:
# Before starting work
colony state pull
# After significant progress
colony state push
# End of session
colony state sync
Clean Completed Tasks:
# Weekly cleanup
colony state task list --status completed |
grep "2024-01" | # Old tasks
xargs -I {} colony state task delete {}
Commit Message Guidelines:
# Good commit messages
git commit -m "state: Add task for user profile feature"
git commit -m "state: Mark authentication tasks as complete"
git commit -m "state: Update blockers for API integration"
# Enable in .git/hooks/prepare-commit-msg
Workflows
Workflow Design Principles
1. Single Responsibility Each workflow should handle one logical process.
# Good: Focused workflow
workflow:
name: code-review-workflow
steps:
- name: lint
- name: test
- name: security-scan
- name: manual-review
# Avoid: Kitchen sink workflow
workflow:
name: do-everything
steps: [50+ steps]
2. Idempotent Steps Steps should be safely re-runnable.
# Good: Can retry
steps:
- name: run-tests
agent: test-runner
retry:
max_attempts: 3
backoff: exponential
# Avoid: Non-idempotent
steps:
- name: increment-counter # Don't do this
3. Clear Dependencies
steps:
- name: build
agent: builder
- name: test
depends_on: [build] # Explicit dependency
agent: tester
- name: deploy
depends_on: [test] # Sequential
agent: deployer
Error Handling
Graceful Degradation:
workflow:
steps:
- name: primary-build
agent: builder
on_failure: try-backup-build
- name: try-backup-build
agent: backup-builder
on_failure: notify-team
- name: notify-team
agent: coordinator
instructions: "Send alert about build failure"
Monitoring
What to Monitor
Critical Metrics:
- Agent Health: Running vs. failed agents
- Task Velocity: Tasks completed per hour
- Message Flow: Communication patterns
- Error Rate: Failed tasks and retries
Using the TUI Effectively
Daily Workflow:
# Morning: Check overnight progress
colony tui
# 1. Review Agents tab for any failures
# 2. Check Tasks tab for completion rate
# 3. Scan Messages for any issues
# During Work: Monitor in real-time
# Keep TUI open in dedicated terminal
# Watch for agent failures
# Monitor task queue depth
# Evening: Final check
# 1. Verify all tasks claimed/completed
# 2. Check no blocked agents
# 3. Sync state before shutdown
Log Review
Regular Log Patterns:
# Daily error scan
colony logs --level error --last 24h
# Agent-specific debugging
colony logs problematic-agent --pattern "error|warning"
# Performance monitoring
colony logs --pattern "slow|timeout|retry"
Performance Optimization
Resource Management
CPU and Memory:
- Limit concurrent agents based on system resources
- Use
claude-sonnetfor routine tasks (faster, cheaper) - Use
claude-opusonly for complex reasoning
agents:
- id: code-reviewer
model: claude-sonnet-4 # Sufficient for reviews
- id: architect
model: claude-opus-4 # Complex system design
Git Worktree Strategy
Shared vs. Isolated:
# Isolated: Parallel independent work
agents:
- id: feature-a-dev
worktree_branch: feature/payment-system
- id: feature-b-dev
worktree_branch: feature/notification-service
# Shared: Collaborative work
agents:
- id: backend-dev
worktree: shared-api-work
worktree_branch: feature/api-refactor
- id: api-tester
worktree: shared-api-work # Same worktree
worktree_branch: feature/api-refactor
Security
Secrets Management
Do:
# Use environment variables
agents:
- id: deploy-agent
env:
AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID # From environment
AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY
Don't:
# Never hardcode secrets
agents:
- id: deploy-agent
startup_prompt: "Use API key: sk-1234abcd..." # NEVER DO THIS
Access Control
Principle of Least Privilege:
agents:
- id: readonly-reviewer
# Give minimal permissions
# Can read code, can't push
- id: deployer
# Only this agent can deploy
env:
DEPLOY_KEY: $DEPLOY_KEY
Team Collaboration
Multi-User Colonies
Personal Colonies:
# alice/colony.yml
name: alice-dev-colony
agents:
- id: alice-main-dev
- id: alice-reviewer
# bob/colony.yml
name: bob-dev-colony
agents:
- id: bob-main-dev
- id: bob-tester
Shared State Coordination:
# Alice
colony state pull # Get latest tasks
colony state task claim task-123 alice-main-dev
colony state push
# Bob
colony state pull # Sees Alice claimed task-123
colony state task claim task-456 bob-main-dev
colony state push
Code Review Process
Automated Review Colony:
agents:
- id: developer
focus: Implement features
- id: auto-reviewer
template: code-reviewer
startup_prompt: |
Review all PRs for:
- Code quality
- Test coverage
- Security issues
- Performance concerns
Comment inline and request changes if needed.
- id: security-reviewer
template: security-auditor
startup_prompt: |
Security-focused review:
- Check for OWASP Top 10
- Validate input sanitization
- Review authentication/authorization
Troubleshooting
Common Issues
Agents Not Starting:
# Check colony health
colony health
# Verify configuration
cat colony.yml
# Check tmux session
tmux list-sessions
colony attach # See what's happening
Tasks Not Being Claimed:
# Check task status
colony tasks list --status pending
# Verify dependencies
colony tasks show task-id
# Check agent assignment
colony tasks agent agent-id
State Sync Conflicts:
# Pull latest
colony state pull
# Resolve conflicts manually
# Edit .colony/state/ files
# Push resolved state
colony state push
Debug Mode
Verbose Logging:
# colony.yml
observability:
logging:
level: debug # Detailed logs
output: both # File + stdout
Cost Optimization
Efficient API Usage
Model Selection:
# Use cheaper models where appropriate
agents:
- id: linter
model: claude-sonnet-4 # Simple tasks
- id: architect
model: claude-opus-4 # Complex decisions only
Task Batching:
# Instead of many small tasks
colony tasks create review-file-1 "Review file1.js"
colony tasks create review-file-2 "Review file2.js"
# ... (100 tasks)
# Batch similar work
colony tasks create review-batch-1 \
"Review all files in src/components/" \
"Check all .js files for code quality issues"
Resource Monitoring
# Track colony costs
colony metrics show api_calls --hours 24
colony metrics show token_usage --hours 24
# Budget alerts (in your monitoring)
if [ $(colony metrics show token_usage) -gt 1000000 ]; then
echo "High token usage - review colony efficiency"
fi
Maintenance
Regular Cleanup
Weekly:
# Clean old tasks
colony tasks list --status completed |
grep "$(date -d '7 days ago' +%Y-%m)" |
xargs -I {} colony tasks delete {}
# Clean old logs
colony logs --clean --older-than 7d
Monthly:
# Review agent performance
colony metrics export --output metrics.json
# Analyze which agents are most effective
# Update dependencies
cd ~/.colony/plugins && git pull
colony plugin update --all
# Review and update templates
colony template list
colony template update --all
Backup Strategy
Critical Data:
# Backup colony configuration
cp colony.yml colony.yml.backup
# Backup shared state
git clone .colony/state/.git state-backup
# Backup custom templates
tar -czf templates-backup.tar.gz .colony/templates/
Summary
Golden Rules:
- Start small - 2-3 specialized agents
- Clear roles - Each agent knows its job
- Communicate - Use broadcasts and messages effectively
- Monitor - Watch the TUI, review logs
- Sync state - Pull before work, push after
- Optimize costs - Right model for the task
- Document - Keep colony.yml well-commented
- Iterate - Adjust based on what works
Success Metrics:
- Agents rarely fail or block
- Tasks flow smoothly through stages
- Communication is clear and purposeful
- State stays synchronized
- Team velocity increases
When in Doubt:
- Check the TUI for current state
- Review logs for errors
- Consult this guide
- Ask in community discussions