Skip to content

Feat/add agentic training example#83

Open
KunWuLuan wants to merge 6 commits into
verl-project:mainfrom
alibaba:feat/add-agentic-training-example
Open

Feat/add agentic training example#83
KunWuLuan wants to merge 6 commits into
verl-project:mainfrom
alibaba:feat/add-agentic-training-example

Conversation

@KunWuLuan

Copy link
Copy Markdown

Summary

Implement RemoteAgentLoop for VeRL, enabling RL training of external agents (e.g., SWE-agent, Claude Code) running in separate environments. The proxy server captures all LLM
traffic via an HTTP proxy, recording token_ids and logprobs for PPO updates.
Closes #5737

Commits

Commit Description
177581f Add initial agentic RL training example with RemoteAgentLoop and Ray-based proxy
374c1d4 Add standalone proxy mode support with WebSocket relay
60a24c7 Add Kubernetes deployment manifest for standalone proxy

Architecture


┌─────────────────────────────────────────────────────────────────────────┐
│                         VeRL Training Cluster                            │
│                                                                          │
│  ┌──────────────────────┐                                               │
│  │  AgentLoop Worker    │                                               │
│  │  (Ray Actor)         │                                               │
│  │                      │                                               │
│  │  ┌────────────────┐  │   HTTP (OpenAI API)                          │
│  │  │ RemoteAgent    │◄─┼──────────────────────────────────────┐       │
│  │  │ Loop           │  │                                      │       │
│  │  └───────┬────────┘  │                                      │       │
│  │          │           │                                      │       │
│  │          ▼           │                                      │       │
│  │  ┌────────────────┐  │   ┌─────────────────────────────┐   │       │
│  │  │ LLMProxyServer │  │   │  Third-Party Agent Cluster  │   │       │
│  │  │ (Ray Singleton │  │   │  (Ray or Kubernetes)        │   │       │
│  │  │  or Standalone)│  │   │                             │   │       │
│  │  │                │  │   │  ┌──────────────────────┐  │   │       │
│  │  │ - FastAPI      │  │   │  │ Third-Party Agent    │  │   │       │
│  │  │ - Session Rec. │  │   │  │ (e.g., SWE-agent)    │  │   │       │
│  │  │ - Tokenizer    │  │   │  │                      │  │   │       │
│  │  └───────┬────────┘  │   │  │ OpenAI SDK           │  │   │       │
│  │          │           │   │  │ base_url =           │  │   │       │
│  │          │           │   │  │ http://proxy/{id}/v1 │  │   │       │
│  │          ▼           │   │  └──────────────────────┘  │   │       │
│  │  ┌───────────────┐    │   └─────────────────────────────┘   │       │
│  │  │ vLLM Rollout  │    │                                     │       │
│  │  │ Servers       │    │                                     │       │
│  │  └───────────────┘    │                                     │       │
│  └──────────────────────┘                                     │       │
└─────────────────────────────────────────────────────────────────────────┘

Key Components

RemoteAgentLoop

Extends AgentLoopBase to:

  • Generate unique trial_id for each rollout trajectory
  • Discover proxy URL via Ray actor or environment variable (PROXY_SERVER_URL)
  • Submit tasks to remote agent server with proxy URL configuration
  • Reconstruct AgentLoopOutput from recorded session data (token_ids, logprobs)

LLMProxyServer

Supports two operating modes:

  • Local mode (Ray actor): Direct LiteLLM + vLLM integration
  • Relay mode (standalone): WebSocket relay to framework-side inference workers

InferenceWorkerClient

WebSocket client that:

  • Connects to standalone proxy's /ws/worker endpoint
  • Bridges vLLM inference requests from proxy to VeRL's rollout servers
  • Supports automatic reconnection with exponential backoff

Files Changed

File Description
agentic/__init__.py Package marker
agentic/agent_loop/__init__.py Register remote_agent loop
agentic/agent_loop/config.py RemoteAgentConfig with env parsing
agentic/agent_loop/remote_agent_loop.py RemoteAgentLoop implementation
agentic/agentic-qwen2.5-3b.sh Example training script
agentic/agentic_main.py TaskRunner integration
agentic/config/agentic_trainer.yaml Trainer config
agentic/mcp-tools.sh MCP tools launcher
agentic/swe-agent.yaml SWE-agent config
agentic/proxyserver/__init__.py Package marker
agentic/proxyserver/models.py Data models
agentic/proxyserver/recorder.py Session recording
agentic/proxyserver/server.py Core proxy server (dual-mode)
agentic/proxyserver/vllm_provider.py LiteLLM custom provider for vLLM
agentic/proxyserver/proxy_server.py Standalone entry point
agentic/proxyserver/ray_actor.py Ray actor wrapper
agentic/proxyserver/relay.py WebSocket inference relay
agentic/proxyserver/worker_client.py Framework-side WebSocket client
agentic/proxyserver/Dockerfile Standalone proxy image
agentic/proxyserver/deploy.yaml Kubernetes deployment
agentic/serversdk/__init__.py SDK package
agentic/serversdk/client.py Harbor server client
agentic/serversdk/models.py SDK models

Usage

Ray actor mode (default):

./agentic/agentic-qwen2.5-3b.sh

Standalone proxy mode:

# Deploy proxy
kubectl apply -f agentic/proxyserver/deploy.yaml

# Set environment variable
export PROXY_SERVER_URL=http://llm-proxy-server:8080
./agentic/agentic-qwen2.5-3b.sh

Test Plan

- Run agentic/agentic-qwen2.5-3b.sh with Ray actor mode
- Deploy standalone proxy to Kubernetes
- Test PROXY_SERVER_URL mode with inference worker client
- Verify token_ids and logprobs capture in session data
- End-to-end SWE-bench training run

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a RemoteAgentLoop and an associated LLM proxy server to enable PPO training with remote third-party agents. The implementation includes a standalone proxy mode using WebSockets and a Ray-based actor mode, along with an SDK for submitting agent runs to a Harbor server. Feedback focuses on improving robustness and efficiency, specifically regarding error handling for environment variable parsing, optimizing HTTP session management, fixing potential crashes in the agent loop logic, and leveraging modern Ray features for asynchronous operations.

if v := os.getenv("REMOTE_MODEL_NAME"):
kwargs["model_name"] = v
if v := os.getenv("REMOTE_AGENT_KWARGS"):
kwargs["agent_kwargs"] = json.loads(v)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

json.loads() is called directly on environment variables without error handling. If REMOTE_AGENT_KWARGS (or other JSON-based env vars on lines 84 and 86) contains invalid JSON, the application will crash with a JSONDecodeError during configuration loading. It is safer to wrap these calls in a try-except block and provide a descriptive error message.

Comment on lines +222 to +248
async def _register_session(self, proxy_url: str, trial_id: str) -> None:
"""POST /sessions/{trial_id} to register a new session."""
async with aiohttp.ClientSession() as session:
async with session.post(f"{proxy_url}/sessions/{trial_id}") as resp:
resp.raise_for_status()

async def _get_session_data(self, proxy_url: str, trial_id: str) -> SessionRecord | None:
"""GET /sessions/{trial_id} to retrieve recorded session data."""
async with aiohttp.ClientSession() as session:
async with session.get(f"{proxy_url}/sessions/{trial_id}") as resp:
if resp.status == 404:
return None
resp.raise_for_status()
data = await resp.json()
return SessionRecord(**data)

async def _complete_session(self, proxy_url: str, trial_id: str) -> None:
"""POST /sessions/{trial_id}/complete to mark session completed."""
async with aiohttp.ClientSession() as session:
async with session.post(f"{proxy_url}/sessions/{trial_id}/complete") as resp:
resp.raise_for_status()

async def _delete_session(self, proxy_url: str, trial_id: str) -> None:
"""DELETE /sessions/{trial_id} to remove session data."""
async with aiohttp.ClientSession() as session:
async with session.delete(f"{proxy_url}/sessions/{trial_id}") as resp:
resp.raise_for_status()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new aiohttp.ClientSession for every helper method call (e.g., _register_session, _get_session_data, etc.) is inefficient. aiohttp sessions are designed to be reused to take advantage of connection pooling and reduce overhead. Consider creating a single session instance in the __init__ method and reusing it throughout the lifecycle of the RemoteAgentLoop instance.

logprobs captured by the proxy.
"""
import shortuuid
messages = list(kwargs["raw_prompt"])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The messages variable is initialized here but is never used in the rest of the run method. Additionally, accessing kwargs["raw_prompt"] directly will raise a KeyError if the key is missing. Since the reconstruction logic later uses session.turns[0].request_messages, this line appears to be redundant and potentially unsafe.

Comment thread agentic/agent_loop/remote_agent_loop.py Outdated
"LOCAL_IP is not set, falling back to 0.0.0.0. "
"The remote agent may not be able to reach the proxy."
)
agent_base_url = f"http://{local_ip}:{proxy_port}/{trial_id}/v1"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

urlparse(proxy_url).port returns None if the port is not explicitly specified in the URL (e.g., http://proxy-server). This results in an invalid agent_base_url like http://10.0.30.11:None/.... You should handle the case where the port is missing by defaulting to the standard port for the scheme (80 for http, 443 for https).

Suggested change
agent_base_url = f"http://{local_ip}:{proxy_port}/{trial_id}/v1"
proxy_port = parsed.port or (443 if parsed.scheme == "https" else 80)

)
agent_base_url = f"http://{local_ip}:{proxy_port}/{trial_id}/v1"

agent_kwargs = dict(self.remote_agent_kwargs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

dict(self.remote_agent_kwargs) creates a shallow copy. If the agent configuration contains nested dictionaries or lists, they will be shared across all rollout executions. Mutations to these nested structures (e.g., by the agent SDK) could lead to subtle bugs or race conditions. Use copy.deepcopy to ensure each run has a completely isolated set of arguments.

Comment thread agentic/proxyserver/vllm_provider.py Outdated

import ray

output = await asyncio.to_thread(ray.get, ref)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In modern Ray versions, ObjectRef is awaitable. Instead of using asyncio.to_thread(ray.get, ref), which blocks a thread from the pool, you can directly await ref. This is more efficient and follows the recommended pattern for asynchronous Ray applications.

Suggested change
output = await asyncio.to_thread(ray.get, ref)
output = await ref

if not log_probs:
return None
content = []
for tid, lp in zip(token_ids, log_probs):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling self.tokenizer.decode([tid]) in a loop for every token is inefficient, especially for long sequences. Consider using self.tokenizer.convert_ids_to_tokens or decoding the entire sequence once and mapping tokens to their positions to improve performance when building the logprobs metadata.

KunWuLuan added 3 commits May 14, 2026 16:13
Signed-off-by: KunWuLuan <30817980+KunWuLuan@users.noreply.github.com>
Signed-off-by: KunWuLuan <30817980+KunWuLuan@users.noreply.github.com>
Signed-off-by: KunWuLuan <30817980+KunWuLuan@users.noreply.github.com>
@KunWuLuan

Copy link
Copy Markdown
Author

@wuxibin89 Hi, would like to hear your thoughts, thank you!

- Promote REMOTE_AGENT_* / LOCAL_IP knobs into yaml; inject via
  runtime_env.env_vars with os.environ.setdefault to keep
  CLI > shell env > yaml precedence.
- Rename proxy_server.local_ip -> llm_proxy_ip (LOCAL_IP -> LLM_PROXY_IP)
  to match its real semantic (LLM proxy's externally reachable IP).
- Add recipe/agentic/dataset/local_harbor.py: scan local Harbor task
  dirs (task.toml + instruction.md) into a cached verl parquet; driver
  overrides data.train_files / data.val_files when
  data.train_harbor_dir / val_harbor_dir is set.
- Rows emit top-level instance_id / local_task_path; RemoteAgentLoop
  resolves task_path as local_task_path -> <harbor_root>/<instance_id>
  -> legacy task_path_template.
- Add README, rewrite agentic-qwen2.5-3b.sh to pure hydra overrides,
  drop obsolete mcp-tools.sh.

Assisted-by: Qoder
Signed-off-by: kunwuluan@gmail.com
@KunWuLuan KunWuLuan force-pushed the feat/add-agentic-training-example branch from 4f15d7c to e510cca Compare May 15, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant