Skip to content
Merged

Dev #24

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 51 additions & 7 deletions README.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ graph TB
Nginx --> Express["Express Backend<br/>31 Routes | 20+ Services | JWT Auth"]
React <-->|"WebSocket Real-time"| Express
Express --> SQLite[("SQLite Database<br/>39 Tables | AES-256 Encryption")]
Express --> LLM["🤖 LLM API<br/>Doubao | OpenAI"]
Express --> LLM["🤖 LLM Model Pool<br/>Doubao | DeepSeek | Qwen<br/>OpenAI | ZhiPu | Local Models"]
Express --> SSH["🖥️ SSH Remote Servers"]
Express --> Webhook["🚨 Alert Webhook<br/>Prometheus | Zabbix"]
Express --> Notify["📬 Notifications<br/>Email | WeCom | DingTalk"]
Expand All @@ -55,6 +55,8 @@ graph TB

- **Multi-Agent Collaboration** — 9 preset IT operations Agents with custom creation support, covering alerts, diagnostics, inspection, compliance, and more
- **Visual Workflow** — Drag-and-drop orchestration with serial/parallel/conditional branching and real-time WebSocket progress pushing
- **HITL Human Approval** — Workflow supports approval nodes, pauses execution for human confirmation, supports auto-reject/wait on timeout, approval requests auto-push to WeCom/DingTalk/Email
- **AI Intelligent Remediation Loop** — Alerts trigger AI analysis → Auto-generate structured remediation commands → Approval node confirmation → Auto-execute remediation → Verify results and feedback
- **Web SSH Terminal** — Interactive remote terminal based on xterm.js with real-time I/O, window auto-resize, and bidirectional communication
- **Host Management Enhancement** — Multi-level group tree structure, CSV/JSON bulk import, automatic SSH information collection (CPU/Memory/Disk/OS)
- **Data Import/Export** — Bulk server import via CSV/JSON, export alerts, audit logs, and report data
Expand Down Expand Up @@ -95,13 +97,36 @@ Multi-layered security design to protect your servers and data:

## Supported AI Models

| Type | Provider/Framework | Support Status | Recommended Scenario |
|------|------------|---------|---------|
| **Domestic Cloud API** | VolcEngine · Doubao | ✅ Fully Supported | Recommended for users in China |
| **International Cloud API** | OpenAI (GPT-4o, etc.) | ✅ Fully Supported | Users with external network access |
| **Local Deployment** | Ollama / LM Studio / vLLM | ✅ Fully Supported | High data security requirements |
The project supports most mainstream large language models worldwide, managed through a unified AI model pool with primary-backup degradation chains.

**Recommended Local Models**: Qwen2.5, Llama3, DeepSeek-Coder, Yi, ChatGLM, Phi-3, etc. (OpenAI-compatible).
| Type | Provider/Model | Integration | Recommended Scenario |
|------|------------|---------|---------|
| **Domestic Cloud API** | VolcEngine · Doubao | Native API | Recommended for users in China |
| **Domestic Cloud API** | Alibaba Cloud · Qwen | OpenAI Compatible | Enterprise applications in China |
| **Domestic Cloud API** | DeepSeek | OpenAI Compatible | Strong code generation and reasoning |
| **Domestic Cloud API** | ZhiPu AI (GLM-4) | OpenAI Compatible | Excellent Chinese understanding |
| **Domestic Cloud API** | Moonshot · Kimi | OpenAI Compatible | Long text processing |
| **Domestic Cloud API** | Baidu · Wenxin | OpenAI Compatible | Enterprise applications in China |
| **Domestic Cloud API** | 01.AI (Yi) | OpenAI Compatible | Open source models |
| **Domestic Cloud API** | Baichuan | OpenAI Compatible | Open source models |
| **International Cloud API** | OpenAI (GPT-4o, o1, o3) | Native API | Users with external network access |
| **International Cloud API** | Anthropic Claude | OpenAI Compatible Layer | Complex reasoning tasks |
| **International Cloud API** | Meta Llama | Ollama/vLLM | Open source models |
| **International Cloud API** | Mistral | OpenAI Compatible | Open source models |
| **Local Deployment** | Ollama | OpenAI Compatible | High data security requirements |
| **Local Deployment** | LM Studio | OpenAI Compatible | Desktop local models |
| **Local Deployment** | vLLM | OpenAI Compatible | High-performance inference |
| **Local Deployment** | Other OpenAI Compatible | OpenAI Compatible | Any compatible service |

**Recommended Local Models**: Qwen2.5, Llama3, DeepSeek-Coder, Yi, ChatGLM, Phi-3, Mistral, etc.

**Features**:
- ✅ Unified AI model pool management, support adding unlimited models
- ✅ Primary-backup model degradation chain (primary_model_id + fallback_model_id)
- ✅ Independent circuit breaker per provider, preventing single point of failure
- ✅ Drag-and-drop sorting to define priority
- ✅ Model connectivity testing
- ✅ API Key inheritance mechanism to reduce duplicate configuration

## Tech Stack

Expand Down Expand Up @@ -345,6 +370,25 @@ System overview displaying servers, alerts, tasks, and other core metrics.
- Report template management
- Report viewing and download

### Approval Center (HITL)

- Workflow supports human approval nodes, can be drag-and-dropped in workflow editor
- Approval node configuration: approval description, timeout, timeout behavior (auto-reject/continue waiting)
- Unified approval center page showing pending, approved, and rejected approval requests
- Approval actions: approve/reject (with reason required)
- Approval requests auto-push notifications (WeCom, DingTalk, Email)
- Support quick approval from mobile devices
- WebSocket real-time push of approval status changes

### AI Remediation Records

- AI analysis of alerts auto-generates structured remediation commands (JSON format)
- Auto-creates remediation workflow: [Approval Node] → [Execute Remediation Agent Node]
- Auto-sets approval timeout based on risk level (low: 30min, medium: 1hr, high: 2hr)
- Displays complete remediation process: diagnosis report, remediation commands, risk level, execution status
- Supports viewing execution results and error information
- Deep integration with alert and task systems

## Project Structure

```
Expand Down
64 changes: 54 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ graph TB
Nginx --> Express["Express 后端<br/>31个路由 | 20+个服务 | JWT认证"]
React <-->|"WebSocket 实时通信"| Express
Express --> SQLite[("SQLite 数据库<br/>39张表 | AES-256加密")]
Express --> LLM["🤖 LLM API<br/>豆包 | OpenAI"]
Express --> LLM["🤖 LLM 模型池<br/>豆包 | DeepSeek | 通义千问<br/>OpenAI | 智谱 | 本地模型"]
Express --> SSH["🖥️ SSH 远程服务器"]
Express --> Webhook["🚨 告警 Webhook<br/>Prometheus | Zabbix"]
Express --> Notify["📬 通知渠道<br/>邮件 | 企业微信 | 钉钉"]
Expand All @@ -55,6 +55,8 @@ graph TB

- **多 Agent 协作** — 9 个预设运维 Agent,支持自定义创建,覆盖告警、诊断、巡检、变更等场景
- **可视化工作流** — 拖拽式编排,支持串行/并行/条件分支,实时 WebSocket 推送执行进度
- **HITL 人工审批** — 工作流支持审批节点,暂停执行等待人工确认,支持超时自动拒绝/等待,审批请求自动推送企业微信/钉钉/邮箱
- **AI 智能修复闭环** — 告警自动触发 AI 分析 → 自动生成结构化修复命令 → 审批节点确认 → 自动执行修复 → 验证结果反馈
- **Web SSH 终端** — 基于 xterm.js 的交互式远程终端,支持实时输入输出、窗口自适应、双向实时通信
- **主机管理增强** — 多级分组树形结构、CSV/JSON 批量导入、SSH 自动信息采集(CPU/内存/磁盘/OS)
- **数据导入导出** — 支持 CSV/JSON 格式批量导入服务器,导出告警、审计日志、报表数据
Expand Down Expand Up @@ -95,13 +97,36 @@ graph TB

## 支持的 AI 模型

| 类型 | 提供商/框架 | 支持情况 | 推荐场景 |
|------|------------|---------|---------|
| **国内云 API** | 火山引擎 · 豆包 (Doubao) | ✅ 完全支持 | 国内用户推荐,稳定快速 |
| **国际云 API** | OpenAI (GPT-4o 等) | ✅ 完全支持 | 有外网环境用户 |
| **本地部署** | Ollama / LM Studio / vLLM | ✅ 完全支持 | 数据安全要求高,内网部署 |
项目支持国内外绝大多数主流大模型,通过 AI 模型池统一管理,支持主备降级链。

**本地模型推荐**:Qwen2.5、Llama3、DeepSeek-Coder、Yi、ChatGLM、Phi-3 等开源大模型(兼容 OpenAI 接口)。
| 类型 | 提供商/模型 | 接入方式 | 推荐场景 |
|------|------------|---------|---------|
| **国内云 API** | 火山引擎 · 豆包 (Doubao) | 原生 API | 国内用户推荐,稳定快速 |
| **国内云 API** | 阿里云 · 通义千问 (Qwen) | OpenAI 兼容 | 国内企业级应用 |
| **国内云 API** | DeepSeek (深度求索) | OpenAI 兼容 | 代码生成、推理能力强 |
| **国内云 API** | 智谱 AI (GLM-4) | OpenAI 兼容 | 中文理解优秀 |
| **国内云 API** | Moonshot · Kimi | OpenAI 兼容 | 长文本处理 |
| **国内云 API** | 百度 · 文心一言 | OpenAI 兼容 | 国内企业应用 |
| **国内云 API** | 零一万物 (Yi) | OpenAI 兼容 | 开源模型 |
| **国内云 API** | 百川智能 (Baichuan) | OpenAI 兼容 | 开源模型 |
| **国际云 API** | OpenAI (GPT-4o, o1, o3) | 原生 API | 有外网环境用户 |
| **国际云 API** | Anthropic Claude | OpenAI 兼容层 | 复杂推理任务 |
| **国际云 API** | Meta Llama | Ollama/vLLM | 开源模型 |
| **国际云 API** | Mistral | OpenAI 兼容 | 开源模型 |
| **本地部署** | Ollama | OpenAI 兼容 | 数据安全要求高,内网部署 |
| **本地部署** | LM Studio | OpenAI 兼容 | 桌面端本地模型 |
| **本地部署** | vLLM | OpenAI 兼容 | 高性能推理服务 |
| **本地部署** | 其他 OpenAI 兼容接口 | OpenAI 兼容 | 任意兼容服务 |

**本地模型推荐**:Qwen2.5、Llama3、DeepSeek-Coder、Yi、ChatGLM、Phi-3、Mistral 等开源大模型。

**特性**:
- ✅ AI 模型池统一管理,支持添加任意数量模型
- ✅ 主备模型降级链(primary_model_id + fallback_model_id)
- ✅ 每个提供商独立熔断器,防止单点故障
- ✅ 拖拽排序定义优先级
- ✅ 模型连通性测试验证
- ✅ API Key 继承机制,减少重复配置

## 技术栈

Expand Down Expand Up @@ -350,22 +375,41 @@ npm run dev
- 报告模板管理
- 报告查看与下载

### 审批中心(HITL)

- 工作流支持人工审批节点,可在工作流编排中拖拽添加
- 审批节点支持配置:审批说明、超时时间、超时行为(自动拒绝/继续等待)
- 统一审批中心页面,展示待审批、已通过、已拒绝的审批请求
- 审批操作:通过/拒绝(需填写原因)
- 审批请求自动推送通知(企业微信、钉钉、邮箱)
- 支持从手机移动端快速审批
- WebSocket 实时推送审批状态变更

### AI 修复记录

- AI 分析告警后自动生成结构化修复命令(JSON 格式)
- 自动创建修复工作流:[审批节点] → [执行修复 Agent 节点]
- 根据风险等级自动设置审批超时时间(low: 30分钟, medium: 1小时, high: 2小时)
- 展示完整修复流程:诊断报告、修复命令、风险等级、执行状态
- 支持查看执行结果和错误信息
- 与告警、任务系统深度集成

## 项目结构

```
├── backend/
│ └── src/
│ ├── app.ts # Express 应用入口
│ ├── models/database.ts # SQLite 数据库初始化和预设数据
│ ├── routes/ # API 路由(31 个模块)
│ ├── services/ # 业务逻辑(20+ 个服务)
│ ├── routes/ # API 路由(32 个模块)
│ ├── services/ # 业务逻辑(20+ 个服务,含 aiRemediationService
│ ├── middleware/ # 中间件(6 个:auth, errorHandler, rateLimiter, validation, trace, commandFilter)
│ ├── websocket/ # WebSocket 实时通信
│ └── utils/ # 工具函数
├── frontend/
│ └── src/
│ ├── App.tsx # React 应用入口
│ ├── pages/ # 页面组件(27 个
│ ├── pages/ # 页面组件(28 个,含 Approvals、AiRemediations
│ ├── components/ # 通用组件
│ ├── contexts/ # React Context
│ ├── hooks/ # 自定义 Hooks
Expand Down
37 changes: 37 additions & 0 deletions backend/src/app.ts
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,8 @@ import sshKeyRoutes from './routes/sshKeyRoutes';
import topologyRoutes from './routes/topologyRoutes';
import changeRoutes from './routes/changeRoutes';
import aiModelRoutes from './routes/aiModelRoutes';
import approvalRoutes from './routes/approvalRoutes';
import aiRemediationRoutes from './routes/aiRemediationRoutes';
import { schedulerService } from './services/schedulerService';
import { reportService } from './services/reportService';
import { copilotService } from './services/copilotService';
Expand All @@ -71,6 +73,7 @@ import { alertAutoAnalyzer } from './services/alertAutoAnalyzer';
import { alertCorrelationService } from './services/alertCorrelationService';
import { setServerInstances } from './services/restartService';
import { checkDbskiterAvailability } from './services/dbskiterService';
import { timeoutApproval } from './services/workflowExecutor';
import { queueService } from './services/queueService';
import importExportRouter from './routes/importExportRoutes';
import alertAutoRouter from './routes/alertAutoRoutes';
Expand Down Expand Up @@ -153,6 +156,7 @@ async function initializeApp() {

initTokenBlacklist();
startCircuitBreakerCleanup();
startApprovalTimeoutChecker();

logger.info('✅ Application initialization complete');
}
Expand Down Expand Up @@ -259,6 +263,8 @@ app.use('/api/ssh-keys', rateLimiter, sshKeyRoutes);
app.use('/api/topology', rateLimiter, topologyRoutes);
app.use('/api/changes', rateLimiter, changeRoutes);
app.use('/api/ai-models', rateLimiter, aiModelRoutes);
app.use('/api/approvals', rateLimiter, approvalRoutes);
app.use('/api/ai-remediations', rateLimiter, aiRemediationRoutes);
app.use('/api', rateLimiter, alertAutoRouter);
app.use('/api', rateLimiter, linkageRouter);
app.use('/api', rateLimiter, networkDiscoveryRouter);
Expand All @@ -270,6 +276,32 @@ app.use(errorHandler);
const PORT = env.PORT;
const HOST = process.env.HOST || '0.0.0.0';

// 审批超时检查器
let approvalTimeoutInterval: NodeJS.Timeout | null = null;

function startApprovalTimeoutChecker() {
// 每 30 秒检查一次超时的审批请求
approvalTimeoutInterval = setInterval(async () => {
try {
const expiredApprovals = db.prepare(`
SELECT id FROM approval_requests
WHERE status = 'pending'
AND timeout_at IS NOT NULL
AND timeout_at < datetime('now', 'localtime')
`).all() as Array<{ id: string }>;

for (const approval of expiredApprovals) {
logger.info(`⏰ Approval ${approval.id} timed out, processing...`);
await timeoutApproval(approval.id);
}
} catch (error) {
logger.error('Error in approval timeout checker:', error);
}
}, 30000);

logger.info('✅ Approval timeout checker started (checking every 30s)');
}

// 等待数据库初始化完成后再启动 HTTP 服务器,避免竞态
async function startServer() {
await initializeApp();
Expand All @@ -294,6 +326,11 @@ const gracefulShutdown = async (signal: string) => {
process.exit(1);
}, 30000);

// 停止审批超时检查器
if (approvalTimeoutInterval) {
clearInterval(approvalTimeoutInterval);
}

try {
await Promise.all([
new Promise<void>((resolve) => httpServer.close(() => {
Expand Down
2 changes: 2 additions & 0 deletions backend/src/models/migrations/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ import v013NetworkDiscovery from './v013_network_discovery';
import v014AlertCorrelation from './v014_alert_correlation';
import v015NotificationColumns from './v015_notification_columns';
import v016DatabasesTable from './v016_databases_table';
import v017ApprovalRequests from './v017_approval_requests';

// v009 / v010 导出的不是 Migration 对象,手动包装
const v009NetworkCompleteCoverage: Migration = {
Expand Down Expand Up @@ -48,6 +49,7 @@ export const ALL_MIGRATIONS: Migration[] = [
v014AlertCorrelation,
v015NotificationColumns,
v016DatabasesTable,
v017ApprovalRequests,
];

export function createMigrationManager(db: any): MigrationManager {
Expand Down
40 changes: 40 additions & 0 deletions backend/src/models/migrations/v017_approval_requests.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import { Migration } from './migrationFramework';

const v017ApprovalRequests: Migration = {
id: '20260614000017',
version: 17,
name: 'approval_requests',
description: 'Add approval_requests table for HITL workflow',

up: async (db: any) => {
db.exec(`
CREATE TABLE IF NOT EXISTS approval_requests (
id TEXT PRIMARY KEY,
task_id TEXT NOT NULL,
node_id TEXT NOT NULL,
node_label TEXT NOT NULL,
description TEXT,
status TEXT NOT NULL DEFAULT 'pending',
requested_by TEXT,
approved_by TEXT,
approved_at DATETIME,
reject_reason TEXT,
timeout_at DATETIME,
timeout_action TEXT DEFAULT 'reject',
created_at DATETIME DEFAULT (datetime('now','localtime')),
updated_at DATETIME DEFAULT (datetime('now','localtime')),
FOREIGN KEY (task_id) REFERENCES tasks(id) ON DELETE CASCADE
);

CREATE INDEX IF NOT EXISTS idx_approval_task ON approval_requests(task_id);
CREATE INDEX IF NOT EXISTS idx_approval_status ON approval_requests(status);
CREATE INDEX IF NOT EXISTS idx_approval_created ON approval_requests(created_at DESC);
`);
},

down: async (db: any) => {
db.exec(`DROP TABLE IF EXISTS approval_requests`);
}
};

export default v017ApprovalRequests;
47 changes: 47 additions & 0 deletions backend/src/routes/aiRemediationRoutes.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import { Router, Request, Response } from 'express';
import { aiRemediationService } from '../services/aiRemediationService';
import { authenticateToken } from '../middleware/auth';

const router = Router();

// 获取所有 AI 修复记录
router.get('/', authenticateToken, (req: Request, res: Response) => {
try {
const limit = parseInt(req.query.limit as string) || 50;
const records = aiRemediationService.listRecords(limit);
res.json({ success: true, data: records });
} catch (error) {
console.error('Failed to list AI remediations:', error);
res.status(500).json({ success: false, message: 'Failed to list AI remediations' });
}
});

// 根据 ID 获取 AI 修复记录
router.get('/:id', authenticateToken, (req: Request, res: Response) => {
try {
const record = aiRemediationService.getRecord(req.params.id);
if (!record) {
return res.status(404).json({ success: false, message: 'AI remediation not found' });
}
res.json({ success: true, data: record });
} catch (error) {
console.error('Failed to get AI remediation:', error);
res.status(500).json({ success: false, message: 'Failed to get AI remediation' });
}
});

// 根据告警 ID 获取 AI 修复记录
router.get('/alert/:alertId', authenticateToken, (req: Request, res: Response) => {
try {
const record = aiRemediationService.getByAlertId(req.params.alertId);
if (!record) {
return res.status(404).json({ success: false, message: 'AI remediation not found for this alert' });
}
res.json({ success: true, data: record });
} catch (error) {
console.error('Failed to get AI remediation by alert:', error);
res.status(500).json({ success: false, message: 'Failed to get AI remediation' });
}
});

export default router;
Loading
Loading