diff --git a/CHAIBOT_QUICKSTART.md b/CHAIBOT_QUICKSTART.md new file mode 100644 index 0000000000000..faed5e930c216 --- /dev/null +++ b/CHAIBOT_QUICKSTART.md @@ -0,0 +1,218 @@ +# Chaibot Quick Start Guide + +## What is Chaibot? + +An AI-powered Slack workflow that automatically triages test failures in #opp-discussion and posts analysis in threads. + +## Files Created + +``` +core-services/ci-chat-bot/ +├── triage-config.yaml # Main config (source of truth) +└── CHAIBOT.md # Quick reference + +clusters/app.ci/ci-chat-bot/ +├── chaibot-configmap.yaml # Kubernetes ConfigMap +└── chaibot-deployment-patch.yaml # Deployment updates + alerts + +core-services/ci-secret-bootstrap/ +└── chaibot-secret-config.yaml # Secret management + +docs/ +└── chaibot-test-failure-triage.md # Full documentation +``` + +## How to Deploy + +### Step 1: Get Credentials + +```bash +# 1. Get OpenAI API key from https://platform.openai.com/api-keys + +# 2. Get Slack channel ID: +# - Right-click #opp-discussion in Slack +# - View channel details +# - Copy Channel ID (looks like C01234ABCD) + +# 3. Update clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml +# Replace REPLACE_WITH_CHANNEL_ID with actual ID +``` + +### Step 2: Configure Secrets + +```bash +# Add to core-services/ci-secret-bootstrap/_config.yaml: +# (Follow pattern in chaibot-secret-config.yaml) + +# Store OpenAI key in Vault +# Reference: https://docs.ci.openshift.org/docs/how-tos/adding-a-new-secret-to-ci/ +``` + +### Step 3: Update ci-chat-bot Deployment + +Edit `clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml` and add: + +```yaml +# Add to spec.template.spec.volumes: +- name: triage-config + configMap: + name: ci-chat-bot-triage-config +- name: chaibot-secrets + secret: + secretName: ci-chat-bot-chaibot-secrets + +# Add to spec.template.spec.containers[name=bot].volumeMounts: +- name: triage-config + mountPath: /etc/triage-config + readOnly: true +- name: chaibot-secrets + mountPath: /etc/chaibot-secrets + readOnly: true + +# Add to spec.template.spec.containers[name=bot].env: +- name: CHAIBOT_ENABLED + value: "true" +- name: OPENAI_API_KEY + valueFrom: + secretKeyRef: + name: ci-chat-bot-chaibot-secrets + key: openai-api-key + +# Add to spec.template.spec.containers[name=bot].args: +--enable-triage=true \ +--triage-config-path=/etc/triage-config/triage-config.yaml \ +``` + +### Step 4: Deploy + +```bash +# From openshift/release repo root: +make update + +# Apply ConfigMap +oc apply -f clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml + +# Apply updated deployment (after editing) +oc apply -f clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml + +# Watch rollout +oc rollout status deployment/ci-chat-bot -n ci +``` + +### Step 5: Test + +``` +# In Slack #opp-discussion: +Post a message with a Prow job URL: + +"This job failed: https://prow.ci.openshift.org/view/gs/origin-ci-test/..." + +# Chaibot should respond in thread within 30-60 seconds +``` + +## Example Output + +``` +Chaibot [BOT]: :cloud: Test Failure Analysis + +Job: pull-ci-openshift-installer-master-e2e-aws +Status: Failed after 2h 15m +Root Cause: Infrastructure - AWS EC2 Capacity (85% confidence) + +Analysis: +Cluster provisioning failed due to AWS InsufficientInstanceCapacity error. + +Evidence: +Error: creating EC2 Instance: InsufficientInstanceCapacity (us-east-1c) + +Historical: +8 similar failures in last 24h (transient AWS issue) + +Recommendations: +1. Retest - likely to succeed on retry +2. Check AWS Service Health Dashboard + +Classification: Transient Infrastructure Issue +``` + +## Configuration + +Edit `core-services/ci-chat-bot/triage-config.yaml`: + +```yaml +# Add channels +monitored_channels: + - name: "opp-discussion" + channel_id: "C01234567" + +# Adjust AI settings +analysis: + ai_provider: "openai" + model: "gpt-4" # or "gpt-3.5-turbo" for lower cost + +# Rate limiting +rate_limiting: + max_analyses_per_hour: 100 +``` + +## Monitoring + +```bash +# Check logs +oc logs -n ci deployment/ci-chat-bot -c bot | grep chaibot + +# View metrics +curl http://ci-chat-bot.ci.svc:9090/metrics | grep chaibot + +# Grafana dashboard +https://grafana.ci.openshift.org/d/chaibot/ +``` + +## Troubleshooting + +**Not responding?** +```bash +oc get pods -n ci -l app=ci-chat-bot +oc logs -n ci -l app=ci-chat-bot -c bot --tail=50 +``` + +**Wrong channel ID?** +```bash +oc get configmap ci-chat-bot-triage-config -n ci -o yaml +# Update and reapply +``` + +**API errors?** +```bash +# Check secret exists +oc get secret ci-chat-bot-chaibot-secrets -n ci + +# View metrics for errors +curl http://ci-chat-bot.ci.svc:9090/metrics | grep chaibot_api_errors +``` + +## Cost + +- GPT-4: ~$0.03/analysis (~$90/month at 100 failures/day) +- GPT-3.5-turbo: ~$0.003/analysis (~$9/month at 100 failures/day) + +Rate limiting prevents cost overruns. + +## Support + +- **Questions**: #forum-ocp-testplatform +- **ci-chat-bot team**: #forum-ocp-crt +- **Full docs**: docs/chaibot-test-failure-triage.md +- **Issues**: https://github.com/openshift/ci-tools/issues + +## Important Note + +⚠️ This configuration requires **code implementation** in the ci-tools repo (openshift/ci-tools cmd/ci-chat-bot) to function. The configs are ready, but the bot logic needs development to: + +1. Parse triage-config.yaml +2. Listen to Slack events +3. Fetch job logs from GCS +4. Call OpenAI API +5. Format and post responses + +See `docs/chaibot-test-failure-triage.md` for implementation details. diff --git a/DEPLOY_CHAIBOT.md b/DEPLOY_CHAIBOT.md new file mode 100644 index 0000000000000..e98e79beb907c --- /dev/null +++ b/DEPLOY_CHAIBOT.md @@ -0,0 +1,406 @@ +# Chaibot Deployment Guide + +## Status + +✅ Configuration files created +✅ ci-chat-bot deployment updated with Chaibot volumes and mounts +⚠️ Requires: Slack channel ID and OpenAI API key +⚠️ Requires: Code implementation in openshift/ci-tools + +## What's Ready + +All configuration and deployment files are prepared: + +``` +✓ core-services/ci-chat-bot/triage-config.yaml +✓ clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml +✓ clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml +✓ clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml (UPDATED) +✓ docs/chaibot-test-failure-triage.md +``` + +## Prerequisites + +### 1. Get Slack Channel ID for #opp-discussion + +In Slack: +1. Right-click `#opp-discussion` channel +2. Select "View channel details" +3. Scroll down in the About section +4. Copy the Channel ID (format: `C` followed by alphanumeric, e.g., `C01234ABCD`) + +### 2. Get OpenAI API Key + +Option A - New Key: +1. Go to https://platform.openai.com/api-keys +2. Create new secret key +3. Copy the key (starts with `sk-`) +4. **Important**: Save it securely - you can't view it again + +Option B - Use Existing: +If your organization already has a key in Vault, confirm the path. + +### 3. Verify Cluster Access + +```bash +# Login to app.ci cluster +oc login https://api.ci.l2s4.p1.openshiftapps.com:6443 + +# Verify access to ci namespace +oc get pods -n ci +``` + +## Deployment Steps + +### Step 1: Update Slack Channel ID + +```bash +# Edit the ConfigMap with the actual channel ID +vi clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml + +# Find this line (around line 12): +# channel_id: "REPLACE_WITH_CHANNEL_ID" +# Replace with actual ID: +# channel_id: "C01234ABCD" # Your actual channel ID +``` + +### Step 2: Create Secret for OpenAI API Key + +**Option A: Via kubectl (for testing/dev)** + +```bash +# Create the secret directly +oc create secret generic ci-chat-bot-chaibot-secrets \ + --from-literal=openai-api-key="sk-YOUR-ACTUAL-KEY-HERE" \ + -n ci \ + --dry-run=client -o yaml | oc apply -f - + +# Verify +oc get secret ci-chat-bot-chaibot-secrets -n ci +``` + +**Option B: Via ci-secret-bootstrap (for production)** + +```bash +# 1. Store the key in Vault (ask DPTP team for path) + +# 2. Add to core-services/ci-secret-bootstrap/_config.yaml: +- from: + openai-api-key: + path: + to: + - cluster: app.ci + namespace: ci + name: ci-chat-bot-chaibot-secrets + +# 3. Submit PR to openshift/release +# 4. After merge, ci-secret-bootstrap will sync the secret +``` + +### Step 3: Apply ConfigMap + +```bash +# Apply the Chaibot configuration ConfigMap +oc apply -f clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml + +# Verify +oc get configmap ci-chat-bot-triage-config -n ci -o yaml +``` + +### Step 4: Apply Prometheus Alerts + +```bash +# Extract and apply just the PrometheusRule from the patch file +cat > /tmp/chaibot-alerts.yaml << 'EOF' +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: chaibot-alerts + namespace: ci +spec: + groups: + - name: chaibot + interval: 30s + rules: + - alert: ChaibotHighErrorRate + expr: | + rate(chaibot_api_errors_total[5m]) > 0.1 + for: 10m + labels: + severity: warning + team: test-platform + annotations: + summary: "Chaibot experiencing high error rate" + description: "Chaibot has {{ $value }} errors per second over the last 5 minutes." + + - alert: ChaibotAnalysisTimeout + expr: | + histogram_quantile(0.95, rate(chaibot_analysis_duration_seconds_bucket[5m])) > 120 + for: 15m + labels: + severity: warning + team: test-platform + annotations: + summary: "Chaibot analysis taking too long" + description: "95th percentile analysis duration is {{ $value }}s, exceeding 120s timeout." + + - alert: ChaibotDown + expr: | + up{job="ci-chat-bot"} == 0 + for: 5m + labels: + severity: critical + team: test-platform + annotations: + summary: "Chaibot service is down" + description: "ci-chat-bot service (including Chaibot) has been down for 5 minutes." +EOF + +oc apply -f /tmp/chaibot-alerts.yaml + +# Verify +oc get prometheusrule chaibot-alerts -n ci +``` + +### Step 5: Deploy Updated ci-chat-bot + +```bash +# The deployment YAML has already been updated with: +# - Chaibot volumes (triage-config, chaibot-secrets) +# - Volume mounts in bot container +# - Environment variables (CHAIBOT_ENABLED, OPENAI_API_KEY) +# - Command args (--enable-triage, --triage-config-path) + +# Review the changes +git diff clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml + +# Apply the updated deployment +oc apply -f clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml + +# Watch the rollout (this will restart the pod) +oc rollout status deployment/ci-chat-bot -n ci --timeout=5m +``` + +### Step 6: Verify Deployment + +```bash +# Check pod status +oc get pods -n ci -l app=ci-chat-bot + +# Check logs for Chaibot initialization +oc logs -n ci deployment/ci-chat-bot -c bot --tail=100 | grep -i chaibot + +# Should see something like: +# INFO: Chaibot triage enabled +# INFO: Monitoring channels: [opp-discussion] +# INFO: AI provider: openai (model: gpt-4) +``` + +### Step 7: Test Functionality + +**Method 1: Post a test message in Slack** + +In `#opp-discussion`: +``` +Test failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/12345/... +``` + +Wait 30-60 seconds for Chaibot to respond in thread. + +**Method 2: Check metrics** + +```bash +# Port-forward to access metrics +oc port-forward -n ci deployment/ci-chat-bot 9090:9090 & + +# Query metrics +curl http://localhost:9090/metrics | grep chaibot + +# Look for: +# chaibot_messages_processed_total +# chaibot_failures_detected_total +# chaibot_analyses_completed_total +``` + +**Method 3: Monitor logs** + +```bash +# Follow logs for Chaibot activity +oc logs -n ci deployment/ci-chat-bot -c bot -f | grep -i "chaibot\|triage" +``` + +## Troubleshooting + +### Pod won't start + +```bash +# Check events +oc describe pod -n ci -l app=ci-chat-bot + +# Common issues: +# - Missing secret: ci-chat-bot-chaibot-secrets +# - Missing configmap: ci-chat-bot-triage-config +# - Invalid YAML syntax in configmap +``` + +### Chaibot not responding in Slack + +```bash +# 1. Check if feature is enabled +oc exec -n ci deployment/ci-chat-bot -c bot -- env | grep CHAIBOT_ENABLED +# Should output: CHAIBOT_ENABLED=true + +# 2. Check config is mounted +oc exec -n ci deployment/ci-chat-bot -c bot -- cat /etc/triage-config/triage-config.yaml + +# 3. Check for errors in logs +oc logs -n ci deployment/ci-chat-bot -c bot --tail=200 | grep -i error + +# 4. Verify channel ID is correct +oc get configmap ci-chat-bot-triage-config -n ci -o jsonpath='{.data.triage-config\.yaml}' | grep channel_id +``` + +### OpenAI API errors + +```bash +# Check if API key is set +oc exec -n ci deployment/ci-chat-bot -c bot -- env | grep OPENAI_API_KEY +# Should show: OPENAI_API_KEY=sk-... + +# Check rate limits +curl http://localhost:9090/metrics | grep chaibot_api_errors_total + +# Common issues: +# - Invalid API key +# - Rate limit exceeded +# - No credits remaining in OpenAI account +``` + +### Wrong Slack channel + +```bash +# Update the channel ID in ConfigMap +oc edit configmap ci-chat-bot-triage-config -n ci + +# Find the channel_id line and update it +# Save and exit + +# Restart the deployment to pick up changes +oc rollout restart deployment/ci-chat-bot -n ci +``` + +## Important Notes + +### 1. Code Implementation Required + +⚠️ **CRITICAL**: This deployment assumes the ci-chat-bot binary in the container already has Chaibot support. The code needs to be implemented in the `openshift/ci-tools` repository (`cmd/ci-chat-bot`). + +If the code doesn't exist yet, the bot will start but ignore the `--enable-triage` flag and related configs. + +To check if Chaibot code exists: +```bash +# Check the container image source +# Look in https://github.com/openshift/ci-tools/tree/master/cmd/ci-chat-bot +# Search for "triage" or "chaibot" functionality +``` + +### 2. Slack App Permissions + +Ensure the ci-chat-bot Slack app has these OAuth scopes: +- `channels:history` - Read channel messages +- `channels:read` - View channel info +- `chat:write` - Post messages +- `files:read` - Access logs +- `reactions:write` - Add reactions + +And subscribed to these events: +- `message.channels` +- `app_mention` + +Check/update at: https://api.slack.com/apps (find ci-chat-bot app) + +### 3. Cost Management + +Monitor OpenAI API usage to control costs: + +```bash +# Check number of analyses +curl http://localhost:9090/metrics | grep chaibot_analyses_completed_total + +# At $0.03 per analysis (GPT-4): +# 100/day = $3/day = ~$90/month +# +# To reduce costs: +# - Use GPT-3.5-turbo (~$0.003/analysis) +# - Adjust rate_limiting in config +# - Increase cooldown_seconds +``` + +Edit ConfigMap to switch models: +```yaml +analysis: + model: "gpt-3.5-turbo" # Change from "gpt-4" +``` + +### 4. Production Readiness Checklist + +Before enabling in production: + +- [ ] OpenAI API key stored in Vault (not hardcoded) +- [ ] Correct Slack channel ID configured +- [ ] Slack app permissions verified +- [ ] PrometheusRules deployed and alerting configured +- [ ] Grafana dashboard created +- [ ] Rate limits tuned appropriately +- [ ] Cost monitoring set up +- [ ] Team trained on Chaibot usage +- [ ] Runbook created for oncall +- [ ] Code implementation verified in ci-tools + +## Rollback + +If you need to disable Chaibot: + +```bash +# Method 1: Disable via environment variable +oc set env deployment/ci-chat-bot CHAIBOT_ENABLED=false -n ci + +# Method 2: Remove volumes and mounts +# Revert clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml to previous version +git checkout HEAD~1 -- clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml +oc apply -f clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml + +# Method 3: Delete ConfigMap (feature will fail gracefully) +oc delete configmap ci-chat-bot-triage-config -n ci +``` + +## Next Steps + +After successful deployment: + +1. **Monitor initial performance** + - Watch metrics and logs for 24-48 hours + - Collect feedback from #opp-discussion users + +2. **Tune configuration** + - Adjust confidence thresholds based on accuracy + - Add/remove failure patterns + - Optimize AI prompts + +3. **Expand coverage** + - Add more monitored channels + - Create team-specific configurations + - Integrate with retester for auto-retry + +4. **Documentation** + - Update team wiki with usage examples + - Create runbook for DPTP oncall + - Add to CI documentation site + +## Support + +- **Documentation**: `docs/chaibot-test-failure-triage.md` +- **Quick Reference**: `core-services/ci-chat-bot/CHAIBOT.md` +- **Questions**: #forum-ocp-testplatform +- **ci-chat-bot team**: #forum-ocp-crt +- **Issues**: https://github.com/openshift/ci-tools/issues diff --git a/OWNERS_ALIASES b/OWNERS_ALIASES index 5f21b57be933f..c361db28d8531 100644 --- a/OWNERS_ALIASES +++ b/OWNERS_ALIASES @@ -332,6 +332,7 @@ aliases: - danalanerh - sg-rh - etirta + - chaclark1974 openstack-k8s-operators-approvers: - abays - arxcruz diff --git a/clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml b/clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml new file mode 100644 index 0000000000000..c86dc8a4f7dda --- /dev/null +++ b/clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml @@ -0,0 +1,135 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: ci-chat-bot-triage-config + namespace: ci +data: + triage-config.yaml: | + # Chaibot Test Failure Triage Configuration + enabled: true + + monitored_channels: + - name: "opp-discussion" + channel_id: "C04TMLC6DRV" + auto_respond: true + response_mode: "thread" + + failure_detection: + prow_job_patterns: + - "https://prow.ci.openshift.org/view/gs/" + - "https://prow.ci.openshift.org/?pr=" + - "https://deck-internal-ci.apps.ci.l2s4.p1.openshiftapps.com/" + + failure_keywords: + - "test failed" + - "job failed" + - "failure" + - "flaky" + - "regression" + + require_job_url: false + + analysis: + timeout: 120 + ai_provider: "openai" + model: "gpt-4" + + analyze_components: + - job_metadata + - failure_logs + - historical_data + - infrastructure + - known_issues + + failure_categories: + infrastructure: + patterns: + - "InsufficientInstanceCapacity" + - "RequestLimitExceeded" + - "could not create instance" + - "timeout waiting for" + confidence_threshold: 0.7 + + flaky_test: + patterns: + - "race condition" + - "intermittent" + - "timeout.*eventually" + confidence_threshold: 0.6 + + product_bug: + patterns: + - "panic:" + - "nil pointer" + - "assertion failed" + confidence_threshold: 0.8 + + configuration: + patterns: + - "missing environment" + - "invalid configuration" + - "secret.*not found" + confidence_threshold: 0.75 + + response: + include_sections: + - summary + - root_cause + - evidence + - historical + - recommendations + - related_issues + + use_emojis: true + emoji_map: + infrastructure: ":cloud:" + flaky_test: ":game_die:" + product_bug: ":bug:" + configuration: ":wrench:" + unknown: ":question:" + + include_actions: + - label: "View Full Logs" + action: "open_url" + - label: "Mark Flaky" + action: "mark_flaky" + + integrations: + sippy: + enabled: true + base_url: "https://sippy.dptools.openshift.org" + lookback_days: 7 + min_occurrences: 2 + + jira: + enabled: true + endpoint: "https://redhat.atlassian.net" + search_projects: + - "OCPBUGS" + - "DPTP" + max_results: 5 + + prow: + enabled: true + gcs_bucket: "gs://origin-ci-test" + max_log_size_mb: 50 + fetch_artifacts: + - "build-log.txt" + - "junit*.xml" + + ai_api: + enabled: true + secret_name: "chaibot-openai-key" + secret_namespace: "ci" + rate_limit_rpm: 50 + + rate_limiting: + max_analyses_per_hour: 100 + max_analyses_per_user_per_hour: 10 + max_concurrent_analyses: 5 + cooldown_seconds: 30 + + monitoring: + metrics_enabled: true + metrics_port: 9090 + log_level: "info" diff --git a/clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml b/clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml new file mode 100644 index 0000000000000..2b5202b6abd85 --- /dev/null +++ b/clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml @@ -0,0 +1,109 @@ +# This file contains the necessary additions to ci-chat-bot deployment +# to enable Chaibot test failure triage functionality +# +# Apply these changes to clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml + +--- +# Additional volume for triage config +# Add to spec.template.spec.volumes in the Deployment: + +# - name: triage-config +# configMap: +# name: ci-chat-bot-triage-config + +# - name: chaibot-secrets +# secret: +# secretName: ci-chat-bot-chaibot-secrets +# items: +# - key: openai-api-key +# path: openai-api-key + +--- +# Additional volumeMount for the bot container +# Add to spec.template.spec.containers[name=bot].volumeMounts: + +# - name: triage-config +# mountPath: /etc/triage-config +# readOnly: true + +# - name: chaibot-secrets +# mountPath: /etc/chaibot-secrets +# readOnly: true + +--- +# Additional command-line arguments +# Add to spec.template.spec.containers[name=bot].args: + +# --enable-triage=true \ +# --triage-config-path=/etc/triage-config/triage-config.yaml \ + +--- +# Additional environment variables +# Add to spec.template.spec.containers[name=bot].env: + +# - name: CHAIBOT_ENABLED +# value: "true" +# - name: OPENAI_API_KEY +# valueFrom: +# secretKeyRef: +# name: ci-chat-bot-chaibot-secrets +# key: openai-api-key + +--- +apiVersion: v1 +kind: Secret +metadata: + name: ci-chat-bot-chaibot-secrets + namespace: ci +type: Opaque +stringData: + # Replace with actual OpenAI API key + # This should be managed via ci-secret-bootstrap + openai-api-key: "REPLACE_WITH_ACTUAL_KEY" +--- +# ServiceMonitor update to include new metrics +# The existing ServiceMonitor already scrapes port 9090, +# so chaibot metrics will be automatically collected +apiVersion: monitoring.coreos.com/v1 +kind: PrometheusRule +metadata: + name: chaibot-alerts + namespace: ci +spec: + groups: + - name: chaibot + interval: 30s + rules: + - alert: ChaibotHighErrorRate + expr: | + rate(chaibot_api_errors_total[5m]) > 0.1 + for: 10m + labels: + severity: warning + team: test-platform + annotations: + summary: "Chaibot experiencing high error rate" + description: "Chaibot has {{ $value }} errors per second over the last 5 minutes." + runbook_url: "https://github.com/openshift/release/blob/main/docs/dptp-triage-sop/chaibot.md" + + - alert: ChaibotAnalysisTimeout + expr: | + histogram_quantile(0.95, rate(chaibot_analysis_duration_seconds_bucket[5m])) > 120 + for: 15m + labels: + severity: warning + team: test-platform + annotations: + summary: "Chaibot analysis taking too long" + description: "95th percentile analysis duration is {{ $value }}s, exceeding 120s timeout." + + - alert: ChaibotDown + expr: | + up{job="ci-chat-bot"} == 0 + for: 5m + labels: + severity: critical + team: test-platform + annotations: + summary: "Chaibot service is down" + description: "ci-chat-bot service (including Chaibot) has been down for 5 minutes." diff --git a/clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml b/clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml index f0564d67a3a69..70ac802160464 100644 --- a/clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml +++ b/clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml @@ -284,6 +284,15 @@ spec: path: rosa-admin-ocm - name: runtimedir emptyDir: { } + - name: triage-config + configMap: + name: ci-chat-bot-triage-config + - name: chaibot-secrets + secret: + secretName: ci-chat-bot-chaibot-secrets + items: + - key: openai-api-key + path: openai-api-key initContainers: - name: git-sync-init command: @@ -371,6 +380,12 @@ spec: readOnly: true - mountPath: /runtimedir name: runtimedir + - name: triage-config + mountPath: /etc/triage-config + readOnly: true + - name: chaibot-secrets + mountPath: /etc/chaibot-secrets + readOnly: true env: - name: BOT_TOKEN valueFrom: @@ -410,6 +425,13 @@ spec: value: us-east-1 - name: XDG_RUNTIME_DIR value: /runtimedir + - name: CHAIBOT_ENABLED + value: "true" + - name: OPENAI_API_KEY + valueFrom: + secretKeyRef: + name: ci-chat-bot-chaibot-secrets + key: openai-api-key command: ["/bin/sh"] args: - -c @@ -432,4 +454,6 @@ spec: --rosa-cluster-limit=30 \\ --rosa-subnetlist-path=/etc/subnet-ids/rosa-subnet-ids \\ --rosa-oidcConfigId-path=/etc/oidc-config-id/rosa-oidc-config-id \\ - --rosa-billingAccount-path=/etc/billing-account-id/rosa-billing-account-id + --rosa-billingAccount-path=/etc/billing-account-id/rosa-billing-account-id \\ + --enable-triage=true \\ + --triage-config-path=/etc/triage-config/triage-config.yaml diff --git a/core-services/ci-chat-bot/CHAIBOT.md b/core-services/ci-chat-bot/CHAIBOT.md new file mode 100644 index 0000000000000..4d890d9a1b0e6 --- /dev/null +++ b/core-services/ci-chat-bot/CHAIBOT.md @@ -0,0 +1,259 @@ +# Chaibot Test Failure Triage Extension + +This directory contains configuration for the **Chaibot** test failure triage feature, an AI-powered extension to ci-chat-bot. + +## What is Chaibot? + +Chaibot automatically monitors Slack channels (like `#opp-discussion`) for test failure messages, analyzes the failures using AI, and posts detailed triage analysis in threads. + +## Files + +- `triage-config.yaml` - Main configuration for Chaibot (source of truth) +- `workflows-config.yaml` - Cluster provisioning workflows (existing ci-chat-bot config) + +## Quick Start + +### 1. Prerequisites + +- OpenAI API key (stored in ci-secret-bootstrap) +- Slack channel ID for #opp-discussion +- ci-chat-bot deployment with Chaibot support (requires ci-tools update) + +### 2. Configuration + +The `triage-config.yaml` file is mounted as a ConfigMap in the ci-chat-bot deployment. + +To enable Chaibot: + +```yaml +enabled: true + +monitored_channels: + - name: "opp-discussion" + channel_id: "YOUR_CHANNEL_ID" # Get from Slack + auto_respond: true +``` + +### 3. Get Slack Channel ID + +In Slack: +1. Right-click the `#opp-discussion` channel +2. Select "View channel details" +3. Copy the Channel ID from the About section +4. Update `channel_id` in `triage-config.yaml` + +### 4. Deploy + +The configuration is automatically deployed when you: + +```bash +# Update from this directory +make update + +# Apply ConfigMap (done automatically by postsubmit) +oc apply -f ../../clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml +``` + +## How It Works + +1. **Detection**: Monitors configured Slack channels for messages containing: + - Prow job URLs + - Failure keywords ("test failed", "job failed", etc.) + +2. **Analysis**: When a failure is detected: + - Fetches job logs from GCS + - Analyzes with AI (OpenAI GPT-4) + - Categorizes failure (infrastructure, flaky, bug, config) + - Searches Sippy for historical patterns + - Looks up related JIRA issues + +3. **Response**: Posts analysis in thread: + - Root cause with confidence level + - Evidence from logs + - Historical context + - Actionable recommendations + +## Example + +User posts in #opp-discussion: +``` +Job failed again: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/... +``` + +Chaibot responds in thread: +``` +:cloud: Test Failure Analysis + +Root Cause: Infrastructure - AWS Capacity (85% confidence) +Analysis: Instance launch failed due to InsufficientInstanceCapacity in us-east-1c +Evidence: "Error: creating EC2 Instance: InsufficientInstanceCapacity..." +Historical: 8 similar failures in last 24h (transient issue) +Recommendation: Retest - likely to succeed + +Classification: Transient Infrastructure Issue +``` + +## Configuration Options + +### Monitored Channels + +Add or remove channels: + +```yaml +monitored_channels: + - name: "opp-discussion" + channel_id: "C01234567" + auto_respond: true # Auto-analyze or require @mention + response_mode: "thread" # thread, channel, or dm +``` + +### Analysis Settings + +Adjust AI provider and timeout: + +```yaml +analysis: + timeout: 120 # seconds + ai_provider: "openai" # openai or anthropic + model: "gpt-4" # or gpt-3.5-turbo for lower cost +``` + +### Failure Categories + +Customize or add categories: + +```yaml +failure_categories: + infrastructure: + patterns: + - "InsufficientInstanceCapacity" + - "RequestLimitExceeded" + confidence_threshold: 0.7 +``` + +### Rate Limiting + +Prevent abuse: + +```yaml +rate_limiting: + max_analyses_per_hour: 100 + max_analyses_per_user_per_hour: 10 + cooldown_seconds: 30 # Min time between analyses for same job +``` + +## Integrations + +### Sippy + +Enabled by default, shows historical failure patterns: + +```yaml +integrations: + sippy: + enabled: true + base_url: "https://sippy.dptools.openshift.org" + lookback_days: 7 +``` + +### JIRA + +Searches for related issues: + +```yaml +integrations: + jira: + enabled: true + endpoint: "https://redhat.atlassian.net" + search_projects: ["OCPBUGS", "DPTP"] +``` + +### OpenAI + +AI analysis requires API key: + +```yaml +integrations: + ai_api: + enabled: true + secret_name: "chaibot-openai-key" + secret_namespace: "ci" + rate_limit_rpm: 50 +``` + +## Monitoring + +Metrics exposed on port 9090 (same as ci-chat-bot): + +- `chaibot_messages_processed_total` +- `chaibot_failures_detected_total` +- `chaibot_analyses_completed_total` +- `chaibot_analysis_duration_seconds` +- `chaibot_api_errors_total` + +Alerts configured in `clusters/app.ci/ci-chat-bot/chaibot-deployment-patch.yaml` + +## Troubleshooting + +### Chaibot not responding + +```bash +# Check if enabled +oc get configmap ci-chat-bot-triage-config -n ci -o yaml | grep enabled + +# Check logs +oc logs -n ci deployment/ci-chat-bot -c bot | grep -i chaibot + +# Verify secrets exist +oc get secret ci-chat-bot-chaibot-secrets -n ci +``` + +### Analysis timeout + +- Check `max_log_size_mb` - reduce if logs are too large +- Increase `analysis.timeout` value +- Check OpenAI API status + +### Wrong analysis + +- Review and tune `failure_categories` patterns +- Adjust `confidence_threshold` values +- Update AI prompts in `ai_prompts` section + +## Cost Management + +OpenAI API costs (approximate): +- GPT-4: ~$0.03 per analysis +- GPT-3.5-turbo: ~$0.003 per analysis + +At 100 analyses/day: +- GPT-4: ~$90/month +- GPT-3.5-turbo: ~$9/month + +Control costs with: +- Rate limiting +- Cooldown periods +- Switching to GPT-3.5-turbo + +## Development + +To add new features: + +1. Update `triage-config.yaml` schema +2. Implement in [openshift/ci-tools](https://github.com/openshift/ci-tools) cmd/ci-chat-bot +3. Add tests +4. Update this documentation + +## Support + +- Questions: `#forum-ocp-testplatform` +- ci-chat-bot team: `#forum-ocp-crt` +- Issues: https://github.com/openshift/ci-tools/issues +- Docs: https://docs.ci.openshift.org/tools/chaibot/ + +## Related + +- [ci-chat-bot README](README.md) - Cluster provisioning workflows +- [Chaibot Full Documentation](../../docs/chaibot-test-failure-triage.md) +- [Sippy](https://sippy.dptools.openshift.org/) - Test analysis platform +- [ci-tools](https://github.com/openshift/ci-tools) - Source code diff --git a/core-services/ci-chat-bot/triage-config.yaml b/core-services/ci-chat-bot/triage-config.yaml new file mode 100644 index 0000000000000..584fabcb7d29d --- /dev/null +++ b/core-services/ci-chat-bot/triage-config.yaml @@ -0,0 +1,205 @@ +# Chaibot Test Failure Triage Configuration +# This config enables ci-chat-bot to monitor Slack channels for test failures +# and provide automated triage analysis + +# Feature flag to enable/disable triage functionality +enabled: true + +# Slack channels to monitor for test failures +monitored_channels: + - name: "opp-discussion" + channel_id: "C01234567" # Replace with actual channel ID + auto_respond: true + response_mode: "thread" # Options: thread, channel, dm + + # Additional channels can be added + # - name: "forum-ocp-testplatform" + # channel_id: "C98765432" + # auto_respond: false # Require @mention to trigger + +# Patterns to detect test failure messages +failure_detection: + # URL patterns that indicate Prow job failures + prow_job_patterns: + - "https://prow.ci.openshift.org/view/gs/" + - "https://prow.ci.openshift.org/?pr=" + - "https://deck-internal-ci.apps.ci.l2s4.p1.openshiftapps.com/" + + # Keywords that indicate test failures + failure_keywords: + - "test failed" + - "job failed" + - "failure" + - "test timeout" + - "flaky test" + - "regression" + - "broken test" + + # Message must contain job URL OR (keyword + context) + require_job_url: false + +# Analysis configuration +analysis: + # Maximum time to spend analyzing a single failure (seconds) + timeout: 120 + + # AI provider configuration + ai_provider: "openai" # Options: openai, anthropic, none + model: "gpt-4" + + # What to analyze + analyze_components: + - job_metadata # Job name, duration, timestamp + - failure_logs # Pod logs, junit output + - historical_data # Sippy integration for past failures + - infrastructure # Cloud provider issues, cluster state + - known_issues # JIRA search for similar failures + + # Categorization rules + failure_categories: + infrastructure: + patterns: + - "InsufficientInstanceCapacity" + - "RequestLimitExceeded" + - "could not create instance" + - "timeout waiting for" + - "connection refused" + confidence_threshold: 0.7 + + flaky_test: + patterns: + - "race condition" + - "intermittent" + - "sometimes fails" + - "timeout.*eventually" + confidence_threshold: 0.6 + + product_bug: + patterns: + - "panic:" + - "nil pointer" + - "assertion failed" + - "unexpected error" + confidence_threshold: 0.8 + + configuration: + patterns: + - "missing environment" + - "invalid configuration" + - "could not find image" + - "secret.*not found" + confidence_threshold: 0.75 + +# Response formatting +response: + # Template for Slack message response + include_sections: + - summary # Brief one-line summary + - root_cause # Identified root cause with confidence + - evidence # Key log excerpts and patterns + - historical # Similar past failures from Sippy + - recommendations # Suggested actions + - related_issues # JIRA issues or documentation + + # Emoji indicators for quick visual parsing + use_emojis: true + emoji_map: + infrastructure: ":cloud:" + flaky_test: ":game_die:" + product_bug: ":bug:" + configuration: ":wrench:" + unknown: ":question:" + + # Add interactive buttons + include_actions: + - label: "View Full Logs" + action: "open_url" + - label: "Retest" + action: "trigger_retest" + - label: "Create JIRA" + action: "create_issue" + - label: "Mark Flaky" + action: "mark_flaky" + +# Integration settings +integrations: + # Sippy integration for historical failure data + sippy: + enabled: true + base_url: "https://sippy.dptools.openshift.org" + lookback_days: 7 + min_occurrences: 2 # Minimum failures to show pattern + + # JIRA integration for known issues + jira: + enabled: true + endpoint: "https://redhat.atlassian.net" + search_projects: + - "OCPBUGS" + - "DPTP" + max_results: 5 + + # Prow/GCS access for log fetching + prow: + enabled: true + gcs_bucket: "gs://origin-ci-test" + max_log_size_mb: 50 + fetch_artifacts: + - "build-log.txt" + - "junit*.xml" + - "e2e-events*.json" + + # OpenAI/Anthropic API + ai_api: + enabled: true + secret_name: "chaibot-openai-key" # Kubernetes secret + secret_namespace: "ci" + rate_limit_rpm: 50 # Requests per minute + +# Rate limiting and abuse prevention +rate_limiting: + max_analyses_per_hour: 100 + max_analyses_per_user_per_hour: 10 + max_concurrent_analyses: 5 + cooldown_seconds: 30 # Min time between analyses for same job + +# Monitoring and observability +monitoring: + metrics_enabled: true + metrics_port: 9090 + log_level: "info" # Options: debug, info, warn, error + + # Prometheus metrics to export + metrics: + - chaibot_messages_processed_total + - chaibot_failures_detected_total + - chaibot_analyses_completed_total + - chaibot_analysis_duration_seconds + - chaibot_api_errors_total + - chaibot_category_detections_total + +# Prompt template for AI analysis +ai_prompts: + system_prompt: | + You are Chaibot, an expert CI/CD test failure analyst for OpenShift. + Analyze test failures and provide concise, actionable triage information. + Focus on root cause identification and practical next steps. + Categorize failures as: infrastructure, flaky_test, product_bug, or configuration. + Be confident but acknowledge uncertainty when appropriate. + + analysis_prompt: | + Analyze this OpenShift CI test failure: + + Job: {job_name} + Status: {status} + Duration: {duration} + + Error Logs: + {error_excerpt} + + Provide: + 1. Root Cause (category + confidence %) + 2. Brief Analysis (2-3 sentences) + 3. Key Evidence (specific log excerpts) + 4. Recommendations (numbered action items) + 5. Classification (transient vs persistent issue) diff --git a/core-services/ci-secret-bootstrap/_config.yaml b/core-services/ci-secret-bootstrap/_config.yaml index fcdf93992551f..d0a44ade19371 100644 --- a/core-services/ci-secret-bootstrap/_config.yaml +++ b/core-services/ci-secret-bootstrap/_config.yaml @@ -4218,6 +4218,14 @@ secret_configs: - cluster: core-ci name: pj-rehearse namespace: ci +- from: + openai-api-key: + field: openai-api-key + path: selfservice/cspi-qe/chaibot-openai-key + to: + - cluster: app.ci + name: ci-chat-bot-chaibot-secrets + namespace: ci - from: sa.ci-chat-bot.build01.config: field: sa.ci-chat-bot.build01.config diff --git a/core-services/ci-secret-bootstrap/chaibot-secret-config.yaml b/core-services/ci-secret-bootstrap/chaibot-secret-config.yaml new file mode 100644 index 0000000000000..4433c1658ac5c --- /dev/null +++ b/core-services/ci-secret-bootstrap/chaibot-secret-config.yaml @@ -0,0 +1,44 @@ +# Secret configuration for Chaibot +# This file should be added to ci-secret-bootstrap configuration +# to manage the OpenAI API key and Slack channel ID securely + +# Add this entry to core-services/ci-secret-bootstrap/_config.yaml: + +# - from: +# openai-api-key: +# dockerconfigJSON: +# to: +# - cluster: app.ci +# namespace: ci +# name: ci-chat-bot-chaibot-secrets + +--- +# Instructions for setting up secrets: +# +# 1. OpenAI API Key: +# - Obtain API key from OpenAI dashboard (https://platform.openai.com/api-keys) +# - Store in Vault at the appropriate path +# - Reference: https://docs.ci.openshift.org/docs/how-tos/adding-a-new-secret-to-ci/ +# +# 2. Slack Channel ID for #opp-discussion: +# - In Slack, right-click the #opp-discussion channel +# - Select "View channel details" +# - Copy the Channel ID from the bottom of the modal +# - Update the channel_id in the triage-config.yaml ConfigMap +# +# 3. Alternative AI Providers (optional): +# - For Anthropic Claude: Store ANTHROPIC_API_KEY +# - Update triage-config.yaml ai_provider to "anthropic" +# +# 4. Required Slack App Permissions: +# The ci-chat-bot-slack-app secret should include these OAuth scopes: +# - channels:history (read messages in public channels) +# - channels:read (view basic channel info) +# - chat:write (post messages) +# - files:read (access uploaded failure logs) +# - reactions:write (add emoji reactions to indicate processing) +# +# 5. Slack Event Subscriptions: +# Subscribe to these events in the Slack App configuration: +# - message.channels (receive messages from monitored channels) +# - app_mention (respond when @chaibot is mentioned) diff --git a/docs/chaibot-test-failure-triage.md b/docs/chaibot-test-failure-triage.md new file mode 100644 index 0000000000000..a6a3eaf01dffe --- /dev/null +++ b/docs/chaibot-test-failure-triage.md @@ -0,0 +1,433 @@ +# Chaibot - Automated Test Failure Triage for Slack + +## Overview + +Chaibot is an AI-powered extension to the ci-chat-bot service that automatically triages and analyzes OpenShift CI test failures posted in Slack channels. It provides intelligent root cause analysis and actionable recommendations directly in Slack threads. + +## Features + +### Automatic Detection +- Monitors configured Slack channels (e.g., `#opp-discussion`) +- Detects Prow job failure messages and URLs +- Identifies test failure keywords and patterns + +### Intelligent Analysis +- Fetches job logs and artifacts from GCS +- Analyzes failure patterns using AI (OpenAI GPT-4 or Anthropic Claude) +- Categorizes failures into: + - **Infrastructure Issues**: Cloud provider capacity, networking, timeouts + - **Flaky Tests**: Race conditions, intermittent failures + - **Product Bugs**: Panics, assertion failures, regressions + - **Configuration Issues**: Missing secrets, invalid config + +### Historical Context +- Integrates with Sippy to show historical failure patterns +- Searches JIRA for related known issues +- Identifies if this is a new or recurring failure + +### Actionable Output +- Posts analysis in Slack threads to keep channels clean +- Provides confidence levels for root cause identification +- Suggests specific next steps (retest, file bug, investigate) +- Includes direct links to logs, Sippy, and related issues + +## Example Output + +When a test failure is posted in `#opp-discussion`: + +``` +User: The e2e-aws job is failing again 😞 +https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/12345/pull-ci-openshift-installer-master-e2e-aws/678901 + +--- + +Chaibot [BOT]: :mag: Analyzing failure... (typically takes 30-60 seconds) + +[30 seconds later] + +Chaibot [BOT]: :cloud: **Test Failure Analysis** + +**Job:** `pull-ci-openshift-installer-master-e2e-aws` +**Status:** Failed after 2h 15m +**Root Cause:** Infrastructure - AWS EC2 Capacity (Confidence: 85%) + +**Analysis:** +Test failed during cluster provisioning when attempting to launch EC2 instances in us-east-1c. +AWS returned "InsufficientInstanceCapacity" error after multiple retry attempts over 45 minutes. + +**Evidence:** +``` +Error: creating EC2 Instance (i-0a1b2c3d4e5f): InsufficientInstanceCapacity +status code: 500, request id: xyz-123 +``` + +**Historical Pattern:** +This failure has occurred 8 times in the last 24 hours across multiple jobs, all in us-east-1c AZ. +Sippy shows this as a known transient infrastructure issue. + +**Recommendations:** +1. ✅ **Retest** - This is a transient AWS issue, likely to succeed on retry +2. 📊 Check AWS Service Health Dashboard for us-east-1 incidents +3. 🔔 If failures persist >6 hours, escalate to infrastructure team + +**Related:** +- +- : Similar AWS capacity issues + +**Classification:** Transient Infrastructure (Not a product bug) + +[Buttons: View Logs | Retest | Mark as Known Issue] +``` + +## Setup and Configuration + +### Prerequisites + +1. **OpenAI API Key** (or Anthropic API Key) + - Required for AI-powered analysis + - Store securely via ci-secret-bootstrap + +2. **Slack Channel ID** + - Get the channel ID for `#opp-discussion` + - Update in ConfigMap configuration + +3. **ci-chat-bot Deployment** + - Chaibot runs as part of ci-chat-bot service + - Requires deployment update to enable + +### Installation Steps + +#### 1. Configure Secrets + +Add OpenAI API key via ci-secret-bootstrap: + +```bash +# Edit core-services/ci-secret-bootstrap/_config.yaml +# Add entry for chaibot-openai-key pointing to vault path +``` + +#### 2. Get Slack Channel ID + +```bash +# In Slack: +# Right-click #opp-discussion → View channel details → Copy Channel ID +# Update clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml +``` + +#### 3. Deploy Configuration + +```bash +# Create the ConfigMap +oc apply -f clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml + +# Create the secrets +# (Managed via ci-secret-bootstrap after PR merge) + +# Update ci-chat-bot deployment +# Edit clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml +# Add volumes, volumeMounts, and env vars from chaibot-deployment-patch.yaml +``` + +#### 4. Update ci-chat-bot Deployment + +Apply the changes from `chaibot-deployment-patch.yaml`: + +```yaml +# Add to volumes: +- name: triage-config + configMap: + name: ci-chat-bot-triage-config + +- name: chaibot-secrets + secret: + secretName: ci-chat-bot-chaibot-secrets + +# Add to volumeMounts (bot container): +- name: triage-config + mountPath: /etc/triage-config + readOnly: true + +- name: chaibot-secrets + mountPath: /etc/chaibot-secrets + readOnly: true + +# Add to env (bot container): +- name: CHAIBOT_ENABLED + value: "true" +- name: OPENAI_API_KEY + valueFrom: + secretKeyRef: + name: ci-chat-bot-chaibot-secrets + key: openai-api-key + +# Add to args (bot container): +--enable-triage=true \ +--triage-config-path=/etc/triage-config/triage-config.yaml \ +``` + +#### 5. Configure Slack App Permissions + +Ensure the ci-chat-bot Slack app has these OAuth scopes: + +- `channels:history` - Read messages in public channels +- `channels:read` - View channel information +- `chat:write` - Post messages and replies +- `files:read` - Access uploaded logs +- `reactions:write` - Add reactions to indicate processing + +Subscribe to these events: +- `message.channels` - Receive channel messages +- `app_mention` - Respond to @chaibot mentions + +#### 6. Deploy and Verify + +```bash +# Apply changes +make update +oc apply -f clusters/app.ci/ci-chat-bot/ci-chat-bot.yaml + +# Watch deployment +oc rollout status deployment/ci-chat-bot -n ci + +# Check logs +oc logs -f deployment/ci-chat-bot -n ci -c bot | grep -i chaibot + +# Test in Slack +# Post a test failure message in #opp-discussion with a Prow URL +``` + +## Usage + +### Automatic Triggering + +Chaibot automatically responds to messages in monitored channels that contain: +- Prow job URLs (`https://prow.ci.openshift.org/view/gs/...`) +- Failure keywords + context + +### Manual Triggering + +Mention `@chaibot analyze` with a job URL: + +``` +@chaibot analyze https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/... +``` + +### Response Modes + +**Thread Mode (default):** +- Posts analysis in a thread reply +- Keeps channels clean and organized + +**Reaction Mode:** +- Adds 👀 emoji when processing starts +- Adds ✅ when complete, ❌ if failed + +## Configuration + +### Adding Channels + +Edit `clusters/app.ci/ci-chat-bot/chaibot-configmap.yaml`: + +```yaml +monitored_channels: + - name: "opp-discussion" + channel_id: "C01234567" + auto_respond: true + response_mode: "thread" + + - name: "forum-testplatform" # Add new channel + channel_id: "C98765432" + auto_respond: false # Require @mention + response_mode: "thread" +``` + +### Adjusting Analysis + +**Timeout:** +```yaml +analysis: + timeout: 120 # seconds +``` + +**AI Model:** +```yaml +analysis: + ai_provider: "openai" # or "anthropic" + model: "gpt-4" # or "claude-3-opus-20240229" +``` + +**Failure Categories:** +```yaml +failure_categories: + custom_category: + patterns: + - "specific error pattern" + - "another pattern" + confidence_threshold: 0.75 +``` + +### Rate Limiting + +```yaml +rate_limiting: + max_analyses_per_hour: 100 + max_analyses_per_user_per_hour: 10 + cooldown_seconds: 30 +``` + +## Monitoring + +### Metrics + +Chaibot exposes Prometheus metrics on port 9090: + +- `chaibot_messages_processed_total` - Messages evaluated +- `chaibot_failures_detected_total` - Failures identified +- `chaibot_analyses_completed_total` - Analyses finished +- `chaibot_analysis_duration_seconds` - Analysis latency +- `chaibot_api_errors_total` - API errors (Slack, OpenAI, etc.) +- `chaibot_category_detections_total{category="..."}` - Failure categories + +### Alerts + +PrometheusRules are configured for: +- High error rate (>10% over 5 minutes) +- Analysis timeouts (>120 seconds) +- Service down + +View alerts: https://prometheus.ci.openshift.org/ + +### Dashboards + +Grafana dashboard: https://grafana.ci.openshift.org/d/chaibot/ + +## Troubleshooting + +### Chaibot Not Responding + +1. **Check service status:** + ```bash + oc get pods -n ci -l app=ci-chat-bot + oc logs -n ci -l app=ci-chat-bot -c bot --tail=100 | grep chaibot + ``` + +2. **Verify configuration:** + ```bash + oc get configmap ci-chat-bot-triage-config -n ci -o yaml + ``` + +3. **Check secrets:** + ```bash + oc get secret ci-chat-bot-chaibot-secrets -n ci + ``` + +4. **Review metrics:** + ```bash + curl http://ci-chat-bot.ci.svc:9090/metrics | grep chaibot + ``` + +### Analysis Timeout + +If analyses are timing out: +- Check `chaibot_analysis_duration_seconds` metric +- Increase timeout in config +- Reduce `max_log_size_mb` if log fetching is slow +- Check OpenAI API rate limits + +### Inaccurate Analysis + +- Review AI prompts in `triage-config.yaml` +- Adjust confidence thresholds for categories +- Add more specific patterns to failure categories +- Consider switching AI models or providers + +### Rate Limiting Issues + +- Check `chaibot_api_errors_total{reason="rate_limit"}` +- Increase OpenAI rate limits +- Adjust `rate_limiting.max_analyses_per_hour` + +## Cost Considerations + +### OpenAI API Usage + +Estimated costs (GPT-4): +- ~$0.03 per analysis (8K input tokens, 2K output tokens) +- 100 analyses/day = ~$3/day = ~$90/month +- Adjust by configuring rate limits + +### Optimization + +- Use GPT-3.5-turbo for lower cost (~$0.003/analysis) +- Limit `max_log_size_mb` to reduce input tokens +- Configure cooldown to prevent duplicate analyses +- Set per-user rate limits + +## Security + +### API Keys +- Never commit API keys to git +- Use ci-secret-bootstrap and Vault +- Rotate keys regularly + +### Log Access +- Chaibot has read access to GCS buckets +- Only fetches publicly accessible job artifacts +- Does not access private/embargoed job logs + +### Slack Permissions +- Only monitors configured public channels +- Cannot read DMs or private channels +- Rate limited to prevent abuse + +## Development + +### Local Testing + +```bash +# Clone ci-tools repo +git clone https://github.com/openshift/ci-tools +cd ci-tools/cmd/ci-chat-bot + +# Add chaibot feature flag +# Implement triage module + +# Run locally +go run . \ + --triage-config-path=/path/to/triage-config.yaml \ + --enable-triage=true \ + --dry-run +``` + +### Adding Features + +1. Update `triage-config.yaml` schema +2. Implement in ci-tools codebase +3. Add tests +4. Update documentation +5. Submit PR to openshift/ci-tools + +## Support + +### Documentation +- This guide: https://docs.ci.openshift.org/tools/chaibot/ +- ci-chat-bot docs: https://docs.ci.openshift.org/architecture/ci-chat-bot/ + +### Slack Channels +- `#forum-ocp-testplatform` - Ask questions +- `#forum-ocp-crt` - ci-chat-bot team + +### Issues +- Report bugs: https://github.com/openshift/ci-tools/issues +- Feature requests: Same, label with `chaibot` + +## Roadmap + +Planned features: +- [ ] Multi-turn conversation for deep analysis +- [ ] Automatic JIRA ticket creation for bugs +- [ ] Integration with retester for auto-retry +- [ ] Flaky test database population +- [ ] Weekly failure summary reports +- [ ] Support for analyzing multiple jobs in one request +- [ ] Custom analysis templates per team