# 🔍 Bot Intelligence Degradation - Root Cause Analysis

**Date:** 2026-02-05
**Tested Model:** gemini-3-pro-high (via local proxy)
**Alternative Model Tested:** gemini-2.0-flash-exp (404 NOT FOUND)

---

## 📊 Executive Summary

**The LLM is COMPLETELY BROKEN** - not degraded, but **non-functional** for tool calling.

- **Test Result:** 0/9 passed (0.0% success rate)
- **Root Cause:** Model `gemini-3-pro-high` returns **empty responses** when presented with tools
- **Impact:** ALL tool-dependent functionality fails
- **Current Workaround:** SAFETY OVERRIDE (217 lines of keyword matching) compensates by **forcing** tool calls through pattern matching

---

## 🧪 Experimental Evidence

### Test Script: `experiments/test_llm_intelligence.py`

Direct LLM test WITHOUT dispatcher overhead or keyword matching.

#### Results for `gemini-3-pro-high`:

| Test Case | Query | Expected Tool | Result |
|-----------|-------|--------------|--------|
| daily_summary | 今天的健康数据怎么样？ | get_daily_detailed_stats | ❌ FAIL (no tools, empty text) |
| yesterday_sleep | 昨晚睡眠怎么样 | get_daily_detailed_stats | ❌ FAIL (no tools, empty text) |
| hrv_trend | 过去60天的hrv变化 | get_metric_history | ❌ FAIL (no tools, empty text) |
| food_simple | 晚上吃了白切鸡、花菜、红烧肉和猪血 | log_diet | ❌ FAIL (no tools, empty text) |
| confirmation | 好的，可以记录 (with context) | log_diet | ❌ FAIL (no tools, empty text) |
| sync | 同步一下garmin数据 | sync_garmin | ❌ FAIL (no tools, empty text) |
| activity_analysis | 今早椭圆机运动请深入分析 | get_activity_history | ❌ FAIL (no tools, empty text) |
| causal_analysis | 喝酒对我的睡眠有什么影响？ | analyze_driver | ❌ FAIL (no tools, empty text) |
| web_search | 搜索一下最新的NAD+研究 | search_web | ❌ FAIL (no tools, empty text) |

**Conclusion:** The model receives tool schemas but **NEVER calls tools** and returns empty text.

#### Results for `gemini-2.0-flash-exp`:

```
Error code: 404 - {'error': {'code': 404, 'message': 'Requested entity was not found.', 'status': 'NOT_FOUND'}}
```

**Conclusion:** The proxy at `http://127.0.0.1:8045` does NOT support this model.

---

## 🔧 Technical Analysis

### 1. Request Payload (from `debug_failed_payload.json`)

The payload is **structurally correct**:
- ✅ System prompt is well-formed
- ✅ Tool schemas use standard OpenAI format
- ✅ User message is clear in Chinese
- ✅ `tool_choice: "auto"` is set correctly

**Example Tool Schema:**
```json
{
  "type": "function",
  "function": {
    "name": "search_web",
    "description": "Search the web for real-time information...",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {
          "type": "string",
          "description": "Search query (e.g. 'latest study on NAD+ effects')"
        }
      },
      "required": ["query"]
    }
  }
}
```

### 2. LLM Implementation (`slack_bot/llm/gemini.py`)

The code path:
- Uses `OpenAI` client for proxy compatibility (Line 98-128)
- Sends request to `http://127.0.0.1:8045/v1` (Line 128)
- Parses tool calls from `response.choices[0].message.tool_calls` (Line 296-308)

**Problem:** No errors logged, but `message.tool_calls` is consistently **None or empty**.

### 3. Proxy Configuration

```env
GEMINI_API_KEY=sk-457cbcd2e0a4467e90db1af0ae65748e
GEMINI_BASE_URL=http://127.0.0.1:8045
GEMINI_MODEL=gemini-3-pro-high
```

**Issue:** Model `gemini-3-pro-high` does NOT exist in official Gemini catalog (as of Jan 2025).

**Valid Gemini Models (1.5 generation):**
- gemini-1.5-flash
- gemini-1.5-pro
- gemini-1.5-pro-exp-0827

**Valid Gemini Models (2.0 generation):**
- gemini-2.0-flash-exp (experimental, not found on proxy)

**Hypothesis:** The proxy is accepting unknown models but NOT properly routing tool calls. It may be:
1. Stripping tool schemas before forwarding to Google
2. Using a fallback model that doesn't support function calling
3. Misconfigured API endpoint

---

## 💡 Why SAFETY OVERRIDE "Works"

The 217-line SAFETY OVERRIDE (dispatcher.py Lines 97-313) bypasses the LLM entirely:

```python
# Example: Food logging fallback
food_items = ["鸡", "肉", "菜", "蛋", "饭", ...
]
meal_indicators = ["晚餐", "午餐", "早餐", ...]

has_ate_pattern = any(k in lower_msg for k in ["吃了", "ate", ...])
has_food_item = any(k in lower_msg for k in food_items)

if (has_ate_pattern and has_food_item):
    # FORCE tool call
    tool_calls = [{
        "name": "log_diet",
        "args": {"description": original_message, ...}
    }]
```

**It's not using the LLM for reasoning - it's a hardcoded decision tree.**

This explains why:
- User felt intelligence degraded (LLM never actually made decisions)
- Simple queries work (keyword matching covers common patterns)
- Edge cases keep failing (new patterns need new keywords)

---

## 🎯 Recommended Actions

### ✅ IMMEDIATE (High Priority)

1. **Fix Model Configuration:**
   - Replace `gemini-3-pro-high` → `gemini-1.5-flash` or `gemini-1.5-pro`
   - Verify proxy supports these models
   - Test tool calling with updated model

2. **Test Alternatives:**
   ```bash
   # Try Gemini 1.5 Flash (fast, cheap, reliable)
   GEMINI_MODEL=gemini-1.5-flash python experiments/test_llm_intelligence.py

   # Try Gemini 1.5 Pro (smarter, slower)
   GEMINI_MODEL=gemini-1.5-pro python experiments/test_llm_intelligence.py
   ```

3. **If Proxy Fails:**
   - Switch to direct Google API (remove `GEMINI_BASE_URL`)
   - Or switch provider (e.g., OpenRouter with verified Gemini models)

### ⚠️ SECONDARY (After Model Fix)

4. **Re-run Intelligence Tests:**
   - If new model achieves >80% success → REMOVE SAFETY OVERRIDE
   - If 50-80% → Simplify SAFETY OVERRIDE to minimal fallbacks
   - If <50% → Investigate system prompt over-guidance

5. **Cleanup:**
   - Remove 217 lines of keyword matching hell
   - Trust LLM to do its job
   - Add monitoring/logging for tool call failures

---

## 📝 Conclusion

**The "intelligence degradation" was never a degradation.**

The LLM was NEVER smart because:
1. The model (`gemini-3-pro-high`) is broken/non-existent for tool calling
2. SAFETY OVERRIDE was a **band-aid** that masked the underlying API failure
3. Each new edge case required new keywords because **LLM never worked**

**Action:** Fix model configuration → Validate tool calling → Remove keyword hell.

---

## 🧪 Appendix: Test Commands

```bash
# Run direct LLM intelligence test
python experiments/test_llm_intelligence.py

# Test with specific model
python experiments/test_llm_intelligence.py --model gemini-1.5-flash

# Run full dispatcher comparison
python experiments/compare_dispatchers.py
```

**Test Results Location:**
- Raw logs: `logs/slack_bot.log`
- Failed payload: `debug_failed_payload.json`
- Test scripts: `experiments/`
