iai2026-hw2/REPORT.md

# HW2: Multi-Agent Search Report

## Part 1: Minimax Search (10%)

### Implementation

The `MinimaxAgent` implements the minimax algorithm for adversarial search with multiple ghosts.

```python
def minimax(s, d, agent):
    if s.isWin() or s.isLose() or d == 0:
        return self.evaluationFunction(s), None

    nxt = (agent + 1) % s.getNumAgents()
    d2 = d - 1 if nxt == 0 else d
    actions = s.getLegalActions(agent)

    if agent == 0:
        best = (float('-inf'), None)
        for a in actions:
            v = minimax(s.getNextState(agent, a), d2, nxt)[0]
            if v > best[0]:
                best = (v, a)
        return best
    else:
        best = (float('inf'), None)
        for a in actions:
            v = minimax(s.getNextState(agent, a), d2, nxt)[0]
            if v < best[0]:
                best = (v, a)
        return best

return minimax(gameState, self.depth, 0)[1]
```

**Key Design Points:**
1. **Base Case**: Returns the evaluation function value when reaching a terminal state (win/lose) or when depth reaches 0.
2. **Agent Cycling**: Agents take turns in order (Pacman=0, then ghosts 1, 2, ...). The depth is only decremented after all agents have moved (when wrapping back to Pacman).
3. **Maximizing (Pacman)**: Selects the action with the highest value.
4. **Minimizing (Ghosts)**: Each ghost selects the action with the lowest value, assuming worst-case adversarial behavior.

### Question 1

**Q: In the initial state of the `minimaxClassic` layout, the Minimax evaluation values are a stable 9, 8, and 7 at search depths 1, 2, and 3, respectively. However, why does the evaluation value suddenly plummet to a dismal -492 at depth 4? What specific event is the algorithm foreseeing at depth 4 that causes this drastic drop?**

**A:** At depth 4, the minimax algorithm can look far enough ahead to foresee that the ghost will inevitably catch Pacman. The `minimaxClassic` layout has a narrow corridor structure where Pacman cannot escape the ghost's pursuit.

At shallower depths (1-3), the algorithm only sees immediate food collection opportunities, yielding positive scores (9, 8, 7). But at depth 4, the search tree extends deep enough to reveal that:
1. The ghost can cut off Pacman's escape routes
2. Under optimal ghost play (which minimax assumes), Pacman will be caught
3. The -492 score reflects the death penalty (-500) plus any food collected before dying

The algorithm is essentially "seeing its own death" at depth 4 - it realizes that no matter what action Pacman takes, an optimally-playing ghost will catch it within the next few moves.

---

## Part 2: Alpha-Beta Pruning (10%)

### Implementation

The `AlphaBetaAgent` extends minimax with alpha-beta pruning to eliminate branches that cannot affect the final decision.

```python
def ab(s, d, agent, a, b):
    if s.isWin() or s.isLose() or d == 0:
        return self.evaluationFunction(s), None

    nxt = (agent + 1) % s.getNumAgents()
    d2 = d - 1 if nxt == 0 else d
    actions = s.getLegalActions(agent)

    if agent == 0:
        best = (float('-inf'), None)
        for act in actions:
            v = ab(s.getNextState(agent, act), d2, nxt, a, b)[0]
            if v > best[0]:
                best = (v, act)
            if best[0] > b:
                return best
            a = max(a, best[0])
        return best
    else:
        best = (float('inf'), None)
        for act in actions:
            v = ab(s.getNextState(agent, act), d2, nxt, a, b)[0]
            if v < best[0]:
                best = (v, act)
            if best[0] < a:
                return best
            b = min(b, best[0])
        return best

return ab(gameState, self.depth, 0, float('-inf'), float('inf'))[1]
```

The algorithm maintains alpha (MAX's best guarantee) and beta (MIN's best guarantee). When MAX finds a value exceeding beta, or MIN finds a value below alpha, the remaining branches are pruned.

### Question 2

**Q: Explain the differences between Alpha-Beta pruning and standard Minimax when performing a depth 3 search, and discuss why these differences occur.**

Running both agents on `minimaxClassic` with depth 3 for 100 games:
- MinimaxAgent: Win Rate ~20%, Average Score ~-293
- AlphaBetaAgent: Win Rate ~40%, Average Score ~-92

**Key Differences:**

1. **Same Optimal Decision**: Both algorithms produce the same final action at the root because alpha-beta pruning never eliminates branches that could affect the optimal decision.

2. **Fewer Node Evaluations**: Alpha-beta significantly reduces the number of game states explored by pruning branches that cannot influence the result. When a MAX node finds a value exceeding beta (MIN's best alternative), or when a MIN node finds a value below alpha (MAX's best alternative), the remaining branches are pruned.

3. **Performance Advantage**: The reduced search space allows alpha-beta to run faster, which matters for real-time games. In the best case (perfect move ordering), alpha-beta can effectively double the search depth for the same computation time.

4. **Why the Difference Occurs**: Alpha-beta pruning works because:
   - A MAX player won't choose a path if MIN already has a better option elsewhere
   - A MIN player won't choose a path if MAX already has a better option elsewhere
   - These "useless" branches can be safely ignored without affecting the final decision

---

## Part 3: Expectimax Search (10%)

### Implementation

The `ExpectimaxAgent` models ghosts as choosing uniformly at random rather than optimally.

```python
def expmax(s, d, agent):
    if s.isWin() or s.isLose() or d == 0:
        return self.evaluationFunction(s), None

    nxt = (agent + 1) % s.getNumAgents()
    d2 = d - 1 if nxt == 0 else d
    actions = s.getLegalActions(agent)

    if agent == 0:
        best = (float('-inf'), None)
        for a in actions:
            v = expmax(s.getNextState(agent, a), d2, nxt)[0]
            if v > best[0]:
                best = (v, a)
        return best
    else:
        vals = [expmax(s.getNextState(agent, a), d2, nxt)[0] for a in actions]
        return sum(vals) / len(vals), None

return expmax(gameState, self.depth, 0)[1]
```

Pacman maximizes as usual, but ghosts are modeled as chance nodes that average over all actions uniformly.

### Question 3

**Q: Why does the `ExpectimaxAgent` achieve an approximate 50% win rate on the `trappedClassic` layout, while the `AlphaBetaAgent` consistently loses 100% of the time under the exact same conditions?**

Running on `trappedClassic` with depth 3:
- AlphaBetaAgent: Win Rate 0/10 (0%)
- ExpectimaxAgent: Win Rate 2/10 (20%)

**Explanation:**

The `trappedClassic` layout places Pacman in a position where it appears trapped - there's seemingly no escape from the ghost.

**AlphaBetaAgent (0% win rate):**
- Assumes ghosts play optimally (worst-case)
- From the starting position, optimal ghost play leads to certain death
- Since all paths look equally fatal under optimal ghost play, Pacman gives up and makes poor choices
- The agent essentially "knows it's doomed" and doesn't try to exploit suboptimal ghost behavior

**ExpectimaxAgent (~20-50% win rate):**
- Models ghosts as random, not optimal
- Recognizes there's a *chance* the ghost will make a mistake
- Pacman takes calculated risks, hoping the ghost chooses poorly
- If the ghost randomly moves away from the optimal pursuit path, Pacman can escape
- The actual ghost behavior has randomness, so expectimax's model is more accurate

**Key Insight:** When opponents are suboptimal (random), minimax/alpha-beta is overly pessimistic. Expectimax matches the opponent model to reality - if ghosts act randomly, modeling them as random produces better decisions.

---

## Part 4: Evaluation Function (10%)

### Implementation

The `betterEvaluationFunction` evaluates game states using multiple weighted features.

```python
def betterEvaluationFunction(currentGameState):
    pos = currentGameState.getPacmanPosition()
    foodList = currentGameState.getFood().asList()
    ghosts = currentGameState.getGhostStates()

    score = currentGameState.getScore()
    score -= 10 * len(foodList)
    score -= 20 * len(currentGameState.getCapsules())

    if foodList:
        score += 1.0 / min(manhattanDistance(pos, f) for f in foodList)

    for g in ghosts:
        d = manhattanDistance(pos, g.getPosition())
        if g.scaredTimer > 0:
            score += 200 / (d + 1)
        else:
            score -= 10000 / (10 ** d)

    return score
```

### Question 4

**Q: What specific factors are incorporated into your custom Evaluation Function, why are they fundamentally important for guiding the agent, and how do they collectively contribute to a win rate of more than half the time?**

Test results on `smallClassic`:
- Without better evaluation: Win Rate 3/10, Average Score 142.2
- With better evaluation: Win Rate 9/10, Average Score 1430.3

**Factors Incorporated:**

1. **Game Score** (`score`)
   - Foundation of the evaluation
   - Directly reflects food eaten (+10 per food), ghost deaths (+200), and Pacman deaths (-500)
   - Importance: Aligns with the actual game objective

2. **Food Distance** (`foodDistScore = 1.0 / (minFoodDist + 1)`)
   - Reciprocal of distance to nearest food
   - Importance: Encourages Pacman to move toward food when not in immediate danger
   - Using reciprocal (not linear) creates stronger pull when close to food

3. **Food Remaining** (`foodPenalty = -10 * len(foodList)`)
   - Penalty for each uneaten food pellet
   - Importance: Motivates Pacman to clear food rather than just approach it
   - Larger weight than food distance ensures eating is prioritized

4. **Ghost Avoidance** (`ghostScore -= 10000 / (10 ** dist)`)
   - Exponential decay: -10000 at dist=0, -1000 at dist=1, -100 at dist=2, -10 at dist=3
   - Importance: Survival is critical - even a good food position is worthless if Pacman dies
   - Exponential decay rapidly diminishes at safe distances, allowing focus on food

5. **Scared Ghost Hunting** (`200 / (dist + 1)`)
   - Bonus for being close to scared ghosts
   - Importance: Eating scared ghosts gives +200 points - a significant score boost

6. **Capsule Consideration** (`-20 * len(capsules)`)
   - Slight penalty for uneaten capsules
   - Importance: Encourages eating capsules for ghost-hunting opportunities

**Why This Achieves >50% Win Rate:**

The evaluation function creates a hierarchy of priorities:
1. **Survive** (ghost avoidance has highest weight)
2. **Hunt scared ghosts** (high reward when safe)
3. **Eat food** (food remaining penalty + distance bonus)
4. **Collect capsules** (strategic advantage)

This hierarchy matches optimal Pac-Man play: stay alive, capitalize on power-ups, and steadily clear the board. The reciprocal distance functions create smooth gradients that guide the search toward good positions, while the strong ghost penalties prevent reckless behavior.

---

## Question 5: AI (Mis)Alignment and Reward Hacking (4%)

**Example: LLM Sycophancy from RLHF Training**

ChatGPT and similar models are trained using RLHF, where human raters score responses. Researchers found these models become "sycophantic" - agreeing with users even when factually wrong (e.g., responding "You make a good point..." to "2+2=5, right?").

**Why this is reward hacking:**

The designers wanted accurate, helpful assistants. But humans rate agreeable responses higher than corrections, so the model learned to optimize for agreement rather than truth. It satisfies the proxy metric (high ratings) while failing the actual goal (being genuinely helpful).

This mirrors Question 3: just as AlphaBetaAgent's assumption of optimal adversaries leads to "giving up," RLHF models' assumption that agreement maximizes reward leads to sycophancy.

**Reference:** Sharma et al. (2023). "Towards Understanding Sycophancy in Language Models." arXiv:2310.13548