Palace Card Game with AI

Training and Test Run of the Palace Card Game

1. State Representation and Encoding

The game's state is encapsulated within a numerical vector of dimensionality 91. This vector comprehensively represents the current configuration of the game, capturing the essential elements required for decision-making by the AI agent. Specifically, the state vector comprises:

15 dimensions representing each card type (In Hand, Face Up, Face Down) for Player 1.
15 dimensions representing each card type (In Hand, Face Up, Face Down) for Player 2.
15 dimensions for the top card of the pile.
1 dimension indicating the status of special rules (e.g., Seven Rule).

The state encoding follows this structure:

State vector = [
    P1_hand_encoding[15],     // Player 1 hand cards
    P1_faceup_encoding[15],   // Player 1 face-up cards
    P1_facedown_encoding[15], // Player 1 face-down cards
    P2_hand_encoding[15],     // Player 2 hand cards
    P2_faceup_encoding[15],   // Player 2 face-up cards
    P2_facedown_encoding[15], // Player 2 face-down cards
    pile_top[1],               // Top card rank
    seven_rule_active[1]       // Active status of Seven Rule
]

2. Neural Network Architecture

The Deep Q-Network (DQN) employed in this implementation consists of a three-layer neural network designed to approximate the Q-value function, which estimates the expected rewards of actions taken in particular states. The architecture is as follows:

Input Layer: 91 neurons corresponding to the state vector dimensions.
Hidden Layer 1: 128 neurons utilizing the Rectified Linear Unit (ReLU) activation function to introduce non-linearity.
Hidden Layer 2: 64 neurons with ReLU activation to further capture complex patterns.
Output Layer: 3 neurons representing the Q-values for each possible action (e.g., play a card from hand, face-up, or face-down).

3. Reinforcement Learning Parameters

The training process is governed by several hyperparameters critical to the efficacy and stability of the learning algorithm:

Learning Rate (α): Set at 0.001, this parameter controls the extent to which newly acquired information overrides old information.
Discount Factor (γ): With a value of 0.99, this determines the importance of future rewards versus immediate rewards.
Initial Exploration Rate (ε): Initialized at 1.0 to prioritize exploration in the early stages of training.
Exploration Decay Rate: A decay rate of 0.995 gradually reduces exploration in favor of exploitation as the agent learns.
Minimum Exploration Rate: Capped at 0.01 to ensure a baseline level of exploration is maintained.

4. Action Space and Decision Making

In each turn, the AI agent selects an action from the available action space, which consists of playing a card from one of the three categories: In Hand, Face Up, or Face Down. The validity of an action is determined based on the game's rules, such as matching or exceeding the rank of the top pile card or playing special cards like '2', '7', or 'Joker'. The DQN predicts the Q-values for each possible action, allowing the agent to select the action with the highest expected reward.

5. Reward Structure and Learning Objectives

The reinforcement learning framework is augmented with a reward system designed to incentivize desirable behaviors and discourage suboptimal actions:

+10 points: Awarded for winning the game by successfully playing all cards.
+1 point: Granted for making a valid play, encouraging consistent adherence to game rules.
-5 points: Penalized for attempting an invalid play, promoting rule compliance.
-10 points: Imposed for having to pick up the pile, discouraging forced losses.

6. The Bellman Equation and Q-Learning

The Bellman Equation serves as the foundation for updating the Q-values within the DQN. It formalizes the relationship between the current Q-value and the expected future rewards:

Q(state, action) = reward + γ * max(Q(next_state, all actions))

This recursive formula allows the agent to iteratively update its value estimates, balancing immediate rewards with long-term gains. For instance, playing a '2' might yield an immediate reward of +1 and facilitate future rewards by clearing the pile, resulting in a higher cumulative Q-value.

7. Deep Q-Network (DQN) Implementation

The `palace_dqn.py` script encapsulates the entire AI implementation using reinforcement learning principles. Below is a detailed breakdown of its components:

7.1. Environment Setup

The `CardGameEnv` class models the game environment, managing the state transitions, rule enforcement, and reward assignments. Key functionalities include:

State Encoding: Transforms the current game state into a numerical vector suitable for input into the neural network.
Step Function: Executes an action, updates the game state, assigns rewards, and determines if the game has concluded.
Special Card Handling: Implements the effects of playing special cards (e.g., '10' burns the pile, '2' resets it).
Player Management: Manages turn-taking between players and handles the distribution of cards.

7.2. Agent Design

The `DQNAgent` class represents the AI agent, responsible for selecting actions, learning from experiences, and optimizing its policy:

Neural Network: Built using PyTorch, the network approximates the Q-value function based on the current state.
Experience Replay: Utilizes a memory buffer (`deque`) to store experiences and sample random minibatches for training, breaking the correlation between sequential data.
Action Selection: Implements an ε-greedy policy, balancing exploration and exploitation based on the current value of ε.
Training: The `replay` method updates the network weights using stochastic gradient descent and backpropagation, minimizing the mean squared error between predicted and target Q-values.

7.3. Training Process

The main execution block orchestrates the training of two agents (Player 1 and Player 2) over a specified number of episodes. During each episode:

The environment is reset, initializing a new game state.
Agents take turns selecting and executing actions based on their current policy.
Rewards are assigned based on the outcomes of actions, and experiences are stored in memory buffers.
Periodically, agents sample minibatches from their memories to perform learning updates, refining their Q-value approximations.
Post-training, the agents engage in a testing phase where exploration is minimized, and their learned policies are evaluated against each other.

8. Game Play Example

An illustrative example demonstrates the AI's decision-making process during gameplay:

Turn 1:
- Top card: 6
- Computer's hand: [King, 7, 2]
- Computer evaluates:
  * King → +1 point now, retains a high-value card for future plays
  * 7 → +1 point now, imposes constraints on the next player's move
  * 2 → +1 point now, clears the pile, potentially ending the turn
- Computer selects: 2 (strategic choice to clear the pile and potentially gain more future rewards)

Turn 2:
- Fresh pile, computer can play any card
- Computer plays: King (eliminates a high-value card, maintaining a stronger hand for subsequent turns)

9. Training Process

The AI undergoes an extensive training regimen to hone its strategic capabilities:

Initial Phase (First 100 Games): The agents engage in predominantly random actions, allowing them to explore the state and action spaces without bias.
Intermediate Phase (500 Games): Agents begin to recognize and adopt fundamental strategies based on accumulated experiences and learned rewards.
Advanced Phase (1000 Games): Agents exhibit proficient gameplay, leveraging sophisticated strategies and optimized decision-making processes developed through extensive training.

Technical Overview of `palace_dqn.py`

The `palace_dqn.py` script is the core component that enables the AI agents to learn and play the Palace Card Game effectively. Below is a comprehensive examination of its structure and functionalities:

1. Libraries and Dependencies

Standard libraries and frameworks integral to the implementation include:

NumPy: Facilitates numerical operations and state vector manipulations.
PyTorch: Provides tools for constructing and training neural networks.
Collections (deque): Implements memory buffers for experience replay.
Random: Enables stochastic processes such as shuffling the deck and action selection.
JSON: Handles the loading of card data from external files.

2. Constants and Utility Functions

The script defines several constants and helper functions to manage game logic:

Card Types: Enumerates the different categories of cards (In Hand, Face Up, Face Down, Pile).
Rank Order: A dictionary mapping card ranks to their corresponding numerical values, facilitating comparison and validation of plays.
get_playable_cards: Identifies and retrieves the set of playable cards based on their type and current game rules.
is_valid_play: Determines the legitimacy of a proposed card play in the context of the game's current state.
handle_special_card: Executes the specific effects associated with special cards, such as clearing the pile or enforcing play restrictions.
distribute: Allocates cards to players, categorizing them appropriately and managing the deck.
pick_up_pile: Manages the action of a player picking up the pile, transferring pile cards to the player's hand.
pprint_distributed_cards: Provides a formatted display of distributed cards for debugging and verification purposes.

3. Environment Class: `CardGameEnv`

This class encapsulates the game environment, handling state management, action execution, and game progression:

Initialization: Sets up the game with distributed cards, managing the deck, pile, current player, and game status.
State Retrieval (`get_state`): Encodes the current game state into a numerical vector suitable for neural network input.
Step Function (`step`): Processes an action taken by a player, updates the game state, assigns rewards, and determines if the game has concluded.
Card Playing (`play_card`): Executes the logic for playing a card, including updating the pile and handling special card effects.
Player Switching (`switch_player`): Alternates turns between players.
Reset Function (`reset`): Initializes a new game, shuffling and distributing cards, and resetting game status flags.

4. Agent Class: `DQNAgent`

The `DQNAgent` class defines the AI agent's behavior, encompassing action selection, memory management, and learning:

Initialization: Sets up neural network parameters, memory buffers, and exploration-exploitation dynamics.
Model Building (`build_model`): Constructs the neural network architecture using PyTorch, adhering to the predefined layer structure.
Memory Management (`remember`): Stores experiences in a deque to facilitate experience replay during training.
Action Selection (`act`): Implements an ε-greedy policy to choose actions based on current Q-value predictions or random exploration.
Training (`replay`): Processes minibatches from memory to perform gradient descent updates on the neural network, minimizing prediction errors.

5. Main Execution Flow

The script's main section orchestrates the interaction between the environment and the agents:

Environment and Agent Initialization: Sets up the game environment and instantiates two DQN agents representing the players.
Training Loop: Iterates over a defined number of episodes, during which agents play games, collect experiences, and update their neural networks through replay.
Testing Phase: Post-training, the agents engage in a deterministic game (with exploration minimized) to evaluate their learned strategies against each other.
Outcome Reporting: Displays the results of the test game, including the winner, reasons for victory, and the final state of each player's cards.

The comprehensive design of `palace_dqn.py` ensures that the AI agents progressively improve their gameplay through iterative learning, leveraging neural networks to approximate optimal strategies within the Palace Card Game's framework.