Scaling Up to Casino Blackjack

The toy blackjack of Notebook 1 was a clean learning environment: a single deck, only Hit or Stand, and reshuffling every hand. But real casino blackjack is more complex. Players face multiple decks, expanded actions like Double Down, and payout rules that tilt the game slightly further in the house’s favor.

This second experiment asks whether an RL agent can still learn realistic play when the rules reflect what happens at a casino table.

Casino Rule Extensions

The environment now includes:

  • Six decks in a shoe, reshuffled only after exhaustion.
  • Dealer stands on soft 17.
  • Blackjack pays 3:2.
  • Action space expanded to:
    • Hit
    • Stand
    • Double Down (double the bet, one more card, then forced to stand)
# Action space: 0 = Hit, 1 = Stand, 2 = Double
action = env.action_space.sample()

Q-Learning

The state and action space is much larger now. Monte Carlo control, which waits until the end of each episode to update values, becomes inefficient. Instead, Q-learning is used.

At each step, the Q-value update is:

target = reward + gamma * np.max(Q[next_state])
Q[state][action] += alpha * (target - Q[state][action])

This incremental update lets the agent converge on useful strategies more quickly, even in a complex environment.

Learned Strategy

The policy is summarized in grids, broken down by whether the player has a usable ace.

EV and Outcomes

After training on hundreds of thousands of simulated hands, the agent’s estimated expected value (EV) converges around -4% per initial bet.

  • Win rate ≈ 42%
  • Push rate ≈ 8–9%
  • Loss rate ≈ 49%

These results are in line with the theoretical house edge of ~0.5%. The difference comes from approximations and the limitations of tabular Q-learning, but the broad shape of play remains realistic.

Takeaways

  • The RL agent adapts to casino-level rules, correctly incorporating Double Down into its policy.
  • Even with the added complexity of six decks and payout rules, the policy closely resembles published basic strategy charts.
  • The agent does not find a winning strategy — consistent with the fact that, under flat betting, blackjack remains a negative-EV game.

Continue to Notebook 3: Card Counting & Adaptive Betting →

Sharing

Related Articles

  • All Post
  • Articles
  • Blog Post
  • General Business Automation
  • Portfolio
  • Stock Market & Finance

September 30, 2025/

From Zero to Basic Strategy: RL Learns Blackjack Can a machine, starting with no knowledge of blackjack, discover the same...