Scaling Up to Casino Blackjack

The toy blackjack of Notebook 1 was a clean learning environment: a single deck, only Hit or Stand, and reshuffling every hand. But real casino blackjack is more complex. Players face multiple decks, expanded actions like Double Down, and payout rules that tilt the game slightly further in the house’s favor.

This second experiment asks whether an RL agent can still learn realistic play when the rules reflect what happens at a casino table.

Casino Rule Extensions

The environment now includes:

Six decks in a shoe, reshuffled only after exhaustion.
Dealer stands on soft 17.
Blackjack pays 3:2.
Action space expanded to:
- Hit
- Stand
- Double Down (double the bet, one more card, then forced to stand)

# Action space: 0 = Hit, 1 = Stand, 2 = Double
action = env.action_space.sample()

# Action space: 0 = Hit, 1 = Stand, 2 = Double
action = env.action_space.sample()

Q-Learning

The state and action space is much larger now. Monte Carlo control, which waits until the end of each episode to update values, becomes inefficient. Instead, Q-learning is used.

At each step, the Q-value update is:

target = reward + gamma * np.max(Q[next_state])
Q[state][action] += alpha * (target - Q[state][action])

target = reward + gamma * np.max(Q[next_state])
Q[state][action] += alpha * (target - Q[state][action])

This incremental update lets the agent converge on useful strategies more quickly, even in a complex environment.

Learned Strategy

The policy is summarized in grids, broken down by whether the player has a usable ace.

EV and Outcomes

After training on hundreds of thousands of simulated hands, the agent’s estimated expected value (EV) converges around -4% per initial bet.

Win rate ≈ 42%
Push rate ≈ 8–9%
Loss rate ≈ 49%

These results are in line with the theoretical house edge of ~0.5%. The difference comes from approximations and the limitations of tabular Q-learning, but the broad shape of play remains realistic.

Takeaways

The RL agent adapts to casino-level rules, correctly incorporating Double Down into its policy.
Even with the added complexity of six decks and payout rules, the policy closely resembles published basic strategy charts.
The agent does not find a winning strategy — consistent with the fact that, under flat betting, blackjack remains a negative-EV game.

Continue to Notebook 3: Card Counting & Adaptive Betting →

Sharing

All Post
Articles
Blog Post
General Business Automation
Portfolio
Stock Market & Finance

All rights are reserved.

Scaling Up to Casino Blackjack

Casino Rule Extensions

Q-Learning

Learned Strategy

EV and Outcomes

Takeaways

Categories

Sharing

Related Articles

Deep Reinforcement Learning for Blackjack

Card Counting & Adaptive Betting

From Zero to Basic Strategy