Foul Play: A Competitive Pokémon Showdown Battle Bot

Foul Play is a Pokémon Showdown battle bot capable of playing a variety of singles formats in many generations of Pokémon. It has achieved high rankings in several formats, including Random Battles, OU, and Battle Factory.

Foul Play utilizes a custom battle engine to perform root-parallelized Monte Carlo Tree Search on many battle states. Foul Play’s advanced client tracks the battle state, predicts opponent Pokémon attributes, and can intelligently infer when information is revealed throughout the battle.

Search

Foul Play uses a bespoke engine (poke-engine) and the Monte Carlo Tree Search (MCTS) algorithm to explore many game trees, being guided by a custom evaluation function rather than rollouts. MCTS is effective at playing Pokémon as it naturally prioritizes exploring promising moves rather than searching all possible lines. Previous versions of the bot used expectiminimax to select a move, but without aggressive move-pruning, this approach would exhaustively search every branch of the game tree to a certain depth, meaning search beyond about 5 turns ahead was practically impossible without timing out. When performing MCTS, the most promising lines are usually searched to a depth of 10 or more, while the uninteresting parts of the tree may only be searched to a depth of 2 or 3.

Expectiminimax also suffered from the problem of having the opponent’s move being selected after the bot’s move, which is not how Pokémon works: both players select their moves at the same time and without knowledge of their opponent’s move. To deal with simultaneous moves, MCTS is modified to use Decoupled Upper Confidence Bound applied to Trees (DUCT), allowing moves to be selected simultaneously. This paper has a good explanation of how MCTS can be modified to work with simultaneous move games and how DUCT works.

Randomness is dealt with by sampling a single outcome of a move pair during the MCTS selection phase. Since poke-engine is able to generate all possible outcomes of a move pair along with their likelihoods, the search tree can be built out with all of this information stored in the nodes. For example, if a move has a 10% chance to miss, there will be 2 child nodes to choose from: a 90% node where the move hits and a 10% node where the move misses. These nodes are selected based on their likelihoods, so the search prioritizes paths that are more likely to occur.

Engine

Instruction Generation

To enable fast search during battles, poke-engine was built as a Rust rewrite of the original Python engine. poke-engine operates by taking a battle state S, a move pair (m1,m2), and generating a list of every possible outcome. However, instead of returning a list of new states (S1, S2, ...), it returns a list of instructions that can be applied to and reversed from the state to mutate it. This is important as Pokémon states can become quite large and costly to copy, especially in later generations where game mechanics are more complicated.

Here is a simplified version of what instruction generation looks like for two moves.

> generate-instructions shadowball tackle

Index: 0
StateInstruction: 
	Percentage: 80.00
	Instructions:
		Damage SideTwo: 184
		Damage SideOne: 48

Index: 1
StateInstruction: 
	Percentage: 20.00
	Instructions:
		Damage SideTwo: 184
		Boost SideTwo SpecialDefense: -1
		Damage SideOne: 48

poke-engine generated instructions for the (shadowball, tackle) move pair. Shadow ball is a 100% accurate move that has a 20% chance of lowering the target’s special-defense by one stage and tackle is a 100% accurate move with no additional side effects. Two outcomes are generated: one when only damage happens and the other for when the special-defense drop occurs.

Damage Roll Grouping

The above example omits damage rolls and critical hits for simplicity, but in practice they are quite important. In Pokémon, every move has 16 equally likely damage rolls - one for each value between 85% and 100% of the calculated damage. Furthermore, a move has an approximate 5% chance of being a critical hit, meaning the damage dealt can have up to 32 unique values. It is impractical to generate instructions for every possible damage roll and critical hit during search, as each node would have too many children and the search would be quite shallow.

poke-engine’s solution is damage roll grouping. When considering damage rolls, what is often important is whether a move will cause the opponent to faint. Rather than generating instructions for every possible damage roll, it inspects the target Pokémon’s remaining health and checks which damage rolls will and will not result in them fainting. The resulting instructions are grouped by whether they cause a faint, each group’s damage is averaged, and the likelihoods are summed.

For example, let’s say that the shadowball in the previous example has a 75% chance to faint its target and the target has 191 HP remaining. This example ignores critical hits for simplicity.

> generate-instructions shadowball tackle

Index: 0
StateInstruction: 
	Percentage: 75.00
	Instructions:
		Damage SideTwo: 191  # faint

Index: 1
StateInstruction: 
	Percentage: 20.00
	Instructions:
		Damage SideTwo: 187  # non-faint
		Damage SideOne: 48

Index: 2
StateInstruction: 
	Percentage: 5.00
	Instructions:
		Damage SideTwo: 187  # non-faint and spdef drop
		Boost SideTwo SpecialDefense: -1
		Damage SideOne: 48

At least 191 damage is dealt 75% of the time, and in that case no more instructions are generated due to the target fainting. The non-fainting damage rolls of (184, 186, 188, 190) in total have a 25% chance of occurring and have an average of 187 damage, so we group them together. Since there is (usually) no practical difference between non-fainting damage rolls they can be grouped together and explored as a single branch during MCTS. Note that branching due to the special-defense drop still happens for the non-fainting case.

Client

The client is responsible for communicating with Pokémon Showdown during a battle, keeping track of the battle state, and calling poke-engine to perform MCTS.

Hidden Information Extraction

Pokémon is a game of hidden information, and to perform well the client needs to do more than just parse the Pokémon Showdown Protocol accurately into a battle state. It needs to understand how to identify hidden information and use it to make better decisions. These are some ways that Foul Play does this:

  • Every time damage is dealt or taken by the opponent’s Pokémon, Foul Play uses poke-engine to simulate the damage roll and get a better idea of the opponent’s Pokémon’s stats.
  • If move priorities match, the move-order reveals the min or max speed of the opponent’s Pokémon
  • If a weather was set by the opponent and lasts beyond 5 turns, the opponent’s Pokémon must have the corresponding weather extension item.
  • If a Pokémon uses a status move, it is definitely not holding an Assault Vest
  • If a Pokémon switches in and takes hazard damage, it definitely does not have Heavy-Duty Boots

All of these are extremely important for performance. The more information Foul Play has, the more accurately it can predict sets.

Set Prediction

For Foul Play set prediction is as important as, if not more important than, accurate and fast search. This becomes obvious when comparing Foul Play’s performance in formats with known sets (like random battles) to formats with open team building (like OU or UU).

Foul Play keeps a large list of possible sets for each opposing Pokémon and filters them down as the battle progresses using the techniques mentioned above. In random formats each Pokémon has a small number of possible sets, often fewer than a dozen, giving Foul Play a pretty good idea about what to expect.

In standard formats it gets more complicated, predictions rely on a combination of publicly available data:

When it needs to select a move, Foul Play samples several possible battle states by filling in the unknown attributes of the opponent’s team. Each data source provides the likelihood of each attribute, so Foul Play can prioritize more likely scenarios.

Performance & Results

Foul Play performs best when it has reliable information about the opponent’s team, which is why random formats tend to have stronger results.

While peak ELO and rank are listed, GXE* and Glicko ratings, once stabilized, are a more accurate metric for performance as competitive Pokémon can be a relatively high variance game. Each GXE shown was measured only after the Glicko deviation dropped below 50, to ensure a stable rating.

Standard Formats

FormatGXEPeak ELOPeak Ladder Rank
gen9ou80%1879Top 100
gen5ou76%~1600Top 200
gen3ou71%18154

Random Formats

FormatGXEPeak ELOPeak Ladder Rank
gen1randombattle75%~1450Top 500
gen2randombattle85%~1500Top 50
gen3randombattle82%~1600Top 20
gen4randombattle85%17283
gen5randombattle80%1600Top 30
gen9randombattle88%2341Top 50
gen9randombattleblitz84%20181
gen9battlefactory81%16241

*GXE is an interpretation of Glicko that estimates your chance to defeat a random opponent on the ladder

Screenshots

Random Battles

randbats.png

gen12randbats.png

Rank 3 Gen4 RandomBattle

gen4rb.png

Rank 1 Gen9 RandomBattleBlitz

rank1rbb.png

Standard Battles

Gen9 OU

top100-medal.png gen9ou.png

Gen 9 Battle Factory

rank1bf.png

Gen3 OU Rank 4

rank4gen3ou

Replays

Gen9OU Win
Gen9OU Loss
Gen3 OU Win
Gen3 OU Loss
Gen9 RandomBattle Blitz Win
Gen9 RandomBattle Blitz Loss (vs MichaelderBeste!)