Deep Reinforcement Learning for Order Execution

This project implements a Deep Q-Network (DQN) agent designed to execute large financial trade orders optimally. The agent learns to balance the trade-off between market impact (slippage caused by trading too fast) and market risk (price volatility risk from holding inventory too long), targeting a superior Implementation Shortfall (IS) compared to standard TWAP strategies.

Approach & Logic

The Problem: Optimal Execution

A trader needs to sell a large block of shares (e.g., 10,000) within a fixed time window (e.g., 1 hour).

Selling too fast (dumping) crashes the price due to temporary market impact (liquidity consumption).
Selling too slow exposes the portfolio to price volatility risk; the price might drift down naturally before the trade completes.

The Solution: Adaptive Control

Standard algorithms like TWAP (Time-Weighted Average Price) sell at a constant rate, ignoring market conditions. Our RL Agent is adaptive:

If the price is rising, it may accelerate selling to lock in favorable prices ("front-loading").
If the price is falling (but not crashing), it may slow down to wait for mean reversion, provided enough time remains.

System Architecture

The system models the interaction between an Execution Agent (Trader) and a Simulated Market.

graph LR
    subgraph Market ["Market Environment (Gym)"]
        PriceProcess["Price Process (GBM)"]
        ImpactModel["Impact Model (Almgren-Chriss)"]
    end

    subgraph Trader ["RL Agent (Double DQN)"]
        PolicyNet["Policy Network"]
        TargetNet["Target Network"]
    end

    PriceProcess -- "State: (Time, Inventory, Price)" --> Trader
    Trader -- "Action: Sell k% of TWAP rate" --> ImpactModel
    ImpactModel -- "Reward: -Slippage - Risk" --> Trader
    ImpactModel -- "Execution Price" --> PriceProcess

State Observation: The agent observes normalized time remaining ($t/T$), inventory remaining ($q/Q$), and recent price returns.
Action Selection: The agent selects a discrete execution speed (e.g., $0.5\times$, $1.0\times$, $2.0\times$ the baseline TWAP rate).
Market Response: The environment calculates the execution price (penalized by impact) and evolves the "fair" mid-price for the next step.

Mathematical Model

1. Market Dynamics (Asset Price)

The "fair" mid-price $P_t$ evolves according to Geometric Brownian Motion (GBM), consistent with the Black-Scholes assumption for short time horizons:

$$ dP_t = \mu P_t dt + \sigma P_t dW_t $$

$\mu$ (Drift): Assumed to be 0 (random walk).
$\sigma$ (Volatility): Controls the "risk" component. Higher volatility forces the agent to sell faster to avoid uncertainty.

2. Transaction Cost (Slippage)

We follow the Almgren-Chriss model logic where trading creates a temporary distortion in price. If we sell $n_t$ shares at time $t$:

$$ \tilde{P}t = P_t - \underbrace{\beta \cdot n_t}_{\text{Temporary Impact}} $$

$\beta$: The liquidity coefficient.
Consequence: The cost of trading is proportional to $n_t^2$. This quadratic cost creates a strong mathematical incentive to split orders into smaller chunks (the foundation of TWAP).

3. Reward Function

To train the RL agent, we define a reward $r_t$ that aligns with minimizing Implementation Shortfall (IS):

$$ r_t = \underbrace{(n_t \times \tilde{P}t)}_{\text{Revenue}} - \underbrace{\lambda \frac{q_t^2}{Q}}_{\text{Inventory Penalty}} $$

This is equivalent to minimizing the cost function:

$$ \text{Cost} \approx \sum (\text{Slippage}_t + \text{Risk}_t) $$

Slippage Term: Penalizes trading too fast.
Risk Term ($\lambda$): Penalizes holding inventory too long. If $\lambda=0$, the agent approaches TWAP. If $\lambda$ is high, the agent sells ASAP.

Project Structure

rl-order-execution/
├── .github/workflows/
│   ├── changelog.yaml        # Auto-generate CHANGELOG.md
│   ├── ci.yaml               # CI pipeline (Lint, Test, Type-Check)
│   └── update-docs.yaml      # Auto-update README config table
├── config/
│   └── config.yaml          # Runtime configuration parameters
├── src/
│   └── rl_order_execution/
│       ├── agent.py         # DQN Agent & ReplayBuffer implementation
│       ├── settings.py      # Pydantic configuration & validation
│       ├── environment.py   # Custom Gymnasium Market Environment
│       ├── evaluation.py    # TWAP comparison & plotting logic
│       ├── optimize.py      # Optuna hyperparameter tuning script
│       └── training.py      # Core training loop with TensorBoard
├── tests/                   # Pytest suite
├── output/                  # Generated artifacts
├── db/                      # Optuna SQLite database storage
├── .pre-commit-config.yaml  # Git hooks configuration
├── CHANGELOG.md         # Auto-generated changelog history
├── cliff.toml               # Changelog configuration
├── Dockerfile               # Container definition
├── LICENSE                  # MIT License
├── Makefile                 # Automation commands
├── pyproject.toml           # Dependencies (uv)
├── README.md                # Documentation
└── main.py                  # Application entry point

Getting Started

Prerequisites

Python 3.11+
uv (Highly Recommended for dependency resolution)
Docker (Optional, for isolated execution)

Setup

Clone the repository:

git clone https://github.com/alex-is-busy-coding/rl-order-execution.git
cd rl-order-execution

# Install dependencies (production + dev)
make install-dev

# Set up Git pre-commit hooks (Optional, but recommended)
make setup-hooks

Install dependencies and run:
```
make run
```

This executes the training loop, compares the agent against a TWAP benchmark, and saves a trajectory plot to execution_analysis.png.

Training Tracking

The training loop automatically logs loss, reward, and epsilon decay to TensorBoard.

make tensorboard

Open http://localhost:6006 to view the metrics.

Hyperparameter Tuning (Optuna)

We use Optuna to automatically find the best hyperparameters (Learning Rate, Batch Size, Gamma).

Run the Optimizer:

make optimize

Key Features:

Optimization: Runs trials (TPE algorithm) to maximize the agent's improvement over TWAP.
Early Pruning: Uses a MedianPruner to terminate bad trials early (checked every 50 episodes), saving compute.
Persistence: Studies are saved to a local SQLite database (db/optuna_study.db), allowing you to pause and resume optimization.
Configuration: Search ranges and trial counts are defined in config/config.yaml under the optimization section.
Auto-Deployment: The best parameters are saved to config/best_params.yaml. Subsequent make run calls automatically prioritize these values.

Docker Support

To run the simulation in a completely isolated environment:

Build the image:
```
make docker-build
```
Run the container: Mounts the local directory to capture output artifacts.
```
make docker-run
```

Key Metrics Explained

The simulation output provides several metrics to assess agent performance against the TWAP benchmark:

Metric	Interpretation	Implementation
Shortfall (IS)	`(Arrival Price * Shares) - Realized Revenue`	The total cost of trading (slippage + missed opportunity). Lower is better.
Avg Savings	`Avg(TWAP IS) - Avg(Agent IS)`	The average dollar amount saved per episode by using the RL agent. Positive is good.
Avg Savings (bps)	`(Avg Savings / Order Value) * 10,000`	Savings normalized by trade size. 1 bps = 0.01%. Standard industry metric for execution quality.
Information Ratio (IR)	`Mean(Savings) / StdDev(Savings)`	Measures risk-adjusted performance. High IR (>0.5) implies consistent outperformance, not just luck.
Win Rate	`% of episodes where Agent Cost < TWAP Cost`	Consistency of beating the benchmark. Ideally > 50%.
VaR 95% (Savings)	`5th Percentile of Savings Distribution`	Tail risk. If negative, it means in the worst 5% of cases, the agent underperforms TWAP by this amount.

Development Workflow

We use make to standardize development tasks and ensure code quality.

Command	Description
`make run`	Run the simulation
`make optimize`	Run Optuna hyperparameter tuning
`make tensorboard`	Launch TensorBoard server
`make check`	Recommended. Run all quality checks (lint + type-check + test)
`make test`	Run unit tests
`make lint`	Check code style
`make type-check`	Run static type checking with mypy
`make format`	Auto-format code
`make docs`	Locally update README config table
`make install`	Install base dependencies
`make install-dev`	Install all dev dependencies
`make docker-build`	Build the Docker image
`make docker-run`	Run the Docker container
`make clean`	Remove virtualenv, caches, and plots

Run make help in your terminal to see the full list of available commands.

Configuration

Configuration is managed via pydantic-settings. You can override defaults using environment variables or by editing config/config.yaml.

`logging` settings

Name	Required	Default	Description
`RL_LOGGING__LOG_LEVEL`	No	`INFO`	Logging verbosity level. Possible values: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`

`optimization` settings

Name	Required	Default	Description
`RL_OPTIMIZATION__BATCH_SIZES`	No	`[32, 64, 128]`	List of batch sizes to test.
`RL_OPTIMIZATION__GAMMA_MAX`	No	`0.9999`	Maximum discount factor.
`RL_OPTIMIZATION__GAMMA_MIN`	No	`0.9`	Minimum discount factor.
`RL_OPTIMIZATION__LR_MAX`	No	`0.01`	Maximum learning rate to test.
`RL_OPTIMIZATION__LR_MIN`	No	`1e-05`	Minimum learning rate to test.
`RL_OPTIMIZATION__N_TRIALS`	No	`20`	Number of Optuna trials to run.
`RL_OPTIMIZATION__STUDY_NAME`	No	`rl_order_execution_v1`	Name of the Optuna study. Change this to start a new experiment.
`RL_OPTIMIZATION__TUNING_EPISODES`	No	`500`	Episodes per trial (shorter than production run).

`rl` settings

Name	Required	Default	Description
`RL_RL__BATCH_SIZE`	No	`64`	Training batch size.
`RL_RL__EPISODES`	No	`500`	Total training episodes.
`RL_RL__EPSILON_DECAY`	No	`0.995`	Epsilon decay factor.
`RL_RL__EPSILON_END`	No	`0.01`	Minimum exploration rate.
`RL_RL__EPSILON_START`	No	`1.0`	Initial exploration rate.
`RL_RL__GAMMA`	No	`0.99`	Discount factor.
`RL_RL__LR`	No	`0.001`	Learning rate.
`RL_RL__MEMORY_SIZE`	No	`10000`	Replay buffer size.
`RL_RL__TARGET_UPDATE`	No	`10`	Episodes between target updates.

`simulation` settings

Name	Required	Default	Description
`RL_SIMULATION__ACTION_MULTIPLIERS`	No	`[0.0, 0.5, 1.0, 1.5, 2.0, 3.0, 5.0]`	Discrete multipliers of the average execution rate.
`RL_SIMULATION__DRIFT`	No	`0.0`	Price drift (mu).
`RL_SIMULATION__LIQUIDITY_PARAM`	No	`0.01`	Permanent market impact (alpha).
`RL_SIMULATION__SEED`	No	`42`	Random seed for reproducibility.
`RL_SIMULATION__START_PRICE`	No	`100.0`	Initial market price.
`RL_SIMULATION__TEMP_IMPACT_PARAM`	No	`0.05`	Temporary market impact (beta).
`RL_SIMULATION__TIME_HORIZON`	No	`50`	Total duration (time steps).
`RL_SIMULATION__TOTAL_SHARES`	No	`1000`	Total number of shares to liquidate.
`RL_SIMULATION__VOLATILITY`	No	`0.002`	Price volatility (sigma).

Limitations & Future Improvements

While this project demonstrates a robust RL pipeline, it makes certain simplifying assumptions common in initial research but limiting for production deployment.

1. Discrete vs. Continuous Control (DQN vs. PPO/SAC)

Limitation: The current agent uses a Deep Q-Network (DQN), which necessitates a discrete action space. Execution rates are quantized into specific bins (e.g., 0.5x, 1.0x, 2.0x TWAP). This lacks the granularity required for precise optimal control.

Future Improvement: Implement Proximal Policy Optimization (PPO) or Soft Actor-Critic (SAC). These algorithms natively support continuous action spaces, allowing the agent to output precise float values for execution rates.

2. Market Simulation Realism (GBM vs. Stylized Facts)

Limitation: The environment utilizes Geometric Brownian Motion (GBM). While standard for theoretical derivatives pricing, GBM fails to capture the "stylized facts" of high-frequency market data, specifically Volatility Clustering, Fat Tails, and Mean Reversion.

Future Improvement:

Implement an Ornstein-Uhlenbeck (OU) process to simulate mean-reverting price dynamics.
Develop a HistoricalReplayEnv to train agents on real minute-bar or tick-level data (L2/L3) to validate performance on historical scenarios.

3. State Space Complexity

Limitation: The current state observation includes only normalized time, inventory, and recent price trend.

Future Improvement: Enrich the state space with microstructure signals such as Order Book Imbalance (OBI), Volume Weighted Average Price (VWAP) deviation, and Bid-Ask Spread to give the agent deeper market visibility.

Changelog

See CHANGELOG.md for a history of changes to this project.

License

This project is licensed under the MIT License.

See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
config		config
src/rl_order_execution		src/rl_order_execution
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cliff.toml		cliff.toml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Deep Reinforcement Learning for Order Execution

Approach & Logic

The Problem: Optimal Execution

The Solution: Adaptive Control

System Architecture

Mathematical Model

1. Market Dynamics (Asset Price)

2. Transaction Cost (Slippage)

3. Reward Function

Project Structure

Getting Started

Prerequisites

Setup

Training Tracking

Hyperparameter Tuning (Optuna)

Key Features:

Docker Support

Key Metrics Explained

Development Workflow

Configuration

logging settings

optimization settings

rl settings

simulation settings

Limitations & Future Improvements

1. Discrete vs. Continuous Control (DQN vs. PPO/SAC)

2. Market Simulation Realism (GBM vs. Stylized Facts)

3. State Space Complexity

Changelog

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`logging` settings

`optimization` settings

`rl` settings

`simulation` settings

Packages