This guide explains how to set up MongoDB replication using Docker containers and provides insights into the replication process, oplog management, failover handling, and best practices.
Launch the primary MongoDB node with the --replSet flag:
docker run -d --name BB-game-mongo1 -p 27018:27017 -v BB-game-mongo1:/data/db --network bb-network mongo:8.0 --replSet rs0These nodes will act as secondary replicas:
docker run -d --name BB-game-mongo2 --network bb-network mongo:8.0 --replSet rs0
docker run -d --name BB-game-mongo3 --network bb-network mongo:8.0 --replSet rs0Connect to the primary MongoDB container:
docker exec -it BB-game-mongo1 mongoshInside mongosh, run the following command to configure the replica set:
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "BB-game-mongo1:27017" },
{ _id: 1, host: "BB-game-mongo2:27017" },
{ _id: 2, host: "BB-game-mongo3:27017" }
]
});If successful, you’ll see an output similar to:
{
"ok": 1,
"$clusterTime": { ... },
"operationTime": { ... }
}MongoDB creates a replica set configuration and assigns roles to nodes.
- One of the nodes (
BB-game-mongo1,BB-game-mongo2, orBB-game-mongo3) is elected as the Primary. - The other two nodes become Secondary members.
- If the Primary node fails, one of the Secondaries is automatically promoted.
- The
Primary nodehandles all write operations. Secondary nodesreplicate data from the Primary using an oplog (operations log).- Oplog is a special collection
local.oplog.rsthat stores all write operations from the Primary. - Secondaries continuously read from the oplog and apply changes to their own copies of the database.
- If the Primary node crashes, the other members detect it via the heartbeat mechanism (
pingevery 2 seconds). - A new election is triggered, and one of the Secondary nodes is promoted to Primary.
- The clients automatically reconnect to the new Primary.
- By default, clients read from the Primary.
- However, we can configure read preferences to read from Secondaries for load balancing.
- If a node is temporarily disconnected, it re-syncs with the Primary when it rejoins.
- MongoDB handles rollback operations if inconsistencies occur.
To check if all nodes are properly connected, run:
rs.status();Expected output (simplified example):
{
"set": "rs0",
"myState": 1,
"members": [
{ "_id": 0, "name": "BB-game-mongo1:27017", "stateStr": "PRIMARY" },
{ "_id": 1, "name": "BB-game-mongo2:27017", "stateStr": "SECONDARY" },
{ "_id": 2, "name": "BB-game-mongo3:27017", "stateStr": "SECONDARY" }
],
"ok": 1
}- PRIMARY → This node handles writes.
- SECONDARY → These nodes replicate data and handle reads if configured.
- ARBITER → If you have an even number of nodes, you may include an arbiter to avoid a tie during elections. Arbiters participate in elections but do not store any data.
docker exec -it BB-game-mongo1 mongoshdb.getSiblingDB('local').oplog.rs.find().sort({$natural: -1}).limit(5).pretty();The oplog local.oplog.rs is a special capped collection that records all write operations performed on the Primary node. The Secondary nodes use this log to replicate changes and keep themselves in sync with the Primary.
- Every write operation (insert, update, delete) is recorded in the oplog.
- Read operations are NOT logged in the oplog (only writes are).
- Secondary nodes continuously pull operations from the Primary’s oplog and apply them.
- MongoDB record operations in the oplog, and the Secondary nodes will replicate this insertion.
{
"op": "i", // Operation type (insert)
"ns": "test.users", // Namespace (database.collection)
"o": { "_id": ObjectId("67deb9c494f2547fbf51e945") }, // Inserted document
"ts": Timestamp({ "t": 1742649796, "i": 1 }), // Timestamp
"wall": ISODate("2025-03-22T13:23:16.366Z") // Wall-clock time
}| Operation | Description |
|---|---|
| op | The operation type: i = insert, u = update, d = delete, c = command, n = noop |
| ns | The namespace database.collection where the operation occurred |
| o | The document that was written to the database |
| o2 | Used for updates (contains _id of the updated document) |
| ts | Timestamp of the operation (used for replication) |
"op": "n" ➡️ These are heartbeat messages to keep Secondaries in sync
The oplog size in MongoDB determines how long operations remain available for replication before they are overwritten.
Secondary Nodes Fall Behind (SECONDARY state → RECOVERING):
- If a Secondary node disconnects or slows down, it needs to catch up when it reconnects.
- If the required oplog entries are already overwritten, the Secondary cannot sync using the oplog.
- Full Resync Required which is expensive
- Wastes Disk Space: The oplog is stored in local and does not shrink automatically.
- Longer Startup and Recovery Time: A larger oplog means longer recovery times when a Secondary node restarts.
db.getSiblingDB('local').oplog.rs.stats();| Statistic | Description |
|---|---|
| size | The total size of the oplog in bytes |
| storageSize | The amount of storage space the oplog uses, including any wasted space, in bytes |
| count | The number of documents (operations) currently stored in the oplog |
| maxSize | The maximum size of the oplog in bytes. |
db.adminCommand({ replSetResizeOplog: 1, size: 5120 })
- Set the Right Size: Make sure your oplog is large enough to handle downtime without losing crucial operations. If a secondary falls behind, it will need to read older entries from the oplog to catch up
- Monitor Oplog Lag: Use tools like rs.printSlaveReplicationInfo() to check how far behind your secondaries are. High lag could be a sign of network or performance issues.
- Disk I/O: Oplogs can generate a lot of I/O, especially in write-heavy workloads. Make sure your disk can handle the write operations efficiently.
- Lost Oplogs: If a secondary node is offline for too long and the oplog runs out of space, it won’t be able to catch up. You’ll have to do a full data resync, which can take hours or even days with large datasets.
- Oplog Lag: If secondaries aren’t catching up fast enough, they can fall behind the primary. This is called oplog lag, and it can impact consistency in your replica set.
replication lag is the delay between the primary logging an operation in the oplog and a secondary applying it. You can monitor this lag using the command:
rs.printSlaveReplicationInfo();or
rs.printSecondaryReplicationInfo();It shows how far behind each secondary is:
source: BB-game-mongo1:27017
{
syncedTo: 'Sat Mar 22 2025 14:47:48 GMT+0000 (Coordinated Universal Time)',
replLag: '0 secs (0 hrs) behind the primary '
}
---
source: BB-game-mongo3:27017
{
syncedTo: 'Sat Mar 22 2025 14:47:48 GMT+0000 (Coordinated Universal Time)',
replLag: '0 secs (0 hrs) behind the primary '
}If the Primary fails, MongoDB automatically elects a new Primary. Here’s how: