Summary
When multiple Gravity instances run the DHCP role, the same client lease can be created concurrently during DHCPDISCOVER.
If one node misses the lease in its local watcher cache, it may create the lease again and overwrite the existing key with the shorter leaseNegotiateTimeout TTL. This can cause the lease to disappear after the negotiate timeout even though the client completed DHCP successfully.
Impact
- the same lease key can be written by multiple nodes
- an existing lease can be overwritten by a later
DISCOVER
- the overwritten lease may end up with the short negotiate TTL, for example
30s
- after that timeout, the lease disappears unexpectedly
Reproduction
- Run multiple Gravity instances with the DHCP role enabled.
- Use a DHCP client against the shared cluster, for example:
- Observe lease creation while more than one node processes the same client traffic.
- In some runs, the lease key is overwritten and ends up with the negotiate timeout instead of the normal scope TTL.
Expected behavior
- only the first node should create the lease for a client
- other nodes should reuse the existing lease instead of overwriting it
DISCOVER must not downgrade an existing lease to leaseNegotiateTimeout
Actual behavior
A node can miss the existing lease in its local watcher state and do:
FindLease() == nil
- create a new lease object
- write it with
leaseNegotiateTimeout
If another node already created the same lease key, this overwrites the key and can shorten the lease lifetime unexpectedly.
Root cause
Lease existence was checked via local watcher state only. In a clustered setup, watcher lag means two nodes can both believe the lease does not yet exist.
Proposed fix
Use an atomic create-if-absent operation for new lease creation during DISCOVER and REQUEST:
- try to create the lease key only if it does not already exist
- if creation fails because the key already exists, fetch and reuse the existing lease
- never overwrite an existing lease with the short discover or negotiate TTL
Summary
When multiple Gravity instances run the DHCP role, the same client lease can be created concurrently during
DHCPDISCOVER.If one node misses the lease in its local watcher cache, it may create the lease again and overwrite the existing key with the shorter
leaseNegotiateTimeoutTTL. This can cause the lease to disappear after the negotiate timeout even though the client completed DHCP successfully.Impact
DISCOVER30sReproduction
Expected behavior
DISCOVERmust not downgrade an existing lease toleaseNegotiateTimeoutActual behavior
A node can miss the existing lease in its local watcher state and do:
FindLease() == nilleaseNegotiateTimeoutIf another node already created the same lease key, this overwrites the key and can shorten the lease lifetime unexpectedly.
Root cause
Lease existence was checked via local watcher state only. In a clustered setup, watcher lag means two nodes can both believe the lease does not yet exist.
Proposed fix
Use an atomic create-if-absent operation for new lease creation during
DISCOVERandREQUEST: