x/net/http2: raciness in stream close prevents closing of failed connections, blocking subsequent requests

### What version of Go are you using (`go version`)?

<pre>
go version go1.20.3 linux/amd6
</pre>

### Does this issue reproduce with the latest release?
Yes


### What operating system and processor architecture are you using (`go env`)?

<details><summary><code>go env</code> Output</summary><br><pre>
$ go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.3"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2968508177=/tmp/go-build -gno-record-gcc-switches"

</pre></details>

### What did you do?

A connection is established to a remote server (using HTTP/2 where available) and requests are sent periodically over this connection - it's essentially a write pipeline working through a queue sequentially.

This being the internet, sometimes bad things happen to individual connections and packets go awry.

If a timeout occurs, the connection will (well, should) be closed:

```
resp, err := client.Do(req.WithContext(ctx))
if err != nil {
            // Attempt to tear down the connection
            fmt.Println("Timed out")
            client.CloseIdleConnections()
        }
```
(that `CloseIdleConnections()` call is sourced from https://github.com/golang/go/issues/36026)


### What did you expect to see?

On the next request (generally a retry), a new connection should be established (if possible) and writes should resume.


### What did you see instead?

The underlying TCP connection does not get closed - the kernel sits retransmitting the most recent packets (until `net.ipv4.tcp_retries2` is reached about 17mins later).

Subsequent requests all attempt to use the dead connection and so time out.


### Repro

The following can be used to help repro the behaviour
```go
package main

import (
    "context"
    "fmt"
    "net/http"
    "time"
    "strings"
)

func main() {

    // The url to place requests to
    url := "https://www.bentasker.co.uk/"

    // Create an empty context
    ctx := context.Background()

    // Create an io.Reader from a string to act as the request body
    bodystring := "foo\nbar\nsed"
    myReader := strings.NewReader(bodystring)

    // build client and request
    client := http.Client{
        Timeout: 5 * time.Second,
    }

    // Set up an infinite loop to place requests from
    for {
        req, err := http.NewRequest("POST", url, myReader)
        if err != nil {
            fmt.Println("boo")
        }
        req.Header.Set("Content-Type", "text/plain; charset=utf-8")

        resp, err := client.Do(req.WithContext(ctx))

        if err != nil {
            // Attempt to tear down the connection
            fmt.Println("Timed out")
            client.CloseIdleConnections()
        } else {
            fmt.Println(resp.StatusCode)
        }

        time.Sleep(time.Second)
    }
}
```

If you build and run that, you should see `403` printed on a new line every second or so (it being 403 isn't pertinent here, it's just easy/convenient to generate)

Check which source port is being used
```
root@ratchett:/home/ben/tmp/go_repro# netstat -anp | grep main
tcp        0   2185 192.168.3.217:50446     143.244.38.137:443      ESTABLISHED 2389361/./main   
```

and then block it in `iptables` to break the return path 

```
iptables -I INPUT -p tcp --dport 50446 -j DROP
```

The script should stop printing 403s and after around 5s will print `Timed out`.

Although the code calls `CloseIdleConnections()`, it'll silently fail to do anything to the connection (more below)

Subsequent attempts will also print `Time out` and re-running `netstat` will show that same connection still in an `ESTABLISHED` state. If you take a packet capture it'll show re-transmission of whatever packet we interrupted by adding the rule.

If you remove the `iptables` rule once the remote server starts sending resets, the kernel will tear down the connection and requests will go back to behaving as they should.

----

### Hacking it out

It is actually possible to work around this at the caller's level, although it's not exactly pleasant to look at.

```go
        if err != nil {
            // Attempt to tear down the connection
            fmt.Println("Timed out")

            // prompt the handler into marking the connection as not
           // being reusable
            req.Header.Set("Connection", "close")
            client.Do(req.WithContext(ctx))

            // this is no longer required
            // client.CloseIdleConnections()
        } else {
            fmt.Println(resp.StatusCode)
        }
```

This works because `net/http2` [specifically checks](https://github.com/golang/net/blob/master/http2/transport.go#L1344) for a `Connection: close` header before placing a request, and set's `doNotReuse` on the `ClientConn`, so when the next request comes around, the connection won't be used even if it is otherwise considered still active.

----

### Underlying Issue

The underlying *seems* to be that there's some raciness between normal stream closure and `closeIdleConnections()`.

The crucial check is [here](https://github.com/golang/net/blob/master/http2/transport.go#L1025)
```go
	if len(cc.streams) > 0 || cc.streamsReserved > 0 {
		cc.mu.Unlock()
		return
	}
```

If the map `cc.streams` contains any entries, the connection isn't considered idle and therefore can't be closed. 

Dropping some prints in as a quick test shows it's this conditional we're failing at
```
@@ -7984,6 +7990,8 @@ func (cc *http2ClientConn) forceCloseConn() {
 func (cc *http2ClientConn) closeIfIdle() {
        cc.mu.Lock()
        if len(cc.streams) > 0 || cc.streamsReserved > 0 {
+                fmt.Println("Refusing to close - streams:", len(cc.streams), " reserved:", cc.streamsReserved)
+
                cc.mu.Unlock()
                return
        }
```

Gives
```
Refusing to close - streams: 1  reserved: 0
```

`net/http2` does actually try to tidy up the streams, [`doRequest`](https://github.com/golang/net/blob/master/http2/transport.go#L1299) calls [`cleanupWriteRequest`](https://github.com/golang/net/blob/eb1572ce7f7a6e97ec44c27568286345c2a7748e/http2/transport.go#L1503) which ultimately drops the broken stream.

Adding some more prints though, shows that this only tends to happen **after** the caller has had opportunity to call `closeIdleConnections()`:
```
Refusing to close - streams: 1  reserved: 0
Doing cleanup context deadline exceeded
Cleanup stream 5 got err context deadline exceeded
Must close body:  false
Got err Sent headers
Writing reset
Forgetting ID  5
werr  <nil>
closing
```
The string `Forgetting ID` was added [here](https://github.com/golang/net/blob/eb1572ce7f7a6e97ec44c27568286345c2a7748e/http2/transport.go#L1560).

----

### Raciness

Timings definitely seem to be important here - if the `time.Sleep` call in the repro is adjusted to wait 5 seconds, you'll sometimes see a string of timeouts followed by writes recovering.

That's not the only raciness though.

Stream closures/timeouts etc are handled by a `select` statement [here](https://github.com/golang/net/blob/eb1572ce7f7a6e97ec44c27568286345c2a7748e/http2/transport.go#L1438)
```go
	for {
		select {
		case <-cs.peerClosed:
			return nil
		case <-respHeaderTimer:
			return errTimeout
		case <-respHeaderRecv:
			respHeaderRecv = nil
			respHeaderTimer = nil // keep waiting for END_STREAM
		case <-cs.abort:
			return cs.abortErr
		case <-ctx.Done():
			return ctx.Err()
		case <-cs.reqCancel:
			return errRequestCanceled
		}
	}
```

Sometimes this is triggered by `cs.abort` and sometimes by `ctx.Done()`. This appears to be as a result of an attempt to abort the stream when the context hits deadline - so there's an element of potluck in terms of which gets through (I've not dug into this in too much depth though).

----

### TL:DR

`CloseIdleConnections()` relies on `closeIfIdle()` which will close a connection **only** if there are no active streams on that connection.

However, when a timeout occurs, there may be a delay before the stream is closed (leaving it visible to `closeIfIdle()`). 

By the time the stream closure occurs, the timeout has already been bubbled up to the caller, who may then choose to call `CloseIdleConnections()` knowing that the connection has failed. 

Because the dead connection does not get closed by this call, it remains available for use by the next request.

The result is that requests will fail until the kernel itself gives up on the connection, or a RST or FIN makes it through from the remote end.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/http2: raciness in stream close prevents closing of failed connections, blocking subsequent requests #59690

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

What did you do?

What did you expect to see?

What did you see instead?

Repro

Hacking it out

Underlying Issue

Raciness

TL:DR

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

x/net/http2: raciness in stream close prevents closing of failed connections, blocking subsequent requests #59690

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Repro

Hacking it out

Underlying Issue

Raciness

TL:DR

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?