From 20k to 60k req/sec: Building a High-Performance HTTP Server in Go
A journey through C, Rust, and Go—learning what performance really means.
The Journey
Since December, I've been following a self-directed roadmap to master network programming. The plan was simple: build HTTP/1.1, then HTTP/2, then HTTP/3. The execution? Anything but.
My first HTTP/1.1 server in Go hit about 20k req/sec. Not terrible, but not satisfying. I spent weeks making it "production-ready"—handling edge cases, improving error handling, adding features. Performance barely budged.
So I deleted the repository and started fresh. Same result. I thought, "Maybe the net package is the bottleneck." I implemented my own network layer. Still 20k.
Frustrated, I moved to HTTP/2. But the 20k ceiling haunted me. If nginx could do 50k, the problem wasn't the protocol—it was me.
The Spiral
- → Tried C with epoll. Same benchmarks, more bugs.
- → Tried Rust for memory safety. Same disappointing results.
- → Stepped away, burnt out.
When I came back, I tried Go again—but this time with a different mindset. Instead of fighting the language, I decided to understand how Go's runtime actually works.
The Breakthrough
The key wasn't implementing my own epoll or reactor pattern. Go's netpoller already does asynchronous I/O efficiently—I just wasn't using it correctly.
Three Critical Optimizations
1. Per-Connection Buffers
The biggest performance gain came from removing sync.Pool. Instead of sharing buffers across goroutines with expensive lock contention:
reader := bufio.NewReaderSize(conn, 8192)
responseBuf := bytes.NewBuffer(make([]byte, 0, 4096))
headers := make(map[string][]byte, 16)
for requestCount := 0; requestCount < 10000; requestCount++ {
// Reuse same buffers - zero allocations after warmup
}
Each connection handles up to 10,000 requests with the same buffers. Memory cost: ~13KB per connection. For 10,000 concurrent connections, that's only 130MB—a small price for 3x throughput.
Result: 35k → 60k req/sec (+71%)
2. Zero-Copy Headers
Instead of converting header keys to strings (which allocates), I store them as byte slices that point directly into the read buffer:
// Slices point into read buffer - no copying
Header: map[string][]byte
This works because the buffer lives for the entire connection lifetime. No copying, no allocations.
3. Long-Lived Connections
Connections handle 10,000 requests before closing, amortizing the TCP handshake cost across thousands of requests.
const maxRequests = 10000 // Up from 100
Result: 25k → 40k req/sec (+60%)
Why Pooling Failed
Performance decreased as I added connections—the signature of lock contention:
10 conn → 32k req/sec (3,200 per conn)
50 conn → 40k req/sec (800 per conn)
100 conn → 35k req/sec (350 per conn) ← Going backwards!
200 conn → 23k req/sec (115 per conn)
Every request was hitting the pool twice:
headers := pool.Get() // ← Lock wait
// ... use headers ...
pool.Put(headers) // ← Lock wait again
With 200 connections at 100 req/sec each = 40,000 lock acquisitions/sec. Using pprof, I found 85% of time was spent waiting on sync.Pool.Get().
Performance Evolution
| Optimization | Req/sec | Improvement |
|---|---|---|
| Initial implementation | 19k | baseline |
| Fixed defer + long connections | 25k | +32% |
| Removed sync.Pool | 60k | +216% |
Final Results
$ wrk -t12 -c100 -d60s http://localhost:8080/
Running 1m test @ http://localhost:8080/
12 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.76ms 1.64ms 21.08ms 78.71%
Req/Sec 15.10k 1.63k 20.20k 70.42%
3,333,561 requests in 1.00m, 352.88MB read
Requests/sec: 57,473.17
Transfer/sec: 5.87MB
Hardware
HP ProBook x360 11 G1 EE (Intel Pentium N4200, 4 cores)
What I Learned
Performance isn't about the language
C didn't magically make things faster. Rust didn't either. The problem was never Go—it was my understanding of how to write performant Go.
Common wisdom isn't always right
sync.Pool is recommended for reducing allocations. But in my case, it introduced lock contention that increased overhead. Benchmark your specific workload.
Profile everything
I thought my parser was slow. It was the pools. Use pprof to find the actual bottleneck.
What's Next
This server doesn't handle edge cases, TLS, or half the HTTP/1.1 spec. It's intentionally minimal—just enough to benchmark and learn.
But that's the point. I didn't set out to build a production web server. I wanted to understand why nginx is fast, why my first attempts weren't, and what makes a difference.
Now I'm ready to move on to HTTP/2. But this time, I'll carry forward what I learned about Go's runtime, memory allocation patterns, and the importance of benchmarking assumptions.
Network programming and concurrency in Go are fascinating—and I'm just getting started. If you're learning performance optimization, I hope this helps you avoid some of the dead ends I hit.