Archive for December, 2009

SMTP timeout while connected to <x> after sending data block

Thursday, December 10th, 2009

This is a problem that had been annoying me for a while.

On our outgoing SMTP servers, exim was queuing up messages and failing to deliver them with no clear reason why. It would get partway through the data phase and then the connection would drop. Size did not appear to be a factor.

Eg.


Connection timed out: SMTP timeout while connected to
smtpin.blueyonder.virginmedia.com [62.254.123.242] after sending data
block (98268 bytes written)

The problem is also described here:

http://www.mail-archive.com/exim-users@exim.org/msg29283.html

Initially I was thrown off the scent, as this *only* affected my newer outgoing smtp servers that use standard debs. The affected servers were in different datacenters, with no obvious common link. My old hand rolled servers were unaffected. My temporary hack was to attempt direct delivery, and if it failed, pass it to my old (t)rusty hand-rolled servers for delivery. This worked. For months.

Until the other day. Nagios alerted me that we had some messages queued.

When I retired an old Xen dom0 I picked up one of the unaffected smtp servers and migrated it to a new Xen host. Suddenly it started exhibiting the same problem. The Xen dom0 was 3.3 instead of 3.0, networking had changed from bridged to routed, and I’d installed grub and a kernel to boot it under pygrub. Nothing else had changed.

A friend suggested it must be network related. I figured he was probably right, as that was the main thing that had changed, but what? So I googled: xen networking problems exim sending email and found my own blog entry on tx checksumming!


  ethtool -K eth0 tx off

Problem solved.

It seems these servers were created before that became part of our standard build, and it had just never caused a problem on the old hardware/network setup. Argh.

I think it’s about time I wrote my post on Monitoring Driven Infrastructure to explain why this *should* never happen.