Screwed Server Checklist

Posted on Sat 28 April 2012 in Tech

My servers started getting unusably slow at peak hours lately and I decided, midst panic to vaguely attempt to document the various things I had to go through to narrow down the problem, anyway I'm sticking them up here so a) I don't lose the list and b) it might be of use to someone else. Some are obvious, some less so. I've already forgotten half of what I did to fix it, so here's the other half before I forget anything else.

Note:This is only meant to give you a pragmatic, last minute ditch attempt at diagnosing problems roughly. In essence, to give an idea of what to look for and dig when you come across signs of strangeness. I'm sure nagios, serverdensity, munin or whatever else can diagnose these things far quicker and better, so just assume we're far beyond clever solutions. It's essentially for the idiots like me who barely change any of their servers configuration to handle scaling at all well.

other handy tips I should be aware of, let me know here.

This is meant for a LAMP install(more specifically ubuntu), so may be slanted that way, but most of the tips should work depending on your distro of choice.

Processes

top/htop -> An obvious one, should be your first point of entry,

Memory

free -m -> Free ram, could there be a lot of IO to disk? Again, obvious.

Hard-Disk

df -h -> Another obvious one, is your disk full?

Network

sudo iftop -> This one requires installation on ubuntu but it's in apt so shouldn't be a big deal(unless of course your network is in fact screwed). Anyway, check those numbers, check your b/w restrictions, then do the math.

DNS

Does banging the servers ip into your local hosts file speed this up? If so, it could be a DNS issue.

MySQL

watch -n 1 mysqladmin  user=mysql_user  password=mysql_password processlist

-> This command will watch live mysql queries being executed. Noticing any crap queries running insanely long? This could be your issue.

And directly related to slow crap queries do you have an large tables that say might be causing a serious performance issue? Here's a mysql query to grab the largest tables on your server.

SELECT CONCAT(table_schema, '.', table_name), CONCAT(ROUND(table_rows / 1000000, 2), 'M') rows, CONCAT(ROUND(data_length / ( 1024 * 1024 * 1024 ), 2), 'G') DATA, CONCAT(ROUND(index_length / ( 1024 * 1024 * 1024 ), 2), 'G') idx, CONCAT(ROUND(( data_length + index_length ) / ( 1024 * 1024 * 1024 ), 2), 'G') total_size, ROUND(index_length / data_length, 2) idxfrac FROM information_schema.TABLES ORDER BY data_length + index_length DESC LIMIT 10;

Index your tables better or truncate crap you don't need.

Apache

ps ax | grep apache | wc -l

Let's assume you didn't change your ubuntu default apache configuration, if your MaxChild setting is at 150 and the value returned from the above command is stuck at over 150(wc will match the ps command too), then here's your problem. It may "look" like your network is slow but in fact it's apache being forced to queue requests.

Run ps aux | grep apache and see how much ram apache is taking on your server(you should see roughly what each concurrent connection is getting in terms of ram percentage). Do some math and see if you have the free ram to increase max childs, if you can, do so.

If not, reduce the keepalive value(essentially this isn't as scary as it sounds, quite a few major sites run without a keepalive value at all), but this entirely depends on your sites usage. Altering keepalive will lose you some speed, you're essentially cycling through the concurrent requests faster, so request 151 doesn't have to wait as long. But, you need to decide whether less speed for more people is better than speed with a hard concurrent limit.

You can also enable some of apache's new cleverer keepalive modules but again that goes beyond the scope of the post.