I'm on StackOverflow podcast this week!

I'm the guest on the StackOverflow podcast with Joel Spolsky and Jeff Atwood.  w00t!

We talk about everything from early internet history, to IPv6, to how to scale a really really really large web site. Here's the episode page and link to the audio. (Tons of links on the episode page).

Check out their website for sysadmins Q&As at http://serverfault.com

Thanks to Joel and Jeff for having me on the show!  I listen every week!

Posted by Tom Limoncelli at November 25, 2009 9:38 AM | Comments (1) | TrackBack

Can my SLA rule work for networks? Yes.

Last week I mentioned that that if you have a service that requires a certain SLA, it can't depend on things of lesser SLA.

My networking friends balked and said that this isn't a valid rule for networks. I think that violations of this rule are so rare they are hard to imagine. Or, better stated, networking people do this so naturally that it is hard to imagine violating this rule.

However, here are 3 from my experience:


  • Situation: A company who's internet connection is a DSL modem. The modem is in the hallway near the computer room, but not in the computer room. As a result, when someone knocks the modem over, the company's website is down. (web site depending on router). Improvement: move the router into the computer room.
  • A computer room with excellent UPS and power infrastructure... but the router isn't on the UPS for weird historical reasons (it is depending on external power). Improvement: move the router onto the UPS.
  • An excellent computer room with fine ethernet switches... but the router is in the lab one room over. Each VLAN has a physical cable connected to it with a cable that runs to that other room. I was told, "the researchers are doing some experiments on the router so they wanted it in their lab". Improvement: Move the router into the computer room.

3 true stories.

Posted by Tom Limoncelli at November 19, 2009 12:44 PM | Comments (2) | TrackBack

Run, run, run, dead.

I assume you have some kind of automated monitoring system that watches over your servers, networks and services. Service monitoring is important to a functioning system. It isn't a service if it isn't monitored. If there is no monitoring then you're just running software.

Monitoring "is it down?" is reactionary. It is better than no monitoring at all, but all it tells you is that there is already a problem. Monitoring is better when it predicts the future and prevents problems.

An analog radio (one with an old-fashion vacuum tube) sounds great at first, but you hear more static when the tube starts to wear out. Then the tube dies and you hear nothing. If you change the tube when it starts to degrade, you'll never have a dead radio. (Assume, of course, you change the tube when your favorite radio show isn't on.)

A transistor radio, on the other hand, is digital. It plays and plays and plays and then stops. Now, during your favorite song, you have to repair it.

At Bell Labs someone called this the "run run run dead" syndrome of digital electronics.

How can we monitor computers and networks in a way that makes it more like analog electronics? There are some simple tricks we can use when monitoring to be "more like analog."

One trick is to stop monitoring "is it up?" and monitor "how fast is it?" instead. Don't measure "can I ping the server", measure "ping response time" and alert if the replies are very slow (where "no reply" is veeeerrrry slow). Better yet, don't measure ping, measure the service's performance directly: measure web-site latency, measure time-to-first-byte-received, send a test email to a relay and measure how long it takes to come back.

We can also monitor things that portend (predict) future failures. For example, measure QPS (Queries Per Second) our website is receiving. If there is a sharp increase, that's an indication that we might have problems if it increases much further. How many monitoring packages can do that kind of calculus?

We can monitor our Internet connection and keep historical data. Draw a graph that shows usage over the last year and notice the trajectory as the line goes up and up. Anyone can eyeball the graph and predict that in 3 months we'll be out of capacity. Luckily it takes 2 months to get more capacity approved, paid for, ordered, and deployed. We have predicted the future. Without good monitoring, we find ourselves with an overloaded connection; and the only thing we can predict is 2 months of unhappy users complaining and complaining.

What is an outage? To me an outage is something that is customer-facing. The users felt pain. Either the VPN system went down, or the VPN system was too slow to be usable. More formally, an outage is any time we miss our SLA. The problem is that a lot of people don't work in an environment that has written SLAs so we must invent them ad hoc.

After an outage has been resolved and we've had time to calm down, take a moment to look back at the situation and figure out a monitoring rule that would have predicted this problem. Add (or update) two sets of rules. The first set detects the outage. If such a rule exists already, maybe it needs to be fine tuned or somehow updated. Maybe the SLA changed and we didn't update the rule. Or, we don't have an SLA and we need to make it tigher or looser to match customer expectations of what our SLA should be. The second set of rules should collect data that would help us prevent that particular outage in the future. This second group might be very specific to specific links, settings, or components.

Eventually we grow our monitoring system not based on what we think is good, but what we've learned over time is good. Write comments into the configuration to list what inspired the rule and document what you think should be done if this rule gets triggered. When it does get triggered, update this documentation with what you learned. This makes your monitoring system evolve and grow like a Wiki.

Over time we'll have an optimal monitoring system for our environment.

As computers have become cheaper we use more redundancy to make them more reliable. We don't rely on a single disk, we put them in redundant (non-striped) RAID sets. We don't depend on one router, but we use VRRP so one can fail and packets still get through. We don't have one web server, we have a group of web servers behind a load balancer. We don't have one load balancer, we buy them in pairs and use an active-active or active-passive configuration.

When we have N+1 redundancy, things are more like analog than digital. If we have 20 machines behind a web load balancer, we don't panic if one goes down, we monitor that "at least 80% are up". That is more like an analog system. Two web servers going down is like static on the radio. Our monitoring should reflect this. We can measure the time a particular query takes to complete and alert depending on what we see: x ms response time is fine. Lower than x but rising quickly, time to add more web servers. Nearly x for extended periods of time, better order more web servers. Higher than x for an extended amount of time, steal web servers from other services. Analog.

The beauty of N+1 redundancy is that it decouples component failure from outages. In the old days, a component failure equalled an outage. A disk died, and the file server was down all day as we restored data from backups. Now our disks are in a RAID set and a single one going south just means a hot spare will be used to get us back to N+1. We can stop monitoring "is it down?" but instead monitor that each RAID group is at N+1 redundancy, that the entire RAID chassis has at least X hot spares in the pool, and that the data is accessible. (X is based on how far we are from the hardware. If we are a consultant that visits the site once a month, more is better. If we work in the building, less is needed.)

It can be difficult to adjust your thinking to be more analog than digital. Start with one small change, like monitoring average queue length on a router port, or time to complete a web query. Once you learn how to do this with your monitoring system for one aspect or service, doing it again and again becomes easier.

Then...you'll be better at predicting the future.

Posted by Tom Limoncelli at November 19, 2009 8:29 AM | Comments (5) | TrackBack

Time Management Wiki

http://wiki.everythingsysadmin.com is my wiki for my various books. The Time Management sub-wiki has a lot of useful tools and products.

There is a link to this on the front page of my web site but nobody seems to notice it. Maybe I need someone with amazing artistic skills to design a logo. Volunteers?

Posted by Tom Limoncelli at November 13, 2009 2:45 PM | Comments (0) | TrackBack

Thanks to everyone that attended my talks at LISA 2009!

Thanks to everyone that attended my tutorials.  They were the #1 and #3 more attended tutorials, topping out at nearly 80 people each. The BOFs I held had packed rooms too. Thanks for your attention, I hope you walked away smarter and feeling better about your time management and sysadmin skills.

I'm involved in 2 more events: Tonight I am the EmCee of the Google VendorBOF (9pm), and on Friday I am co-hosting a "guru" talk on job hunting (2pm).

If you are attending CHIMIT this weekend (or are considering on-site registration), I'm facilitating a panel on Saturday afternoon.  Hope to see you there!

Posted by Tom Limoncelli at November 5, 2009 3:19 PM | Comments (0) | TrackBack

Tom @ LISA '09, Nov 1-6, 2009, Baltimore, MD

Tom will be very busy at LISA 2009 both teaching and hosting various events:
  • Half-day Tutorial: "Time Management for System Administrators: A New Approach"
  • Half-day Tutorial: "Design Patterns for System Administrators"
  • Guru Session: Job Hunting (with Andy Lester from http://theworkinggeek.com, author of the new book Land The Tech Job You Love
  • BoFs: I'll be hosting 3 BoFs. A 2-hour BoF on "Time Management", a BoF for the open-source project called Ganeti. And I'll be MC of this year's LGBT and Allies BoF.
  • I'll be at the Google Vendor BoF on Thursday night to answer questions about what it is like to work at Google.
  • Book signing: I hope to arrange a Book Signing during the vendor show.
With all that happening, I've also set a goal for myself to read all the papers before the conference. I promise myself that I'll do this every year, but this time I'm actually setting aside time to do the reading. Let's see how that goes. LISA is rarely in the Washington D.C. area. This gives locals an opportunity to attend LISA without having to pay for travel (I'll be taking Amtrak!). Hopefully this means there will be a lot of new faces. Hope to see you there! Registration is open.
Posted by Tom Limoncelli at November 1, 2009 10:10 AM | Comments (0) | TrackBack