hirez: (Armalite rifle)
[personal profile] hirez
(This will be incoherent and likely contain swearing. I am still mostly running on snot and bile.)

There's a montage sequence in 'The first of the few' where RJ Mitchell is smoking a pipe at a problem and the fellows in the brown coats on the shop-floor are working on lathes and vices in order to build one or other of his fine seaplanes. However, there's some confusion over the routing of the oil-lines, so one of the engineer types has to beetle off to the design shop and ask Mr Mitchell about it. Yer man taps the drawings with his pipe, admits that it's not at all clear and promises to have a new design by the AM.

Nothing particularly strange going on there. A new thing is carefully considered over pipes and pints of mild, mock-ups are tested and drawings filled with well-specified terms are made up such that many examples of the new thing can be made without (by and large) the designer being on hand to personally oversee each one.

So I think it would be really quite nice if system administration could perhaps consider getting a clue and using tools and practices that have been with us since the industrial revolution.

There are no particularly good reasons for machines that are hand-built and/or infrastructures that lack DNS, SSO, central logging, patch-management, security management, trivially repeatable machine instantiation or useful reporting/instrumentation.

And yet. The attitude that such things are a bit hard or strange is part of the background noise.

For instance. Puppet's got the makings of quite a useful machine-management tool. However, one of the early types and/or examples was for login management by hand-hacking the passwd file and copying around SSH keys. Which, what?

I believe we've had working Kerberos + LDAP for the thick end of a decade, and yet people still think that keyed SSH access is pretty swish? Jayzus.

Still, I suppose it's not a set of shared root passwords of different classes, depending on machine type. No-one's daft enough to use that any more...

And. Why are people still surprised when disks fill up? I can see that some Java job going bugfuck and filling the /var partition (Oh, wait, we're all on fucking Linux now so it's all one big / partition. D'oh!) might be something of a black swan, but taking some readings and spotting a change in disk-usage delta isn't entirely rocket science.

And. Machine specification and swapfile sizing is still bloody voodoo.

Date: 2010-09-18 03:18 pm (UTC)
From: [identity profile] solipsistnation.livejournal.com
Yeah, sometime last year one of the school of engineering admins posted to an internal mailing list asking about how one would maintain configurations for his ten or so systems (not all the same, some Solaris, some various linuxes) in case he had to restore them. The plan if a system died was, I guess, to reinstall the OS by hand and then, uh, copy the configs back into place?

I currently maintaining configs for sixty or so Linux and Solaris systems using cfengine and kickstart/jumpstart, I had to try not to be snotty about being able to reinstall a system from bare metal with basically no interaction beyond doing whatever it takes to PXE-boot it...

I'm still surprised when even huge companies who you'd think would know better obviously haven't bothered with any kind of management infrastructure. Not to name any names, but I worked for six months at a place you've heard of and spent WEEKS doing more-or-less by-hand system installs on production systems, which were then never patched or updated again. News of their financial difficulties were not unexpected, considering how much effort and expense they put into doing things in as difficult a way as possible.

Date: 2010-09-18 03:57 pm (UTC)
From: [identity profile] hirez.livejournal.com
I can see that there's a minimum number of boxes for which some measure of automation 'isn't worth the bother', but I think that's a moveable feast depending on the relevant automation and the number of times you've done it.

I mean, I run a nameserver & DHCP on my home network because both of those things are slightly simpler than falling off a log.

Kerberos is worth the bother as soon as someone leaves and you have to change all the passwords.

Running a local repo/package-management rig is a right faff the first time, but...

Central logging happens the day after a box gets owned.

Etc.

Date: 2010-09-18 03:25 pm (UTC)
From: [identity profile] pir.livejournal.com
Indeed. I've been working on machine build automation and standardisation for about two thirds of my career. Of course the current bunch have solved these problems in rather different and interesting ways so it hasn't been a problem most of my time there.

Much of my time before then was spent running machines that had to keep working and you had to be able to log into them nomatter what else was broken or not, so centralised login stuff was out. It can be hacked up so things will keep working in outages (cached creds) but mostly it's still more effort than it's worth unless you've got quite large scales or you really want to keep a bunch of machines identically useful as workstations and such.

Date: 2010-09-18 03:51 pm (UTC)
From: [identity profile] hirez.livejournal.com
I think HPLB was the first place I worked that had enough machines for this sort of thing to be a concern. Before that, you could remember each box and what it did.

Thus in some ways I'm a bit late to the party and this is the sound of me working things out in longhand.

It seems to me that the backup plan for SSO going bugfuck (which generally means that the network's expired and you have bigger problems than not being able to login as yourself) was OOB remote login and the big list of root p/ws that was kept in the fire safe (in a different building, obv). Certainly that approach Worked For Us when Slammer melted one network segment, but there are always corner cases.

Date: 2010-09-18 04:16 pm (UTC)
From: [identity profile] pir.livejournal.com
SSO going away can be far more than just the network going away entirely, odd bug in your service infrastructure, network partitioning, etc. OOB is emergency login, I'm talking about name servers and other infrastructure stuff keeping functioning and being accessible without any external machines running since historically I deal with infrastructure.

Sure centralised login is a lower risk to a machine than most for services but when you only have a small sysadmin team with accounts and odd side effects happen it can be sensible to keep them reliant on nothing else (in that case I had a system to sync accounts out for the sysadmins) so on the odd occasion when everything goes down you can be sure they come up cleanly... which is important for central infrastructure.

Date: 2010-09-18 04:01 pm (UTC)
zotz: (Default)
From: [personal profile] zotz
My debian box currently has its filesystem across eight partitions on four disks, mostly put in place by the installer. People not bothering isn't the OS's fault.

Date: 2010-09-18 04:09 pm (UTC)
From: [identity profile] hirez.livejournal.com
IIRC, the Beardian installer asks you which you'd like but goes 'all in one big partition if you're not sure'

In theory, it shouldn't be a problem since we have resizeable filesystems on most useful kit. Also, if you've an automated build process, not including some basic machine-state monitoring is a bit careless.

Date: 2010-09-20 11:36 am (UTC)
From: [identity profile] aoakley.livejournal.com
One of the annoyances I have is that most of the time, I want /home and /var on one parition and everything else on another, but the installer makes it difficult to achieve that straight off the bat.

But in essence, all I'm trying to do is /var on a separate partion and then ln -s /var/home /home

The problem is that whilst partitioning is a sensible thing to do, there aren't any good defaults other than "on a desktop install you probably just want everything on one partition". For any value of server... it depends.

Date: 2010-09-18 05:04 pm (UTC)
From: [identity profile] nalsa.livejournal.com
Fucking Solaris.

C'est tout.

Date: 2010-09-18 05:21 pm (UTC)
reddragdiva: (Default)
From: [personal profile] reddragdiva
I must confess I'm still trying to work out what to do about this sort of thing myself. I believe it was when we cracked sixty live front-end app instances that my boss (also a 43 year old rocknerd with more than 0 small children) and I realised "our automation leaves something to be desired."

A lot of my work has been decrufting our chain in general. It's amazing the festering shite I've found. Like discovering the six month old hamburger that fell behind the cooker. "Ah, that's where the maggots are coming from!"

I was looking at Puppet and cfengine and couldn't quite work out from their descriptions if they would do what I was thinking of. I'm seriously this close to writing my own configuration pusher using svn, shell scripts and scp.

Date: 2010-09-18 06:21 pm (UTC)
From: [identity profile] solipsistnation.livejournal.com
What are you trying to do? I've gotten WAY into cfengine (which is, I guess, kind of old-school now, and all the COOL sysadmins are using Puppet these days) and I might be able to give you some idea...

The cfengine docs are kind of stupid.

Date: 2010-09-18 06:27 pm (UTC)
reddragdiva: (Default)
From: [personal profile] reddragdiva
I actually ... don't remember. I think I asked somewhere and got various suggestions. I do remember thinking "oh ghod, don't let me have to write my own."

At the moment I want to type "do this please" (whatever the command is) and have everything I need set up for me.

Date: 2010-09-18 10:27 pm (UTC)
From: [identity profile] hirez.livejournal.com
Puppet is so last year.

The cool kids are using Chef. The seriously cool ones are using Kokki. Probably.
Edited Date: 2010-09-18 10:30 pm (UTC)

Date: 2010-09-19 09:35 am (UTC)
reddragdiva: (Default)
From: [personal profile] reddragdiva
I think one reason we still build our own is that we look at our Windows brethren who have automated management out to here and there's still three times as many of them for the same number of boxes and they still spend their lives watching progress bars.

Date: 2010-09-19 09:58 am (UTC)
From: [identity profile] hirez.livejournal.com
Oh God Yes. I have watched other people use, um, many different ones, and they all appear differently rubbish. There seemed to be a daily loop round the machine rooms to kick the boxes that had failed to update properly at one point. I think that was M$ SMS.

However, I've gone well past the number of machines where I'm happy going 'for $machine in @list; do ssh $machine, etc' because it's shit.

.bash_history is not a substitute for documentation.

Date: 2010-09-19 11:21 am (UTC)
reddragdiva: (Default)
From: [personal profile] reddragdiva
No inded. I've been working on achieving mere documentation. The internal wiki is a combination of notes to self, hit by a bus book and guide to where the bodies are buried and why.

May 2025

S M T W T F S
    123
45678910
11121314151617
18192021222324
2526272829 3031

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 22nd, 2026 07:57 pm
Powered by Dreamwidth Studios