Similarly, I can see that considering the known issues with top-down vs. bottom-up city planning/evolution could be beneficial for software-centric organizations too; the issues with badly-fit top-down city plans seem to match very well with the pains of an ill-fit software architecture that's mandated from an ivory tower, complete with users using the planned cities "wrong".
I'm sure there are differences though. You have a lot more observability into your software systems, and at the end of the day, they are orders of magnitude less complicated than cities, so you can comprehend more of the system at once, and truly find common usecases to standardize around. This is in contrast to cities where it's impossible to really know every citizen's unique needs, temperament, and usage patterns.
Worth thinking about more; given the relatively low cross-pollination rates between the fields, I suspect there are more lessons that software engineers could glean from architecture and city planning.
I think we might be able to get by with this perspective so long as we’re seeing computers/systems only as inert tools. It’s interesting to consider whether there’s any motivation for that to change, as we move towards more ubiquitous & intelligent computing. (Eg: should IOT devices be thought akin to insects?)
You could argue that netflix-style chaos engineering is an attempt to introduce more resilience into the system precisely by mimicking natures "anything can die at any moment" principle, but even then it typically only applies to computers. Netflix is known for firing fast but I don't think even they would consider randomly firing employees to make sure there are no single points of failure in the employee makeup. Would be interesting though: tax filing need to be submitted next Tuesday but the CFO was just fired, what is your recovery plan?
Mentioned here: https://www.cognitect.com/blog/2016/3/24/the-new-normal-embr...
I know I've been on teams that were significantly disrupted by jury duty, medical incidents, traffic accidents, etc. So it seems like a reasonable way to simulate this.
A really useful thing my team did (and I think it was a moderately successful trend on other SRE teams) was to role play recent outages. The oncall who had seen a particularly interesting outage would DM using the graphs, error messages, and logs they encountered when debugging a alert for a chosen victim (ahem, role-player) who would have to choose which graphs, dashboards, and logs to look at and which remediation actions to take to track down and fix the actual problem. It was perfect for building metis since it was done in a team setting so everyone benefited from the insights into the system architecture and behavior and the role-player learned practical oncall skills. Things like escalating to other teams and running incident management were built into the RP.
We codify knowledge or information into explicit knowledge such as documentation, expert systems or design. Not all knowledge lends itself to this.
Implicit knowledge is that which often require experience and learning by doing. It is hard to capture explicitly. On the one hand because the skilled individual might be unaware of the skill in action, on the other they may be unable to express it.
Various hacks are then tried to pry this valuable asset out into the open, so it can be recorded on a corporate wiki.
I think that a consequence of "two sorts of knowledge: techne and metis" is that standardisation is good, but it only gets you so far. Past that point, you need to be familiar with the system.
This should not devalue our efforts to standardise, e.g. get systems to all log to the same aggregator, and emit the same basic stats, agree on naming and forwarding of correlation ids that will allow us to cross-reference related log entries.
But we should also recognise that those efforts will never cover everything.
e.g. If I changed over to working on an unfamiliar system in the same organisation, I would know where it should be logging to, what the field naming and general structure of those log entries should be, but I would not not know what healthy operation should look like in those logs.
For the record, Corbusier's Ville Radieuse (Radiant City) predates the Cold War by a rather hot World War II (1930). Interestingly enough, it was a very Googly impulse -- "organize all the world's" bipeds -- that motivated the relatively young control freak aka architect. After WWII, Corbu mellowed. And his collective residential structures, Unité d'habitation, were the result of his synthesis of a generative measuring system and modularity. OP and fellow SREs have quite a lot to learn from the mature thoughts of Le Corbusier.
Over here in America, we had our own native genius, Frank Lloyd Wright, who devised his vision of an urbanism for a democracy - The Broadacre City:
But of course, the "high modernism" clique (ran by the moneyed set of East Coast (think MoMA), and the "ex-Nazi", Phillip Johnson) that did everything to marginalize Wright. And it was this clique, having imported wholesale (ironically) the leftist architects of Europe escaping Fascism, that gifted us with "high modernism" dystopia.
If you want to learn about modern architecture, I recommend Ken Frampton's Modern Architecture: A Critical History. He was one of the very few actual teachers I had in architectural school worthy of the designation.
https://en.wikipedia.org/wiki/Ludwig_Mies_van_der_Rohe (His own works were exquisite gems.)
> metis, is local, specific, and practical. It’s won from experience. It can’t be codified in the same way that techne can. The comparison that Scott gives is between navigation and piloting. Deepwater navigation is a general skill, but a pilot knows a specific port — a ‘local and situated knowledge,’ as Scott puts it, including tides, currents, seasonal changes, shifting sandbars, and wind patterns. A pilot cannot move to another port and expect to have the same level of skill and local knowledge.
Other expressions of it include the strategy/tactics distinction and the nomothetic/idiographic distinction. The idea is based on the very ancient observation that phenomena involve both general laws and specific circumstances.
The full picture of app behavior is invaluable to the new or learning engineer, or even experienced engineers learning some unfamiliar subsystem.
> Irecently spent some time trying to write a set of general guidelines for what to monitor in a software system
Reframe as "Shit That Needs To Run To Make The Customer Happy" and you get closer to what you want. Which is to say, it's completely product-specific. A general list of technical things to monitor is about as useful as monitoring the cotton thread fiber integrity of a pair of shoelaces. Is it the cotton thread fiber integrity what you care about, or a general quality of the shoelaces? Are they shitty laces, or just decent, or great laces? Quantify that.
> Typically, the former kind of PRR will take a quarter or more, because invariably, new large services have a significant amount of tasks to do to get them production-ready. The SRE team onboarding the service will spend significant time finding gaps, understanding what happens when dependencies of the service fail, improving monitoring and runbooks and automation.
I deal with these a lot of the time, and I hate them because they are so stupid. We could make these reviews completely self-service and automated and they'd move a lot faster, and could even be on-going as the product is actually released to customers. But SRE and Architecture remain their own silos, and neither of them work closely enough with the product team or core engineering groups to find the streamlined, agile ways of doing these things. Basically, none of them grok the concept of finding quicker and better ways to get this shit done. Or they just don't care to.
> The second kind of PRR typically does not uncover much to be done, and devolves into a tick-box exercise where the developers attempt to demonstrate compliance with the organisation’s production standards.
Architecture and SRE don't explain to the product team WTF they are going on about, so of course they just tick boxes mindlessly. Nobody wants to stop and understand the whole picture, so you end up with empty formalism.
The way to "formalize" and "standardize" the operationalizing of a product is to make it clear what the fuck is going on at each stage of your product. Who the fuck are my customers? What the fuck are they doing with the product? How the fuck does the product work for them, and internally? What the fuck are the external dependencies and how do they work? You need simple, practical ways to express these things.
And you also need to train people as to why everyone needs to understand these things. Why you cannot just allow someone to sit in their little corner of a room and jerk off and collect a paycheck. I often hear it from developers ("I just want to write code") but literally everyone else in the organization does it too.
SRE has some good ideas in principal. But in practice, unless you are Google, it often leads to over-engineering.
If any organization is bold enough to write a book on it, that book becomes the de-facto standard. (it helps that it's 10000% easier than buying and reading a million disparate ISO standards on Information Technology)
So yes, it doesn't translate from 130K to 100 people, but the concepts do.