Saturday, November 30 2024

After all, people regularly use www.google.com to check if their Internet connection is set up correctly.

— JC van Winkel, Site Reliability Engineering pg. 25

Site Reliability Engineering is a collection of essays written by senior engineers at Google describing how they run their production systems. It's mostly within the context of the Site Reliability Engineering (SRE) organization, which is an actual job title there. However, I found the subject matter to be quite wide-ranging, everything from people management to distributed consensus algorithms. It didn't focus strictly on the SRE discipline, which partly explains why it's 500 pages long.

The whole book is actually available for free online if you're interested in reading it. Or just parts of it, since each chapter is a separate topic and there's not much overlap between them.

In essence, the SRE organization is a specialized discipline within Google meant to promote and maintain system-wide reliability for their services and infrastructure.

![[Pasted image 20241128184537.png]]{.alignright}

Reliability is such a multi-faceted objective that the expertise and responsibilities required are wide-ranging. The end goal seems simple to explain: Ensure systems are operating as intended. But reaching that goal requires a combination of technical, operational, and organizational objectives. As a result, this book touches on basically every topic of interest for a software company.

I spent a couple years working in a Nuclear Power Plant, so I've seen what peak reliability and safety culture looks like. The consequences of errors there are so much higher compared to most other companies, including Google. So it's not a surprise that reliability and safety are the paramount objectives and they take priority over everything else.

This safety culture permeated everything we did within the plant, and around it. There were lines painted on each row of the parking lot to indicate the path to the entrance. If you didn't follow them, you would get coached and written up by someone. It was intense. And don't even think about not holding the railing while taking the stairs either...

Any changes you want to make to a system within the plant needs extensive documentation, review, and planning before being approved. Thus, the turnaround on any change takes months, if not longer.

Contrast that with software companies like Google where 1000s of changes are made on a daily basis. The consequences of a mistake can still be serious, depending on the application. But instead of aiming for no errors, errors are managed like a budget, and the rate at which this budget is spent determines how much change can be made in a given period of time:

In effect, the product development team becomes self-policing. They know the budget and can manage their own risk. (Of course, this outcome relies on an SRE team having the authority to actually stop launches if the SLO is broken).

— Marc Alvidrez, pg. 51

Learning about Google's software development process was interesting. In the first few chapters, there was a lot of useful information on measuring risk, monitoring, alerting, and eliminating toil. These were some of the more insightful chapters in my opinion.

But there were also a few...less insightful chapters. Chapter 17 was about testing code and it was really just stating obvious things about writing tests; it wasn't specific to SRE at all. Then there was a lot of time spent on organizational stuff, like postmortem culture and how to have effective meetings. So much of the writing came off as anecdotal and rather useless advice that the author tried to generalize (or just make up) from past experiences.

So there were good and bad parts of the book. I wouldn't recommend reading it cover to cover like I did. It'd be better to just read a chapter on a topic that's relevant for you.

For instance, I found the section on load balancing to be really informative. Below is a summary of how Google does load balancing.

Balancing Act

Chapters 19 and 20 are about how Google handles their global traffic ingress. Google, by operating one of the largest distributed software systems in the world, definitely knows a thing or two about traffic load balancing. Or to put it in their words:

Google’s production environment is—by some measures—one of the most complex machines humanity has ever built.

— Dave Helstroom, pg. 216

Melodrama aside, I appreciated the clear and concise breakdown of their networking and traffic management in these chapters.

Load balancing needs to consider multiple measures of quality. Latency, throughput, and reliability are all important and are prioritized differently based on the type of request.

1. DNS

Chapter 19 is about load balancing across datacenters. Google runs globally replicated systems, so figuring out which datacenter to send a particular request to is the first step in traffic management. The main mechanism for configuring this is via DNS — a.k.a the phone book of the internet.

The goals of this routing layer are twofold: - Balance traffic across servers and deployment regions fairly - Provide optimal latency for users

DNS responses can include multiple IP addresses for a single domain name, which is standard practice. This provides a rudimentary way of distributing traffic, as well as increasing service availability for clients. Most clients (i.e. browsers) will automatically retry requests to different records in the DNS response until they successfully connect to something. The downside is that the service provider, Google, has little control over which IP address actually gets chosen in the DNS response. So it can't be solely relied on to distribute traffic.

The second goal of DNS is to provide optimal latency to users, which means trying to route their requests to the geographically closest server available to them. This is accomplished by having different DNS name-servers set up in each region Google operates in, and then using anycast routing to ensure the client connects to the closest one. The DNS server can then serve a response tailored to that region.

This sounds great in theory, but in practice DNS resolution is more hairy and there's lots of issues specifically around the caching introduced by intermediary name-servers. I won't go into those details here.

Despite all of these problems, DNS is still the simplest and most effective way to balance load before the user’s connection even starts. On the other hand, it should be clear that load balancing with DNS on its own is not sufficient.

— Piotr Lewandowski, pg 240

2. Reverse Proxy

The second layer of load balancing happens at the "front door" to the datacenter—using a Network Load Balancer (NLB), also known as a reverse proxy. These handle all incoming requests by broadcasting a Virtual IP (VIP) address. Then it can proxy incoming requests to any number of actual application servers. In order retain the originating client details after proxying a request, Google uses Generic Routing Encapsulation (GRE) which wraps the entire IP packet in another IP packet.

There's some complexity here, of course, in terms of the actual routing algorithm used by the NLB. Supporting stateful protocols like WebSockets requires the NLB to keep track of connections and forward all requests to the same backend for a given client session.

Once the request has reached an application server, there will likely be a multitude of internal requests initiated in order to serve the request.

In order to produce the response payloads, these applications often use these same algorithms in turn, to communicate with the infrastructure or complementary services they depend on. Sometimes the stack of dependencies can get relatively deep, where a single incoming HTTP request can trigger a long transitive chain of dependent requests to several systems, potentially with high fan-out at various points.

— Alejandro Forero Cuervo, pg. 243

And besides that, there's plenty of requests and computational work that aren't originated by end-users. Cronjobs, batch processes, queue workers, internal tooling, machine learning pipelines, and more are all different forms of load that must be balanced within a network. That's what Chapter 20 covers.

3. Connection Pool Subsets

The goal of internal load balancing is mostly the same as for external requests. Latency is still important, but the main focus is on optimizing compute and distributing work as efficiently as possible. Since there's only so much actual CPU capacity available, it's vital to ensure load is distributed as evenly as possible to prevent bottlenecks or the system falling over due to a single overloaded service.

Within Google, SRE has established a distinction between "backend tasks" and "client tasks" in their system architecture:

We call these processes backend tasks (or just backends). Other tasks, known as client tasks, hold connections to the backend tasks. For each incoming query, a client task must decide which backend task should handle the query.

— Cuervo, pg. 243

Each backend task or service can be composed of 100s or 1000s of processes in a single machine. Ideally, all backend tasks operate at the same capacity and the total wasted CPU is minimized.

The client tasks will hold persistent connections to the backend tasks in a local connection pool. Due to the scale of these services, it would be inefficient for every single client to hold a connection to every single backend task, because connections cost memory and CPU to maintain.

So Google's job is to optimize an overlapping subset problem—which subset of backend tasks should each client connect to in order to evenly spread out work.

Using random subsetting didn't work. The graph below shows the worst backend is only 63% utilized and the most is 121% utilized.

![[srle_2003.png]]{.alignright}

Instead, Google uses deterministic subsetting which perfectly balances the connections between clients. It's an algorithm that shuffles and assigns backends to each subset evenly. Again, I won't go into detail about it.

4. Weighted Routing

Once the pool of connections has been established for each client task, the final step is to build an effective load balancing policy for these backends.

Using a simple round robin algorithm didn't work, as evidenced from historical operational data. The main reason is because different clients will issue requests to the same backends at vastly different rates, since they could be serving completely different downstream applications. There may also be variation in the cost of different queries, backend machine diversity, and unpredictable factors like antagonistic neighbours.

Instead, Google uses weighted round robin which keeps track of each backend's current load and distributes work based on that. First they built it based on active requests to each backend, but this also doesn't tell the whole story of how healthy a particular backend is. So instead, each backend sends load information to the client in every response. It includes active request count, CPU, and memory utilization. The client uses this data to distribute the flow of work optimally.

Here's a crappy diagram I made to visualize everything.

![[lb.png]]

Conclusions

Site Reliability Engineering offers many insights shared by senior engineers from one of the world's leading software companies. I particularly enjoyed the sections on alerting, load balancing, and distributed computing. But there were some chapters I found boring and without much useful, actionable advice.

Google has been a leader and innovator in tech for many years. They're known for building internal tools for basically every part of the production software stack and development life cycle. A lot of these tools have been re-released as open source libraries, or even new companies started by ex-Googlers.

For instance, Google has been running containerized applications for over 20 years. As the scale of running services and jobs this way expanded, the manual orchestration and automation scripts used to administer these applications became unwieldy. Thus, around 2004 Google built Borg — a cluster operating system which abstracted these jobs away from physical machines and allowed for remote cluster management via an API. And then 10 years later, Google announced Kubernetes, the open source successor to Borg. Today, Kubernetes is the de-facto standard for container orchestration in the software industry.

All this to say—Google has encountered many unique problems over the years due to its sheer complexity and unprecedented scale; it's forced the company to develop novel solutions. As such, it's helpful looking to them as a benchmark for the entire software industry. Understanding how they maintain their software systems is helpful for anyone looking to improve their own.

Rating

Non-Fiction 

Value

4/7

Interest

3/7

Writing

3/7

🥪️If this book was a sandwich it would be: california burrito with extra avocado

Sunday, October 6 2024

It's time to update my website.

Over the last couple years I took a bit of a hiatus from posting new content, but this year I've rediscovered the motivation for writing. I'm also interested in writing about more subjects, not just book reviews. In particular, I'm going to start writing about software engineering and coding more, which will require some changes to the formatting of posts.

Because of these new requirements, I've decided to rebuild my writing "stack" and the platform that powers my blog.

The main things I don't like about my current writing flow and site:

  • The site's UI is outdated
    • While I'm proud of the handcrafted *artisanal* HTML I wrote for the original site, I want to redesign it to match my current tastes
  • There's too much manual HTML editing required for new posts
    • I write posts in markdown and then convert them to HTML programmatically. But then I usually have to modify the HTML output to finalize the formatting.
  • My deployment infrastructure is not cloud optimized or properly de-coupled.
    • Sure, hosting your Spring Boot App, MongoDB server, media assets and Jenkins server on a single EC2 instance is possible. Is it a good idea? No.

So there's nothing horribly wrong about the current site. It works. Which isn't surprising given it's a static blog that changes infrequently. But I think rebuilding the site will help invigorate my writing and improve my efficiency for generating new content. Secondly, any project is an opportunity to learn so I'm excite to work with some stuff I don't use often and try out some new technology.

My goals for this new site (code named flow2) are the following:

  • Refresh the UI
  • Containerize and use better cloud tooling for the infrastructure
  • Markdown files as the single source of truth for content. No manual HTML editing required
  • Streamline the entire process between writing and posting
  • Overclock the Lighthouse scores as much as possible for fun
  • Try out Ktor — an async web framework built in Kotlin with coroutines

And that's it! I'll probably get a new domain name too.

Looking forward to building. You can check out my progess on GitHub if you like.

Monday, August 19 2024

When awake, we see only a narrow set of all possible memory interrelationships. The opposite is true, however, when we enter the dream state and start looking through the other end of the memory-surveying telescope. Using that wide-angle dream lens, we can apprehend the full constellation of stored information and their diverse combinatorial possibilities

— Matthew Walker, pg. 203

I have a hunch that most people who identify as "book readers" like to read before bed. In fact, I'm confident that's where a lot of readers get most of their reading done. However, I admit I have no evidence or statistics to back this up; mostly because I'm not a scientist and I haven't done any research. Welcome to my blog.

Fortunately, there are people like Dr. Matthew Walker who IS a scientist and DOES do research before sharing their theories with the world. Walker is a professor of neuroscience at USC Berkeley and he has spent his career studying sleep. This book - Why We Sleep - is his magnum opus; a distillation of decades spent revealing the secrets of the strange, yet essential, nocturnal phase of our existence.

Why We Sleep book cover

Y'all Sleeping on Sleep

It feels like an injustice to this book to sum it up by saying "sleep is good for you". I was astounded by the breadth of topics covered by Dr. Walker in regards to sleep and its effects on our body and health. Part 2 of the book is entitled "Why Should You Sleep" and I kid you not, the following is an incomplete list of the benefits that adequate sleep has been shown to promote:

  • creativity
  • emotional regulation
  • learning efficacy
  • memory retention
  • expected lifespan
  • decreased psychiatric disorder risk
  • decreased injury risk
  • decreased cancer risk
  • decreased Alzheimer's disease disk
  • decreased type-2 diabetes risk
  • decreased car crash risk
  • increased testosterone levels
  • increased testicle size

Yes even that last one. After reading the details of all the studies and methodologies behind proving these correlations, I felt the emotional impact of each lesson began to dull after awhile. Like hearing that the world's on fire every day in the news, being told that sleep is really good for you starts to get repetitive. So I think that Part 2 sort of drags on a bit.

Thankfully, the remainder of Why we Sleep was more focused on how to get better sleep — and what the hell is going on when we dream! I found both these topics to be much more interesting.

Like I mentioned before, I do a lot of reading in bed. This was both a good and bad book to read before sleeping. On the one hand, learning about all the enumerable ways that sleep is good for me was a great headspace to end my day in, allowing my mind to drift off and start to ride those rejuvenating REM and NREM brainwaves.

On the other hand, for the days when I couldn't' sleep, or was going to sleep late, or was in an uncomfortable environment where I wasn't in control of my sleeping space — knowing the exact reasons why I couldn't sleep ("ugh my hands and feet are too hot" ) was almost worse ("agh I looked at a screen too recently") and tended to exacerbate my stress ("shouldn't have had that green tea at 4PM").

But alas, I truly believe that knowledge is powerful and ignorance is not blissful. I chose to read this book because I wanted to know more about sleep and now I do. I took away a lot of interesting information and useful tips. I think this knowledge will help me sleep better, but more importantly I now have random factoids to drop into conversations whenever they turn to sleep.

In particular, I learned that humans naturally have a biphasic circadian rhythm, which is a really cool term to break out at parties. In English, it means we're biologically hardwired to nap once a day. That mid-afternoon lull you feel everyday turns out to be totally natural and not just because your lunch was a big bowl of cheesy gnocchi.

Well, the gnocchi could be part of it honestly (reminder: not a scientist).

Not only is it natural, a short afternoon nap is apparently healthy for you too. Walker points to several studies, some from his own lab, that have illuminated the subtle but measurable ways that a short afternoon nap is beneficial.

Those who were awake throughout the day became progressively worse at learning, even though their ability to concentrate remained stable (determined by separate attention and response time tests). In contrast, those who napped did markedly better, and actually improved in their capacity to memorize facts.

— Walker, pg. 102

From a longitudinal study of Greece and the decline of it's siesta culture over the late 20th century, there was a clear relationship between reduced naps and the risk of heart disease:

However, those that abandoned regular siestas went on to suffer a 37 percent increased risk of death from heart disease across the six-year period, relative to those who maintained regular daytime naps. The effect was especially strong in workingmen, where the ensuing mortality risk of not napping increased by well over 60 percent.

— Walker, pg. 69

Leading to the natural conclusion, in Walker's own words:

From a prescription written long ago in our ancestral genetic code, the practice of natural biphasic sleep, and a healthy diet, appear to be the keys to a long-sustained life.

— Walker, pg. 70

While reading about napping and all its great benefits in Why we Sleep, I was reminded of a headline I'd seen somewhere years ago that said something along the lines of:

BREAKING: New Scientific Study by Scientists Shows Napping Causes Bad Health Things to Happen and You Might Die Sooner, According to Science

That might not be verbatim, but you get the point.

Now I'm not one to believe everything I read on the internet. But I do like to base my entire worldview on a subject according to a single headline I skimmed over once and never looked into further. So needless to say, finding out there was conflicting evidence on the benefits of napping was rather shocking. I was losing sleep over it.

With such uncertainty circling in my head, I felt the need to do more investigation. So I decided to finally put on my scientist hat and do some good old fashioned research on the matter.

Naps: A Literature Review

There have indeed been several studies published over the last couple decades which associated napping with increased risk of hypertension, cardiovascular disease and diabetes, to name a few. According to that last one, the correlation with diabetes risk was "partly explained by adiposity". Adiposity is a really technical term for being fat; and being fat, interestingly enough, was also found to be correlated with regular daytime napping in a separate study! This begs the question — what came first, the couch or the glucose intolerance?

It's hard to say since these long-term epidemiological studies are observational and can only suggest potential causes of a disease based on statistical evidence. The risk of confounding factors makes these studies hard to trust entirely, which is why the conclusions drawn must use wording like:

"increased daytime nap frequency may represent a potential causal risk factor for essential hypertension."

For us normies, these studies lead to clickbait headlines and articles touting the risks of this seemingly benign activity:

Long Naps May Be Bad For Your Health | Forbes Napping regularly linked to high blood pressure and stroke, study finds | CNN You snooze, you lose: why long naps can be bad for your health | The Guardian

With all these morbidities being linked to napping, it's easy to see how one could conclude that naps are indeed bad for them. I haven't found any explanations offered as to why napping might be bad for your health; so my research, along with the entire state of nap science, is apparently stalled for now.

But what does our sleep expert, Dr. Walker, say about all this? Walker makes no mention of any of these studies in Why We Sleep (granted, many of them were published after his book was written). Walker's only words of warning come in the Twelve Tips for Healthy Sleep Appendix:

  1. Don’t take naps after 3 p.m. Naps can help make up for lost sleep, but late afternoon naps can make it harder to fall asleep at night.

— Walker, pg. 325

So, according to Walker, naps are good for you as long as you take them early enough in the day.

Most studies I read as part of my research mentioned nap length as an important factor in these negative health correlations. For example, this study showed those who regularly take short naps (< 30 minutes) were less likely to have high blood pressure, and those who take long naps were more likely to have high blood pressure.

Putting it all together, I've decided my new worldview on naps boils down to the following:

  • Napping early in the afternoon for no more than 30 minutes is:
    • probably fine
    • possibly beneficial for you
  • Napping for longer than 30 minutes regularly is:
    • possibly bad for you
    • more likely a symptom of some other underlying health issue(s) or poor lifestyle habits.

And of course, my final research source was asking ChatGPT, which obviously told me exactly the same thing:

ChatGPT output
dude nobody likes a know-it-all...

Why bother researching the internet when ChatGPT has already read the whole thing? Oh well, now I know for next time.

Reading books is probably outdated by this point too, but it still helps me get to bed at night. Especially a book like Why We Sleep that evangelises the life-changing power of slumber. So until AI can start singing me lullabies*, I'm clinging to my books!

*editor's note: it turns out AI can already do this

Rating

Non-Fiction 

Value

4/7

Interest

5/7

Writing

4/7

🥪️If this book was a sandwich it would be: sliced turkey with mashed potatoes and gravy on buttered white bread

More