November 08, 2017

Famous Cloud Outages

Learning from failures is definitely painful (if you're the one failing) but also a very informative learning experience.  I'm trying to compile a list of interesting cloud outages to learn from and improve.  (May or may not have something to do with my own team's recent outage :P)

(I'm slowly updating this)

AWS - 2017 S3 Outage - Inadvertent removal of important servers in Northern Virginia
An engineer trying to debug a slow billing system inadvertently removed 2 important servers that were serving two S3 subsystems.  One of them, the index subsystem, managed the metadata and location information of all S3 subsystems in that region (Northern Virginia) which served all GET, LIST, PUT, and DELETE requests.  The other, the placement subsystem, managed allocation of new storage, and depended on the index subsystem.  These two subsystems required a full restart, during which S3 was unable to service requests.  Other AWS services in that region that relied on S3 for storage were impacted while the S3 APIs were unavailable.

Amazon stated that while the S3 subsystems were resilient to limited capacity (to a degree), it was not prepared for the index/placement subsystems failures.

Improvements to address this outage included (1) building more safeguards into the debugging tool to remove capacity more carefully, and (2) recovering services more quickly from failures by splitting them up into smaller units.  The smaller units also serve to debug large systems extensively.

Azure - 2017 Outage - Overheated data center in Northern Europe caused sudden shutdown of machines
This one is dear to my heart because I remember this happening at work.

Due to human error, one of Azure's data centers overheated which caused some servers and storage systems to suddenly shut down.  This caused many dependent resources to fail - virtual machines were shut down to prevent data corruption; HDInsight, Azure Scheduler and Functions, and Azure Stream Analytics dropped jobs; Azure Monitor and Data Factory saw increased errors and latency in their pipelines.

Microsoft pointed out that customers who'd deployed to availability sets wouldn't have been affected by the outage.

AWS - 2012 EBS Outage - Memory leak caused by DNS propagation failures overwhelmed Elastic Block Store (EBS) volumes
Each EBS storage server contacts some data collection servers that report data.  While the data is important, it is not time sensitive, so the system is tolerant of late/missing data.  One of the data collection servers failed and had to be replaced, and as part of replacing it, a DNS record was updated to remove the failed server and add the replacement server.  However, the DNS update didn't successfully propagate to all the internal DNS servers so a fraction of the storage servers didn't get the updated server address and continued to attempt to contact the data collection server that was taken out.  Because the data collection service was tolerant of missing data, this didn't raise any alarms.  However, the inability to contact a data collection server triggered a memory leak on the storage servers, and rather than gracefully deal with the failed connection, the storage servers continued trying to contact the data collection server and slowly consumed memory.  The monitoring system failed to catch this memory leak, and eventually this consumed enough memory on the affected storage servers that they were unable to keep up with requests.

The number of stuck EBS volumes increased quickly.  The system began to failover from unhealthy to healthy servers.  But because many of the servers failed at the same time, the system was unable to find enough healthy servers to failover to, and eventually a large number of volumes in the Availability Zone were stuck.

This throttled the EC2 and EBS APIs, affected accessibility in some RDS databases, and hindered some of the Elastic Load Balancers (ELB)'s traffic routing ability.

Amazon made changes to propagate DNS changes more reliably and to monitor/alert more aggressively.  They deployed resource limits to prevent low priority processes from consuming excess resources on EBS storage servers.  They relaxed their throttling policies of the APIs, improved failover logic of the RDS databases (in particular the Multi Availability Zone databases which were designed to handle this), and improved reliability of the load balancers by issuing more IP capacity, reducing the interdependency between EBS and ELB, and improving traffic shifting logic so that traffic is more quickly rerouted away from a degraded Availability Zone.

October 30, 2017

ELI5: Synchronous vs Asynchronous programming

Synchronous:  When you go to a restaurant and there's a HUGE line and a LONG wait but you have to stay in the waiting room until your table is ready.  Basically you're stuck in the restaurant for 30-45 minutes and can't go anywhere or do anything and your life screeches to a grinding halt.
(programming example: your UI freezes on a PUT until the call returns)

Asynchronous:  When you go to a restaurant and there's a HUGE line and a LONG wait but you can just put your phone number down and they'll call you when your table is almost ready.  You can go and get coffee or walk in the park or go to the bookstore and your life goes on as usual.  Sometimes you can even check on the status of your wait from your phone and see how many people are ahead of you.  And when your table is ready you can go to the restaurant and eat.
(programming example: you issue a PUT from the UI but you can still do stuff in the UI, and in the background the code continually calls for updates on the PUT until it successfully completes)

From my experience, restaurants in Seattle and Columbus are synchronous and restaurants in Portland are asynchronous.

May 30, 2017

My debugging strategies

This isn't advice on how to use a debugger - it's more of a holistic approach to debugging.  I've been struggling with trying to debug really obscure issues but after watching some of my teammates figure out very edge case issues I know it's possible. Since then I've been trying to narrow down some things that help with figuring out problems.

- Reliably repro-ing is super helpful.  Sometimes it's very difficult to (due to race conditions or the customer getting into situations I have no idea how they got there)
- Assumptions are bad (going into debugging convinced it's caused by some certain issue and dismissing other possibilities).  Making this mistake means you waste valuable time going down the wrong rabbit hole.  And very often, the root of the problem is extremely unexpected and something you never would've predicted.
- It pays to put in some investment up front to make things easier later on.  Even if it takes an hour to set up automation or a template, it'll actually save a lot of time later down the road.  Even writing an Autohotkey script to automate writing some text.
- Run repro cases with varying parameters to eliminate as many possibilities as possible (and make sure they can truly be eliminated).  Keep track of these - I use excel for this.  (This is why reliably repro-ing is very useful)
- Make a mind map to keep track of investigations. There are so many different pieces of information to keep in mind while trying to debug a problem and keeping the information you find out in a graph/map form helps you understand your investigations at a glance and keeps track of the relationships between your new pieces of knowledge, or helps you see them in the first place.
- Be aware of recent changes or current known issues in the product.  Being aware of dates the bug started. In querying our logs I've started to include the build version as a vital piece of info which really helps if I notice a bug just started happening recently. In the ideal case you can narrow down the bug to the individual commit
- Gathering as much data as possible - logs, stack traces, telemetry, anywhere you can glean information from.  Print statements outputting the values of all of your variables, and what you're going to do/what you just completed.  The more concrete info, the better.
- Organize logs or categorize them to make easier to understand.  I've taken to exporting our logs to excel and coloring rows by certain types of data or types of actions so I can easily determine what's happening
- Don't be in denial - if the program is consistently not working then it's a waste of time to convince yourself it's a transient error or an external software bug.  Every extremely unlikely bug I've encountered that seemed impossible at the time ended up all having a good logical reason that I overlooked and I would've saved a lot of time in the past if I just accepted the fact my program has a bug and I have to look for the reason.

And of course, proficient use of the debugger and a comfortable knowledge of the code base helps as well.

February 28, 2017

work vs school

A general list of problems we face at work that never came up in school (this list is not exhaustive).

Deployment issues - in school we don't really worry about pushing out code to production or maintaining different environments for development, or complex build systems or using software to actually release our code to customers which sometimes can have problems of its own.
- There's this whole idea of swap files (kind of like double buffers in graphics) where code is running on System 1 and you load your new code onto System 2 (like a staging site).  Then you swap them, so System 2 then has the most recent code while System 1 is now the staging site.
- We also have to be aware of releasing our server side code versus releasing our UI code versus releasing database code.  Usually our code isn't so coupled that we have to time these correctly, but it's still something to be aware of.  (But, feature flags...)
- There's the whole issues of hot fixes, where you have to quickly release a fix to solve a customer impacting bug.  Or rolling back releases when we see we've just released a big bug, and we have to be aware of which release is out to which region so we know which one to pull back.
- And remembering to hide features that are still in development, usually hidden by a flag but sometimes the feature can get leaked if a programmer forgets to code behind the flag.
- Ideally we have automatic, frequent releases.  Automatic because it's good to automate when possible.  Frequent because you tend to build up bugs when you let the code sit for a while.  At least with frequent releases, if we detect a bug in prod then we have a good guess of which release it was that caused it.  Recently during livesite I've been making sure to keep track of the build number for each row in the logs, and it's nice when an error starts occuring at a certain build number because it really narrows down the code path that caused that issue.
- Maybe idempotency falls into this category, which basically means that producing the same call will produce the same results.  If a service restarts (as it may sometimes), then you don't want to mess up any long-running operations that were in progress during the restart.  If your code supports idempotency then this won't be a problem but if it doesn't, you'll lose data or the customer will notice weird behavior.
Different regions - we have customers in different parts of the world so Azure has data centers globally for a lot of reasons.
- Having multiple regions prevents a single point of failure.  So if a failure occurs in one region hopefully the traffic manager is able to route traffic to another region.
- Also provides lower latency for customers in those regions
- We have many different regions we deploy to, and it's good to be aware of what version is in a different region when tracking new changes or bug fixes.  Sometimes when customers email us about bugs, we keep them updated with the bug fix by knowing which region the customer is in and the current build making its way through the regions.
- We have low traffic regions and high traffic regions, need to be aware of that too
- Does make things more difficult when you have to call customers who are in, for example, Europe time zone so you have to set up a time that's terrible for both of you to call.  Or if you're on holiday but then the rest of the world doesn't celebrate that holiday so if you're on live site that means you still have to watch over the service.
Scalability - Our service is still somewhat small but we are steadily gaining more customers and that comes with more problems.  We have to be aware of limits we encounter (size & storage limits, request limits, resource limits, etc.) as when the product started we didn't need to code for this behavior
- And this will sometimes cause problems we don't encounter in our dev environment but they happen in prod, and these can be hard to replicate.
- Concurrency - becomes a larger problem the larger the service is.  In our case, specifically database concurrency and readers not reading the most current version, or writers overwriting important information which screws things up (not producing a correct number of resources, etc.)
Localization - Usually not too complicated but every customer facing string has to be localized.
- We did have some issues with this for listing days of the week because we were trying to be clever and code a sentence programmatically based on days the user selected but different languages handle this differently
- And sometimes bugs come up where logic is executed depending on the string returned, but if for some reason it depends on a non-localized string and we didn't take that into account, then in different languages than English then the code will act differently
- And we have to be aware that some languages are a lot longer than others and may look different in the UI
Backwards compatibility - This is something we always have to keep in mind so we don't break existing code that works.  Customers get angry (for good reason) if you suddenly change the way things work without telling them.
- Always have to keep this in mind when making a new feature or improving an old one
- Sometimes this means migrating a lot of old data for whatever reason, which is always error prone
- Don't change default behavior!  And for that matter make sure default behavior is always the "safer" route or the less dumb route
- I know I just said not to change default behavior but sometimes we need to upgrade old versions to new versions, ideally seamlessly (without the customer noticing, or worse, causing the upgraded version not to work anymore)
- Kinda ties in with API versioning as well.
Future proofing - When designing a feature we have to keep a good balance of keeping things open for extension but not burdening ourselves excessively with details that may not come to be
- Often in design discussions someone will have to remind the team that we are future proofing and should take a step back
- But we've all felt the pain when dealing with code that obviously wasn't well thought out or have to hack around an old implementation
Idempotency - This was a new topic - I knew the textbook definition of idempotency (making sure the same thing happens despite calling it multiple times) but I didn't really see how it fit in with the real world.
- Server restart is a thing - when we deploy or just have to restart our web jobs, then this isn't a problem if your service is idempotent.  Unfortunately if it's not, then a lot of things break.
- We had some other bugs where the underlying Azure service was making multiple Create calls on our resources, and our code wasn't idempotent because we'd randomly generate a number for each Create call which obviously isn't idempotent.  So basic things would break.
- A surprising thing to learn about was the trade off between concurrency and idempotency.  In trying to make our code idempotent, we sacrificed concurrency by overwriting some data we needed for concurrency to work.  But when we tried to solve concurrency by reducing conflicts (or taking care of them), we sacrificed idempotency.
- The way we're trying to handle idempotency is handling code in smaller chunks and enable more requeueing.  Doing things like making sure messages are in queue for a very short time (to minimize failure/risk), and chunking code into smaller pieces (so if server restarts in the middle of something the code can restart at that one step instead of starting the entire operation over again - this means the operation has to keep track of which step it's at though).  And in each of these smaller pieces we save the state of the operation.  The work can be put back on to the queue at any time, but usually each time it's put back, it has made some progress, and we log that progress so next time we dequeue we know where to start
Queues - We definitely learned what queues were in school but we use them differently for work than we did for in school
- We use queues basically to support asynchronous calls and idempotency.
- Asynchronous - so we can support long running operations without blocking other requests
- I explained idempotency earlier - we use queues so we can chunk an operation into smaller pieces so if the service ever messes up then it's not a big deal if the current code that was executing was thrown away - we've been saving its state as it gets dequeued and requeued so we can just start back from the last "checkpoint" next time we dequeue the message and process it
Customers/live site - The most stressful part of the job I think, especially when on live site duty.  Even engineers need good people skills.
- I am still struggling with this.  Being super aware of new features added in code (or which bugs have recently been fixed), details of how big parts of the system are implemented, knowing which release is out to which region (or knowing how to find out), being able to sort through our logs and telemetry and formulating a good story of what the customer did or what code was executed, etc., etc., etc. are all important
- Nice thing about livesite is that you're forced to be exposed to a ton of different parts of your service at once.  I liken it to cramming for finals - it's a week (or two) where I'm super stressed but I'm forced to learn a lot in a short amount of time.
Importance of logging/telemetry - Logging is good and necessary, but sometimes too much is bad because the signal gets lost in the noise
- However we tend towards more logging as we wouldn't be able to solve many of our live site problems without logging.  Just like how beginners *cough* use print statements to debug their code, this is kind of an after the fact print statement that allows us to debug what the customer did and perhaps what went wrong and how to solve it
- It's also important to record telemetry to let us know which new features to implement, where to devote our time into, or perhaps which features to cut
API versioning/general versioning - Important if you publish your own SDK, or if other people are using your APIs
- Making sure old APIs are usable because some people still depend on them if they are using our old SDKs or using Fiddler to make calls
- That means freezing old REST models and adapters, and making sure logic works to provide the correct conversion between REST and business regardless of API version.  Usually calls will come to the current (preview) version from the front end, but if people call old apis through our old SDK or fiddler, etc., then we have to make sure we route them to the correct version.  So that's why we must save all our old models and adapters somewhere.
- Different API versions don't need to be back compatible, they are running side by side usually
- Eventually you need to implement deprecation logic for very old API versions
- Also being aware of when SDKs we depend on get upgraded, that sometimes breaks thing
- When updating API versions, increment major version if there are breaking changes and update minor version for other changes.
- And for super major breaking changes, must announce it publicly (Azure notification or blog post) and since we are a small-ish service and know what our major customers are doing, even giving them an email before we roll out the feature.
Importance of catching exceptions/null refs - So this is sometimes a big source of live site problems.  Anything can happen to data in the real world despite how implausible it seems while coding.  We always have to look out for resources that don't exist, for data that returns empty, etc., even though when writing the code you're thinking to yourself "there's no way this will ever be null...".  Customers have a way.
- Also, I did not know the difference between "throw" and "throw ex" before working here.  (It's important to retain stack trace information!  Hence use "throw" instead)
- And the importance of catching specific exceptions as often as we can - and not swallowing general exceptions we didn't expect, deferring them to the global controller.
User/admin permissions - We actually had someone who was IT support who somehow had access to one of our important resources and cleaned out his system, deleting our resource and wreaking havoc for about an hour.  So that was a good lesson why it's important to be careful whenever you have prod level access
- But we also have to be aware of permissions when coding - Azure is strict about user roles and user permissions (also for good reason as stated above).  I forget this sometimes - when writing and testing code I am always at admin level so I only test the admin code path.  But users and admins are usually treated differently, and we don't want to accidentally give users permissions that they shouldn't.
Testing - We in fact did have to write unit tests in school, but integration tests are another whole thing.  Unit tests can be nice when implementing code as well because it's quicker to run just that one snippet instead of building the solution, then deploying it, then logging in to the web site, then following the set of actions that will test your code.
- One benefit of interfaces I've never thought of is they help keep your tests up to date.  I'm working in an area of code right now that is very volatile and changes by the day, and some of our unit tests were already out of date after a week.  However, whenever the developer implemented an interface, it forced those unit tests to keep up with the code.
- Microsoft pretty much got rid of testers so we just test each other's code.  I've noticed that really good testers are ones who test things you didn't even think of - happy path as well as edge cases as well as weird things customers do
Interdependence - In school maybe we relied on some packages, but they were usually tested very thoroughly and not likely to have bugs.  Azure is a huge complicated system with everybody depending on everybody else and we can't be sure that their code is reliable but we have to be sure we don't f over teams depending on our code.  It makes me feel special being a part of this wonderful behemoth but at the same time it's pretty intimidating.
- The fact that our product is one system utilizing so many different parts of Azure can be confusing.  We have to tie together many different services to make one thing work - make sure the user has permissions, make sure we have permissions to do what we need to do, make sure we create resources in the correct order so we can hook them up to each other, make sure the resources we create are compatible (for example assigning a new vm to the right load balancer), make sure the resources are created in the same resource group, make sure we're aware of throttling limits between different services, hook up analytics so we get the logs for customers, make sure our multiple web services are up and running and correctly hooked up, etc.
- I also learned that before using another service to implement a part of your own framework, it's important to really read the documentation to understand how it works and how it is meant to be used, and be aware of the restrictions before hand.
- In a much smaller scope, we've had issues where our business and data layers are way too intertwined and we're in the process of decoupling our layers.  Data layers should only know about CRUD: create, read, update, delete.
- (this also falls under Deployment): We want to be able to separate out services in our code so we can deploy them differently.  If you want to push out a hot fix and all of your code is all under one service, then that means you have to deploy EVERYTHING in master whereas if you separate your services then you can quickly hot fix one service without affecting others
- And circular dependencies and moving around projects/references can be super annoying.  Nugetizing code can help with this, but this gets annoying if you're in a project that gets updated a ton and you have to make changes to that code and then publish the nuget package and then in the other solution wait for that change to get published to be able to make your own change.  Probably making sure that your presentation, business, data, and database layers have clearly defined boundaries is a good start.

But I'd also like to give a shout out to my team - I really think this is the best job I'll ever have in terms of people and work given.  We're in the unique position of being a startup-y product in a huge corporation so we get to write a lot of new features with a lot of flexibility but have amazing resources.  And most of all, my team is amazing - not a day has gone by at work without me cracking up super hard.  I've heard so many horror stories about Microsoft being super political and cutthroat but it just feels like a group of friends here, a really smart and nerdy and hilarious group of friends.  In a way it sucks to have the bar set so high so early in my career but I'm thankful for this amazing opportunity while I still have it.

I still struggle with the sheer amount of new material to learn and digest and even after about 10 months I feel like I am still struggling.  I hope that's normal and not just me - but I guess with time and experience it'll all seem less overwhelming.

January 18, 2017

my code when i'm tired

var thingy = new Thing();
thingy.doTheThing();

console.log('hiiiii');
var thingy2 = {
propertyThing = thingy.x;
lol = thingy.lol;
hahah = thingy.lol2();
};

console.log('hello!!!');
var thingies = thing2.makeThingies();

for (var = 0; i < thingies.length(); i++) {
var thingy3 = thingies[i].hi();
console.log('wtf: ', thingy3);
thingy3.lol = thingy2;
}

var lol = LOL.haHAhaha();
var lolol = lol.lol();
lolol.omg();

lol();
lol.lol();
lolWHatDoesThisDo();
omg();

console.log('lol');