A general list of problems we face at work that never came up in school (this list is not exhaustive).
Deployment issues - in school we don't really worry about pushing out code to production or maintaining different environments for development, or complex build systems or using software to actually release our code to customers which sometimes can have problems of its own.
- There's this whole idea of swap files (kind of like double buffers in graphics) where code is running on System 1 and you load your new code onto System 2 (like a staging site). Then you swap them, so System 2 then has the most recent code while System 1 is now the staging site.
- We also have to be aware of releasing our server side code versus releasing our UI code versus releasing database code. Usually our code isn't so coupled that we have to time these correctly, but it's still something to be aware of. (But, feature flags...)
- There's the whole issues of hot fixes, where you have to quickly release a fix to solve a customer impacting bug. Or rolling back releases when we see we've just released a big bug, and we have to be aware of which release is out to which region so we know which one to pull back.
- And remembering to hide features that are still in development, usually hidden by a flag but sometimes the feature can get leaked if a programmer forgets to code behind the flag.
- Ideally we have automatic, frequent releases. Automatic because it's good to automate when possible. Frequent because you tend to build up bugs when you let the code sit for a while. At least with frequent releases, if we detect a bug in prod then we have a good guess of which release it was that caused it. Recently during livesite I've been making sure to keep track of the build number for each row in the logs, and it's nice when an error starts occuring at a certain build number because it really narrows down the code path that caused that issue.
- Maybe idempotency falls into this category, which basically means that producing the same call will produce the same results. If a service restarts (as it may sometimes), then you don't want to mess up any long-running operations that were in progress during the restart. If your code supports idempotency then this won't be a problem but if it doesn't, you'll lose data or the customer will notice weird behavior.
Different regions - we have customers in different parts of the world so Azure has data centers globally for a lot of reasons.
- Having multiple regions prevents a single point of failure. So if a failure occurs in one region hopefully the traffic manager is able to route traffic to another region.
- Also provides lower latency for customers in those regions
- We have many different regions we deploy to, and it's good to be aware of what version is in a different region when tracking new changes or bug fixes. Sometimes when customers email us about bugs, we keep them updated with the bug fix by knowing which region the customer is in and the current build making its way through the regions.
- We have low traffic regions and high traffic regions, need to be aware of that too
- Does make things more difficult when you have to call customers who are in, for example, Europe time zone so you have to set up a time that's terrible for both of you to call. Or if you're on holiday but then the rest of the world doesn't celebrate that holiday so if you're on live site that means you still have to watch over the service.
Scalability - Our service is still somewhat small but we are steadily gaining more customers and that comes with more problems. We have to be aware of limits we encounter (size & storage limits, request limits, resource limits, etc.) as when the product started we didn't need to code for this behavior
- And this will sometimes cause problems we don't encounter in our dev environment but they happen in prod, and these can be hard to replicate.
- Concurrency - becomes a larger problem the larger the service is. In our case, specifically database concurrency and readers not reading the most current version, or writers overwriting important information which screws things up (not producing a correct number of resources, etc.)
Localization - Usually not too complicated but every customer facing string has to be localized.
- We did have some issues with this for listing days of the week because we were trying to be clever and code a sentence programmatically based on days the user selected but different languages handle this differently
- And sometimes bugs come up where logic is executed depending on the string returned, but if for some reason it depends on a non-localized string and we didn't take that into account, then in different languages than English then the code will act differently
- And we have to be aware that some languages are a lot longer than others and may look different in the UI
Backwards compatibility - This is something we always have to keep in mind so we don't break existing code that works. Customers get angry (for good reason) if you suddenly change the way things work without telling them.
- Always have to keep this in mind when making a new feature or improving an old one
- Sometimes this means migrating a lot of old data for whatever reason, which is always error prone
- Don't change default behavior! And for that matter make sure default behavior is always the "safer" route or the less dumb route
- I know I just said not to change default behavior but sometimes we need to upgrade old versions to new versions, ideally seamlessly (without the customer noticing, or worse, causing the upgraded version not to work anymore)
- Kinda ties in with API versioning as well.
Future proofing - When designing a feature we have to keep a good balance of keeping things open for extension but not burdening ourselves excessively with details that may not come to be
- Often in design discussions someone will have to remind the team that we are future proofing and should take a step back
- But we've all felt the pain when dealing with code that obviously wasn't well thought out or have to hack around an old implementation
Idempotency - This was a new topic - I knew the textbook definition of idempotency (making sure the same thing happens despite calling it multiple times) but I didn't really see how it fit in with the real world.
- Server restart is a thing - when we deploy or just have to restart our web jobs, then this isn't a problem if your service is idempotent. Unfortunately if it's not, then a lot of things break.
- We had some other bugs where the underlying Azure service was making multiple Create calls on our resources, and our code wasn't idempotent because we'd randomly generate a number for each Create call which obviously isn't idempotent. So basic things would break.
- A surprising thing to learn about was the trade off between concurrency and idempotency. In trying to make our code idempotent, we sacrificed concurrency by overwriting some data we needed for concurrency to work. But when we tried to solve concurrency by reducing conflicts (or taking care of them), we sacrificed idempotency.
- The way we're trying to handle idempotency is handling code in smaller chunks and enable more requeueing. Doing things like making sure messages are in queue for a very short time (to minimize failure/risk), and chunking code into smaller pieces (so if server restarts in the middle of something the code can restart at that one step instead of starting the entire operation over again - this means the operation has to keep track of which step it's at though). And in each of these smaller pieces we save the state of the operation. The work can be put back on to the queue at any time, but usually each time it's put back, it has made some progress, and we log that progress so next time we dequeue we know where to start
Queues - We definitely learned what queues were in school but we use them differently for work than we did for in school
- We use queues basically to support asynchronous calls and idempotency.
- Asynchronous - so we can support long running operations without blocking other requests
- I explained idempotency earlier - we use queues so we can chunk an operation into smaller pieces so if the service ever messes up then it's not a big deal if the current code that was executing was thrown away - we've been saving its state as it gets dequeued and requeued so we can just start back from the last "checkpoint" next time we dequeue the message and process it
Customers/live site - The most stressful part of the job I think, especially when on live site duty. Even engineers need good people skills.
- I am still struggling with this. Being super aware of new features added in code (or which bugs have recently been fixed), details of how big parts of the system are implemented, knowing which release is out to which region (or knowing how to find out), being able to sort through our logs and telemetry and formulating a good story of what the customer did or what code was executed, etc., etc., etc. are all important
- Nice thing about livesite is that you're forced to be exposed to a ton of different parts of your service at once. I liken it to cramming for finals - it's a week (or two) where I'm super stressed but I'm forced to learn a lot in a short amount of time.
Importance of logging/telemetry - Logging is good and necessary, but sometimes too much is bad because the signal gets lost in the noise
- However we tend towards more logging as we wouldn't be able to solve many of our live site problems without logging. Just like how beginners *cough* use print statements to debug their code, this is kind of an after the fact print statement that allows us to debug what the customer did and perhaps what went wrong and how to solve it
- It's also important to record telemetry to let us know which new features to implement, where to devote our time into, or perhaps which features to cut
API versioning/general versioning - Important if you publish your own SDK, or if other people are using your APIs
- Making sure old APIs are usable because some people still depend on them if they are using our old SDKs or using Fiddler to make calls
- That means freezing old REST models and adapters, and making sure logic works to provide the correct conversion between REST and business regardless of API version. Usually calls will come to the current (preview) version from the front end, but if people call old apis through our old SDK or fiddler, etc., then we have to make sure we route them to the correct version. So that's why we must save all our old models and adapters somewhere.
- Different API versions don't need to be back compatible, they are running side by side usually
- Eventually you need to implement deprecation logic for very old API versions
- Also being aware of when SDKs we depend on get upgraded, that sometimes breaks thing
- When updating API versions, increment major version if there are breaking changes and update minor version for other changes.
- And for super major breaking changes, must announce it publicly (Azure notification or blog post) and since we are a small-ish service and know what our major customers are doing, even giving them an email before we roll out the feature.
Importance of catching exceptions/null refs - So this is sometimes a big source of live site problems. Anything can happen to data in the real world despite how implausible it seems while coding. We always have to look out for resources that don't exist, for data that returns empty, etc., even though when writing the code you're thinking to yourself "there's no way this will ever be null...". Customers have a way.
- Also, I did not know the difference between "throw" and "throw ex" before working here. (It's important to retain stack trace information! Hence use "throw" instead)
- And the importance of catching specific exceptions as often as we can - and not swallowing general exceptions we didn't expect, deferring them to the global controller.
User/admin permissions - We actually had someone who was IT support who somehow had access to one of our important resources and cleaned out his system, deleting our resource and wreaking havoc for about an hour. So that was a good lesson why it's important to be careful whenever you have prod level access
- But we also have to be aware of permissions when coding - Azure is strict about user roles and user permissions (also for good reason as stated above). I forget this sometimes - when writing and testing code I am always at admin level so I only test the admin code path. But users and admins are usually treated differently, and we don't want to accidentally give users permissions that they shouldn't.
Testing - We in fact did have to write unit tests in school, but integration tests are another whole thing. Unit tests can be nice when implementing code as well because it's quicker to run just that one snippet instead of building the solution, then deploying it, then logging in to the web site, then following the set of actions that will test your code.
- One benefit of interfaces I've never thought of is they help keep your tests up to date. I'm working in an area of code right now that is very volatile and changes by the day, and some of our unit tests were already out of date after a week. However, whenever the developer implemented an interface, it forced those unit tests to keep up with the code.
- Microsoft pretty much got rid of testers so we just test each other's code. I've noticed that really good testers are ones who test things you didn't even think of - happy path as well as edge cases as well as weird things customers do
Interdependence - In school maybe we relied on some packages, but they were usually tested very thoroughly and not likely to have bugs. Azure is a huge complicated system with everybody depending on everybody else and we can't be sure that their code is reliable but we have to be sure we don't f over teams depending on our code. It makes me feel special being a part of this wonderful behemoth but at the same time it's pretty intimidating.
- The fact that our product is one system utilizing so many different parts of Azure can be confusing. We have to tie together many different services to make one thing work - make sure the user has permissions, make sure we have permissions to do what we need to do, make sure we create resources in the correct order so we can hook them up to each other, make sure the resources we create are compatible (for example assigning a new vm to the right load balancer), make sure the resources are created in the same resource group, make sure we're aware of throttling limits between different services, hook up analytics so we get the logs for customers, make sure our multiple web services are up and running and correctly hooked up, etc.
- I also learned that before using another service to implement a part of your own framework, it's important to really read the documentation to understand how it works and how it is meant to be used, and be aware of the restrictions before hand.
- In a much smaller scope, we've had issues where our business and data layers are way too intertwined and we're in the process of decoupling our layers. Data layers should only know about CRUD: create, read, update, delete.
- (this also falls under Deployment): We want to be able to separate out services in our code so we can deploy them differently. If you want to push out a hot fix and all of your code is all under one service, then that means you have to deploy EVERYTHING in master whereas if you separate your services then you can quickly hot fix one service without affecting others
- And circular dependencies and moving around projects/references can be super annoying. Nugetizing code can help with this, but this gets annoying if you're in a project that gets updated a ton and you have to make changes to that code and then publish the nuget package and then in the other solution wait for that change to get published to be able to make your own change. Probably making sure that your presentation, business, data, and database layers have clearly defined boundaries is a good start.
But I'd also like to give a shout out to my team - I really think this is the best job I'll ever have in terms of people and work given. We're in the unique position of being a startup-y product in a huge corporation so we get to write a lot of new features with a lot of flexibility but have amazing resources. And most of all, my team is amazing - not a day has gone by at work without me cracking up super hard. I've heard so many horror stories about Microsoft being super political and cutthroat but it just feels like a group of friends here, a really smart and nerdy and hilarious group of friends. In a way it sucks to have the bar set so high so early in my career but I'm thankful for this amazing opportunity while I still have it.
I still struggle with the sheer amount of new material to learn and digest and even after about 10 months I feel like I am still struggling. I hope that's normal and not just me - but I guess with time and experience it'll all seem less overwhelming.