Saturday, July 29, 2017

Approach to Designing a distributed system

Most of the systems we work on or we design now-a-days are distributed. The data is distributed, data is served by may be many instances of identical components, data is partitioned, different functionalities may be served by different components running independently and each components may be running different number of instances. The components will most likely run across different machines, and may run on different data centers. So, it is really complex to weaving such system together and making it function correctly.
No matter how much precautions we take,  things may go wrong. But we should try to minimize the possibility of that happening. When we try to overlook something, saying that that this is unlikely to happen in production and let us not waste time in writing devising safeguards against them, then we are fooling ourselves. Things that unlikely to happen or don't happen during tests will surely happen in production someday, they may not be that we that infrequent we hoped them to be :(


Below are the principles, I guess we should follow while designing a system
  • System must behave correctly as far as possible, in some cases we can afford not more than 0% errors, that must be unit tested and integration tested to the maximum. In other cases also, let us design it for 99.99% correctness. Designing system which tries to overlook error condition is a recipe for disaster and unstable system; many capable engineers will have to spend hours or days chasing and correcting the errors. So, correctness is the first and most important consideration for our design
  • System must be reliable. Different components may come up and die,  a component or service may run with different number of instances at different time. But this should not hamper reliability of the system. If the client pushed a document to our platform and we returned him a 200 OK response, then we must ensure all subsequent get for that document by that client at least (if the document is not public) always returns with a status of 200 OK
  • System should be highly available. First let us make the system HA and then think about scalability. There should not be any component or data store on our deployment which is not HA; otherwise things may turn ugly when the non HA component  goes down. We may end up losing data, many services may stop working which were dependent on non HA component which went down. Don't be selective about making only some components HA, but all the components must be HA. A deployment where some components are made HA and some are not, is a pathetic deployment architecture to follow. Avoid that at any cost. In many cases we don't really deal with "big data" (E.g. Handling 4 TB of data is not actually a big data problem :))
  • Scalability is important as we may have to serve larger number of clients,  also data will keep growing. Our system should be able to grow with the number of requests and increasing data. 
  • The system must handle concurrent updates and at the same time it should not unnecessarily lock some records resulting in performance degradation. Optimistic locking is the technique we may use here. Most databases have the facility for optimistic  locking of records. I will give an example where things may go wrong if we don't use optimistic locking. Assume there is an website where the user maintains their Job profiles. Each job is maintained as a separate record in some database. Now assume an user opened two browsers and read the job record from both the browsers. The user leisurely changes the job description details from both the browsers and submits the changes. The problem is whichever browser submitted the changes later will override the changes done by the browser who submitted its changes earlier( but they read the same version of record). How to prevent such errors efficiently? The answer is optimistic locking. Every record will have a version and every update should do a compare and swap of the version read with a new version(new version may be "old version + 1" or "current time stamp" etc.).  The compare and swap operation must be atomic for optimistic locking to work correctly. It compares the version with the expected value (which is the version read by the client) and if that comparison finds that the version is not the expected value (means some other client changes the version in between), then it returns an error. When the client gets the error, it should again read the latest record and try to apply the change on the latest version of the record.  This simple technique prevents a lot of errors. People tries to ignore  this scenario out of ignorance or thinking the chances of happening this issue is less. But when there is a simple and robust solution for the problem, why should we ignore that?
  • Be careful when you do local caching of some records in application servers especially when there will be always multiple version of the application server running. Local cache is a cache within the memory of the application process. It is useful when we want to avoid network calls for getting the cached data from Memcache, Redis, Hazelcast etc. Basically, the application server looks into its local cache for some records before it makes calls to the distributed cache or database etc. But if the record may get updated by multiple copies of the service, then use of local cache will be incorrect. For example, let us say instance 1 of service A update some record for key "counter_x" to value VAL_X100 at 10:30 am. Same value was read by instance 2 of service A  at 10:25 am and at that time value was VAL_X80 and it cached the value locally. Now some client's call reaches instance 2 of service A requiring the value of counter_x. If instance 2 of service A returns the value cached locally, then the client ends up reading an incorrect value (VAL_X80) for counter_x.  Applying some expiry time for the keys in local cache won't solve the problem anyway.  So, we should think when we want to cache something locally in a distributed system. There are cases where local caching may result in erroneous behavior.