Pragmatic Programming Techniques: October 2009

Friday, October 30, 2009

Notes on Memcached

Some notes about Memcached. Here is its architecture.

How it works ?
Memcached is organized as a farm of N servers. The storage model can be considered as a huge HashTable partitioned among these N servers.

Every API request takes a "key" parameter. There is a 2-step process at the client lib ...

Given the key, locate the server
Forward the request to that server

The server receiving the request will do a local lookup for that key. The servers within the farm doesn't gossip with each other at all. Each server use asynchronous, non-blocking I/O and one thread can be used to handle large number of incoming TCP sockets. Actually a thread pool is being used but the number of threads is independent of the number of incoming sockets. This architecture is highly scalable for large number of incoming network connections.

API
Memcached provide a HashTable-like interface, so it has ...

get(key)
set(key, value)

Memcached also provides a richer "multi-get" so that one read request can retrieve values for multiple keys. The client library will issue different requests to multiple servers and doing the lookup in parallel.

get_multi(["k1", "k2", "k3"])

Some client lib offers a "master-key" concept such that a key contains 2 parts, the prefix master-key and the suffix key. In this model, the client lib only use the prefix to located the server (rather than looking at the whole key) and then pass the suffix key to that server. So user can group entries to be stored by the same server by using the same prefix key.

get_multi(["user1:k1", "user1:k2", "user1:k3"]) -- This request just go to the server hosting all keys of "user1:*"

For updating data, Memcached provides a number of variations.

set(key, value, expiration) -- Memcached guarantees the item will never be staying in the cache once the expiration time is reached. (Note that it is possible that the item being kicked out before expiration due to cache full)
add(key, value, expiration) -- Success only when no entry of the key exist.
replace(key, value, expiration) -- Success only when an entry of the key already exist.

Server Crashes
When one of the server crashes, all entries owned by that server is lost. Higher resilience can be achieved by storing redundant copies of data in different servers. Memcached has no support for data replication. This has to be taken care by the application (or client lib).

Note that the default server hashing algorithm doesn't handle the growth and shrink of the number of servers very well. When the number of servers changes, the ownership equation (key mod N) will all be wrong. In other words, if the crashed server needs to be taken out from the pool, the total number of servers will be decreased by one and all the existing entries needs to be redistributed to different server. Effectively, the whole cache (among all server) is invalidated even when just one server crashes.

So one approach to address this problem is to retain the number of Memcached servers across system crashes. We can have a monitor server to detect the heartbeat of all Memcached server and in case any crashes is detected, start a new server with the same IP address as the dead server. In this case, although the new server will still lost all the entries and has to repopulate the cache, the ownership of the keys are unchanged and data within the surviving node doesn't need to be redistributed.

Another approach is to run logical servers within a farm of physical machines. When a physical machine crashes, its logical servers will be re-start in the surviving physical machines. In other words, the number of logical servers is unchanged even when crashes happens. This logical server approach is also good when the underlying physical machines has different memory capacity. We can start more Memcached process in the machine with more memory and proportionally spread the cache according to memory capacity.

We also can use a more sophisticated technique called "consistent hashing", which localize the ownership changes to just the neighbor of the crashed server. Under this schema, each server is assigned with an id under the same key space. The ownership of a key is determined by the closest server whose key is the first one encountered when walking in the anti-clockwise direction. When a server crashes, its immediate upstream neighbor server (walking along the anti-clockwise direction) will adopt the key ownership of the dead server, while all other servers has the same ownership of key range unchanged.

Atomicity
Each request to Memcached is atomic by itself. But there is no direct support for atomicity across multiple requests. However, App can implement its own locking mechanism by using the "add()" operation provide by Memcached as follows ...

success = add("lock", null, 5.seconds)
if success
  set("key1", value1)
  set("key2", value2)
  cache.delete("lock")
else
  raise UpdateException.new("fail to get lock")
end

Memcached also support a "check-and-set" mechanism that can be used for optimistic concurrency control. The basic idea is to get a version stamp when getting an object and pass that version stamp in the set method. The system will verify the version stamp to make sure the entry hasn't been modified by something else or otherwise, fail the update.

data, stamp = get_stamp(key)
...
check_and_set(key, value1, stamp)

What Memcached doesn't do ?
Memcached's design goal is centered at performance and scalability. By design, it doesn't deal with the following concerns.

Authentication for client request
Data replication between servers for fault resilience
Key > 250 chars
Large object > 1MB
Storing collection objects

Data Replication DIY
First of all, think carefully about whether you really need to have data replication at the cache level, given that cache data should always be able to recreated from the original source (although at a higher cost).

The main purpose of using a cache is for "performance" reason. If your system cannot tolerate data lost at the cache level, rethink your design !

Although Memcached doesn't provide data replication, it can easily be done by the client lib or at the application level, based on a similar idea described below.

At the client side, we can use multiple keys to represent different copies of the same data. A monotonically increasing version number is also attached with the data. This version number is used to identify the most up-to-date copy and will be incremented for each update.

When doing update, we update all the copies of the same data via different keys.

def reliable_set(key, versioned_value)
  key_array = [key+':1', key+':2', key+':3']
  new_value = versioned_value.value
  new_version = versioned_value.version + 1
  new_versioned_value =
          combine(new_value, new_version)

  for k in key_array
      set(k, new_versioned_value)
  end
end

For reading the data from cache, use "multi-get" for multiple keys (one for each copy) and return the copy which has the latest version. If any discrepancy is detected (ie: some copies have a lacking version, or some copies are missing), start a background thread to write the latest version back to all copies.

def reliable_get(key)
  key_array = [key+':1', key+':2', key+':3']
  value_array = get_multi(key_array)

  latest_version = 0
  latest_value = nil
  need_fix = false

  for v in value_array
      if (v.version > latest_verson)
          if (!need_fix) && (latest_version > 0)
              need_fix = true
          end
          latest_version = v.version
          latest_value = v.value
      end
  end
  versioned_value =
          combine(latest_value, latest_version)

  if need_fix
      Thread.new do
          reliable_set(key, versioned_value)
      end
  end

  return versioned_value
end

When we delete the data, we don't actually remove it. Instead, we mark the data as deleted but keep it in the cache and let it expire.

User Throttling
An interesting use case other than caching is to throttle user that is too active. Basically you want to disallow user request that is too frequent.

user = get(userId)
if user != null
  disallow request and warn user
else
  add(userId, anything, inactive_period)
  handle request
end

Monday, October 26, 2009

Machine Learning: Association Rule

A typical example is the "market-basket" problem. Lets say a supermarket keep track of all the purchase transactions. Each purchase transaction is a subset of all the item available in the store. e.g. {beer, milk, diaper, butter}.

The problem is: By analyzing a large set of transactions, can be discover the correlation between subsets ? ie: people buying milk and butter has a high tendency of buying diaper. Or people buying diaper tends to buy soda and ice-cream.

Such correlation is called an association rule, which has the following form:
A => B where A, B are disjoint subsets of U (a universal set)

This rule can be interpreted as: From the transaction statistics, people buying all items in set A tends to also buy all items in set B.

Note that people buying both set A AND set B is denoted as (A union B) rather than (A intersect B).

There are two concepts need to be defined here ...

"Support" is defined with respect to a subset X as the % of total transaction that has contains subset X. This can be indicate as P(contains X). e.g. The support of {beer, diaper} is P(contains {beer, diaper}) which means if we randomly pick a transaction, how likely that it will contain both beer and diaper.

"support" of an association rule A => B is defined as the "support" of (A union B)

"Confidence" is defined with respect to a rule (A => B) that given we know a transaction contains A, how likely that it also contains B.

P(contains B | contains A) = P(contains B union A) / P(contains A) which is the same as Support(A union B) / Support(A)

Mining Association Rules

The problem is how can we discover the association rules that has a high enough "support" and "confidence". First of all, an arbitrary threshold of "support" and "confidence" is set according to domain specific concerns. There are two phases.

1) Extract all subset X where support(X) > thresholdOfSupport
2) For all extracted subset X, discover A => B where A is subset of X and B is (X - A)

1) is also known as the "finding frequent subsets" problem. A naive implementation can generate all possible subsets and check their support value. The naive approach has exponential complexity 2 exp(N) .

Apriori Algorithm to find frequent subsets

This algorithm exploit the fact that if X is not a frequent subset, then (X union Anything) will never be a frequent subset. So it starts with scanning small subsets and throw away those that doesn't has high enough support. In other words, it prune the tree as it grows.

Lets say the universal set is {I1, I2, I3, I4, I5, I6}

First round:

Generate possible candidates of 1-item subset. ie: {I1}, {I2}, ... {I6}
Find out all supports of the candidate set. ie: support({I1}), support({I2}), .... support({I6})
Filter out those whose value < supportThreshold

Second round:

From the surviving 1-item subset, generate possible 2-item subset candidates. ie: {I1, I2}, {I1, I4} ... Note that we can skip any subset that contains I3 because it is out.
Find out all supports of the 2-item candidate set.
Filter out those whose value < supportThreshold

K round: (repeat until no more surviving k-1 item subset)

From the surviving k-1 item subset, generate possible k-item subset candidates by adding one more item that is not already in the k-1 item subset. Skip any k item subset that contains any throw-away k-1 item candidates from the last round.
Calculate the support of k-item candidates
Throw away those whose support value < supportThreshold

Find association rules from frequent subsets

After knowing a frequent k-item subset X, we want to find its subset A such that the confidence value of (A => X-A) is higher than the confidence threshold.

Note that confidence = support(X) / support(A)

Since for a given X, support(X) is fixed, we start with trying to find lowest support(A) because that will give the highest confidence. So we do the reverse process, starting from the largest subset A within X, and reduce the set A in each round.

Notice that if support(A) is not low enough, there is no need to try subset A' because support(A') can only be higher. Therefore we only focus our energy to try subset with the surviving A.

First round:

Within X, generate possible candidates of k-1 item subset.
Find out all confidence of the candidate set.
Filter out those whose confidence value < confidenceThreshold
For those surviving k-1 item subset A, mark the rule (A => X-A)

J round: (repeat until no more surviving j-1 item subset)

Within the surviving j-item subset A, generate possible candidates of j-1 item subset.
Find out all confidence of the candidate set.
Filter out those whose confidence value < confidenceThreshold
For those surviving j-1 item subset A', mark the rule (A' => X-A')

Miscellaneous

Note that confidence is absolute but not relative.
When (A => B) has confidence = 75%, it is also possible that (!A => B) has confidence = 90%. In other words, it is possible that some rules are contradict to each other and usually the one with higher support and confidence wins.

Wednesday, October 7, 2009

Re-engineer Legacy Architecture

Legacy system may sounds like a negative term, but in fact it is the result of a successful system. Most legacy systems lies in the core business operation of many enterprises. In fact, it is because of the criticality of it to the business that no one dare to make changes, because a small bug can have huge financial impact.

Most of those legacy system started with a clean design at the beginning as the originally problem it tried to solve was smaller and well-defined. However, as business competition and organization evolution continuously demand new enhancements/features within a relatively short timeframe, these new features will typically be implemented in a completely separated module so that existing working code won't be touched. Common functionality coded in the original system will got copied into the new module. This is a very common syndrome of what I call "reuse via copy and paste"

Here is an earlier blog about some common mindset that is causing the formation of bad code.

Basically, the idea of "not touching existing working code" encourage a "copy and paste code" culture which over time, causing a lot of code duplication across many places. Once a bugfix is developed, you need to make sure the fix is put in all the copied code. Once you enhance a feature, you need to make sure to roll it into all the copies. When there is 20 different places in the code doing the same thing (but slightly different), you start of losing visibility which takes the ultimate responsibility of certain piece of logic. Now the code is very hard to maintain and also hard to understand it.

Because you cannot understand it, so you are more scare about making changes to existing code (since they are working). This further encourage you to put new feature in a complete separated module, and further worsen the situation. The cycle repeats.

Over a period of time, the code is so unmaintainable that adding any new features takes a long time and usually breaks many places of existing code, development team doesn't feel they are productive and work in a low morale condition. In my past career, I was brought in to help on this situation.

At a high level, here are the key steps ...

1. Identify your target architecture

Define a "to-be" architecture that can support the business objectives in next 5 years. It is important to purposely ignore the current legacy system at this stage because otherwise you won't be able to think "outside the box".

Be cautious not to pass out a feeling that this exercise is going to suggest throwing the existing system away and start everything from scratch. It is important to understand that the "to-be architecture" is primarily a thought exercise for us to define our target. And we should clearly separate our "vision" from the "execution" which we shouldn't be worrying at this stage.

The long-term architecture establish a vision on where we want the ultimate architecture to be and serve as our long-term target. A core vs non-core analysis is also necessary to decide which components should be built vs buy.

It is also important to get a sense of possible changes in future and build enough flexibility into the architecture such that it can adapt to future changes when it happens. Knowing what you don't know is very important.

A top down approach is typically used to design the to-be architecture. The level of detail is determined by how well the requirements are known and how likely will they be changed in future. Since the to-be architecture mainly serve the purpose of a guiding target, I usually won't get too deep into implementation details at this stage.

The next step is to get on to the ground to understand where you are now.

2: Understand your existing system

To get quickly up to speed, my first attempt is talk to people who understand the current code base as well as the pain points. I'd also try to skim through existing documents, presentation slides to get a basic understanding of the existing architecture.

In case people who are knowledgeable about how the legacy system works still available, conducting a formal architecture review process can be a very efficient process to get start on understanding the legacy system.

In case these people has already left, a different reverse engineering process is needed.

3: Define your action plan

At this point, you have a clear picture of where you are and where you want to be. The next step is to figure out how to move from here to there. In fact this stage is the hardest because many factors needs to be taken into considerations.

Business priorities and important milestone dates
Risk factors and opportunity costs
Organization skill set distribution and culture

The next step is to construct an execution plan that optimize business opportunities and minimize cost and risks. Each risk also need to have an associated contingency plan (plan B). In my experience, the action plan usually take on one of the following options.

Parallel development of a green-field project
A small team of the best developers will form an effort in parallel (alongside with the legacy system) to create the architecture from scratch. The latest, best of breed technologies will typically be used such that most of the infrastructure pieces can either be bought or fulfilled by open source technologies. The team will focus in just rewriting the core business logic. The green-field system is typically more easy to understand and more efficient.

After the green field system is sufficiently tested. The existing legacy system will be swapped out (or serve as a contingent backup). Careful planning on data migration, traffic migration is important to make sure the transition is smooth.

One problem of this approach is development cost, because now you need to maintain (within the transition period) two teams of developers working on two systems. New feature requirements may come in continuously and you may need to do the same thing twice in both systems.

Another problem is the morale of the developers who maintain the legacy system. They know the system is going away in future and so they may need to find another job. You may endup losing those people who are knowledgeable about your legacy system even faster.

Refactoring
Another approach is to refactor the current code base and incrementally bring them back into a good shape. This involve repartitioning of responsibilities of existing components, break down complex components or long methods.

When I run into code that I am not able to understand which may be dead code that never get exercise, or logic that is hidden in many level of indirection. What I typically do is to add trace statements into the code and rerun the system to see when this code is execute and who is calling it (by observing the stack trace). I will also put a wrapper around the code that I don't understand, shrink the wrapper's perimeter to a point that I can safely just swap out the component which I don't understand.

It is also quite common that legacy system lacks of unit test, so a fair amount of effort may need to spend in writing unit test around the components.

One problem of the refactoring approach is it is not easy to get management buy-in because they don't see any new feature coming out from the engineering effort being spent.

Architecture Review Process

If you are lucky enough to keep the engineers who are knowledgeable about the current system around, then conducting a formal "Architecture Review Process" is probably the most efficient way to understand how the existing system work.

However, if these people have already left the company already, then a different reverse engineering effort need to be taken instead.

Participants
It is usually involved a number of key persons

A facilitator who orchestrate the whole review process and control the level of depth at different stages
A recorder who documents key points, ideas, observations and outstanding issues throughout the review process
Key architects and developers who collectively understand the details of the legacy systems, and be able to get down to any level of details (even code walk-through) if necessary.
Domain expert who understand every details of how people use the system, what are their current pain points and what features will help them most. The domain expert also helps to set business priorities between conflicting goals.

Process

The architecture review process has the following steps.

Use Cases and Actors
From the business domain expert, we focus in the "actors" (who use the system) as well as the "use case" (how they use it). We also look at the key metrics to measure the efficiency of their tasks (e.g. how long does it take to complete the use case)

Activity Analysis
Drill down from each use case, we identify activities that each actor perform. We look at what data need to be captured at each activity and how actors interact with each other as well as with the system.

At this point, we should establish a good external view of the system. Now we dig into the internals of it ...

Technology Stack
This purpose is to understand what are those building blocks underlie the system and get a good sense of whether the build vs buy combination is correct. Things like which programming language, Java vs DOtNet, which App Server, what DB, any ORM vs direct SQL, XML vs JSON, which IOC or AOP container, Messaging framework ... etc need to be discussed. We also need to distinguish the core features (which you mostly want to build) from the non-core features (which you mostly want to leverage 3rd party code). By the end of this exercise, we'll get a very good understanding about the foundation on which we write our code and perhaps we can also identify certain areas where we can swap in 3rd party code.

Component Analysis
This is the portion where most of the time is being spent. Here we dissect the whole system into components. It starts off by the architect highlighting a list of major components of current system. For each component, we look at

The responsibility of the component
The persistent data owned by the component and the life cycle of maintaining this data
The interface of the component
The thread model executing the logic of the component (ie: Caller thread vs listening thread vs a new spawn thread) as well as any concurrent access implications
What are the potential bottleneck of this component and how we can remove the bottleneck when it occurs.
How does the component scale up along growth of different dimensions (e.g. more users, more data, more traffic rate ... etc) ?
What is the impact if this component crashes ? How does the recovery happen ?

It is important to realize whether the component communicates across VM boundaries. If so,

What is the authentication and authorization mechanism ?
What is the message format being communicated ?
Is the data transfer in clear text or encrypted ? And where is the secret key being stored ?
What is the handshaking sequence (protocol) ?
Is the component stateful or stateless ?

Since we already dive deep into the architecture, I usually take a further step to drill into the code by asking the developers to walk me through the code of some key components. This usually will give me a good sense about the code quality. Whether the code is easy to read, whether the logic is easy to follow, whether the method is too many lines of code, whether there is duplicated logic scattering around ... etc, and more important, whether there are sufficient unit tests around.

Maintainability Analysis
This focus in the ongoing maintenance of the architecture, things like ...

Is the system sufficiently instrumented for monitoring purpose ?
When the problem happens, is there enough trace around to quickly identify what went wrong ?
Can the system continue working (with some tolerable degradation) when some components fail ?

Extensibility Analysis
Understand the parameters that affects the behavior of the system. When different scenarios of changes happens, how much code need to be change to accommodate that ? Or can the system still serve by just changing the configurable parameters ? For example, does the system hard code business rules or using some kind of rule engine ? Does the system hard code the business flow or using some kind of workflow system ? What if the system need to serve a different UI device (like mobile devices) ?

Output

The output of an architecture review process is typically

A set of documents/diagrams of the key abstractions that gives a better view of how the overall system. This documents should help a new comer to get up to speed quicker as well as communicate the architecture to a broader audiences.
A set of recommended action plans on what can be done to improve the current architecture.