CHECKPOINT REPORT Final Report
To design a parallel distributed key-value store using consistent hashing on a cluster of Raspberry Pis.
Our key-value store supports the following operations:
- PUT: Insert/Update a new/existing key-value pair
- GET: Retrieve the value associated with a given key
- DEL: Remove a key-value pair from the database
Distributing the database across several nodes provides multi-faceted advantages:
- Allows us to run several PUT/GET/DEL operations in parallel
- Storage load is amortized across multiple nodes
The basic architecture consists of a central coordinator and several storage nodes. The clients send requests to the coordinator which routes the requests to appropriate storage nodes. Request routing decisions are taken based on a consistent-hashing ring. The central coordinator has information about all the storage nodes present in the system. Storage nodes store the key-value pairs assigned to it in-memory and service requests routed to it. We'll be programming our project in Go as it enables parallelism via low-cost goroutines. The GoRPC package spawns a new goroutine for each request which enables parallelism across multiple requests.
Claim: The simple key-value system store does not have compute intensive tasks therefore it is wasteful to use beefy compute nodes to implement the system as they are very power hungry.
We plan to build this system using wimpy compute nodes (Raspberry Pis) which are power efficient. We want to benchmark the system and comment if it is useful to build a key-value store using Raspberry Pis. We also would like to perform dynamic scaling of the storage nodes in the system as the in-memory data structure on individual nodes start to bloat. We plan to use consistent hashing to redistribute the data among nodes.
A few challenges that we foresee:
- Setting up the cluster of Raspberry Pis: We need to understand how basic CPU usage + networking of the Raspberry Pis work, and also need to make sure that they can talk to each other on a local network. For our prototyping, we'll be connecting the Pis on a local network using switch/hub.
- Setting up the software framework so that all the nodes in the system have required communication channels set up.
- Correctly implementing consistent hashing: This is the crux of our project. We need to ensure that every request is being sent to the appropriate storage node based on it's key following the correct semantics of consistent hashing.
- Five Type 2 1 GB RAM Raspberry Pis with 8GB flash storage
- TP-link 8 port networking switch
- A whole lot of ethernet cables
- Golang, with a focus on the RPC package
GOALS AND DELIVERABLES
A basic system with all the communication channels set up between the nodes:
We want to build a basic system with storage nodes labelled such that they follow the semantics of consistent hashing. We will need to prototype and test the RPC functionality by passing known objects between different nodes in the system.
Benchmark the system and compare the performance per watt and performance per dollar with beefy compute node:
We want to run few workloads on this system and get benchmarks for absolute performance of the system in terms of latency along with the performance per watt and performance per dollar. Based on these benchmark numbers we want to identify the tradeoffs, if there are any, of using this cluster of wimpy nodes. We want to comment based on these tradeoffs whether or not it is useful to use this system.
Identifying any bottleneck in the initial design by benchmarking and optimizing to overcome the bottlenecks
Fine-tune parameters to find out the correct mix of nodes for different workloads. We want to reason the observed behaviour of the system and then based on these reason identify if there are any bottlenecks in the design. If there are some bottlenecks then optimize for those bottlenecks and benchmark again.
A few addons that we want to explore if time permits:
- Replication of data across multiple storage nodes for basic fault tolerance/improved throughput using PAXOS or 2 phase commits
- Optimizing the system for large "value" objects like images, audio files
- Replicating the functionality of the single coordinator
- Persisting the in-memory key-value store on flash to make the data durable
- Comparing the performance of C/C++ implementation of the same design vs Go to get an idea of efficiency of touted go routines vs worker threads
We hope to demo a working model of this proposed small-scale key-value architecture. We would demonstrate all the features guaranteed by our system namely put, get and delete. We also hope to compare the performance of this distributed key-value store against a single node version of it.
- As of now it would be difficult for us to predict the performance specifics because:
- We are new to Raspberry Pis and therefore we would need to profile the system with our workloads.
- We think that the performance of the system will change as we increase the size of "values" associated with each key.
- We hope that the performance of our N storage nodes version of distributed key-value store to be at least K times better than the single node version where K <= N
Raspberry Pis are cheap and power efficient and provides fast access to persistent storage (flash vs disks). Pis are readily accessible and therefore we hope that our project can be used by other students to play around with simple distributed systems for academic purposes. Golang is used as it's an effective systems programming language to design thread-safe efficient parallel distributed systems.
- April 4 - April 10: Familiarizing ourselves with GoLang and read up on consistent hashing.
- April 11 - April 17: Implementing communication framework. Design software APIs.
- April 18 - April 20: Implement correct version of consistent hashing.
- April 21 - April 24: Come up with workloads to test the system and setup cluster on Raspberry Pis.
- April 25 - April 28: Implement scaling with data migration.
- April 29 - May 1: Benchmark the system, identify the bottlenecks in the system and optimize.
- May 2 - May 8: Make the system robust. Work on future goals. Make final presentation. Demo!!!
We tried running a workload using 1024 requests of 128 byte values. This workload was split across clients ranging from 5 to 100 and was run both on Raspberry Pis and Andrew Machines using various configurations like: 1 master 2 workers, 1 master 3 workers and 1 master 4 workers. We observed that Pis scaled poorly because of a single master but the maximum latency was still below 100ms. For any decent web application, this is more than acceptable. As a result, Pis could be an acceptable replacement for disk based powerful compute servers for such applications.
Performance comparison between Raspberry Pis and Andrew Machines:
One of the main motivations behind using Raspberry Pis was that they are power efficient. We tried modeling the performance per watt graphs for both Raspberry Pis and Andrew machines to compare their relative power efficiencies. The results are as follows:
We also observed that increasing the size of values stored in the key-value store gives similar performance upto Values of size 1KB. Above that, especially on higher sizes like 1MB values, continuously streaming data increases average latency upto 2400 ms. This is definitely not acceptable especially for a web application which requires latencies in the order of 100ms. This is mainly due to the NIC of the single master getting overwhelmed by transferring 1MB of data per client request as all requests are streamed from the worker->master->client. Our findings are shown below:
To improve the latency for big data, we removed master from the critical path of data movement. In the new system, the master would just forward the ip address and port number of the worker chosen to handle the request to the client. The client would then establish a new connection with the worker and get the data directly from it. This optimization reduces data movement as master is no longer needed to download data from the worker and stream to the client. This also allows requests to different workers to be served in parallel by removing the master's NIC in the data path which was a bottleneck. We tried the same workload as before with 1 master and 2 workers with 256 STORE requests of 1MB values. We chose nodeIDs in advance such that 128 requests went to each worker. The readings are shown below in comparison to the previous latency values:
So far we have achieved the following:
We have started the project by reading upon consistent hashing. We tried to understand how it works, what are its advantages and found out what are some simple ideas that we can use to implement in our project.
We are new to GoLang therefore we read up on that and few presentations about it. We wrote few simple programs initially to get used to GoLang.
We have written simple programs in GoLang to test if we are able to communicate between nodes. For now our communication assumes that the IP addresses of the nodes participating in communication are known already. We plan to build node discovery framework so that in case of dynamic IP addresses we are able to communicate with the devices. We still will assume that the IP address of the master node is known to all the slave nodes.
Having done this we also experimented with the goRPC package to transfer data between the nodes. We plan to use goRPC extensively in our project for GET/PUT requests therefore getting it to work correctly was crucial.
We don’t have access to Raspberry Pis yet and most of our development is being done using the GHC machines. We will still need some time to setup the cluster. Particularly, we think that most of our time while setting up the cluster will be spent on installing and bringing up the Go compiler assuming that we won’t face much trouble with the default unix distribution that comes with Pi.
We are still deciding what will be the best type of value object that we want to use to demo our key-value store so that we can highlight the benefits of Raspberry Pis, if there are any. For now we are considering either small String phrases or 1 MB image files. We plan to experiment with both kind of value objects and see how our system performs.
Our proposed goals were to implement a Key-Value store with GET, PUT and DEL API support. We were hoping to demonstrate the dynamic scale-in/scale-out of the nodes using consistent hashing as the load on the system increases or decreases. This load can be described along two axes:
Memory overhead on a single data node
Number of requests that are seen by a single data node
As we scale-in/scale-out the cluster of Pis we were planning to compare its perf/watt and perf/dollar to a regular commercial machine.
Our goals for the project are pretty much the same as we had initially proposed. Only thing changed is that initially we were hoping to perform dynamic scaling of the system but we don't think we will be able to write an efficient implementation (without a big lock) in the given amount of time. In context of what would be feasible to demo on the project presentation day we think we will demonstrate working model of our system. We plan to show graph of average response times as the number of clients grow, performance/watt and performance/dollar. For the performance/watt we, for now, plan to use the power numbers that are mentioned in the hardware specification document of the concerned machines that we are using. To perform a live demo we need to go forward with the assumption that the IP address of the master node is known to all the slave nodes. This is one addition to our initial goals that we hope to implement a very basic slave node discovery protocol which will run on the master. We don’t know if we will be to implement this given the amount of time that we have but as a fallback we will be using static IP address or a configuration file which will contain the IP address of all the nodes so that the communication between nodes is easy.