Mini-project 2: scalable key-value store

Due Oct 18 by 9pm
Points 20
Submitting a file upload
File Types pdf

Due date

You must share a working copy of your service with your partner group early enough that they can get it running by Monday, October 14th at midnight. The report is due Friday, October 18th at 9 pm in Canvas

Groups

This project must be done in groups of 3 or 4 people, using the same group as mini-project 1. If you want to change groups please let the instructor know as soon as possible.
You will be partnered with the same groups from mini-project 1 so you can use most of the interface you previously agreed upon.

Service

The service to provide is a simple key-value store running as a set of replicated instances, where keys and values are strings. You service should be available, meaning that it should continue to provide service when many failures happen. It should be scalable, so you can run 100s of instances of the service. It should be as consistent as possible, so it should make an effort to return the latest value put() when there are no failures, and soon after failures are resolved. If a crash happens during a write, then the data may have been written even if the client never learned of its success. Your service should recover from failures, so you will need to store data persistently.

You can assume your service will be configured with 100 or more nodes, even if only one of them (or zero) is alive at a given time.

There is no standard protocol between servers or between clients and servers. However, there is a standard client interface (API) that people can program against to access your store.

Your service must allow multiple clients to connect simultaneously and access the service. In addition, clients may only be able to connect to a subset of the instances at a given time (more details in the API).

You can choose any language to implement the service.

Client library

The client interface is the similar to mini-project 1 -- differences are marked in bold. You must implement a client library to access key/value data. The interface is detailed in the specification below.

You will provide a shared library named "lib739kv.so" implementing the interface to your server.

The client code must run on standard CS department workstations. Information on how to remotely access them is available here:
- Instructional Facilities | CSL Docs (wisc.edu)
- Connect to CS Workstation (wisc.edu)
You can assume that there are a fixed set of backend servers and that instances fail and recover, but they won't be added and removed from the system. You should implement multiple instances as
You need to handle nodes failing and then recovering

Here are the functions you must implement, callable from C or C++:

int kv739_init(char *config_file) - provide file name containing a list of service instances. Each instance has the format "host:port". This function initializes the client code. Returns 0 on success and -1 on failure. If host is numeric (with period separators) it is an IP address, otherwise a DNS name. This file must contain the complete set of instances
int kv739_shutdown(void) - shutdown the connection to a server and free state. After calling this, client code should be able to call kv739_init() again to the same or a different set of servers.
int kv739_get(char * key, char * value) - retrieve the value corresponding to the key. If the key is present, it should return 0 and store the value in the provided string. The string must be at least 1 byte larger than the maximum allowed value. If the key is not present, it should return 1. If there is a failure, it should return -1.
int kv739_put(char * key, char * value, char * old_value) - Perform a get operation on the current value into old_value and then store the specified value. Should return 0 on success if there is an old value, 1 on success if there was no old value, and -1 if the server had an internal error and is guaranteed to not have stored the value, and -2 if there was a communication failure, so the state of the server is ambiguous. The old_value parameter must be at least one byte larger than the maximum value size.

There are restrictions on keys and values:

Keys are valid printable ASCII strings without special characters and are 128 or less bytes in length. They cannot include the "[" or "]" characters.
Values are valid printable ASCII strings with neither special characters nor UUencoded characters and 2048 or less bytes in length. They cannot include the "[" or "]" characters.

For testing purposes, you will implement functions that artificially restrict terminate or restrict communication between replica instances:

int kv739_die(char * server_name, int clean) -- tell the server to terminate itself (call exit()). The server name has the format "host:port". If clean == 1, then the server can flush state and notify other machines it is failing -- e.g., a fail-stop failure. if clean = 0, it should call exit() and terminate immediately. This function returns 0 if it successfully contacts a server and the server initiates self-destruction, or -1 if there is any kind of failure. On failure, it is indeterminate whether the server will die.
You may need to add your own method, in conjunction with your partner group, for restarting services after a failure.

Specifications

Failures

Your service should be multiple processes on one more or computers, with failures simulated by terminating one or more of the processes (halting failures only) or using the testing interface.
The service should restart correctly after a failure and without losing any data was written before the crash
The service should provide some level of availability with most of the instances dead.

Scalability

You should allow up to 100 instances of the service.

Performance

Performance goal: you should seek maximum throughput and minimum latency for a variety of different client workload distributions, such as uniformly random or hot/cold distributions where 10% of keys are hot and receive 90% of requests
You should make an attempt to spread load across the instances.

Originality

You cannot use an existing distributed/networked key value service, such as Redis, to implement this assignment, or existing front-end proxies for load balancing. You must write the client library, decide how to handle remote communication, and the service code that responds to requests.
You can use existing storage engines for managing persistent data.
You may use communication packages such as gRPC to simplify communication.

Tests

You should write a test program that links against the shared library described above. You should write two kinds of tests:

Availability tests: Measure how many service instances need to be available for the service to be available.
Consistency tests: ensure that in the absence of failures, the service provides consistent results, and try to find cases when it can return inconsistent results.
Performance tests: measure the performance, as latency and throughput, for some workloads, including those with different popularity distributions. Also measure the load balance by looking at the CPU utilization of the service instances - they should be as even as possible.

In addition to testing your code, you will also run your tests against the servers & libraries from at least one other project group. This means that you will need to make it easy for someone else to install and run your services, and to use your library to connect to it.

You must test your code in two configurations

Your tests, your library, your server.
Your tests, your partner's library, your partner's server.

What to turn in

You will turn in a short report describing your effort that includes:

Your names, and the names of the members of your partner group
A short discussion of how you implemented your client and server, including availability and scalability guarantees. You don't need to describe in detail things you described in the report for project 1.
A specification of the protocol between client and server, such as what you used for communication (hand-crafted messages over sockets, gRPC, what the messages contain)
A description of the tests you wrote and an explanation why they are appropriate/suitable for testing your service. For example, how will they show the presence or absence of consistency guarantees? Please include the test methodology.
The results of the tests. Please use appropriate graphics/visualizations to present results.

Grading

Performance is important but not the primary concern of this project. We are more interested in the ability to return correct results under failure conditions.

For the writeup, we will look for these things:

Does the writeup adequately describe the project?
Does the design make sense?
Do the described tests adequately test the guarantees of the design?
Does the code interoperate correctly?

Rubric

Find a Rubric

Title:

Title

Criteria

Ratings

Pts

Edit criterion description Delete criterion row

Description of criterion

Range

threshold: 5 pts

Edit rating Delete rating

5 pts

Full Marks

Edit rating Delete rating

0 pts

No Marks

pts

5 pts

Total Points: 5 out of 5

Upload a file, or choose a file you've already uploaded.
File: remove empty attachment This file type is not allowed. Accepted file types are: pdf remove empty attachment This file type is not allowed. Accepted file types are: pdf


Send comment to instructor only Send comment to the whole group