Monday, June 24, 2013

Upload Strategy of Happiness

As I mentioned in the previous post, my GSoC project is to implement a new upload algorithm in Tahoe, called the "Upload Strategy of Happiness". This algorithm was developed by Kevan Carstensen, one of Tahoe's contributors, in order to meet the servers of happiness test by maximizing the distribution of file shares over a network. The relationship between shares and servers is modeled using a bipartite graph, where an edge exists from share n to server s if and only if s stores n.

The algorithm is as follows:

Let N be the set of all shares and S be the set of all servers.

  1. Construct a bipartite graph A from the existing allocations on the network such that an edge exists from share n to server s if and only if s stores n. 
  2. Calculate the maximum matching graph of A using the Edmonds-Karp algorithm and let this graph be A-max. 
  3. Let N-existing be the set of all shares in A-max and let S-existing be the set of all servers in A-max. Construct the bipartite graph B such that the set of shares used by B is N - N-existing and the set of servers used by B is S - S-existing where an edge from each share in N - N-existing exists to each server in S - S-existing if server s can hold share n. 
  4. Calculate the maximum matching graph of B and let this graph be B-max.
  5. Merge A-max and B-max.
The product of merging A-max and B-max is the maximum distribution of shares over unique servers on the network. This technique of calculating the maximum distribution is useful because it utilizes as many existing shares as possible, which reduces the amount of data transferred. Once the shares have been distributed over the maximum number of unique servers, the remaining shares are distributed as evenly as possible by looping over the set of servers until there are no more shares to distribute. Ultimately this will solve edge cases in Tahoe that prevented certain server configurations from accepting new files.

Note: If you are unfamiliar with the maximum matching of bipartite graphs but interested in learning more, I highly recommend chapter 26 of Introduction to Algorithms, titled Maximum Flow.

Last week I was able to implement the algorithm and start refactoring the current uploader. While the new uploader isn't finished yet, some of the test cases which used to fail now succeed, which is pretty exciting. Hopefully I will be able to clean up my code and work out some of the bugs this week. To view my progress checkout the Github branch here

Monday, June 17, 2013

File Upload in Tahoe-LAFS

Introduction

As a brief introduction, my name is Mark J. Berger and I'm a student at Stanford University studying computer science. In conjunction with the Google's Summer of Code (GSoC) program, I am improving Tahoe-LAFS on behalf of the Python Software Foundation. This blog will be dedicated to my work on Tahoe-LAFS, and all other things GSoC related. If you want to contact me or check out some of my personal projects, visit markjberger.com

Tahoe-LAFS & Servers of Happiness

About Tahoe-LAFS

Tahoe-LAFS is an open source cloud-based storage system which distributes user data over multiple servers, similar to Dropbox and other cloud-based storage services. However, unlike other services, Tahoe is built with provider-independent security. All user data is encrypted in such a way that the service provider cannot access it, guaranteeing that the user's privacy is respected. For more information on what Tahoe-LAFS is and how it works, check out the about page on the Tahoe-LAFS website here.

Servers of Happiness

To accomplish data redundancy, Tahoe uses erasure encoding to split the file into n pieces where only k pieces are required to reassemble the file. By default, Tahoe sets the encoding parameters to be n = 10 and k = 3 so that any file can be reassembled with three pieces even though ten pieces exist on the network.

While erasure encoding is great for redundancy purposes, it can be misleading. If all ten pieces of a given file are on the same server, data redundancy hasn't increased significantly. In the situation of server failure, the data is no better off in ten pieces compared to a single file on the server. Erasure encoding has only increased the size of the file without adding any system wide benefits.

To mitigate this situation, Tahoe uses a criteria called "Servers of Happiness", which introduces a new parameter h. Tahoe ensures that a subset of n shares are placed on h distinct servers, therefore guaranteeing increased redundancy. However, Servers of Happiness is only a test, and the current upload algorithm was not built to pass the new test. Therefore, problems arise in which the Servers of Happiness test can be met, but the uploader does not know how to do so, leading to upload failure. My project aims to fix this issue, and other issues which arise from the outdated uploader, by implemented an "Upload Strategy of Happiness" which knows how to meet the Servers of Happiness criteria. This will increase overall data redundancy,  as well as remove inefficiencies in the uploader. 

Today is the first day of programming in GSoC and I've just started to refactor the uploader to implement the Upload Strategy of Happiness. In my next post I hope to explain the new strategy I will be implementing, and discuss any progress I have made this week. To follow my progress subscribe to this blog, or follow me on Github and checkout the project branch here.