Using AWS S3 as a database with PouchDB

While investigating various options for a managed database solution on Amazon Web Services, I came up with an idea: What if we use Amazon’s S3 file hosting solution as a database? My requirements was that it should be a document database and it should run as cheaply as possible.

There were several obvious options already available: SimpleDB, which looks like it’s deprecated and DynamoDB, which looks to be its successor. DynamoDB is KKV NoSQL database which is supposed to scale infinitely. The pricing is reasonable, however, you have to specify the read and write capacity, and pay an on-going cost based on what you want provisioned, even if you don’t use up to the capacity. This meant that you will essentially have to pay more than your actual usage, and you will have to keep monitoring the usage levels and adjust as necessary, a maintenance hassle.

There’s actually another NoSQL database on AWS, and that’s S3. S3 is basically a huge key value store, sorted by the key. Even though the main use case is for storing files, the underlying architecture itself is hardly a file system. It automatically partitions your data inside a bucket and maintains a primary index for the key. It’s design probably inspired DynamoDB. The costs of S3 is minuscule and pay-as-you-go. They charge essentially on storage, bandwidth and API calls. For small workloads with irregular demand, this would be a big cost saving compared to using DynamoDB, where dedicated capacity needs to be provisioned.

To effectively program against S3 as a database, we will need to create an API. I’ve previously used PouchDB, which is a small Javascript based NoSQL database inspired by CouchDB. PouchDB allows the creation of custom storage backends, thus allowing data to be stored in memory, in the browser IndexedDB or another space. There isn’t a storage backend for S3, so I thought why not make one?

LevelUP/LevelDOWN

There is an existing PouchDB storage backend that uses LevelUP, which is a high-level API that wraps LevelDB, a very basic NoSQL database developed by Google. You could plug in custom backends which implement the LevelDOWN API, allowing you to use the LevelUP API to write to different backends. It has a small API surface and is the basis of larger and more feature-rich databases. A number of PouchDB backends (like memory) are constructed by wrapping a custom LevelDOWN implementation with the LevelDB plugin (see pouchdb-adapter-leveldb-core).

Due to it’s small API surface, the easiest way for PouchDB to work with S3 would be create an adapter for LevelDOWN, which I did called S3LevelDOWN. The added bonus is that you could just use it with LevelUP if you don’t need the full PouchDB functionality.

Implementing S3LevelDOWN

The most fundamental functionality is the ability to search from a start key (gt). Since S3 supports specifying a start Key, this neatly maps to the required calls of LevelDB. However, due the limited API provided by S3, there are inefficiencies due to workarounds needed to create a full LevelDOWN API. The biggest would be if you needed to reverse sort, since, S3 always return in keys in sorted ascending order, the only way to reverse is to retrieve in ascending and then reverse the list.

Performance considerations

At the end of the day, S3 is not designed as a database, hence it’s not going to yield the same performance as a normal database. It should however offer consistency in access times and scale as your storage grows.

Some additional optimisation tricks that could be employed are including data into the keys. Since there’s no batch file retrieval, separate database calls are needed to get the contents of each file, which means accessing large sets of records are slow. There is calls to list buckets, 1000 keys at a time, so including data as part of the keys allows batch retrieval. Keys can be up to 1024 in length and UTF-8 encoded.

S3 also employs partitioning based on the key prefix as noted in their documentation. So they actually advise to have more diverse prefixes in your bucket. This actually harks back to the KKV style which DynamoDB uses. You would still need a partition key, maybe as your top level folder to ensure that the data is evenly spread out to different S3 instances as to not affect the overall performance of your bucket. DynamoDB rules also apply in the sense that your prefixes should be high in cardinality. This of course may make searching by ranges harder.

And as mentioned, when listing a bucket, S3 always return in keys in sorted ascending order. If you want descending order, then that’s too bad. You have to enumerate the entire list and do a in-memory reverse sort. If you only provide the start key (lt), we would need to seek from the first key of the bucket to the start key (lt), then reverse the list in memory so that the list is emitted in reverse order. This is okay for small subsets of data but may be impossible for large sets.

Concurrency issues

A major deficiency of S3 is that there is little concurrency control. While all S3 API operations are atomic, some calls used in S3LevelDOWN like batch and get requires several S3 API operations to perform. And on top of that, pouch DB would need several LevelDOWN calls as well. Hence having two PouchDB instances accessing S3 simultaneously would result in data corruption. S3 does not support locking and only very limited optimistic concurrency control (through use of file versioning).

Conclusion

While unfortunately, we can’t use S3LevelDOWN with PouchDB, it could still be used with LevelUP, provided that you are prepared to handle errors which may arise during concurrent writes. If the documents you want stored are quite independent of each other, and very rarely modified concurrently, then this should not be an issue.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Playing on the frontier

A log of random stuff that interests me.