Building Private Cloud Storage (Seafile, S3QL, Amazon S3)

Problem: I have about 50 GB of irreplaceable data, that I would like to have regular access to, over the Internet.

For the past couple years I have been using a Linux box at home as NAS, running a RAID-5 array with daily incremental backup to Amazon Glacier (an extremely cheap and reliable storage service that has a retrieval latency of around 5 hours). That gives me fast access at home (over my gigabit LAN), but accessing over the Internet has always been problematic. I usually use SFTP/SCP/SSHFS, but that really only work for small files. For big files it gets a little ridiculously slow even though I have pretty good 10 MBps up at home. It’s even more painful from places where ping to the server is high, like across the Pacific Ocean. With 300ms ping, directory browsing is almost impossible.

Since unreliable storage is so cheap now (that is, harddrives in PCs), it feels to me like cloud-based storage (for the lack of a non-buzzword-term) sync-ed locally to each computer is the way to go. Storing multiple copies of data may even be an advantage when it comes to reliability.

The easiest way to do cloud-based storage, is probably to just sign up for a Dropbox account, and use that. For me it would be $10/month, which is quite reasonable, but there are a few reasons why I am reluctant to put all my data on Dropbox, and decided to build my own cloud instead –

  • What if someone hacks into Dropbox, and deletes everything? Dropbox uses Amazon S3 for backend storage, which has very high reliability (Amazon says S3 standard redundancy storage can sustain simultaneous data loss in 2 data centers, and multiple copies of data is stored at each data center), but backup systems usually do not guard against intentional sabotaging, and if someone gets Dropbox’s access keys (eg. a disgruntled ex-employee?), it’s very easy to wipe out their entire database, including all backups on Amazon (and judging by the amount of data they have, I highly doubt they have their own backup).  What if my own cloud gets hacked? Well that’s definitely possible, but it would have to be an attack directed specifically at me, and I have far fewer enemies than Dropbox for sure.
  • There’s only 30 days history. They charge a lot of money for infinite history (which is reasonable considering the potential for abuse). I can get infinite history for a lot less, because I know I am not going to abuse it.
  • Dropbox lies. I believe they have changed their wording since then, but they once said their employees cannot access your data since it’s encrypted and they don’t have the key. Well, turned out that’s not true. They do have the encryption key. Data is stored with AES-256 encryption on their site, and they manage all the keys. That also means any attacker that gets hold of the keys can decrypt all your data.
  • Dropbox already has a few very severe security breaches, in their short existence.
  • Slow. I have seen performance drop to 100-200KBytes/s syncing big files. That’s very slow.

Ultimately, it’s just not a company I want to deal with, if I have other choices. And it turned out I do.

I also looked into Google Drive, which looks like one of the only 2 worthy competitors to Dropbox, but their Windows client does this weird thing with random file conversions to try to force you to open all your Office files using Google Doc… so that’s out. I want a simple file storage solution. Not something that tries to decide how I should do things and what programs I should use. Also, no Linux client… which is really surprising.

The other one is Microsoft SkyDrive – most of the same problems as Dropbox, and not have a Linux client in addition.

I’m not interested in smaller providers because they get in and get out of the market in hordes. I don’t want to have to migrate my data between providers every couple months (assuming they don’t just disappear with all my data).

So then I looked into building my own solution.

I have a Linode VPS that I have been using for 5-6 years now, and it already runs my mail server and web server, so I’d like the solution to be based on that.

I also evaluated Amazon EC2 (micro instance), and that turned out to be a joke, mostly because of the very aggressive CPU throttling and weird “bursting” system, but also pathetic disk IO performance. A micro-instance and a ~50GB EBS (elastic block storage, the Amazon term for “harddrive”)  would cost about as much as Linode, and a Linode is much more powerful.

Linode hosts are 8-core (16-threads) Sandy Bridge 2.6 GHz Xeons shared between several servers, but any server can use all 8-cores if no one else is using them. I’ve never had to fight anyone for CPU. The host has always been idle every time I needed to use the CPU (that’s not guaranteed of course… if you are unlucky you may get a host with someone using Linode for compute, but that’s unlikely, because Linodes aren’t very cost effective for that). BTW, if you have never tried “make -j 10”, you should. It’s a pretty liberating experience. You can compile the Linux kernel in 1 minute 41 seconds.  Their network performance is pretty awesome, too. Usually 100mbps up and down, though it’s hard to find other servers that can talk to you that fast.

They also have extremely good support. I’ve opened a few support tickets over the years, and they have always been answered within 1-2 MINUTES, even on Sunday nights, by people who actually know what they are talking about, and no BS.

Highly recommended. They may not be the absolute cheapest, but the support alone is worth it.

However I don’t want my data to actually be hosted on Linode, because –

  • Linode storage is @#$@#%ing expensive for some reason. $20/48GB.
  • They don’t offer offsite backup. There was another VPS provider a while ago that lost a bunch of customer data because their data center got flooded (literally, by water). Definitely need offsite backup.

Linode doesn’t provide affordable reliable storage (I’m using Amazon’s definition of reliable – simultaneous loss of 2 facilities, or at least 4 storage devices), so it looks like I’d have to use network storage, which is not ideal, but we’ll see how that goes.

I decided to use S3 because the only worthy competitor is Google Storage, and I don’t like it for a variety of reasons. S3 is also much better supported in terms of available 3rd-party tools.

S3 is available in several regions (where the data centers are located), and it turned out which region you choose and where your server is has GREAT impact on what kind of network performance you’ll get.

I did a few benchmarks using Linode trial instances at a few data centers, to all the S3 regions in North America.

I found the best match to be Linode’s Fremont location, coupled with S3 North California region, where I got 19 MBytes/s up (to S3) and 25 MBytes/s down (from S3). It’s definitely worth the time to do this experiment – many other combinations produced <1MBytes/s speeds in upload or download. I asked Linode to transfer my server to their Fremont location. It was painless, and only took an hour and half. I wouldn’t be surprised if Amazon’s S3 server is actually in the same data center as Linode, or at least on the same ISP.

For the actual hosting application, I tried all the popular open source offerings – SparkleShare, Owncloud, and Seafile.

SparkleShare was just too hard to set up and sometimes didn’t detect file changes… it’s too buggy.

Owncloud was very nice and polished with tons of cool features… but with just one fatal flaw – it’s @#@#%ing slow. I don’t know how they came to the conclusion that it’s a good idea to build a cloud storage platform ENTIRELY based on PHP. Obviously that didn’t work out very well. I have many small files, and it takes about 1 second per file. I believe it just does 1 WebDAV upload request per file with no threading. Ugh. I would get 10-20 Kbytes/s from my house (10 MBps upload). I also ran into a bug when I paused syncing to play a game of Age of Empires – it decided it just doesn’t want to start again afterwards.

Seafile, while doesn’t have a fancy interface, is by far the most functional. The frontend may not be very pretty, but it’s definitely good enough, and I haven’t been able to find anything wrong with the backend, and I have been doing some pretty weird stuff to test it out. It’s also blazingly fast. I get an average of about 700-800 KBytes/s uploading a bunch of small files, which is pretty close to my theoretical maximum. I also talked to the developers about a few things (not problems with the program), and they were very responsive and helpful, this is in contrast to Owncloud where most bug reports show no sign of having ever been looked at by a developer.

Seafile has a closed-source version that is free for personal use and supports S3 as storage backend directly, but I decided against using that, and decided to use S3QL instead, because that would allow me to also put random things on there for backup (home directories, mail, mysql dump, etc), and keep things simple. One of the developers actually suggested it for people that don’t want to use the proprietary version. They say it’s slower, but it still doesn’t bottleneck my upload or download speed (still at least couple MBytes/s), so I’m not worried about that. I did disable S3QL’s encryption, and have it use zlib instead of lzma for compression (lzma would definitely limit S3QL throughput on faster networks).

One of the many cool things S3QL does is data de-duplication. It computes SHA-256 hash of all blocks, and will only store identical blocks once. That makes backing up stuff extremely simple. I just have to keep copying stuff in, and if they haven’t been changed since the last backup, they won’t take up space on S3.

Since Seafile keeps all file revisions already, by now the biggest risk of data loss is application error corrupting my data. It can happen, for example, if there is a bug in Seafile, or S3QL. S3 has a very nice way of protecting against that – versioning. I turned on versioning on my S3 bucket for S3QL, which means if data corruption happens, I can always just revert the entire bucket to an earlier time, and since S3QL stores all its FS states in the bucket, the filesystem will be as good as it was at the time I reverted the bucket to. Since the S3 North California data center provides immediate consistency, I don’t have to worry about weird problems with eventual consistency. Objects in versioned buckets cannot be easily permanently deleted. A special command needs to be sent to delete them, so there is no chance of that happening accidentally.

That is my entire private cloud storage stack. Let’s see what happens.

My biggest risks of losing data now –

  • I forget to pay Linode or Amazon, and got my accounts terminated. This is the biggest risk, but there’s no way to work around that besides storing all the data myself.
  • Intentional malicious attack, which is unlikely because that person would have to target me specifically, and be able to tweak Seafile in a way that all my computers would sync to an empty directory instead of stop syncing, and also clear my S3 versioned bucket… that’s a lot of work. I’ll just try to not make too many very technically savvy enemies with a lot of UNIX experience.
  • I delete my own data in my sleep. Unlikely, because I don’t even KNOW what the command is to delete versioned objects. It’s unlikely that I’ll be able to figure it out in my sleep.

If I am even more paranoid, I can enable 2-factor authentication for permanent versioned object deletion on S3, but I think I’m ok with not having that right now. I’ll probably end up losing the authentication device, and not be able to delete that bucket… forever.

Note that high availability is not very important to me, that’s why I don’t have things like automatic-fallback, etc. System design would have to be a lot more complicated if high availability is also a requirement. I don’t generally mind if I can’t sync for a day or 2, as long as the data is safe. For example, if Linode goes down, I will just set up another VPS or a dedicated server somewhere, and set up S3QL to mount the S3 filesystem, and Seafile to continue serving my data.

Summary of my current setup:

{home directory backup, emails backup, Seafile} -> S3QL -> Amazon S3

Amazon S3 is where everything goes in the end, and Amazon is responsible for data reliability against hardware failures and natural disasters. I am responsible for data reliability against application errors in Seafile and S3QL, and I’m combating that using S3 versioning. I am also responsible for, for example, a virus or “rm -rf” on a sync-ed computer, and I’m combating that using Seafile’s infinite revision history (in addition to S3 versioning, which shouldn’t be used unless Seafile corrupts its own database).

In the extremely unlikely event that S3 data is lost (eg. due to forgetting to pay Amazon), I’ll still have recent copies of my data on my sync-ed computers, and master copy of home directory and emails on my Linode server.

Cost: $4.75/month for 50GB ($20/month for Linode, but I needed that anyways, for other stuff).

Book Review: The Cuckoo’s Calling

cuckoo

No, I did not read this book just because it has a pretty girl on the cover.

One cool thing about the book is that it was written by J. K. Rowling, but under the pseudonym Robert Galbraith. Of course, it wasn’t long (3 months) before people figured out it was her. However, I still really admire her courage in trying to pull this one off. It was her experiment, an attempt to publish a book under some other name, to be able to get honest critical feedback, without her own hype getting in the way.

I knew she was the author going in because I Googled the book first, so I had very high expectations. I was in no way disappointed.

She is as skilled at writing crime novels as fantasy novels that we have all come to love (if you haven’t, you should hit yourself in the head with a hammer until you do). All the endless details, the plot twists, the characters and their widely varied flaws, and the subtle humour.

The plot is very juicy and definitely un-predictable, but it did require fairly long stretches of imagination to connect at a few places – something she clearly has way more than enough of.

Reading this book reminded me of Sherlock Holmes. Cuckoo’s Calling is not quite as twisted as complex in plot, but it does come REALLY close, and Rowling more than made up for it with her signature expressiveness and all the interesting minor plot elements.

Another interesting thing about the book is its portrayal of so many different kinds of prejudices – skin colour, wealth, reputation, beauty, and most interestingly, mental illnesses.

I wished she would expand more on the theme of mental illnesses, but just the fact that she gave 2 of the main characters in the book bipolar, and showed how people discredited them, is a very nice reminder of how prejudiced we are against people with mental illnesses, much more so than people with physical illnesses, for no good reason really.

People wouldn’t stop trusting someone because they have a broken leg, but many would stop trusting someone because they have bipolar, even though bipolar is a purely mood disorder and does not cause hallucinations and delusions (except in an extremely severe and rare form that’s basically bipolar combined with schizophrenia… IIRC).

The social stigma on people with mental disorders and the general taboo surrounding anything to do with mental disorders is really disheartening, and the fact that she made it a plot element is quite commendable. If nothing else, it would hopefully encourage some people to read up on mental illnesses, and understanding is often the first step to dissolving barriers.

On a lighter note, I have no idea what’s with J. K. Rowling and breasts. She seems to have a rather strange obsession with them, that she can’t ever write about women without describing their breasts in great detail…

STRIKE ABSORBED THE IMPACT, HEARD the high -pitched scream and reacted instinctively: throwing out a long arm, he seized a fistful of cloth and flesh; a second shriek of pain echoed around the stone walls and then, with a wrench and a tussle, he had succeeded in dragging the girl back on to firm ground. Her shrieks were still echoing off the walls, and he realized that he himself had bellowed, “Jesus Christ!”

The girl was doubled up in pain against the office door, whimpering. Judging by the lopsided way she was hunched, with one hand buried deep under the lapel of her coat, Strike deduced that he had saved her by grabbing a substantial part of her left breast. A thick, wavy curtain of bright blonde hair hid most of the girl’s blushing face, but Strike could see tears of pain leaking out of one uncovered eye.

Book Review: The Hitchhiker’s Guide to the Galaxy

, the increasingly-inaccurately named trilogy of five.

hitch

“A towel, it says, is about the most massively useful thing an interstellar hitchhiker can have. Partly it has great practical value. You can wrap it around you for warmth as you bound across the cold moons of Jaglan Beta; you can lie on it on the brilliant marble-sanded beaches of Santraginus V, inhaling the heady sea vapors; you can sleep under it beneath the stars which shine so redly on the desert world of Kakrafoon; use it to sail a miniraft down the slow heavy River Moth; wet it for use in hand-to-hand-combat; wrap it round your head to ward off noxious fumes or avoid the gaze of the Ravenous Bugblatter Beast of Traal (such a mind-boggingly stupid animal, it assumes that if you can’t see it, it can’t see you); you can wave your towel in emergencies as a distress signal, and of course dry yourself off with it if it still seems to be clean enough.”

More importantly, a towel has immense psychological value. For some reason, if a strag (strag: non-hitchhiker ) discovers that a hitchhiker has his towel with him, he will automatically assume that he is also in possession of a toothbrush, face flannel, soap, tin of biscuits, flask, compass , map, ball of string, gnat spray, wet-weather gear, space suit, etc ., etc. Furthermore, the strag will then happily lend the hitchhiker any of these or a dozen other items that the hitchhiker might accidentally have ‘lost’. What the strag will think is that any man who can hitch the length and breadth of the galaxy, rough it, slum it, struggle against terrible odds, win through, and still know where his towel is is clearly a man to be reckoned with.

Adams, Douglas (2012-03-01). The Complete Hitchhiker’s Guide to the Galaxy: The Trilogy of Five (p. 31). Macmillan Publishers UK. Kindle Edition.

I seriously did not expect the ending to be so dark, but having briefly known Mr. Adams through the 250K or so words (I read all 5 books in one go, over couple weeks), I guess I should have expected that.

He wanted to write a 6th book with a more uplifting ending, but died before he could do that. There is a 6th book written by Eoin Colfer, the author of Artemis Fowl.

Don’t think I am going to read that, though. The Hitchhiker’s Guide to the Galaxy is all I know about Douglas Adams, and I want to keep the series purely his in my mind.

The Hitchhiker’s Guide to the Galaxy is really a work like no other. It’s a combination of sci-fi, original humour, sarcasm, ironies, and (lack of) story, in perfect stoichiometric ratio.

It’s a book that really only consists of a joke after another after another… yet somehow he manages to get the reader to think about some pretty deep issues, and grow quite fond of the character, all while ROFLing. All the jokes are also completely original, and tie very well into their respective contexts.

There is so much mind-bendingly twisted and convoluted concepts in the story that it’s a miracle that the story still made sense, and I don’t think it would have, in anyone’s hands but Douglas Adams’.

It’s also a really nice read in the sense that you can easily stop, and pick it up later to keep reading, without having to backtrack much, because the amount of the elapsed story that you need to keep track of to make sense of the next little bit is incredibly short. There are no long and convoluted story lines that you have to keep in your head, and doesn’t cause much mental stress. Of course, that’s only if you can actually put the book down.

‘OK,’ he said, ‘hear me, hear me. It’s, like, these guys, you know, are entitled to their own view of the Universe. And according to their view, which the Universe forced on them, right, they did right. Sounds crazy, but I think you’ll agree. They believe in . . .’ He consulted a piece of paper which he found in the back pocket of his Judicial jeans.

‘They believe in “peace, justice, morality, culture, sport, family life and the obliteration of all other life forms”.’ He shrugged. ‘I’ve heard a lot worse,’ he said.

Adams, Douglas (2012-03-01). The Complete Hitchhiker’s Guide to the Galaxy: The Trilogy of Five (p. 376). Macmillan Publishers UK. Kindle Edition.

In his own words, he is a “radical atheist”.

Many people believe the passage above is a satire on Islam, but I don’t think anyone got a confirmation from him, and now no one ever will because he is dead.

But hey, Islam is definitely not the only religion that believes in the obliteration of all non-believers, at least at some point in history.