42 notes &
Distributed file coding, Ruby, binary data, profiling, C extensions
Storage erasure codes
The fundamental idea here is to split a file into pieces, combine them with an invertible binary function to form new pieces according to some coding scheme, then distribute the pieces across your storage nodes according to the same coding scheme.
The advantage of such a scheme over straight replication is increased fault tolerance at a reduced storage cost. The key point is that any given node does not need to hold all of the information for a stored file. When combined with information from one or more other nodes, they may recombine missing pieces, whether during the repair of failed storage nodes, or the reassembly of the original file to serve a user request.
For a more complete introduction, and beyond, take a look at An Overview of Codes Tailor-made for Networked Distributed Data Storage, by Anwitaman Datta, who we were fortunate enough to have a guest lecture set from at KTH. Also this presentation video from Wuala.
http://en.wikipedia.org/wiki/Erasure_code may also fill in some blanks, though it is probably discussing the traditional communication erasure coding schemes, rather than storage erasure codes, which are relatively new.
Ruby, binary data, iteration, performance issues
It should be fairly obvious that Ruby is the wrong tool for the job when some set of the following constraints emerges:
- binary file processing — Array packing hell
- large files — in our case, even mp3s were big
- long iteration ranges
- execution time
However, it is the right tool for the job when:
- prototyping
- writing coordination code
Glossing over the performance issues we jumped to prototyping our lab exercise in Ruby, knowing that having only a few days to reach a solution we’d need to move quickly.
It became clear about midway, when working on the file processing portions of the program, that we had performance problems, particularly compared to the Java solutions being implemented by our colleagues (though their final codebases were significantly larger).
Not wanting to put out a shoddy product we set about finding the hot-points, or bottlenecks, in the program, to see what we could do to speed it up.
Profiling Ruby code
This is dead easy, just add
require 'profile'
and you’ll get a printout telling you where the hotpoints in your code are (where most time is spent). The output is not at all intuitive, but it is sorted worst-first, and you can sort-of figure out where these bits are in the program (look for your function names) & then rewrite these bits in C if applicable.
In our case it was file-io, with hundreds of thousands of message calls (which are inherently slow in a message passing / dynamic language), and the bytestring packing really wasting a lot of time & risking data corruption — Yehuda Katz has a good writeup on string encodings.
Combining Ruby and C
Having a C programmer on hand, I was able to give him a function to (re)write which was the significant hotpoint/bottleneck in our Ruby implementation. We had about an hour to write it, verify it worked, and find how to integrate it with Ruby.
backticks
The simplest way to call C from Ruby is to make a standalone C application and pass it your data on the commandline, in the form of stdin or filenames for main()’s varargs. This was great for us, as we just needed to pass it some filenames. It may be that this simple solution is enough, but it won’t fit every case.
This accelerated our program, for working on an MP3 file, from a 2 minute run to a 2 second run. This is a huge improvement, especially considering that we’re more interested in large binaries these days — video files, not stills.
In the end we actually reverted to this solution, due to ruby 1.9.2 compilation problems with the more elegant solution, RubyInline.
RubyInline
First of all I should mention that this is an optional tool, and not required for writing Ruby C extensions. Instead they can be done “the hard way” and compiled manually, as described by Aaron Patterson & Pickaxe.
Rubyinline is a nice trick from zenspider which enables you to enter blocks of C (or other) code into your Ruby source codes, dynamically compiling it on load (with caching) and handling some fiddly things like ruby datatype / c datatype conversion for you (strings -> char*, Integer -> int, etc).
Here’s a simple example: https://gist.github.com/1247601#file_rb_string_array_arg_example.rb
Performance
Running our program with an mp3 as input, execution times are comparable. Our backticks solution was 1 second slower over 10 runs, 15 vs 14 seconds. This is negligible, and it seems more important to consider the implications of the first run for RubyInline compiling the code at runtime, adding 5 seconds, vs the ease of deployment (no compilation to perform before execution). It’s best not to put too much value in these numbers, as they’re subjective & don’t examine the backticks / inline mechanisms in isolation — the Ruby happening around them differs. The speedup vs the native ruby version was enormous, and well worthwhile.
You can find the program here: Our program can be found here, Eraser on Github. Please accept my apologies for the code style, I am still fighting with proper OOP).
Prior works/reading: https://github.com/seattlerb/rubyinline/blob/master/tutorial/example2.rb http://segment7.net/projects/ruby/inline_optimization.html http://docs.seattlerb.org/RubyInline http://www.rubyinside.com/using-rubyinline-to-speed-up-code-by-10x-159.html http://tenderlovemaking.com/2009/12/18/writing-ruby-c-extensions-part-1/
Closing thoughts
It’s not that hard, give it a go. It’ll probably take a few hours to get a toy running (assuming you know a little bit of C), but once you’ve done it you’ll be prepared to use it if you need it in future.
While writing C code may be far from the mind of most Ruby programmers, I believe for some applications it is by far the more suitable & more pleasant language to work with.
That aside, writing a C extension, making use of the Ruby data struct access functions and macros, can give some small insight into what’s going on under-the-hood, which may be just enough, or it may encourage you to delve into the implementation code itself, possibly spinning off into JRuby, Rubinius, RubySpec, or wherever your curiosity takes you.