There is a nifty piece of software called zsync, which is kind-of like rsync, except it is totally different.
Rsync
Rsync is mainly useful when you want to synchonize a list of files, or directories, between two servers. It will only download the new files and files which have changed. It will even delete or backup the files which have been removed at the original site. Nice.
For a project I was involved until recently at work we had a slightly different problem: we generate a huge file (an ISO image) which contains about 6 GB of data. This ISO image contains the daily build of our application. It contains only a handful of files. Problem is some of them are generated and GB in size, yet from day to day only maybe 100-150 MB have changed (and it would be even less if it were not because of this “feature” of .NET that never generates identical binaries even if using exactly the same source code)
Rsync was not useful in this case: it would download the whole file, gigabytes! (some of the people downloading the ISO are on a slow link in India)
zsync
This is exactly the case zsync targets: zsync will only download the changed parts of the file thanks to the rolling checksum algorithm.
Best of all: no need for an rsync server, opening port TCP 873 (which requires months of arguing with BOFHs in some companies), or anything special: HTTP over port 80 and you are done. Provided that you are not using Internet Information Server, which happens to support only 6 ranges in an HTTP request (hint: configure nginx in reserve proxy mode).
But I’m digressing.
Cool. Great. Awesome. Zsync. The perfect tool for the problem.
Hello Windows
Except for this project is for Windows, people work on Windows, they are horrified of anything non-Windows, and zsync is only available for Unix platforms.
Uh oh.
In addition to that, the Cygwin port suffers from many connection error problems on Windows 7 and does not work on a cmd.exe prompt, it wants the Cygwin bourne shell prompt.
So I started to port zsync to Windows natively.
Native port howto
The starting point was:
- C99 code
- autotools build system
- No external dependencies (totally self-contained)
- Heavy use of POSIX and Unix-only features (such as reading from a socket via file descriptors, renaming a file while open, deleting a file while open and replacing it with another file yet still use the same file descriptor, etc)
To avoid breaking too much, and because I wanted to contribute my changes upstream, my intention was to do the port step by step:
- Linux/gcc/autotools
- Linux/gcc/CMake
- Cygwin/gcc/CMake
- MSYS/MinGW gcc/CMake
- Visual C++/CMake
Autotools
Autotools was the first stone in the path.
With some work (calling MSYS from a DOS prompt, etc) it would have been possible to make it generate a Visual C++ Makefile but it would have been painful.
Plus the existing autotools build system did not detect the right configuration on MinGW.
Step 1: replace autotools with CMake. On Linux. This was relatively easy (although time consuming) and did not require any change in the code.
Cygwin
The second step was to build zsync on Windows using Cygwin (which provides a POSIX compatibility layer) and CMake.
No code changes were required here either, only a few small adjustments to the CMake build system. I tested on Linux again, it worked fine.
At this point, I had only made a pyrrhic progress: zsync was still Unix only, but with a cross-platform build system.
MinGW
My next step was a serious one: port zsync to use MinGW, which generates a native Windows application with gcc.
That means using Winsock where required.
5And hitting Microsoft’s understanding of “POSIX-compliant”: the standard Windows POSIX C functions do not allow to treat sockets as files, rename open files, temporary files are created in C:\ (which fails on Windows Vista and newer), etc. And that’s when the functions do exist. In many cases (mkstemp, pread, gmtime_r…) those functions were outright inexistent and I needed to provide an implementation.
Plus adapting the build system. Fortunately, I was still using gcc and Qt Creator provides great support for MinGW and gdb on Windows, and decent support for CMake.
Some other “surprises” were large file support, a stupid “bug” and the difficulties of emulating all the file locking features of Unix on Windows.
Regarding LFS, I took the easy path: instead of using 64-bit Windows API directly, I used the mingw-w64 flavor of gcc on Windows, which implements 64-bit off_t on 32-bit platforms transparently via _FILE_OFFSET_BITS.
Visual C++ misery
Porting to Visual C++ was the last step.
This was not strictly required. After all, all I had been asked for as a native version, not a native version that used Visual C++.
Yet I decided to give VC++2010 a try.
The main problem was lack of C99 support (though you can partially workaround that by compiling as C++) and importing symbols due to lack of symbol exports in the shared library (attributes for symbol visibility were introduced in gcc4.0, but many libraries do not use them because gcc does its “magic”, especially MinGW, which will “guess” the symbols).
Porting to Visual C++ 2010 required either to give up some C99 features in use (e. g. moving variable declarations to the beginning of the functions) or adding a lot of C++-specific workarounds (extern “C”).
I was a bit worried upstream would not accept this code because it didn’t really provide any benefit for the application (for the developer, use of a great IDE and very powerful debugger), therefore I didn’t finish the Visual C++ port. Maybe some day if Microsoft decides to finally provide C99.
The result (so far) is available in the zsync-windows space in Assembla.
> Rsync was not useful in this case: it would download the whole file, gigabytes!
Something was wrong in that setup. I remember downloading Knoppix using a bad internet connection and rsync. After a faulty download (md5sums didn’t match, either), rsync only re-downloaded the bad parts. Later I could execute rsync again. Finally md5sums matched.
From the official “rsync algorithm” page ( https://rsync.samba.org/tech_report/ ):
“The algorithm identifies parts of the source file which are identical to some part of the destination file, and only sends those parts which cannot be matched in this way. Effectively, the algorithm computes a set of differences without having both files on the same machine. The algorithm works best when the files are similar, but will also function correctly and reasonably efficiently when the files are quite different. ”
From “http://everythinglinux.org/rsync/”:
“Diffs – Only actual changed pieces of files are transferred, rather than the whole file. This makes updates faster, especially over slower links like modems. FTP would transfer the entire file, even if only one byte changed.
Compression – The tiny pieces of diffs are then compressed on the fly, further saving you file transfer time and reducing the load on the network.”
OK! I’ve got a test case to demonstrate that rsync does not download all the file:
1) I executed
rsync -avz rsync://ftp.uni-kl.de/knoppix/qemu-0.8.1/qemu.exe .
and my system downloaded about 7,27 MiB.
Then rsync wrote
sent 48 bytes received 7399136 bytes 30016.97 bytes/sec
total size is 7396548 speedup is 1.00
I executed
md5sum qemu.exe
and it answered
8ebdbc46620badb76972fabd3de25b1f qemu.exe
2) I executed again
rsync -avz rsync://ftp.uni-kl.de/knoppix/qemu-0.8.1/qemu.exe .
and my system downloaded about 0 MiB.
Then rsync wrote
sent 29 bytes received 62 bytes 20.22 bytes/sec
total size is 7396548 speedup is 81280.75
I executed
md5sum qemu.exe
and it answered
8ebdbc46620badb76972fabd3de25b1f qemu.exe
what was the same result as before, of course.
3) I launched Okteta and with it, I changed a byte of the “qemu.exe” file.
I executed
md5sum qemu.exe
and it answered
b0fecd0af32c0fe9681fc440ec92a7f2 qemu.exe
what was a different result than before, of course.
4) I executed again
rsync -avz rsync://ftp.uni-kl.de/knoppix/qemu-0.8.1/qemu.exe .
and my system downloaded about 0 MiB.
Then rsync wrote
sent 16416 bytes received 2821 bytes 4274.89 bytes/sec
total size is 7396548 speedup is 384.50
I executed
md5sum qemu.exe
and it answered
8ebdbc46620badb76972fabd3de25b1f qemu.exe
what was the correct result.
* * *
Note: rsync can be used with ssh, for having enough security.
Ow! This website “ate” my spaces. I’ll try again, but using underscores to indent:
OK. I’ve got a test case to demonstrate that rsync does not download all the file:
1) I executed
____rsync -avz rsync://ftp.uni-kl.de/knoppix/qemu-0.8.1/qemu.exe .
and my system downloaded about 7,27 MiB.
Then rsync wrote
____sent 48 bytes received 7399136 bytes 30016.97 bytes/sec
____total size is 7396548 speedup is 1.00
I executed
____md5sum qemu.exe
and it answered
____8ebdbc46620badb76972fabd3de25b1f qemu.exe
2) I executed again
____rsync -avz rsync://ftp.uni-kl.de/knoppix/qemu-0.8.1/qemu.exe .
and my system downloaded about 0 MiB.
Then rsync wrote
____sent 29 bytes received 62 bytes 20.22 bytes/sec
____total size is 7396548 speedup is 81280.75
I executed
____md5sum qemu.exe
and it answered
____8ebdbc46620badb76972fabd3de25b1f qemu.exe
what was the same result as before, of course.
3) I launched Okteta and with it, I changed a byte of the “qemu.exe” file.
I executed
____md5sum qemu.exe
and it answered
____b0fecd0af32c0fe9681fc440ec92a7f2 qemu.exe
what was a different result than before, of course.
4) I executed again
____rsync -avz rsync://ftp.uni-kl.de/knoppix/qemu-0.8.1/qemu.exe .
and my system downloaded about 0 MiB.
Then rsync wrote
____sent 16416 bytes received 2821 bytes 4274.89 bytes/sec
____total size is 7396548 speedup is 384.50
I executed
____md5sum qemu.exe
and it answered
____8ebdbc46620badb76972fabd3de25b1f qemu.exe
what was the correct result.
* * *
Note: rsync can be used with ssh, for having enough security.
The good thing about zsync (which rsync does not support) is doing partial download of zipped files. Yup, you can zip it, download, change a bit, recompress and download the only changed part without uncompressing the whole file. Great for java apps distributed as JAR/WAR/EAR and raills app working on JRuby also as WAR files. Now one just need to find a way to get a glibc-free version of it on linux so one can distribute the binary with the App without worring about binary API mismatch between different distros (I got lots of troube between different CentOS versions)