In Spanish-speaking forums and websites a lot of people use Hacha (a win32-only app) to split a large file into several smaller chunks. English-speaking people prefer HJSplit, which has a Linux version called lxsplit.
On one hand, I cannot understand why people keep using these programs as you could just use a compressor (WinZip, WinRAR) and set the compression ratio to zero: it would be as fast as Hacha and HJSplit and everybody already has WinZip and/or WinRAR. On the other hand, I cannot change people’s mind and using wine to run Hacha is a pain in the ass in my 64-bit KUbuntu (32-bit chroot, yadda, yadda).
I have tried to contact the author of Hacha to no avail. I suspected the algorithm was easy but I like to play nice: I kindly requested information about the algorithm Hacha is using to split files. After some weeks without an answer, tonight I gave KHexEdit a try and you know what? I was right: the split & join algorithm in Hacha 3.5 is extremely simple.
There is a variable-length header which consists of:
- 5 bytes set to 0x3f
- 4 bytes CRC. If no CRC was computed, CRC is 1 byte set to 0x07 followeb by 3 bytes set to 0x00. If CRC was computed, its 4 bytes are here. I have not discovered the CRC algorithm yet.
- 5 bytes set to 0x3f
- Variable number of bytes representing the filename of the large file (before splitting/after joining). This is plain ASCII, no Unicode involved.
- 5 bytes set to 0x3f
- Variable number of bytes representing an integer which is the size of the large file (before splitting/after joining). Let’s name it largeFileSize.
- 5 bytes set to 0x3f
- Variable number of bytes representing the size of each chunk except the first (the one which ends with ".0") and the last. Let’s call it chunkSize. The size of the first chunk is chunkSize + headerSize. The size of the last chunk is largeFileSize – (n-1)*chunkSize.
- 5 bytes set to 0x3f
And that’s all you need to know if you want to implement the Hacha 3.5 algorithm. I will be doing that in the next few days and releasing this program under the GPL.
Update I had not realized there is CRC information. The information I had here corresponds to the trivial case (no CRC), but I’m yet to find out the CRC algorithm. Reversing CRC – Theory and Practice seems a good starting point.