Monday, April 6, 2009

Reverse a 100 GB file.

How will you reverse the 100 GB file (word by word).
keep in mind the memory constraint. Its a ACSII file. Optimum solution ??

Here's an idea:

Assume the file has N entries. Assume your memory has space for only M, M << N entries.

Read the M/2 last entries of the file and the M/2 first entries. Read then in a order so you don't have to order them or anything. Now write the last M/2 entries at the beginning of the file and the first M/2 at the end of the file. Repeat the processes assuming your file now is smaller: missing M/2 entries at the start and missing M/2 entries at the end.

I know there's a better way to do this ... used to know it ... but it has been a couple years since I used it ... just search for file structures and stuff like that.

 have solution to this :- 
For reversing 100 GB file.... memory constraints will be there to handle... for this suppose at a time you can handle 1 GB of data.. divide the file in 100 sections of 1 GB each (using LSEEK) then reverse the words in each sections and make a Dictionary type (hash table) data structures in which make an entry of the word and its reverse value. When ever you try to reverse any word check the entry in DS if it is there take the value otherwise reverse it and make a entry in DS.

Do we have any other optimized solution to this ??


  1. I would just mmap the file and revert page-by-page. And let the operating system handle caching and flushing changes to disk.

  2. So file "one two three four five hippopotamus" -> "hippopotamus five four three two one"?

    You need two file offsets for tell() and seek(). You also need two buffers that are 2*largest_word_size.

    Seek to the end, and get its value as offset_omega. offset_alpha is zero.

    Seek each toward the middle, looking for spaces. Buffer each until you get spaces on each. When you do, you can write words from each buffer to the opposite end and shift the buffer to the max space position of the two buffers. Repeat until the offsets cross, then write the final word.

  3. Diskspace is cheap.
    Create a temp file.
    Goto end at source file.
    Write source file to temp file, seek source file -1, until you reach beginning of source file.
    Delete source file, and rename temp file to source file name

  4. Word by word? Doesn't that already imply O(n)? In which case, seek to end; advance backward, saving each word until space and then writing to a new file in that order.

    Assuming "memory constraint" refers to temporary memory and not disk space.

  5. Reverse the file then reverse each word.
    foobar baz -> zab raboof -> baz foobar.
    Works if no word is bigger than memory.

  6. Bentley J., Programming Pearls. Column 2. There you will find your solution.

  7. Didn't Knuth write a dozen variations of this in Volume 3? Know your magnetic tape sorting algorithms, young'uns...

    There now. I won't be able to sleep tonight until I've reread that chapter.

  8. By word do you mean natural language word or machine word?

    You don't need two buffers, just swap between your buffer and the end of the file as you write chunks from the buffer to the end of the file, then when done write them from the buffer to the beginning of the file. Repeat. Only slightly tricky parts are stopping on a word boundry (if using natural language words) and dealing with the middle of the file (partial buffer).

    Incremental performance gains will come from tuning of main buffer size and sizes of read/write chunks based on the OS (caching of disk reads, memory management) and disk hardware caching.

  9. Why would you not simply do:
    tac -s " " input.txt > output.txt

  10. Step 1. Copy to external USB drive
    Step 2. Invert the USB drive
    Step 3. Copy the file back.

    You could also pull your HDD out, invert it and put it back in. But then all your files will be reversed! Darn!

  11. Switch the HDD + (positive) and - (negative) wires and file is instantly reversed. :)

  12. Great advice, thank you man. I just reversed the whole harddisk in just a few minutes. Works like a charm.

  13. You need a stack and positional pointer (S as long datatype).

    Point S at position 0, then destructively push characters onto the stack starting from the end of the file until you reach a delineator (I will assume it's a space). The EOF character needs to be converted into the space delineator before being pushed as well.

    Then start popping characters at position S, incrementing S with every pop. Once the stack is empty, push another word (starting with the delineator) and repeat the pop+increment S procedure. Once S has reached the end of the file, the final delineator needs to be replaced by an EOF character and the procedure is finished.

    Works in my head at least, I might be missing something so feel free to tear this answer apart if I erred in some way.