Guess it is time to write my first post using Hugo.
Yesterday I downloaded a torrent consisting of 2 years worth of 4chan posts, the plan was to mess with it and use the data to train a chatbot.
Dealing with big datasets is always fun because even the easiest tasks tend to get complicated, for example extracting the data from a ~3 GB tar.gz compressed archive was a challenge by itself.
Running “tar -xzvf archive.tar.gz” resulted in TAR/the Linux kernel eating the whole available memory to use it as cache, when that was down to ~200 MB of free RAM my workstation started lagging so hard that even Xorg was freezing for a couple of seconds every 20 or so seconds.
To solve the issue what I did was running the following commands:


sync && sudo sysctl vm.drop_caches=3

“vm.drop_cache=*” is especially interesting, replace * with 1 to free pagecache, 2 to free dentries and inodes, 3 to free both of them.
Running the command flushed the cache just in a few seconds, after that Xorg stopped behaving weirdly.

Another funny thing that happens when dealing with very small files is that hard drives are slow at reading and writing them: imagine running GREP on 30 GB of data made of a literal ton of very very small files with an average read speed of 10 MB/s.
One thing we can do to alleviate the issue is group together data in few bigger files, I suggest using at least 100 MB as file size.
To have the data “compressed” into few 512 MB files run:


cd /destination/path/
find /source/path/ -type f | while read i; do cat $i; done | split -da5 -b 512M - data

Now that the data is grouped in 512 MB chunks, GREP, or any kind of read operation, will be performed at around the storage’s sequential read speed rate.