Guess it is time to write my first post using Hugo.
Yesterday I downloaded a torrent consisting of 2 years worth of 4chan posts, the plan was to mess with it and use the data to train a chatbot.
Dealing with big datasets is always fun because even the easiest tasks tend to get complicated.
Even extracting the data from a ~3 GB tar.gz compressed archive was a challenge in itself.
Running “tar -xzvf archive.tar.gz” resulted in TAR/the Linux kernel eating the whole available memory to use it as cache, when that was down to ~200 MB of free RAM my workstation started lagging so hard that even Xorg was freezing for a couple of seconds every 20 or so seconds.
To solve the issue what I did was issuing the following commands:
sync && sudo sysctl vm.drop_caches=3
“vm.drop_cache=*” is especially interesting, replace * with 1 to free pagecache, 2 to free dentries and inodes, 3 to free both of them.
Running the command flushed the cache just in a few seconds, after that Xorg stopped behaving weirdly.
Another funny thing that happens when dealing with very small files is that hard drives are slow at reading and writing them: imagine running GREP on 30 GB of data made of a literal ton of very very small files with an average read speed of 10 MB/s.
One thing we can do to alleviate the issue is group together data in few bigger files, I suggest using at least 100 MB as file size.
To have the data “compressed” into few 512 MB files run:
cd /destination/path/ find /source/path/ -type f | while read i; do cat $i; done | split -da5 -b 512M - data
Now that the data is grouped in 512 MB chunks, GREP, or any kind of read operation, will be performed at around the storage’s sequential read speed rate.