Skip to content ↓ | Skip to navigation ↓

Sometimes you come across a tool that everyone but you seems to have known about. I hit a wall recently where I wanted to query a massive 10GB text file with a list of terms in another file.

Usually a simple grep command would do the trick, but I quickly learned the limitations of grep, when I let the command run overnight and came back in the morning to my system still churning away.

Grep in all of its utility has been a powerful tool in the arsenal of many an IT professional, or anyone using shell for that matter. Grep was created before I was born in 1973 by Ken Thompson as an offshoot of the “ed” regular expression parser. It is such an integral tool that it is part of pretty much every Unix based system.

Being an older utility it is a little stuck in its ways about how it goes about doing work and although it gets the job done, it is not particularly efficient. Grep like many command line tools they are not designed to take advantage of processors with multiple cores, back in the day it only had one core, that’s the way it was and we liked it! 

Enter GNU Parallel, a shell tool designed for executing tasks in parallel using one or more computers. For my purposes I just ran in on a single system, but wanted to take advantage of multiple cores.

Having enough memory on my system, I loaded the entire massive file into memory and pipe it to GNU Parallel along with another file consisting of thousands of different strings I want to search for in the “PATTERNFILE”:

cat BIGFILE | parallel –pipe grep -f PATTERNFILE

A process that would have taken almost a day ran in under a a few hours. Almost immediately after I fired the command the fan in my laptop kicked into overdrive, a good sign that it was being put to work.

To really leverage the power of the tool you can farm processes out to multiple systems, but for now I am just happy to be able to run shell commands using multiple cores.

 

Related Articles:

 

Resources:

picCheck out Tripwire SecureScan™, a free, cloud-based vulnerability management service for up to 100 Internet Protocol (IP) addresses on internal networks. This new tool makes vulnerability management easily accessible to small and medium-sized businesses that may not have the resources for enterprise-grade security technology – and it detects the Heartbleed vulnerability.

 

picThe Executive’s Guide to the Top 20 Critical Security Controls

Tripwire has compiled an e-book, titled The Executive’s Guide to the Top 20 Critical Security Controls: Key Takeaways and Improvement Opportunities, which is available for download [registration form required].

 

Title image courtesy of ShutterStock

Hacking Point of Sale
  • Note you may need the –line-buffered argument to grep to avoid the parallel processes interspersing their output

    There is similar functionality within coreutils. The following will send a line at a time
    to a grep command running on each processor. Note the stdbuf here is used to
    generally apply line buffering to any command using stdio, though you can use
    grep –line-buffered here instead.

    split -n r/$(nproc) –filter='stdbuf -oL grep -f PATTERNFILE' BIGFILE

    Note also when processing in parallel, I/O is often the bottleneck.
    Therefore it can help to presplit BIGFILE to files on separate devices like:

    split -n l/$(nproc) BIGFILE

    Even if you don't have separate devices, this might be beneficial as
    then you can use the existing xargs -P functionality to process
    multiple files in parallel like:

    find -name x?? | xargs -P$(nproc) -n1 grep -f PATTERNFILE

    If you have many small files then you would adjust -n1 above
    to batch them together to reduce the number of processes run

    • Thanks, great information!

    • Ole Tange

      Unfortunately –line-buffered is deceptive: It does NOT guarantee against parallel processes interspersing their output. See example on http://www.gnu.org/software/parallel/man.html#DIF

      Pre-splitting will cost disk I/O. Parallel’s –pipe-part will do that on the fly, thus saving disk I/O.

  • Mohammed

    It is very impressive technology and useful