I’ve created this useful (I hope) tool https://github.com/sc0ty/grip.
It is grep-like file searcher, but unlike grep, it uses index to speed up the search. We need to generate index database first, and then we could grep super fast. Instruction is on the github site, here I would like to share some notes about implementation details and performance.
How it works
This project consist of two commands: gripgen and grip. First one is used to create index database for provided file list (e.g. using external tool like find), second one search for pattern in this database.
Database is based on trigram model (Google is also using this in some projects). Basically it stores every 3-character sequences that appear in the file matched with its path. E.g. world turtle consist of trigrams: tur, urt, rtl and tle. Index format is optimised to fast lookup every file that contains given trigram. To find given pattern, grip will generate every trigram that is part of this pattern and try to find files containing every one of them. It is worth noting that this technique would generate certain amount of false positives. Index database does not contain information about order of trigrams in file, thus grip must read selected files to confirm match. This step could be disabled with the –list switch.
Because gripgen need to be supplied with file list to process, I am using find to prepare such list on the fly. Gripgen is unable to scan files itself by design. This decision vastly simplified its code base, and the many configurable switches of find results in great flexibility of these two tools combined.
I’ve tested it on Intel Core i5 3470 machine with 5400 RPM HDD, scanning Android code base (precisely CyanogenMod for Motorola XT897).
sc0ty@ubuntu:~/projects/cyanogen-xt897$ find -type f | gripgen done - files: indexed 537489 (5.7 GB), skipped 259992, total 797481 - speed: 186.8 files/sec, 2.0 MB/sec - time: 2877.185 sec - database: 493.8 MB in 8 chunks (merged to 1)
It is worth noting, that CPU load was low during the scan (10 – 20%), performance was limited by the HDD bandwidth.
Next I’ve performed several test searches, measuring its execution time.
1.316 sec: grip -i 'hello word' 6.531 sec: grip class 0.357 sec: grip sctp_sha1_process_a_block 0.555 sec: grip -i sctp_sha1_process_a_block
As we can see here, command that produce fewer results will execute faster (sctp_sha1_process_a_block vs class). It is expected – fewer files must be read. There is also noticeable slowdown in case-insensitive search, as more trigrams is looked up in index (every possible case permutation of pattern).
Next test was performed on laptop with Intel Core i7 L620 (first generation i7) with fast SSD HDD. I’ve scanned several smaller open source projects.
sc0ty@lap:~/projects $find -type f -and -size -4M | gripgen done - files: indexed 52839 (576.1 MB), skipped 11848, total 64687 - speed: 281.6 files/sec, 3.1 MB/sec - time: 187.611 sec - database: 62.7 MB in 1 chunk
This time CPU speed was the limiting factor – one core was used at 100% during the scan. I’d like to repeat this test on machine with fast CPU and SSD, but unfortunately I have no access to such device.
In both tests database size was about 10% of indexed files size. This ratio should be proportional to the entropy of indexed files. More entropy means more different trigrams to store. For typical source code tree it is expected to retain this level.
0.444 sec: grip -i 'hello world' 1.174 sec: grip class 0.061 sec: grip extract_lumpname 0.043 sec: grip -i extract_lumpname
Order of magnitude smaller database with SSD drive results in way faster search. Last case insensitive query was faster than case sensitive probably because files used in previous test was buffered by the system.
Measured times are very promising – instead of wasting long minutes with grep, we could have results in mere seconds (or even in fraction of second), waiting time needed for classic grep only once – for index creation. Of course this approach is not immediately applicable to every situation, but I hope to provide useful tool for its job.
Does grip generate a trigram from the search query? Wouldn’t this method take 3 times longer to search for a query in a file? once in every trigram
Grip will generate trigrams, but only to lookup them in the database generated by gripgen. Maybe I wasn’t clear enough: gripgen will generate database that allows fast lookup files containing given trigrams. Using this database grip don’t have to read every file itself, only small subset that has potential hits based on this database.
And last step – grepping these narrowed down files is done simply by searching for strings, the same way as original grep command.
the windows example “dir /b /s /a:-d | gripgen” does not index anything, it starts and then the process hangs with zero activity for hours..
i also tried “dir /c: | gripgen” but with same result
Can you give a hint how to do it correctly e.g. indexing the dir “C:\Users\Ronald” and its subdirectories
Thank you in advance
@Ronald, does dir /b /s /a:-d produce list of files?
gripgen should print something like:
indexing file 4434+ (4233 files/sec): filename
Do you see this?