Grip updated and ported to Windows

Recently I was working on my indexed grep. Now it could be used with Boost library, as alternative to POSIX. There are build for Windows in the release page. This version could have issues when grepping file with lines longer than 16 kB due to my simplified implementation of getline. I will probably replace it with Boost code, eventually.

There is also significant performance improvement of index generation (gripgen). Now reading file names is not blocking indexer. Also memory organisation was changed to play nicer with cache. As a result indexing over SSD drive is several times faster. Unfortunately, mechanical drive is usually the bottleneck itself, so it will be not much difference. Lower CPU load maybe. And under Windows indexing is painfully slow due to Windows Defender (at least during my test on Windows 10). It looks like Defender is scanning every file that I am opening for indexing. Excluding binary files might help, I guess.

And since file list is read asynchronously, now I could print pretty percentage progress. Of course gripgen can do it after it reads entire list, thus feeding it with find will not show you percentages until find ends.

I was also cleaning up the code, refactoring build system and adding unit tests. Further changes will be mainly polishing and bug fixing, I think. Or not, if I’ll have some great feature idea. We will see.

grip – indexed grep

I’ve created this useful (I hope) tool https://github.com/sc0ty/grip.

It is grep-like file searcher, but unlike grep, it uses index to speed up the search. We need to generate index database first, and then we could grep super fast. Instruction is on the github site, here I would like to share some notes about implementation details and performance.

How it works

This project consist of two commands: gripgen and grip. First one is used to create index database for provided file list (e.g. using external tool like find), second one search for pattern in this database.

Database is based on trigram model (Google is also using this in some projects). Basically it stores every 3-character sequences that appear in the file matched with its path. E.g. world turtle consist of trigrams: tururtrtl and tle. Index format is optimised to fast lookup every file that contains given trigram. To find given pattern, grip will generate every trigram that is part of this pattern and try to find files containing every one of them. It is worth noting that this technique would generate certain amount of false positives. Index database does not contain information about order of trigrams in file, thus grip must read selected files to confirm match. This step could be disabled with the –list switch.

Because gripgen need to be supplied with file list to process, I am using find to prepare such list on the fly. Gripgen is unable to scan files itself by design. This decision vastly simplified its code base, and the many configurable switches of find results in great flexibility of these two tools combined.

Performance

I’ve tested it on Intel Core i5 3470 machine with 5400 RPM HDD, scanning Android code base (precisely CyanogenMod for Motorola XT897).

sc0ty@ubuntu:~/projects/cyanogen-xt897$ find -type f | gripgen
done
 - files: indexed 537489 (5.7 GB), skipped 259992, total 797481
 - speed: 186.8 files/sec, 2.0 MB/sec
 - time: 2877.185 sec
 - database: 493.8 MB in 8 chunks (merged to 1)

It is worth noting, that CPU load was low during the scan (10 – 20%), performance was limited by the HDD bandwidth.

Next I’ve performed several test searches, measuring its execution time.

1.316 sec:    grip -i 'hello word'
6.531 sec:    grip class
0.357 sec:    grip sctp_sha1_process_a_block
0.555 sec:    grip -i sctp_sha1_process_a_block

As we can see here, command that produce fewer results will execute faster (sctp_sha1_process_a_block vs class). It is expected – fewer files must be read. There is also noticeable slowdown in case-insensitive search, as more trigrams is looked up in index (every possible case permutation of pattern).

Next test was performed on laptop with Intel Core i7 L620 (first generation i7) with fast SSD HDD. I’ve scanned several smaller open source projects.

sc0ty@lap:~/projects $find -type f -and -size -4M | gripgen
done
 - files: indexed 52839 (576.1 MB), skipped 11848, total 64687
 - speed: 281.6 files/sec, 3.1 MB/sec
 - time: 187.611 sec
 - database: 62.7 MB in 1 chunk

This time CPU speed was the limiting factor – one core was used at 100% during the scan. I’d like to repeat this test on machine with fast CPU and SSD, but unfortunately I have no access to such device.

In both tests database size was about 10% of indexed files size. This ratio should be proportional to the entropy of indexed files. More entropy means more different trigrams to store. For typical source code tree it is expected to retain this level.

0.444 sec:    grip -i 'hello world'
1.174 sec:    grip class
0.061 sec:    grip extract_lumpname
0.043 sec:    grip -i extract_lumpname

Order of magnitude smaller database with SSD drive results in way faster search. Last case insensitive query was faster than case sensitive probably because files used in previous test was buffered by the system.

Measured times are very promising – instead of wasting long minutes with grep, we could have results in mere seconds (or even in fraction of second), waiting time needed for classic grep only once – for index creation. Of course this approach is not immediately applicable to every situation, but I hope to provide useful tool for its job.

Vimview: Vim – gdb integration

Vimview is my new pet project. The goal was to follow source code in vim, when using gdb. I wanted it to be done without heavy vim scripting. So I wrote a single file gdb plugin in Python. It makes vim to follow gdb frame (by opening files and moving cursor to the corresponding lines) while vim and gdb are running in separate terminals.

Vim has the ability to be controlled by RPC, gdb can be scripted in Python. That’s all what we need. Plugin and instruction are on my github.

Enjoy.

Old stuff

Some of my project which I created in past. These projects are published as is. Presented in chronological order.

Robot manipulator simulation

Robot manipulator simulation

Cylindrical manipulator simulation (C++, May 2008)
Opengl robot manipulator simulation. Contains binary, source code and DevCpp project.

Super Marian Bros

Super Marian Bros

Platform game (C++, December 2008)
Only in Polish language.
You are a turtle, your opponents are Marios. Written with Allegro library. Contains map editor and network multiplayer (which does not work well). Only binaries, without source code. Most of maps are made by my brother Pawel. Very playable thanks to his maps.

Instant Messenger

Instant Messenger

Client-Server Instant Messenger (C#, January 2011)
Only in Polish language.
Written in C#, require .NET Framework 2.0 or newer. Contains binary and Visual Studio solution/project files.

HEX Merge

HEX Merge

HEX Merge (C#, Aprill 2011)
Console application. Merge multiple Intel HEX files into one file. Contains binary and Visual Studio solution/project files.

Modified Emulator DSM-51

Modified Emulator DSM-51

Modified Emulator DSM-51 (CIL, April 2011)
Only in Polish language.
Modified emulator from Poznan University of Technology. Added drag and drop functionality and some minor modifications.

Multithread environment for DSM-51

Multithread environment for DSM-51

Multithread environment system for DSM-51 (8051 assembler, June 2011)
Only in Polish language.
Written in 8051 assembler using Keil uVision. Runs on DSM-51. Contains sources, hex file and uVision project files.

Brainfuck compiler for DSM-51

Brainfuck compiler for DSM-51

Brainfuck compiler (C, January 2012) Only in Polish language. Written using Keil uVision. Runs on DSM-51. Contains sources, hex file and uVision project files. Brainfuck sources are uploaded to DSM-51 by RS232, then compiled to native code and executed.

Blackberry VSMTools – modifying branding files

About year ago I got my first Blackberry device. Of course I did some research of what we can do with it. I mean of course, more than RiM allow us to do.

First I’ve got my phone debranded. That involves uploading branding file to device. When I did that, I started to wonder if I could edit or create my own branding file with my own splashscreen image or other data changed. I found some VSM file format description (but incomplete). I did my own research (including software disassembling) and finally I created VSMTools. It’s an easy-to-use command line tools which allow you to extract, edit or create your own VSM file.

VSMTools v0.6 is my tool to extract, edit and create branding VSM files. I’ve also documented VSM file structure. I got it all except sign section. VSM files are digitally signed and as far as I know it is done by RSA-SHA1 private key and the public key (which I covered in this document). It is used to check the file integrity. I can’t tell any more about the sign mechanism, if you do have any information about that, please contact me. Sign section may be not present in VSM. Blackberry 8800 accept unsigned files, but maybe some newer devices don’t.

BlackBerry VSM file structure

Some usefull links related with VSM files: topic about hacking VSM files on GSM Forum, VSM resources description in BlackBerry API.