Pages

April 28, 2018

Empowering 'git diff'

The problem

As a "power user" of git (like all of us developers are, aren't we?) I use 'git diff' really often to see, what changes I have made to the code. While examining the diff it is often a pain to visually parse the small changes for some lines, having to compare two or more adjacent lines in the diff. Now you may know (sure you know as a power user…) about 'git diff --color-words', which helps in that regard. But 'git diff --color-words' just kills the other lines or blocks of lines, which changed substantially, leaving us with an unreadable piece of junk, while showing the only slightly changed lines pretty well.

As always, if something (in the software world) is not good enough for us developer users, we strive to improve it. So I searched a pretty long time for a solution to this problem. First I changed the diff.wordRegex to improve stuff, tried to limit the word diffs to single lines. But the word diff of git just does not work this way, it even cannot work this way. It helped a bit (see below for my current wordRegex), but did not solve the problem. I gave up…

Fast forward some months, still being bugged by this problem, I started over to search for a solution. Some days later I stumbled upon diff-highlight, which exactly tries to solve the problem described above. I was instantly hooked, so I tried it out immediately; but only to see that they did not promise too much: "It's currently very simple and stupid about doing these tasks.". But there I saw how it is possible to solve the problem, namely as a post-processing to 'git diff'.

The solution or Welcome to git-diffw

So I took that approach and started to write that kind of post-processor. It takes the output of git diff and searches for only slightly changed lines (or lines with only a prefix or suffix added or removed). Those lines are processed once more by git diff, but this time with --color-words, to get the pretty in-line diff. Added a bit of block recognition and handling of diff's output and the program was done (written in about 6 hours). The result is published on github: nixn/git-diffw. The README in the repository shows how to use it exactly.

To recognize, which lines changed only slightly, I use the Damerau-Levenshtein distance (also called "edit distance"), comparing it with the lengths of the original lines (aside from handling prefix or suffix-only changes). After testing a lot, the best threshold seemed to be 55% to declare lines as similar enough for the in-line diff (that threshold is used as a constant in the code). That is a personal feeling, one could adjust it to their likings.

Now, some months later, I find myself using it practically as the only diff, because I like it very much. I have preserved access to the old 'git diff' (by using a git alias command for the new behaviour), but I don't really need it any more. The new alias command is called "diffw" (hence the repository name) to be clear to my mind which diff I am using. (Prior to this "git diffw" was an alias to "git diff --color-words".)

A better diff.wordRegex

Now since git-diffw uses 'git diff --color-words', it respects the git setting 'diff.wordRegex'. As mentioned above, I have played a lot with that setting at first, then again when writing git-diffw. So I have developed a regex, which deems to me being good for developers. Here it is:

[[:upper:]]+[[:lower:]]*('[[:lower:]]+)*|[[:lower:]]+('[[:lower:]]+)*|[+-]?[[:digit:]]+([\\.,][[:digit:]]{3,})*([\\.,][[:digit:]]+)?|[[:xdigit:]]{2,}|([[:punct:]])\\1*|[-=]>|!==?|[^[:space:]]

The structure is pretty easy (though looking not so easy): it consists of several branches (separated by "|"), each of which handles an own type of possible input. The branches are:

[[:upper:]]+[[:lower:]]*('[[:lower:]]+)*

This branch handles common simple words with possible apostrophes in it, like "Space", "don't", "Ben's", "BBC's". Those words must have some upper letters at the beginning (but not inside) and at least one letter around each apostrophe.

[[:lower:]]+('[[:lower:]]+)*

Like the above, but only lower case letters allowed.

[+-]?[[:digit:]]+([\\.,][[:digit:]]{3,})*([\\.,][[:digit:]]+)?

A number with possible separators (thousands or decimal) and an optional sign. For thousands separators it must have at least 3 digits in between two separators. After the last separator (the decimal one) there may be only digits.

[[:xdigit:]]{2,}

Hexadecimal numbers with a minimum length of 2 chars.

([[:punct:]])\\1*

Punctuation chars and runs thereof, e.g. "=", "==", "===", "#######", "******".

[-=]>

"->" or "=>". We all like OOP.

!==?

"!=" or "!=="

[^[:space:]]

Every other char but whitespace. Length 1, since it could not be a letter, so not a word any more (words are handled by previous branches).

Combined with git-diffw it makes viewing code changes really nice!

No comments:

Post a Comment