5/26/2019

Git - How to validate commit messages?

Last time I wrote an introduction to GIT for beginners. This time I would like to give a solution on a bit advanced problem in GIT. I had to solve to following issue: all commit messages needs to follow some specific rules (maximum length of line etc.) and no commit should be able to be pushed if its commit message does not fulfill these rules.
I thought there should be an easy solution for this problem and for sure a lot of people already solved it, since validation of commit messages should be needed for several projects. In fact it was not so easy. Let me describe why!

Git commit hooks

In GIT you can specify so called commit hooks. These are scripts which are called in case of specific actions. There are commit hooks which are running on “client” side and the are hooks running on “server” side. I’m using quotes, since we know in GIT there is no specified role for server and client repos, each repo can behave both as server and client. You can find all these commit hooks under .git/hooks in your git repo.
There is a commit hook called commit-msg, this is exactly for that purpose what we need. If you are calling the git commit command this script is invoked, it is getting the commit message as an input parameter and if it doesn’t return with 0 a commit will be discarded. It sounds really good. The only problem that this script is running in case of a commit on the “client” repo. Which means if the client, the one who cloned the server repo, is removing this commit hook from the .git/hooks folder of the repo, he will be able to commit whatever he wants. So this is good as a first check, but is is still not solving the problem in a secure way. Furthermore the commit_hooks are not version controlled, so you need to find another way (maybe through some additional scripts) to copy them to the client repos. Alternatively you can create a symlink between the hooks folder and a version controlled folder.
If we want to be hundred percent sure that no invalid commit message will appear in the server repo we need to check it on the server side.
For checking on server side one possibility is the usage of commit hook. This commit hook is invoked whenever someone pushed something to the server and if it has a non-zero return value the push will be refused.
The only problem is that this commit hooks has no clear input about which exact commits has been pushed. It can read it’s input from the standard input and it contains only the changes of the git references in the following format: old value new value reference name.
How to figure out now which are the new commits?
First go a bit deeper into git and learn about references.

Git branches and references

As you are pushing commit in git you are always pushing to a branch. By default you are on master branch, but you can anytime create new branches which are branching out from an already existing branch. If you are doing a git push it is either pushing your branch to its upstream branch if it exists. If you fetched the branch from the server its upstream branch will be set up automatically. The upstream branch is always a branch on the server. If the upstream branch is not existing or if you are in detached head mode (you are not on any branch, your head is just pointing to a random commit) git will ask you to specify to which branch are you pushing (like git push origin master).
Let’s go one step back now. What is a git ref? A git ref is like a named pointer to a specific commit in your repository. You can find all references under the .git/ref directory. And branches are nothing else than special references. They are also just a pointer to a commit, but if you are committing something new to the commit the branch will automatically change to point to the latest commit on the branch. But it is just a named pointer, nothing else.

What happens at git push?

Commits doesn’t know much. They know their own content and their parent commit. In case of merge commits the commit has multiple parent, otherwise only one. The very first “root” commit in the repo has no parent at all.
So if you are calling git push you are pushing always one or more (by using git push --all) braches. You are letting know with the server first that which is the new commit the branch is pointing to. And this is the value what you are getting as input for pre-receive commit hook. Push commit hook is also checking if the branch is already existing on the server and if yes then it is letting pre-receive hook know what is its previous content.
Then the server is checking if it already has that commit or not (commits are stored under .git/objects). If not, then it is getting the commit from the client and checking what is its parent. If the parent is not on the server the parent commit will be also moved to the server. It continues until the first parent commit which is located on the server.

How to figure out in pre-receive hook which commits are new?
The biggest achievement is that the pre-receive hook only tells us which references has been changed to what and nothing else. Our goal is to validate all newly push commit messages, but nothing else.
The first and easiest case if someone pushed commits to a branch which already existed before. In this case we are getting the old value of the reference and the new value of it and with git log old_hash..new_hash we will see which are the commits between them.
There is one corner case when this method shows more commits than necessary: in case of merge commits it is showing the whole content of the merged branch, however it can be that that branch is already pushed at least partially.
I also need to mention the case when the reference (or branch) has been deleted. In this case the new hash will be 40 times 0, but that also means that no commit messages needs to be validated.
The last case to be covered is when a new branch has been pushed. In this case the old hash of the reference is 40 times zero and we have the new hash of the reference. That means we have only the hash of the latest commit on the branch. What to know? After some investigation my idea was to do the same as the push does. Check to latest commit and then jump to its parent commit, in case of merge commits do the same with all parents and stop this activity ones we reached a commit on the branch which was already pushed before.
This idea sound good, but how to figure out if a commit was already there on the server or not. For sure there are multiple solutions, but it took some time for me to find one which is working.
My solution is git branch --contains this command returns a list about branches which are containing a specific commit in their history. But pay attention! Since git is storing only a reference to the latest commit on the branch, all commits which are ancestors of this commit are on this branch. So if I’m branching out at a point from the master branch then all commit on the master branch which was before my branch are also part of my branch. There’s one more thing to notice: the branches on the client and the branches on the server not the same and this will be the solution for our task.
Based on my experience all commits on the server are belonging at least to one branch, since it is not possible to push a detached commit. The pre-receive commit hook is called before changing the references. That means all commits which were not pushed before are not part of any currently existing branch, but all commit which were already there are part of at least one branch. And this is the fact we can use here.

Summary

Let me summarized the solution for checking git commit messages on server side commit hooks.
Start by the latest commit of the branch, go parent by parent and check if this git branch --contains for the commit returns an empty list. If so validate its commit message and check its parent, if not then this commit has been already pushed before, we have nothing else to do on this branch. Pay special attention on merge commits, to check every parent.

I hope that this solution is correct, until now it passed all test cases and I also hope that it helped you to solve your task.

No comments:

Post a comment

Software development in automotive sector

Introduction The automotive business changed a lot during the last decades. Earlier a car was built up from poor hardware, without any s...