Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Wiki / Everything you need to unlearn about Git

Everything you need to unlearn about Git

Git is perfectly simple to understand, once you are comfortable with the fundamentals of Git.  Unfortunately, resolving the circular dependency is complicated by the fact that most of the facts a reasonable user might initially assume about Git are utterly wrong.

This is an attempt to make a list of the properties about Git that are all obvious, and sadly wrong.  Once they are unlearned, then Git will (might?) become far less inscrutable.

A Git commit is just like a diff.  It is born on one branch, but when branches get merged then each of these diffs kind of gets splatted across a lot of different places, so each can end up applying on a lot of code bases.

This would be a perfectly reasonable assumption about how revision control commits could (should?) work, but every bit of it is untrue in Git.  A Git commit is actually an immutable logical snapshot of the entire file hierarchy.  It's more like a tarball than a diff.  It also contains a little bit of metadata (author, timestamp, a brief message, and most importantly, zero or more pointers to parent commits---usually exactly one).  It has a name (the SHA-1 hash of the combination of the file hierarchy and the metadata), which is closely related to the fact that a commit is immutable (you can never change a commit; the closest you could get is to create a new and very similar one).

A commit doesn't really live on any branch at all.  (One or more branches can POINT to a commit... branches know about commits, but commits are completely ignorant of branches.  A sufficiently perverse user could do most of what Git can do with no branches at all, by using commits and tags only.)    Commits generally are closely related to other commits, though.  The "zero or more pointers to parent commits" mean that all commits in a repository form an implied DAG, with the commits being nodes and the pointers being edges.  This DAG records all the history of everything contained in the entire repository.  (Note that the BRANCHES don't encode any history whatsoever.  In fact they don't encode anything except the name of their "current" revision.)

Which brings us to...

A Git branch is a linear history---a sequence of commits.  However, when things get merged, branches reference other branches and then things get non-linear.

Also sounds plausible, and also utterly wrong.  A branch isn't a sequence at all.  It's a pointer to a commit... a moveable tag.  Nothing more than a human-readable branch name and the corresponding commit hash.  There's no history connected with the branch (other than whatever history you can transitively derive from its commit's parent(s)).  Branches are EXTREMELY cheap, because they're merely pointers.

Lastly:

A Git repository is a tree, rooted at the very first commit.

Not really true.  There doesn't have to be just one "first" commit (you could have multiple commits without parents in the same repository, if you want), and parent commits know nothing about their children (not even whether they exist).  Everything will make far more sense if you think about it in the other direction... tags and branches are rooted sub-DAGs of the repository, containing all the commits transitively reachable from whatever commit they point to.  The repository is the union of all those tags and branches (so doesn't have a unique root except in the case where all the tags and branches point to the same commit).

None of the above will make any sense.  If held to any standard of intuition, it is nonsense.  But I conjecture that it is COMPLETE nonsense---necessary and sufficient nonsense, in fact.  If you unlearn enough reasonable (but wrong) facts about Git that everything here makes sense, I believe you have unlearned everything you need for ALL of Git to make sense.  (As much as it can.)