How Git Actually Works (Objects, Refs, Pack Files)

We use Git daily but rarely peek at what git commit actually writes to .git. Git's genius is its simplicity — four object types model every history, branch, merge, and tag. This guide walks through Git's internals — the objects, refs, pack files, SHA-1 content addressing — and what merge / rebase / reset actually move around on disk.

.git — Git's database

.git/
├── HEAD                  ← "where are we?" (ref: refs/heads/main)
├── config                ← repo settings
├── objects/              ← all data
│   ├── 8a/
│   │   └── b4f1...       ← SHA-1 prefix (2 chars) + rest
│   ├── pack/             ← packed objects (after gc)
│   │   ├── pack-xxx.idx
│   │   └── pack-xxx.pack
│   └── info/
├── refs/
│   ├── heads/
│   │   ├── main          ← "main points to this commit SHA"
│   │   └── feat/x
│   ├── tags/
│   │   └── v1.0
│   └── remotes/origin/...
├── logs/                 ← reflog (every HEAD movement)
├── hooks/                ← pre-commit, etc.
└── index                 ← staging area (binary)

Everything lives in .git/objects/ (the data) and .git/refs/ (the pointers). A branch is literally a one-line text file.

Four object types

1. Blob — file contents

$ echo "hello" | git hash-object --stdin
ce013625030ba8dba906f756967f9e9ca394464a

$ git cat-file -p ce0136...
hello

$ git cat-file -t ce0136...
blob

A blob is just bytes + length. No filename. The same "hello" in any file produces the same SHA-1.

On disk — .git/objects/ce/013625... (zlib compressed).

2. Tree — directory layout

$ git cat-file -p 8a3f...
100644 blob ce0136...    README.md
040000 tree 5b2a4f...    src
100644 blob 7a8b9c...    package.json
                         ↑ this is where filenames first appear

A tree is a list of (mode, type, SHA, name) entries. The blob gets its name from the tree it sits in. Trees can contain sub-trees, building a directory hierarchy.

3. Commit — a snapshot in time

$ git cat-file -p HEAD
tree 8a3f...                          ← snapshot's root tree
parent f1a2...                        ← previous commit
author Alice <alice@example.com> 1700000000 +0900
committer Alice <alice@example.com> 1700000000 +0900

feat: add login

This commit adds login functionality.

A commit is one tree (the snapshot) + parent commits + author + message. It does not store the diff — the diff is computed by comparing this tree to the parent's tree when you ask.

Merge commits have two parents. Root commits have zero.

4. Tag (annotated)

$ git cat-file -p v1.0
object f1a2...           ← commit it points to
type commit
tag v1.0
tagger Alice <alice@example.com> 1700000000 +0900

Release v1.0

Annotated tags are immutable bookmarks with message + tagger. Lightweight tags (git tag v1.0) are just refs — no object created.

SHA-1 — content-addressable storage

Each object's SHA-1 is a hash of (type + contents). Same contents → same SHA-1. Git's central trick:

Automatic dedup — the same file across 100 commits stores one blob
Integrity — change a byte, the SHA changes, corruption surfaces immediately
Distributed sync — two people producing the same commit get the same SHA. Histories merge naturally.

SHA-1 collisions? SHAttered (2017) demonstrated an intentional collision. Git is migrating to SHA-256 since 2018. Day-to-day users aren't affected — only attackers with budget to construct collisions matter.

Refs — what branches and tags actually are

$ cat .git/refs/heads/main
f1a2b3c4d5e6f7...

$ cat .git/HEAD
ref: refs/heads/main

→ HEAD points to main, and main points to commit f1a2...

A branch is a one-line file containing a commit SHA. Adding a new commit just rewrites that file.

Detached HEAD = HEAD points at a commit SHA directly, not a branch (git checkout f1a2...). New commits there belong to no branch — risk of GC.

From file to commit — the staging dance

Working tree (files on disk) — what you edit
       │
       │ git add
       ↓
Staging area (.git/index, binary)
       │
       │ git commit
       ↓
Object database (.git/objects/)
       │
       │ git push
       ↓
Remote

The index is literally "the tree the next commit will produce." git diff = working tree vs index. git diff --cached = index vs HEAD.

Merge vs Rebase — what they do to history

Merge — keeps history intact

Before:
main:    A → B → C
feat:        → D → E (branched from B)

git merge feat:
main:    A → B → C → M
                       ↑
              merge commit M (parents: C, E)
              tree: combines main's changes with feat's

Rebase — replay feat on top of main

Before:
main:    A → B → C
feat:        → D → E (branched from B)

git rebase main (on feat):
main:    A → B → C
feat:                → D' → E'  (new commits with different SHAs)

D' = D's changes reapplied atop C. Different SHA because parent differs.
Equivalent to a series of cherry-picks.

Merge preserves history. Rebase rewrites it. Never rebase a branch you've pushed to a shared remote — everyone else's SHAs break.

Reset — moving the branch pointer

Before:
main → C (HEAD)
parent chain: A → B → C

git reset --soft B:
main → B  (HEAD)
staging: keeps C's changes
working tree: keeps C's changes

git reset --mixed B (default):
main → B
staging: same as HEAD (B)
working tree: keeps C's changes

git reset --hard B:
main → B
staging: B
working tree: B (C's changes lost!)

Reset just moves the branch pointer. C's object remains in .git/objects/ — git reflog recovers it for up to 90 days.

Pack files — efficient storage

New objects start as "loose" (.git/objects/ab/cdef...). Over time you get thousands of small files — wasteful on disk and slow.

git gc (manual or automatic) does two things:

Packs many objects into one file in .git/objects/pack/
Delta compression — similar blobs (e.g. a 1-line edit) get stored as deltas

$ du -sh .git/objects/
initial:  50 MB (loose)
after gc:  5 MB (packed)
          ↑ 10× reduction (more for big repos)

Garbage collection

Unreferenced objects (unreachable from any ref) are GC candidates:

git gc --prune=now → immediate GC
Default — only objects older than 2 weeks AND with expired reflog entries
Commits dropped by git reset --hard survive in the reflog for 90 days

Common pitfalls

1. Force push danger

A force-pushes main:
local:  A → B → D
remote: A → B → C → E (Bob's commits)

After force push:
remote: A → B → D
→ C and E are lost. Bob's work is gone.

Fix — git push --force-with-lease (only force when remote matches your last fetch)

2. Losing commits on detached HEAD

git checkout abc123, commit, then git checkout main → the new commits seem gone. The SHA is still in the reflog — recover with a new branch.

3. .gitignore ignored

A file already tracked won't be ignored by adding it later. git rm --cached file to untrack, then commit.

4. Submodule detached HEAD

To commit inside a submodule — cd into it, check out a branch, commit, push. Then commit the updated pointer in the parent repo.

5. Large binaries committed without LFS

Big binaries (images, video, ML models) bloat the repo. Once committed, only history rewrite removes them. Use Git LFS or gitignore them.

References

Pro Git book (Scott Chacon) — Internals chapter — git-scm.com
Git from the inside out (Mary Rose Cook) — maryrosecook.com
Git for Computer Scientists — eagain.net
Git SHA-256 transition — git-scm.com

Summary

Git = content-addressable file system. Four objects in .git/objects/ (blob / tree / commit / tag) plus pointers in .git/refs/.
SHA-1 of object contents → automatic dedup, integrity, painless distribution.
A commit is a tree (snapshot) + parents + message. Diffs are computed on demand.
A branch is a one-line text file. Cheap to create, cheap to delete.
Merge preserves history with a merge commit. Rebase rewrites history with new SHAs.
Reset moves the branch pointer. soft / mixed / hard differ on staging and working tree.
Pack files + delta compression shrink repos. git gc runs automatically.
Reflog (90 days) is your safety net for reset / checkout accidents.