Querying with SQL and search
Run custom SQL against your repository, search every version of every file, and write queries that stay fast on delta-compressed tables.
Querying with SQL and search
The built-in analyses cover the common questions. When you have an uncommon one, the whole repository is sitting in plain PostgreSQL tables, and pgit sql hands you a connection. This guide covers running queries, searching across history, and writing SQL that stays fast on the delta-compressed tables.
Running a query
pgit sql "SELECT id, author_name, message FROM pgit_commits ORDER BY seq DESC LIMIT 10"Results open in the same interactive viewer the analyses use. Add --json for a JSON array, --raw for tab-separated output, or --no-pager for a plain table. --timeout sets the query timeout in seconds (default 60), and --remote runs against a configured remote.
Reads are safe; writes are gated
By default pgit sql allows read-only queries only. Anything that starts with INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, or TRUNCATE is refused unless you add --write.
pgit sql --write lets you modify tables directly, which can corrupt a repository: the storage is append-only and the delta chains assume immutable history. Use it only when you know exactly what you are doing. For exploration, leave it off and you cannot break anything.
Discovering the schema
pgit documents its own tables, so you do not have to keep this page open:
pgit sql tables # list every pgit tablepgit sql schema # all tables with a one-line purposepgit sql schema pgit_commits # one table's columns and typespgit sql examples # ready-to-run example queriesThe database schema reference has the full column list if you prefer to read it here.
Example queries
A few starting points (these come from pgit sql examples):
-- Most recent commitsSELECT id, author_name, message, authored_atFROM pgit_commitsORDER BY seq DESCLIMIT 10;-- Files with the most versionsSELECT p.path, COUNT(*) AS versionsFROM pgit_file_refs rJOIN pgit_paths p ON p.path_id = r.path_idGROUP BY p.pathORDER BY versions DESCLIMIT 10;-- Deleted files (content_hash is NULL when a file was removed)SELECT DISTINCT p.pathFROM pgit_file_refs rJOIN pgit_paths p ON p.path_id = r.path_idWHERE r.content_hash IS NULLORDER BY p.path;Notice both file queries join the two heap tables (pgit_file_refs, pgit_paths) and never touch a compressed content table. That is the single most important habit for fast queries, and the next section explains why.
Writing queries that stay fast
pgit's tables come in two flavours, and they have very different performance characteristics:
- Heap tables (
pgit_paths,pgit_file_refs,pgit_commit_graph,pgit_refs,pgit_metadata,pgit_sync_state) are normal PostgreSQL tables. No decompression cost. Filter, join, and aggregate on these freely. - xpatch tables (
pgit_commits,pgit_text_content,pgit_binary_content) store delta chains. Reading a row may decompress part of a chain. Every rule below is about minimizing how much of a chain you touch.
Get everything you can from the heap tables first; only read an xpatch table when you need actual content or a commit message, and when you do, read it by primary key or front-to-back.
Rules for the content tables
The content tables are keyed by (group_id, version_id). Resolve the group_id for a path from pgit_paths.group_id, then:
- Always constrain
group_id. Without it, the planner may scan every group and decompress all of them. - Primary-key lookups are cheap.
WHERE group_id = $1 AND version_id = $2reconstructs exactly one row. - Front-to-back is fastest.
WHERE group_id = $1 ORDER BY version_id ASC LIMIT $2reads the chain in natural order, each row reusing the previous decompression. UseLIMITso you do not pull the whole group. - Avoid range and set filters on
version_id.version_id > $2orversion_id = ANY($2)push the planner into a bitmap scan that decompresses the whole group, then filters. - Avoid
OFFSET. It decompresses and discards rows before returning. Fetch a largerLIMITand skip in your application instead.
Rules for the commits table
pgit_commits is one big delta chain ordered by seq (not by timestamp, and not by version_id). So:
- Scan front-to-back with
ORDER BY seq ASCwhen you need to read many commits. This is exactly whatanalyze authorsdoes. - Look up a single commit by its primary key (
WHERE id = $1); that is cheap. - Do not
JOINontopgit_commits. A join can drive a sequential scan of the whole chain. Instead do a two-step lookup: read the id you want from a heap table, then fetch that one commit by id.
-- Avoid: the join can scan and decompress all of pgit_commitsSELECT c.* FROM pgit_commits cJOIN pgit_refs r ON r.commit_id = c.idWHERE r.name = 'HEAD';-- Prefer: two cheap stepsSELECT commit_id FROM pgit_refs WHERE name = 'HEAD'; -- heap tableSELECT * FROM pgit_commits WHERE id = $1; -- PK lookupCounts and sizes without a full scan
COUNT(*), MIN(), and MAX() over an xpatch table decompress every row. For counts and storage figures, ask xpatch directly:
SELECT * FROM xpatch.stats('pgit_commits');For a MIN/MAX over commit ids or versions, use the heap table pgit_file_refs instead, which carries those columns uncompressed.
Cheat sheet
| Pattern | Speed |
|---|---|
PK lookup (group_id, version_id) or id |
Fast |
WHERE group_id ... ORDER BY version_id ASC LIMIT n |
Fastest |
ORDER BY ... DESC LIMIT n |
Fast once the cache is warm, slow cold |
version_id > x or version_id = ANY(...) |
Slow (bitmap scan) |
JOIN onto an xpatch table |
Risky (may seq-scan) |
COUNT(*) / MIN / MAX on xpatch |
Very slow (full decompress) |
No group_id in WHERE |
Very slow (scans all groups) |
For the deeper version of all this, including the shared read cache, see pg-xpatch read performance and caching and performance.
Searching across history
pgit search runs a regular expression against file content stored in the database, which means it can search history, not just your working tree.
pgit search "TODO" # search the latest version of each file (at HEAD)pgit search "func.*Error" # regex (RE2 syntax)pgit search -i "fixme" # case-insensitivepgit search --path "*.go" "panic\(" # restrict to a path glob (escape regex metacharacters like the paren)By default search looks at HEAD. To go through history:
--allsearches every version of every file. Identical matches across versions are grouped into one result unless you add--no-group.--commit <ref>searches at one specific commit.
Other flags: --limit (-n, default 50) caps results, and --remote searches a remote database. pgit grep is an alias for pgit search.
--all reconstructs old file versions, so on a huge repo it does real work (tens of seconds on a multi-million-commit history). On the latest checkout it is quick.
Where to go next
Every table and column you can query.
Cache and parallelism settings that affect query speed.
Hat das auf deinem Setup funktioniert?
Noch nicht bewertet