Full-Text Search on File Metadata with PostgreSQL
To make uploaded files searchable by name, caption, and tags, store a tsvector column derived from your text fields, index it with GIN, and query it with tsquery ranked by ts_rank — all without leaving PostgreSQL. Combine it with JSONB filters and you get fast, relevance-ordered search over your media catalog.
This guide is part of Metadata Indexing & Search within Backend Validation & Cloud Storage Architecture. It builds on how to index file metadata in PostgreSQL.
When to use this approach
- Users need to search files by free text (filename, title, tags) not just exact match.
- You already store metadata in Postgres and want to avoid a separate search engine.
- You need relevance ranking and the ability to combine text search with structured filters.
Prerequisites
- PostgreSQL 12+ (generated columns require 12; earlier needs a trigger).
- A table holding one row per uploaded file with text fields and a JSONB metadata column.
- Permission to create columns and indexes on that table.
Implementation
Store a generated tsvector that concatenates your searchable fields with weights, index it, then query with websearch_to_tsquery for a friendly search syntax.
-- 1. Base table for uploaded file records.
CREATE TABLE files (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
filename text NOT NULL,
title text,
tags text[] DEFAULT '{}',
metadata jsonb NOT NULL DEFAULT '{}',
created_at timestamptz NOT NULL DEFAULT now()
);
-- 2. A generated tsvector combining fields with relevance weights.
ALTER TABLE files ADD COLUMN search_vec tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(filename, '')), 'B') ||
setweight(to_tsvector('english', array_to_string(tags, ' ')), 'C')
) STORED;
-- 3. A GIN index makes tsquery lookups fast.
CREATE INDEX files_search_idx ON files USING GIN (search_vec);
-- 4. A GIN index on JSONB enables containment filters alongside text.
CREATE INDEX files_metadata_idx ON files USING GIN (metadata jsonb_path_ops);
Line-by-line on the critical pieces
to_tsvector('english', ...)normalizes text into lexemes — lowercasing, stemming (“uploads” matches “upload”), and dropping stop words. The first argument is the text-search configuration; pick the language your content is in.setweight(..., 'A')tags lexemes with a weight class A–D. A title match should rank above a filename match, so title getsA, filenameB, tagsC.ts_ranklater turns these into a score.coalesce(..., '')guards againstNULLfields; concatenating aNULLtsvector would null the whole expression.STOREDgenerated column keepssearch_vecautomatically in sync on every insert and update, so you never maintain it by hand.USING GINis the index type built for set-membership lookups like “does this vector contain this lexeme”. A btree index cannot serve@@queries.jsonb_path_opsis a compact GIN operator class optimized for the@>containment operator you use to filter structured metadata.
Querying with ranking
websearch_to_tsquery accepts user-friendly input (quoted phrases, or, - to exclude) and the @@ operator matches it against the vector. ts_rank scores each row so you can order by relevance.
SELECT id, filename, title,
ts_rank(search_vec, query) AS rank
FROM files,
websearch_to_tsquery('english', $1) AS query
WHERE search_vec @@ query
AND metadata @> $2::jsonb -- structured filter, e.g. {"type":"video"}
ORDER BY rank DESC, created_at DESC
LIMIT 20;
Calling it from TypeScript with the pg driver:
import { Pool } from "pg";
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
export async function searchFiles(term: string, filter: Record<string, unknown>) {
const { rows } = await pool.query(
`SELECT id, filename, title, ts_rank(search_vec, query) AS rank
FROM files, websearch_to_tsquery('english', $1) AS query
WHERE search_vec @@ query
AND metadata @> $2::jsonb
ORDER BY rank DESC, created_at DESC
LIMIT 20`,
[term, JSON.stringify(filter)],
);
return rows;
}
// searchFiles("annual report -draft", { type: "document" })
// => rows ordered by relevance, documents only, excluding "draft".
How the pieces fit together
Combining with JSONB metadata
Upload metadata is often semi-structured — dimensions, codec, EXIF — and fits naturally in JSONB. The @> containment operator filters on it in the same query as the text search, and because both columns are GIN-indexed, Postgres can combine them efficiently. To search inside JSONB text values too, fold them into the tsvector:
-- Include a JSONB text field (e.g. metadata->>'description') in the vector.
ALTER TABLE files DROP COLUMN search_vec;
ALTER TABLE files ADD COLUMN search_vec tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(metadata->>'description', '')), 'B')
) STORED;
Configuration gotchas
Search returns nothing for partial words
Full-text search matches whole lexemes, not substrings, so “rep” will not match “report”. For prefix matching use to_tsquery('english', 'report:*'), or add a pg_trgm trigram index if you need true substring search.
Wrong language config silently hurts recall
Indexing English content with the simple configuration skips stemming, so “uploads” will not match “upload”. Use the configuration that matches your content’s language consistently in both the column and the query.
GIN index not used on small tables
On a tiny table the planner may sequential-scan instead of using the index; that is correct behavior. Verify with EXPLAIN on a realistically sized table, not a handful of test rows.
Verification
Confirm the index is used and matches behave as expected:
EXPLAIN ANALYZE
SELECT id FROM files, websearch_to_tsquery('english', 'annual report') AS query
WHERE search_vec @@ query;
-- Expected: a "Bitmap Index Scan on files_search_idx" node in the plan.
FAQ
Should I use a trigger or a generated column for the tsvector?
On PostgreSQL 12+, prefer a STORED generated column — it is declarative and cannot drift. Use a trigger only on older versions that lack generated columns.
When should I reach for a dedicated search engine instead?
Postgres full-text search comfortably handles millions of rows with ranking and filters. Move to a dedicated engine when you need fuzzy typo-tolerance, faceted aggregation at scale, or sub-millisecond latency across very large corpora.
Can I rank phrase proximity, not just term presence?
Yes. Use phraseto_tsquery or the <-> distance operator in a tsquery to require terms to appear adjacently, then rank with ts_rank_cd, which accounts for term proximity.