Vector Similarity Search with PostgreSQL and pgvector
pgvector adds vector similarity search to PostgreSQL.
In this post I cover my quick experiment creating embeddings for text and storing and searching them in PostgreSQL. This makes it easy to query and search structured and unstructured data in PostgreSQL. The code for my experiment is available at https://github.com/gclitheroe/exp.
Embeddings are used to capture how related information is. In this case text. pgvector lets us store the embeddings in a vector type in postgres and query how similar they are. In this case I’m using cosine; a smaller angle between a pair of vectors the closer together they are and the more likely they are to have similar meaning.
I used a couple of data sets from https://www.kaggle.com/
To create embeddings from the text I used Python and the gte-small model:
- Doing well in the Massive Text Embedding Benchmark (MTEB) Leaderboard.
- Trained on English text only.
- Input will be truncated to 512 tokens.
- Embeddings have 384 dimensions.
- Storing an embedding in a Postgres vector type uses
4 * dimensions + 8 bytes
. In this case 1544 bytes per embedding. - Other models are available that have been trained on multi-lingual input. They generate embeddings with more dimensions.
With the source text and embeddings stored in the database it is then easy to query them using SQL and the additional pgvector operators. I also stored embeddings for sample search queries testing easier. In an application these would be computed from user input.
The items
table holds 8469 support tickets with associated embeddings. Queries perform well without an index.
Results are about software problems even though the exact grammar and content of the phrases is different.
SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'software problem') AS cos
FROM items ORDER BY cos ASC;
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| 1) If you want new (not already installed) software, you may need to use: +
| +
| 1.) Windows 7 Professional. +
| +
| 2.) This problem started occurring after the recent software update. I haven't made any other changes to the device.
cos | 0.10983020430459844
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| +
| I have the product purchased as a full time job. I have used the software and it has worked so far and I am satisfied! A few months ago I need assistance as soon as possible because it's affecting my work and productivity.
cos | 0.11112324098668158
-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| I want to give your company a free demo program. Please help me create this program. +
| +
| I want your support. Please add your name. I'm worried that the issue might be hardware-related and might require repair or replacement.
cos | 0.11770975873161138
-[ RECORD 4 ]----
...
Using a LIKE query on the raw text doesn’t yield any results. Although I didn’t spend any time trying to make this into a better phrase query.
SELECT description FROM items WHERE description LIKE '%software problem%';
(0 rows)
There is no specific mention of dogs laying in the sun but the search does find some possible matches including ‘pet’ in an email address.
SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'dog laying in the sun') AS cos
FROM items ORDER BY cos ASC;
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| I want a picture of your dog. Please come and visit me soon. +
| +
| I'll keep the pictures. Please come to me soon. I've checked for any available software updates for my {product_purchased}, but there are none.
cos | 0.20796826662189327
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. 1-800-859-7267 2 e-mail us at [email protected] for questions or to try out this product if you I've tried different settings and configurations on my {product_purchased}, but the issue persists.
cos | 0.2115174712052278
-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| We have two customers: +
| +
| Carnivorous pet! +
| +
| Grenadine (Grizzly) +
| +
| Kelica ( I've tried different settings and configurations on my {product_purchased}, but the issue persists.
cos | 0.2263596774337866
-[ RECORD 4 ]
...
The issues
table holds 1,000,000 GitHub issues with associated embeddings. Query performance can be greatly
improved by adding an index to group the embeddings into lists that are probed at query time. This increases performance
but can reduce recall.
With no index a full table scan is needed and cosine is calculated for all rows.
EXPLAIN ANALYSE SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'software problem') AS cos
FROM issues ORDER BY cos ASC LIMIT 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=217132.44..217144.11 rows=100 width=365) (actual time=735.190..743.605 rows=100 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on search (cost=0.00..20.12 rows=4 width=32) (actual time=0.031..0.032 rows=1 loops=1)
Filter: (term = 'software problem'::text)
Rows Removed by Filter: 3
-> Gather Merge (cost=217112.32..314241.30 rows=832476 width=365) (actual time=728.648..737.053 rows=100 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
-> Sort (cost=216112.29..217152.89 rows=416238 width=365) (actual time=702.454..702.461 rows=77 loops=3)
Sort Key: ((issues.embedding <=> $0))
Sort Method: top-N heapsort Memory: 108kB
Worker 0: Sort Method: top-N heapsort Memory: 115kB
Worker 1: Sort Method: top-N heapsort Memory: 108kB
-> Parallel Seq Scan on issues (cost=0.00..200203.97 rows=416238 width=365) (actual time=3.581..642.596 rows=333333 loops=3)
Planning Time: 0.220 ms
JIT:
Functions: 19
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 4.064 ms, Inlining 0.000 ms, Optimization 1.492 ms, Emission 15.255 ms, Total 20.811 ms
Execution Time: 746.185 ms
Adding an index to issues speeds up queries by using approximate nearest neighbor search. This trades some recall for performance. The index must be first created when there is already some data in the tabel
SET maintenance_work_mem TO '512 MB';
CREATE INDEX ON issues USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000);
With the index in place queries are a significantly faster Set search probes to ~sqrt the number of lists. If set equal the number of lists the index won’t be used.
SET ivfflat.probes = 35;
EXPLAIN ANALYSE SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'software problem') AS cos
FROM issues ORDER BY cos ASC LIMIT 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=17835.12..17848.26 rows=100 width=365) (actual time=74.305..74.560 rows=100 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on search (cost=0.00..20.12 rows=4 width=32) (actual time=0.026..0.029 rows=1 loops=1)
Filter: (term = 'software problem'::text)
Rows Removed by Filter: 3
-> Index Scan using issues_embedding_idx on issues (cost=17815.00..149137.00 rows=1000000 width=365) (actual time=74.302..74.547 rows=100 loops=1)
Order By: (embedding <=> $0)
Planning Time: 0.260 ms
Execution Time: 74.620 ms
pgvector has made it easy to store and search embedding data in PostgreSQL. With the current explosion of advances in AI a new world of possibilities is opening up.