I've applied this to my use-case, with much improved results (noisy dataset)
CREATE OR REPLACE FUNCTION tsvectortokenize(txt TEXT) RETURNS TEXT[]
LANGUAGE SQL
IMMUTABLE PARALLEL SAFE
BEGIN ATOMIC
SELECT array_agg(a.lexeme) FROM (SELECT lexeme, positions FROM unnest(to_tsvector('swedish', txt))) a JOIN LATERAL unnest(a.positions) on true;
END;
This uses the built-in text search system and configurations.
The base postgres docker images (perhaps most installs?) seem to have some 29 text search configurations, for different languages + a "simple" parser.
The text search configuration applies stop word lists, punctuation removal, stemming and I don't know what else (dictionary lookup?).
to_tsvector normally returns word counts, which could be used when doing word_freqs - but here it is detrimental, so the query reverses it. I haven't analyzed if this approach could simplify the places where it is actually used - so this may be the wrong approach.
On my dataset tsvectortokenize() is some 4.5 times faster than stopwordfilter(bm25simpletokenize()) - but the absolute numbers make this relatively meaningless I think - word score map dominates my indexing operations by two orders of magnitude.
Not sure if this is something you'd like to apply to this project - but thought I'd share since it was a large improvement for me.
I've applied this to my use-case, with much improved results (noisy dataset)
This uses the built-in text search system and configurations.
The base postgres docker images (perhaps most installs?) seem to have some 29 text search configurations, for different languages + a "simple" parser.
The text search configuration applies stop word lists, punctuation removal, stemming and I don't know what else (dictionary lookup?).
to_tsvector normally returns word counts, which could be used when doing word_freqs - but here it is detrimental, so the query reverses it. I haven't analyzed if this approach could simplify the places where it is actually used - so this may be the wrong approach.
On my dataset
tsvectortokenize()is some 4.5 times faster thanstopwordfilter(bm25simpletokenize())- but the absolute numbers make this relatively meaningless I think - word score map dominates my indexing operations by two orders of magnitude.Not sure if this is something you'd like to apply to this project - but thought I'd share since it was a large improvement for me.