Skip to content

tsvector for tokenization? #10

Description

@peterlindsten

I've applied this to my use-case, with much improved results (noisy dataset)

CREATE OR REPLACE FUNCTION tsvectortokenize(txt TEXT) RETURNS TEXT[]
LANGUAGE SQL
IMMUTABLE PARALLEL SAFE
BEGIN ATOMIC
    SELECT array_agg(a.lexeme) FROM (SELECT lexeme, positions FROM unnest(to_tsvector('swedish', txt))) a JOIN LATERAL unnest(a.positions) on true;
END;

This uses the built-in text search system and configurations.
The base postgres docker images (perhaps most installs?) seem to have some 29 text search configurations, for different languages + a "simple" parser.

The text search configuration applies stop word lists, punctuation removal, stemming and I don't know what else (dictionary lookup?).

to_tsvector normally returns word counts, which could be used when doing word_freqs - but here it is detrimental, so the query reverses it. I haven't analyzed if this approach could simplify the places where it is actually used - so this may be the wrong approach.

On my dataset tsvectortokenize() is some 4.5 times faster than stopwordfilter(bm25simpletokenize()) - but the absolute numbers make this relatively meaningless I think - word score map dominates my indexing operations by two orders of magnitude.

Not sure if this is something you'd like to apply to this project - but thought I'd share since it was a large improvement for me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions