tsvector for tokenization?

I've applied this to my use-case, with much improved results (noisy dataset)

```sql
CREATE OR REPLACE FUNCTION tsvectortokenize(txt TEXT) RETURNS TEXT[]
LANGUAGE SQL
IMMUTABLE PARALLEL SAFE
BEGIN ATOMIC
    SELECT array_agg(a.lexeme) FROM (SELECT lexeme, positions FROM unnest(to_tsvector('swedish', txt))) a JOIN LATERAL unnest(a.positions) on true;
END;
```

This uses the built-in text search system and configurations.
The base postgres docker images (perhaps most installs?) seem to have some 29 text search configurations, for different languages + a "simple" parser.

The text search configuration applies stop word lists, punctuation removal, stemming and I don't know what else (dictionary lookup?).

to_tsvector normally returns word counts, which could be used when doing word_freqs - but here it is detrimental, so the query reverses it. I haven't analyzed if this approach could simplify the places where it is actually used - so this may be the wrong approach.

On my dataset `tsvectortokenize()` is some 4.5 times faster than `stopwordfilter(bm25simpletokenize())` - but the absolute numbers make this relatively meaningless I think - word score map dominates my indexing operations by two orders of magnitude.

Not sure if this is something you'd like to apply to this project - but thought I'd share since it was a large improvement for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsvector for tokenization? #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

tsvector for tokenization? #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions