Skip to content

feat: visitor classification for robot/machine/datacenter filtering#174

Open
slint wants to merge 1 commit into
inveniosoftware:masterfrom
slint:refactor
Open

feat: visitor classification for robot/machine/datacenter filtering#174
slint wants to merge 1 commit into
inveniosoftware:masterfrom
slint:refactor

Conversation

@slint

@slint slint commented Jun 11, 2026

Copy link
Copy Markdown
Member

Depends on inveniosoftware/counter-robots#20

Route the event preprocessors through a single counter-robots classifier, exposed
as the cached current_stats.visitor_classifier property on the extension state
(built lazily, like the other cached properties). It is built from
STATS_VISITOR_CLASSIFIER (an import path or app -> Classifier); the default factory,
default_visitor_classifier in ext.py, composes the COUNTER baseline with the
extended preset and, when STATS_VISITOR_ASN_DB points at a GeoLite2-ASN mmdb, a
maxminddb-backed ASN resolver. flag_robots / flag_machines keep setting is_robot /
is_machine through it.

Add exclude_datacenter_browser, which drops events whose user agent looks like a
browser but whose IP resolves to a datacenter/hosting ASN (automation faking a
browser from cloud infrastructure). It only excludes: it returns None or the
document unchanged and writes nothing to the event. It must run before
anonymize_user, which removes ip_address.

invenio-stats holds no robot or ASN lists. Requires counter-robots>=2026.6.

Route the event preprocessors through a single counter-robots classifier, exposed
as the cached current_stats.visitor_classifier property on the extension state
(built lazily, like the other cached properties). It is built from
STATS_VISITOR_CLASSIFIER (an import path or app -> Classifier); the default factory,
default_visitor_classifier in ext.py, composes the COUNTER baseline with the
extended preset and, when STATS_VISITOR_ASN_DB points at a GeoLite2-ASN mmdb, a
maxminddb-backed ASN resolver. flag_robots / flag_machines keep setting is_robot /
is_machine through it.

Add exclude_datacenter_browser, which drops events whose user agent looks like a
browser but whose IP resolves to a datacenter/hosting ASN (automation faking a
browser from cloud infrastructure). It only excludes: it returns None or the
document unchanged and writes nothing to the event. It must run before
anonymize_user, which removes ip_address.

invenio-stats holds no robot or ASN lists. Requires counter-robots>=2026.6.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant