fix: update request to return raw data instead of encoded text by hudcap · Pull Request #421 · autoscrape-labs/pydoll

hudcap · 2026-06-07T15:29:11Z

Bug Fix Pull Request

Related Issue(s)

Fixes #419

Bug Description

tab.request sometimes returns corrupted binary content, with the Unicode replacement character mixed into the bytes.

Root Cause

What's being returned is an encoded version of the decoded bytes, and the decoded bytes can lose data when the bytes have no known associated character.

Solution

Return the original bytes instead of the encoded text.

Verification Steps

Performed a request that returns an .xls file (previously corrupted). The file is now returned uncorrupted.

Testing Checklist

Added regression test that would have caught this bug
Modified existing tests to account for this fix
All tests pass
Edge cases have been tested

Impact

Low (isolated fix with no side effects)
Medium (might affect closely related functionality)
High (affects multiple areas or changes core behavior)

Backwards Compatibility

This change is fully backward compatible
This change introduces backward incompatibilities (explain below)

Checklist before requesting a review

This change is so minimal, I don't think most of these are relevant. I don't think I could possibly have messed up any of the below, but if you really want me to, I'll double-check.

My code follows the style guidelines of this project
I have performed a self-review of my code
I have added test cases that prove my fix is effective
I have run poetry run task lint and fixed any issues
I have run poetry run task test and all tests pass
My commits follow the conventional commits style with message explaining the fix

Summary by CodeRabbit

Refactor
- Improved response handling so binary responses are preserved verbatim, ensuring images and other raw downloads are delivered consistently.
Tests
- Updated integration test to validate response body bytes exactly match expected binary fixtures, improving reliability of binary-response checks.

When the response is decoded, any bytes that cannot be converted to text are replaced with the unicode replacement character, and are therefore lost forever.

coderabbitai · 2026-06-07T15:29:25Z

📝 Walkthrough

Walkthrough

Reads fetch response body as an ArrayBuffer immediately (preserving original bytes), concurrently reads and prefix-strips text for JSON parsing, and updates tests to assert exact binary content bytes.

Changes

Response Content Handling

Layer / File(s)	Summary
Early arrayBuffer acquisition and binary return `pydoll/constants.py`, `tests/integration/test_tab_request_integration.py`	Clones the fetch Response and reads `responseClone.arrayBuffer()` into `content` while concurrently reading `response.text()`, strips anti-XSSI prefixes from `text`, and returns raw `content` bytes for non-JSON responses. Integration test updated to assert `response.content` equals the PNG fixture bytes exactly.
Content-Type detection and JSON parsing `pydoll/constants.py`	Reads `contentType` from response headers, parses `text` into `jsonData` for JSON responses and reserializes `text` on success; parsing failures set `jsonData` to `null`. `content` is not rebuilt from `text` during this process.

🎯 3 (Moderate) | ⏱️ ~20 minutes

A rabbit nibbling bytes with care,
No lost pixels in the midnight air,
ArrayBuffer held, text trimmed and neat,
Pixels safe, no replacement char to meet,
Hooray — the bytes and tests all greet! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: fixing the request function to return raw data instead of encoded text, which directly addresses the corruption issue in the linked bug.
Description check	✅ Passed	The description covers the bug, root cause, solution, and verification steps. Though some testing checklist items are incomplete, the author acknowledges the minimal nature of changes and provides sufficient context for review.
Linked Issues check	✅ Passed	The PR addresses issue `#419` by modifying the JavaScript code to acquire raw response bytes early via arrayBuffer() and return original bytes instead of decoded/re-encoded text, directly fixing the Unicode replacement character corruption.
Out of Scope Changes check	✅ Passed	All changes (updates to constants.py's JavaScript makeRequest and test file assertions) are directly in scope for fixing the binary response corruption issue in issue `#419`.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

pydoll/constants.py (2)

255-257: ⚡ Quick win

Clone and dual-read strategy correctly preserves binary content.

Reading arrayBuffer() from the cloned response ensures the raw bytes are captured before any decoding, which is exactly what's needed to fix the binary corruption issue described in #419. The original response's text() method leverages the browser's native charset detection.

⚡ Optional: parallelize the two reads for faster response handling

-            const responseClone = response.clone();
-            const content = await responseClone.arrayBuffer();
-            let text = await response.text();
+            const responseClone = response.clone();
+            const [content, text] = await Promise.all([
+                responseClone.arrayBuffer(),
+                response.text()
+            ]);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pydoll/constants.py` around lines 255 - 257, The current dual-read uses
sequential awaits on responseClone.arrayBuffer() then response.text(), which is
correct but slower; change to run them in parallel by creating the clone
(responseClone) and then await Promise.all([responseClone.arrayBuffer(),
response.text()]) so you populate content and text concurrently (keep using
responseClone for arrayBuffer() to preserve raw bytes and response.text() for
decoded text).

276-279: ⚡ Quick win

Clarify the comment to reflect what changed.

The comment "For non-JSON, keep original text handling" could be misinterpreted. The actual change here is that the previous code was re-encoding text into content (causing binary corruption), and that re-encode step has been removed. The text from line 257 is simply kept as-is, and content is already set from the arrayBuffer read on line 256.

📝 Suggested comment clarification

             } else {
-                // For non-JSON, keep original text handling
+                // Non-JSON: text is already decoded; content already captured as raw bytes
                 jsonData = null;
             }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pydoll/constants.py` around lines 276 - 279, Update the comment next to the
jsonData branch to explicitly state that we removed the prior re-encoding of
text into content (which caused binary corruption); note that content is now
assigned from the arrayBuffer read earlier and text from the upstream variable
is preserved as-is instead of being re-encoded. Reference the variables
jsonData, text, content, and the arrayBuffer read so it's clear the change: no
re-encoding occurs and original text is kept while content comes from the
arrayBuffer.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pydoll/constants.py`:
- Around line 255-257: The current dual-read uses sequential awaits on
responseClone.arrayBuffer() then response.text(), which is correct but slower;
change to run them in parallel by creating the clone (responseClone) and then
await Promise.all([responseClone.arrayBuffer(), response.text()]) so you
populate content and text concurrently (keep using responseClone for
arrayBuffer() to preserve raw bytes and response.text() for decoded text).
- Around line 276-279: Update the comment next to the jsonData branch to
explicitly state that we removed the prior re-encoding of text into content
(which caused binary corruption); note that content is now assigned from the
arrayBuffer read earlier and text from the upstream variable is preserved as-is
instead of being re-encoded. Reference the variables jsonData, text, content,
and the arrayBuffer read so it's clear the change: no re-encoding occurs and
original text is kept while content comes from the arrayBuffer.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 316a1aa8-60ee-4403-bf5d-f95467ad64d8

📥 Commits

Reviewing files that changed from the base of the PR and between 59330ab and 84e595e.

📒 Files selected for processing (1)

pydoll/constants.py

hudcap · 2026-06-07T15:41:33Z

It's failing this line:

pydoll/tests/integration/test_tab_request_integration.py

Line 345 in 59330ab

assert response.content == response.text.encode('utf-8')

I think that test is incorrect (it should be checking response.content against the original bytes) and that's how this bug slipped past, but who am I? I'll wait for someone smarter than me to weigh in, since someone smarter obviously thought differently than me when writing that test.
I can also fix what AI requested if said smarter person agrees :)

hudcap · 2026-06-07T17:09:35Z

I guess there's no point in the PR if the tests are failing. I'll just change everything and await review

coderabbitai

🧹 Nitpick comments (1)

pydoll/constants.py (1)

274-280: 💤 Low value

Consider clarifying the comments.

The comments on lines 275 and 279 state "Return raw bytes," but raw bytes are always returned in the content field regardless of these code paths. The comments might be clearer if they explicitly stated what changes in these branches, e.g., "jsonData will be null" or "JSON parsing skipped; caller should use content field."

📝 Suggested comment refinement

                 try {
                     jsonData = JSON.parse(text);
                     text = JSON.stringify(jsonData);
                 }} catch (e) {{
-                    // Return raw bytes if parsing fails
+                    // JSON parsing failed; jsonData will be null
                     jsonData = null;
                 }}
             }} else {{
-                // For non-JSON, return raw bytes
+                // Non-JSON response; jsonData will be null
                 jsonData = null;
             }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pydoll/constants.py` around lines 274 - 280, The comments in the try/catch
and else branches that say "Return raw bytes" are misleading because the code
always returns raw bytes in the content field; update those comments to clearly
state that jsonData is set to null and that JSON parsing is being skipped so
callers should use the content field (e.g., replace "Return raw bytes" with
"jsonData will be null — JSON parsing failed; caller should use content bytes"
in the catch block, and "JSON parsing skipped; caller should use content bytes"
in the non-JSON else branch). Ensure the comments reference the jsonData
variable and the content field near the catch/else around the JSON parsing
logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pydoll/constants.py`:
- Around line 274-280: The comments in the try/catch and else branches that say
"Return raw bytes" are misleading because the code always returns raw bytes in
the content field; update those comments to clearly state that jsonData is set
to null and that JSON parsing is being skipped so callers should use the content
field (e.g., replace "Return raw bytes" with "jsonData will be null — JSON
parsing failed; caller should use content bytes" in the catch block, and "JSON
parsing skipped; caller should use content bytes" in the non-JSON else branch).
Ensure the comments reference the jsonData variable and the content field near
the catch/else around the JSON parsing logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6c9e6d4e-0ab7-492d-bf4b-62f96315fbf2

📥 Commits

Reviewing files that changed from the base of the PR and between 84e595e and 1023bcd.

📒 Files selected for processing (2)

pydoll/constants.py
tests/integration/test_tab_request_integration.py

hudcap · 2026-06-08T00:02:24Z

Sorry, I'm out of my depth. Tests seem to have passed but there's a public key error. No idea what to do about that

fix: update request to return raw data instead of encoded text

84e595e

When the response is decoded, any bytes that cannot be converted to text are replaced with the unicode replacement character, and are therefore lost forever.

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

hudcap added 3 commits June 7, 2026 12:59

fix tab.request test to compare to original bytes

2deb763

fix comments

b609a0f

parallelize content reads

1023bcd

coderabbitai Bot reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: update request to return raw data instead of encoded text#421

fix: update request to return raw data instead of encoded text#421
hudcap wants to merge 4 commits into
autoscrape-labs:mainfrom
hudcap:fix-raw-response

hudcap commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

hudcap commented Jun 7, 2026 •

edited

Loading

Uh oh!

hudcap commented Jun 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

hudcap commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hudcap commented Jun 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug Fix Pull Request

Related Issue(s)

Bug Description

Root Cause

Solution

Verification Steps

Testing Checklist

Impact

Backwards Compatibility

Checklist before requesting a review

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hudcap commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hudcap commented Jun 7, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

hudcap commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hudcap commented Jun 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 7, 2026 •

edited

Loading

hudcap commented Jun 7, 2026 •

edited

Loading