Skip to content

fix: update request to return raw data instead of encoded text#421

Open
hudcap wants to merge 4 commits into
autoscrape-labs:mainfrom
hudcap:fix-raw-response
Open

fix: update request to return raw data instead of encoded text#421
hudcap wants to merge 4 commits into
autoscrape-labs:mainfrom
hudcap:fix-raw-response

Conversation

@hudcap

@hudcap hudcap commented Jun 7, 2026

Copy link
Copy Markdown

Bug Fix Pull Request

Related Issue(s)

Fixes #419

Bug Description

tab.request sometimes returns corrupted binary content, with the Unicode replacement character mixed into the bytes.

Root Cause

What's being returned is an encoded version of the decoded bytes, and the decoded bytes can lose data when the bytes have no known associated character.

Solution

Return the original bytes instead of the encoded text.

Verification Steps

  1. Performed a request that returns an .xls file (previously corrupted). The file is now returned uncorrupted.

Testing Checklist

  • Added regression test that would have caught this bug
  • Modified existing tests to account for this fix
  • All tests pass
  • Edge cases have been tested

Impact

  • Low (isolated fix with no side effects)
  • Medium (might affect closely related functionality)
  • High (affects multiple areas or changes core behavior)

Backwards Compatibility

  • This change is fully backward compatible
  • This change introduces backward incompatibilities (explain below)

Checklist before requesting a review

This change is so minimal, I don't think most of these are relevant. I don't think I could possibly have messed up any of the below, but if you really want me to, I'll double-check.

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have added test cases that prove my fix is effective
  • I have run poetry run task lint and fixed any issues
  • I have run poetry run task test and all tests pass
  • My commits follow the conventional commits style with message explaining the fix

Summary by CodeRabbit

  • Refactor
    • Improved response handling so binary responses are preserved verbatim, ensuring images and other raw downloads are delivered consistently.
  • Tests
    • Updated integration test to validate response body bytes exactly match expected binary fixtures, improving reliability of binary-response checks.

When the response is decoded, any bytes that cannot be converted to text are replaced with the unicode replacement character, and are therefore lost forever.
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Reads fetch response body as an ArrayBuffer immediately (preserving original bytes), concurrently reads and prefix-strips text for JSON parsing, and updates tests to assert exact binary content bytes.

Changes

Response Content Handling

Layer / File(s) Summary
Early arrayBuffer acquisition and binary return
pydoll/constants.py, tests/integration/test_tab_request_integration.py
Clones the fetch Response and reads responseClone.arrayBuffer() into content while concurrently reading response.text(), strips anti-XSSI prefixes from text, and returns raw content bytes for non-JSON responses. Integration test updated to assert response.content equals the PNG fixture bytes exactly.
Content-Type detection and JSON parsing
pydoll/constants.py
Reads contentType from response headers, parses text into jsonData for JSON responses and reserializes text on success; parsing failures set jsonData to null. content is not rebuilt from text during this process.

🎯 3 (Moderate) | ⏱️ ~20 minutes

A rabbit nibbling bytes with care,
No lost pixels in the midnight air,
ArrayBuffer held, text trimmed and neat,
Pixels safe, no replacement char to meet,
Hooray — the bytes and tests all greet! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: fixing the request function to return raw data instead of encoded text, which directly addresses the corruption issue in the linked bug.
Description check ✅ Passed The description covers the bug, root cause, solution, and verification steps. Though some testing checklist items are incomplete, the author acknowledges the minimal nature of changes and provides sufficient context for review.
Linked Issues check ✅ Passed The PR addresses issue #419 by modifying the JavaScript code to acquire raw response bytes early via arrayBuffer() and return original bytes instead of decoded/re-encoded text, directly fixing the Unicode replacement character corruption.
Out of Scope Changes check ✅ Passed All changes (updates to constants.py's JavaScript makeRequest and test file assertions) are directly in scope for fixing the binary response corruption issue in issue #419.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pydoll/constants.py (2)

255-257: ⚡ Quick win

Clone and dual-read strategy correctly preserves binary content.

Reading arrayBuffer() from the cloned response ensures the raw bytes are captured before any decoding, which is exactly what's needed to fix the binary corruption issue described in #419. The original response's text() method leverages the browser's native charset detection.

⚡ Optional: parallelize the two reads for faster response handling
-            const responseClone = response.clone();
-            const content = await responseClone.arrayBuffer();
-            let text = await response.text();
+            const responseClone = response.clone();
+            const [content, text] = await Promise.all([
+                responseClone.arrayBuffer(),
+                response.text()
+            ]);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pydoll/constants.py` around lines 255 - 257, The current dual-read uses
sequential awaits on responseClone.arrayBuffer() then response.text(), which is
correct but slower; change to run them in parallel by creating the clone
(responseClone) and then await Promise.all([responseClone.arrayBuffer(),
response.text()]) so you populate content and text concurrently (keep using
responseClone for arrayBuffer() to preserve raw bytes and response.text() for
decoded text).

276-279: ⚡ Quick win

Clarify the comment to reflect what changed.

The comment "For non-JSON, keep original text handling" could be misinterpreted. The actual change here is that the previous code was re-encoding text into content (causing binary corruption), and that re-encode step has been removed. The text from line 257 is simply kept as-is, and content is already set from the arrayBuffer read on line 256.

📝 Suggested comment clarification
             } else {
-                // For non-JSON, keep original text handling
+                // Non-JSON: text is already decoded; content already captured as raw bytes
                 jsonData = null;
             }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pydoll/constants.py` around lines 276 - 279, Update the comment next to the
jsonData branch to explicitly state that we removed the prior re-encoding of
text into content (which caused binary corruption); note that content is now
assigned from the arrayBuffer read earlier and text from the upstream variable
is preserved as-is instead of being re-encoded. Reference the variables
jsonData, text, content, and the arrayBuffer read so it's clear the change: no
re-encoding occurs and original text is kept while content comes from the
arrayBuffer.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pydoll/constants.py`:
- Around line 255-257: The current dual-read uses sequential awaits on
responseClone.arrayBuffer() then response.text(), which is correct but slower;
change to run them in parallel by creating the clone (responseClone) and then
await Promise.all([responseClone.arrayBuffer(), response.text()]) so you
populate content and text concurrently (keep using responseClone for
arrayBuffer() to preserve raw bytes and response.text() for decoded text).
- Around line 276-279: Update the comment next to the jsonData branch to
explicitly state that we removed the prior re-encoding of text into content
(which caused binary corruption); note that content is now assigned from the
arrayBuffer read earlier and text from the upstream variable is preserved as-is
instead of being re-encoded. Reference the variables jsonData, text, content,
and the arrayBuffer read so it's clear the change: no re-encoding occurs and
original text is kept while content comes from the arrayBuffer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 316a1aa8-60ee-4403-bf5d-f95467ad64d8

📥 Commits

Reviewing files that changed from the base of the PR and between 59330ab and 84e595e.

📒 Files selected for processing (1)
  • pydoll/constants.py

@hudcap

hudcap commented Jun 7, 2026

Copy link
Copy Markdown
Author

It's failing this line:

assert response.content == response.text.encode('utf-8')

I think that test is incorrect (it should be checking response.content against the original bytes) and that's how this bug slipped past, but who am I? I'll wait for someone smarter than me to weigh in, since someone smarter obviously thought differently than me when writing that test.
I can also fix what AI requested if said smarter person agrees :)

@hudcap

hudcap commented Jun 7, 2026

Copy link
Copy Markdown
Author

I guess there's no point in the PR if the tests are failing. I'll just change everything and await review

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pydoll/constants.py (1)

274-280: 💤 Low value

Consider clarifying the comments.

The comments on lines 275 and 279 state "Return raw bytes," but raw bytes are always returned in the content field regardless of these code paths. The comments might be clearer if they explicitly stated what changes in these branches, e.g., "jsonData will be null" or "JSON parsing skipped; caller should use content field."

📝 Suggested comment refinement
                 try {
                     jsonData = JSON.parse(text);
                     text = JSON.stringify(jsonData);
                 }} catch (e) {{
-                    // Return raw bytes if parsing fails
+                    // JSON parsing failed; jsonData will be null
                     jsonData = null;
                 }}
             }} else {{
-                // For non-JSON, return raw bytes
+                // Non-JSON response; jsonData will be null
                 jsonData = null;
             }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pydoll/constants.py` around lines 274 - 280, The comments in the try/catch
and else branches that say "Return raw bytes" are misleading because the code
always returns raw bytes in the content field; update those comments to clearly
state that jsonData is set to null and that JSON parsing is being skipped so
callers should use the content field (e.g., replace "Return raw bytes" with
"jsonData will be null — JSON parsing failed; caller should use content bytes"
in the catch block, and "JSON parsing skipped; caller should use content bytes"
in the non-JSON else branch). Ensure the comments reference the jsonData
variable and the content field near the catch/else around the JSON parsing
logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pydoll/constants.py`:
- Around line 274-280: The comments in the try/catch and else branches that say
"Return raw bytes" are misleading because the code always returns raw bytes in
the content field; update those comments to clearly state that jsonData is set
to null and that JSON parsing is being skipped so callers should use the content
field (e.g., replace "Return raw bytes" with "jsonData will be null — JSON
parsing failed; caller should use content bytes" in the catch block, and "JSON
parsing skipped; caller should use content bytes" in the non-JSON else branch).
Ensure the comments reference the jsonData variable and the content field near
the catch/else around the JSON parsing logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6c9e6d4e-0ab7-492d-bf4b-62f96315fbf2

📥 Commits

Reviewing files that changed from the base of the PR and between 84e595e and 1023bcd.

📒 Files selected for processing (2)
  • pydoll/constants.py
  • tests/integration/test_tab_request_integration.py

@hudcap

hudcap commented Jun 8, 2026

Copy link
Copy Markdown
Author

Sorry, I'm out of my depth. Tests seem to have passed but there's a public key error. No idea what to do about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Unicode replacement character in binary response

1 participant