Hello, I was trying to use the UniProt().mapping() function and I noticed two potential issues.
Unnecessary duplication of failed IDs
In this batching code:
for x in tqdm.tqdm(range(size, total, size), disable=not progress):
link = self._get_next_link(self.services.last_response.headers)
batch = self.services.http_get(link, frmt="json")
batches += batch["results"]
fails += results.get("failedIds", []) <---------
return {"results": batches, "failedIds": fails}
First, results is from outside the loop, so the code is just adding the same "failedIds" values to fails over and over.
Second, it looks like the initial results return ALL failed IDs, not just the first 25 of them, so adding to fails inside the loop is unnecessary.
- I tested in debug mode with sending 1000 Ensembl IDs, and the first
results = self.services.http_get() call returned 106 failed IDs. The next call inside the loop (batch = self.services.http_get(link, frmt="json") returned the same 106 IDs.
Using the wrong URL to fetch results
I tried to send a set of 1000 Ensembl IDs (from "Ensembl", to "UniProtKB-Swiss-Prot"). That call stalled out after ~batch 25, ate up all my computer's memory, and killed the kernel. I've repeated this several times with the same results. Doing some investigating, it looks like using idmapping/status/{job_id} is returning a HUGE dictionary with every single piece of data for each matching protein from UniProt, including things like the protein sequence and lists of all alternative sequences and pieces of evidence. This turns out to be a massive amount of data and very quickly eats up all available memory.
On the other hand, using idmapping/results/{job_id} returns just the ID, like {'from': 'ENSG00000080815', 'to': 'P49768'}, which is what I would expect from an ID mapping call.
Looking at the UniProt website about the API, it looks like /status is intended to be used to check if the job is done and is supposed to return something like {"jobStatus":"FINISHED", ...}. In practice it does seem to be returning the first 25 results w/ all the available data in UniProt instead, so that might be an error on UniProt's end?
Either way... it seems like idmapping/results/{job_id} is supposed to be the URL used to retrieve results, and prevents the memory issues.
If you would like, I'd be happy to fork this repo and submit a PR with fixes. Otherwise I can leave it to your team, whichever is easiest for you.
Hello, I was trying to use the UniProt().mapping() function and I noticed two potential issues.
Unnecessary duplication of failed IDs
In this batching code:
First,
resultsis from outside the loop, so the code is just adding the same "failedIds" values to fails over and over.Second, it looks like the initial results return ALL failed IDs, not just the first 25 of them, so adding to
failsinside the loop is unnecessary.results = self.services.http_get()call returned 106 failed IDs. The next call inside the loop (batch = self.services.http_get(link, frmt="json")returned the same 106 IDs.Using the wrong URL to fetch results
I tried to send a set of 1000 Ensembl IDs (from "Ensembl", to "UniProtKB-Swiss-Prot"). That call stalled out after ~batch 25, ate up all my computer's memory, and killed the kernel. I've repeated this several times with the same results. Doing some investigating, it looks like using
idmapping/status/{job_id}is returning a HUGE dictionary with every single piece of data for each matching protein from UniProt, including things like the protein sequence and lists of all alternative sequences and pieces of evidence. This turns out to be a massive amount of data and very quickly eats up all available memory.On the other hand, using
idmapping/results/{job_id}returns just the ID, like{'from': 'ENSG00000080815', 'to': 'P49768'}, which is what I would expect from an ID mapping call.Looking at the UniProt website about the API, it looks like
/statusis intended to be used to check if the job is done and is supposed to return something like{"jobStatus":"FINISHED", ...}. In practice it does seem to be returning the first 25 results w/ all the available data in UniProt instead, so that might be an error on UniProt's end?Either way... it seems like
idmapping/results/{job_id}is supposed to be the URL used to retrieve results, and prevents the memory issues.If you would like, I'd be happy to fork this repo and submit a PR with fixes. Otherwise I can leave it to your team, whichever is easiest for you.