Skip to content

[Feature Request] Recurrent Depth Latent Reasoning #647

Description

@bitnom

Potentially significant implications for scaling performance of distributed inference. Potentially greater implications for distributing inference than a naive implementation (An initial thought/guess; citation needed). Transformers has it via:

The model requires its own KV-cache implementation HuginnDynamicCache, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.

but no idea if this makes sacrifices/unrealized potential.

Having recently read bigscience-workshop/petals#483 and listening to the pod got me curious about it. The are the obvious benefits but I'm wondering more about distributing inference for a single request. It's a pipe-dream until it isn't.

Papers

https://arxiv.org/abs/2502.05171
https://arxiv.org/abs/2402.14020

POC Model: https://huggingface.co/tomg-group-umd/huginn-0125

Code

https://github.com/seal-rg/recurrent-pretraining

https://github.com/gair-nlp/prox

Interview Pod: https://www.youtube.com/watch?v=dY90DXLi0vk

easy

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions