
In tests, PWM outperforms SHAC and TD-MPC2 in terms of higher rewards and smoother optimization landscapes. Evaluations on 30 and 80 multi-task environments show PWM's superior reward performance and faster inference time compared to TD-MPC2. Ablation studies also highlight PWM's robustness to stiff contact models and higher sample efficiency.

The primary challenge of RL with multitasking is adapting across diverse morphologies and handling multiple tasks simultaneously, especially with large models like GAIA-1 and UniSim. Existing methods struggle with scalability and efficiency as model size increases, and they often require near-expert data.

PWM (Policy learning with large World Models) improves upon traditional RL (Reinforcement Learning) methods by using pretrained world models for first-order gradient policy learning, enabling it to solve tasks with up to 152 action dimensions. This approach outperforms existing methods by achieving up to 27% higher rewards without costly online planning. PWM emphasizes the utility of smooth, stable gradients over long horizons rather than mere accuracy, resulting in better policies and faster training than traditional zeroth-order methods.