RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling

The Hebrew University of Jerusalem

Text-to-3D

Abstract

Score Distillation Sampling (SDS) has emerged as a highly effective technique for leveraging 2D diffusion priors for a diverse set of tasks such as text-to-3D generation. While powerful, SDS still struggles with achieving fine-grained alignment to user intent. To overcome this limitation, we introduce RewardSDS, a novel approach that weights noise samples based on the alignment scores of a reward model, producing weighted SDS loss. This loss prioritizes gradients from noise samples that yield aligned high-reward output. Our approach is broadly applicable and can be applied to diverse methods extending SDS. In particular, we also demonstrate its applicability to Variational Score Distillation (VSD) by introducing RewardVSD. We evaluate RewardSDS and RewardVSD on text-to-image, 2D editing, and text-to-3D generation tasks, demonstrating a significant improvement over SDS and VSD on a diverse set of metrics measuring generation quality and alignment to desired reward models, enabling state-of-the-art performance.


Method

Method Figure

Overview of our I2V framework, transforming a reference image \( x^{(0)} \) and text prompt \( c \) into a coherent video sequence \( \hat{x} \). A pre-trained LLM is used to derive the motion-specific prompt \( c_{motion} \) and object-specific prompts \( c_{local} = \{c_{local}^{(1)}, \dots, c_{local}^{(L)}\} \) capturing each object's intended motion. We generate an initial segmentation mask \( s^{(0)} \) from \( x^{(0)} \) using SAM2. In Stage 1, the Image-to-Motion utilizes \( x^{(0)}, s^{(0)}, c_{motion} \) to generate mask-based motion trajectories \( \hat{s} \) that represent object-specific movement paths. In Stage 2, the Motion-to-Video takes as input \( x^{(0)}, \hat{s}, c \) as a global condition, and \( c_{local} \) through masked attention blocks, integrating the mask-based motion trajectory softly using learned attention layers. Specifically, we employ (i) masked cross-attention to integrate object-specific prompts into corresponding latent space regions, and (ii) masked self-attention to ensure each object maintains consistency across frames by restricting attention to object-defined regions.


RewardSDS vs MVDream

NeRF-Based Scenes

"A cartoon cat eating a cheesecake"
Reward SDS
MVDream
"A penguin with a brown bag in the snow"
Reward SDS
MVDream
"A man with a beard, wearing a suit, holding a pink briefcase"
Reward SDS
MVDream
"A bulldog wearing a black pirate hat"
Reward SDS
MVDream

3DGS-Based Scenes

"Two dogs in the park"
Reward SDS
MVDream
"Two computer monitors on a brown wooden table"
Reward SDS
MVDream
"A DSLR photo of the Imperial State Crown of England"
Reward SDS
MVDream
"A banana on the left of an apple"
Reward SDS
MVDream

Reward Model Effect

"A skyscraper that reaches the clouds"
Aesthetic
ImageReward
"Mini Chinatown"
Aesthetic
ImageReward

BibTeX

@misc{chachy2025rewardsdsaligningscoredistillation,
      title={RewardSDS: Aligning Score Distillation via Reward-Weighted Sampling}, 
      author={Itay Chachy and Guy Yariv and Sagie Benaim},
      year={2025},
      eprint={2503.09601},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.09601}, 
}