News

What I am reinforcing will become more prominent and that which I choose not to reinforce will become less prominent.
We include the training example used in our paper in data/train/one_shot_rlvr. For 1(few)-shot RLVR dataset, we duplicate the data until training batch size (in our experiment it is 128). Prompt: "The ...