How to Align Big-Language Models with Human Preferences Using Direct Preference Optimization, QLoRA, and Ultra-Feedback

0 1 4 minutes read

How to Align Big-Language Models with Human Preferences Using Direct Preference Optimization, QLoRA, and Ultra-Feedback

In this tutorial, we use an end-to-end Direct Preference Optimization workflow to match a large linguistic model to human preferences without using a reward model. We combine TRL’s DPOTrainer with QLoRA and PEFT to enable preference-based matching on a single Colab GPU. We train directly on the binary UltraFeedback dataset, where each prompt has a selected and rejected response, allowing us to model behavior and style rather than simply recalling the truth.

import os
import math
import random
import torch


!pip -q install -U "transformers>=4.45.0" "datasets>=2.19.0" "accelerate>=0.33.0" "trl>=0.27.0" "peft>=0.12.0" "bitsandbytes>=0.43.0" "sentencepiece" "evaluate"


SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)


MODEL_NAME = os.environ.get("MODEL_NAME", "Qwen/Qwen2-0.5B-Instruct")
DATASET_NAME = "HuggingFaceH4/ultrafeedback_binarized"
OUTPUT_DIR = "dpo_ultrafeedback_qlora"


MAX_TRAIN_SAMPLES = 8000
MAX_EVAL_SAMPLES  = 200
MAX_PROMPT_LEN = 512
MAX_COMPLETION_LEN = 256


BETA = 0.1
LR = 2e-4
EPOCHS = 1
PER_DEVICE_BS = 2
GRAD_ACCUM = 8


LOGGING_STEPS = 10
SAVE_STEPS = 200


device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device, "GPU:", torch.cuda.get_device_name(0) if device == "cuda" else "None")

Set up a workstation and install all the necessary libraries for DPO, PEFT, and standard training. We describe all hyperparameters, data set constraints, and optimization settings in one place. We also run a random number generator and verify GPU availability to ensure repeatable runs.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


bnb_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
)


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
if tokenizer.pad_token is None:
   tokenizer.pad_token = tokenizer.eos_token


model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   quantization_config=bnb_config,
   torch_dtype=torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16,
   device_map="auto",
)
model.config.use_cache = False

We load the token and base language model using 4-bit quantization to minimize memory usage. We are configuring bitandbytes to enable proper QLoRA-style calculations on Colab GPUs. We optimize the training model by disabling cache usage to avoid inconsistencies during backpropagation.

from peft import LoraConfig, get_peft_model


lora_config = LoraConfig(
   r=16,
   lora_alpha=32,
   lora_dropout=0.05,
   bias="none",
   task_type="CAUSAL_LM",
   target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj", "gate_proj"],
)


model = get_peft_model(model, lora_config)
model.print_trainable_parameters()


model.gradient_checkpointing_enable()

We attach LoRA adapters to the model’s attention and forwarding layers. We limit training to a small set of parameters to make fine-tuning efficient and stable. We enable gradient checking to further reduce GPU memory usage during training.

from datasets import load_dataset


ds = load_dataset(DATASET_NAME)


train_split = "train_prefs" if "train_prefs" in ds else ("train" if "train" in ds else list(ds.keys())[0])
test_split  = "test_prefs" if "test_prefs" in ds else ("test" if "test" in ds else None)


train_raw = ds[train_split]
test_raw = ds[test_split] if test_split is not None else None


print("Splits:", ds.keys())
print("Using train split:", train_split, "size:", len(train_raw))
if test_raw is not None:
   print("Using test split:", test_split, "size:", len(test_raw))


def _extract_last_user_and_assistant(messages):
   last_user_idx = None
   last_asst_idx = None
   for i, m in enumerate(messages):
       if m.get("role") == "user":
           last_user_idx = i
       if m.get("role") == "assistant":
           last_asst_idx = i


   if last_user_idx is None or last_asst_idx is None:
       return None, None


   prompt_messages = messages[: last_user_idx + 1]
   assistant_text = messages[last_asst_idx].get("content", "")
   return prompt_messages, assistant_text


def format_example(ex):
   chosen_msgs = ex["chosen"]
   rejected_msgs = ex["rejected"]


   prompt_msgs_c, chosen_text = _extract_last_user_and_assistant(chosen_msgs)
   prompt_msgs_r, rejected_text = _extract_last_user_and_assistant(rejected_msgs)


   if prompt_msgs_c is None or prompt_msgs_r is None:
       return {"prompt": None, "chosen": None, "rejected": None}


   prompt_text = tokenizer.apply_chat_template(
       prompt_msgs_c, tokenize=False, add_generation_prompt=True
   )


   return {
       "prompt": prompt_text,
       "chosen": chosen_text.strip(),
       "rejected": rejected_text.strip(),
   }


train_raw = train_raw.shuffle(seed=SEED)
train_raw = train_raw.select(range(min(MAX_TRAIN_SAMPLES, len(train_raw))))


train_ds = train_raw.map(format_example, remove_columns=train_raw.column_names)
train_ds = train_ds.filter(lambda x: x["prompt"] is not None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)


if test_raw is not None:
   test_raw = test_raw.shuffle(seed=SEED)
   test_raw = test_raw.select(range(min(MAX_EVAL_SAMPLES, len(test_raw))))
   eval_ds = test_raw.map(format_example, remove_columns=test_raw.column_names)
   eval_ds = eval_ds.filter(lambda x: x["prompt"] is not None and len(x["chosen"]) > 0 and len(x["rejected"]) > 0)
else:
   eval_ds = None


print("Train examples:", len(train_ds), "Eval examples:", len(eval_ds) if eval_ds is not None else 0)
print(train_ds[0])

We load the UltraFeedback binary data set and dynamically select the appropriate train and test classification. We extract prompt, selected, and rejected responses from high-probability conversations and format them using a model conversation template. We shuffle, filter, and sample data to create clean and efficient training and testing datasets.

from trl import DPOTrainer, DPOConfig


use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
use_fp16 = torch.cuda.is_available() and not use_bf16


training_args = DPOConfig(
   output_dir=OUTPUT_DIR,
   beta=BETA,
   per_device_train_batch_size=PER_DEVICE_BS,
   gradient_accumulation_steps=GRAD_ACCUM,
   num_train_epochs=EPOCHS,
   learning_rate=LR,
   lr_scheduler_type="cosine",
   warmup_ratio=0.05,
   logging_steps=LOGGING_STEPS,
   save_steps=SAVE_STEPS,
   save_total_limit=2,
   bf16=use_bf16,
   fp16=use_fp16,
   optim="paged_adamw_8bit",
   max_length=MAX_PROMPT_LEN + MAX_COMPLETION_LEN,
   max_prompt_length=MAX_PROMPT_LEN,
   report_to="none",
)


trainer = DPOTrainer(
   model=model,
   args=training_args,
   processing_class=tokenizer,
   train_dataset=train_ds,
   eval_dataset=eval_ds,
)


trainer.train()


trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)


print("Saved to:", OUTPUT_DIR)

We prepare the objective of DPO training with carefully selected planning parameters. We implement DPOTrainer to directly prepare the preferred pairs without the reward model. We train LoRA adapters and save aligned model artifacts for later reference.

from peft import PeftModel
from transformers import pipeline


def generate_text(model_for_gen, prompt, max_new_tokens=180):
   model_for_gen.eval()
   inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=MAX_PROMPT_LEN).to(model_for_gen.device)
   with torch.no_grad():
       out = model_for_gen.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           do_sample=True,
           temperature=0.7,
           top_p=0.95,
           pad_token_id=tokenizer.eos_token_id,
       )
   return tokenizer.decode(out[0], skip_special_tokens=True)


base_model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   quantization_config=bnb_config,
   torch_dtype=torch.bfloat16 if use_bf16 else torch.float16,
   device_map="auto",
)
base_model.config.use_cache = True


dpo_model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
dpo_model.config.use_cache = True


sample_pool = eval_ds if eval_ds is not None and len(eval_ds) > 0 else train_ds
samples = [sample_pool[i] for i in random.sample(range(len(sample_pool)), k=min(3, len(sample_pool)))]


for i, ex in enumerate(samples, 1):
   prompt = ex["prompt"]
   print("n" + "="*90)
   print(f"Sample #{i}")
   print("- Prompt:n", prompt)


   base_out = generate_text(base_model, prompt)
   dpo_out  = generate_text(dpo_model, prompt)


   print("n- Base model output:n", base_out)
   print("n- DPO (LoRA) output:n", dpo_out)


print("nDone.")

We reload the basic model and attach the trained DPO LoRA adapters to understand. We generate responses for both the original and aligned models using the same commands for comparison. We empirically examine how making preferences changes the behavior of the model by examining each output.

In conclusion, we have shown how DPO provides a stable and efficient alternative to RLHF by directly preparing preferred pairs with a simple, well-defined objective. We have shown that parameter optimization with LoRA and 4-bit quantization enables realistic experiments even under strict computational limitations. We verified correct alignment by comparing generations before and after DPO training, ensuring that the model learns to select high-quality responses while remaining lightweight and usable.

Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.