top of page

Fine-Tune NVIDIA Nemotron-3 Nano on a Customer Support Dataset

  • 9 hours ago
  • 16 min read


Introduction


NVIDIA Nemotron-3 is a family of open models built for reasoning, coding, chat, and agentic workflows. The Nano variant packs strong language understanding into a 4-billion-parameter model that can be fine-tuned on a single 24GB GPU, making it practical for teams who want to adapt a capable base model to their own domain without renting a large training cluster.


In this tutorial, we fine-tune Nemotron-3-Nano-4B on a customer support dataset. After training, the model learns to respond like a professional support agent, acknowledging the customer’s concern, providing a direct resolution, and maintaining a polite and empathetic tone throughout.


We use Low-Rank Adaptation (LoRA) via TRL’s SFTTrainer. LoRA freezes the original model weights and trains only a small set of low-rank matrices inserted into every linear layer, which is why a 4B model can be adapted in a few hours on a single GPU.






What We Are Building


A fine-tuned version of NVIDIA Nemotron-3-Nano-4B that handles customer support queries better than the base model. The pipeline:


  1. Load a customer support Q&A dataset from Hugging Face

  2. Format each example into a prompt-completion pair using a support-agent system prompt

  3. Evaluate the base model’s responses before any training

  4. Fine-tune with LoRA using TRL’s SFTTrainer

  5. Save the LoRA adapter locally and push it to Hugging Face Hub

  6. Compare base model vs fine-tuned model responses side by side




Tech Stack


Component

Tool

Base model

nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16

Dataset

Bitext/Bitext-customer-support-llm-chatbot-training-dataset

Fine-tuning method

LoRA via PEFT

Training framework

TRL SFTTrainer

Deep learning

PyTorch 2.7.1 + CUDA 12.8

Model hub

Hugging Face Hub

Environment

python-dotenv




Project Structure



nemotron_customer_support/
├── finetune.py       # full fine-tuning script: load, format, train, compare
├── .env              # HF_TOKEN
└── requirements.txt




Setting Up the Environment


Nemotron-3 uses a hybrid Mamba-Transformer architecture. The Mamba layers require mamba_ssm and causal_conv1d, which must be compiled against a specific version of PyTorch and CUDA. The safest approach is to start from a clean environment, install the correct PyTorch build first, then install the Mamba packages.



pip install -U packaging ninja

pip uninstall -y torch torchvision torchaudio triton

pip install "torch==2.7.1" "torchvision==0.22.1" "torchaudio==2.7.1" \
    --index-url https://download.pytorch.org/whl/cu128

pip install -U "transformers==4.56.2" tokenizers "trl==0.22.2" \
    accelerate datasets peft pandas tqdm huggingface_hub safetensors python-dotenv

pip install -U --no-build-isolation "mamba_ssm==2.2.5" "causal_conv1d==1.5.2"



Uninstalling the existing PyTorch stack first avoids version conflicts with the Mamba kernel pins. Installing mamba_ssm with --no-build-isolation ensures it compiles against the PyTorch headers we just installed rather than whatever happens to be in the build environment.


After installing, check that CUDA is available and that your GPU has enough VRAM:



import os               # reads the HF_TOKEN environment variable
import platform         # prints the Python version for environment verification
import torch            # checks CUDA availability and GPU properties

from dotenv import load_dotenv  # reads .env and injects HF_TOKEN into the process environment

load_dotenv()  # call before any os.environ.get() — otherwise .env values are not visible

print(f"Python: {platform.python_version()}")         # confirm the Python version in use
print(f"PyTorch: {torch.__version__}")                # confirm the PyTorch version installed
print(f"PyTorch CUDA build: {torch.version.cuda}")    # confirm the CUDA version PyTorch was built with
print(f"CUDA available: {torch.cuda.is_available()}")  # True means a GPU is detected and usable

if not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available. Run this script on a machine with a CUDA-capable GPU.")

for idx in range(torch.cuda.device_count()):  # loop over every GPU in the system
    props = torch.cuda.get_device_properties(idx)            # fetch name, VRAM, and compute capability
    total_gb = props.total_memory / 1024**3                  # convert bytes to gigabytes
    print(f"GPU {idx}: {props.name} ({total_gb:.1f} GB VRAM, capability {props.major}.{props.minor})")

if torch.cuda.get_device_properties(0).total_memory < 24 * 1024**3:  # warn if VRAM is below the recommended 24GB
    print("Warning: this 4B LoRA script is tuned for GPUs with at least 24GB VRAM. Reduce batch sizes on smaller GPUs.")

torch.backends.cuda.matmul.allow_tf32 = True  # use TF32 for matrix multiplications — faster on Ampere+ GPUs with minimal precision loss
torch.backends.cudnn.allow_tf32 = True         # use TF32 for cuDNN convolutions as well


torch.backends.cuda.matmul.allow_tf32 enables TF32 for matrix multiplications on Ampere GPUs and above. TF32 stores values in 32-bit but computes with 10-bit mantissa precision, which is close enough to FP32 for training but significantly faster. Enabling both settings together can cut training time by 20–30% on an RTX 3090 without visible impact on loss curves.


Set your Hugging Face token in .env:



HF_TOKEN=your_huggingface_token_here


Then authenticate:




from huggingface_hub import login  # authenticates with Hugging Face to download gated models and push adapters

hf_token = os.environ.get("HF_TOKEN")  # read the token from the environment — never hardcode it in source
if not hf_token:
    raise ValueError("Set HF_TOKEN in your .env file before running this script.")

login(token=hf_token)             # authenticate the current session with Hugging Face
print("Logged in to Hugging Face.")


The Nemotron-3 model repository on Hugging Face is gated, which means you need to accept the usage terms on the model page before your token will work. Visit nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 on Hugging Face, click Agree, and then your HF_TOKEN will grant download access.




Loading and Processing the Dataset


The Bitext/Bitext-customer-support-llm-chatbot-training-dataset contains 26,872 customer support query-response pairs across 27 business categories including account management, orders, refunds, shipping, and cancellations. Each row has an instruction column (the customer’s query) and a response column (the reference agent reply).




from datasets import DatasetDict, load_dataset  # load_dataset fetches from Hugging Face Hub; DatasetDict holds train/val/test splits

DATASET_REPO      = "Bitext/Bitext-customer-support-llm-chatbot-training-dataset"  # 26k customer support query-response pairs across 27 categories
MAX_TRAIN_SAMPLES = 8000   # cap training examples so the run stays manageable on a single GPU
MAX_EVAL_SAMPLES  = 800    # examples held out to measure loss during training
MAX_TEST_SAMPLES  = 300    # examples used only for the before/after response comparison
RANDOM_SEED       = 42     # fixed seed so shuffles and splits are reproducible

full_dataset   = load_dataset(DATASET_REPO)                           # downloads the dataset from Hugging Face Hub
shuffled_train = full_dataset["train"].shuffle(seed=RANDOM_SEED)      # shuffle before splitting so all categories are represented in every split

primary_split   = shuffled_train.train_test_split(test_size=0.15, seed=RANDOM_SEED)          # 85% train, 15% temp pool for val+test
secondary_split = primary_split["test"].train_test_split(test_size=0.33, seed=RANDOM_SEED)   # split the 15% pool into ~10% val and ~5% test


def cap_split(split, limit):
    if limit is None:      # no limit means use the full split
        return split
    return split.select(range(min(limit, len(split))))  # cap at the limit without going out of bounds


data_splits = DatasetDict({
    "train":      cap_split(primary_split["train"],   MAX_TRAIN_SAMPLES),  # up to 8000 training examples
    "validation": cap_split(secondary_split["train"], MAX_EVAL_SAMPLES),   # up to 800 validation examples
    "test":       cap_split(secondary_split["test"],  MAX_TEST_SAMPLES),   # up to 300 test examples
})

print(data_splits)                           # verify the split sizes and column names before proceeding
print(data_splits["train"].column_names)     # confirm instruction, response, category, intent, flags columns are present
print(data_splits["train"][0])               # inspect one row to verify the data looks correct


Shuffling before splitting ensures that every category appears in the training, validation, and test sets proportionally, rather than having the splits accidentally dominated by whichever categories happen to be at the top of the original file.




Formatting the Dataset for TRL Fine-Tuning


TRL’s SFTTrainer expects each example to have a prompt list of chat messages and a completion list. The system prompt is the key lever for shaping how the model responds. It instructs the model to act as a professional support agent.



SUPPORT_AGENT_PROMPT = """/no_think
You are a professional and empathetic customer support assistant.
Do not include hidden reasoning, thinking traces, <think> tags, or </think> tags in the final answer.
Respond clearly, politely, and concisely to every customer query.
Acknowledge the customer's concern, provide a direct solution or next steps, and maintain a friendly tone throughout.
Do not make promises you cannot keep. If the issue cannot be resolved immediately, guide the customer to the right resource or escalation path.
Keep your answer focused, specific, and directly relevant to the customer's question without being overly brief."""

TEMPLATE_OPTIONS = {"enable_thinking": False}  # disable the model's internal reasoning trace — we want clean, direct responses
QUERY_TEMPLATE   = "Customer Query:\n\n{query}"  # template that wraps the customer's message for the chat format


def normalize_text(value):
    return " ".join(str(value).strip().split())  # strip leading/trailing whitespace and collapse internal newlines and extra spaces


def format_chat_example(example):
    query  = normalize_text(example["instruction"])  # the customer's raw query — mapped to the user role
    answer = normalize_text(example["response"])     # the reference agent reply — mapped to the assistant role

    return {
        "prompt": [
            {"role": "system",  "content": SUPPORT_AGENT_PROMPT},                    # tells the model how to behave as a support agent
            {"role": "user",    "content": QUERY_TEMPLATE.format(query=query)},       # the customer's question
        ],
        "completion": [
            {"role": "assistant", "content": answer},  # the target response the model should learn to generate
        ],
        "chat_template_kwargs": TEMPLATE_OPTIONS,  # passed to apply_chat_template to suppress thinking tags
    }


formatted_ds = data_splits.map(
    format_chat_example,
    remove_columns=data_splits["train"].column_names,  # drop original columns — SFTTrainer only needs prompt and completion
)

print(formatted_ds["train"][0])  # verify the formatted example looks correct before training


The /no_think prefix in the system prompt is a Nemotron-specific directive that tells the model not to produce internal chain-of-thought reasoning before the answer. Without it, the model may wrap responses in <think>...</think> blocks, which is unhelpful for a customer-facing support tool.




Loading the Nemotron-3 Base Model


We load the tokenizer and model weights from the Hugging Face Hub. Nemotron-3 ships custom model code, so trust_remote_code=True is required for both. The weights are loaded in BF16 to match the model’s native precision and keep the full 4B parameter model within 24GB of VRAM. We also configure the generation defaults here: the KV cache is disabled (it conflicts with gradient checkpointing later), greedy decoding is selected for deterministic outputs, and a mild repetition penalty is set to keep responses natural.



from transformers import AutoModelForCausalLM, AutoTokenizer  # AutoTokenizer loads the tokenizer; AutoModelForCausalLM loads the weights

BASE_MODEL_ID    = "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16"                # the 4B BF16 base model on Hugging Face
ADAPTER_SAVE_DIR = "./nemotron-3-nano-4b-bf16-customer-support-lora"       # local folder where the LoRA adapter and tokenizer will be saved
MAX_TOKEN_LENGTH = 1024                                                     # maximum token length per training example; keeps memory usage stable on 24GB GPUs

tokenizer = AutoTokenizer.from_pretrained(
    BASE_MODEL_ID,
    token=hf_token,          # HuggingFace token to access the gated model repository
    trust_remote_code=True,  # required because Nemotron uses a custom tokenizer class
    use_fast=True,           # use the Rust-based fast tokenizer for faster encoding
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token  # if no pad token is defined, reuse the end-of-sequence token as padding

tokenizer.padding_side = "right"  # pad on the right side so the model attends to tokens from left to right consistently

nemotron_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    token=hf_token,
    trust_remote_code=True,         # Nemotron's Mamba hybrid architecture requires custom model code from the repo
    dtype=torch.bfloat16,           # load weights in BF16 to save VRAM while keeping sufficient numerical precision
    device_map="auto",              # automatically distribute layers across available GPUs
    attn_implementation="eager",    # use standard eager attention — flash attention is not compatible with Mamba hybrid layers
)

nemotron_model.config.use_cache                       = False  # disable KV cache during training — it conflicts with gradient checkpointing
nemotron_model.config.pad_token_id                    = tokenizer.pad_token_id  # keep model config and tokenizer in sync
nemotron_model.config.eos_token_id                    = tokenizer.eos_token_id  # same for end-of-sequence token
nemotron_model.generation_config.pad_token_id         = tokenizer.pad_token_id  # sync generation config too
nemotron_model.generation_config.eos_token_id         = tokenizer.eos_token_id
nemotron_model.generation_config.use_cache            = False    # also disable cache in the generation config
nemotron_model.generation_config.do_sample            = False    # use greedy decoding for deterministic, consistent outputs
nemotron_model.generation_config.top_p                = None     # not needed with greedy decoding
nemotron_model.generation_config.min_new_tokens       = None     # let the model stop naturally at the eos token
nemotron_model.generation_config.repetition_penalty   = 1.08    # mildly penalise repeated phrases to keep responses natural
nemotron_model.generation_config.no_repeat_ngram_size = 4       # prevent any 4-gram from appearing more than once in the output


attn_implementation="eager" is required for Nemotron-3. The model uses a hybrid Mamba-Transformer architecture where some layers are standard attention and others are Mamba state-space layers. Flash Attention is not compatible with this mixed architecture, so using it causes errors. Eager attention is slower but works correctly.




Generation Helper Functions


Before training, we need a way to run inference so we can record a baseline and later compare it against the fine-tuned model. We define four helpers: flush_gpu_memory clears the CUDA cache between runs, make_chat_messages builds the prompt in the chat format the model expects, strip_reasoning_tags removes any <think>...</think> blocks the base model might produce, and run_inference ties them together into a single function that takes a query and returns a clean text response.


build_response_table iterates over a list of examples, calls run_inference on each, and collects the results into a DataFrame for easy display and comparison.



import gc                        # used to force garbage collection before clearing the CUDA memory cache
import pandas as pd              # builds the before/after comparison table
from tqdm.auto import tqdm       # shows a progress bar while generating responses for multiple examples


def flush_gpu_memory():
    gc.collect()                        # free unreferenced Python objects before clearing GPU memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()        # release all unused cached GPU memory blocks back to the OS


def make_chat_messages(query, system_prompt=SUPPORT_AGENT_PROMPT):
    return [
        {"role": "system", "content": system_prompt},                                       # system instruction for the model
        {"role": "user",   "content": QUERY_TEMPLATE.format(query=normalize_text(query))},  # formatted customer query
    ]


def strip_reasoning_tags(text):
    text = text.strip()  # remove leading and trailing whitespace from the raw generated output
    while "<think>" in text and "</think>" in text:  # loop in case multiple thinking blocks were generated
        start = text.find("<think>")                              # find the opening tag position
        end   = text.find("</think>", start) + len("</think>")   # find the closing tag end position
        text  = (text[:start] + text[end:]).strip()               # cut out the thinking block entirely

    if "</think>" in text:                        # handle a dangling closing tag with no matching opening tag
        text = text.split("</think>")[-1].strip()    # take everything after the last closing tag

    return text.replace("<think>", "").replace("</think>", "").strip()  # remove any remaining stray tags


def run_inference(model, tokenizer, query, system_prompt=SUPPORT_AGENT_PROMPT, max_new_tokens=180):
    messages = make_chat_messages(query, system_prompt)  # build the full chat prompt from the customer query
    device   = next(model.parameters()).device            # find which device (GPU/CPU) the model is on

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,                   # convert the messages to token IDs immediately
        **TEMPLATE_OPTIONS,              # suppress thinking mode
        add_generation_prompt=True,      # append the assistant turn start token so the model continues from there
        return_dict=True,                # return a dict with input_ids and attention_mask
        return_tensors="pt",             # return PyTorch tensors
    )

    inputs    = {key: value.to(device) for key, value in inputs.items()}  # move inputs to the same device as the model
    input_len = inputs["input_ids"].shape[-1]                              # record prompt length to slice it off the output later

    with torch.no_grad():  # disable gradient tracking during inference — saves memory and speeds up generation
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,      # limit the reply length to roughly 135 words
            do_sample=False,                    # greedy decoding for reproducible outputs
            use_cache=False,                    # keep cache disabled to match training configuration
            repetition_penalty=1.08,            # same penalty used in the model's generation config
            no_repeat_ngram_size=4,             # same ngram repetition guard
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    decoded = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True).strip()  # decode only the new tokens, not the prompt
    return strip_reasoning_tags(decoded)  # strip any thinking traces before returning the clean response


def build_response_table(model, tokenizer, examples, output_column):
    rows = []      # accumulates one dict per example
    model.eval()   # set the model to evaluation mode — disables dropout

    for ex in tqdm(examples, desc=f"Generating {output_column}", leave=False):  # iterate with a progress bar
        rows.append({
            "instruction":        normalize_text(ex["instruction"]),  # the original customer query
            "reference_response": normalize_text(ex["response"]),     # the reference answer from the dataset
            output_column:        run_inference(model, tokenizer, ex["instruction"]),  # model's generated answer
        })

    return pd.DataFrame(rows)  # return as a DataFrame for easy display and merging




Pre-Fine-Tuning Baseline Evaluation


With the helpers in place, we run the base model on three test examples before any fine-tuning. This gives us a concrete snapshot of how the unmodified model responds to customer queries. The results are stored in baseline_table so we can merge them with the fine-tuned model’s answers in section 9 and print a direct side-by-side comparison.



test_samples = [data_splits["test"][idx] for idx in range(min(3, len(data_splits["test"])))]  # pick the first 3 test examples as the baseline sample

baseline_table = build_response_table(
    nemotron_model,
    tokenizer,
    test_samples,
    "base_model_answer",  # column name for the base model's responses in the comparison table
)

print(baseline_table)  # display the baseline table: customer query, reference response, and base model answer




Configuring LoRA and Training Settings


Before training, we prepare the model for LoRA fine-tuning. First, we enable gradient checkpointing on nemotron_model to reduce VRAM usage during the backward pass. Instead of storing all intermediate activations in memory, PyTorch recomputes them on the fly. We then define adapter_config, which controls how LoRA modifies the model: the rank (r=32) sets the size of the low-rank matrices, lora_alpha scales their contribution, and target_modules="all-linear" tells PEFT to insert LoRA layers into every linear projection across both the attention and feed-forward blocks.





from peft import LoraConfig  # LoraConfig defines which layers to adapt and how large the low-rank matrices are

nemotron_model.gradient_checkpointing_enable()  # recompute activations during the backward pass instead of storing them — reduces VRAM usage significantly
nemotron_model.config.use_cache = False          # gradient checkpointing and KV cache are incompatible — disable cache again after enabling checkpointing

adapter_config = LoraConfig(
    r=32,                        # rank of the low-rank matrices — higher rank = more capacity but more parameters to train
    lora_alpha=64,               # scaling factor for the LoRA updates; alpha/r = 2 is a common stable ratio
    lora_dropout=0.1,            # dropout applied to LoRA layers during training to reduce overfitting
    bias="none",                 # do not train bias terms — keeps the adapter small and the base model unchanged
    task_type="CAUSAL_LM",       # tells PEFT this is a causal language modelling task
    target_modules="all-linear", # apply LoRA to every linear projection in the model — covers attention and MLP layers
)
from trl import SFTConfig, SFTTrainer  # SFTConfig holds all hyperparameters; SFTTrainer wraps the training loop

train_config = SFTConfig(
    output_dir=ADAPTER_SAVE_DIR,             # where checkpoints and the final adapter are saved
    per_device_train_batch_size=8,           # examples per GPU per forward pass
    per_device_eval_batch_size=8,            # same batch size for validation
    gradient_accumulation_steps=8,           # accumulate 8 batches before updating — effective batch size is 64
    learning_rate=5e-5,                      # starting learning rate for AdamW
    weight_decay=0.01,                       # L2 regularisation to prevent overfitting
    lr_scheduler_type="linear",              # decay the learning rate linearly from 5e-5 to 0 over training
    warmup_ratio=0.05,                       # warm up the learning rate for the first 5% of steps
    num_train_epochs=2,                      # two passes over the training data
    logging_steps=50,                        # log training loss every 50 steps
    eval_strategy="steps",                   # run validation every eval_steps steps
    eval_steps=50,                           # evaluate every 50 training steps
    save_strategy="steps",                   # save a checkpoint every save_steps steps
    save_steps=100,                          # save every 100 steps
    save_total_limit=2,                      # keep only the 2 most recent checkpoints to save disk space
    load_best_model_at_end=True,             # restore the checkpoint with the lowest validation loss after training
    metric_for_best_model="eval_loss",       # use validation loss to decide which checkpoint is best
    greater_is_better=False,                 # lower eval_loss is better
    gradient_checkpointing=True,             # trade compute for memory — recompute activations on the backward pass
    bf16=True,                               # train in BF16 — matches the model's weight dtype
    fp16=False,                              # do not use FP16 — BF16 is more numerically stable for this model
    tf32=True,                               # allow TF32 in matrix operations for extra speed on Ampere GPUs
    max_length=MAX_TOKEN_LENGTH,             # truncate sequences longer than 1024 tokens
    packing=False,                           # do not pack multiple short examples into one sequence
    completion_only_loss=True,               # compute loss only on the assistant's reply, not on the prompt
    remove_unused_columns=False,             # keep chat_template_kwargs column — SFTTrainer needs it for the template
    dataloader_num_workers=4,                # use 4 background workers to prefetch batches and keep the GPU busy
    optim="adamw_torch_fused",               # fused AdamW kernel — faster than the standard implementation
    report_to="none",                        # disable external logging for this run
    seed=RANDOM_SEED,                        # fix the random seed for reproducible training
)


completion_only_loss=True is one of the most important settings. It tells SFTTrainer to compute the cross-entropy loss only on the assistant’s reply tokens, not on the system prompt or the user’s message. Training on the prompt tokens would push the model to memorise SUPPORT_AGENT_PROMPT rather than learning to generate better responses, which wastes training capacity and can hurt performance.




Training and Saving the LoRA Adapter


We pass the model, configuration, and datasets to SFTTrainer and start the fine-tuning loop. Before calling train(), we print a trainable parameter count to confirm LoRA was attached correctly. If lora_param_count is zero it means no adapter layers were inserted and the training loop would run silently without updating anything. After training completes, we switch the model to evaluation mode, save the adapter weights and tokenizer locally, and push them to Hugging Face Hub. Only the adapter is uploaded, not the full 4B base model, so the Hub file is a few hundred MB rather than several gigabytes.



sft_trainer = SFTTrainer(
    model=nemotron_model,                        # the base Nemotron-3 model with weights frozen except for LoRA layers
    args=train_config,                           # all hyperparameters defined above
    train_dataset=formatted_ds["train"],         # 8000 formatted customer support examples
    eval_dataset=formatted_ds["validation"],     # 800 examples for validation loss tracking
    peft_config=adapter_config,                  # attaches the LoRA adapter to all linear layers
    processing_class=tokenizer,                  # tokenizer used to encode the prompt-completion pairs
)

lora_param_count  = sum(param.numel() for param in sft_trainer.model.parameters() if param.requires_grad)  # count only the LoRA parameters being trained
total_param_count = sum(param.numel() for param in sft_trainer.model.parameters())                          # count all parameters including frozen base model

if lora_param_count == 0:
    raise RuntimeError("No trainable LoRA parameters were attached. Check target_modules before training.")

print(f"Trainable LoRA parameters: {lora_param_count:,}")    # should be a small fraction of total_param_count
print(f"All parameters:            {total_param_count:,}")
print(f"Trainable percentage:      {100 * lora_param_count / total_param_count:.4f}%")

training_stats = sft_trainer.train()  # start fine-tuning — logs loss every 50 steps, evaluates every 50 steps

sft_trainer.model.eval()                                   # switch to evaluation mode before inference
sft_trainer.model.config.use_cache             = False     # keep cache disabled to match training behaviour
sft_trainer.model.generation_config.use_cache  = False

print(training_stats)  # print final training metrics: total steps, runtime, samples per second

sft_trainer.model.save_pretrained(ADAPTER_SAVE_DIR)   # save the LoRA adapter weights locally
tokenizer.save_pretrained(ADAPTER_SAVE_DIR)           # save the tokenizer alongside the adapter so they can be loaded together


With r=32 and target_modules="all-linear", lora_param_count is typically 1–3% of total_param_count. This is what makes LoRA efficient: you only update a small fraction of the model while keeping the base weights frozen, which means you do not need to store a full 4B parameter gradient during training.


Push the adapter to Hugging Face:



HUB_REPO_PATH = "your-username/nemotron-3-nano-4b-bf16-customer-support-lora"  # replace with your Hugging Face username

sft_trainer.model.push_to_hub(HUB_REPO_PATH, private=False)  # upload the LoRA adapter to Hugging Face Hub
tokenizer.push_to_hub(HUB_REPO_PATH, private=False)           # upload the tokenizer as well

Only the LoRA adapter weights are pushed, not the full 4B base model. Anyone who wants to use the fine-tuned model loads the base model themselves and merges the adapter on top.




Comparing Responses Before and After Fine-Tuning


We run the fine-tuned model on the same three test examples used for the baseline. The results go into finetuned_table, which we then merge with baseline_table on the instruction column. The final loop prints each example as a four-part block: the customer query, the reference response from the dataset, the base model’s answer before training, and the fine-tuned model’s answer after training. This makes it easy to see exactly what changed in tone, structure, and specificity.



finetuned_table = build_response_table(
    sft_trainer.model,
    tokenizer,
    test_samples,
    "fine_tuned_answer",  # column name for the fine-tuned model's responses
)

results_table = baseline_table[
    ["instruction", "reference_response", "base_model_answer"]  # keep these columns from the pre-training table
].merge(
    finetuned_table[["instruction", "fine_tuned_answer"]],  # add the fine-tuned answers by joining on the query
    on="instruction",
    how="left",
)

for idx, row in results_table.iterrows():  # print a formatted block for each example
    print("=" * 100)
    print(f"Sample {idx + 1}")
    print("=" * 100)
    print("\nCUSTOMER QUERY:\n")
    print(row["instruction"])            # the customer's original question
    print("\nREFERENCE RESPONSE:\n")
    print(row["reference_response"])     # the target answer from the dataset
    print("\nBASE MODEL ANSWER:\n")
    print(row["base_model_answer"])      # what the base model said before fine-tuning
    print("\nFINE-TUNED ANSWER:\n")
    print(row["fine_tuned_answer"])      # what the fine-tuned model says after LoRA training
    print("\n")


After fine-tuning, the model’s responses align more closely with the reference style: shorter, more direct, and consistently professional. The base model often gives verbose or overly generic answers; the fine-tuned model mirrors the dataset’s pattern of acknowledging the concern and immediately offering a resolution or next step.




Final Thoughts


Nemotron-3 Nano requires more environment setup than a typical Hugging Face fine-tuning run. The Mamba dependency chain (pinned PyTorch version, specific CUDA build, --no-build-isolation compile) will break on a standard environment. A clean virtual environment or a fresh RunPod/Colab instance is the safest starting point.


Once the environment is working, the LoRA fine-tuning itself is straightforward. The model adapts quickly to the customer support response style in two epochs. If you need the model to match a more specific brand voice or handle a narrower set of intents, three to five epochs with a higher rank (r=64) is a reasonable next step.


The adapter file is small (a few hundred MB compared to the 8GB+ base model), which makes it easy to version, share, and swap. You can train multiple domain-specific adapters on the same base model and switch between them at inference time without reloading the base weights.




Who Can Benefit


  • Students learning NLP or applied machine learning can use this project to go beyond API calls and understand what actually happens when a language model is adapted to a new domain. Fine-tuning a real model on a real dataset is far more instructive than reading about it.

  • Startups can fine-tune a private model on their own support ticket history and deploy a support bot that matches their tone without sending customer data to a third-party API.

  • E-commerce teams can adapt the model to handle their specific product categories, return policies, and escalation paths, going beyond generic support responses.

  • Enterprises can run the fine-tuned model on-premise or on a private cloud instance, keeping customer conversations entirely within their infrastructure.

  • AI engineers can use this as a template for any instruction-response dataset. Swap the dataset repo, update the system prompt, and the rest of the pipeline works without changes.




How Codersarts Can Help


If you want to take this further, Codersarts offers hands-on support at every stage.


  • For learners: Live 1-to-1 sessions with an AI engineer who can walk through each step with you, explain LoRA and the Mamba architecture, and help you adapt the pipeline for your own dataset.

  • For teams: End-to-end development of custom fine-tuned models including dataset preparation, hyperparameter tuning, evaluation, and deployment.

  • For enterprises: Architecture consulting and implementation for private LLM deployments, domain-specific fine-tuning pipelines, and integration with existing support platforms.


Reach out at contact@codersarts.com or visit www.codersarts.com to get started.




Comments


bottom of page