Introduction
In the evolving landscape of data science solutions for healthcare, fine-tuning Large Language Models (LLMs) has emerged as a critical technique in natural language processing (NLP), enabling developers to adapt pre-trained models for specialized tasks and domain-specific applications.
This process involves adjusting the parameters of a pre-trained LLM using task-specific data, enabling the model to perform more accurately and effectively in targeted applications.
When developing a medical chatbot, fine-tuning ensures the model can accurately interpret medical terminology, respond with domain-specific expertise, and adhere to healthcare industry standards. Leveraging DeepSeek, an open-source alternative to industry-standard models, for fine-tuning offers several advantages:
- Chain of Thought (CoT) Reasoning: DeepSeek can break down complex medical inquiries step by step, improving diagnostic accuracy and reasoning transparency.
- Improved Medical Comprehension: Fine-tuning with curated healthcare data helps the model understand and respond to complex medical concepts with greater precision.
- Enhanced Patient Interaction: A well-tuned DeepSeek model provides structured, empathetic, and natural conversations, improving patient trust and engagement.
- Better Explainability and Transparency: CoT reasoning enables the chatbot to justify its recommendations, making it a valuable tool in medical applications where interpretability is critical.
- Domain-Specific Adaptability: Fine-tuning ensures the chatbot is optimized for medical diagnosis, symptom assessment, treatment guidance, and literature summarization, making it far more specialized than general-purpose models.
In this article, we will explore the parameter-efficient fine-tuning steps of DeepSeek, demonstrating how it can be adapted to medical chatbot applications. Now, let’s start by looking into the performance of the distilled DeepSeek model [1]. DeepSeek-R1-Distill-Qwen-14B offers a good balance between model size and performance. Given limited GPU resources, we can adopt this model for Parameter-Efficient Fine-Tuning (PEFT) to optimize its capabilities for medical reasoning while maintaining computational efficiency.

Understanding the Fine-Tuning Process
The journey begins with pretraining for completion. During this phase, the model learns to predict what comes next in a sequence of text by processing massive amounts of data from the internet—trillions of tokens, equivalent to millions of books. The model develops a statistical understanding of language patterns, learning which words will likely follow others in different contexts. For example, if you write “Common symptoms of diabetes include,” the model learns that words like “increased thirst” or “frequent urination” are more likely to follow than “headache” or “sore throat.”
This pretraining phase is incredibly resource-intensive, consuming about 98% of the overall compute and data resources. The resulting pretrained model has absorbed enormous knowledge but doesn’t yet know how to be helpful or follow instructions.
During supervised fine-tuning (SFT) phase, the model is shown examples of how to respond to different types of prompts appropriately. These examples follow a format of (prompt, response) pairs created by highly educated human labelers or other powerful LLMs.
The final phase is preference alignment, where we further fine-tune the model using comparison pairs (prompt, accepted, rejected) that guide the model in what style to respond to specific prompts. This increases the probability that prompt X should respond in way Y instead of way Z.

In this article, we will take DeepSeek-R1-Distill-Qwen-14B model and follow step 2, which is supervised fine-tuning, to make it more capable for medical tasks.
For faster fine-tuning steps, we will use the unsloth [2] library. The following are the steps to get started:
- Install the dependencies.
- Load the DeepSeek-R1-Distill-Qwen-14B model.
- Prepare the medical dataset for supervised fine-tuning.
- Configure PEFT for efficient training.
- Run the fine-tuning process and evaluate performance.
For the full notebook visit: github.com/ShahedSabab/mediseek
Step 1: Install the Dependencies
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from dotenv import load_dotenv
from datasets import load_dataset
import os
Step 2: Load the DeepSeek-R1-Distill-Qwen-14B model
For this step we will use Huggingface unsloth/DeepSeek-R1-Distill-Qwen-14B model [3]. We will need a hf_api_key for downloading the model to the local machine. To get access to the api key follow steps from [4]. After getting the api key put it in a .env file as follows:
HUGGINGFACE_TOKEN=******************************
In the notebook, call load_dotenv. This will load the hugging face api key. Next, initialize model parameters.
max_seq_length = 2048
dtype = None
MODEL = "unsloth/DeepSeek-R1-Distill-Qwen-14B"
# Load environment variables from a .env file
load_dotenv()
# Access environment variables
hf_api_key = os.getenv("HUGGINGFACE_TOKEN")
Now, we will load the DeepSeek-R1-Distill-Qwen-14B base model using Unsloth’s optimized methods. Here, we used loading in 4-bit quantization to reduce memory usage.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = MODEL,
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = True,
token = hf_api_key,
)
Step 3: Prepare the medical dataset for supervised fine-tuning:
We will use FreedomIntelligence/medical-o1-reasoning-SFT [5] data for the finetuning step. This dataset is designed for advanced medical reasoning and was constructed using GPT-4o, which searches for solutions to verifiable medical problems and validates them through a medical verifier. The dataset helps initialize models with the reasoning chain from DeepSeek-R1, making it highly suitable for fine-tuning medical chatbots.
To design a prompt style for the model, we will add a system prompt with placeholders for the question and the response generation. This prompt will direct the model to reason systematically and deliver a logical, accurate answer.
train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.
### Instruction:
You are a medical expert with specialized expertise in clinical reasoning, diagnostics, and treatment planning. Please answer the following medical question.
### Question:
{}
### Response:
{}
{}"""
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
inputs = examples["Question"]
cots = examples["Complex_CoT"]
outputs = examples["Response"]
texts = []
for input, cot, output in zip(inputs, cots, outputs):
text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
texts.append(text)
return {
"text": texts,
}
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
Step 4: Configure PEFT for efficient training:
For PEFT (Parameter-Efficient Fine-Tuning), a few key concepts will be crucial. In this article, we will follow PEFT fine-tuning. There are three main approaches when fine-tuning LLMs: full fine-tuning, LoRA, and QLoRA. The last 2 comes under PEFT. Each has its strengths and trade-offs, making it suitable for different scenarios.
- Full Fine-Tuning: This method retrains every parameter in the base model, often delivering the best results. However, it is highly resource-intensive, requiring hundreds of GBs of VRAM, making it impractical for most setups. Additionally, it can cause catastrophic forgetting, where the model loses prior knowledge.
- LoRA (Low-Rank Adaptation): Instead of modifying the original weight matrix (W), LoRA introduces two smaller matrices (A and B) that form a low-rank update: W’ = W + BA. This approach:
- Reduces memory usage while preserving the pre-trained knowledge.
- Requires tuning only a fraction of the model’s parameters.
- Works efficiently on a single GPU with 14-18GB VRAM.
- Uses key hyperparameters: rank (r), alpha (α), and dropout.
- QLoRA: This method combines LoRA with quantization to optimize memory further. By reducing precision to a custom 4-bit format (NF4) and employing double quantization, it minimizes memory overhead, making it ideal for low-resource setups. However, it comes at a 30% slower training speedcompared to LoRA.
In the following, r represents the rank for LoRA. A smaller value of r means fewer parameters will be modified. Additionally, lora_alpha controls the strength of the applied LoRA, while lora_dropout specifies the dropout rate for LoRA. For further details on these techniques, refer to [6].
FastLanguageModel.for_training(model)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth", # True or "unsloth" for very long context
random_state=7,
use_rslora=False,
loftq_config=None,
)
Next, we’ll configure the training arguments and set up the trainer by specifying the model, tokenizer, dataset, and other key training parameters to optimize the fine-tuning process. Due to time constraints, we will train the model for just one epoch.
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs = 1,
warmup_ratio=0.1,
max_steps=60,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
),
)
To begin training, execute the following:
trainer_stats = trainer.train()
Once the training is complete, save the adapter, full model, and tokenizer locally to ensure they can be reused in future projects.
new_model_local = "sabab05/DeepSeek-R1-Medical-COT-Qwen-14B"
model.save_pretrained(new_model_local) # Local saving
tokenizer.save_pretrained(new_model_local)
model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)
Testing:
For testing out the response of the model we used a hypothetical case study based on Hodkins Lymphoma [7]. The case study centers around Emily, a 42-year-old woman who presents with a persistent, enlarging lump in her neck that has been present for the past two months. Initially painless, the swelling has gradually increased in size without any signs of infection. Over the past six months, she has experienced unexplained weight loss, approximately 12% of her body weight, along with recurrent episodes of fever exceeding 103°F, which come and go. Additionally, she reports heavy night sweats that are severe enough to soak her sheets.
Non-finetuned LLM response:

Finetuned LLM response:

The response from the fine-tuned model is more aligned with the likely refined diagnosis, as suggested by [7]. It demonstrates greater precision, and a clearer chain of thought compared to the non-fine-tuned model. Following is the portion of the chain of thought of the fine-tuned model:

Few things to try as next steps:
- Injecting medical real-time context through Retrieval-Augmented Generation (RAG).
- Agentic design to cite different sources based on question type (diagnosis, treatment, medication) to bring transparency.
- Expanding the dataset by incorporating additional medical literature and case studies.
- Evaluating the chatbot on real-world medical queries to benchmark its accuracy and usability.
References:
[1] “DeepSeek-R1/DeepSeek_R1.pdf at main · deepseek-ai/DeepSeek-R1.” Accessed: Mar. 03, 2025. [Online].
[2] “unslothai/unsloth: Finetune Llama 3.3, DeepSeek-R1 & Reasoning LLMs 2x faster with 70% less memory! 🦥.” Accessed: Mar. 07, 2025. [Online].
[3] “unsloth/DeepSeek-R1-Distill-Qwen-14B · Hugging Face.” Accessed: Mar. 07, 2025. [Online].
[4] “How to Access HuggingFace API key? – GeeksforGeeks.” Accessed: Mar. 07, 2025. [Online].
[5] “FreedomIntelligence/medical-o1-reasoning-SFT · Datasets at Hugging Face.” Accessed: Mar. 07, 2025. [Online].
[6] “In-depth guide to fine-tuning LLMs with LoRA and QLoRA.” Accessed: Mar. 09, 2025. [Online].
[7] “Hodgkin lymphoma (Hodgkin disease) – Symptoms and causes – Mayo Clinic.” Accessed: Mar. 09, 2025. [Online].