Fine-Tuning LLaMA3 with a Custom Docker Commands Dataset Using Unsloth

Discover a comprehensive guide on fine-tuning the LLaMA3 model with a custom Docker commands dataset using Unsloth. This detailed blog covers environment setup, model and tokenizer loading, configuration with LoRA, dataset preparation, prompt design, model fine-tuning, and testing. Learn how to create a Gradio interface for your model and deploy it effectively.

10 min readAug 4, 2024

Importance of Fine-Tuning

Fine-tuning adjusts a pre-trained model to perform specific tasks by further training it on task-specific data. This process leverages the general knowledge the model has gained and adapts it to the nuances of the new task.

2. The Theory Behind Fine-Tuning

Fine-tuning is a critical aspect of machine learning, especially when working with complex models like LLaMA3. It allows you to adapt a pre-trained model to specific tasks or domains, leveraging existing knowledge while tailoring the model to perform optimally on new, specialized data. This process involves several core concepts, including transfer learning, which forms the foundation of fine-tuning.

2.1 Core Concepts of Fine-Tuning

2.1.1 Transfer Learning

Transfer learning is a powerful approach in machine learning that involves taking a model pre-trained on one task and adapting it to perform a different but related task. The primary benefits of transfer learning are efficiency in training and improved performance, especially when dealing with limited data for the new task. Here’s an in-depth look at how transfer learning works and why it’s so effective:

Pre-trained Models: These are models that have been trained on large datasets for general tasks. For example, models like BERT or GPT-3 have been trained on vast amounts of text data to understand language patterns and relationships. Such models capture a broad range of knowledge and features that are useful for various tasks.
Feature Extraction: In the transfer learning process, the pre-trained model serves as a feature extractor. It has learned to identify and represent complex patterns in data, which can be reused for different tasks. The model’s internal layers have learned representations that capture a range of features from the input data, which can be applied to new tasks.
Adaptation to New Tasks: To adapt a pre-trained model to a new task, you typically modify the model by adding new layers or adjusting the existing ones. For instance, in fine-tuning, you might replace the output layer of a pre-trained model to match the specific number of classes or outputs required for your new task. The rest of the model’s layers, which have already learned useful features, remain unchanged or are fine-tuned with a smaller learning rate.
Fine-Tuning: This involves continuing the training of the pre-trained model on a new, task-specific dataset. Fine-tuning is usually done with a smaller learning rate compared to the initial training phase. This is because you want to adjust the model slightly to adapt to the new task without drastically altering the learned features. Fine-tuning helps the model specialize its learned knowledge to the specifics of the new task while retaining the general features acquired during the initial training.
Benefits of Transfer Learning:
Reduced Training Time: Training a model from scratch can be computationally expensive and time-consuming. Transfer learning significantly reduces this time by leveraging the pre-existing knowledge of a pre-trained model.
Improved Performance: Pre-trained models often perform better on new tasks because they start from a point of having learned useful features from a large and diverse dataset. This is particularly beneficial when working with smaller datasets for the new task.
Resource Efficiency: Transfer learning saves on computational resources and data requirements, making it a cost-effective solution for many machine learning problems.

2.1.2 Fine-Tuning Techniques

Full Model Fine-Tuning: Adjusting all layers of the model.
Layer-Wise Fine-Tuning: Fine-tuning specific layers of the model.
Parameter-Efficient Fine-Tuning (PEFT): Techniques like LoRA (Low-Rank Adaptation) that adjust a small subset of parameters to adapt the model.

2.2 Key Parameters in Fine-Tuning

2.2.1 Learning Rate

The learning rate controls how much to change the model’s weights in response to the estimated error. A higher learning rate can speed up training but may cause instability, while a lower rate ensures stable but slower convergence.

2.2.2 Batch Size

Batch size refers to the number of training examples utilized in one iteration. Larger batch sizes can lead to more accurate estimates of gradients but require more memory.

2.2.3 Epochs

An epoch is one complete pass through the entire training dataset. More epochs can lead to better training but may also cause overfitting if not managed properly.

2.2.4 Regularization

Techniques like dropout and weight decay are used to prevent overfitting by reducing the model’s ability to memorize the training data.

2.3 Advanced Fine-Tuning Techniques

2.3.1 LoRA (Low-Rank Adaptation)

LoRA introduces low-rank matrices into the model architecture, allowing fine-tuning of a small number of parameters while keeping the rest of the model frozen.

2.3.2 Prompt Tuning

Prompt tuning involves modifying the input prompts given to the model to achieve better performance on specific tasks.

3. Preparing for Fine-Tuning

3.1 Setting Up Your Environment

3.1.1 Hardware Requirements

GPUs vs. CPUs: GPUs are preferred for their parallel processing capabilities, which significantly speed up training.
Cloud Solutions: Platforms like AWS and Google Cloud offer scalable resources for training large models.

3.1.2 Software Requirements

Libraries: PyTorch, Hugging Face Transformers, and other libraries are essential for implementing fine-tuning.
Installation: Ensure all required packages are installed and configured correctly.

3.2 Loading and Configuring the Model

3.2.1 Model Selection

Choose a pre-trained model that aligns with your task. For example, BERT is often used for classification tasks, while GPT models are suitable for generation tasks.

3.2.2 Configuration Parameters

Max Sequence Length: Defines the maximum length of input sequences.
Precision: Determines the numerical precision of the model’s parameters (e.g., 4-bit precision).

3.3 Data Preparation

3.3.1 Dataset Loading

Sources: Use datasets from repositories like Hugging Face Datasets or custom datasets.
Formats: Ensure data is in the correct format for the task (e.g., input-output pairs for classification).

3.3.2 Data Augmentation

Synthetic Data: Generate additional data to improve model robustness.
Balancing: Ensure the dataset is balanced to prevent bias in the model.

Theory Behind Environment Setup

# Install necessary libraries
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install gradio

Unsloth: This package is used for model fine-tuning and managing training workflows.
Xformers: Provides implementations for Flash Attention and other attention mechanisms that can optimize training.
TRL (Transformers Reinforcement Learning): Useful for advanced training techniques.
PEFT (Parameter-Efficient Fine-Tuning): For methods like LoRA, which adjust a small subset of model parameters.
Accelerate: A library to facilitate distributed and mixed-precision training.
Bitsandbytes: Provides efficient implementations of quantization and optimization algorithms.

from unsloth import FastLanguageModel
import torch

Explanation:

FastLanguageModel from Unsloth: This is the class used to load and work with the language model.
torch: PyTorch library for tensor operations and model training.

2. Loading and Configuring the Model

max_sequence_length = 2048
dtype = None
load_in_4bit = True

Explanation:

max_sequence_length: The maximum length of input sequences the model can handle. Setting this appropriately helps in managing memory and performance.
dtype: Data type for model parameters. None means default precision, but can be set to torch.float16 for half-precision training.
load_in_4bit: A boolean to indicate whether to load the model in 4-bit precision to save memory and computational resources.

2.2 Load Pre-trained Model and Tokenizer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B",
    max_seq_length=max_sequence_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

Explanation:

model_name="unsloth/Meta-Llama-3.1-8B": Specifies the pre-trained model to use. LLaMA (Large Language Model Meta AI) 3.1 with 8 billion parameters is selected here.
max_seq_length: Ensures the model can handle sequences up to the specified length.
dtype: Sets the data type for model weights (e.g., float16 for mixed-precision).
load_in_4bit: Loads the model using 4-bit quantization for efficiency.

2.3 Apply Parameter-Efficient Fine-Tuning

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3047,
    use_rslora=False,
    loftq_config=None,
)

FastLanguageModel.get_peft_model: Method to apply Parameter-Efficient Fine-Tuning (PEFT) techniques.
r=16: Rank parameter for LoRA. It controls the rank of the low-rank matrices used in adaptation.
target_modules: List of model modules to which LoRA will be applied. These modules handle different aspects of attention and projection in the model.
lora_alpha=16: Scaling factor for LoRA. This adjusts the magnitude of the updates applied to the model.
lora_dropout=0: Dropout rate for LoRA. Set to 0 here, meaning no dropout.
bias="none": Specifies whether to include bias terms in the fine-tuning. "none" means no additional bias.
use_gradient_checkpointing="unsloth": Enables gradient checkpointing to save memory during training.
random_state=3047: Seed for random number generation to ensure reproducibility.
use_rslora=False: Whether to use RSLORA, an advanced version of LoRA. Set to False here.
loftq_config=None: Configuration for other fine-tuning parameters, not used in this setup.

3. Preparing the Dataset

3.1 Load and Inspect the Dataset

from datasets import load_dataset

dataset = load_dataset("MattCoddity/dockerNLcommands", split='train')

# Print the column names and the length of each column
print("Columns and their lengths:")
for column in dataset.column_names:
    print(f"Column: {column}, Length: {len(dataset[column])}")

Explanation:

load_dataset("MattCoddity/dockerNLcommands", split='train'): Loads a custom dataset of Docker commands. The split='train' argument specifies that we are loading the training subset of the data.
dataset.column_names: Lists the column names in the dataset.
len(dataset[column]): Prints the length of each column, which helps in understanding the dataset structure.

3.2 Formatting the Dataset

def formatting_prompt_func(example):
    return {
        "input": example["input"],
        "instruction": example["instruction"],
        "output": example["output"]
    }

dataset = dataset.map(formatting_prompt_func)

print("Dataset Features:", dataset.features)

formatted_dataset = dataset.map(formatting_prompt_func, batched=True)

formatting_prompt_func: Function to format each example in the dataset. It extracts relevant fields (input, instruction, output) to fit the model's requirements.
dataset.map(formatting_prompt_func): Applies the formatting function to each example in the dataset.
dataset.features: Displays the features of the dataset after formatting.

4. Designing the Prompt

4.1 Define the Prompt Template

docker_prompt = """below is an instruction describes a task, paired with an input that provides further context, write a response that approves the instruction .

### Instruction:
{}

### Input:
{}

### Response:
{}"""

docker_prompt: Template for the input to the model. It includes placeholders for the instruction, input, and expected output.

4.2 Format the Prompt for the Model

EOS_Token = tokenizer.eos_token

def formatting_prompt_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = docker_prompt.format(instruction, input, output) + EOS_Token
        texts.append(text)
    return {"texts": texts}

EOS_Token: End-of-sequence token used to mark the end of the generated text.
formatting_prompt_func(examples): Formats examples into the prompt template. The function iterates over instructions, inputs, and outputs, and creates a formatted text for each example.

5. Training the Model

5.1 Set Up Training Arguments

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bf16_supported

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=50,
    learning_rate=2e-4,
    fp16=not is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs",
)

TrainingArguments: Configures training parameters.
per_device_train_batch_size=2: Batch size per device. Small batch size helps manage memory.
gradient_accumulation_steps=4: Accumulates gradients over multiple steps before updating model weights. Helps in training with larger effective batch sizes.
warmup_steps=5: Number of steps to gradually increase the learning rate from 0 to the set value.
max_steps=50: Total number of training steps.
learning_rate=2e-4: Learning rate for the optimizer.
fp16=not is_bf16_supported(): Uses mixed-precision training if bfloat16 is not supported.
logging_steps=1: Frequency of logging training progress.
optim="adamw_8bit": Optimizer used, here AdamW with 8-bit precision.
lr_scheduler_type="linear": Learning rate scheduler type.
seed=3407: Random seed for reproducibility.
output_dir="outputs": Directory to save model checkpoints and logs.

5.2 Create the Trainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=formatted_dataset,
    tokenizer=tokenizer,
)

Explanation:

SFTTrainer: Trainer class for fine-tuning the model. It handles the training loop, gradient updates, and logging.
model=model: Model to be fine-tuned.
args=training_args: Training arguments configured earlier.
train_dataset=formatted_dataset: Dataset formatted for training.
tokenizer=tokenizer: Tokenizer used to encode text inputs.

5.3 Train the Model

trainer.train()

Explanation:

trainer.train(): Starts the training process using the configured trainer and arguments.

6. Saving and Evaluating the Model

6.1 Save the Model

model.save_pretrained("path_to_save_model")
tokenizer.save_pretrained("path_to_save_model")

Explanation:

model.save_pretrained("path_to_save_model"): Saves the fine-tuned model to the specified path.
tokenizer.save_pretrained("path_to_save_model"): Saves the tokenizer to the same path for consistency.

6.2 Load the Model for Evaluation

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("path_to_save_model")
tokenizer = AutoTokenizer.from_pretrained("path_to_save_model")

Explanation:

AutoModelForCausalLM.from_pretrained("path_to_save_model"): Loads the fine-tuned model from the saved path.
AutoTokenizer.from_pretrained("path_to_save_model"): Loads the tokenizer.

6.3 Evaluate the Model

def evaluate_model(input_text):
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(evaluate_model("Write a Docker command to start a container"))

Explanation:

evaluate_model(input_text): Function to evaluate the model’s response to an input text.
tokenizer(input_text, return_tensors="pt"): Tokenizes the input text and converts it to PyTorch tensors.
model.generate(**inputs): Generates a response from the model.
tokenizer.decode(outputs[0], skip_special_tokens=True): Decodes the generated tokens into human-readable text.

Conclusion

In this comprehensive guide, we delved into the intricate process of fine-tuning a language model using a step-by-step approach. We started by setting up the necessary environment, including installing required packages and importing essential libraries. We then proceeded to load and configure a pre-trained model, utilizing parameter-efficient fine-tuning techniques such as LoRA.

Next, we explored the preparation and formatting of a dataset, followed by designing a prompt to guide the model’s responses. Detailed instructions were provided on setting up training arguments, creating a trainer, and initiating the training process. Finally, we discussed how to save the fine-tuned model and evaluate its performance on new input text.

By breaking down each code block and explaining its function, this guide aimed to provide a thorough understanding of the fine-tuning process. Whether you are a novice or an experienced practitioner, this guide offers valuable insights and practical knowledge to help you fine-tune language models effectively.

Fine-tuning allows for customization and optimization of models for specific tasks, making them more accurate and efficient. With the growing importance of AI and NLP applications, mastering fine-tuning techniques is essential for developing state-of-the-art solutions.

Fine-Tuning LLaMA3 with a Custom Docker Commands Dataset Using Unsloth

Importance of Fine-Tuning

2. The Theory Behind Fine-Tuning

2.1 Core Concepts of Fine-Tuning

2.1.1 Transfer Learning

2.1.2 Fine-Tuning Techniques

2.2 Key Parameters in Fine-Tuning

2.2.1 Learning Rate

2.2.2 Batch Size

2.2.3 Epochs

2.2.4 Regularization

2.3 Advanced Fine-Tuning Techniques

2.3.1 LoRA (Low-Rank Adaptation)

2.3.2 Prompt Tuning

3. Preparing for Fine-Tuning

3.1 Setting Up Your Environment

3.1.1 Hardware Requirements

3.1.2 Software Requirements

3.2 Loading and Configuring the Model

3.2.1 Model Selection

3.2.2 Configuration Parameters

3.3 Data Preparation

3.3.1 Dataset Loading

3.3.2 Data Augmentation

Theory Behind Environment Setup

2. Loading and Configuring the Model

2.2 Load Pre-trained Model and Tokenizer

2.3 Apply Parameter-Efficient Fine-Tuning

3. Preparing the Dataset

3.1 Load and Inspect the Dataset

3.2 Formatting the Dataset

4. Designing the Prompt

4.1 Define the Prompt Template

4.2 Format the Prompt for the Model

5. Training the Model

5.1 Set Up Training Arguments

5.2 Create the Trainer

5.3 Train the Model

6. Saving and Evaluating the Model

6.1 Save the Model

6.2 Load the Model for Evaluation

6.3 Evaluate the Model

Conclusion

Written by Adnan Writes