There are two principal methods for fine-tuning large language models:

1.   Supervised Fine-Tuning (SFT)
2.   Reinforcement Learning from Human Feedback (RLHF)

Supervised Fine-Tuning (SFT):
This method entails training or fine-tuning a large language model (LLM) on a dataset comprising pairs of instructions and corresponding responses. The objective is to alter the model’s weights to minimize the variance between the model-generated responses and the actual, correct responses, which act as the training labels.


**Install All the Required Libraries**

In [None]:

#install transformers library to import autotokenizer
#install datasets library to load the dataset from hugging face
#install peft library to fine-tune the Llama 2 model by reducing computational and memory requirements. PEFT methods only fine-tune a small number of (extra) model parameters
#install trl library to import SFT trainer, trl is a wrapper that can be for Supervised Fine Tuning or for Reinforcement Learning from Human Feedback
#install bitsandbytes library for quantization because we are not going to use the model in full precision
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/542.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m409.6/542.0 kB[0m [31m12.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.4/302.4 kB[0m [31m35.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/199.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K     

**Set the Environment to Use Hugging Face Token:**
Ensure that the environment is configured to authenticate using your Hugging Face API token.

In [None]:
# Import necessary libraries
import os                       # To interact with the operating system
from IPython.display import clear_output  # To clear the output in Jupyter notebooks
from getpass import getpass    # To securely input the password or token

# Prompt the user for their Hugging Face token in a secure manner that does not display the token as it is typed
hf_token = getpass("Please enter your Hugging Face token: ")
# Set the entered token as an environment variable 'HF_TOKEN' for later use in the session or by other processes
os.environ["HF_TOKEN"] = hf_token

# Clear the output to hide the token from being displayed in the notebook after entry
clear_output(wait=True)

# Print a confirmation message indicating that the token has been successfully set
print("Token successfully set.")


Token successfully set.


**Import Necessary Libraries**:
Load all the libraries required for the project, ensuring all dependencies are available for model training and manipulation.

In [None]:
# Import PyTorch, a popular deep learning library for tensor computations with strong GPU acceleration
import torch
# Import the load_dataset function from the datasets library for loading and managing datasets easily
from datasets import load_dataset
# Import several utilities from the transformers library:
from transformers import (
    AutoModelForCausalLM,     # To automatically load a pre-trained causal language model
    AutoTokenizer,            # To automatically load a tokenizer corresponding to a pre-trained model
    BitsAndBytesConfig,       # Configuration class for BitsandBytes, which is used for 8-bit optimizers
    TrainingArguments,        # Class to store and manage training parameters
    pipeline,                 # To easily create a processing pipeline for a pre-trained model
)
# Import specific configurations and model preparation utilities for training with low-rank adaptations
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training
# Import a specific trainer for structured fine-tuning from the trl library
from trl import SFTTrainer



**Fine-Tune the Llama 2 Model with Supervised Fine-Tuning (SFT):**
Apply Supervised Fine-Tuning to the Llama 2 model by training it on a dataset that includes specific instructions paired with the correct responses.

There are three ways in which we can fine-tune the model using Supervised Fine-Tuning (SFT).

Full Fine-Tuning

LoRA

QLoRA

In [None]:
# Model configuration
# Define the base model identifier, which is a pre-trained model from Hugging Face's Model Hub
base_model = "NousResearch/Llama-2-7b-chat-hf"
# Define the name for the new model that will be fine-tuned from the base model
new_model = "llama-2-7b-platypus"

# Load the Dataset
# Load a specific dataset from the Hugging Face Hub by its name and specify the data split to use (e.g., 'train')
dataset = load_dataset("malikashish997/create_data", split="train")

# Tokenizer setup
# Load the tokenizer associated with the base model. Using 'use_fast=True' enables the fast tokenizer implementation
tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)

# Tokenizer padding setup
# In the Llama 2 model, there is no padding token by default, which can cause issues when batching sequences of different lengths.
# Here, we set the padding token to be the same as the end-of-sentence (EOS) token.
# This ensures that sequences are padded with the EOS token, allowing for consistent sequence lengths without affecting the semantic integrity of the text.
tokenizer.pad_token = tokenizer.eos_token
# Set the padding to occur on the right side of the sequences (default behavior), ensuring that padding tokens are added at the end of the text.
tokenizer.padding_side = "right"


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.25M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]



tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [None]:
dataset

Dataset({
    features: ['instruction', 'output'],
    num_rows: 1000
})

In [None]:
dataset.to_pandas()

Unnamed: 0,instruction,output
0,### Instruction:\nLet's come up with a rich an...,Planet Name: Xylothar\n\nXylothar is a diverse...
1,"### Instruction:\nLet\n$$p(x,y) = a_0 + a_1x +...","Observe that \begin{align*}\np(0,0) &= a_0 = ..."
2,"### Instruction:\nGiven the code below, refact...",Here is the refactored and commented version:\...
3,### Instruction:\nFind the area of the region ...,"Let $n = \lfloor x \rfloor,$ and let $\{x\} = ..."
4,### Instruction:\nLet $P$ be the plane passing...,Let $\mathbf{v} = \begin{pmatrix} x \\ y \\ z ...
...,...,...
995,### Instruction:\nHello. My name is Mike. I ha...,"Hello Mike, it's nice to meet you. As an AI la..."
996,### Instruction:\nGiven a prime $p$ and an int...,"To find the primitive roots $\pmod 7$, I need ..."
997,### Instruction:\nLet $f$ be defined by \[f(x...,The number $f^{-1}(-3)$ is the value of $x$ su...
998,### Instruction:\nBEGININPUT\nBEGINCONTEXT\nda...,Dr. Eleanor Thompson's study found that partic...


In [None]:
# Configuration for Quantization using BitsAndBytes to reduce VRAM usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Load model weights in 4-bit precision to reduce memory usage
    bnb_4bit_quant_type="nf4",  # Use "nf4" quantization format as described in the QLoRA paper
    bnb_4bit_compute_dtype=torch.float16,  # Perform computations in 16-bit floating point for better performance
    bnb_4bit_use_double_quant=True,  # Apply double quantization for additional compression of quantization parameters
)

# Configuration for LoRA (Low-Rank Adaptation)
peft_config = LoraConfig(
    lora_alpha=15,  # Alpha parameter controls the strength of the added adapters (15 is relatively high)
    lora_dropout=0.1,  # Set dropout rate in LoRA layers to 10% to prevent overfitting
    bias="none",  # Specify that no bias terms should be added to LoRA adapters
    task_type="CAUSAL_LM",  # Indicate the task type as causal language modeling
)

# Load the base causal language model with specified quantization configurations and set the device mapping
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,  # Apply the quantization configuration
    device_map={"": 0}  # Map the model to device ID 0 (typically GPU ID 0)
)

# Modify model configurations not directly related to performance but necessary for training
model.config.use_cache = False  # Disable caching to save memory during training
model.config.pretraining_tp = 1  # Set the number of tensor processing units to 1 for simplicity

# Prepare the model for k-bit training by setting specific layers to higher precision and enabling gradients
model = prepare_model_for_kbit_training(model)  # Adjust model for training under k-bit precision rules




config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]



In [None]:
# Set up the training arguments for the model training process
training_arguments = TrainingArguments(
    output_dir="./results",  # Directory where training outputs and checkpoints will be saved
    num_train_epochs=1,  # Number of epochs for training, usually 3 or 5 is sufficient for convergence
    per_device_train_batch_size=4,  # Set the batch size for each training device
    gradient_accumulation_steps=1,  # Number of steps to accumulate gradients before updating model weights
    evaluation_strategy="steps",  # Evaluation strategy to use during training, set to steps for periodic eval
    eval_steps=1000,  # Number of steps between each evaluation phase
    logging_steps=25,  # Interval of steps at which to log training information
    optim="paged_adamw_8bit",  # Use a memory-efficient 8-bit version of AdamW optimizer that operates on paged memory
    learning_rate=2e-4,  # Set the learning rate for the optimizer
    lr_scheduler_type="linear",  # Type of learning rate scheduler to use, linear for a gradually decreasing rate
    warmup_steps=10,  # Number of steps to perform learning rate warmup
    report_to="tensorboard",  # Enable reporting to TensorBoard for monitoring the training process
    max_steps=-1,  # Run training indefinitely until the number of epochs is reached
)

# Initialize the structured fine-tuning trainer with the model and dataset
trainer = SFTTrainer(
    model=model,  # Model to be fine-tuned
    train_dataset=dataset,  # Dataset to use for training the model
    eval_dataset=dataset,  # Use the same dataset for evaluation as no separate eval dataset is provided
    peft_config=peft_config,  # Configuration for parameter-efficient fine-tuning techniques
    dataset_text_field="instruction",  # Specify which field in the dataset to use for text data
    max_seq_length=512,  # Maximum sequence length to process due to VRAM limitations
    tokenizer=tokenizer,  # Tokenizer to use for preprocessing the text data
    args=training_arguments,  # Training arguments set previously
)

# Start the training process
trainer.train()

# Save the fine-tuned model to the specified directory
trainer.model.save_pretrained(new_model)  # Save the model with the identifier for the new, fine-tuned version


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]



Step,Training Loss,Validation Loss




In [None]:
# Load the TensorBoard notebook extension
# This line of code activates the TensorBoard extension within Jupyter notebooks,
# allowing TensorBoard to be displayed directly in the notebook environment.
%load_ext tensorboard

# Start TensorBoard within the notebook
# This command launches TensorBoard and specifies the directory where the logs are stored.
# The `--logdir` parameter points to the directory containing the training logs,
# enabling TensorBoard to visualize training metrics such as loss and accuracy over time.
%tensorboard --logdir results/runs


<IPython.core.display.Javascript object>

In [None]:
# Define the input prompt for the text generation task
prompt = "What is a large language model?"

# Format the prompt using a chat-style template that separates the instruction from the expected response
# This template styling is often used to structure the inputs in a way that the model is accustomed to handling.
instruction = f"### Instruction:\n{prompt}\n\n### Response:\n"

# Initialize the Hugging Face pipeline for text generation
# This sets up a convenient interface to the model for generating text. The `pipeline` function automatically handles
# tokenization, input formatting, and output decoding.
pipe = pipeline(
    task="text-generation",  # Specify the task to perform
    model=model,  # Provide the model that was previously fine-tuned
    tokenizer=tokenizer,  # Provide the tokenizer corresponding to the model
    max_length=128  # Set a maximum length for the generated text to prevent overly long outputs
)

# Execute the pipeline with the formatted instruction
# The `pipe` function takes the instruction, processes it through the model, and returns the generated text.
result = pipe(instruction)

# Post-process the generated text to remove the template formatting
# This step is necessary to extract only the model's response, omitting the rest of the instruction template.
print(result[0]['generated_text'][len(instruction):])


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


A large language model is a type of artificial intelligence (AI) model that is trained on a large dataset of text to generate language outputs that are coherent and natural-sounding. These models are typically trained using deep learning techniques, such as transformer architectures, and are designed to learn patterns and relationships within language.

### Instruction:
What are some potential applications of large language models?

### Response:
Some potential applications of large language models include:

1. **Language Translation**


In [None]:
# Delete model, pipe, and trainer to free up memory
# These commands remove the references to the objects in Python, which allows the Python garbage collector to reclaim the memory.
del model  # Delete the model object to free up GPU memory
del pipe   # Delete the pipeline object, which also can consume a significant amount of memory
del trainer  # Delete the trainer object used for fine-tuning the model

# Import the garbage collection module
import gc

# Perform garbage collection to reclaim unused memory
# This is particularly useful in environments like Jupyter notebooks where long-running sessions can accumulate unused memory.
gc.collect()  # Explicitly calls garbage collection to clean up any leftover objects in memory
gc.collect()  # Call it twice to ensure a more thorough cleanup. Sometimes, references are only cleared after repeated collection.


0

In [None]:
# Reload the base model in FP16 precision to reduce memory usage and potentially increase speed on compatible hardware
model = AutoModelForCausalLM.from_pretrained(
    base_model,  # Specify the base model identifier
    low_cpu_mem_usage=True,  # Optimize memory usage for CPU, useful when GPU is primarily used for computation
    return_dict=True,  # Ensure that the model's outputs are returned as a dictionary
    torch_dtype=torch.float16,  # Use 16-bit floating point precision for model weights
    device_map={"": 0},  # Map the model to device ID 0, typically referring to the primary GPU
)

# Reload the base model and apply the LoRA (Low-Rank Adaptation) weights stored in 'new_model'
# This step is crucial for integrating parameter-efficient training changes back into the model
model = PeftModel.from_pretrained(model, new_model)
# Merge LoRA weights with the base model and unload any unnecessary data from memory
model = model.merge_and_unload()

# Reload the tokenizer associated with the base model
tokenizer = AutoTokenizer.from_pretrained(
    base_model,  # Base model identifier
    trust_remote_code=True  # Trust and load any custom or additional code associated with the tokenizer
)
# Set the padding token of the tokenizer to be the end of sentence (EOS) token
tokenizer.pad_token = tokenizer.eos_token
# Set the tokenizer to add padding to the right side of the sequences
tokenizer.padding_side = "right"




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [None]:
! pip install huggingface_hub --q

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write)

In [None]:

model.push_to_hub("malikashish997/llama-2-7b-demo")
tokenizer.push_to_hub("malikashish997/llama-2-7b-demo")


model-00002-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/3.59G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/malikashish997/llama-2-7b-demo/commit/aba9f15cbf7bab95f8448cc8222214815d1ab4f0', commit_message='Upload tokenizer', commit_description='', oid='aba9f15cbf7bab95f8448cc8222214815d1ab4f0', pr_url=None, pr_revision=None, pr_num=None)