## Notebook 1: PDF Pre-processing

In the series, we will be going from a PDF to Podcast using all open models. 

The first step in getting to the podcast is finding a script, right now our logic is:
- Use any PDF on any topic
- Prompt `Llama-3.2-1B-Instruct` model to process it into a text file
- Re-write this into a podcast transcript in next notebook.

In this notebook, we will upload a PDF and save it into a `.txt` file using the `PyPDF2` library, later we will process chunks from the text file using our featherlight model.

Most of us shift-enter pass the comments to realise later we need to install libraries. For the few that read the instructions, please remember to do so:

In [1]:
!pip install PyPDF2
!pip install rich ipywidgets
!pip install -r requirements.txt

[0m

Assuming you have a PDF uploaded on the same machine, please set the path for the file. 

Also, if you want to flex your GPU-please switch to a bigger model although the featherlight models work perfectly for this task:

In [1]:
pdf_path = './resources/2402.13116v4.pdf'
DEFAULT_MODEL = "meta-llama/Llama-3.2-1B-Instruct"

In [2]:
import PyPDF2
from typing import Optional
import os
import torch
from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer

from tqdm.notebook import tqdm
import warnings

warnings.filterwarnings('ignore')

os.environ["HF_TOKEN"] = "replace"

Let's make sure we don't stub our toe by checking if the file exists

In [3]:
def validate_pdf(file_path: str) -> bool:
    if not os.path.exists(file_path):
        print(f"Error: File not found at path: {file_path}")
        return False
    if not file_path.lower().endswith('.pdf'):
        print("Error: File is not a PDF")
        return False
    return True

Convert PDF to a `.txt` file. This would simply read and dump the contents of the file. We set the maximum characters to 100k. 

For people converting their favorite novels into a podcast, they will have to add extra logic of going outside the Llama models context length which is 128k tokens.

In [4]:
def extract_text_from_pdf(file_path: str, max_chars: int = 100000) -> Optional[str]:
    if not validate_pdf(file_path):
        return None
    
    try:
        with open(file_path, 'rb') as file:
            # Create PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)
            
            # Get total number of pages
            num_pages = len(pdf_reader.pages)
            print(f"Processing PDF with {num_pages} pages...")
            
            extracted_text = []
            total_chars = 0
            
            # Iterate through all pages
            for page_num in range(num_pages):
                # Extract text from page
                page = pdf_reader.pages[page_num]
                text = page.extract_text()
                
                # Check if adding this page's text would exceed the limit
                if total_chars + len(text) > max_chars:
                    # Only add text up to the limit
                    remaining_chars = max_chars - total_chars
                    extracted_text.append(text[:remaining_chars])
                    print(f"Reached {max_chars} character limit at page {page_num + 1}")
                    break
                
                extracted_text.append(text)
                total_chars += len(text)
                print(f"Processed page {page_num + 1}/{num_pages}")
            
            final_text = '\n'.join(extracted_text)
            print(f"\nExtraction complete! Total characters: {len(final_text)}")
            return final_text
            
    except PyPDF2.PdfReadError:
        print("Error: Invalid or corrupted PDF file")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")
        return None


Helper function to grab meta info about our PDF

In [5]:
# Get PDF metadata
def get_pdf_metadata(file_path: str) -> Optional[dict]:
    if not validate_pdf(file_path):
        return None
    
    try:
        with open(file_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            metadata = {
                'num_pages': len(pdf_reader.pages),
                'metadata': pdf_reader.metadata
            }
            return metadata
    except Exception as e:
        print(f"Error extracting metadata: {str(e)}")
        return None

Finally, we can run our logic to extract the details from the file

In [6]:
# Extract metadata first
print("Extracting metadata...")
metadata = get_pdf_metadata(pdf_path)
if metadata:
    print("\nPDF Metadata:")
    print(f"Number of pages: {metadata['num_pages']}")
    print("Document info:")
    for key, value in metadata['metadata'].items():
        print(f"{key}: {value}")

# Extract text
print("\nExtracting text...")
extracted_text = extract_text_from_pdf(pdf_path)

# Display first 500 characters of extracted text as preview
if extracted_text:
    print("\nPreview of extracted text (first 500 characters):")
    print("-" * 50)
    print(extracted_text[:500])
    print("-" * 50)
    print(f"\nTotal characters extracted: {len(extracted_text)}")

# Optional: Save the extracted text to a file
if extracted_text:
    output_file = './resources/extracted_text.txt'
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(extracted_text)
    print(f"\nExtracted text has been saved to {output_file}")

Extracting metadata...

PDF Metadata:
Number of pages: 43
Document info:
/Author: 
/CreationDate: D:20241022021202Z
/Creator: LaTeX with hyperref
/Keywords: 
/ModDate: D:20241022021202Z
/PTEX.Fullbanner: This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5
/Producer: pdfTeX-1.40.25
/Subject: 
/Title: 
/Trapped: /False

Extracting text...
Processing PDF with 43 pages...
Processed page 1/43
Processed page 2/43
Processed page 3/43
Processed page 4/43
Processed page 5/43
Processed page 6/43
Processed page 7/43
Processed page 8/43
Processed page 9/43
Processed page 10/43
Processed page 11/43
Processed page 12/43
Processed page 13/43
Processed page 14/43
Processed page 15/43
Processed page 16/43
Reached 100000 character limit at page 17

Extraction complete! Total characters: 100016

Preview of extracted text (first 500 characters):
--------------------------------------------------
1
A Survey on Knowledge Distillation of Large
Language Models
Xiaohan Xu1, M

### Llama Pre-Processing

Now let's proceed to justify our distaste for writing regex and use that as a justification for a LLM instead:

At this point, have a text file extracted from a PDF of a paper. Generally PDF extracts can be messy due to characters, formatting, Latex, Tables, etc. 

One way to handle this would be using regex, instead we can also prompt the feather light Llama models to clean up our text for us. 

Please try changing the `SYS_PROMPT` below to see what improvements you can make:

In [7]:
device = "cuda" if torch.cuda.is_available() else "cpu"

SYS_PROMPT = """
You are a world class text pre-processor, here is the raw data from a PDF, please parse and return it in a way that is crispy and usable to send to a podcast writer.

The raw data is messed up with new lines, Latex math and you will see fluff that we can remove completely. Basically take away any details that you think might be useless in a podcast author's transcript.

Remember, the podcast could be on any topic whatsoever so the issues listed above are not exhaustive

Please be smart with what you remove and be creative ok?

Remember DO NOT START SUMMARIZING THIS, YOU ARE ONLY CLEANING UP THE TEXT AND RE-WRITING WHEN NEEDED

Be very smart and aggressive with removing details, you will get a running portion of the text and keep returning the processed text.

PLEASE DO NOT ADD MARKDOWN FORMATTING, STOP ADDING SPECIAL CHARACTERS THAT MARKDOWN CAPATILISATION ETC LIKES

ALWAYS start your response directly with processed text and NO ACKNOWLEDGEMENTS about my questions ok?
Here is the text:
"""

In [8]:
print(device)

cuda


Instead of having the model process the entire file at once, as you noticed in the prompt-we will pass chunks of the file. 

One issue with passing chunks counted by characters is, we lose meaning of words so instead we chunk by words:

In [9]:
def create_word_bounded_chunks(text, target_chunk_size):
    """
    Split text into chunks at word boundaries close to the target chunk size.
    """
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    
    for word in words:
        word_length = len(word) + 1  # +1 for the space
        if current_length + word_length > target_chunk_size and current_chunk:
            # Join the current chunk and add it to chunks
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_length = word_length
        else:
            current_chunk.append(word)
            current_length += word_length
    
    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

Let's load in the model and start processing the text chunks

In [10]:
accelerator = Accelerator()
model = AutoModelForCausalLM.from_pretrained(
    DEFAULT_MODEL,
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
    device_map=device,
)
tokenizer = AutoTokenizer.from_pretrained(DEFAULT_MODEL, use_safetensors=True)
model, tokenizer = accelerator.prepare(model, tokenizer)

In [11]:
def process_chunk(text_chunk, chunk_num):
    """Process a chunk of text and return both input and output for verification"""
    conversation = [
        {"role": "system", "content": SYS_PROMPT},
        {"role": "user", "content": text_chunk},
    ]
    
    prompt = tokenizer.apply_chat_template(conversation, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            temperature=0.7,
            top_p=0.9,
            max_new_tokens=512
        )
    
    processed_text = tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt):].strip()
    
    # Print chunk information for monitoring
    #print(f"\n{'='*40} Chunk {chunk_num} {'='*40}")
    print(f"INPUT TEXT:\n{text_chunk[:500]}...")  # Show first 500 chars of input
    print(f"\nPROCESSED TEXT:\n{processed_text[:500]}...")  # Show first 500 chars of output
    print(f"{'='*90}\n")
    
    return processed_text

In [14]:
INPUT_FILE = "./resources/extracted_text.txt"
CHUNK_SIZE = 1000

chunks = create_word_bounded_chunks(extracted_text, CHUNK_SIZE)
num_chunks = len(chunks)

processed_text = []

In [15]:
num_chunks

101

In [18]:
# Read the file
with open(INPUT_FILE, 'r', encoding='utf-8') as file:
    text = file.read()

# Calculate number of chunks
num_chunks = (len(text) + CHUNK_SIZE - 1) // CHUNK_SIZE

# Cell 6: Process the file with ordered output
# Create output file name
output_file = f"./resources/clean_{os.path.basename(INPUT_FILE)}"

In [19]:
with open(output_file, 'w', encoding='utf-8') as out_file:
    for chunk_num, chunk in enumerate(tqdm(chunks, desc="Processing chunks")):
        # Process chunk and append to complete text
        processed_chunk = process_chunk(chunk, chunk_num)
        processed_text += processed_chunk + "\n"
        
        # Write chunk immediately to file
        out_file.write(processed_chunk + "\n")
        out_file.flush()

Processing chunks:   0%|          | 0/101 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
1 A Survey on Knowledge Distillation of Large Language Models Xiaohan Xu1, Ming Li2, Chongyang Tao3, Tao Shen4, Reynold Cheng1, Jinyang Li1, Can Xu5, Dacheng Tao6, Tianyi Zhou2 1The University of Hong Kong2University of Maryland3Microsoft 4University of Technology Sydney5Peking University6The University of Sydney {shawnxxh,chongyangtao,hishentao }@gmail.com {minglii,tianyi }@umd.edu ckcheng@cs.hku.hk jl0725@connect.hku.hk Abstract —In the era of Large Language Models (LLMs), Knowledge Distillati...

PROCESSED TEXT:
as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT -4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, Knowledge Distillation plays a crucial role in compressing these models, and facilitating their self-improvement by employing themselves as teachers....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
advanced knowledge to smaller models and its utility in model compression and self- improvement. Our survey is meticulously structured around three foundational pillars: algorithm ,skill, and verticalization – providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the interaction between data augmentation (DA) and KD, illustrating how DA emerges as a powerful ...

PROCESSED TEXT:
their impact on various fields, including but not limited to machine learning (ML) and deep learning (DL). A meticulous examination of key mechanisms underlying knowledge discovery and deep learning, specifically within the context of knowledge graph models (KGMs), is presented. The role of data augmentation (DA) in enhancing the performance of these models is explored, highlighting its potential to improve contextual understanding, alignment, and semantic insigh

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
distillation and proposing future research directions. By bridging the gap between proprietary and open-source LLMs, this survey underscores the potential for more accessible, efficient, and powerful AI solutions. Most importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs. Index Terms —...

PROCESSED TEXT:
idging the gap between proprietary and open-source LLMs, this survey underscores the potential for more accessible, efficient, and powerful AI solutions. Most importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of knowledge distillation.

Associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs. Index Terms —Large language mo

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
complexity, have unlocked new realms of possibility, from generating human-like text to offering sophisticated problem-solving capabilities. The core significance of these LLMs lies in their emergent abil- ities (Wei et al., 2022a,b; Xu et al., 2024a), a phenomenon where the models display capabilities beyond their explicit training objectives, enabling them to tackle a diverse array of tasks with remarkable proficiency. These models excel in understanding and generation, driving applications fr...

PROCESSED TEXT:
ophisticated problem-solving capabilities. The core significance of these LLMs lies in their emergent abilities (Wei et al., 2022a,b; Xu et al., 2024a), enabling them to tackle a diverse array of tasks with remarkable proficiency. These models excel in understanding and generation, driving applications from creative generation to complex problem-solving (OpenAI et al., 2023; Liang et al., 2022)....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
redefine our interaction with technology. Despite the remarkable capabilities of proprietary LLMs like GPT-4 and Gemini, they are not without their shortcom- ings, particularly when viewed in light of the advantages offered by open-source models. A significant drawback is their limited accessibility and higher cost (OpenAI et al., 2023). These proprietary models often come with substantial usage fees and restricted access, making them less attain- able for individuals and smaller organizations. ...

PROCESSED TEXT:
se proprietary models are limited in accessibility and come with higher costs. They are often tied to specific usage fees, making them inaccessible to individuals and smaller organizations. The use of these models raises concerns about data privacy and security, particularly when sensitive information is involved. This is especially pertinent for users handling confidential data....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
present significant challenges in leveraging the full potential of proprietary LLMs. In contrast to proprietary LLMs, open-source models like LLaMA (Touvron et al., 2023) and Mistral (Jiang et al., 2023a) bring several notable advantages. One of the primaryarXiv:2402.13116v4 [cs.CL] 21 Oct 2024 2 benefits of open-source models is their accessibility and adaptability. Without the constraints of licensing fees or restrictive usage policies, these models are more readily available to a broader rang...

PROCESSED TEXT:
, restrictive usage policies, and limited scalability....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
resources compared to their proprietary counterparts. One of the most significant limitations is the smaller model scale, which often results in lower per- formance on real-world tasks with a bunch of instruc- tions (Zheng et al., 2023a). These models, with fewer pa- rameters, may struggle to capture the depth and breadth of knowledge embodied in larger models like GPT-4. Ad- ditionally, the pre-training investment in these open-source models is typically less substantial. This reduced investmen...

PROCESSED TEXT:
significant limitations. One of the most notable is the smaller model scale, which often results in lower performance on real-world tasks with a large number of instructions. These models, with fewer parameters, may struggle to capture the depth and breadth of knowledge in larger models like GPT-4. Additionally, the pre-training investment in these open-source models is typically less substantial. This reduced investment can lead to a narrower range of pre-traini

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
particularly evident when these models are compared to the highly fine-tuned proprietary LLMs, which are often tailored to excel in a wide array of complex scenarios (OpenAI et al., 2023). Primarily, recognizing the disparities between propri- etary and open-source LLMs, KD techniques have surged as a means to bridge the performance gap between these models (Gou et al., 2021; Gupta and Agrawal, 2022). Knowl- edge distillation, in this context, involves leveraging the more advanced capabilities o...

PROCESSED TEXT:
like GPT-4 and Gemini outperforming open-source models. Knowledge distillation bridges the gap between these models by leveraging proprietary models' advanced capabilities as a framework. This process is akin to transferring a teacher's expertise to a student, where the student (open-source model) learns to replicate the teacher's performance characteristics....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
paradigm to achieve knowledge distillation of LLMs, where a small seed of knowledge is used to prompt the LLM to generate more data with respect to a specific skill or domain (Taori et al., 2023). Secondly, KD still retains its fundamental role in compressing LLMs, making them more efficient without significant loss in performance. (Gu et al., 2024; Agarwal et al., 2024). More recently, the strategy of employing open-source LLMs as teachers for their own self-improvement has emerged as a promisi...

PROCESSED TEXT:
mpt the LLM to generate more data with respect to a specific skill or domain. Secondly, KD retains its fundamental role in compressing LLMs, making them more efficient without significant loss in performance....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
compression for efficiency, and 3) an emerging trend of self-improvement via self-generated knowledge. (e.g., in-context learning (Huang et al., 2022a) and in- struction following (Taori et al., 2023)), improved align- ment with user intents (e.g., human values/principles (Cui et al., 2023a), and thinking patterns like chain-of-thought (CoT) (Mukherjee et al., 2023)), and NLP task specialization (e.g., semantic understanding (Ding et al., 2023a), and code generation (Chaudhary, 2023)). These ski...

PROCESSED TEXT:
owards in-context learning and instruction, improving alignment with user intents and thought patterns like chain-of-thought and NLP task specialization, essential for various applications, including casual conversations and complex problem-solving, particularly in specialized domains such as healthcare, law, and science....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
been extensively trained and fine-tuned in these areas. The benefits of knowledge distillation in the era of LLMs are multifaceted and transformative (Gu et al., 2024). Through a suite of distillation techniques, the gap between proprietary and open-source models is significantly nar- rowed (Chiang et al., 2023; Xu et al., 2023a) and even filled (Zhao et al., 2023a). This process not only streamlines computational requirements but also enhances the environ- mental sustainability of AI operations...

PROCESSED TEXT:
ceted and transformative. Through a suite of distillation techniques, the gap between proprietary and open-source models narrows significantly and is filled. This process streamlines computational requirements, enhancing the environmental sustainability of AI operations, as open-source models become more adept at lower overhead....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
and research domains. The escalating need for a comprehensive survey on the knowledge distillation of LLMs stems from the rapidly evolving landscape of AI (OpenAI et al., 2023; Team et al., 2023) and the increasing complexity of these models. As AI continues to penetrate various sectors, the ability to effi- ciently and effectively distill knowledge from proprietary LLMs to open-source ones becomes not just a technical aspiration but a practical necessity. This need is driven by the growing dema...

PROCESSED TEXT:
h a growing number of organizations and researchers actively exploring the possibilities and limitations of Large Language Models (LLMs). The increasing complexity of these models has created a pressing need for more efficient and effective methods for distilling knowledge from proprietary LLMs to open-source variants, driving a growing demand for accessible, cost-effective, and adaptable AI solutions....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ReinforcementLearningoutputsreward RM!(·)distill SupervisedFine-tuningX,Y preferenceRankOptimizationy,1y,2y3y1y2y3≻≻rank…… DataCuration X,YrawdatasynthesizefeedbackFeedback input outputSelf-Knowledge outputinputinput YlabelLabelingExpansion X,YdemonstrationsexpandFeature featureinput,outputextractSec.4Sec.5 Sec.3.1Sec.3.2①②③④ Fig. 2: An overview of this survey on knowledge distillation of large language models. Note that ‘Section’ is abbreviated as ‘Sec.’ in this figure. RM S(·)denotes the stude...

PROCESSED TEXT:
k
Data curation, X, Y, raw data, synthesizing, feedback, input, output, self-knowledge, output, input, label, expansion, feature, input, output, extract, sec. 4, sec. 5, sec. 3.1, sec. 3.2, ①②③④
Fig. 2: An overview of this survey on knowledge distillation of large language models....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
future research. Survey Organization. The remainder of this survey is orga- nized into several comprehensive sections, each designed to offer a deep dive into the multifaceted aspects of knowledge distillation within the realm ofLLMs. Following this intro- duction, §2 provides a foundational overview of knowledge distillation, comparing traditional techniques with those emerging in the era of LLMs and highlighting the role of data augmentation (DA) in this context. §3 delves into the approaches ...

PROCESSED TEXT:
s organized into several comprehensive sections, each designed to offer a deep dive into the multifaceted aspects of knowledge distillation within the realm of LLMs. Following this introduction, Section 2 provides a foundational overview of knowledge distillation, comparing traditional techniques with those emerging in the era of LLMs and highlighting the role of data augmentation (DA) in this context.

Section 3 delves into the approaches to elicit knowledge fro

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(NLU), genera- tion (NLG), information retrieval, recommendation systems, and the evaluation of text generation. In §5, we venture into domain-specific vertical distillation, showcasing how knowledge distillation techniques are applied within spe- cialized fields such as law, healthcare, finance, and science,illustrating the practical implications and transformative impact of these approaches. The survey suggests open problems in §6, identifying current challenges and gaps in knowledge distillat...

PROCESSED TEXT:
d the evaluation of text generation. In §5, we delve into domain-specific vertical distillation, showcasing how knowledge distillation techniques are applied within specialized fields such as law, healthcare, finance, and science, illustrating the practical implications and transformative impact of these approaches. The survey reveals open problems in §6, identifying current challenges and gaps in knowledge distillation research that offer opportunities for futur

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
model (teacher) to a smaller, more efficient model (student) (Gou et al., 2021). This technique is pivotal in mitigating the challenges posed by the computational demands and resource constraints of deploying large-scale models in practical applications. Historically, knowledge distillation techniques, prior to the era of LLMs, primarily concentrated on transferring knowledge from complex, often cumbersome neural net- works to more compact and efficient architectures (Sanh et al., 2019; Kim and ...

PROCESSED TEXT:
...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
CoT-Distill (Hsieh et al., 2023) Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023), Baize (Xu et al., 2023b), Mammoth (Yue et al., 2023a), Mixed Distill (Chenglin et al., 2023) ExpansionSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Code Alpaca (Chaudhary, 2023) Self-Align (Sun et al., 2024b), WizardLM (Xu et al., 2023a), WizardCoder (Luo et al., 2023a), WizardMath (Luo et al., 2023b), AugGPT (Dai et al., 2023a), TDG (He et al., 2023b) CurationUltraChat (Ding et al., 2...

PROCESSED TEXT:
or Baize (Xu et al., 2023b), or Mammoth (Yue et al., 2023a), or Mixed Distill (Chenglin et al., 2023) or ExpansionSelf-Instruct (Wang et al., 2022a), or Alpaca (Taori et al., 2023), or Code Alpaca (Chaudhary, 2023), or Self-Align (Sun et al., 2024b), or WizardLM (Xu et al., 2023a), or WizardCoder (Luo et al., 2023a), or WizardMath (Luo et al., 2023b), or AugGPT (Dai et al., 2023a), or TDG (He et al., 2023b), or CurationUltraChat (Ding et al., 2023b), or Phi-1 (Gu

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(Tunstall et al., 2023), CycleAlign (Hong et al., 2023), RLAIF (Lee et al., 2023a), Lion (Jiang et al., 2023b), PERsD (Chen et al., 2023a), GKD (Agarwal et al., 2024) Self-KnowledgeSelf-Instruct (Wang et al., 2022a), Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024), ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023), Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022) DistillationSupervised Fine-TuningAlpaca (T...

PROCESSED TEXT:
l., 2023b), PERsD (Chen et al., 2023a), Self-KnowledgeSelf-Instruct (Wang et al., 2022a), Self-Align (Sun et al., 2024b), RLCD (Yang et al., 2024), ImpDistill (Jung et al., 2023), LMSI (Huang et al., 2023a), ReST (Gulcehre et al., 2023), Self-Rewarding (Yuan et al., 2024a), Baize (Xu et al., 2023b), STaR (Zelikman et al., 2022) DistillationSupervised Fine-TuningAlpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Self-Instruct (

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
(Gu et al., 2024), GKD (Agarwal et al., 2024), GPT3 Reward (Kwon et al., 2023) Rank Optimization Zephyr (Tunstall et al., 2023), CycleAlign (Hong et al., 2023), Skill DistillationContext FollowingInstruction FollowingSelf-Instruct (Wang et al., 2022a), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), WizardLM (Xu et al., 2023a), Orca (Mukherjee et al., 2023), Orca 2 (Mitra et al., 2023), WizardMath (Luo et al., 2023b), Llama-GPT4 (Peng et al., 2023a), Multi-turn DialogueVicuna (Chiang ...

PROCESSED TEXT:
3), CycleAlign (Hong et al., 2023), Skill DistillationContext FollowingInstruction FollowingSelf-Instruct (Wang et al., 2022a); Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023); WizardLM (Xu et al., 2023a); Orca (Mukherjee et al., 2023); Orca 2 (Mitra et al., 2023); WizardMath (Luo et al., 2023b); Llama-GPT4 (Peng et al., 2023a); Multi-turn DialogueVicuna (Chiang et al., 2023); Baize (Xu et al., 2023b); UltraLLaMA (Ding et al., 2023b); CAMEL (Li et al., 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Reward (Kwon et al., 2023), ILF (Scheurer et al., 2023), ALMoST (Kim et al., 2023a), RLEF (Roit et al., 2023), RLAIF (Lee et al., 2023a), Zephy (Tunstall et al., 2023), UltraFeedback (Cui et al., 2023a), ValueCAI (Bai et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b), Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024) AgentTool UsingToolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023)...

PROCESSED TEXT:
23), RLAIF (Lee et al., 2023a), Zephy (Tunstall et al., 2023), UltraFeedback (Cui et al., 2023a), ValueCAI (Bai et al., 2022a), Align Honesty (Yang et al., 2023a), SANDBOX (Liu et al., 2023b), Self-Align (Sun et al., 2024b), UltraFeedback (Cui et al., 2023a), RLCD (Yang et al., 2024), AgentTool UsingToolformer (Schick et al., 2023), Graph-ToolFormer (Zhang, 2023), Gorilla (Patil et al., 2023), ToolAlpaca (Tang et al., 2023a), ToolLLM (Qin et al., 2023a), CRAFT (Y

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2023a), Mix Distill (Chenglin et al., 2023), Annollm (He et al., 2023a), UDG (Wang et al., 2021a), ZeroGen (Ye et al., 2022), NLGInheritSumm (Xu et al., 2023c), RECOMP (Xu et al., 2024b), MaRio (Ramnath et al., 2023), ID (Jung et al., 2023), GPT-3 Labeling (Wang et al., 2021b), BioGPT (Guo et al., 2023a), ChatGPT NMT (Yang and Nicolai, 2023), Information RetrievalQUILL (Srinivasan et al., 2022), Promptgator (Dai et al., 2023b), InPars (Bonifacio et al., 2022), AugTriever (Meng et al., 2023), (Su...

PROCESSED TEXT:
(Ye et al., 2022), NLGInheritSumm (Xu et al., 2023c), RECOMP (Xu et al., 2024b), MaRio (Ramnath et al., 2023), ID (Jung et al., 2023), GPT-3 Labeling (Wang et al., 2021b), BioGPT (Guo et al., 2023a), ChatGPT NMT (Yang and Nicolai, 2023), Information RetrievalQUILL (Srinivasan et al., 2022), Promptgator (Dai et al., 2023b), InPars (Bonifacio et al., 2022), AugTriever (Meng et al., 2023), RankVicuna (Pradeep et al., 2023a), RankZephyr (Pradeep et al., 2023b), ExaRa

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
al., 2023) Phi-1 (Gunasekar et al., 2023), PERsD (Chen et al., 2023a), MFTCoder (Liu et al., 2023d), WaveCoder (Yu et al., 2024), Code Clean (Jain et al., 2023), Multi-ModalityLLaVA (Liu et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c), LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b), Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et ...

PROCESSED TEXT:
r (Yu et al., 2024), Code Clean (Jain et al., 2023), Multi-ModalityLLaVA (Liu et al., 2023e), SVIT (Zhao et al., 2023b), LVIS-Instruct4V (Wang et al., 2023e), Shikra (Chen et al., 2023c), LSKD (Park et al., 2023), DetGPT (Pi et al., 2023; Zhao et al., 2023c), LRV (Liu et al., 2023f), NExT-GPT (Wu et al., 2023b), Valley (Luo et al., 2023d), ILuvUI (Jiang et al., 2023d), StableLLaVA (Li et al., 2023c), PointLLM (Xu et al., 2023e), Verticalization DistillationLaw (H

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
to mimic the output of a larger teacher network, often through techniques like soft target training, where the student learns from the softened softmax output of the teacher. Please refer to the survey (Gou et al., 2021) for more details on general knowledge distillation techniques in AI and DL. In contrast, the advent of LLMs has revolutionized the knowledge distillation landscape. The current era of knowledge distillation in LLMs shifts the focus from mere architecture compression to knowledge...

PROCESSED TEXT:
g, where the student learns from the softened softmax output of the teacher. The survey (Gou et al., 2021) provides a general overview of knowledge distillation techniques in AI and deep learning. In contrast, the advent of large language models (LLMs) has significantly altered the knowledge distillation landscape. The current era of knowledge distillation in LLMs shifts the focus from mere architecture compression to knowledge elicitation and transfer (Taori et 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
focus in LLM-based knowledge distillation is to elicit the specific knowledge these models have. The key to this modern approach lies in heuristic and carefully designed prompts, which are used to elicit specific knowledge (Ding et al., 2023b) or capabilities (Chaudhary, 2023) from the LLMs. These prompts are crafted to tap into the LLM’s understanding and capabilities in various domains, ranging from natural language understanding (He et al., 2023a) to more complex cognitive tasks like reason- ...

PROCESSED TEXT:
ey to this modern approach lies in carefully designed prompts used to elicit specific knowledge from LLMs. These prompts tap into the LLM's understanding and capabilities in various domains. From natural language understanding to complex cognitive tasks like reasoning and problem-solving, the use of prompts as a means of knowledge elicitation offers a more flexible and dynamic approach to distillation....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
distillation also em- phasizes the transfer of more abstract qualities such as reasoning patterns (Mitra et al., 2023), preference align- ment (Cui et al., 2023a), and value alignment (Sun et al., 2024b). This is in stark contrast to the earlier focus on output replication (Taori et al., 2023), indicating a shift towards a more holistic and comprehensive transfer of cognitive capabilities. The current techniques involve not just the replication of outputs, but also the emulation of the thought p...

PROCESSED TEXT:
tputs. It emphasizes the integration of abstract reasoning patterns, preference alignment, and value alignment into the transfer process. This shift is driven by the need to enhance problem-solving and decision-making capabilities in students....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
knowledge distillation. Unlike traditional DA techniques such as paraphrasing (Gangal et al., 2022) or back-translation (Longpre et al., 2019), which primarily aim at expanding the training dataset in a somewhat mechanical manner, DA within the context of LLMs focuses on the generation of novel, context-rich training data tailored to specific domains and skills.The relationship between DA and KD in LLMs is both symbiotic and foundational. By leveraging a set of seed knowledge, KD employs DA to p...

PROCESSED TEXT:
LLMs) by generating novel, context-rich training data tailored to specific domains and skills. It leverages a seed knowledge set to prompt LLMs to create high-quality datasets that are both diverse and specific, bridging the knowledge gap between proprietary and open-source models....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
ensuring that the distilled models not only replicate the teacher model’s output behavior but also embody its deep-seated understanding and cognitive strategies. DA acts as a force multiplier, enabling the distilled mod- els to acquire and refine capabilities that would otherwise require exponentially larger datasets and computational re- sources. It facilitates a more effective transfer of knowledge, focusing on the qualitative aspects of learning rather than quantitative expansion. This strate...

PROCESSED TEXT:
derstanding and cognitive strategies, and facilitating a more effective transfer of knowledge....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Building on the discussions introduced earlier, this survey aims to comprehensively explore the landscape of knowl- edge distillation within the context of LLMs, following a meticulously structured taxonomy as in Figure 3. The survey’s scope is delineated through three primary facets: KD Algorithms, Skill Distillation, and Verticalization Dis- tillation. Each facet encapsulates a range of subtopics and methodologies. It’s important to note that KD algorithms provide the technical foundations for...

PROCESSED TEXT:
ndscape of knowledge distillation within the context of LLMs, following a meticulously structured taxonomy as in Figure 3. The survey's scope is delineated through three primary facets: KD Algorithms, Skill Distillation, and Verticalization Distillation. Each facet encapsulates a range of subtopics and methodologies.

KD Algorithms provide the technical foundations for skill distillation and verticalization distillation. This segment focuses on the technical foun

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
al., 2023), curation (Gu- nasekar et al., 2023), feature understanding (Agarwal et al., 2024), feedback mechanisms (Tunstall et al., 2023), and self- knowledge generation (Wang et al., 2022a). This exploration seeks to uncover the various ways in which knowledge can be identified, expanded, and curated for effective dis- tillation. The ‘ distillation ’ subsection examines learning ap- proaches like supervised fine-tuning (SFT) (Wang et al., 2022a), divergence minimization (Agarwal et al., 2024),...

PROCESSED TEXT:
s to simpler ones, is explored in this research. The process involves training a model on a dataset and then distilling the knowledge from that model into a simpler format. This can be achieved through various techniques, including supervised fine-tuning, divergence minimization, reinforcement learning, and rank optimization strategies.

Supervised fine-tuning is a technique where a model is trained on a dataset and then fine-tuned to adapt to new tasks or enviro

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
retrieval-augmented generation (RAG) Capa- bility. In the realm of alignment (Mitra et al., 2023; Tun- stall et al., 2023), the survey investigates thinking patterns, persona/preference modeling, and value alignment. The ‘agent’ category delves into skills such as Tool Using and Planning. NLP task specialization (Dai et al., 2023a; Jung et al., 2023; Chaudhary, 2023) is scrutinized through lenses like natural language understanding (NLU), natural lan- guage generation (NLG), information retrieva...

PROCESSED TEXT:
hinking patterns, persona/preference modeling, and value alignment. The 'agent' category examines skills such as tool usage and planning. NLP task specialization is scrutinized through natural language understanding, language generation, information retrieval, recommendation systems, text generation evaluation, and code generation. The survey addresses multi-modality, exploring how knowledge discovery (KD) enhances language models' ability to integrate multiple f

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2023a), Finance (Zhang and Yang, 2023), Science (Zhang et al., 2024), among others. This exploration not only showcases the practical implications of KD tech- niques but also highlights their transformative impact on domain-specific AI solutions. Through these facets, this survey provides a compre- hensive analysis of KD in LLMs, guiding researchers and practitioners through methodologies, challenges, and op- portunities in this rapidly evolving domain. Declaration. This survey represents our ea...

PROCESSED TEXT:
hniques and their transformative impact on domain-specific AI solutions. Through these facets, this survey provides a comprehensive analysis of KD in LLMs, guiding researchers and practitioners through methodologies, challenges, and opportunities in this rapidly evolving domain. Declaration. This survey represents our earnest effort to provide a comprehensive and insightful overview of knowledge distillation techniques applied to LLMs, focusing on algorithms, ski

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
foundational paradigms of knowledge dis- tillation, highlighting key methodologies and their impacts across a range of applications. 2.4 Distillation Pipeline in LLM Era SeedKnowledgeSkill/Domain TeacherLLMKnowledgeElicitationStudentModelDistillationAlgorithmsteer driveGeneratedKnowledgeLearningObjectivetrain Fig. 4: An illustration of a general pipeline to distill knowl- edge from a large language model to a student model. The general distillation pipeline of LLMs is a structured and methodical...

PROCESSED TEXT:
across a range of applications, 2.4 Distillation Pipeline in LLM Era, SeedKnowledgeSkill/Domain TeacherLLMKnowledgeElicitationStudentModelDistillationAlgorithmsteer driveGeneratedKnowledgeLearningObjectivetrain...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
seen in Figure 2. I. Target Skill or Domain Steering Teacher LLM. The first stage involves directing the teacher LLM towards a specific target skill or domain. This is achieved through care- fully crafted instructions or templates that guide the LLM’s focus. These instructions are designed to elicit responses that demonstrate the LLM’s proficiency in a particular area, be it a specialized domain like healthcare or law, or a skill such as reasoning or language understanding. II. Seed Knowledge as...

PROCESSED TEXT:
stage involves directing the teacher LLM towards a specific target skill or domain. This is achieved through carefully crafted instructions or templates that guide the LLM's focus. These instructions are designed to elicit responses that demonstrate the LLM's proficiency in a particular area, such as healthcare or law, or a skill like reasoning or language understanding.

Seed Knowledge as Input

Once the target area is defined, the next step is to feed the teach

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
thereby creating more comprehensive and in-depth knowledge examples. III. Generation of Distillation Knowledge. In response to the seed knowledge and steering instructions, the teacher LLM generates knowledge examples. These examples are predominantly in the form of question-and-answer (QA) dialogues or narrative explanations, aligning with the nat- ural language processing/understanding capabilities of the LLM. In certain specialized cases, the outputs may also in- clude logits or hidden featur...

PROCESSED TEXT:
e to the seed knowledge and instructions, the teacher LLM generates knowledge examples predominantly in the form of question-and-answer dialogues or narrative explanations, aligning with the natural language processing/understanding capabilities of the LLM. In certain specialized cases, the outputs may also include logits or hidden features, although this is less common due to the complexity and specific requirements of such data forms. The generated knowledge ex

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
learning objectives. The loss function quantifies the student model’s performance in replicating or adapting the knowledge from the teacher model. By minimizing this loss, the student model learns to emulate the target skills or domain knowledge of the teacher, thereby acquiring similar capabilities. The process involves iteratively adjusting the student model’s parameters to reduce the discrepancy be- tween its outputs and those of the teacher model, ensuring the effective transfer of knowledge...

PROCESSED TEXT:
adapting the knowledge from the teacher model. By minimizing this loss, the student model learns to emulate the target skills or domain knowledge of the teacher, thereby acquiring similar capabilities. The process involves iteratively adjusting the student model's parameters to reduce the discrepancy between its outputs and those of the teacher model, ensuring the effective transfer of knowledge. This can be abstracted as two formulations: (1) D(kd) = {Parse(o, s

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
example ( e.g., (x, y)) from the teacher LLM’s output o(plus the input sin some cases), andpTrepresents the teacher LLM with parameters θT. Given the datasets D(kd) Ibuilt for distillation, we then define a learning objective as L=X ILI(D(kd) I;θS), (2) whereP Idenotes there could be multiple tasks or skills being distilled into one student model, LI(·;·)stands for a specific learning objective, and θSparameterizes the student model. Following our exploration of the distillation pipeline and the...

PROCESSED TEXT:
rformance of a teacher model. It involves two primary steps: 

1. Eliciting knowledge from teacher models 
2. Injecting knowledge into the student model...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
We will elaborate on these two processes in the subsequent sections. 3.1 Knowledge This section focuses on the approaches to elicit knowledge from teacher LLMs. According to the manners to acquire knowledge, we divided them into Labeling ,Expansion ,Data Curation ,Feature ,Feedback , and Self-Knowledge . Figure 5 shows an illustration of these knowledge elicitation meth- ods. 3.1.1 Labeling Labeling knowledge refers to using a teacher LLM to label the output yfor a given input xas the seed knowl...

PROCESSED TEXT:
tion or demonstrations....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
y∼pT(y|I⊕c⊕x)}. (3) Input xcould be sourced from existing NLP task datasets, which serve as typical reservoirs for distillation efforts. Numerous works have sought to harness the capa- bilities of powerful LLMs as teachers for annotating dataset samples across a range of tasks. For instance, efforts in natural language understanding involve using LLMs to cat- egorize text (Gilardi et al., 2023; Ding et al., 2023a; He et al., 2023a), while in natural language generation, LLMs assist in generating...

PROCESSED TEXT:
reservoirs for distillation efforts. Numerous works have sought to harness the capabilities of powerful LLMs as teachers for annotating dataset samples across a range of tasks. For instance, efforts in natural language understanding involve using LLMs to classify text. In natural language generation, LLMs assist in generating sequences for outputs. Text generation evaluation tasks leverage LLMs to label results. Reasoning tasks utilize LLMs for labeling Chains of

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
current works focus on labeling outputs based on instructions, thereby teaching student models to solve tasks in a more flexible way by following in- structions. Collections of various NLP tasks, complemented by instructional templates, serve as valuable input sources forx. For instance, FLAN-v2 collections (Longpre et al., 2023) offers extensive publicly available sets of tasks with instructions, which are labeled with responses generated by teacher LLMs in Orca (Mukherjee et al., 2023; Mitra e...

PROCESSED TEXT:
ctions, teaching student models to solve tasks in a more flexible way by following instructions. Collections of various NLP tasks, complemented by instructional templates, serve as valuable input sources for training.

FLAN-v2 collections, Longpre et al. (2023), offers extensive publicly available sets of tasks with instructions, labeled with responses generated by teacher LLMs in Orca (Mukherjee et al., 2023; Mitra et al., 2023). These tasks are built from prede

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
instructions Ior demonstrations c. A commonly used in- struction type for guiding labeling is chain-of-thought (CoT) prompt (Hsieh et al., 2023; Fu et al., 2023; Magister et al., 2023). Mukherjee et al. (2023) add multiple system messages (e.g. “You must generate a detailed and long answer.” or “explain like I’m five, think step-by-step”) to elicit rich signals. Yue et al. (2023a) and Chenglin et al. (2023) la- bel a hybrid of knowledge of chain-of-thought (CoT) and program-of-thought (PoT) rati...

PROCESSED TEXT:
-of-thought (CoT) prompt Hsieh et al. 2023 Fu et al. 2023 Magister et al. 2023 Mukherjee et al. 2023 Yue et al. 2023a Chenglin et al. 2023 Xu et al. 2023b...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
involved. To address these limitations, various expansion methods have been proposed (Wang et al., 2022a; Taori et al., 2023; Chaud- hary, 2023; Si et al., 2023; Ji et al., 2023a; Luo et al., 2023b,a; Wu et al., 2023c; Sun et al., 2024b; Xu et al., 2023a; Guo et al., 2023c; Rozi `ere et al., 2023; West et al., 2022). These methods take the demonstrations as seed knowledge and aim to expand a large scale and various data by in-context learning. A key characteristic of these expansion methods is t...

PROCESSED TEXT:
g the in-context learning ability of Large Language Models (LLMs) to generate data similar to the provided demonstrations. These methods aim to expand a large-scale and diverse dataset by in-context learning. A key characteristic of these expansion methods is the utilization of the in-context learning ability of LLMs to generate data similar to the provided demonstrations, as described in the following equation:

(x, y) ∈ D(exp) = {(x, y) | x∼PT(x|I⊕c) y∼PT(y|I⊕x

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
𝑚Meta-Information𝑐Demonstrations𝑥𝐼 𝑦 FilterFeedback ExtractFeature𝑥𝑦 DistributionIntermediateFeature 𝑥Input𝑦Output𝐼Instruction𝑦! 𝑦" 𝑦# 𝑥GuideFeedback𝑦#∗ 𝑦# Feedback Self-Knowledge StudentTeacher Generate≻≻𝑦" 𝑦! 𝑦# 𝑥 𝑥& CorrectExpand𝑐 Fig. 5: An illustration of different knowledge elicitation methods from teacher LLMs. Labeling : The teacher generates the output from the input; Expansion : The teacher generates samples similar to the given demonstrations through in- context learning; Data Curatio...

PROCESSED TEXT:
ibutionIntermediateFeature InputOutputInstruction! # GuideFeedback# ∗ Feedback Self-Knowledge StudentTeacher Generate≻≻# CorrectExpand Fig. 5: An illustration of different knowledge elicitation methods from teacher LLMs. Labeling : The teacher generates the output from the input; Expansion : The teacher generates samples similar to the given demonstrations through in-context learning; Data Curation : The teacher synthesizes data according to meta-information, suc

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
xand yrepresent the new input- output pairs generated by the teacher LLM. The input x is generated based on a set of input-output demonstrations c. The output yis then generated in response to the new input xunder the guidance of an instruction I. Note that the demonstrations could be predefined or dynamically updated by adding the newly generated samples. Expansion techniques have been widely utilized to extract extensive instruction-following knowledge from teacher LLMs. Wang et al. (2022a) fi...

PROCESSED TEXT:
troduced an iterative bootstrapping method, Self-Instruct, to utilize LLMs to generate a wide array of instructions based on several demonstrations sampled from 175 manually-written instructions

y The newly generated instructions are then added back to the initial pool, benefiting subsequent expansion iterations

Taori et al. (2023) applied this expansion method to a more powerful teacher LLM, text- davinci-003, to distill 52K high-quality data...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
diversity and coverage during expansion, Wu et al. (2023c) and (Sun et al., 2024b) prompt the teacher LLM to generate instructions corresponding to some specific topics. Xu et al. (2023a) propose an Evol-Instruct method to ex- pand the instructions from two dimensions: difficulty (e.g. rewriting the question to be more complex) and diversity (e.g. generating more long-tailed instructions). This Evol- Instruct method is domain-agnostic and has been used to expand the distillation of coding (Luo e...

PROCESSED TEXT:
acher LLM to generate instructions corresponding to some specific topics. Xu et al. (2023a) proposed an Evol-Instruct method to expand the instructions from two dimensions: difficulty and diversity. This Evol-Instruct method is domain-agnostic and has been used to expand the distillation of coding and math. Additionally, expansion methods can significantly augment NLP task datasets with similar samples, thereby enhancing task performance....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
automatically identifies challenging sub- groups within data and generates new samples for these subgroups using LLMs through in-context learning. In summary, the expansion method leverages the in- context learning strengths of LLMs to produce more var- ied and extensive datasets with both inputs and outputs. However, the quality and diversity of the generated data are heavily reliant on the teacher LLMs and the initial seed demonstrations. This dependence can lead to a dataset with inherent bia...

PROCESSED TEXT:
Data and Generates New Samples for These Subgroups Using LLMs Through In-Context Learning**

**Expansion Method Leverages In-Context Learning Strengths of LLMs to Produce Varied and Extensive Datasets**

However, the quality and diversity of the generated data are heavily reliant on the teacher LLMs and the initial seed demonstrations. This dependence can lead to a dataset with inherent bias from LLMs, and a homogeneity issue where the generated models may be pro

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
process can be meticulously controlled to yield datasets that are not only large in scale but also of high quality. The formulation for Data Curation can be represented as: D(cur)={(x, y)|x∼pT(x|I⊕m), y∼pT(y|I⊕x)}.(5) In this formulation, mrepresents the diverse meta- information used to guide the synthesis of x, and Iis the instruction guiding teacher LLMs to generate xory. Different studies primarily vary in their source and method of leveraging meta-information. UltraChat (Ding et al., 2023b)...

PROCESSED TEXT:
Ding et al., 2023b) demonstrates the process of curating high-quality and diverse data by distilled knowledge. They collect extensive meta-information across three main areas: Questions about the World, Creation and Generation, and Assistance on Existing Materials.

Under Questions about the World, they explore 30 meta-topics like Technology and Food and Drink....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
substantial scale of 1.5 million instances. UltraChat stands out with its lexical and topical diversity. The UltraLLaMA model, fine- tuned on this data, consistently surpasses other open-source models. Another notable series, phi(Gunasekar et al., 2023; Li et al., 2023a; Mar, 2023), focuses on distilling smaller, high-quality datasets akin to ”textbooks.” Phi-1 (Gunasekar et al., 2023) experiments with synthesizing ”textbook qual- ity” data in the coding domain. Their approach involves distillin...

PROCESSED TEXT:
rsity. The UltraLLaMA model, fine-tuned on this data, consistently surpasses other open-source models....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
times smaller in model size and 100 times smaller in dataset size. MFTCoder (Liu et al., 2023d) utilizes hundreds of Python knowledge points as meta-information to create a CodeExercise Dataset. In contrast, Magicoder (Wei et al., 2023) and WaveCoder (Yu et al., 2024) get raw code collections from open-source code datasets, using this as meta-information for generating instructional data. In the context of NLU tasks, certain studies (Ye et al., 2022; Gao et al., 2023a; Wang et al., 2021a) explor...

PROCESSED TEXT:
knowledge points as meta-information to create a code exercise dataset. In contrast Magicoder (Wei et al., 2023) and WaveCoder (Yu et al., 2024) get raw code collections from open-source code datasets, using this as meta-information for generating instructional data....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
large in scale. The success of models like phi-1 in specialized domains underscores the efficacy of this method. The abilityto create synthetic datasets will become a crucial technical skill and a key area of focus in AI (Li et al., 2023a). 3.1.4 Feature The previously discussed knowledge elicitation methods are typically applied to powerful black-box models, which are expensive and somewhat unreproducible due to calling API. In contrast, white-box distillation offers a more trans- parent and ac...

PROCESSED TEXT:
efficacy of this method. The ability to create synthetic datasets is a crucial technical skill and a key focus in AI. 3.1.4 Feature The previously discussed knowledge elicitation methods are typically applied to powerful black-box models, which are expensive and somewhat unreproducible due to calling API. In contrast, white-box distillation offers a more parent and accessible approach for researchers. It involves leveraging the output distributions, intermediate 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2023; Liang et al., 2023a; Gu et al., 2024; Agarwal et al., 2024; Liu et al., 2023a; Wen et al., 2023; Wan et al., 2024a; Zhao and Zhu, 2023; Qin et al., 2023b; Boizard et al., 2024; Zhong et al., 2024). The typical method for acquiring this feature knowledge involves teacher LLMs annotating the output sequence y with its internal representations. These annotations are then distilled into the student model using methods such as Kullback-Leibler Divergence (KLD). The process of eliciting feature ...

PROCESSED TEXT:
n et al., 2024a; Zhao and Zhu, 2023; Qin et al., 2023b; Boizard et al., 2024; Zhong et al., 2024....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
distributions (Sanh et al., 2019; Wen et al., 2023). To leverage the rich semantic and syntactic knowledge in intermediate layers of the teacher model, TED (Liang et al., 2023a) designs task-aware layer-wise distillation. They align the student’s hidden representations with those of the teacher at each layer, selectively extracting knowledge pertinent to the target task. Gu et al. (2024) and Agarwal et al. (2024) introduce a novel approach where the student model first generates sequences, terme...

PROCESSED TEXT:
nowledge in intermediate layers of the teacher model, TED (Liang et al., 2023a) designs task-aware layer-wise distillation. They align the student’s hidden representations with those of the teacher at each layer, selectively extracting knowledge pertinent to the target task. Gu et al. (2024) and Agarwal et al. (2024) introduce a novel approach where the student model first generates sequences, termed ‘self-generated sequences.’ The student then learns by using fe

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
quantizing the LLMs, ensuring minimal loss of performance. Additionally, feature knowledge could serve as a potent source for multi-teacher knowledge distil- lation. Timiryasov and Tastet (2023) leverages an ensemble of GPT-2 and LLaMA as teacher models to extract output distributions. Similarly, FuseLLM (Wan et al., 2024a) inno- vatively combines the capabilities of various LLMs through a weighted fusion of their output distributions, integrating them into a singular LLM. This approach has the ...

PROCESSED TEXT:
erve as a potent source for multi-teacher knowledge distillation. Timiryasov and Tastet (2023) leverages an ensemble of GPT-2 and LLaMA as teacher models to extract output distributions. Similarly, FuseLLM (Wan et al., 2024a) inno- vatively combines the capabilities of various LLMs through a weighted fusion of their output distributions, integrating them into a singular LLM. This approach has the potential to significantly enhance the student model's capabilities

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
smaller models, its application is not suitable for black-box LLMs where internal parame- ters are inaccessible. Furthermore, student models distilled from white-box LLMs may underperform compared to their black-box counterparts, as the black-box teacher LLMs (e.g. GPT-4) tend to be more powerful. 3.1.5 Feedback Most previous works predominantly focus on one-way knowledge transfer from the teacher to the student for imitation, without considering feedback from the teacher on the student’s genera...

PROCESSED TEXT:
naccessible. Furthermore, student models distilled from white-box LLMs may underperform compared to their black-box counterparts, as the black-box teacher LLMs tend to be more powerful. 3.1.5 Feedback Most previous works predominantly focus on one-way knowledge transfer from the teacher to the student for imitation, without considering feedback from the teacher on the student's generation. The feedback from the teacher typically offers guidance on student-generat

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
where ydenotes the output generated by the student model in response to x, and ϕfb(·;θT))represents providing feedback from teacher LLMs. This operation evaluates the student’s output ygiven the input x, by offering assess- ment, corrective information, or other forms of guidance. This feedback knowledge can not only be distilled into the student to also generate feedback (such as creating a student preference model) but, more importantly, enable the student to refine its responses based on the ...

PROCESSED TEXT:
iding feedback from teacher LLMs
This operation evaluates the student's output ygiven the input x, by offering assessment, corrective information, or other forms of guidance
This feedback knowledge cannot only be distilled into the student to also generate feedback such as creating a student preference model
More importantly enables the student to refine its responses based on the feedback
Various methods have been explored to elicit this advanced knowledge
Bai e

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
teachers by prompting it with specific criteria. Bai et al. (2022a) in- troduce RLAIF for distilling harmlessness preferences from LLMs. This involves using an SFT-trained LLM to generate response pairs for each prompt, then ranking them for harmlessness to create a preference dataset. This dataset is distilled into a Preference Model (PM), which then guides the RL training of a more harmless LLM policy. Wizard- Math (Luo et al., 2023b) places emphasis on mathematical reasoning. They employ Chat...

PROCESSED TEXT:
harmlessness preferences from LLMs
This involves using an SFT-trained LLM to generate response pairs for each prompt, then ranking them for harmlessness to create a preference dataset
This dataset is distilled into a Preference Model (PM), which then guides the RL training of a more harmless LLM policy
Wizard- Math (Luo et al., 2023b) places emphasis on mathematical reasoning
They employ ChatGPT as teacher to directly provide process supervision and evaluate the 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
and helpfulness. Beyond merely assessing student generations, teachers can also furnish extensive feedback on instances where students underperform. In Lion (Jiang et al., 2023b), teacher model pinpoints instructions that pose challenges to the student model, generating new, more difficult instructions aimed at bolstering the student’s abilities. PERsD (Chen et al., 2023a) showcases a method where teacher offers tailored refinement feedback on incorrect code snippets gen- erated by students, gui...

PROCESSED TEXT:
nces where students underperform. In Jiang et al. (2023b), teacher model pinpoints instructions that pose challenges to the student model, generating new, more difficult instructions aimed at bolstering the student’s abilities. Perceived Performance Rating (PERsD) (Chen et al., 2023a) showcases a method where teacher offers tailored refinement feedback on incorrect code snippets generated by students, guided by specific execution errors encountered. SelFee (Ye et

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
wherein the student model initially generates sequences, followed by teacher model producing an output distribution as feedback. This method leverages the teacher’s insight to directly inform and refine the student model’s learning process. 3.1.6 Self-Knowledge The knowledge could also be elicited from the student itself, which we refer to as Self-Knowledge . In this setting, the same model acts both as the teacher and the student, iteratively improving itself by distilling and refining its own ...

PROCESSED TEXT:
roducing an output distribution as feedback. This method leverages the teacher's insight to inform and refine the student model's learning process. 3.1.6 Self-Knowledge The knowledge can also be obtained from the student itself, which we refer to as Self-Knowledge. In this setting, the same model acts both as teacher and student, iteratively improving itself by distilling and refining its previous outputs. This knowledge bypasses the need for an external, potenti

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
which could include but is not limited to filtering, rewarding, or any other mechanisms for enhancing or evaluating y. It could be governed by external tools or the student itself θS. Recent research in this area has proposed various innovative methodologies to elicit self-knowledge, demonstrating its potential for creating more efficient and autonomous learn- ing systems. (Allen-Zhu and Li, 2020; Wang et al., 2022a; Sun et al., 2024b; Yang et al., 2024; Jung et al., 2023; Huang et al., 2023a; G...

PROCESSED TEXT:
e, demonstrating its potential for creating more efficient and autonomous learning systems...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
model. Other methods aim to elicit targeted knowledge 11 from student models by modifying prompts, and leveraging these data for further refinement. In Self-Align (Sun et al., 2024b), they find that models fine-tuned by Self-Instruct data tend to generate short or indirect responses. They prompt this model with verbose instruction to produce in- depth and detailed responses. Then, they employ context- distillation (Askell et al., 2021) to distill these responses paired with non-verbose instructi...

PROCESSED TEXT:
m student models by modifying prompts, and leveraging these data for further refinement. In Self-Align (Sun et al., 2024b), they find that models fine-tuned by Self-Instruct data tend to generate short or indirect responses. They prompt this model with verbose instructions to produce in-depth and detailed responses. Then, they employ context distillation (Askell et al., 2021) to distill these responses paired with non-verbose instructions back to the model. Simil

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
summarization tasks, implementing filters based on entailment, length, and diversity to screen self-generated summaries. LMSI (Huang et al., 2023a) generates multiple CoT reasoning paths and answers for each question, and then retains only those paths that lead to the most consistent answer. Note that refined self-knowledge can be iteratively ac- quired as the student model continuously improves, further enhancing the student’s capabilities. This is Gulcehre et al. (2023) introduces a Reinforced...

PROCESSED TEXT:
ecifically focusing on Reinforced Self-Training (ReST) and its application in learning and improving self-knowledge. The Reinforced Self-Training approach combines two stages: Grow and Improve, where the student model generates multiple predictions, is ranked and filtered, and undergoes fine-tuning. This process enhances self-knowledge and improves the model's capabilities....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2024a) introduces a framework resembling iterative DPO, where the language model is fine-tuned to differentiate the self-generated responses from the human-annotated data. These self-generated responses could be seen as “negative knowledge” to promote the student to better align with the target distribution. Self-Rewarding (Yuan et al., 2024a) explores a novel and promising approach by utilizing the language model itself as a reward model. It employs LLM- as-a-Judge prompting to autonomously ass...

PROCESSED TEXT:
differentiate between generated responses and human-annotated data, where self-generated responses can be seen as "negative knowledge" to promote alignment with the target distribution. Self-Rewarding explores a novel and promising approach by utilizing the language model itself as a reward model. The process can then be iterated to improve instruction following and reward modeling capabilities. 3.2 Distillation focuses on methodologies for transferring knowledge

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
and Rank Optimization , as shown in Figure 3. 3.2.1 Supervised Fine-Tuning Supervised Fine-Tuning (SFT), or called Sequence-Level KD (SeqKD) (Kim and Rush, 2016), is the simplest and one of the most effective methods for distilling powerful black-boxDivergence Type D(p, q)Function Forward KLDPp(t) logp(t) q(t) Reverse KLDPq(t) logq(t) p(t) JS Divergence1 2Pp(t) log2p(t) p(t)+q(t)+Pq(t) log2q(t) p(t)+q(t) TABLE 1: Functional forms of Dfor various divergence types. p: reference Similarity Functi...

PROCESSED TEXT:
, or called Sequence-Level KD (SeqKD) (Kim and Rush, 2016), is a simple and effective method for distilling powerful black-box Divergence Type D(p, q)Function Forward KLDPp(t) logp(t) q(t) Reverse KLDPq(t) logq(t) p(t) JS Divergence1 2Pp(t) log2p(t) p(t)+q(t)+Pq(t) log2q(t) p(t)+q(t)

**SeqKD**...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
formulated as minimizing the objective function: LSFT=Ex∼X,y∼pT(y|x)[−logpS(y|x)], (9) where yis the output sequence produced by the teacher model. This simple yet highly effective technique forms the basis of numerous studies in the field. Numerous re- searchers have successfully employed SFT to train student models using sequences generated by teacher LLMs (Taori et al., 2023; Chiang et al., 2023; Wu et al., 2023c; Xu et al., 2023a; Luo et al., 2023b). Additionally, SFT has been ex- plored in ...

PROCESSED TEXT:
the output sequence produced by the teacher model. This simple yet highly effective technique forms the basis of numerous studies in the field. Numerous re- searchers have successfully employed SFT to train student models using sequences generated by teacher LLMs (Taori et al., 2023; Chiang et al., 2023; Wu et al., 2023c; Xu et al., 2023a; Luo et al., 2023b). Additionally, SFT has been explored in many self-distillation works (Wang et al., 2022a; Huang et al., 20

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
groups: those minimizing divergence in probability distributions and those aimed at enhancing the similarity of hidden states. Divergence. Divergence-based methods minimize diver- gence between the probability distributions of the teacher and student models, represented by a general divergence function D: LDiv= E x∼X,y∼Y[D(pT(y|x), pS(y|x))], (10) The specific form of Dvaries depending on the type of divergence employed. Table 1 outlines the functional forms ofDfor different divergence measures....

PROCESSED TEXT:
tions and those aimed at enhancing the similarity of hidden states. Divergence-based methods minimize divergence between the probability distributions of the teacher and student models, represented by a general divergence function D: L_Div = ∑∼X, y∼Y[D(pT(y|x), pS(y|x)]. (10) The specific form of D varies depending on the type of divergence employed. Table 1 outlines the functional forms of D for different divergence measures. The commonly-used standard Kullback-

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
predom- inantly on the most prominent mode, thereby exhibiting a “mode-seeking” behavior. Wen et al., 2023; Timiryasov and Tastet, 2023; Liang et al., 2023a; Chen et al., 2024d) , which forces pSto cover all the modes of pT. However, when a student model is unable to learn all modes of a highly complex teacher, the re- sultant “mode-covering” behavior might cause the student to assign probability mass to tokens with low probability under the teacher’s distribution (cf. Figure 6 blue curve). This...

PROCESSED TEXT:
2023; Timiryasov and Tastet, 2023; Liang et al., 2023a; Chen et al., 2024d) which forces the teacher to cover all the modes of the teacher. However, when a student model is unable to learn all modes of a highly complex teacher, the resultant "mode-covering" behavior may cause the student to assign probability mass to tokens with low probability under the teacher's distribution. This mode-covering phenomenon can potentially lead to hallucinations and low-quality g

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
low-probability regions of the teacher’s distribution, employing Policy Gradient methods for optimization. Both Agarwal et al. (2024) and Sason and Verd ´u (2016) assess the effect of different divergence func- tions in LLM distillation, finding the optimal divergence to be task-dependent. For instance, forward KL divergence is more suitable for tasks like Machine Translation, where the output has fewer modes or variations, while reverse KL divergence is preferable for tasks like dialogue genera...

PROCESSED TEXT:
zation both Agarwal et al. (2024) and Sason and Verd´u (2016) assess the effect of different divergence functions in LLM distillation finding the optimal divergence to be task dependent For instance forward KL divergence is more suitable for tasks like Machine Translation where output has fewer modes or variations while reverse KL divergence is preferable for tasks like dialogue generation and instruction tuning which involve multiple modes and a wider range of p

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
two models. The objective is to ensure that the student model not only produces similar outputs to the teacher but also processes information in a comparable manner. The formulation for a similarity-based objective might look like this: LSim= E x∼X,y∼Y[LF(ΦT(fT(x, y)),ΦS(fS(x, y)))],(11) where fT(x, y)andfS(x, y)are the feature maps of the teacher and student models, respectively. The transforma-tion functions ΦTandΦSare applied to these feature maps to ensure they are in the same shape, facilit...

PROCESSED TEXT:
the teacher but also processes information in a comparable manner. The formulation for a similarity-based objective might look like this: LSim= E x∼X,y∼Y[LF(ΦT(fT(x, y),ΦS(fS(x, y)))],(11) where fT(x, y)andfS(x, y)are the feature maps of the teacher and student models, respectively. The transformation functions ΦTandΦSare applied to these feature maps to ensure they are in the same shape, facilitating direct comparison. The similarity function LFis used to match 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
discrepancy between the filtered representations in both teacher and student models. While similarity-based approaches are common in encoder-based LMs (Sun et al., 2019, 2020; Jiao et al., 2020; Hou et al., 2020; Zuo et al., 2022; Liang et al., 2021), their application in LLM knowledge distillation is not as widespread. However, considering their effectiveness, we anticipate an increase in research exploring these methods for LLM distillation in the near future. 3.2.3 Reinforcement Learning This...

PROCESSED TEXT:
ity-based approaches are common in encoder-based LMs (Sun et al., 2019, 2020; Jiao et al., 2020; Hou et al., 2020; Zuo et al., 2022; Liang et al., 2021), their application in LLM knowledge distillation is not as widespread. However, considering their effectiveness, we anticipate an increase in research exploring these methods for LLM distillation in the near future. 3.2.3 Reinforcement Learning This section explores advanced methods of distilling knowledge into s

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
involves training a reward model rϕusing the feedback data D(fd) generated by teacher LLMs. Preference data, as one of the typical feedback, is employed to train the student reward model (Bai et al., 2022a; Cui et al., 2023a; Lee et al., 2023a; Kim et al., 2023a). They usually consist of input-output pairs (x, yw, yl). Here, ywandylrepresent “winning” and “losing” outputs relative to the teacher’s preferences. The loss function for the reward model is defined as: LRM(rϕ,D(fd)) =− E (x,yw,yl)∼D(f...

PROCESSED TEXT:
ated by teacher LLMs. Typically, this involves input-output pairs (x, yw, yl) where ywandyl represent "winning" and "losing" outputs relative to the teacher's preferences. The loss function for the reward model is defined as: LRM(rϕ,D(fd)) = E (x,yw,yl)∼D(fd)[-logσ(rϕ(x, yw) - rϕ(x, yl))] (12) This formulation guides the reward model to distinguish between more and less preferable outputs based on the teacher's criteria. Instead of learning instance-level rewards

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Learning Optimization. In the second stage, the student model, represented by a policy πθ, is optimized to maximize the expected reward as per the trained reward model. Simultaneously, it minimizes the divergence from a reference policy πref, typically the initial policy of the student model trained by SFT, controlled by a factor β. The RL objective is given by: 13 max πθE x∼X,y∼πθ(y|x)[rϕ(x, y)]−βDKL[πθ(y|x)∥πref(y|x)] (13) This RL framework not only ensures that the student model learns the ex...

PROCESSED TEXT:
teacher model, represented by a policy πθ. The objective is to maximize the expected reward as per the trained reward model, while simultaneously minimizing divergence from the reference policy πref, typically the initial policy of the student model trained by SFT, controlled by a factor β. The RL objective is given by:

13 max θE x∼X,y∼πθ(y|x)[rϕ(x, y)] − βDKL[πθ(y|x)∥πref(y|x)] (13)...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
performance, it comes at a higher computational cost compared to employing a smaller distilled reward model. 3.2.4 Ranking Optimization Ranking optimization presents a stable and computationally efficient alternative to RL for injecting preference feedback into language models (Rafailov et al., 2023; Song et al., 2023a; Yuan et al., 2023b). This method, diverging from traditional RL approaches, directly incorporates ranking information into language models from a fixed preference dataset during ...

PROCESSED TEXT:
ard model. 3.2.4 Ranking Optimization presents a stable and computationally efficient alternative to RL for injecting preference feedback into language models. This method, diverging from traditional RL approaches, directly incorporates ranking information into language models from a fixed preference dataset during fine-tuning. Intuitively, it directly updates policy to increase the relative likelihood of preferred over less favored responses. This direct optimiz

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Preference Optimization (DPO) (Rafailov et al., 2023) to distill the preference alignment in teacher LLMs. DPO streamlines the objective of reinforcement learning (as in Eq. 13), which involves reward maximization with a KL-divergence constraint, into a single-stage policy training. Specifically, DPO’s training goal is to maximize the following expecta- tion: E (x,yw,yl)∼D(fd) logσ βlogπθ(yw|x) πref(yw|x)−βlogπθ(yl|x) πref(yl|x) , (14) where ywis preferred over ylaccording to the teacher LLM...

PROCESSED TEXT:
expectation: E(x,yw,yl)∼D(fd) log(βlogπθ(yw|x) πref(yw|x)−βlogπθ(yl|x) πref(yl|x)...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
probabilities under the policy πθ. This approach emphasizes direct comparison and ranking of responses based on the teacher’s preferences. PRO (Song et al., 2023a) expands the concept of pairwisecomparison to handle preference rankings of any length. For a given instruction xand a sequence of responses ordered by teacher preference as y1≻y2≻...≻yn, the RPO training objective is: LPRO=−n−1X k=1logexp (pk)Pn i=kexp (pi), (16) where pkrepresents the conditional log probabilities for ykunder the stu...

PROCESSED TEXT:
es direct comparison and ranking of responses based on teacher's preferences. PRO (Song et al., 2023a) expands the concept of pair-wise comparison to handle preference rankings of any length. For a given instruction x and a sequence of responses ordered by teacher preference as y1≻y2≻...≻yn, the RPO training objective is: 
LPRO=−n−1X k=1log(πk)Pn i=kexp(πi), (17) where pk represents the conditional log probabilities for yk under the student policy πθ. By iterativ

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
range of skills exhibited by LLMs, including Context Following ,Alignment ,Agent ,NLP Task Specializa- tion and Multi-Modality .Context Following focuses on the student’s ability to comprehend and respond effectively to input information. Alignment delves into the student’s capability to align its output with the teacher’s responses. Moving forward, Agent underscores the autonomous nature of language models. NLP Task Specialization highlights the LLM’s versatility in specializing across various ...

PROCESSED TEXT:
ion and Multi-Modality. Context Following focuses on the student's ability to comprehend and respond effectively to input information. Alignment delves into the student's capability to align its output with the teacher's responses. Moving forward, Agent underscores the autonomous nature of language models. NLP Task Specialization highlights the LLM's versatility in specializing across various Natural Language Processing tasks, demonstrating its adaptability. Fina

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
transferring the ability of LLMs to handle a variety of complex contexts — such as few-shot demonstrations, intricate instructions, dia- logue history, and retrieval-augmented information — into smaller models. Many research efforts in this domain aim to imbue smaller models with these sophisticated, context- following capabilities. Our discussion here will dissect this facet of skill distillation, categorizing it based on different types of context and elaborating on how each is distilled and i...

PROCESSED TEXT:
ven instructions. This ability significantly enhances human-AI interaction, allowing for seamless understanding and execution of tasks as directed by users. A primary method for acquiring this skill involves construct instruction-like prompt-response pairs and employing Supervised Fine Tuning (SFT) for model training....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
can be manually curated by human experts or transformed from existing NLP tasks into instructional 14 Methods Skill Seed Knowledge Teacher LLM Student Model Knowledge Elicitation Objective Context Following Self-Instruct (Wang et al., 2022a) IF 175 human-curated tasks GPT3 GPT3 Expansion + Self-Knowledge SFT Alpaca (Taori et al., 2023) IF 175 human-curated tasks GPT3 LLaMA Expansion + Self-Knowledge SFT LaMini-LM (Wu et al., 2023c) IF3.5K Wikipedia Categories + Mixed DatasetChatGPT Various Model...

PROCESSED TEXT:
l Knowledge Elicitation Objective Context Following Self-Instruct Wang et al., 2022a IF 175 human-curated tasks GPT3 GPT3 Expansion + Self-Knowledge SFT Alpaca (Taori et al., 2023) IF 175 human-curated tasks GPT3 LLaMA Expansion + Self-Knowledge SFT LaMini-LM (Wu et al., 2023c) IF3.5K Wikipedia Categories + Mixed Dataset ChatGPT Various Models Expansion SFT WizardLM (Xu et al., 2023a) IF Alpaca Data ChatGPT LLaMA Expansion SFT Lion (Jiang et al., 2023b) IF Alpaca

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Self-Rewarding (Yuan et al., 2024a) IF Human-written Samples LLaMA LLaMA Self-Knowledge SFT + RL STaR (Zelikman et al., 2022) IF Arithmetic + CommonsenseQA + GSM8K GPT-J GPT-J Self-Knowledge SFT Llama-GPT4 (Peng et al., 2023a) IF Alpaca Dataset GPT4 LLaMA Labeling SFT Reflection-Tuning (Li et al., 2023e) IF Alpaca/WizardLM Dataset ChatGPT LLaMA Labeling SFT Selective Reflection-Tuning (Li et al., 2024d) IF Alpaca/WizardLM Dataset ChatGPT LLaMA Labeling SFT Vicuna (Chiang et al., 2023) IF/MD Huma...

PROCESSED TEXT:
(Zelikman et al., 2022) IF Arithmetic + CommonsenseQA + GSM8K GPT-J GPT-J Self-Knowledge SFT Llama-GPT4 (Peng et al., 2023a) IF Alpaca Dataset GPT4 LLaMA Labeling SFT Reflection-Tuning (Li et al., 2023e) IF Alpaca/WizardLM Dataset ChatGPT LLaMA Labeling SFT Selective Reflection-Tuning (Li et al., 2024d) IF Alpaca/WizardLM Dataset ChatGPT LLaMA Labeling SFT Vicuna (Chiang et al., 2023) IF/MD Human Conversation ChatGPT + GPT4 LLaMA Labeling SFT Koala (Geng et al., 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
et al., 2023) IF/TP Human Conv, Flan/Code/Math Collection ChatGPT LLaMA Labeling SFT CoT-Distill (Hsieh et al., 2023) IF/TP e-SNLI + ANLI + CQA + SVAMP PaLM T5 Labeling SFT KnowPAT (Zhang et al., 2023a) IF/TP CPKG + QA Data ChatGPT + ChatGLM + Vicuna-7B LLaMA Labeling SFT DEBATunE (Li et al., 2024e) IF/TP Controversial Topics ChatGPT LLaMA Labeling SFT Phi-1 (Gunasekar et al., 2023) IF/Code - GPT3.5 phi-1 Curation SFT Phi-1.5 (Li et al., 2023a) IF/Code 20k Topics from Web GPT3.5 phi-1 Curation +...

PROCESSED TEXT:
Hsieh et al., 2023) IF/TP e-SNLI + ANLI + CQA + SVAMP PaLM T5 Labeling SFT KnowPAT (Zhang et al., 2023a) IF/TP CPKG + QA Data ChatGPT + ChatGLM + Vicuna-7B LLaMA Labeling SFT DEBATunE (Li et al., 2024e) IF/TP Controversial Topics ChatGPT LLaMA Labeling SFT Phi-1 (Gunasekar et al., 2023) IF/Code - GPT3.5 phi-1 Curation SFT Phi-1.5 (Li et al., 2023a) IF/Code 20k Topics from Web GPT3.5 phi-1 Curation + Labeling SFT SAIL (Luo et al., 2023c) IF/RAG Alpaca Data + Web C

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Human-written Prompts LLaMA LLaMA Expansion + Labeling SFT + RL RLCD (Yang et al., 2024) IF/Preference Human-written Prompts LLaMA LLaMA Labeling SFT + RL RLAIF (Lee et al., 2023a) IF/Preference Human-written Prompts PaLM 2 PaLM 2 Labeling + Feedback RL GPT3 Reward (Kwon et al., 2023) Preference Human-written Prompts GPT3 GPT3 Labeling RL ILF (Scheurer et al., 2023) Preference Task-specific Datasets GPT3 + FeedME GPT3 Labeling RL ULTRAFEEDBACK (Cui et al., 2023a) Preference Mixed Datasets GPT4 L...

PROCESSED TEXT:
ence Human-written Prompts LLaMA LLaMA Labeling SFT + RL RLAIF (Lee et al., 2023a) IF/Preference Human-written Prompts PaLM 2 PaLM 2 Labeling + Feedback RL GPT3 Reward (Kwon et al., 2023) Preference Human-written Prompts GPT3 GPT3 Labeling RL ILF (Scheurer et al., 2023) Preference Task-specific Datasets GPT3 + FeedME GPT3 Labeling RL ULTRAFEEDBACK (Cui et al., 2023a) Preference Mixed Datasets GPT4 LLaMA Labeling RL Constitutional AI (Bai et al., 2022a) Preference

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
API Documentation GPT4 LLaMA Expansion SFT GPT4Tools (Yang et al., 2023b) Tool Image Content ChatGPT LLaMA Curation + Expansion SFT ToolAlpaca (Tang et al., 2023a) Tool Public-apis Repository ChatGPT LLaMA Curation SFT ToolLLM (Qin et al., 2023a) Tool Real-world APIs ChatGPT LLaMA Curation SFT MLLM-Tool (Wang et al., 2024) Tool HuggingFace Model Cards GPT4 LLaMA Curation SFT FireAct (Chen et al., 2023b) Planning Mixed QA Dataset GPT4 LLaMA Labeling SFT AgentTuning (Zeng et al., 2023a) Planning 6...

PROCESSED TEXT:
PT LLaMA Curation + Expansion SFT ToolAlpaca (Tang et al., 2023a) Tool Public-apis Repository ChatGPT LLaMA Curation SFT ToolLLM (Qin et al., 2023a) Tool Real-world APIs ChatGPT LLaMA Curation SFT MLLM-Tool (Wang et al., 2024) Tool HuggingFace Model Cards GPT4 LLaMA Curation SFT FireAct (Chen et al., 2023b) Planning Mixed QA Dataset GPT4 LLaMA Labeling SFT AgentTuning (Zeng et al., 2023a) Planning 6 Agent Tasks GPT4 + ChatGPT LLaMA Labeling + Expansion SFT Lumos 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
al., 2021a) NLU NLU Tasks GPT3 BERT Expansion SFT InheritSumm (Xu et al., 2023c) NLG Pile + ArXiv + CNN/DM + WikiHow GPT3.5 ZCode++ Label SFT DIMSUM+ (Jung et al., 2023) NLG None GPT2 + CTRL + BioGPT T5 Curation + Self-Knowledge SFT Genie (Yehudai et al., 2024) NLG ELI5 + ASQA + NQ + CNN/DM Falcon + LLaMA FLAN + LLaMA Label SFT GKD (Agarwal et al., 2024) NLG/NLU/IF XSum+WMT14 en-de+GSM8K+FLAN2021 T5-XL T5 Feature + Feedback D&S + RL QUILL (Srinivasan et al., 2022) IR IR Datasets T5 4-layer Trans...

PROCESSED TEXT:
kiHow GPT3.5 ZCode++ Label SFT DIMSUM+ (Jung et al., 2023) NLG None GPT2 + CTRL + BioGPT T5 Curation + Self-Knowledge SFT Genie (Yehudai et al., 2024) NLG ELI5 + ASQA + NQ + CNN/DM Falcon + LLaMA FLAN + LLaMA Label SFT GKD (Agarwal et al., 2024) NLG/NLU/IF XSum+WMT14 en-de+GSM8K+FLAN2021 T5-XL T5 Feature + Feedback D&S + RL QUILL (Srinivasan et al., 2022) IR IR Datasets T5 4-layer Transformer Internal Knowledge D&S RankVicuna (Pradeep et al., 2023a) IR IR Dataset

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
PandaLM (Wang et al., 2023b) Evaluation Alpaca Data ChatGPT LLaMA Labeling SFT Prometheus (Kim et al., 2024) Evaluation 50 Seed Rubrics GPT4 LLaMA Labeling SFT InstructScore (Xu et al., 2023d) Evaluation Mixed Dataset GPT4 LLaMA Labeling SFT WizardMath (Luo et al., 2023b) Math GSM8k + MATH ChatGPT LLaMA Expansion + Feedback SFT + RL Mammoth (Yue et al., 2023a) Math/TP Mixed Math Dataset GPT4 LLaMA Labeling SFT Mixed Distill (Chenglin et al., 2023) Math/TP SVAMP + GSM8K + ASDIV + StrategyQA ChatG...

PROCESSED TEXT:
l., 2024) Evaluation of GPT4 LLaMA Labeling SFT InstructScore (Xu et al., 2023d) Evaluation of Mixed Dataset GPT4 LLaMA Labeling SFT WizardMath (Luo et al., 2023b) Evaluation of Math GSM8k + MATH ChatGPT LLaMA Expansion + Feedback SFT + RL Mammoth (Yue et al., 2023a) Evaluation of Math/TP Mixed Math Dataset GPT4 LLaMA Labeling SFT Mixed Distill (Chenglin et al., 2023) Evaluation of Math/TP SVAMP + GSM8K + ASDIV + StrategyQA ChatGPT LLaMa Labeling SFT WizardCoder 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
al., 2023) Code Code Datasets ChatGPT LLaMA Labeling SFT Multi-Modality LLaVA (Liu et al., 2023e) Vision-Language COCO GPT4 LLaMA Labeling SFT SVIT (Zhao et al., 2023b) Vision-Language Visual Genome + COCO GPT4 LLaMA Labeling SFT LVIS-Instruct4V (Wang et al., 2023e) Vision-Language LVIS GPT4V LLaMA Labeling SFT LLaVAR (Zhang et al., 2023d) Vision-Language LAION GPT4 LLaMA Labeling SFT Macaw-LLM (Lyu et al., 2023) Multiple Modalities Image/Video with Caption ChatGPT LLaMA Labeling SFT MIMIC-IT (L...

PROCESSED TEXT:
2023b) Vision-Language Visual Genome + COCO GPT4 Labeling SFT LVIS-Instruct4V (Wang et al., 2023e) Vision-Language LVIS GPT4V Labeling SFT LLaVAR (Zhang et al., 2023d) Vision-Language LAION GPT4 Labeling SFT Macaw-LLM (Lyu et al., 2023) Multiple Modalities Image/Video with Caption...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Divergence and Similarity, RL: Reinforcement Learning, RO: Ranking Optimization. formats with templates, such as prefacing machine transla- tion data with ”Translate this sentence to Spanish:” . However, these approaches have limitations. Manual data creation is labor-intensive, while template-based transformation lacks diversity in instructions and may not align well with natural human input. LLMs like GPT-4 offer an efficient alternative for creating diverse and controlled SFT data by their ca...

PROCESSED TEXT:
s, such as prefacing machine translation data with “Translate this sentence to Spanish:” 
However, these approaches have limitations. Manual data creation is labor-intensive, while template-based transformation lacks diversity in instructions and may not align well with natural human input. 
LLMs like GPT-4 offer an efficient alternative for creating diverse and controlled SFT data by their capabilities of in-context learning and instruction following. Most relev

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
GPT-3 to expand 15 a seed pool of 175 tasks to 52K task-agnostic instructions, ensuring a broad spectrum of general instructions. Addi- tionally, a filtering and post-processing stage is introduced to eliminate redundant or similar instructions. Notably, through training with this enriched dataset, GPT-3 acquires the ability to follow instructions, enabling it to perform comparably to InstructGPT in zero-shot instruction tasks and when provided with expert-written instructions for novel tasks. B...

PROCESSED TEXT:
nstructions through training on 52k demonstrations, enabling a broad spectrum of general instructions, and filtering redundant or similar instructions to improve overall instruction quality"...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Complex Instructions. Some works promote students to solve more complex instructions (Xu et al., 2023a; Luo et al., 2023b,a; Guo et al., 2023c). According to Xu et al. (2023a), in- struction datasets derived from human-written seeds often exhibit low to moderate complexity. To enhance the com- plex instruction-following capabilities of smaller models, WizardLM (Xu et al., 2023a) introduces Evol-Instruct . This method gradually transforms instructions into more com- plex forms through a multi-ste...

PROCESSED TEXT:
complex instructions (Xu et al., 2023a; Luo et al., 2023b, a; Guo et al., 2023c). According to Xu et al. (2023a), instruction datasets derived from human-written seeds often exhibit low to moderate complexity. To enhance the complexity of smaller models, WizardLM introduces Evol-Instruct. This method gradually transforms instructions into more complex forms through a multi-step evolution process, focusing on increasing difficulty levels and expanding topic divers

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
preliminary studies revealing the effectiveness of increasing instruction complexity. Instruction Fusion (Guo et al., 2023c) further uses teacher LLMs to increase the complexity by fusing two distinct evolved instructions. Furthermore, this concept of “evolving” instructions has been extended to distill specific skills such as coding (Luo et al., 2023a) and mathematics (Luo et al., 2023b). Human Instructions. In contrast to works that rely on gener- ating instructions from ChatGPT, which may lac...

PROCESSED TEXT:
xpands to include specific skills such as coding and mathematics....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
capture the reasoning process of the original teacher (Gudibande et al., 2023; Mukherjee et al., 2023). System Instructions. To encourage student models to learn the reasoning process, Orca and Orca 2 (Mukherjee et al., 2023; Mitra et al., 2023) enhance the prompt, response data pairs by introducing a system message (e.g., ”explain like I’m five, think step-by-step”) to encourage student mod- els to grasp the reasoning process. This system messageprompts GPT-4 to provide explanation traces that ...

PROCESSED TEXT:
student models, prompting them to explain step-by-step. This message will be:

"Explain like I'm five, think step-by-step"

This will help the student models understand the teacher's thought process and provide more accurate explanations.

We will further train the student models to identify the most effective solution strategy for each task, guided by the performance of Orca's model. This will improve the student models' ability to follow instructions that invol

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
2023b) distills large-scale data with high-quality and di- verse instructions from teacher LLMs by various meta- information. The UltraLLaMA model, fine-tuned on this data, consistently surpasses other open-source models. The Phi series models (Gunasekar et al., 2023; Li et al., 2023a; Mar, 2023) prioritize data quality and employ synthetic methods to generate data of “textbook quality” to enhance the learning experience for smaller models. Notably, Phi exhibits the ability to follow instruction...

PROCESSED TEXT:
r open-source models by distilling large-scale data with high-quality diverse instructions. The UltraLLaMA model is fine-tuned on this data, which prioritizes data quality and employs synthetic methods to generate "textbook quality" data for smaller models. Notably, Phi series models (Gunasekar et al., 2023; Li et al., 2023a; Mar, 2023) follow instructions effectively, even without fine-tuning. Phi-2, with 2.7 billion parameters, outperforms Mistral and Llama-2 m

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
the quality of responses. ExpertLLaMA (Xu et al., 2023f) improves the quality of responses by augment- ing vanilla instructions with specialized Expert Identity descriptions. Reflection-Tuning (Li et al., 2023e) improves both the instruction and response sequentially by reflecting on specific criteria. DEITA (Liu et al., 2023h) proposes to enhance and score instructions in three directions includ- ing complexity, quality, and diversity to get high-quality distillation data. MUFFIN (Lou et al., 2...

PROCESSED TEXT:
d descriptions. Reflection-Tuning (Li et al., 2023e) improves instruction sequentially by reflecting on specific criteria. DEITA (Liu et al., 2023h) enhances instructions in three directions including complexity, quality, and diversity to obtain high-quality distilled data. MUFFIN (Lou et al., 2023) scales instructions according to input diversifying tasks with various facets. Selective Reflection- Tuning (Li et al., 2024d) involves the student model in the pipel

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
Cur- rent small models have made strides in enhancing var- ious aspects of instruction-following ability, like diver- sity, complexity and explanation. However, student mod- els trained on instruction data expanded by ChatGPT of- ten mimic ChatGPT’s style without replicating its factual accuracy (Gudibande et al., 2023). Achieving a more ca- pable instruction-following capability requires a stronger teacher LLM (Gudibande et al., 2023) and access to di- verse, high-quality instruction data, such...

PROCESSED TEXT:
ersity, complexity, and explanation, but student models trained on instruction data expanded by ChatGPT often mimic its style without replicating factual accuracy (Gudibande et al., 2023). Achieving a more capable instruction-following capability requires a stronger teacher LLM (Gudibande et al., 2023) and access to diverse, high-quality instruction data, such as the one used in Orca (Mukherjee et al., 2023; Mitra et al., 2023), which incorporates extensive task 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
dialogue turns. Some works have been dedicated to train to small chat models by distilling multi-turn knowl- edge from teacher LLMs (Chiang et al., 2023; Xu et al., 2023b; Ding et al., 2023b; Li et al., 2023b; Wang et al., 2023c; Tunstall et al., 2023). ShareGPT serves as a platform for users to share their conversations with ChatGPT, offering a vast repository of multi-turn conversations readily available. Some small chat models are trained using this data to acquire the capability for engaging...

PROCESSED TEXT:
nstrating their ability to engage in multi-turn dialogues."...



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
aiming to incentivize student models to produce high-quality responses. Addi- tionally, Ye et al. (2023) enhance the quality of multi-turn data from ShareGPT by generating self-feedback on model responses and iteratively refining the responses based on the received feedback. To enhance the multi-turn capabilities of student models, another line of research focuses on expanding conversa- tional datasets through self-chat and using them to train smaller models (Xu et al., 2023b; Ding et al., 2023b...

PROCESSED TEXT:
023) enhance the quality of multi-turn data from ShareGPT by generating self-feedback on model responses and iteratively refining the responses based on the received feedback. To enhance the multi-turn capabilities of student models, another line of research focuses on expanding conversational datasets through self-chat and using them to train smaller models. For instance, Xu et al. (2023b) initiate their work by using questions sourced from Quora and Stack Overf

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
dialogues from ChatGPT. Notably, UltraChat encom- passes a wide range of topics and instructions. Building upon the UltraChat dataset, they fine-tune a LLaMA model, resulting in the creation of a powerful chat model known as UltraLLaMA. UltraLLaMA consistently outperforms other open-source chat models, including Vicuna and Baize. Fur- thermore, UltraChat is employed in conjunction with an AI preference-aligned chat model named Zephyr (Tunstall et al., 2023). Zephyr enhances intent alignment thro...

PROCESSED TEXT:
model known as UltraLLaMA. UltraLLaMA consistently outperforms other open-source chat models, including Vicuna and Baize....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
of retrieved information is also a non- trivial skill of LLMs. Several approaches to distill RAG capabilities have been proposed (Kang et al., 2023a; Luo et al., 2023c; Asai et al., 2023). SAIL (Luo et al., 2023c) starts by retrieving search results for each training case using search APIs, creating search- augmented instructions that include both the instruction and grounding information. To encourage the language model to prioritize informative retrieval results, they input each retrieved pass...

PROCESSED TEXT:
proposed. SAIL, for example, retrieves search results for each training case using search APIs and generates search-augmented instructions. These instructions include the instruction and grounding information, which are then input into teacher LLMs to generate responses. The search-augmented instructions and relevance labels are fine-tuned to encourage the model to prioritize informative retrieval results....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
rationales are then utilized to train two models: a student LM and a Reranker. For training the student LM, the rationales serve as a means to retrieve relevant knowledge d, and the student LM is subsequently fine-tuned using the rationales along- side questions and knowledge. However, during inference, only questions are available. To address this, the Reranker is trained to mimic how the retriever scores passages with the rationale by minimizing the KL divergence between Retriever (d|r)andRera...

PROCESSED TEXT:
critic model to minimize KL divergence between the critic and the retriever, thereby improving the overall performance of the student LMs....



Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
‘reflection to- kens.’ For instance, Self-Rag initiates the retrieval operation when generating the reflection token Retrieve . To distill this critic data, GPT-4 is prompted to assess the need for retrieval using few-shot demonstrations I, the task input x, and output yto predict a reflection token ras follows: p(r|I, x, y ). 4.2 Alignment 4.2.1 Thinking Pattern Most existing methods mainly focus on directly aligning the direct responses of the student models to the responses of teacher models ...

PROCESSED TEXT:
lf-Rag initiates the operation. To assess the need for retrieval, GPT-4 is prompted with input x and output y, and predicts a reflection token ras. 4.2.1 The current methods focus on matching direct responses to teacher models (Taori et al., 2023), but these models may learn to imitate the response style, not the reasoning process (Mukherjee et al., 2023). To improve this, novel thinking patterns are proposed that go beyond mere imitation (Ye et al., 2023; Mukher

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
by the effectiveness of LLMs in generat- ing their own feedback without relying on external mod- els (Schick et al., 2022; Madaan et al., 2023; Saunders et al., 2022), SelFee (Ye et al., 2023) proposes to train a 17 model that has been fine-tuned to continuously revise its own answer until it provides a high-quality response in a single inference. During training, it utilizes both the final response and feedback chain as the fitting target. This pat- tern, response with the revision process, sho...

PROCESSED TEXT:
SelFee proposes a 17 model that continuously revises its answer until it provides a high-quality response in a single inference. During training, it utilizes the final response and feedback chain as the target. This pattern shows a promising performance gain. Following SelFee, Reflection-Tuning also utilizes the reflection process as the learning pattern. Noting the lack of reasoning imitation of previous methods, Orca proposes Explanation Tuning, which learns th

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


INPUT TEXT:
experiments verify the effectiveness of distilling with this thinking pattern. The following Orca2 (Mitra et al., 2023) further presents to equip the student models with the ability to utilize different solution strategies for different tasks, mo- tivated by the capability discrepancies between the smaller and larger models. By employing this training pattern, the student models are able to gain a better reasoning ability. Be- sides learning with the corresponding revision or reflection process,...

PROCESSED TEXT:
al., 2023) study demonstrates that student models can be equipped with the ability to employ different problem-solving strategies for various tasks, motivated by the discrepancies between smaller and larger models. By implementing this training pattern, student models can improve their reasoning capabilities. This thinking pattern involves learning through the revision and reflection process....

INPUT TEXT:
distilled into the student models. 4.2.2 Preference The

Let's print out the final processed versions to make sure things look good

In [20]:
print(f"\nProcessing complete!")
print(f"Input file: {INPUT_FILE}")
print(f"Output file: {output_file}")
print(f"Total chunks processed: {num_chunks}")

# Preview the beginning and end of the complete processed text
print("\nPreview of final processed text:")
print("\nBEGINNING:")
print(processed_text[:1000])
print("\n...\n\nEND:")
print(processed_text[-1000:])


Processing complete!
Input file: ./resources/extracted_text.txt
Output file: ./resources/clean_extracted_text.txt
Total chunks processed: 101

Preview of final processed text:

BEGINNING:
['T', 'a', 'o', ' ', 'S', 'h', 'e', 'n', '4', ',', ' ', 'R', 'e', 'y', 'n', 'o', 'l', 'd', ' ', 'C', 'h', 'e', 'n', 'g', '1', ',', ' ', 'J', 'i', 'n', 'y', 'a', 'n', 'g', ' ', 'L', 'i', '1', ',', ' ', 'C', 'a', 'n', ' ', 'X', 'u', '5', ',', ' ', 'D', 'a', 'c', 'h', 'e', 'n', 'g', ' ', 'T', 'a', 'o', '6', ',', ' ', 'T', 'i', 'a', 'n', 'y', 'i', ' ', 'Z', 'h', 'o', 'u', '2', ' ', '1', 'T', 'h', 'e', ' ', 'U', 'n', 'i', 'v', 'e', 'r', 's', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'H', 'o', 'n', 'g', ' ', 'K', 'o', 'n', 'g', '2', 'U', 'n', 'i', 'v', 'e', 'r', 's', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'M', 'a', 'r', 'y', 'l', 'a', 'n', 'd', '3', 'M', 'i', 'c', 'r', 'o', 's', 'o', 'f', 't', ' ', '4', 'U', 'n', 'i', 'v', 'e', 'r', 's', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'T', 'e', 'c', 'h', 'n', 'o', 'l', 'o', 'g', 

### Next Notebook: Transcript Writer

Now that we have the pre-processed text ready, we can move to converting into a transcript in the next notebook

In [17]:
#fin