TABLE OF CONTENTS
Updated: 24 Jul 2024
NVIDIA H100 GPUs On-Demand
We couldn’t hold our excitement after the massive release of Llama 3.1. According to Meta, this model is breaking all performance records and is slightly better than other prominent models like GPT-4, 4o, Mistral 7B, Gemma and Claude 3.5 Sonnet. What’s even more exciting? Meta claims Llama 3.1 to be the most capable open-source AI model that could be fine-tuned, distilled and deployed anywhere. Continue reading as we explore the capabilities of Llama 3.1 and guide you to get started with Hyperstack.
About Llama 3.1
Llama 3.1 is Meta’s latest and most capable open-source AI model to date. This new model shows a significant leap forward in the capabilities and accessibility of AI technology, continuing Meta's commitment to open-source AI development. The Llama 3.1 release introduces six new open LLM models based on the Llama 3 architecture.
Llama 3.1 Models
These models come in three sizes: 8 billion, 70 billion and 405 billion parameters, each available in both base (pre-trained) and instruct-tuned versions. The full list of Llama 3.1 models includes:
- Meta-Llama-3.1-8B: Base 8B model
- Meta-Llama-3.1-8B-Instruct: Instruct fine-tuned version of the base 8B model
- Meta-Llama-3.1-70B: Base 70B model
- Meta-Llama-3.1-70B-Instruct: Instruct fine-tuned version of the base 70B model
- Meta-Llama-3.1-405B: Base 405B model
- Meta-Llama-3.1-405B-Instruct: Instruct fine-tuned version of the base 405B model
In addition to these language models, Meta has also released two specialised models:
- Llama Guard 3: An updated safety model fine-tuned on Llama 3.1 8B.
- Prompt Guard: A small 279M parameter BERT-based classifier for detecting prompt injection and jailbreaking attempts.
Key Features of Llama 3.1
Meta Llama 3.1 is not only the world’s largest and most capable openly available foundation model but also boasts top-class features, including:
Multilingual Support
All Llama 3.1 variants support eight languages, including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. This expanded language support makes Llama 3.1 more accessible and useful for a global audience.
Increased Context Length and GQA
One of the most significant improvements in Llama 3.1 is the extended context length of 128K tokens. This substantial increase allows the models to process and understand much longer pieces of text for more complex tasks and analyses. Llama 3.1 also maintains the use of GQA, an efficient attention mechanism that helps manage longer context lengths effectively.
Tool Calling
The instruct-tuned models in Llama 3.1 are fine-tuned for tool calling, making them suitable for agentic use cases. They come with two built-in tools (search and mathematical reasoning with Wolfram Alpha) and support custom JSON functions for further extensibility. Check below the example of Llama 3 executing multi-step planning, reasoning, and tool calling to complete a task.
Source: Llama 3.1 Paper
Improved Instruction and Safety Measures
The instruct models have been optimised to follow user instructions more effectively. With the introduction of Llama Guard 3 and Prompt Guard, Meta is offering robust tools to improve the safety and security of AI applications built with Llama 3.1. Check below the table showing the performance of Prompt Guard with in- and out-of-distribution evaluations, a multilingual jailbreak built using machine translation, and a dataset of indirect injections from CyberSecEval:
Source: Llama 3.1 Paper
Model Evaluations of Llama 3.1
Meta has conducted evaluations of Llama 3.1 8B, Llama 3.1 70B and Llama 3.1 405B to assess its performance across various tasks and domains. Meta has claimed its flagship 405B model is competitive with leading foundation models, including Mistral 7B, GPT-4, GPT-4o and Claude 3.5 Sonnet. Please find the evaluations below:
Source: Llama 3.1 Paper
Model Architecture of Llama 3.1
The architecture of Llama 3.1 includes several key improvements:
- Decoder-Only Transformer: Llama 3.1 maintains a standard decoder-only transformer model architecture with minor adaptations. This choice was made to maximise training stability, opting for a more straightforward approach over more complex architectures like mixture-of-experts models.
- Scalable Training Process: To handle the massive scale of the 405B model, which was trained on over 15 trillion tokens, Meta significantly optimised its full training stack. The training process utilised over 16K H100 GPUs, making it the first Llama model trained at this scale.
- Iterative Post-Training Procedure: The development process included multiple rounds of supervised fine-tuning and direct preference optimisation. This iterative approach allowed for the creation of high-quality synthetic data for each round, progressively improving the performance of various capabilities.
- Improved Data Quality and Quantity: Compared to previous Llama versions, both the pre-training and post-training data saw improvements in quality and quantity. This included the development of more careful pre-processing and curation pipelines for pre-training data, as well as more rigorous quality assurance and filtering approaches for post-training data.
- Quantisation for Efficient Inference: To support large-scale production inference for the 405B model, Meta quantised the models from 16-bit (BF16) to 8-bit (FP8) numerics. This reduction in precision effectively lowered the compute requirements, allowing the model to run within a single server node.
What’s New About Open Source Case of Llama 3.1
With Llama 3.1, you now have unprecedented control over these models. You can customise them to fit your needs, train them on your datasets and conduct additional fine-tuning. This level of flexibility means you can tailor Llama 3.1 to your exact use case, whether it's natural language processing, code generation or specialised domain tasks.
One of the most exciting aspects is the deployment flexibility. You can run Llama 3.1 in virtually any environment. Need to keep your data on-premises? No problem. Want to leverage cloud scalability? Go for it. You can even run it locally on your laptop for testing or small-scale applications. And here's the kicker - you can do all this without sharing your data with Meta.
While some might argue for the cost-effectiveness of closed models, Llama 3.1 open-source AI models are proving to be highly competitive. According to Artificial Analysis, Llama 3.1 open-source AI models offer some of the lowest costs per token in the industry. This means that Llama 3.1 is more economical in handling text. When using the model for tasks such as generating text, analysing data or running AI applications, the cost to process each piece of text is lower than other models. This results in significant savings, especially for large-scale model deployments or high-volume applications where processing vast amounts of text is necessary.
Getting Started with Llama 3.1 on Hyperstack
On Hyperstack, getting started with Llama 3.1 is a straightforward process. After setting up your environment, you can easily download the Llama 3.1 model from the Hugging Face repository. Once downloaded, you can launch the web UI and load the model seamlessly. Hyperstack's powerful hardware resources make it an ideal platform to fine-tune, inference and experiment with capable open-source AI models like Llama 3.1.
- For fine-tuning Llama 3.1, we recommend using NVIDIA H100 PCIe or NVIDIA SXM H100 GPUs.
- For inference, the recommended GPUs for various Llama 3.1 models are:
- Meta-Llama-3.1-8B-Instruct: 1x NVIDIA A100 or NVIDIA L40 GPUs. For budget-friendly users, we recommend using NVIDIA RTX A6000 GPUs.
- Meta-Llama-3.1-70B-Instruct: 4x NVIDIA A100
- Meta-Llama-3.1-405B-Instruct-FP8: 8x NVIDIA H100 in FP8
Sign up now to get started with Hyperstack. To learn more, you can watch our platform demo video below:
FAQs
What is Llama 3.1?
Llama 3.1 is Meta’s latest open-source AI model, showcasing major advancements in AI technology with six new open LLM models ranging from 8 billion to 405 billion parameters.
What sizes are available for Llama 3.1 models?
Llama 3.1 models are available in three sizes: 8 billion, 70 billion, and 405 billion parameters, each offered in both base and instruct-tuned versions.
What are some key features of Llama 3.1?
The key features of Llama 3.1 include multilingual support for eight languages, an extended context length of 128K tokens, and tool calling capabilities with built-in tools for search and mathematical reasoning.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?