Skip to content

Fine-Tuning Llama 3 on AMD Radeon GPUs

15, October 2024 by {hoverxref}Garrett Byrd<garrettbyrd>, {hoverxref}Joe Schoonover<joeschoonover>.

Introduction

Source code and Presentation

This blog is a companion piece to the ROCm Webinar of the same name presented by Fluid Numerics, LLC on 15 October 2024. The source code for these materials is provided on GitHub.

What is Fine-Tuning?

Quantization

IEEE 754 / FP8 / FP4

Exponent and mantissa stay the same. NF8 maps to int8 (quantile quantization)

Installation

Requirements

Install Steps

Hugging Face

Llama 3

Llama 3 likely needs no introduction, and it should be clear that this is the model that we're using in this example. The Llama family, developed by Meta Platforms, Inc., is a collection of some of the most popular open-weight models available today. Meta offers 8 billion, 70 billion, and now even 405 billion parameter versions. Llama 3 was trained with 15 trillion tokens, and can digest an impressive 8K tokens. Llama 3 also achieves consistently high scores on a variety of LLM benchmarks, such as MMLU, HumanEval, and GSM8K. At the time of publication, the Llama family has approximately 15 million downloads on the Hugging Face Hub. Today, we'll be using the Llama-3-8B-Instruct version, an 8 billion parameter version that has been fine-tuned for instructional speech patterns, i.e., answering questions.

Example

Acknowledgments

Special thanks to Garrett Byrd and Dr. Joe Schoonover at Fluid Numerics, LLC for contributing this blog. The ROCm software ecosystem is strengthened by community projects that enable you to use AMD GPUs in new ways. If you have a project you would like to share here, please raise an issue or PR.

Find Fluid Numerics online at: - fluidnumerics.com - Reddit - GitHub - YouTube - LinkedIn

References

Source Code

Available on GitHub

Papers

Vaswani, A. "Attention is all you need." Advances in Neural Information Processing Systems (2017).

Dettmers, Tim, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. "8-bit optimizers via block-wise quantization." arXiv preprint arXiv:2110.02861 (2021).

Dettmers, Tim, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. "Qlora: Efficient finetuning of quantized llms." Advances in Neural Information Processing Systems 36 (2024).

Micikevicius, Paulius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha et al. "Fp8 formats for deep learning." arXiv preprint arXiv:2209.05433 (2022).

Installation Pages

ROCm quick start installation guide

Miniconda Quick Command Line Install

PyTorch Installation Configurator

git-lfs

Disclaimers

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED “AS IS” WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.