This open source training framework increases the speed of pre-training a large language model when failures arise

As demand for technologies that enable generative AI continues to skyrocket, processing capabilities must keep pace with model training and fault tolerance. Researchers from the University of Michigan have designed a solution specifically for modern AI workloads.

A research team developed Oobleck, an open source high-volume training framework, using the concept of pipeline templates to provide fast and guaranteed error recovery without degrading training throughput.

The results were presented in October 2023 at the proceedings of the 29th Symposium on Principles of Operating Systems in Koblenz, Germany.

“Oobleck is a multi-purpose solution for adding efficient flexibility to any large model pre-training. As a result, its impact will be felt in basic model pre-training for the full range of its applications ranging from big technology to high-performance computing and high-performance computing,” said Mosharraf Chowdhury, associate professor of electrical engineering and computer science and author. Opposite of the paper: “In science and medical fields.”

Large language models require huge GPU clusters for long periods of time during pre-training, and the likelihood of experiencing failures increases with the size and duration of training. When failures occur, the concurrent nature of pre-training the large language model amplifies the problem as all participating GPUs remain idle until the failure is resolved.

Existing frameworks have little systematic support for fault tolerance during large language model pre-training. Existing solutions rely on checkpointing or recalculation to recover from failures, but both methods are time-consuming and cause cluster-level slowdowns during the recovery process with no formal fault-tolerance guarantees.

Pipeline templates are the core of Ooblick’s design. A pipeline template, which is a specification for executing a training pipeline for a given number of nodes, is used to create pipeline replicas. All pipeline templates are logically equivalent (i.e. they can be used together to train the same model) but physically heterogeneous (i.e. they use different numbers of nodes).

“Oobleck is the first work that exploits the redundancy inherent in large language models for fault tolerance with the combination of previously generated heterogeneous templates. Together, this framework provides high throughput with maximum utilization, guaranteed fault tolerance, and fast recovery without the overhead of checking or checking “Recalculation-based approaches,” said Insu Jang, a doctoral student in computer science and engineering and first author on the paper.

Given a training task that starts with a maximum tolerable number of simultaneous failures, f, Oobleck’s execution engine instantiates at least f + 1 heterogeneous pipeline from the generated set of templates. The fixed global batch is distributed proportionally to the computing power of the pipeline replicas to avoid having outliers in training synchronization.

When failures occur, Oobleck simply reconstructs the pipelines from precomputed pipeline templates, avoiding the demanding analysis to find a new optimal configuration at runtime. Certainly, using a pre-calculated set of pipeline templates enables Oobleck to always recover from more or less failures.

Resilience to unexpected events is a classic problem in computer science. Instead of addressing problems after they happen, which is slow, or planning for all possible scenarios, which is practically impossible, pipeline templates strike a balance between speed and efficiency in elastic distributed computing.

“Oobleck provides the first proof of the effectiveness of this idea, but it can potentially be applied to any distributed computing system where the same dichotomy exists. Going forward, we want to apply pipeline templates to improve the flexibility of all aspects of GenAI applications,” Choudhury said, “starting with inference service systems.” “.

Oobleck is open source and available on GitHub.

more information:
Insu Zhang et al., Oobleck: Resilient Distributed Training for Large Models Using Pipeline Templates, Proceedings of the 29th Symposium on Principles of Operating Systems (2023). doi: 10.1145/3600006.3613152

Provided by the University of Michigan College of Engineering

the quote: Open source training framework speeds up large language model pretraining when failures arise (2023, December 18) Retrieved December 18, 2023 from https://techxplore.com/news/2023-12-open-source-framework- Great language pre-training.html

This document is subject to copyright. Notwithstanding any fair dealing for the purpose of private study or research, no part may be reproduced without written permission. The content is provided for informational purposes only.

In the rapidly evolving field of natural language processing, the ability to efficiently pre-train large language models is crucial for the development of cutting-edge AI technologies. However, the process of pre-training can be time-consuming and prone to failures, leading to delays and setbacks in research and development. Thankfully, a new open source training framework has emerged that promises to increase the speed of pre-training large language models, even when failures arise. This innovative framework aims to streamline the pre-training process and minimize disruptions, ultimately accelerating the advancements in NLP technology.

This open source training framework increases the speed of pre-training a large language model when failures arise

Formulaire de contact