Optimizing AI Training Platform with Large Model Support
Apply Large Model Support (LMS) using Pytorch & Tensorflow deep learning frameworks to “extend” GPU Memory to System Memory on IBM POWER AC922
Artificial Intelligence: Advanced
One of the challenges when training a deep learning model (especially with a large neural network architecture and/or a large dataset) is the limitation of available memory in GPU (Graphic Processing Unit), within one system. As a typical model training requires many epochs to process (hundreds or thousands), most of the times, it takes substantial amount of time (days, even weeks) to complete multiple simulations to generate a suitable model with various hyperparameters settings.
Today, data scientists are approaching this by training with partial dataset at a time (hence, divided the total dataset in several batches within one epoch). To speed up the training, the preferred way (ideal) is to load all data to GPU’s memory in one epoch, however this is mostly not possible as a single GPU memory (such as in NVidia Tesla GPU for example) is limited to either 8, 12, 16, 24 or 32 GB. Another way is to adjust batch size to fit into available GPU memory.
As models grow in complexity (recent neural networks typically consists of more and more layers) and datasets become larger, data scientists have been struggling to stay working within the limited GPU memory within a single system — sometimes pushing to the limit like adjusting (by lowering) the training batch size to be as small as possible.
Large datasets usually come in the form of high definition video and images or even large collection of texts like in recent Generative Pre-Trained GPT-2 language modeling, for instance.
Of course, there is a way for having multiple high-end GPUs within a single machine (either for training or inference) or even across multiple machines — however, such a luxury high-end configuration is only available for selected few people to access. For example, this distributed deep learning platform within a High-Performance Computing (HPC) environment is provided within the fastest Supercomputer in the world, IBM Summit Supercomputer. A few HPC-like Supercomputer systems are also available in selected research, governments, as well as higher-educational institutions.
One of practical approaches provided within IBM POWER server configured for AI (IBM AC922 — Accelerated Computing on IBM POWER9) is a software feature called Large Model Support (LMS). LMS is designed to ‘extend’ GPU memory to System’s memory (CPU’s memory) through software, giving flexibility in managing GPU-CPU memory allocations in heavy GPU-bound applications, such as for training a deep learning model.
This article explores LMS capability and its practical application with PyTorch and TensorFlow deep learning frameworks, implemented using Python Programming language on IBM POWER AC922 server.
Preparing The Platform: IBM POWER AC922 on IBM CECC
The preparation of the platform includes reserving the machine (IBM POWER AC922 on IBM CECC) and acccesing the active machine using VPN.
Reserving The Machine
Unless we have direct access to a physical IBM POWER AC922 machine, we need to find a way to experiment LMS on IBM POWER AC922 server. The practical way is to reserve the machine in an IBM provided cloud, IBM CECC (IBM Systems Worldwide Client Experience Center). We can reserve the machine in a short period of time, typically just for 7-days.
CECC is the world-wide IBM-System‘s (IBM Hardware division) cloud platform for creating temporary virtual environments and has several locations worldwide. CECC delivers a single interface with offering catalog and provides automated provisioning and access to Power System resources. These resources can be used for testing, Proof of Technology, Proof of Concept, development or other usage to support IBM, IBM Business Partners and IBM Customers.
A CECC portal user needs an IBMID to access cloud resources. Following successful authentication, the user is redirected back to the CECC portal. Only IBMers and IBM Business Partners (BP) are currenly allowed access to the portal. If a user is not identified as an IBMer or registered BP, the user will be redirected to a page with an IBM Partnerworld registration link, in case they are not a registered partner but would like to become one.
We explore LMS in a created temporary Watson Machine Learning Community Edition (WML-CE) virtual environment (reserved for 7 days) running on IBM POWER AC922 server, hosted by IBM CECC. The commercial version of WML-CE is WMLA (Watson Machine Learning Accelerator).
The following details a step-by-step guide in creating temporary virtual environment on IBM AC922, hosted by CECC.
- Select “IBM WML CE“ (Paas) from Image Catogories.
2. Click the “Add to cart” button.
3. System settings and maximum value are presented in a pop-up window.
4. Select and change any settings available in the drop downs.
5. Select “Add” to Add to the Cart.
6. When we are satisfied with the systems and choices in our cart, select “Checkout”.
The checkout process will prompt us to select for a start date, to be scheduled later. The default start date options and durations are based on the system configuration settings in the CECC Portal.
The default start date is 50 minutes from the current time (if we can get that to be scheduled at the earliest). Any change to the start date must always be at least 45 minutes in the future. Times are automatically displayed in the zone of the browser. Default duration is usually 7 days and can be decreased to not less than on a day or increased to the maximum value (to be requested), which is displayed when the system is selected to “Add to cart”. Scheduling for more than the maximum duration of the system will not be allowed. The “Start date is a flexible” box and is checked by default.
This provides the option to select other dates/times if the initial dates/times that we provided are not available. This option will save time, so we do not have to guess what times are available.
If specific times are required for the project, uncheck the “Start date is flexible” box and the CECC portal will inform if those dates are available. If created, the checkout overlay is displayed to change dates/times. If the initial dates/times are not available and the “Start date is flexible” box is checked, three alternate available dates are displayed. These dates are held for a limited time.
If the dates/times are still available, notification is displayed on the screen that the project is successfully commited like in illustration-4. An email with the related information will be sent along with another email when each system in the project has become active.
If the project is Active, visit the “My Projects” page to review our project status, scheduled start and end dates/times, and connection information. Items in our cart will be discarded when our IBMID session has expired. An email will also be sent before reservation ends so it can be extended if needed.
Connecting the VPN
At this point we have the client installed. To log into the center, start the client, enter asa003b.cebters.ihost.com in the box and click connect. When prompted, choose CECC as the group (unless told otherwise), and enter the VPN ID and password.
We are now able to access the assigned resources.
LMS, Large Model Support in IBM POWER AC922
LMS directly addresses a big scale challenge in deep learning, limited memory in GPUs. When data scientists develop a deep learning model, the structure of matrices in the neural net, and the data elements which train the model (in a batch), must sit within the memory on the GPU. As model grows in complexity and dataset increases in size, data scientists are focused to make tradeoffs to stay within the GPU’s constrained memory limits (16GB or 32GB for example). In IBM POWER AC922 with LMS, we can train models with higher quality data (for example, better resolution for images) and ultimately giving the capability to develop more accurate models that perform better.
AC922 provides an even bigger boost to LMS as it is equipped with NVidia’s NVLink 2.0. Specifically, with AC922 and NVLink 2.0, CPUs and GPUs communicate with a coherent interface. NVLink enables near direct access of the system memory to the GPUs which enables the GPUs and data workflows to leverage the system’s memory for larger, more complex, higher resolution models.
With LMS, enabled by AC922’s unique NVLink connection between CPU (memory) and GPU, the entire model and dataset can be quickly loaded into system memory and cached down to the GPU for action. According to IBM’s internal testing (Chris Eaton, Scott Soutter, Paul Zikopoulos), LMS has enabled the increased in model size (more layers, larger matrices), increased in data element sizes (higher definition image), and larger batch size (for faster time to convergence).
LMS enables data scientists to load models which span nearly an entire terabyte of system memory across 4 GPUs, getting much more to be done within a cluster of AC922. LMS is supported also on AC922 with 6 GPUs.
Imagine if we are using a 480 x 480 pixels resolution image to try and detect a faulty part. If that part was using the R (Red), G (green), B (blue) color channels, each picture will feed into the neural networks 1st layer as 691,200 inputs (480x480x3) for every class of a faulty part the neural net is trying to detect. Or 600x600x3 inputs (1,080,000 inputs to 1st layer) with the State-Of-The-Art EfficientNet-B7 object detection architecture.
Now consider that higher resolution images could mean the ability to spot certain characteristics in the image that we couldn’t observe before (we do this all the time when watching HDTV and looking at someone’s face where we can see every wrinkle when compared to older resolution TVs). Consider a picture that is captured on a 6MB camera that generates an image with a 3,000 x 2,000 pixels resolution. In our example, that would increase the inputs by a whopping 26x: (3000 x 2000 x 3) divided by (480 x 480 / 3). Without the feature like LMS, it’s obvious that how we are forced to settle for lower quality input data which means the model can’t be trained as well as what it could be, with better resolution.
We will experience utilizing LMS with PyTorch and TensorFlow deep learning frameworks.
In this case, we are connected to a virtual environment with MobaXterm application (ssh), by using IP address: 22.214.171.124, with cecuser as userid. WML-CE 1.7 is provided on this system, in a conda virtual environment with Python3. The base Anaconda environment is automatically activated from the provided .bashrc file.
Preparing the Environment
Illustration-9, 10 and 11 show the virtual environment (IBM POWER AC922) that we are using on RHEL Linux Server v7.6. The system using ppc64-le (Little Endian) architecture, Memory: 640GB, Disk: 3.8TB along with GPU information, CUDA version: 10.1, CUDA Driver version: 418.87.00, Number GPUs: 4 (Tesla V100, 32GB Memory).
Run “conda list” command to show what python packages are available in the base env. To list the available conda environments, run “conda env list“ command (illustration-12).
To activate the python3 based WMLCE environment, run “conda activate wmlce_env3”. If we want to make any changes to the provided envs (most users will need to modify something), we should clone the provided WML-CE envs (as in illustration-12).
Then run “conda activate wmlce_cahya”. The provided WML-CE environment have all the frameworks and python packages that are installed by running “conda install powerai”.
Before test LMS with PyTorch and TensorFlow on Virtual Environment AC922, clone this samples from GitHub (Link from IBM Knowledge Center) in our env:
A. Test LMS with PyTorch Framework
Large Model Support (LMS) is a feature provided in WML-CE/WMLA PyTorch that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out of memory“ errors. LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed. One or more elements of a deep learning model can lead to GPU memory exhaustion. These include: model depth and complexity, base data size (for example, high-resolution images) and batch size.
1st — Pytorch without LMS
Traditionally, the solution to this problem has been to modify the neural network model until it fits in GPU memory. This approach, however, can negatively impact accuracy — especially if concessions are made by reducing data fidelity or more complexity. With LMS, deep learning models can scale significantly beyond what was prevously possible and ultimately, generate more accurate results. The PyTorch imagenet example (example of GitHub PyTorch) provides a simple illustration of LMS in action. ResNet-152 is a deep residual network that requires a significant amount of GPU memory, and this experiment is using ResNet-152 model. First, we try running a different batch-size without using LMS. Illustration-13 and 14 shows a test that uses a batch size value of 16.
Table-1 shows the result of all sample tests with batch-size of 16, 32, 64, 128 and 256.
We use a significant samples from Microsoft COCO and kaggle datasets for the “train“ and “val“ directories. Why do we get successful training result with a batch-size value of up to 256? Because in the python script, we have an Image resizing process. Of course in the AC922 we don’t have “out of memory” result, but training process with setting a batch-size value of 256 will be faster than the previous batch-size.
2nd — Pytorch with LMS
Then we use LMS feature. Before running a python script (main.py), change the python script as in Illustration-15 (“main“ function).
Then run the script with use a sample test.
Why do we get successful result in training with a batch-size value of up to 256 too? The same reason as before, because in the python script, we have an Image resizing process. In the python script all image from “train“ directory will be resized to 224 x 224 and from “val“ directory will be resized to 256 x 256.
Of course current AC922 with 640 GB installed memory will run the training very well. However, the training time between “with LMS” and “without LMS” is different. If we are using LMS, it will be faster. With LMS, training time with batch-size=256 will also be significantly faster.
If training process is running, we will use all GPUs in the machine and use CPU RAM between ±15 GB and ±20 GB. We can monitor this using “nvidia-smi” status and “free –m” status as in illustration-18.
We are monitoring RAM status regularly (every 3 seconds) with “nvidia-smi” for check GPU status and GPU memory status, then “free –m”.
Next, We will try LMS with the TensorFlow Framework.
B. Test LMS with TensorFlow Framework
TensorFlow Large Model Support (TFLMS) is a feature in the TensorFlow provided by IBM WMLE-CE/WMLA that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out-of-memory“ errors.
1st — Tensorflow without LMS
We try making model with TensorFlow python script sample. The default setting model is ResNet-50 and default setting image size is at 500x500 pixels. At first, we test without LMS.
Before we create ResNet-50 model, we try to use ResNet-152 model, but result was between using LMS and without LMS the results were the same. Out-of-memory condition occured when we use 32 batch-size value. For ResNet-50 model, you can see in table-3, out-of-memory condition was started when we use batch-size value = 64.
2nd — Tensorflow with LMS
The TensorFlow Large Model Support (TFLMS) provides an approach to train large models that can‘t be fit into GPU memory. It takes a computational graph that is defined by users and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the Host and vice versa. The computational graph is statically modified. If a Tensorflow Keras model is used with v1 compatibility mode in TensorFlow2 and TensorFlow 2 behavior is disabled using “tf.disable_v2_behavior()“, write after “import tensorflow.compat.v1 as tf“ like illustration-23.
Then run the python script. We can then see in table-4, out-of-memory condition occured when we use 256 as the batch-size value.
If we compare all tests using LMS and without LMS, of course the result of training a model with LMS will be better. We don’t use LMS “out-of-memory“ condition when using batch-size value=64. However, we use LMS “out-of-memory“ condition when using batch-size value=256. Then we can get summary when we use Large Model Support, provided that we have a complex model and the dataset has a high resolution (for the case of using images or video as dataset).
IBM WML-CE/WMLA (IBM Watson Machine Learning Community Edition / IBM Watson Machine Learning Accelerator) address the fundamental limitation for deep learning (especially in training process), in this case: the limitation of memory in GPUs. When training complex models or training with huge dataset (i.e. a lot of high definition images), ‘managing’ available memory in GPUs can be a real challenge.
Instead of being forced into building less complex, shallower deep learning models, we can develop better models by utilizing Large Model Support (LMS) feature, available on IBM POWER AC922 AI-Enabled Server.
With LMS, enabled by IBM’s unique NVLink connection between CPU (memory) and GPU, the entire model and dataset can be loaded into system memory and cached down to the GPU for action (provided we have enough system memory). We can now address bigger challenges and get much more work done within a cluster of WML-CE/WMLA servers, therefore increasing efficiency in training deep learning models.
Please note, coherent access to system memory means that GPU memory is treated (can access) system memory which enables game changing simplification of application programing and support for larger model sizes.
What’s next then?
With IBM POWER9, we are not limited to just use LMS. Other components are available by utilizing multiple GPUs with multiple servers: such as Distributed Deep Learning (DDL), Elastic Distributed Training (EDT) or Elastic Distributed Inference (EDI).
Distributed Deep Learning (DDL) has been a part of both Watson WML-CE and WMLA.
In the community edition (WML-CE), we are able to cluster up to 4 nodes together while in Accelerator, we can cluster thousands of servers together to perform training or inference tasks. This solution is ideal and available for a very large scale out model training for single models at a time. The resource allocation is static.
For multi-tenancy and dynamic resource sharing, EDT delivers the most flexible solution. This feature allows us to set up resource groups, prioritization and consumer groups to create a very flexible and scalable training platform for data scientists.
EDI is a component of WMLA (applied in IBM Power IC922 for example, a machine that is designed specifically for doing inferencing). EDI enables us to publish inference models as services across a scalable cluster of servers, from which we can consume the services.
At one point in time in the future, we may face a problem with increasingly complex models or a hudge datasets requirement (high resulution image or large scale of numbers or text). If LMS can‘t solve (not enough anymore) the challenging situation, DDL or EDT and EDI may provide a better way to get faster training time with the best accuracy as possible with easy and scalable platform to deploy.
Will Feng, ”ImageNet training in PyTorch” .
Sam Matzek, “TensorFlow Large Model Support”.
Chris Eaton, Scott Soutter, Paul Zikopoulos, “IBM Watson Machine Learning Accelerator”.
Techelite solutions, “Power9 IC922”.
Kaggle, “Dogs vs. Cats”.