Unexpected Performance for AMD vs NVIDIA GPUs Running Disco Diffusion and Stable Diffusion Part 1

I am trying to understand how different GPU architectures perform for image generation AI scripts, such as Disco Diffusion and Stable Diffusion. For this test, I am using a NVIDIA M40 GPU and an AMD Radeon Instinct MI25 GPU. Both GPUs are installed in a single Supermicro 1028GR-TR server, with PCIe-3.0 x16 risers. The server has 256 GB RAM, along with two INTEL Xeon E5-2660 v3 CPUs, with 10 cores each. The table below shows the nominal specifications of the two GPUs:

AMD Radeon Instinct MI25NVIDIA M40
Cores40963072
Clock1500 MHz1112 MHz
Memory bus width2048 bits384 bits
Memory bandwidth436.2 GB/s288.4 GB/s
FP16 FLOPS24.58 TFLOPS???
FP32 FLOPS12.29 TFLOPS6.832 TFLOPS
FP64 FLOPS768 GFLOPS213.5 GFLOPS
GPU specs for AMD Radeon Instinct MI25 compared to NVIDIA M40

Bases on the specs alone, it appears that the MI25 would perform far better than the M40, however, there are many differences in the design that make them too different to compare based on the specs alone. I have picked two very different AI programs, Disco Diffusion and Stable Diffusion, to benchmark these two GPUs, to figure out which GPU performs better. Before reading the results, you should know what these two different AI programs are good at.

What Is Disco Diffusion?

Disco Diffusion is a very customizable image generation AI scripts that can create somewhat large images on low VRAM, such as making a 1280 by 720 image on 16GB while still having extra VRAM to use for models. Disco Diffusion also has a lot of compatible CLIP and diffusion models to choose from, since it’s a common AI script to train models for.

What Is Stable Diffusion?

Stable Diffusion is very different from Disco Diffusion, Stable Diffusion is not especially good for customization, there are only a few settings you can change other than the prompt. Stable Diffusion also uses a lot of extra VRAM for small images, you can barely fit a 512 by 512 image in 16GB VRAM. However, Stable Diffusion is a very fast AI script. One other difference that between Disco Diffusion and Stable Diffusion is that Stable Diffusion can not be run on CPU, while Disco Diffusion can be. Based on running this scripts many times, I’ve discovered that the model is trained with some watermarked images, this makes some of the results have a watermark in them. However, I removed all watermarked results from this test.

To benchmark the two GPUs with Disco Diffusion and Stable Diffusion, I will test changing a few settings for each scripts, on both of the GPUs. I will also show the images created, to demonstrate that the images are of similar quality on both GPUs.

RESULTS

The results of the tests will be split between both AIs, and each of those sections will be split by the settings used. All tests are measured in time to generate, per individual image. I decided to use the same prompt for all images, to insure that the results are similar enough to compare them. This is the prompt I used for all tests: “A beautiful, highly detailed oil painting of a mysterious green emerald tower next to a glowing blue lake in the middle of a dark forest at dusk in the style of Greg Rutkowski and Afremov, highly detailed oil painting”
All other settings not listed are either the defaults, the same on every test, or do not affect speed. All seeds used were randomly generated, to create many different example images, different seeds do not affect generation time.

STABLE DIFFUSION

This section is for tests on the AI scripts Stable Diffusion, results in this section are split into categories based on image size, and precision mode. Precision mode normally means FP16 compared to FP32, or even FP64, which stand for Floating Point. The different numbers after FP represent the precision of the operations for the AI scripts on the GPU, a higher number is higher precision, but often slower on the GPU. For this scripts, rather than a choice between FP16 and FP32, the setting allows either “full” precision or “autocast” precision. The names don’t make it entirely clear what each precision mode is, but it clearly makes a huge difference in the speed. As a comparison, running this script using the default settings (256×256, 50 DDIM steps) on an A100 takes 3 seconds.

AUTOCAST PRECISION

Autocast precision is the default precision mode in Stable Diffusion. It is most likely the equivalent of FP16, however, it is hard to tell in this script. Although the result images are a set of four images, the generation time is per individual image, rounded to whole seconds. Image generation time does not include time loading the model, since generating more images without reloading the model scaled by image generation time, not time loading the model. The result images are linked to each generation time. All images were generated using autocast precision.

SettingsGeneration time in seconds on AMD Radeon Instinct MI25Generation time in seconds on NVIDIA M40
256×256, DDIM, 50 steps9.0s9.0s
256×256, PLMS, 50 steps9.0s9.0s
256×256, DDIM, 250 steps48.0s46.0s
256×256, PLMS, 250 steps48.0s46.0s
512×512, DDIM, 25 steps11.0s14.0s
512×512, PLMS, 25 steps12.0s15.0s
512×512, DDIM, 50 steps23.0s29.0s
512×512, PLMS, 50 steps23.0s29.0s
Time to generate images using the Stable Diffusion AI script, on an AMD Radeon Instinct MI25 and a NVIDIA M40

FULL PRECISION

Full precision is the alternative option to autocast precision, It is unclear what it does, however, it most likely switches from FP16 to FP32. Based on the results, it is clear that the precision mode makes a major difference when running Stable Diffusion.

SettingsGeneration time in seconds on AMD Radeon Instinct MI25Generation time in seconds on NVIDIA M40
256×256, DDIM, 50 steps4.0s7.0s
256×256, PLMS, 50 steps4.0s7.0s
256×256, DDIM, 250 steps21.0s38.0s
256×256, PLMS, 250 steps21.0s39.0s
512×512, DDIM, 25 steps9.0s13.0s
512×512, PLMS, 25 steps9.0s14.0s
512×512, DDIM, 50 steps18.0s27.0s
512×512, PLMS, 50 steps19.0s28.0s
Time to generate images using the Stable Diffusion AI script, on an AMD Radeon Instinct MI25 and a NVIDIA M40

As we can see, with the default settings (autocast, 256×256, 50 DDIM steps), both the MI25 and M40 perform similarly, and both are about 3 times slower than on an A100. However, when switching to full precision mode, the AMD Radeon Instinct MI25 nearly matches an A100, even though it costs almost 50 times less. Based on this speed difference, you may expect the images generated with the default settings to be better, but all images created using Stable Diffusion are of equal quality, unless settings such as resolution or DDIM steps are changed. PLMS appears to have little to no effect in time, no more than a one second difference, even for large images.

Part 2

2 Comments

  1. hi, thank you so much for sharing the performance difference in such a detail! I’m familiar with running codes but not so much with hardwares, especially GPUs. Could you help me clarify one thing, please? From my understanding, the CUDA operations were built for NVIDIA so the currently distributed version of DD v5.6 cannot be run on Apple machines / AMDs. (It would help me tremendously if there’s a way to run DD on AMD + Mac Pro instead of having to buy a PC…)

Leave a Comment

Your email address will not be published.