Troubleshooting LMDeploy 0.9.2 Qwen2.5 VL 32B AWQ Inference Errors And Multi-turn Image Issues

Jul 27, 2025 by JurnalWarga.com 95 views

[Bug] LMDeploy 0.9.2-Qwen2.5 VL 32B AWQ Image Inference Error and Multi-turn Image Inference Failure

Hey guys! Today, we're diving into a bug report concerning LMDeploy 0.9.2, specifically with the Qwen2.5 VL 32B AWQ model. It seems like there are some issues when dealing with image inference, especially in multi-turn conversations. Let’s break down the problem and see what’s going on.

The Issue

The user reported two main errors when trying to run image inference with the Qwen2.5 VL 32B AWQ model using LMDeploy 0.9.2. These issues seem to revolve around memory management, leading to Out-of-Memory (OOM) errors. This is a common headache when working with large models and images, so let’s dig into the details.

Error 1: Image Size Too Large

The first error occurs when the image size is relatively large, around 1MB. The system throws an OOM error, indicating that it can't allocate enough memory to process the image. Interestingly, the user pointed out that the same model and image work fine with VLLM or sglang. Here’s the error message they encountered:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 15.56 GiB of which 66.50 MiB is free. Including non-PyTorch memory, this process has 15.49 GiB memory in use. Of the allocated memory 1.17 GiB is allocated by PyTorch, and 200.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

This error message is pretty explicit: the system ran out of memory while trying to allocate 76MB. The GPU has a total capacity of 15.56GB, but only 66.50MB is free. It also suggests a potential fix by setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid memory fragmentation. This is a good tip, but let's see if we can find a more comprehensive solution.

Error 2: Multi-turn Conversation Failure

The second error pops up during multi-turn conversations involving images. The user reported that the first image interaction goes smoothly, but when they try to engage with a second image, the system crashes with another OOM error. Here’s the error message:

[TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/core/allocator.cc:49
Aborted (core dumped)

This error suggests that the memory is not being properly released or managed between turns, leading to a memory overflow when processing multiple images. This is a critical issue for applications that require conversational interaction with visual inputs. When dealing with multi-turn conversations, memory management is crucial. Each turn consumes resources, and if these resources aren't freed up, you'll quickly run into memory issues. The CUDA runtime error here points to a low-level memory allocation failure, indicating a potential bug in how LMDeploy handles memory in multi-turn scenarios. It’s like trying to fill a glass that's already full; eventually, it's going to overflow.

Environment Details

Before we dive into potential solutions, let's take a look at the environment in which these errors occurred. This can give us valuable clues about what might be going wrong.

Operating System: Linux
Python Version: 3.10.18
CUDA: Available and running
GPUs: Two Quadro RTX 5000 (each with 16GB of memory)
CUDA Version: 12.4
PyTorch Version: 2.7.1+cu126
LMDeploy Version: 0.9.2+
Transformers: 4.54.0

The system has two Quadro RTX 5000 GPUs, each with 16GB of memory, totaling 32GB of GPU memory. This should be sufficient for many large models, but the OOM errors suggest that memory is not being utilized efficiently or that there might be memory leaks. Having sufficient GPU memory is essential, but how that memory is managed is equally important. A memory leak or inefficient allocation can quickly exhaust resources, leading to crashes and frustration.

Potential Causes and Solutions

Now that we’ve laid out the problem, let’s brainstorm some potential causes and solutions. These are a few things that might be contributing to the OOM errors:

Inefficient Memory Management: LMDeploy might not be releasing memory properly after processing each image or turn in a conversation. This can lead to a buildup of memory usage over time.
Large Image Sizes: Even though the images are around 1MB, the model might be creating large intermediate tensors during processing, which consume significant memory.
Configuration Issues: The LMDeploy configuration might not be optimized for the available hardware, leading to inefficient resource utilization.
Bugs in LMDeploy: There might be underlying bugs in LMDeploy 0.9.2 that cause memory leaks or other memory-related issues.

Given these potential causes, here are some solutions and strategies to try:

1. Optimize LMDeploy Configuration

The first step is to ensure that LMDeploy is configured optimally for the available hardware. The user provided the following command:

lmdeploy serve api_server \
    /home/drc-whlab/james/Qwen2___5-VL-32B-Instruct-AWQ \
    --model-name Qwen2___5-VL-32B-Instruct-AWQ \
    --server-port 7777 \
    --tp 2 \
    --cache-max-entry-count 0.9 \
    --session-len 20000 \
    --max-batch-size 4

Let’s break down these parameters and see if we can tweak them for better performance:

--tp 2: This specifies tensor parallelism with 2 GPUs. This is good for utilizing both GPUs, but make sure the model is properly sharded across them.
--cache-max-entry-count 0.9: This parameter controls the maximum number of entries in the cache. A value of 0.9 means 90% of the cache can be used. While caching can improve performance, it also consumes memory. Try reducing this value to see if it helps with OOM errors.
--session-len 20000: This sets the maximum session length. A large session length can consume a lot of memory, especially in multi-turn conversations. Consider reducing this value if your use case doesn't require extremely long sessions.
--max-batch-size 4: This sets the maximum batch size. A smaller batch size can reduce memory consumption but might also decrease throughput. Experiment with lower values to see if it resolves the OOM issues.

Optimization is key when dealing with large models. Fine-tuning parameters like --cache-max-entry-count, --session-len, and --max-batch-size can significantly impact memory usage. It's about finding the right balance between performance and resource consumption.

2. Reduce Image Size and Complexity

Large images consume more memory. Try resizing or compressing the images before feeding them into the model. This can significantly reduce the memory footprint. You can also explore techniques like image cropping or reducing the color depth to further minimize memory usage. The size and complexity of input images directly affect memory consumption. Preprocessing images to reduce their size or complexity can alleviate OOM errors.

3. Implement Memory Management Techniques

Ensure that memory is being properly managed within your application. Use techniques like explicitly deleting variables and tensors when they are no longer needed. PyTorch's torch.cuda.empty_cache() can also be used to free up unused GPU memory. Proper memory management is essential to prevent memory leaks and ensure resources are freed up when no longer needed. Tools like torch.cuda.empty_cache() can be helpful in reclaiming unused GPU memory.

4. Update or Downgrade LMDeploy Version

If the issue is a bug in LMDeploy 0.9.2, try updating to the latest version or downgrading to a more stable version. Check the LMDeploy release notes for any bug fixes related to memory management. Software updates often include bug fixes and performance improvements. Checking the LMDeploy release notes can reveal whether the encountered issues have been addressed in newer versions or if downgrading to a more stable version might be a viable solution.

5. Explore Mixed Precision Training

If you're training the model yourself, consider using mixed precision training (e.g., with FP16 or BF16). Mixed precision can significantly reduce memory consumption during training and inference. Mixed precision training is a technique that uses lower precision floating-point numbers, like FP16 or BF16, to reduce memory consumption and speed up computations. This can be particularly beneficial when dealing with large models and limited GPU memory.

6. Use Gradient Accumulation

Gradient accumulation allows you to simulate a larger batch size without increasing memory consumption. This can be helpful if you need to use a larger batch size for better performance but are limited by memory. Gradient accumulation is a technique that allows you to simulate larger batch sizes by accumulating gradients over multiple smaller batches. This can be useful when memory constraints prevent you from using the desired batch size directly.

7. Profile Memory Usage

Use profiling tools to identify memory bottlenecks in your code. PyTorch provides tools for profiling memory usage, which can help you pinpoint where memory is being allocated and released. Memory profiling tools help you understand how your application is using memory. By identifying memory bottlenecks, you can focus your optimization efforts on the parts of the code that consume the most resources.

Diving Deeper into the Errors

Let’s revisit the error messages to see if we can extract more information. The first error message suggests setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. This environment variable can help with memory fragmentation, but it’s more of a workaround than a solution. It's like putting a bandage on a wound that needs stitches. While it might provide temporary relief, it doesn't address the underlying problem.

The second error message, CUDA runtime error: out of memory, is more generic but points to a fundamental memory allocation failure. This could be due to a variety of reasons, including memory leaks, inefficient allocation, or simply running out of memory. This is a red flag indicating that something is fundamentally wrong with how memory is being managed.

Community and Support

If you’re still facing issues after trying these solutions, don’t hesitate to reach out to the LMDeploy community or support channels. There might be others who have encountered similar problems and found solutions. Engaging with the community and support channels can provide valuable insights and assistance. Sharing your experiences and learning from others can often lead to effective solutions.

Conclusion

Dealing with OOM errors can be frustrating, but with a systematic approach, you can often find a solution. By optimizing your LMDeploy configuration, managing image sizes, implementing memory management techniques, and staying updated with the latest versions, you can mitigate these issues and get your image inference pipeline running smoothly. Remember, it’s all about understanding the problem, experimenting with different solutions, and leveraging the resources available to you. Solving OOM errors requires a methodical approach. By optimizing configurations, managing resources, and staying informed about updates, you can overcome these challenges and ensure your applications run efficiently.

I hope this helps you guys tackle those OOM errors! Let me know if you have any other questions or run into further issues. Happy debugging!