Troubleshooting Intermittent API Connection Errors With Azure OpenAI And LiteLLM

by JurnalWarga.com 81 views
Iklan Headers

Introduction

Hey guys, we've been wrestling with some tricky API connection errors while sending asynchronous requests to OpenAI, especially when using reasoning models through LiteLLM. It's been a bit of a headache, and we wanted to share our experience, the steps we've taken, and hopefully, get some insights from the community. This article dives deep into the intermittent API connection errors we've encountered while using LiteLLM as a gateway to Azure OpenAI, particularly with reasoning models. We'll explore the setup, the specific issues, observations, steps taken to diagnose the problem, and a detailed traceback. Our goal is to provide a comprehensive overview to help others facing similar challenges and to seek assistance from the LiteLLM team and the broader community.

Our Setup

Our setup involves using LiteLLM as a crucial proxy, routing requests from our applications to Azure OpenAI. This setup allows us to manage and monitor our interactions with the OpenAI service more effectively. To ensure security and performance, our OpenAI instances are hosted within our Azure Private Network, with Azure Private Endpoints meticulously configured. This private network setup is designed to provide a secure and low-latency connection, which is essential for our applications. The good news is that our applications have generally been successful in communicating with these OpenAI instances, indicating that the fundamental network connectivity is sound. However, the intermittent nature of the API connection errors suggests that the issue lies elsewhere, possibly in the interaction between LiteLLM and the Azure OpenAI service, particularly when handling the more demanding reasoning models.

The Issue: Intermittent API Connection Errors

The core issue we're grappling with is the occurrence of regular API connection errors, which cropped up after we deployed reasoning models. While most requests to these models go through smoothly, we consistently encounter these errors, making it a real puzzle to solve. These connection errors are particularly frustrating because they don't occur all the time, but rather sporadically, making them difficult to reproduce and diagnose. The impact of these errors is significant because they can disrupt the performance of our applications and degrade the user experience. The fact that these errors began after deploying reasoning models suggests that the issue might be related to the increased computational demands or the specific way these models interact with the API. This leads us to believe that the problem might be more complex than a simple network hiccup and could involve the intricacies of the model inference process and the underlying infrastructure.

Key Observations

We've made several key observations that shed some light on the problem, though a definitive solution remains elusive. Firstly, Azure Support has pointed out that the API connection error appears to be a client-side issue, as they don't see any corresponding logs on their end. This suggests that the problem might be occurring before the request even reaches the Azure OpenAI service, possibly within our infrastructure or LiteLLM itself. Secondly, our networking team has diligently checked and confirmed that there are no apparent latency issues, ruling out network congestion or connectivity problems as the primary cause. This eliminates a common suspect and forces us to look deeper into the software and configuration aspects of our setup. Lastly, we meticulously log the duration of all API calls, and this has revealed some interesting patterns. Most connection errors occur after a request has been running for 250+ seconds, indicating a potential timeout or resource exhaustion issue. However, some errors pop up in less than 5 seconds, suggesting that there might be multiple underlying causes at play. The Azure endpoint logs also show timeouts, but these don't directly correlate with the API connection errors, further complicating the diagnostic process. These observations paint a picture of a complex problem that requires a multi-faceted approach to solve.

Diagnostic Steps Taken

To tackle these API connection errors, we've taken a systematic approach, starting with reaching out to the experts. We filed a ticket with Azure Support, hoping for a resolution, but unfortunately, they couldn't provide a fix, stating that the requests aren't even being logged on their backend. This lack of visibility on the Azure side is a major hurdle, as it prevents us from directly correlating the errors with server-side events. Next, we enabled logging on Azure endpoints, which did reveal some timeouts, but these timeouts didn't directly correspond with the connection errors we were seeing. This suggests that the timeouts might be a symptom of a different issue or that the connection errors are occurring before the requests even reach the point where timeouts are logged. The combination of these steps has helped us rule out some potential causes, but it has also highlighted the complexity of the problem and the need for more in-depth investigation. We're now focusing on examining the interaction between LiteLLM and the Azure OpenAI service, as well as the configuration of our client-side applications, in an effort to pinpoint the root cause of these frustrating errors.

Seeking Assistance and Insights

We're now reaching out to the community, specifically the LiteLLM team, seeking any insights or guidance that can help us diagnose and resolve these API connection errors. We believe that sharing our experiences and observations can lead to a collaborative solution, benefiting not just us but also others who might be facing similar challenges. The LiteLLM team's expertise in the proxy's inner workings and its interactions with various LLM providers could be invaluable in identifying the source of the problem. Additionally, we're open to suggestions from anyone who has encountered similar issues or has experience troubleshooting network and API connectivity problems. We believe that a fresh perspective and a collaborative approach are essential to unraveling the complexities of this issue and finding a lasting solution. Any assistance or advice from the community would be greatly appreciated, as we're committed to resolving these errors and ensuring the smooth operation of our applications.

Detailed Traceback Analysis

To provide a clearer picture of the errors, let's dive into a detailed analysis of the traceback we've encountered. This traceback offers a step-by-step view of the error's journey through our system, highlighting the different layers involved and the specific points where things went wrong. By dissecting each part of the traceback, we can gain a better understanding of the error's root cause and identify potential areas for improvement. The traceback starts with an aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer, which is a common indication of a network-level issue. However, this is just the first piece of the puzzle. This error is then wrapped in a series of subsequent exceptions, each providing additional context and narrowing down the potential sources of the problem. The next exception in the chain is httpx.ConnectError: [Errno 104] Connection reset by peer, which confirms that the issue is related to the HTTP connection. This error suggests that the connection was forcibly closed by the remote host, which could be due to a variety of reasons, such as a timeout, a server-side error, or a network interruption. The traceback then leads us to the litellm.llms.azure.azure.py file, specifically the acompletion function, which is responsible for handling asynchronous completion requests to Azure OpenAI. This indicates that the error is occurring within the LiteLLM's Azure integration layer. Further down the traceback, we see openai.APIConnectionError: Connection error, which is a high-level error that encapsulates the underlying connection problem. This error is raised by the OpenAI library when it fails to establish a connection with the API endpoint. Finally, the traceback culminates in a litellm.exceptions.APIConnectionError: litellm.APIConnectionError: AzureException APIConnectionError - Connection error., which is the ultimate exception that our application receives. This exception is raised by LiteLLM and provides a user-friendly error message that summarizes the problem. By tracing the error through these different layers, we can see that the issue originates at the network level but is then propagated through the HTTP client, LiteLLM's Azure integration, and the OpenAI library, ultimately resulting in an API connection error that our application can handle. This detailed analysis helps us focus our debugging efforts on the specific components and interactions that are most likely to be the cause of the problem.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 59, in map_aiohttp_exceptions
    yield
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 213, in handle_async_request
    response = await client_session.request(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<13 lines>...
    ).__aenter__()
    ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/client.py", line 1488, in __aenter__
    self._resp: _RetType = await self._coro
                           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/client.py", line 770, in _request
    resp = await handler(req)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/client.py", line 748, in _connect_and_send_request
    await resp.start(conn)
  File "/usr/local/lib/python3.13/site-packages/aiohttp/client_reqrep.py", line 532, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/aiohttp/streams.py", line 672, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/openai/_base_client.py", line 1519, in request
    response = await self._client.send(
               ^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 1629, in send
    response = await self._send_handling_auth(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 1657, in _send_handling_auth
    response = await self._send_handling_redirects(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 1694, in _send_handling_redirects
    response = await self._send_single_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/httpx/_client.py", line 1730, in _send_single_request
    response = await transport.handle_async_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 206, in handle_async_request
    with map_aiohttp_exceptions():
         ~~~~~~~~~~~~~~~~~~~~~~^^
  File "/usr/local/lib/python3.13/contextlib.py", line 162, in __exit__
    self.gen.throw(value)
    ~~~~~~~~~~~~~~~~~~~~~~^^^
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/custom_httpx/aiohttp_transport.py", line 73, in map_aiohttp_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/azure/azure.py", line 412, in acompletion
    headers, response = await self.make_azure_openai_chat_completion_request(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<4 lines>...
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/litellm/litellm_core_utils/logging_utils.py", line 135, in async_wrapper
    result = await func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/azure/azure.py", line 178, in make_azure_openai_chat_completion_request
    raise e
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/azure/azure.py", line 165, in make_azure_openai_chat_completion_request
    raw_response = await azure_client.chat.completions.with_raw_response.create(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        **data, timeout=timeout
        ^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/openai/_legacy_response.py", line 381, in wrapped
    return cast(LegacyAPIResponse[R], await func(*args, **kwargs))
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/openai/resources/chat/completions/completions.py", line 2454, in create
    return await self._post(
           ^^^^^^^^^^^^^^^^^
    ...<45 lines>...
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/openai/_base_client.py", line 1784, in post
    return await self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/openai/_base_client.py", line 1551, in request
    raise APIConnectionError(request=request) from err
openai.APIConnectionError: Connection error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/litellm/main.py", line 541, in acompletion
    response = await init_response
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/litellm/llms/azure/azure.py", line 466, in acompletion
    raise AzureOpenAIError(status_code=500, message=message, body=body)
litellm.llms.azure.common_utils.AzureOpenAIError: Connection error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/litellm/utils.py", line 1410, in wrapper_async
    result = await original_function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/litellm/main.py", line 560, in acompletion
    raise exception_type(
          ~~~~~~~~~~~~~~^
        model=model,
        ^^^^^^^^^^^^
    ...<3 lines>...
        extra_kwargs=kwargs,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2293, in exception_type
    raise e
  File "/usr/local/lib/python3.13/site-packages/litellm/litellm_core_utils/exception_mapping_utils.py", line 2058, in exception_type
    raise APIConnectionError(
    ...<4 lines>...
    )
litellm.exceptions.APIConnectionError: litellm.APIConnectionError: AzureException APIConnectionError - Connection error.

Conclusion

In conclusion, we're facing a complex issue with intermittent API connection errors when using LiteLLM with Azure OpenAI reasoning models. We've shared our setup, the problem, our observations, and the steps we've taken to diagnose it. The traceback provides a detailed view of the error's journey, highlighting potential areas of concern. We're now looking to the community, especially the LiteLLM team, for any insights or suggestions. We believe that a collaborative approach is key to resolving this issue and ensuring the reliability of our applications. Thanks for reading, and we're eager to hear your thoughts and suggestions!