Intel OpenCL Runtime Performance Issue With WPADiscussion In Hashcat
Introduction
Guys, let's dive into a peculiar issue encountered with Intel's OpenCL runtime when dealing with WPADiscussion in Hashcat. A recent test, conducted on commit 33445583961bcfd4c94987959e41a6198d2f82de
, revealed a significant performance bottleneck that seems specific to Intel's OpenCL implementation. This issue manifests as a performance degradation when non-PMKID AUX kernels are enabled, impacting the overall cracking speed. This article will explore the problem, the attempted solution, performance comparisons, and potential causes. We will discuss the complexities involved and the need for a deeper understanding of Intel's OpenCL JIT compiler.
The Problem: Performance Hit with Non-PMKID AUX Kernels
In the realm of password cracking, performance is paramount. Every kilohash per second (kH/s) counts, and any unexpected slowdown can significantly impact the time required to crack a password. The core issue lies in the Intel OpenCL runtime's behavior when handling different types of kernels within Hashcat, specifically those related to WPADiscussion. To put it simply, the presence of non-PMKID AUX kernels (AUX1, AUX2, and AUX3) appears to negatively affect the performance of the PMKID kernel (AUX4), even though these kernels should operate independently. This behavior is unexpected and points to a potential inefficiency in how Intel's OpenCL JIT (Just-In-Time) compiler optimizes or executes these kernels.
The problem was initially discovered while working with Hashcat, a popular password cracking tool. During testing, it was observed that disabling the non-PMKID AUX kernels resulted in a substantial increase in cracking speed. This was a surprising finding because the PMKID kernel (AUX4) should not be directly influenced by the presence or absence of other kernels. The fact that disabling AUX1, AUX2, and AUX3 kernels led to a noticeable performance improvement in AUX4 suggested a deeper issue within the Intel OpenCL runtime. This unexpected behavior prompted further investigation into the underlying cause and potential solutions.
The Patch: A Temporary Workaround
To demonstrate the issue and quantify the performance impact, a patch was created. It's crucial to emphasize that this patch is not intended for production use. It serves solely as a diagnostic tool to isolate and highlight the problem. The patch essentially disables all non-PMKID AUX kernels (m22000_aux1, m22000_aux2, and m22000_aux3) in the OpenCL/m22000-pure.cl
file. This forces the kernel to focus exclusively on cracking PMKIDs, effectively bypassing the performance bottleneck.
The code modification is straightforward: a simple return;
statement is added at the beginning of each of the non-PMKID AUX kernel functions. This prevents these kernels from executing, allowing us to isolate the performance of the PMKID kernel (AUX4) without interference. While this approach provides a significant speed boost in this specific scenario, it's important to understand that it's not a long-term solution. Disabling these kernels completely eliminates the ability to crack passwords using those methods, which is obviously not desirable in a real-world password cracking scenario. However, the patch serves as a valuable tool for understanding the problem and measuring the extent of the performance impact.
diff --git a/OpenCL/m22000-pure.cl b/OpenCL/m22000-pure.cl
index f6462637c..6aa7f7c30 100644
--- a/OpenCL/m22000-pure.cl
+++ b/OpenCL/m22000-pure.cl
@@ -403,6 +403,7 @@ KERNEL_FQ KERNEL_FA void m22000_comp (KERN_ATTR_TMPS_ESALT (wpa_pbkdf2_tmp_t, wp
KERNEL_FQ KERNEL_FA void m22000_aux1 (KERN_ATTR_TMPS_ESALT (wpa_pbkdf2_tmp_t, wpa_t))
{
+return;
const u64 gid = get_global_id (0);
if (gid >= GID_CNT) return;
@@ -596,6 +597,7 @@ KERNEL_FQ KERNEL_FA void m22000_aux1 (KERN_ATTR_TMPS_ESALT (wpa_pbkdf2_tmp_t, wp
KERNEL_FQ KERNEL_FA void m22000_aux2 (KERN_ATTR_TMPS_ESALT (wpa_pbkdf2_tmp_t, wpa_t))
{
+return;
const u64 gid = get_global_id (0);
if (gid >= GID_CNT) return;
@@ -779,6 +781,7 @@ KERNEL_FQ KERNEL_FA void m22000_aux2 (KERN_ATTR_TMPS_ESALT (wpa_pbkdf2_tmp_t, wp
KERNEL_FQ KERNEL_FA void m22000_aux3 (KERN_ATTR_TMPS_ESALT (wpa_pbkdf2_tmp_t, wpa_t))
{
+return;
/**
* aes shared
*/
For those who wish to reproduce the results, it's essential to clear the kernel cache before and after applying the patch. This ensures that the OpenCL JIT compiler recompiles the kernels, providing a consistent baseline for comparison. The kernel cache is typically located in the kernels/
folder within the Hashcat directory. Deleting this folder forces a fresh compilation, eliminating any potential influence from previously compiled kernels.
Performance Improvement: A Significant Boost
The results obtained after applying the patch were striking. Before the patch, the cracking speed was measured at approximately 459.9 kH/s, with a processing time of 34.34ms. After applying the patch, the speed jumped to 704.0 kH/s, with a processing time of 21.73ms. This represents a significant performance improvement, highlighting the extent of the bottleneck caused by the non-PMKID AUX kernels. The nearly 53% increase in speed clearly demonstrates the impact of the issue and the potential for optimization.
To reiterate, these figures were obtained with a specific set of parameters: Accel:512, Loops:512, Thr:16, and Vec:1. These parameters define the workload distribution and execution characteristics of the cracking process. While the exact performance numbers may vary depending on the hardware, OpenCL driver version, and other system configurations, the relative improvement observed after applying the patch is consistent. This suggests that the underlying issue is not specific to a particular configuration but rather a general problem with the Intel OpenCL runtime's handling of these kernels.
Before:
Speed.#01........: 459.9 kH/s (34.34ms) @ Accel:512 Loops:512 Thr:16 Vec:1
After:
Speed.#01........: 704.0 kH/s (21.73ms) @ Accel:512 Loops:512 Thr:16 Vec:1
Testing Across Different Platforms: Intel OpenCL's Uniqueness
To further investigate the issue, tests were conducted across various platforms, including CUDA (NVIDIA), HIP (AMD), and other OpenCL runtimes. The results were consistent: the patch had no discernible performance impact on these platforms. This indicates that the problem is specific to Intel's OpenCL implementation. The other platforms correctly isolate the kernels, ensuring that the presence or absence of non-PMKID kernels does not affect the performance of the PMKID kernel.
This observation strengthens the hypothesis that the Intel OpenCL JIT compiler is the root cause of the issue. The JIT compiler is responsible for translating the OpenCL code into machine code optimized for the specific hardware. It appears that Intel's JIT compiler is not handling these kernels optimally, leading to the observed performance degradation. The fact that other platforms, with their respective compilers and runtimes, do not exhibit this behavior suggests that there is room for improvement in Intel's OpenCL implementation.
The Mystery of the AUX4 Kernel: Why the Impact?
The perplexing aspect of this issue is that the AUX4 kernel (PMKID) should not be affected by the presence of AUX1-3 kernels. These are logically and functionally isolated kernels, each responsible for a specific part of the password cracking process. There is no explicit dependency or data sharing between these kernels that would explain the observed performance impact. The Intel OpenCL JIT compiler seems to be introducing some form of overhead or interference that is not present in other OpenCL implementations.
One possible explanation is that the JIT compiler is performing some form of global optimization that is not beneficial in this case. For example, it might be attempting to share resources or optimize memory access across all kernels, even though this is not necessary or efficient. Another possibility is that the JIT compiler is generating suboptimal code for the PMKID kernel due to the presence of the other kernels, perhaps by introducing unnecessary branching or synchronization overhead. Further investigation and profiling of the compiled code would be necessary to pinpoint the exact cause.
A Historical Parallel: DESCRYPT on NVIDIA
Interestingly, this isn't the first time a similar issue has been encountered. A long time ago, a comparable problem existed with OpenCL on NVIDIA, specifically with the DESCRYPT kernels. In that case, the presence of certain DESCRYPT kernels negatively impacted the performance of others, even though they should have been independent. This historical precedent suggests that the current issue with Intel OpenCL is not entirely unprecedented and that similar challenges can arise in the complex world of GPU programming and JIT compilation.
The DESCRYPT issue on NVIDIA was eventually resolved through driver updates and optimizations in the OpenCL runtime. This gives hope that the current issue with Intel OpenCL can also be addressed through similar means. However, it also underscores the need for careful testing and optimization of OpenCL implementations to ensure that kernels perform as expected and that performance is not negatively impacted by unexpected interactions between seemingly independent code sections.
The Search for a Solution: No Easy Fix in Sight
As of now, there isn't a straightforward or clean solution to this problem. The issue appears to be deeply rooted in the Intel OpenCL JIT compiler and its interaction with the specific kernels used in Hashcat. A simple workaround, like the patch discussed earlier, can provide a temporary performance boost, but it comes at the cost of disabling functionality. A proper solution would require a more fundamental understanding of the JIT compiler's behavior and the identification of the specific optimization or code generation pattern that is causing the bottleneck.
Ideally, Intel would address this issue in a future driver update or OpenCL runtime release. However, until that happens, users of Hashcat on Intel GPUs may experience suboptimal performance when cracking passwords using WPADiscussion methods. In the meantime, further research and experimentation are needed to explore potential workarounds or alternative coding strategies that might mitigate the impact of this issue. This could involve restructuring the OpenCL code, using different optimization flags, or exploring other ways to influence the JIT compiler's behavior.
Conclusion: A Call for Further Investigation
The Intel OpenCL runtime speed issue on WPADiscussion highlights the complexities of GPU programming and the challenges of achieving optimal performance across different platforms. The unexpected performance degradation caused by non-PMKID AUX kernels points to a potential inefficiency in Intel's OpenCL JIT compiler. While a temporary patch can alleviate the issue, a comprehensive solution requires a deeper understanding of the underlying cause and a more targeted optimization strategy. This issue serves as a reminder of the importance of thorough testing and performance analysis in the development of GPU-accelerated applications.
Further investigation is needed to pinpoint the specific code generation or optimization pattern that is causing the bottleneck. This could involve profiling the compiled code, examining the JIT compiler's output, and experimenting with different coding styles and optimization flags. Collaboration between the Hashcat developers and Intel's OpenCL engineers would be invaluable in finding a lasting solution to this problem. Ultimately, addressing this issue will not only improve the performance of Hashcat on Intel GPUs but also contribute to a more robust and efficient OpenCL ecosystem.
Keywords
Intel OpenCL, WPADiscussion, Hashcat, performance, runtime, speed, PMKID, AUX kernels, JIT compiler, password cracking