RVCK Original Version UEFI Boot Issues And Solutions Discussion

by JurnalWarga.com 64 views
Iklan Headers

Introduction

This article delves into the intricacies of booting the original version of RVCK (RISC-V Computing Kernel) in UEFI (Unified Extensible Firmware Interface) mode, highlighting the challenges encountered and potential solutions. The core focus is on addressing the panic issues and the system freeze observed during the boot process. We will explore the error logs, analyze the root causes, and discuss the proposed fixes, ensuring a comprehensive understanding for developers and enthusiasts alike. So, if you're diving into RVCK and hitting roadblocks with UEFI, you're in the right place, guys!

Problem 1: Kernel Panic During UEFI Boot

The initial hurdle in booting the original RVCK version in UEFI mode is the occurrence of kernel panics. These panics manifest at seemingly random locations during the boot sequence, making it challenging to pinpoint the exact cause. Let's break down the error scenarios observed.

Error Scenario 1

One prevalent panic scenario involves a kernel paging request issue, as indicated by the following error log:

[    0.000000] Unable to handle kernel paging request at virtual address ffffaf8074bff7f4
[    0.000000] Oops [#1]
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.97-gf3cc9d65c887-dirty #11
[    0.000000] epc : memcmp+0x16/0x30
[    0.000000]  ra : start_kernel+0xa6/0x76c
[    0.000000] epc : ffffffff80987dce ra : ffffffff80a00774 sp : ffffffff81403e80
[    0.000000]  gp : ffffffff8150cb58 tp : ffffffff8140db00 t0 : 0000000000000061
[    0.000000]  t1 : 0000000000000073 t2 : 0000000000000003 s0 : ffffffff81403e90
[    0.000000]  s1 : ffffaf8074bff7f4 a0 : ffffaf8074bff7f4 a1 : ffffffff80fd7fd8
[    0.000000]  a2 : ffffaf8074bff800 a3 : 0000000000000ea0 a4 : 0000000000000001
[    0.000000]  a5 : ffffaf8074bff7f4 a6 : 0000000000000000 a7 : 0000000052464e43
[    0.000000]  s2 : ffffaf8074bff800 s3 : ffffffff8150e058 s4 : ffffaf8074bff7f0
[    0.000000]  s5 : 00000000f4c00000 s6 : 00000000ff078b18 s7 : ffffffffffffffff
[    0.000000]  s8 : 0000000000000000 s9 : 00000000f6ec0b90 s10: 0000000000000000
[    0.000000]  s11: 0000000000000000 t3 : ffffffffffffffff t4 : ffffffffffffffff
[    0.000000]  t5 : ffffffff81520918 t6 : ffffffff81403ab8
[    0.000000] status: 0000000200000100 badaddr: ffffaf8074bff7f4 cause: 000000000000000d
[    0.000000] [<ffffffff80987dce>] memcmp+0x16/0x30
[    0.000000] [<ffffffff80a00774>] start_kernel+0xa6/0x76c
[    0.000000] Code: e022 e406 0800 c215 87aa 962a a021 0585 8963 00c7 (c503) 0007 
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

This log indicates a memory access violation, specifically an inability to handle a kernel paging request at the virtual address ffffaf8074bff7f4. The error occurs within the memcmp function, which is a crucial function for memory comparison operations. The backtrace points to start_kernel, suggesting the issue arises during the kernel initialization phase. This type of panic, my friends, can be super tricky because it can stem from various memory management problems.

Error Scenario 2

Another instance of kernel panic presents a different error signature:

[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.6.97-gf3cc9d65c887-dirty #11
[    0.000000] epc : __memset+0x60/0x100
[    0.000000]  ra : memblock_alloc_try_nid+0x74/0x84
[    0.000000] epc : ffffffff809911c0 ra : ffffffff80a15dd6 sp : ffffffff81403d70
[    0.000000]  gp : ffffffff8150cb58 tp : ffffffff8140db00 t0 : ff60000074800000
[    0.000000]  t1 : 0000001000000000 t2 : 206572617774666f s0 : ffffffff81403db0
[    0.000000]  s1 : 0000000004000000 a0 : ff60000074800000 a1 : 0000000000000000
[    0.000000]  a2 : 0000000004000000 a3 : ff60000078800000 a4 : 0000000000000000
[    0.000000]  a5 : ff5fffff80000000 a6 : 0000000004000000 a7 : 0000000000000000
[    0.000000]  s2 : ff60000074800000 s3 : ffffffffffffffff s4 : 0000000000000001
[    0.000000]  s5 : ffffffff81541518 s6 : 0000000000000fff s7 : 0000000000000000
[    0.000000]  s8 : ffffffff8150c5e0 s9 : fffffffffffff000 s10: 0000000000000000
[    0.000000]  s11: 0000000000000000 t3 : ffffffff81545ae0 t4 : ffffffff81545ae0
[    0.000000]  t5 : ffffffff815458b8 t6 : ffffffff81545ae0
[    0.000000] status: 0000000200000100 badaddr: ff60000074800000 cause: 000000000000000f
[    0.000000] [<ffffffff809911c0>] __memset+0x60/0x100
[    0.000000] [<ffffffff80a0d970>] swiotlb_init_remap+0xc2/0x272
[    0.000000] [<ffffffff80a0db32>] swiotlb_init+0x12/0x1a
[    0.000000] [<ffffffff80a06b46>] mem_init+0x2a/0x224
[    0.000000] [<ffffffff80a13288>] mm_core_init+0x112/0x2d0
[    0.000000] [<ffffffff80a00a9e>] start_kernel+0x3d0/0x76c
[    0.000000] Code: 1007 82b3 40e2 0797 0000 8793 00e7 8305 97ba 8782 (b023) 00b2 
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

Here, the panic occurs within the __memset function, a fundamental memory setting function, during the memblock_alloc_try_nid call. The backtrace indicates that the panic arises during the initialization of the SWIOTLB (Software I/O Translation Lookaside Buffer), a crucial component for managing DMA (Direct Memory Access) operations. The badaddr field points to a problematic memory address, ff60000074800000, further solidifying the memory-related nature of the issue. This one's a bit of a head-scratcher, pointing to potential problems in memory allocation or initialization.

Root Cause Analysis: Suspected Patch Issue

Initial analysis suggests that these panics may be attributed to a faulty patch, specifically the one identified by the commit hash 42bceff74928191e2d4d1243b024b8874129602e. This commit, titled "riscv: kexec: Add image loader for kexec file," introduces changes to the arch/riscv/mm/init.c file. The problematic code snippet is:

--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1294,7 +1294,7 @@ static void __init create_linear_mapping_page_table(void)
                    __pa(PAGE_OFFSET) < end)
                        start = __pa(PAGE_OFFSET);
 
-create_linear_mapping_range(start, end, 0);
+create_linear_mapping_range(start, end, PMD_SIZE);
        }

The change from 0 to PMD_SIZE in the create_linear_mapping_range function call is suspected to be the culprit. This modification likely alters the memory mapping strategy, potentially leading to the observed memory access violations and subsequent kernel panics. It’s like changing a key setting in your car’s engine and suddenly it won't start – a small change can have big consequences!

Problem 2: System Freeze After Kernel Boot

Beyond the kernel panics, another significant issue arises when the kernel manages to boot successfully. In this scenario, the system freezes after displaying the message "Run sbin/init as init process," rendering the system unresponsive to any input. This means no shell, no commands – just a frozen screen. This is like reaching the finish line only to find the door locked!

Symptoms and Observations

The system log provides the following clues:

[    1.438894] Freeing initrd memory: 61944K
[    1.440902] Warning: unable to open an initial console.
[    1.500939] Freeing unused kernel image (initmem) memory: 2248K
[    1.501818] Run sbin/init as init process
[    1.502077]   with arguments:
[    1.502205]     sbin/init
[    1.502305]   with environment:
[    1.502419]     HOME=/
[    1.502512]     TERM=linux

The log indicates that the kernel has successfully freed the initrd (initial RAM disk) memory and the unused kernel image memory. It then proceeds to execute /sbin/init, the first process to be run in user space. The warning message, "unable to open an initial console," is a potential red flag, suggesting issues with console initialization. This is a key point, guys, because without a console, you're essentially locked out of the system.

Suspected Cause: Incomplete AIA Code Integration

The root cause of this freeze is not immediately apparent, but a preliminary suspicion points towards incomplete integration of the AIA (Advanced Interrupt Architecture) code. The fact that previous RVCK-OLK (Open Kernel Labs) versions did not exhibit this issue further strengthens this hypothesis. AIA is responsible for handling interrupts, and if it's not correctly set up, the system can hang waiting for events that never arrive. It's like a traffic controller missing, causing a massive jam!

Proposed Solutions and Next Steps

Addressing these issues requires a multi-pronged approach. Here's what we need to do:

For Kernel Panic Issues:

  1. Revert the Suspect Patch: The immediate step is to revert the commit 42bceff74928191e2d4d1243b024b8874129602e and test if the kernel panics are resolved. This will help confirm whether the patch is indeed the root cause.
  2. Analyze Memory Mapping: If reverting the patch resolves the issue, a deeper analysis of the memory mapping changes introduced by the patch is necessary. We need to understand why the change from 0 to PMD_SIZE is causing the memory access violations.
  3. Implement Alternative Solutions: If the patch is necessary for other functionalities, alternative solutions that don't introduce the panics need to be explored. This might involve tweaking the memory mapping logic or finding a different way to achieve the desired outcome.

For System Freeze Issue:

  1. Review AIA Code Integration: A thorough review of the AIA code integration is crucial. This involves comparing the current implementation with the one in the previous RVCK-OLK versions to identify any missing or incorrectly implemented components.
  2. Console Initialization Debugging: Investigate the "unable to open an initial console" warning. This might involve checking the console driver configuration, the device tree settings, and any other relevant initialization code.
  3. Interrupt Handling Analysis: Analyze the interrupt handling mechanisms to ensure that interrupts are being correctly routed and processed. This involves checking the interrupt controller configuration and the interrupt handlers.

Conclusion

Booting RVCK in UEFI mode presents several challenges, including kernel panics and system freezes. The key to resolving these issues lies in meticulous debugging, root cause analysis, and systematic implementation of solutions. By addressing the suspect patch and thoroughly reviewing the AIA code integration, we can pave the way for a stable and functional RVCK UEFI boot process. So, let’s roll up our sleeves, dive into the code, and get these issues sorted out, guys! Remember, every bug fixed is a step closer to a more robust and reliable system. And that's what we're all aiming for, right?