ScyllaDB 2025.3.0-rc1 Reactor Stall Investigation On Row_cache::do_update()
Hey everyone! We've got an interesting issue to dive into today regarding a reactor stall observed in ScyllaDB version 2025.3.0-rc1. This happened after a node upgrade, and it's crucial we understand what went down and how to prevent it in the future. Let's break it down in a way that's easy to follow, even if you're not a ScyllaDB expert.
Issue Description
Regression Alert
First off, it's important to note that this issue is flagged as a regression. This means it's a problem that wasn't present in earlier versions and has popped up again in this release. Identifying regressions is super important because it helps us pinpoint exactly what changes might have caused the problem. In this case, the reactor stall occurred after upgrading a node from ScyllaDB version 2025.2.0-20250625.33e947e75342
to 2025.3.0~rc1-20250710.f3297824e397
. Understanding the specific builds involved helps narrow down the search for the root cause.
The Reactor Stall Event
So, what exactly happened? After upgrading node-2, the system experienced a reactor stall. A reactor stall is basically when a core process in ScyllaDB gets hung up for a significant amount of time, preventing it from handling other tasks. This is a big deal because it can lead to performance degradation and, in severe cases, even downtime. The specific error message logged was:
2025-07-14 09:09:25.784 <2025-07-14 09:09:24.961>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=1d9cf4c6-105f-4974-a601-3701456a0023: type=REACTOR_STALLED regex=Reactor stalled line_number=48711 node=rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-2
2025-07-14T09:09:24.961+00:00 rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-2 !INFO | scylla[4104]: Reactor stalled for 845 ms on shard 0, in scheduling group main.
This message tells us a few key things:
- The stall lasted for 845 milliseconds, which is a pretty long time in the world of databases.
- It happened on shard 0, indicating the issue might be specific to how that shard is handling operations.
- The stall occurred within the
main
scheduling group, suggesting it's a core process that's affected.
Diving into the Backtrace
The log also includes a backtrace, which is like a stack trace for the code. It shows the sequence of function calls that led to the stall. Analyzing this backtrace is crucial for pinpointing the exact line of code that's causing the problem. The backtrace provided points to the row_cache::do_update()
function as the culprit:
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89
seastar::backtrace_buffer::append_backtrace_oneline() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:831
(inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:853
seastar::internal::cpu_stall_detector::generate_trace() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:1464
seastar::internal::cpu_stall_detector::maybe_report() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:1199
(inlined by) seastar::internal::cpu_stall_detector::on_signal() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:1219
?? ??:0
cache_entry::set_continuous(bool) at ././db/row_cache.hh:131
(inlined by) cache_entry::on_evicted(cache_tracker&) at ././db/row_cache.cc:1321
(inlined by) rows_entry::on_evicted(cache_tracker&) at ././db/row_cache.cc:1365
std::_Function_handler<seastar::memory::reclaiming_result (), cache_tracker::cache_tracker(utils::updateable_value<double>, mutation_application_stats&, seastar::bool_class<register_metrics_tag>)::$_0>::_M_invoke(std::_Any_data const&) at ././utils/lru.hh:137
logalloc::tracker::impl::compact_and_evict_locked(unsigned long, unsigned long, seastar::bool_class<is_preemptible_tag>) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
logalloc::allocating_section::reserve(logalloc::tracker::impl&) at ././utils/logalloc.cc:2732
seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::async<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}>(seastar::thread_attributes, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&, basic_preemption_source&)::$_0, basic_preemption_source&)::{lambda()#1}::operator()() const::{lambda()#2}&&)::{lambda()#1}>::call(seastar::noncopyable_function<void ()> const*) at ././utils/logalloc.hh:473
seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:215
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/thread.cc:318
seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:215
(inlined by) seastar::thread_context::main() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/thread.cc:318
This suggests that the issue is related to how ScyllaDB is updating the row cache, which is a memory area used to store frequently accessed data for faster retrieval. Let's dig deeper into what this means.
Key Functions in the Backtrace
To really understand what's happening, let's look at some of the key functions in the backtrace:
cache_entry::set_continuous(bool)
: This function likely sets a flag indicating whether a cache entry is part of a continuous sequence. It could be related to how ScyllaDB manages memory fragmentation within the cache.cache_entry::on_evicted(cache_tracker&)
: This is called when a cache entry is evicted (removed) from the cache, usually to make space for new entries. Thecache_tracker
is probably responsible for managing the cache's overall state.rows_entry::on_evicted(cache_tracker&)
: Similar to the previous function, but specific to entries that store rows of data. This indicates the eviction process for row data might be involved.std::_Function_handler<seastar::memory::reclaiming_result (), ...>::_M_invoke(std::_Any_data const&)
: This is a C++ function object handler, suggesting that a function is being called as part of the eviction process.logalloc::tracker::impl::compact_and_evict_locked(...)
: This function is responsible for compacting and evicting entries from the log-structured allocator, which is a memory management technique used in ScyllaDB. Thelocked
suffix suggests this operation requires a lock, which could be a potential source of contention.seastar::async<row_cache::do_update<...>(...)>::{lambda()#1}::operator()() const::{lambda()#2}
: This looks like an asynchronous operation related to updating the row cache. Asynchronous operations are designed to prevent blocking, but if they're not handled efficiently, they can still contribute to stalls.row_cache::do_update(...)
: This is the main suspect! It suggests the stall is happening during the process of updating the row cache. Understanding how this function works is key to solving the issue.
The Big Picture: Row Cache Operations
Based on the backtrace, the reactor stall seems to be occurring during an update operation on the row cache. The row cache is a critical component for performance, as it stores frequently accessed rows in memory. When the cache needs to be updated (e.g., when data changes or when entries need to be evicted), these operations need to be performed quickly and efficiently. If the update process gets bogged down, it can lead to reactor stalls like the one we're seeing here.
Impact of the Reactor Stall
Performance Degradation
The main impact of a reactor stall is, of course, performance degradation. When the reactor stalls, it means that the database node is unable to process requests efficiently. This can manifest as increased latency, slower query execution times, and overall reduced throughput. For users, this means a sluggish and unresponsive application.
Potential for Downtime
In more severe cases, prolonged or frequent reactor stalls can lead to downtime. If the reactor is stalled for too long, the node might become unresponsive, requiring a restart. This can disrupt services and lead to data unavailability. Therefore, it's crucial to address reactor stalls promptly to prevent them from escalating into more serious issues.
How Frequently Does It Reproduce?
To understand the severity of this issue, it's important to know how frequently it occurs. Is it a one-off event, or does it happen consistently? This information helps prioritize the investigation and determine the urgency of a fix. Unfortunately, the provided information doesn't specify the exact reproduction frequency. However, the fact that it was observed after a node upgrade suggests it might be related to the upgrade process itself or to specific conditions triggered by the new version. Knowing how often this issue can be reproduced is vital for determining its overall impact on the system's reliability and performance.
Installation Details
Cluster Configuration
Let's talk about the setup where this issue occurred. The cluster consists of 6 nodes, each running on a Standard_L8s_v3 Azure instance. These instances are pretty beefy, so it's unlikely that resource constraints are the primary cause of the stall, but it's still good to know the hardware specs. Each node has 7 shards, which is the number of parallel processing units within ScyllaDB. This information helps us understand the scale of the system and how the workload is distributed.
Nodes Involved
The reactor stall was specifically observed on rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-2
, but it's worth noting all the nodes in the cluster:
rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-6
(10.0.0.10)rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-5
(10.0.0.9)rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-4
(10.0.0.8)rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-3
(10.0.0.7)rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-2
(10.0.0.6) - The culprit!rolling-upgrade--ubuntu-focal-db-node-81e52e9b-eastus-1
(10.0.0.5)
Knowing the specific node where the stall occurred helps focus the investigation. We can examine logs and metrics from that node in detail to understand what was happening at the time of the stall.
Operating System and Image
The nodes are running on an Ubuntu Focal image (/CommunityGalleries/scylladb-7e8d8a04-23db-487d-87ec-0e175c0615bb/Images/scylla-2025.2/Versions/2025.2.0
) within Azure. The OS and image version can sometimes play a role in performance issues, so this is good information to have. It ensures that the environment is consistent across the cluster, which is important for troubleshooting.
Test Environment
This issue was encountered during a rolling upgrade test (rolling-upgrade-azure-image-test
). This type of test is designed to simulate a real-world upgrade scenario, where nodes are upgraded one at a time while the cluster remains online. This makes the discovery of this stall during the test particularly valuable, as it highlights a potential problem that could affect users performing upgrades.
Test Details
The test ID is 81e52e9b-470d-46ca-9daa-9dd7b72dc6e5
, and the test name is scylla-2025.3/rolling-upgrade/rolling-upgrade-azure-image-test
. The test method used was upgrade_test.UpgradeTest.test_rolling_upgrade
, which further confirms that the issue is related to the upgrade process. The test configuration file, rolling-upgrade.yaml
, can provide additional details about the test setup and workload.
Logs and Commands
The provided information includes a treasure trove of logs and commands that can help in the investigation. Here's a quick rundown:
- Restore Monitor Stack command:
$ hydra investigate show-monitor 81e52e9b-470d-46ca-9daa-9dd7b72dc6e5
- This command allows you to view the monitoring data collected during the test run. Monitoring data can provide valuable insights into system performance and resource utilization. - Restore monitor on AWS instance using Jenkins job: This Jenkins job automates the process of restoring the monitoring stack on an AWS instance, making it easier to analyze the data.
- Show all stored logs command:
$ hydra investigate show-logs 81e52e9b-470d-46ca-9daa-9dd7b72dc6e5
- This command retrieves all the logs generated during the test run. Logs are a goldmine of information for troubleshooting, as they often contain error messages, warnings, and other clues about what went wrong.
Available Logs
Several log files are available for analysis:
db-cluster-81e52e9b.tar.zst
: Contains logs from the ScyllaDB cluster itself.schema-logs-81e52e9b.tar.zst
: Includes logs related to schema changes and operations.sct-runner-events-81e52e9b.tar.zst
: Contains events from the Scylla Cluster Test (SCT) runner, which is used for running the tests.sct-81e52e9b.log.tar.zst
: The main log file for the Scylla Cluster Test.loader-set-81e52e9b.tar.zst
: Logs related to data loading operations.monitor-set-81e52e9b.tar.zst
: Logs from the monitoring system.ssl-conf-81e52e9b.tar.zst
: Logs related to SSL configuration.builder-81e52e9b.log.tar.gz
: Logs from the build process.
These logs provide a comprehensive view of what happened during the test, from the initial setup to the execution of the workload and the final results. Analyzing these logs is crucial for understanding the root cause of the reactor stall.
Additional Resources
The information also includes links to the Jenkins job and Argus, which are valuable resources for further investigation:
- Jenkins job URL: This link provides access to the Jenkins job that ran the test. You can view the job history, configuration, and results.
- Argus: Argus is a monitoring and analysis platform used by ScyllaDB. This link provides access to the monitoring data collected during the test run, allowing you to visualize performance metrics and identify anomalies.
Conclusion
So, there you have it! A deep dive into a reactor stall issue in ScyllaDB 2025.3.0-rc1. We've covered the error message, the backtrace, the impact on performance, and the installation details. The next step is to analyze those logs and monitoring data to pinpoint the exact cause of the stall in row_cache::do_update()
. This kind of detailed analysis is what helps keep ScyllaDB rock-solid. Stay tuned for further updates as the investigation progresses!