Scripted Rebuild of Corrupt Splunk Buckets
Throughout my Splunk adventures, I’ve run into various situations where I’ve had to rescue data for one reason or another. Recently, I had a situation that required a good deal of automation to be utilized to save thousands of Splunk buckets, as an entire cluster was rendered in-operable (more on this from Steve Salisbury soon). We will be walking through some of the tools that were created to perform some emergency surgery.
Quick Primer – What’s in a Bucket?
Splunk’s basic unit of index storage is a bucket. Within a bucket we have a few key components:
- journal.gz is where events are stored. This should be considered the most critical piece of the bucket, as just about everything else can be rebuilt from it.
- TSIDX files
- TSIDX files are the “magic sauce” for Splunk. They’re time series term index files that contain pointers to the raw data. It’s how Splunk can be so quick with something like “index=firewall 126.96.36.199”. These files can be rebuilt from the journal.
There are a handful of other files that make up a bucket, but we really just care about the journal.gz for the purposes of this article.
As Splunk receives events, it will categorize them into indexes and further into buckets based on the meta data associated with the event (host, index, source, sourcetype). Splunk selects a bucket, writes out the the journal.gz on disk, a slice at a time. A slice is simply a chunk of events (default is 128K) compressed and glued to the journal.gz. As a result, Splunk can use pointers to reach into places within the jorunal.gz and only decompress 128K at a time.
Armed with the knowledge that a journal.gz is essentially a bunch of glued together gzip files, we are going to use this to our advantage when recovering data.
Recovering Corrupted Buckets
Understanding the basis of a Splunk bucket is half the battle. Knowing that a corrupted bucket is not fatal, we can utilize Splunk’s built in tools to recover.
To be clear here, this process is something that should be used as a last resort when fsck fails to rebuild a bucket. It’s a very expensive process (lots of CPU and disk used), so be aware of the storage requirements when exporting data. A normal 10GB bucket could easily be 40GB after export.
Splunk’s export tool exports data from a bucket to a raw data format. We’ll want to use a CSV.
splunk cmd exporttool <path_to_bucket> <path_to_temp_storage> -csv
This will export an entire bucket to CSV for us. The key here is the tool handles slice errors gracefully. That means we can have 1000s of corrupted slices and still recover the remaining valid data!
Splunk’s import tool can now be used to create a fresh bucket that does not contain and corrupted slices.
splunk cmd importtool <path_to_bucket> <path_to_csv> –create-bloomfilter
This will recreate an entirely new bucket based on the CSV we created in exporttool. It will be a fully searchable bucket, ready to move in an index location.
Now Automate it!
Following the steps above works well for a small number of buckets. Now imagine if you had to do this for an entire indexer cluster! We needed to automate this to run it against a large number of buckets. Our use case was saving a Windows Indexer Cluster, but the concepts should apply just as well to Linux (and let me know if there’s a desire for some tooling around Linux). I’m by no means a powershell expert, so pull requests against the code are welcomed! I hope some of these tools come in handy.
This tool is the bread and butter of what needs to be done. It wraps the export and import tools in powershell and allows you to specify a directory to operate on. You could move all the know back buckets to a specific directory and fix them while Splunk runs or if 100% of your buckets are broken, operate on them in place. I’d highly recommend backups, but we wouldn’t be here if you had backups anyways!
.\splunk_bucket_recovery_threaded.ps1 -old_path <PATH> -new_path <PATH> -threads <THREADS> -log <LOGFILE>
This script takes a source (where you want to run splunk exporttool) and a destination (where you want the well formed buckets to land).
This script depends on Invoke-Parallel. If you’re running with many threads, this WILL use a lot of memory.
This script is pretty simple – It takes a path that contains non-clustered buckets and renames the buckets to cluster aware buckets (Not multi-site though).
.\bucket_rename.ps1 -src <PATH> -guid <GUID>
This script takes a path and guid as arguments and finds non clustered buckets and makes them clustered.
Path: The splunk index path (“c:\program files\splunk\var\lib\splunk” for example). Right now it operates on an entire volume (not a specific index).
Guid: The guid you’d like to append to buckets. This MUST include the leading “_”.
This script makes a judgement based on a bucket’s size and moves buckets that appear to have no events in the journal. This comes in handy as export / import tool will make a bucket regardless of if there are non corrupted events to operate on.
.\bad_bucket.ps1 -src <Source directory> -dest <where to copy bad buckets>
This script looks for buckets under 6000 bytes total and moves them.