You can find the source code referenced throughout this post here - though, it is quite different from what is written here.
Redis is an in-memory, key-value database. Its primary use is for caching. It does this over the network, but it’s only really meant to be used by “trusted clients”:
Redis is designed to be accessed by trusted clients inside trusted environments.
What if you want to make sure nothing bad happens to your Redis instance? Say you want to analyze the traffic going to and from that Redis instance just to be extra sure it’s secure. Well, you can - Zeek and Spicy have you covered. This is the first of a two-part miniseries of posts in which I’ll be diving into how I went about writing a Spicy Redis parser.
As for myself, I’m Evan, the newest member of Corelight’s open source Zeek team. I joined in August 2024. My background is mostly with parsers and compilers, so working with Spicy came relatively quickly for me.
I won’t be dwelling too much on how to actually parse the Redis serialization protocol - that’s too specific to be worthwhile for any general audience. For what it’s worth, the Spicy docs have a good tutorial that can get you up and running. Instead, I want to focus more on what isn’t covered in documentation, like design decisions and how I chose to solve certain edge cases.
As a word of warning, this is not a tutorial and it’s not meant to be perfect Zeek or Spicy code. It’s an exploration on how to approach writing a Spicy analyzer for Redis from nothing. This was my first analyzer, after all.
Redis RESP
When Redis talks between the client and server, it’s all just serialized data. When communicating between two endpoints, there must be a structured way to transmit data - like protocol buffers. For Redis, this serialization format is the Redis serialization protocol (or RESP).
It turns out RESP is very straightforward to parse. There are a couple weird cases: a bulk string starts with its length, but if you give a length of -1, it’s a special “null bulk string.” From what I can tell, though, those are mostly documented. The edge cases aren’t really hard to plan for.
But, what in the world can you do with a bunch of serialized data? At this point, there are no commands, keys, or values. It’s all just data. You can store it and use your human brain, but that is just an interpretation of the serialized data. Nothing in its structure indicates what it does. So we need another pass!
Spicy’s role
Spicy is a binary parser generator (that is, it generates parsers that create structure from binary data). Spicy is specialized for network traffic - like Redis! All you have to do is describe the structure of the data, then Spicy will generate a nice parser that structures the stream of binary data.
So what data are we parsing? RESP in the general case looks like this:
<type><data>\r\n
Where is a byte representing how to interpret , followed by a carriage return newline (CRLF).
If you are parsing and the first byte you see is +
- that means what follows is a simple string. Here’s what that might look like in Spicy:
public type Data = unit {
ty: uint8; # This will eventually become an enum for each possible type
if (self.ty == '+') { # … and this will become a switch over those types
simple_string: # TODO
};
};
So this parses a byte (uint8), then if that is a ‘+’, it starts trying to parse some simple_string field (left unimplemented). What is a simple string in RESP?
Simple strings are encoded as a plus (+) character, followed by a string. The string mustn’t contain a CR (\r) or LF (\n) character and is terminated by CRLF (i.e., \r\n).
Simple! So we just parse until a CRLF. That’s super convenient in Spicy:
simple_string: bytes &until=b”\x0d\x0a”;
That’ll parse until hex code 0x0D
followed by 0x0A
- CRLF.
Later iterations of the parser will defensively limit the size of each field. Leaving this as-is would allow
simple_string
to grow until you run out of memory - which could be very bad.
We can then shove this into some resp.spicy
file:
# resp.spicy
module RESP;
public type Data = unit {
ty: uint8; # This will eventually become an enum for each possible type
if (self.ty == '+') {
simple_string: bytes &until=b"\x0d\x0a";
};
on %done {
print self;
}
};
And run it with some test data to see what it prints:
$ printf "+Hi there\r\n" | spicy-driver resp.spicy
[$ty=43, $simple_string=b"Hi there"]
With that, we can parse simple strings passed via RESP.
I won’t go through every case. It turns out that the remaining cases are pretty simple to parse from here. We have a bunch of types - the first byte. Then, we use a big switch
statement that determines how to parse the remainder based on that first byte. Each part is split with CRLF. This works well!
Aside: Unrecognized first byte
There is, sadly, one case where you can’t “parse the remainder based on that first byte.” In fact, the server works perfectly fine if you just send some command followed by CRLF. You see this in pipelining - the example that sends three PING commands never even serializes those PING commands:
$ (printf "PING\r\nPING\r\nPING\r\n"; sleep 1) | nc localhost 6379
+PONG
+PONG
+PONG
This is called inlining - just arbitrary text followed by CRLF.
How can we account for that? Spicy will just throw an exception if a switch statement doesn’t have an explicit case for the provided value. For example, if we don’t have a default (*
) arm for the switch, sending an inline PING command will make the Spicy parser throw an exception:
$ printf "PING\x0d\x0a" | spicy-driver resp.spicy --parser RESP::Data
[error] terminating with uncaught exception of type spicy::rt::ParseError: no matching case in switch statement for value 'DataType::<unknown-80>' (resp.spicy:12:5-35:10)
It turns out, inlining is a bit annoying to deal with. If we don’t recognize the first byte, we should go back and store all of it until CRLF. For now, I went a bit overboard and made this monstrosity:
* -> not_serialized: bytes &convert=(pack(cast<uint8>(self.ty), spicy::ByteOrder::Network) + $$) &until=b"\x0d\x0a";
… all that does is prepend the self.ty
byte to the bytes that get parsed. That can look nicer, but hey, it works.
That this changed in the current version to using random access in Spicy to reparse the first byte, which is almost certainly better. You can read more about that in the second post on the Spicy Redis analyzer.
Two stage parsing
Now we have some serialized data, but no idea what that data means. So, we need to parse it again! All a parser does is consume less-structured output and create more-structured output. With that in mind, let’s take a detour. Consider the following construct in any programming language:
if (i > 3) {
}
Parsing the RESP data resembles tokenizing a program in a compiler. Tokenizing just splits the bytes from a file into “tokens” which represent meaningful pieces. For example, the ‘(‘ character would get transformed into an LPAREN token. After tokenizing, the program may have the following token stream:
IF LPAREN ID GREATER_THAN NUMBER RPAREN LBRACE RBRACE
But, this is unstructured. It’s just a stream of tokens that would love to get put into some object. A compiler would then parse this stream of tokens and create some sort of node. Here’s how that might look (in as intuitive as a format as I can muster):
|-statement.If
|-condition: expression.BinaryOp
|-lhs: expression.Identifier “i”
|-rhs: expression.Number 3
|-kind: GT
|-body: statement.CompoundStatement {}
|-else: null
So back to RESP, we parsed the binary data on the wire into proper RESP format - much like a compiler tokenizing a program. Now we want to parse that “tokenized” data in order to get useful information, like commands sent over the wire.
Well, at this point, you can do that wherever you want (Zeek script, C++, Spicy). But, for this analyzer, the second pass parsing is done in Spicy.
Doing more in Spicy
Commands in Redis are just serialized data in RESP. From the Redis documentation:
Clients send commands to a Redis server as an array of bulk strings.
So, if we want to parse a command, we need to grab some array (of bulk strings) and process that. We can then shove that into a data structure, then off into the void (well, to a Zeek event).
First, we need to determine whether a given array is indeed a command. As an example, I’ll use a GET command, like GET my_key
. After the Spicy parser gets its hands on it, the serialized data gets printed like this:
[$ty=DataType::ARRAY, $array=[$num_elements=2, $elements=[
[$ty=DataType::BULK_STRING, $bulk_string=[$length=3, $content=b"GET"]],
[$ty=DataType::BULK_STRING, $bulk_string=[$length=6, $content=b"my_key"]]
]]]
So we can crudely determine if this is a GET command. Does the array have two elements? Is the first element GET? That means it’s a GET command!
That’s pretty intuitive in Spicy code:
public function is_get(arr: RESP::Array): bool {
# GET key
if (arr.num_elements != 2)
return False;
local cmd = command_from(arr);
return cmd && (*cmd == Command::GET);
}
Using some command_from
function I won’t bore you with - it returns an optional enum value associated with the command. We get that enum value just comparing the first array element to known commands.
Here we’re using Spicy like a general purpose programming language! This has nothing to do with generating a parser for network traffic yet, it’s simply a function which takes in a value and returns another. Spicy isn’t just a parser generator, it’s a programming language.
Now that we know it’s a GET command, we need to take that data and put it in some structure to pass along to Zeek. That looks pretty simple too:
type Get = struct {
key: bytes;
};
public function make_get(arr: RESP::Array): Get {
return [$key = arr.elements[1].bulk_string.content];
}
When this Get
struct is passed into Zeek, it magically transforms into the following record in Zeek script:
type GetCommand: record {
key: string;
};
So how does it get sent into Zeek world? With a definition in a .evt file, of course!
on RESP::Array if ( Zeek_RESP::is_get(self) ) -> event RESP::get_command($conn, $is_orig, Zeek_RESP::make_get(self));
So if our is_get
function returns true, then we trigger the event get_command
and send the result of make_get
off into Zeek. In Zeek script, that event signature looks like:
event RESP::get_command(c: connection, is_orig: bool, command: GetCommand)
The Get
struct in Spicy and GetCommand
in Zeek script aren’t related, they just automatically translate to each other since they look the same - cool.
That is the foundation for any specific command event we could want. If we want an event for AUTH
commands, we’d do the same thing for that. I’m not quite sure that an event is useful for GET
commands, but it’s one of the simplest commands that you’ll frequently use.
What do we log?
One of the first questions you need to answer when making a new analyzer is “what do I log?” For RESP (that is, the serialization protocol), the answer is honestly “nothing.” You simply don’t have enough information from just some serialized data over the wire. If you shove everything into some “data log” then it’s just too much to do anything meaningful with.
But what about Redis commands? Most of the communication you see is a client sending commands to a server. Those commands may set or get keys, publish to a channel, or get configuration values. Having those seems useful - just a list of what commands were sent to the server seems like it’d get a pretty useful idea of what was going on. Many commands act on keys and values (Redis is a key-value database, after all). Having that generic info would also be nice.
So what is a command? We know from before it’s an array of bulk strings - so any time the client sends one, it’s a command.
Well, you’re missing a case - inlining! This acts like a command for my local running redis server:
$ (printf "PING\r\nPING\r\nPING\r\n"; sleep 1) | nc localhost 6379
+PONG
+PONG
+PONG
All Redis actually handles from the client is an array of bulk strings or inlined data. We discussed how to handle the array of bulk strings before, and it’s similar here. Here’s a simple Spicy function that will “tokenize” that inline data, if we get it:
public function inline_command(inline: RESP::Data): vector<bytes> {
# Only call this if it's inline :)
assert inline?.not_serialized;
return inline.not_serialized.split();
}
Well, we can leave it there. Don’t ask what happens if you have a “quoted string with spaces.”
How do we log this?
So now: every array of bulk strings will be treated as a command. Also, every unserialized request will be treated as a command. The rest is actually pretty straightforward: we search through every Redis command, note which ones operate on keys and their values, then structure the data. That slightly structured data is sent into Zeek world in a record.
In order to turn the tokenized data (either from the bulk strings composing the array or the split inline data) into a Command, we have a new function:
function parse_command(raw: vector<bytes>): Command {
The resulting Command
just contains the generic info in one structure:
type Command = struct {
raw: vector<bytes>;
command: bytes;
key: optional<bytes>;
value:optional<bytes>;
known: optional<KnownCommand>;
};
There are a few cases this may not work, for example command
may be multiple tokens. Also, it’s impossible to enumerate every known command, so the known
field is an optional enum that is only set if we recognize the command.
This gets transformed into some analogous record in Zeek script auto-magically. That Zeek script record can choose to log only parts of this (like the command, key, and value fields).
The remainder of this is a simple solution: large switch statements! We parse the command, set the known
enum field if we recognize it, then switch on that to figure out which arguments correlate to a key or value. Quite frankly, I’d show it, but it’s ugly and unruly enough that I don’t think that’s a good idea
The end result
Finally, we’re left with something pretty nice. We’ll just log the client’s commands, so there is no additional upkeep, just send the data into logs. Here’s an example where a server will try to GET a fibonacci number, but if it’s not cached, it’ll calculate and cache all of them up to the given point:
GET :1:factorial_50 -
SET :1:factorial_1 1
SET :1:factorial_2 2
SET :1:factorial_3 6
SET :1:factorial_4 24
SET :1:factorial_5 120
SET :1:factorial_6 720
SET :1:factorial_7 5040
SET :1:factorial_8 40320
SET :1:factorial_9 362880
SET :1:factorial_10 3628800
SET :1:factorial_11 39916800
SET :1:factorial_12 479001600
SET :1:factorial_13 6227020800
SET :1:factorial_14 87178291200
SET :1:factorial_15 1307674368000
SET :1:factorial_16 20922789888000
SET :1:factorial_17 355687428096000
SET :1:factorial_18 6402373705728000
SET :1:factorial_19 121645100408832000
SET :1:factorial_20 2432902008176640000
SET :1:factorial_21 51090942171709440000
SET :1:factorial_22 1124000727777607680000
SET :1:factorial_23 25852016738884976640000
SET :1:factorial_24 620448401733239439360000
I removed the timestamp, uid, and 4-tuple for your viewing pleasure.
That’s about as clean as I’d hope for a generic approach to logging “something useful” about Redis. So, I’m happy with it.
Overall, RESP itself is a terribly uninteresting protocol. What Redis does with this RESP could be very interesting. Parsing RESP is more like tokenizing. The second parse step is what actually turns that unstructured data into commands that are useful for the user.
The analyzer explored here is nowhere near complete. The events are a bit minimal and the client/server relationship can be more fleshed out. That’s what we’ll explore next time