Need help understanding serializer

Hi Robin,

I've been looking at the serialization code in Bro for a while now and
I've hit a dead end. It'd be really cool if you could help me out
because I think I simply don't get it :slight_smile:

I think I've basically understood the concepts of SerialObj, Serializer
and SerializationFormat -- I think I would have structured things the
same way. I'm getting lost in the details though:

- When exactly and why does a class have to implement Unserialize() and
Serialize()? What's their relationship to DoSerialize() and
DoUnserialize()? The comments in SerialObj.h are a bit vague in that
regard.

- It seems the idea of sending ahead an identifier to the receiving end
that tells it what to do with the following data exists three times:

      * in the SER_xxx constants and the factory approach in
        IMPLEMENT_SERIAL
      * in the character constants 'i', 'e', 's' etc in Serializer.cc
      * in the MSG_xxx constants in RemoteSerializer.cc.

I think the latter are partially internal to the remote<->local
communication and can hence mostly be ignored for understanding the
serialization code, right? If you could quickly explain the difference
between the first two that'd be great.

The next thing I noticed are the hardcoded (un)seralization methods in
Serializer:

        bool Serialize(const ID& id, const SerialInfo& info);
        bool Serialize(const char* func, val_list* args);
        bool Serialize(const StateAccess& s);
        bool Serialize(const Connection& c);
        bool Serialize(const Timer& t);

        virtual void GotID(ID* id, Val* val) = 0;
        virtual void GotEvent(const char* name, double time,
                                EventHandlerPtr event, val_list* a) = 0;
        virtual void GotFunctionCall(const char* name, double time,
                                Func* func, val_list* args) = 0;
        virtual void GotStateAccess(StateAccess* s) = 0;
        virtual void GotTimer(Timer* t) = 0;
        virtual void GotConnection(Connection* c) = 0;

        bool UnserializeID();
        bool UnserializeCall();
        bool UnserializeStateAccess();
        bool UnserializeTimer();
        bool UnserializeConnection();

Are these special in a way to have them implemented this way? Couldn't
there be a "received" callback per SER_xxx constant that resides as
a static method in the serializable classes themselves? So we can avoid
hardcoding anything?

- Following the comments in SerialObj.h, I see what I need to do to make
a class's objects serializable. I presume that the correct way to ship
an object to a serializer is by calling SerialObj::Serialize() with the
appropriate serializer. What are my options for picking them up at the
receiving end?

Oh and RemoteSerializer::ProcessSerialization() calls Unserialize()
passing a SerialInfo, but Serializer::Unserialize() expects a bool -- is
that intended?

The reason why I'm looking at this is that I'm trying to find the right
knobs to tweak to allow arbitrary *local* client applications to feed
information into Bro (like a tuned sshd that can feed its events and
traffic to a local Bro) without reinventing the wheel ...

Thanks so much!

Best,
Christian.

I think I've basically understood the concepts of SerialObj, Serializer
and SerializationFormat -- I think I would have structured things the
same way. I'm getting lost in the details though:

First of all, there are still a few loose ends which need some
clean-up. I just haven't found time to do that yet. :frowning:

Most importantly, some classes are serialized slightly different
than others, because there are two types of interfaces, an old one
and a new one (the new one correctly handles shallow-copied objects,
which the old doesn't). For example, Val uses the old
interface while Conn uses the new one. You can differentiate the two
by looking for a DECLARE_SERIAL in the class definition; if it's
there it's the new interface. If the semantincs are different, the
explanations below refer to the new one. I am going to adapt the
other classes asap (and I intend to write a short doc about this
stuff).

- When exactly and why does a class have to implement Unserialize() and
Serialize()? What's their relationship to DoSerialize() and
DoUnserialize()? The comments in SerialObj.h are a bit vague in that
regard.

Unserialize()/Serialize() are (non-virtual) methods which are to be
called when you want to actually serialize/unserialize an object,
i.e. that's the "user-interface". They are only defined inside the
base class of an hierarchy[1] (e.g. Conn has as Serialize(), but
TCP_Connection doesn't).

Unserialize()/Serialize(), in turn, call the (virtual)
DoSerialize()/DoUnserialize() which need to be implemented in every
class derived from such a base class. DoSerialize()/DoUnserialize()
are supposed to call their parent's implementation first, and then
read/write their own attributes.

[1] I did not use BroObj as the base but the classes on the next
layer. So, for this discussion Val, Stmt, Expr, Conn, etc. all start
their own hierarchy.

      * in the SER_xxx constants and the factory approach in
        IMPLEMENT_SERIAL
      * in the character constants 'i', 'e', 's' etc in Serializer.cc
      * in the MSG_xxx constants in RemoteSerializer.cc.

I think the latter are partially internal to the remote<->local
communication and can hence mostly be ignored for understanding the
serialization code, right?

Right.

If you could quickly explain the difference
between the first two that'd be great.

The types of objects that a Serializer handles are different from
the classes containing Serialize/Unserialize methods; e.g.
Serializer serializes function calls for which there is no
corresponding class. The characters indicate which kind of
"top-level" serialization follows, while the SER_* constants specify
a concrete class.

Perhaps this could be avoided somehow, but it would introduce more
dependencies between the Serializer and Serialize/Unserialize
methods. And the additional overhead is only small.

Are these special in a way to have them implemented this way? Couldn't
there be a "received" callback per SER_xxx constant that resides as
a static method in the serializable classes themselves? So we can avoid
hardcoding anything?

Putting them into the serializable classes themselves doesn't work
as it depends on the serializer what we need to do (e.g. the
RemoteSerializer treats a received ID different than the
PersistenceSerializer).

It could be an alternative to use only one of the
Serialize()/Unserialize()/Got() methods which would handle all
cases. But I don't think that would be much nicer: first, each
serializer would use some switch-construct anyway, and second, we
would lose the static type checking.

- Following the comments in SerialObj.h, I see what I need to do to make
a class's objects serializable. I presume that the correct way to ship
an object to a serializer is by calling SerialObj::Serialize() with the
appropriate serializer.

Correct.

What are my options for picking them up at the
receiving end?

You need a Serializer-derived class at the receiving end that
implements the Serializer::Got*() methods, and calls
Serializer::Unserialize() to actually do the work.
Serializer::Unserialize() gets it data from its member "io" which is
an instance of ChunkedIO and has to be initialized before (e.g. from
a fd by using a ChunkedIOFd).

Try taking a look at the implementation of FileSerializer to see an
example of how this work to read data back from a file.

Oh and RemoteSerializer::ProcessSerialization() calls Unserialize()
passing a SerialInfo, but Serializer::Unserialize() expects a bool -- is
that intended?

Your eyes are quite good. :slight_smile:

No, that's indeed wrong (and actually I'm quite surprised that the
compiler implicitly converts the pointer to a bool here; but I guess
it does this at all places and not just inside an "if (...)" :-).

Unserialize(false) should be better.

The reason why I'm looking at this is that I'm trying to find the right
knobs to tweak to allow arbitrary *local* client applications to feed
information into Bro (like a tuned sshd that can feed its events and
traffic to a local Bro) without reinventing the wheel ...

If you'd like to pass in data from other applications than Bro, it
could perhaps make sense to think about a more well-defined data
format. The serializations are a representation of internal Bro
structures which could be quite hard to generate externally.

Robin

First of all, there are still a few loose ends which need some
clean-up. I just haven't found time to do that yet. :frowning:

Okay sure -- this is neat stuff, and if I'm a pita then just tell me to
bugger off and I'll come back once you're happy with it :slight_smile:

Most importantly, some classes are serialized slightly different
than others, because there are two types of interfaces, an old one
and a new one (the new one correctly handles shallow-copied objects,
which the old doesn't). For example, Val uses the old
interface while Conn uses the new one. You can differentiate the two
by looking for a DECLARE_SERIAL in the class definition; if it's
there it's the new interface. If the semantincs are different, the
explanations below refer to the new one. I am going to adapt the
other classes asap

I see, thanks!

(and I intend to write a short doc about this
stuff).

That'd be cool. Actually I think a Hacker's Guide to Bro would be really
useful. The thing is really quite big now ...

Unserialize()/Serialize() are (non-virtual) methods which are to be
called when you want to actually serialize/unserialize an object,
i.e. that's the "user-interface". They are only defined inside the
base class of an hierarchy[1] (e.g. Conn has as Serialize(), but
TCP_Connection doesn't).

Okay. I'm still a bit confused because Connections declare

  bool Serialize(Serializer* s) const;
  static Connection* Unserialize(Serializer* ser);

but SerialObj's also have those, but different signatures:

  bool Serialize(Serializer* s, SerialInfo* i, bool cache = true) const;
  static SerialObj* Unserialize(Serializer* s, SerialType type,
                                bool cache = true);

Is that related to the different APIs?

Unserialize()/Serialize(), in turn, call the (virtual)
DoSerialize()/DoUnserialize() which need to be implemented in every
class derived from such a base class. DoSerialize()/DoUnserialize()
are supposed to call their parent's implementation first, and then
read/write their own attributes.

Yeah I saw that -- nice!

The types of objects that a Serializer handles are different from
the classes containing Serialize/Unserialize methods; e.g.
Serializer serializes function calls for which there is no
corresponding class.

<enlightenment> Aaaaaaah! </enlightenment>

The characters indicate which kind of
"top-level" serialization follows, while the SER_* constants specify
a concrete class.

Perhaps this could be avoided somehow, but it would introduce more
dependencies between the Serializer and Serialize/Unserialize
methods. And the additional overhead is only small.

Okay sure. I just want to understand it! :slight_smile:

> Are these special in a way to have them implemented this way? Couldn't
> there be a "received" callback per SER_xxx constant that resides as
> a static method in the serializable classes themselves? So we can avoid
> hardcoding anything?

Putting them into the serializable classes themselves doesn't work
as it depends on the serializer what we need to do (e.g. the
RemoteSerializer treats a received ID different than the
PersistenceSerializer).

Oh, I see. Thanks.

It could be an alternative to use only one of the
Serialize()/Unserialize()/Got() methods which would handle all
cases. But I don't think that would be much nicer: first, each
serializer would use some switch-construct anyway, and second, we
would lose the static type checking.

True. And if I understand you correctly, I normally won't have to deal
with a serializer's internals anyway because I only need the
Serialize()/Unserialize() stuff at some point in the hierarchies you're
mentioning above. I think I'm starting to get it.

> What are my options for picking them up at the
> receiving end?

You need a Serializer-derived class at the receiving end that
implements the Serializer::Got*() methods, and calls
Serializer::Unserialize() to actually do the work.

Mhmm that still confuses me. You're saying above that the types of
objects that Serializers handle are different from the
Serialize()/Unserialize() hierarchies. And the Got* methods aren't for
arbitrary serializable objects but just for the specific types that
their names indicate, right?

So say I have a class Foo that implements DoSerialize() and
DoUnserialize() following the comments in SerialObj.h, and higher up
Foo's hierarchy is Bar that implements Unserialize() as you're
describing above. Now I ship a Foo instance using Bar::Serialize(s,
...). How do I get from the Serializer at the receiving end noticing
that something arrives, to the Bar::Unserialize() call at the far end?

Serializer::Unserialize() gets it data from its member "io" which is
an instance of ChunkedIO and has to be initialized before (e.g. from
a fd by using a ChunkedIOFd).

Yep thanks for writing the chunk stuff, that's really useful.

No, that's indeed wrong (and actually I'm quite surprised that the
compiler implicitly converts the pointer to a bool here; but I guess

So was I!

If you'd like to pass in data from other applications than Bro, it
could perhaps make sense to think about a more well-defined data
format. The serializations are a representation of internal Bro
structures which could be quite hard to generate externally.

Well for now I was really just looking for a way to pump back and forth
simple structs and maybe an occasional dynamic length byte string. What
I had in mind is roughly this:

- Bro subsytem X can handle input from client application X.
- Some LocalSerializer handles local comms through domain sockets
- Client app X registers with Bro at startup
- Client app X sends some data
- Bro side recognizes that data for subsystem X are coming up and
notifies subsystem X
- Subsystem X extracts next data item from the link and processes it

And vice versa. I'd leave it entirely up to the subsystems how to define
the data layout. As long as I could use the various Read/Write methods
of the Serializer I'd be happy .. The main hurdle I didn't manage in the
current code was how to get from the Serializer noticing that data
arrives to an arbitrary subsytem.

I even have a name for the client library already: libbroccoli. BRO
Client COmmunications LIbrary :wink:

Thanks Robin!

Best,
Christian.

Robin,

And vice versa. I'd leave it entirely up to the subsystems how to define
the data layout. As long as I could use the various Read/Write methods
of the Serializer I'd be happy .. The main hurdle I didn't manage in the
current code was how to get from the Serializer noticing that data
arrives to an arbitrary subsytem.

I think I just thought of a reasonable way to do this ... would it make
sense to you to add one extra step in the hierarchy below Serializer,
say BroSerializer, and keep the toplevel Serializer class free of any
Bro-Bro specific stuff? And let the toplevel Serializer be a very
bare-bones interface that only lets you read/write primitives (ie the
various Read/Write methods)?

I think I could derive what's best for my needs from such a base class
(for example code to dispatch to subsystems etc), while you could keep
working on the inter-Bro stuff from your BroSerializer class that would
contain the various Serialize() methods, plus the GotXXX callbacks...
does that make any sense?

Regards,
Christian.

  bool Serialize(Serializer* s) const;
  bool Serialize(Serializer* s, SerialInfo* i, bool cache = true) const;

Good point. The difference is that you don't call
SerialObj::Serialize() directly but Conn::Serialize() which, in
turn, uses SerialObj::Serialize() to do its work; i.e.
SerialObj::Serialize() does not really belong to the user-interface.

I guess SerialObj::Serialize() should better be protected
(and perhaps even renamed). I'll change this. Thanks!

arbitrary serializable objects but just for the specific types that
their names indicate, right?

Right.

...). How do I get from the Serializer at the receiving end noticing
that something arrives, to the Bar::Unserialize() call at the far end?

If I understand your question correctly, you're expecting some
call-back mechanism, right? That's not the case here. The receiving
Serializer sees that one of the basic entities (i.e. those with
character constants) has arrived. Then it will call the
corresponding Unserialize() method. If that is, e.g., a Connection,
then Connection::Unserialize() is called. If the connection contains
a Foo object, Connection::DoUnserialize() will call
Bar::Unserialize().

I even have a name for the client library already: libbroccoli. BRO
Client COmmunications LIbrary :wink:

Cool! :slight_smile:

Hi!

> bool Serialize(Serializer* s) const;
> bool Serialize(Serializer* s, SerialInfo* i, bool cache = true) const;

Good point. The difference is that you don't call
SerialObj::Serialize() directly but Conn::Serialize() which, in
turn, uses SerialObj::Serialize() to do its work; i.e.
SerialObj::Serialize() does not really belong to the user-interface.

All is clear :slight_smile:

> ...). How do I get from the Serializer at the receiving end noticing
> that something arrives, to the Bar::Unserialize() call at the far end?

If I understand your question correctly, you're expecting some
call-back mechanism, right?

Yup sort of. I guess my problem was that I approached the thing with a
different idea in mind how it would work.

That's not the case here. The receiving
Serializer sees that one of the basic entities (i.e. those with
character constants) has arrived. Then it will call the
corresponding Unserialize() method. If that is, e.g., a Connection,
then Connection::Unserialize() is called. If the connection contains
a Foo object, Connection::DoUnserialize() will call
Bar::Unserialize().

Ah I think I'm getting there. You're using a callback mechanism for the
basic entities only, and then the DoSerialize/DoUnserialize API to
handle the objects referenced by the basic entities.

> Bro-Bro specific stuff? And let the toplevel Serializer be a very
> bare-bones interface that only lets you read/write primitives (ie the
> various Read/Write methods)?

In general, adding an extra step would be fine with me. But I am not
sure if I got your idea right. Do you indeed want to access only the
Read/Write methods (not any of the Serialize/Unserialize)? If yes,
perhaps the SerializationFormat itself would suffice?

Well see I need a mechanism to communicate with local sensors (in
intrusion detection parlance), and I was thinking of communicating
through unix domain sockets (largely because I have code that I could
reuse for that). So at the Bro end, another Serializer sounds like a
good idea.

The problem is that the existing Serializer class already implies that I
want to send connections, IDs, etc (ie things that are largely
interesting for Bro-Bro communication only) which is not necessarily the
case for what I need. The SerializationFormat gives me what I need to do
the marshalling, but it doesn't handle I/O. And a format plus I/O are
pretty much the core ingredients of a Serializer, right? Also I could
build the callback API that I was looking for in my own Serializer...

Thanks,
Christian.

To clarify what I mean I've put up a patch at

http://www.cl.cam.ac.uk/~cpk25/BroSerializer.diff.gz

that moves all the callback and Serialize()/Unserialize() stuff down to
a new child class that I called BroSerializer (as it's geared towards
Bro-Bro communication). The new Serializer class only has only the basic
read/write methods, the cache subsystem, and the error stuff left.
RemoteSerializer is now derived from BroSerializer.

I could now derive my own serializer from this Serializer class without
much trouble. The patch should apply cleanly against 0.8a82.

The separation was pretty easy -- maybe that's an indication that this
would be a good move ... does this look reasonable?

Thanks,
Christian.

> case for what I need. The SerializationFormat gives me what I need to do
> the marshalling, but it doesn't handle I/O. And a format plus I/O are
> pretty much the core ingredients of a Serializer, right? Also I could
> build the callback API that I was looking for in my own Serializer...

Ok, I see.

http://www.cl.cam.ac.uk/~cpk25/BroSerializer.diff.gz

Thanks! I skimmed over it and it seems to be fine. So, feel free to
go ahead.

Robin

Hi, a quick update on this: Broccoli can now speak a good subset of the
Bro communication protocol (as per ChunkedIO, RemoteSerializer etc).
Broccoli clients look like normal Bros, and I have thus no longer found
the need for the BroSerializer split I mentioned earlier in this thread.

You can have a look at Broccoli here:
http://www.cl.cam.ac.uk/~cpk25/vern/broccoli-0.2.tar.gz

[ usual danger sign here for very alpha code :slight_smile: ]

Cheers,
Christian.