THE COGNICAST TRANSCRIPTS
In this episode, we talk to Paula Gearon about RDF and Naga.
The complete transcript of this episode is below.
The audio of this episode of The Cognicast is available here.
CRAIG: Hello, and welcome to Episode 114 of The Cognicast, a podcast by Cognitect, Inc. about software and the people who create it. I'm your host, Craig Andera.
Well, I really only have one announcement for you today, and it is this. I have decided to accept a position at another company, and so I will no longer be the host of The Cognicast.
Fear not, though. The Cognicast will go on with the same great production staff that we’ve had all along, and it’s going to be hosted by Carin Meier from time-to-time, Stuart Sierra from time-to-time, Tim Baldridge from time-to-time, and I think they will all do a fantastic job. I’m really actually pretty excited to see some fresh perspectives that they’ll bring and looking really forward to continuing to listen to the show and the great things that they’re going to do to carry it forward.
After 114 episodes, it is time for me to say good bye as the host, but The Cognicast will continue. You will, of course, maybe still hear my voice a few times, as we do have some episodes recorded. But as I put this down on tape, on virtual tape, on electronic tape, this is my final message to you.
It’s been really great having all of you as listeners as I’ve been the host, but I’m also looking forward to the fact that I know that you will continue to treat the new hosts of the show in the same fantastic way that you’ve treated me. As I said, the production team, Michael Parenteau and Russ Olsen, and Daemian Mack and Kim Foster are all going to continue to make the show, so I expect that this show will continue to be the fantastic thing that I think we’ve built up over the years.
I just want to say thank you so much to everybody for, like I said, treating me so well as the host, but the time has come for me to move on, so I will leave it there and we will go on to Episode 114 of The Cognicast.
[Music: "Thumbs Up (for Rock N' Roll)" by Kill the Noise and Feed Me]
CRAIG: Are you all set?
CRAIG: All right, great. Then let’s kick it off. All right. Welcome, everybody. Today is Thursday, October 27th in 2016, and this is The Cognicast. Today on this show we have a software engineer in the advanced threats division of Cisco Systems, Paula Gearon. Welcome to the show, Paula.
PAULA: Thanks, Craig.
CRAIG: You and I have known each other for quite a long time. I think we may even have met at a Clojure meetup before I even started at Relevance way back when, so you and I have both been orbiting the same portion of the Clojure world for quite some time, so it’s great to finally have you on the show. I know you listen, actually, because you’ve mentioned it several times, which we really, really appreciate it, so it’s even more excellent to have you here with us today.
PAULA: Well, at this point I think the latest episode was with Micha and--
PAULA: Alan Dipert.
PAULA: And I was just listening to them this morning.
CRAIG: Well, awesome. Well, like I say, we really appreciate that, and we’re very, very glad to have you on because you’re doing some interesting things. But we’ll get to those in a minute.
We do want to ask you the question that we ask everyone at the beginning of the show, which is to do with art. Of course, we always ask our guest to relate some experience of art, whatever that means to them. People have said all sorts of interesting things, but it could be anything from a book you read to a painting you saw or whatever. I imagine you’ve had a chance to think about this, and hopefully you have something you’d like to share with us today.
PAULA: Well, I have, but it’s really more of a personal experience. It comes back to what you’ve often said about art. It’s when I was about 12, I got taken by my parents to the Queensland Art Gallery. In the entryway there was this big painting, and it was like 40 feet long or something. And it was ugly.
Oh, I cannot describe how awful this thing was. It was all browns and oranges and ochres, and had a really disturbing imagery throughout. It was a collage as well. It had bits of newspaper up there and been put down with red paint over the top of it. As I’m walking along looking at this thing and just marveling at how awful it was, sticking out of the top of it was a preserved, partially decomposed head of a dead cat.
My understanding of art at that point, my appreciation of it I suppose, was things like photorealism. But seeing something like this, which was extraordinarily abstract, I just didn’t get it. I didn’t understand how anyone could call this art.
A couple of months later -- this was over the summer break. A couple of months later I was at high school and I spoke to our art teacher. Just in a conversation with her, I commented on this particular painting, which had really bothered me. I was saying, you know, how is it that art is so all encompassing that it’s got awful things like this in it? I don’t get that.
She explained to me, well, how did it make me feel? Did it make me feel sort of sickly and icky on the inside? And I almost imagined something spikely inside of me when I was remembering this thing, almost a flavor. Yeah. Yeah, definitely like that. Yes. She said, well, that’s what the artist wanted you to feel.
I think at that moment I got it. So from them on I’ve realized that art is about this form of communication, which isn’t about images and sound or anything. It’s about trying to convey an impression to another person. Sometimes that impression isn’t necessarily even something that the artist themselves thought of, but to inspire feelings in other people. The more you can inspire, the more love or hatred or disgust or feelings of passion and beauty. If what you produce can inspire feelings in others, then that’s real art. And so I like to look for that sort of thing now whenever I’m experiencing anything in the world.
CRAIG: Yeah, that’s really interesting. I mean I can just absolutely relate to the feeling that you’re describing and clearly this is something that happened to you as a child and has stuck with you, so it was obviously a very strong impression. That’s just -- I mean I think the artist would really take that as a compliment that it’s something that here you are, it’s years later and you still remember the emotional associations with that work.
PAULA: Absolutely. Yeah. Yeah. In fact, I have no idea what the piece was.
CRAIG: Right. All right, well, that’s very cool. But I think we will turn from that, as interesting as it is, and I’m sure we could discuss it quite a bit more, to matters technical because you’ve actually -- so as usual, we’re happy to talk about whatever you would like to talk about today. However, you had a talk accepted at the Conj, which is the thing that reminded me that we needed to have you on the show.
Like I said, you and I have spoken many times over the last, oh, most of a decade now at various Clojure and Clojure related events. But I saw your talk, and I’m like, oh, yeah, right. Paula is doing this thing, and it’s actually something that we here at Cognitect have many times said, boy, there’s a certain shape of problem that comes up over and over again to do with rules. I’ll let you explain your project more in a minute, but you’re doing work there. I think it makes it a great time to have you on and talk about Naga, the project that you will be speaking on at the Conj and that you’ve been working on lately.
Maybe we can stop me babbling on about it and have you explain what that is if you think it makes sense to talk about that project now.
PAULA: Yeah, sure. I actually had this talk accepted at Clojure/west earlier this year, but I had some family issues come up, and I wasn’t able to make it. But the project has progressed quite a bit since then, so it may be better that I’m talking about it now anyway.
The idea of Naga is that it’s a production rules engine. This let’s us build on Datalog a little differently than the way something like Datomic talks about Datalog, which is how a pattern matches against the database and bring all that information by them joining and performing various transformations on it. It allows you to pick up information through that process, but then use that to assert new information, which goes back into your database. This can then be an iterative process, which allows you to build out a whole lot of new information based on your original set of information, plus a set of rules about how that information would interrelate.
It works really nicely with the concept of graph databases because graph databases allow you to describe things. Depending on how the terminology for the database works, you’ll have like entity property value or subject predicate object. But these are different ways of describing triple statement and whether it be the word “property” or the word “predicate” where these come back to that branch of mathematics, which is predicate logic. By working with predicate logic, this is exactly the sort of thing that we expect to get out of our rules engine. We can declare that certain things exist, and these end up being assertions in the database.
Then we can talk about how different entities relate to one another depending on their properties. A really simple example that I like to use is in a simplified model of the family because people typically know how families work. For instance, if you have two people who are siblings, then you know that they’re going to share parents. If person A has a parent and person A is a sibling of person B, then we know that person B will have that same parent. If the parent has a brother, then we know that both A and B are going to have an uncle. We can look for these sorts of relationships, which are being described in the database. We can describe how, well, if I see this relationship, I can now infer these new relationships.
Then we can start building it out because family relationships can be quite complex. You can start figuring out, well, if I have an uncle, I can also figure out what a nephew is. Then if I start adding gender, then I’ve got aunts, and I’ve got sisters, and I’ve got lots of different sorts of relationships, which all start building out if I just describe a couple of relationships. I can say one person has a brother, as a mother. The mother has a brother, and suddenly we know the gender of several of those people.
We know we’ve got inverse relationships because if one person has a sibling then that sibling has a sibling of the first person. If somebody has a parent, then we know that they’re also going to be a child, so these are inverse relationships. And we can expand the whole thing out once we start getting a little more. We get cousins, and we can figure out all sorts of ancestry and third cousins twice removed and all sorts of things like that based on only a small amount of information to start with and the set of rules, which builds on that.
Naga is a way of both allowing you to describe what those rules are and an engine to run those rules as efficiently as possible across your database. In this case the style of database that I want to use is a graph store, specifically taking advantage of the fact that we have these triple representations of what predicate logic calls and assertion. Because I try to hide the storage then behind a protocol, this means that we’ve both got an in-memory version that I’ve got locally that Naga is running on, but also wrappers around other databases like Datomic, SPARQL stores, looking at OrientDB and things like this.
There’s a little more to it than that, but is there sort of part of this that you’d like me to explore?
CRAIG: Oh, well, no, it’s all very interesting. I mean, yeah, so where to even start. A couple things that come to my mind. First of all, just to make sure that I understand. You said it’s a production rules database.
PAULA: That’s right.
CRAIG: My understanding of that is that’s describing that property you were talking about where you have some facts and things that are merely asserted, right?
CRAIG: Like Craig has a brother named Kevin. That’s not derived information. That’s something that the database is simply told. Then you have other information that derives from that, right? We might say two things, like you were giving an example of I have a brother Kevin and my mother’s name is Maddie. Therefore my brother’s mother’s name is Maddie. But that last fact, that’s what it refers to by production, right? It’s the production rules is the thing where we’re able to arrive at new information based on rule and existing information. Is that a correct description?
PAULA: Yes, it is.
CRAIG: Okay. Great. Then I guess the question that leads me to is, like I said, around here I have definitely heard some peers, meaning Cognitects, I’ve heard some people saying, “Oh, gosh. I ran into yet another problem where we wound up needing a rules system, a rules engine, and there are a bunch of choices out there.”
I guess what I’m wondering, though, is aside from this example of families, which is a really good one for illustrating the point, but help me understand what sorts of problems that are more -- I don’t want to say real world because I’m sure somebody has that problem in the real world, right, like if you’re writing Ancestry.com, you have that problem. But like what are other concrete examples of problems that I can solve using a production rule system? Maybe one of them will be something that’s familiar to various of our listeners, if that makes any sense.
PAULA: Yeah. Yeah, only production ruled systems aren’t the only way to approach this. Sometimes we’re looking at things where we want to not assert that new data because maybe the data will explode. Sometimes it’s better to just describe these things as relationships. And so you end up doing a query against relationships, and the query gets transformed in the background for you. You can then question your database, and it’ll give you a solution to what you want, but you haven’t had to assert all of that new information.
CRAIG: Sorry. That’s something that -- you’re saying that’s something that Naga does or that’s, like, the general--?
CRAIG: Okay. Gotcha. Okay.
PAULA: Naga does not do that, but we do see that in other databases. Prolog will do that kind of thing. Prolog also lets you work backwards from some kinds of information and this works forward only.
But when it comes to practical applications, there are, for instance, when you’re looking at an inventory system. I want to build an engine, and I know that the engine contains these components. I then want to say, well, this component, the carburetor, it’s going to use these parts inside, and I need a butterfly valve. I need lots of different parts, and they’re all going to have serial numbers, et cetera.
Now what are all the parts that I needed to make the engine? What’s the overall budget on this? And I might not have put all of that information directly in about the engine, but it’s all related. Sometimes the system can get complex enough that it’s not easy to simply say, “Give me the breakdown of this engine,” because I might need to iterate over the database over and over until I’ve explored all possible breakdowns of the components to know what the costing of the entire thing will be in terms of parts. How many engines am I able to build with the parts that I’ve got?
PAULA: Inventory systems I know are something that NASA works with, specifically with this kind of thing because their engines, for instance, are incredibly complex devices with massive breakdowns. I mean we won’t say breakdowns. That’s a bad term, isn’t it?
PAULA: But I mean breaking the componentry down into its parts. You want to be able to describe things at a simpler level in each part of it and then allow the computer to reconstruct the rest of the relationships for your entire system. Having a nice, clear description of how things relate, you know, if this is in this component and that component is in this bigger component, then the small item is also in the bigger component. These kind of relationships of transitivity and the like are great to find upfront because they can be expensive to calculate on the fly if you’ve got a lot of data to work across.
That’s part of the issue with all of this as well. While there are systems, which can calculate these things on the fly, they often won’t scale particularly well because you end up needing to fill memory with some of this information. It can be difficult to get to that next step without having to calculate all of this stuff in between. But if you have some kind of scalable data solution on disk or in the cloud or something, then you can -- disk is cheap, so you can build up a lot there. So long as you kind of talk to that data efficiently, then you can do a lot of this work relatively scalably and you can do a lot of the work upfront so that, later on when you need an answer to your question, you can get the answer very quickly.
CRAIG: Okay. That makes a lot of sense, actually. I like that example. That actually helps me because one of the things that I’ve learned in the work that I’ve been doing over the last few years is how many problems are in fact graph shaped. Right? That’s just a really natural shape for a lot of problems. It’s an interconnected world, and being able to say this is true about that thing, and this other thing is related to that thing.
I can see where things that are graph shaped, you mentioned graph databases as a natural complement to or integration point with or backend of Naga, however you want to put that. I can see how this would be useful in that context. Did all that kind of hang together for you? Did it make sense?
PAULA: Absolutely. You’re sort of triggering on a few words for me there because my background is in the semantic web, and I know that RDF got mentioned a couple of times with the release of Datomic. Do you know about my background with graph databases, because I’m not sure if we’ve discussed that very much?
CRAIG: You know, I mean I might know a little bit, but I don’t think our listeners all know, so go ahead and share with them.
PAULA: Sure. Well, about 15 years ago now, a little more, we had this issue actually with Xerox Corporation, these things that led to Sarbanes-Oxley, the fraud, which was going on at various levels in organizations. When they were taken to court over it, there were massive paper trails that had to be drilled through, and the company that I was working for got some of this work to try to make sense of the data that was coming through.
While trying to figure it out, well, by then my boss David Wood was figuring that we needed to store all this metadata that we were extracting in some way. He started looking across all the standards, which were out there. One of the ones, which it just had been released and was starting to get some headway at that point, was the resource description framework or RDF, and that’s a graph model of your data.
Initially it was being touted as specifically for metadata, although over time it became clear that you can store anything in it. We tried to put it all into various databases, which were available to us at the time. With the computing hardware that we had in our little company, we couldn’t load up any more than something like 80 triples per second, and we had millions. We were getting more every week, and so we were simply unable to ingest the amount of data that we needed to.
A colleague of mine, David Makepeace -- and I wanted to mention him because he was a mentor to me throughout my career -- we came up with this idea that maybe because we knew that our data was in the shape of triples and it specifically could only hold certain data types in those three columns, then maybe we could optimize a system to hold that. And so we ended up building what ended up becoming one of the first commercial RDF databases. At that point it was called Tucana. A few years later we released an open source version of it, Kowari, and then with various machinations with the company finally being wound up. We renamed the open source system to Mulgara.
And so I ended up working on this triplestore database for some years, and I worked on various parts of it in the query engine. Initially worked in the storage lab, and it became this really interesting intersection between the efficient representation of data on disk and the data model that was going to be applied to it. Having to work at each layer of that system brought in new insights on how it all fit together.
Initially we needed to just put RDF in there, but then we needed some way to query it. We invented our own query language for it. Then it turned out that some of the other groups, which were also trying to build RDF databases like Hewlett Packard, and they created the Jena system, which was open source.
We all decided to get together and, under the auspices of the W3C, World Wide Web Consortium, we all joined to create a committee to query these sorts of databases. What came out of that was the SPARQL system, which was what the SPARQL protocol and RDF query language.
CRAIG: Love it. Recursive.
PAULA: Yeah. Exactly. There are some weird acronyms when it comes to these sorts of things because other than RDF, there’s a schema language, but there’s a much more complete concept than what a schema offers, and that’s called an ontology. That’s the Web Ontology Language, WOL. It’s pronounced “owl,” and the reason it’s called “owl” is because, in Winnie the Pooh the character of Owl knew how to read, and he could write his own name. He spelled it W-O-L.
PAULA: So the Web Ontology Language is “owl.”
CRAIG: I suppose it’s not surprising that these various semantic web technologies are themselves rather meta in their naming schemes.
PAULA: Exactly. Yeah. Of course all of these technologies were being defined out of the W3C, and apparently Sir Tim Berners Lee had this vision upfront that web pages weren’t simply supposed to be the be all and end all of the web because they were really about documents that people read, but they’re not really helping computers. The main information, particularly early on that was embedded in a web page, was about the presentation. It also had information about links to other documents. But really the computer had no knowledge of what was going on in these. You know, in the pages it didn’t know what was being described.
Sir Tim was thinking that what we really wanted was a web of information so that computers had some idea of not only the information that they held in their own documents, but also how that information relates to other information and, you know, so not simply where to find other information, but what information means and how those relationships occur. In order to do that, RDF was invented specifically to address some of these issues. This is why all of the identifiers in RDF are -- well, now they’re IRIs.
They started out as URIs, of which URLs are a subset. But over time they moved it up to Internationalized Resource Identifiers, so IRIs. That’s the latest standard. We have these IRIs for identifying anything, really. These are the nodes in your graph. Best practices will then have these IRIs refer to information across the web found in databases or in other documents on a website and things like this. This all gets described using RDF.
That’s great that we have this description of how things connect and what relationships are, but then what do those relationships mean? They created a schema, which describes, like, the domain and the range for relationships. So if I see this sort of relationship, then I know that it’s referring to something that talks about a vehicle. I see that sort of relationship that’s going to be talking about a person identifier and that kind of thing.
But that was very limited, so the more complete descriptive system then goes on to describe relationships between relationships. This more complete concept, which goes way beyond what database schemas one would expect, that was referred to as an ontology. The system built up around that was the Web Ontology Language or WOL. That was all semantically a lot higher level than where I’ve been because I’ve been really in the low level getting data and off this quickly, how to optimize queries best to address this sort of thing, and I hadn’t really considered those higher levels until one day the engineering manager turned to me and said, “We need WOL support. Can you do that?”
And I said, “Yeah, sure. I’ll look up the standards and have a go at it.” As it turned out, it was much more complex than I had anticipated. And I said, “Well, how have other people done this?”
We’ve got two other open source projects out there working on this: Sesame and Jena. What do they do? It turns out that they didn’t have a complete solution either. It was a post-doc research problem.
At the time, I was doing my postgraduate physics because I started as an electrical engineer and, when the first quantum computing algorithms came out, I was really inspired by this, and I went back to university to do physics. After graduating, getting an undergraduate degree, then I moved into postgraduate physics. I was just finding that postgraduate physics and full-time work was challenging.
Then when I discovered that my full-time work was now something that other people around the world were researching, I went to the university and said, “Do you mind if maybe I swapped disciplines here and moved into computer science, please?” So I did, and that was really good for teaching me how to do a lot of the research that I needed for this sort of problem.
At the time, because we were building a graph database, I was after scalable solutions, which there are some good scalable approaches to a more complete system, but the one that really stood out back at that point were rule systems. And so that was around 2004 that I started getting into rules and looking at it from the perspective of doing everything on this graph database. Back then I built one, and it still works, and it’s still not out in open source that’s built into Mulgara.
Initially the whole thing was configured using XML because that was -- well, it wasn’t XML. The whole thing was configured using RDF. At that point the only officially recognized serialization of RDF was in XML, which was a mistake, and no one should ever use XML to serialize RDF. But because it was official, I went with that.
And so I was configuring my rule system with RDF, and I found that I was writing out my rules using horn clauses, which is something that we find in Datalog or Prolog. Then I would be, by hand, converting them into the RDF structure that I needed to describe the rules that my system was configured with. I thought one day, “Why am I doing this? Why not just write this out? Why not write the horn clauses in directly and write a parser that does that conversion for me?”
Of course I was going to need to find dependencies between rules because if a rule produces statements that look like this, then I needed to have an efficient mechanism for saying, well, I’ve now asserted these new statements. Which rules should be run as a consequence of that, because some rules don’t care about those new statements, but others do? I was going through this process of doing so much by hand, and the more I did the more I realized I could automate, including the entire process of identifying where those dependencies were.
Then once I started finding where the dependencies were, I realized that I was duplicating what a Rete system would be doing. Rete -- I’m sorry because a lot of people don’t know. Rete is the fundamental algorithm for rule systems whereby it minimizes the amount of work done between the initial assertion of data and how often the rules need to be run to make sure that that data has propagated entirely throughout the system. It does this by remembering data as it flows through, and it’s a whole network of memory nodes, which build up the data that comes through. As I’m working with this, it became clear that instead of memory nodes, I was using lookups in these indexes on triples, which are really efficiently indexed, and so it was really fast to simply do a quick lookup of a pattern, and I essentially had just duplicated what this particular node in a Rete network would end up doing.
Because everything was already pre-indexed, it meant that the network had already been built for me, and so I got a lot of leverage out of the way that these graph databases would index triples or most of them, the way they index triples and then could leverage this using the standard query infrastructure into creating the sort of rule out counts that the Rete network was expecting. And then looking for changes in here, so rather than saying, well, a new assertion has come into the system and it’s going to end up creating, you know, flowing through the network and building, you know, adding into these nodes in the network. And because they have been modified, we now need to see whether or not that rule fires. Well, now I can do an efficient count on a particular pattern that’s been indexed, so a particular pattern, which has an associated index with it. Those counts were very efficient to do as well.
Using the standard querying infrastructure, all of this just sort of started falling out. It turned out that many of the labels in the research literature started applying when I approached this, like doing a count on this pattern where applied to an index is what we call semi-naïve reasoning. The reason for that is that we’re presuming that we’re creating new data all the time, and that means that if we see a change in the count on that data that means that we have something new asserted.
I’m rambling now.
CRAIG: No. This is good, actually. You’re taking us through the history of kind of how you arrived at where Naga is today, and it’s very interesting, so please continue.
PAULA: Well, so I built the whole thing for the Mulgara database, and it runs. My paths are red, something that looked like horn clauses, but instead of taking standard atoms, it was looking for queue names, which is a way of representing a URI or IRI as a namespace and labeled pair separated by a colon because almost all URIs that I was using, you know, followed that particular pattern. I found -- I didn’t know a lot of Prolog at that point. But once I started learning some Prolog, I tried typing it in and it just ran.
Now there’s things like Prolog. There are things that you can put into Prolog, which require backward chaining where you can’t simply say, “If A and B then C.” Sometimes you might have a B and a C, and you can work backwards to say, well, I must have had an A. Forward chaining rule systems can’t do those backward sorts of things. Prolog can. Prolog has greater capability, but it also has less scalability.
Anyway, it was really great to have built this. Once I had the rule system going, I started plugging stuff into it. And because there were lots of people who were using the database because they wanted to apply these schemas and ontologies that they’d already built, and I said, “Well, what happens if I put this in?” The first thing that came out was a whole lot of data, which I thought was just random crap, that I had a bug in the system, and I needed to find out what had gone wrong. I started tracing it through to figure out where did this new data come from.
As I traced it through, I discovered that none of that data was actually all correct, and I had never seen through to the consequences of some of the rules, which were a part of this vocabulary. That was possibly one of the coolest experiences in my computing career when I discovered that I had built something that was capable of discovering things, not simply to the scale, but it was capable of reasoning out answers that I hadn’t considered.
That was what got me interested in computers as a kid. I had this notion that computers could think, that they could figure out things that we can’t figure out. But as I got older I realized, oh, no, it’s just really fast at working through a whole lot of information.
Yeah, that’s what is happening here, but there are so many layers of abstraction and semantics, which have been applied, that now that we’ve gone through reasoning through all of this bulk of information with these rules, the consequences that come out are non-intuitive and can be beyond what you were expecting. To see new information like this just come out of it, for me, was kind of magical and it really inspired me to keep doing it.
It is difficult sometimes to take that step back. I do like working with Al for this reason because it lifts me up so many layers semantically away from the story. I sometimes have to swap up and down this whole stack. One of the things that brought me all the way back down to the bottom again, actually, was when Datomic came out.
When we first built Mulgara, it was Tucana and then it was Kowari, and now it’s Mulgara. It’s still Mulgara. But when we first built it, we were trying to find the best way to store these triples on disk. Everyone in the office who was interested in these sorts of things had a go over a weekend of trying to store these things, index these things onto disk. We benchmarked it to see how quickly we could get data on and off. At the end of the weekend, it turned out that the youngest kid in the office who had only been out of university for a few months had the winner. He built it using AVL trees.
AVL trees have this really interesting property. They may be a little slower to read because they’re a binary tree, but they’re super fast to write because any update on the tree is a O-1 operation. You might have to do lots of reading to find everything to identify what you’re looking for, but when you finally do a write to the tree, it’s just a O-1. There’s only a couple of write operations and you’re done.
The tree always stays balanced. This means that AVL trees are actually really, really fast for writing information. We’ve been using this now for several years, and while shallower trees have real benefits for reading, we’ve still been getting a lot of benefit out of AVL trees because usually the bottleneck for people has been in loading data.
I’m looking at fractal trees after seeing the talk on them at Strange Loop because that definitely looks like it’s optimized for writing as well. But the reason I mention the tree structures is that while we were looking at this, one morning on the Linux Kernel mailing list we saw this post on the release of the Tux2 file system, and that file system--tell me if you’ve heard this before--was built for a tree where whenever you made a modification, you never modified the tree. Instead, you would write out new nodes, which then did structural sharing with the original tree, and it would only need to duplicate from a particular point in the tree up to the root, which would only be a couple of levels deep. Any modifications to the file system ended up just being a few new writes, and you never had to -- what is it? You never needed to modify anything in the file system that existed.
CRAIG: Yeah, that sounds familiar.
PAULA: Yeah, and then when you wanted to -- well, when you’re finished with the write, you did this commit operation, which simply moved from the old root of the tree to the new root of the tree, and then the old root of the tree didn’t exist any more. We’re talking about 2001, and so you then went through from the root and you had to identify those nodes, which were no longer in use. You’d sweep them up so that they could be reused.
What it meant was that the file system would be guaranteed to be in a consistent state at all times. We thought this was a wonderful idea, and so Mulgara is built on that concept. Now we put an awful lot of effort into finding nodes, which were no longer being used in trees and sweeping them up in the background and then making them available for reuse later.
Then years and years later I started saying, well, we had this mantra, “This is cheap.” Why are we sweeping these things up? Why not just leave it around because it’s taking up so much effort to identify this stuff, and we have to do all this bookkeeping. Let’s throw away the bookkeeping. We’ll just leave it on disk, and we can just build it up. If we ever need to save space, then we’ll read from the index and just rewrite it into a new index. That can be done as an open odd operation or something.
Then Datomic came out, which when he was presenting it, essentially presented the same structure. Only then he said, “And you can query all the old version.” And the little light went on for me saying, “Oh, my goodness. I’ve got all the old versions there, but I don’t have any API to access them.”
I was sort of working since then into creating the APIs to access all of this old data. But Datomic was a very natural pick for me because it was almost exactly the same principles in architecture of many of the things that I’ve being looking at for many years already. So when Datomic came out, it was something that I naturally just wanted to try to apply these things to as well.
I’d heard comments from people--I don’t want to name names--talking about how Datomic wasn’t going to be really compatible with RDF because RDF has the whole open world model and Datomic doesn’t. At that basic storage layer, that doesn’t actually really apply. It’s always a closed world. You’ve got what you’ve asserted on disk and you don’t have what you haven’t asserted on disk.
And you know when I say old disk, this is an abstraction, of course, for where the data is really being stored. It could be in React or something like this. But you -- yeah, the whole storage mechanism is simply whether or not these triples or, in fact, they’re not triples because every database I know of actually has extra information on each assertion. But we call them triples because we typically have that subject predicate object or entity property value.
All of these statements on disk, they either exist or they don’t exist. Then we apply semantics at some higher layer as to how that works. The first thing I did was to build a wrapper around Datomic to try to load my RDF into. That’s still floating around called Kiara. Because I loved the recursive acronyms, that one was Kiara is a recursive acronym.
CRAIG: Good one. That’s awesome.
PAULA: Thank you.
PAULA: That worked quite nicely as well because it turns out that keywords -- well, there was an issue with this, but keywords and queue names are kind of looked at as analogous to one another because they had a name space and a label. The main issue there is that with large enough -- you know, a large enough RDF document, you might end up with millions of URLs. If I tried to map that into millions of keywords, then they will get in turned in Datomic, and that has a scalability issue. And the whole point of this is to scale, so Kiara has some issues there that need to be addressed.
But I was finding that the general model was working really well for me. Working with RDF, working with the semantic web, all of these things brought me to the U.S. in, what, 2006. And then a couple of years later I moved to a nonprofit who had me working on the database. After the financial crisis, they were running short on funding and asked if I could find something else. They gave me a good lead time to do that because I was on a Visa.
I happened to start at Revelytix, which is where Alex Miller was working. At that point I was sick to death of RDF. Sorry. Not RDF. I was sick to death of Java, and all of my Java code was starting to look extraordinarily functional. I was copying many of the things that I’d been learning about from Common Lisp and Ruby, and writing libraries in Mulgara to try to do that kind of thing. Everything was being done in these anonymous inner classes and functions, and it was becoming -- Java lets you do this, but it’s messy.
I’d been moving towards Scala at that point, and I started to at Revelytix. The boss there, the owner of the company said, “Well, do you know Clojure?” And I said, “Well, I’ve heard it’s a functional language, and I believe it’s a Lisp, isn’t it?” He said, “Yeah, you need to learn Clojure.” So I picked up Stu Halloway’s book and got through that and then started working in Clojure and really don’t like many other languages now.
What they had me do at Revelytix was, strangely enough, build a rules engine in Clojure, which I did and learned a lot in that process. It was my first real Clojure project, so there were things, which I could have done better, and I always intended to. Unfortunately, Revelytix eventually moved out of semantic web technology. They got into big data. All of the work that I was doing in big data ended up we stored all the metadata for that in Datomic. That was my job, and so I became really familiar with Datomic doing that.
Then we went -- that company got sold to Teradata, and they wrapped up that division last year. Now I decided I was really -- I wanted to do another rules system, this time open source, and I wanted to abstract it away from the databases, which I’ve been connecting it to in the past, because I was now starting to see this common element across all of the databases, the graph databases that I’d been working in.
I came up with the concept, which is now Naga. Initially I called it Kanones, which is Greek for rules. But then earlier this year I started at Cisco, and they were very interested in my semantic web background and in my rules system background.
Then I had a discussion with my manager about what I was doing with Kanones, and he got really interested. He said, “Look. I’d like you to do this on company time for Cisco.” We had this theme with the advanced services division of all of these -- all of our software, which has been named after characters in Avatar, the cartoon series.
PAULA: So I had to rename it, and I picked the polar bear dog in--what is it--the Korra group.
CRAIG: Oh, yeah, Legends of Korra, I think it is.
PAULA: Legends of Korra, yeah, so that’s where the name Naga comes from. In order to write rules, I wanted a quick, easy way to do it, so I wrote a parser using Parsatron that Nate Young put together a few years ago. I used to work with Nate at Revelytix. The language there, which is essentially a re-implementation of Prolog, I called that Pabu, and that was the little fire ferret.
Anyway, I was putting it together. I wanted to build it over Datomic, but the decision was made that we wanted to have something in house that wasn’t relying on another database. And sure I could leave my protocol in place to talk to other databases, but could I please build something in memory itself. I thought, but there are all these databases out there, and it’s all this functionality that I don’t want to have to re-implement. But I was asked to do so, and I was actually, despite seeing the process of Clojure for many years, I was surprised at just how quickly a query engine with an optimizer can be built in Clojure, basically under a week with one person, which really that was a new level for me.
It doesn’t do everything, but it does handle the inner joins. It does handle filtering. It does do projection. It does have a query optimizer, so it figures out the elements of the query. Incidentally, queries syntactically are identical to Datomic queries because I’ve been using that syntax now for some years. It will reorder the query to be some level of optimization unless you ask it not to. And it does all of this and all of the code to do that was by myself in under a week. That’s a call out to Clojure again because I just could not have done that in any other language.
CRAIG: I want to jump in there for a second because there’s this query language, and I’m trying to understand how that relates to the rules. I mean are you expressing the rules in terms of query or the rules produce data and then you do query to extract the information you need? What’s the relationship there?
PAULA: Sure. Well, a rule might be going with the family example. We might say if A has a parent B and B has a brother C, then A has an uncle C. Is that--?
CRAIG: Yep, I’m with you. Yep.
PAULA: Okay. We might query that out as select A and C, and we’re going to know that A is related to C through uncle. Select A and C where A parent B and then B brother C.
PAULA: And so that’s to select where that you might use in Datomic. That involves -- what’s needed there is the lookup of those relationships where A parent B. That’s a lookup in an index. Then B brother C is another lookup in an index. We need to be able to join those results according to the B that we found. Then we need to project that result down to just the A and C. That projection is what the select clause does.
We then want that to be reasserted into the database as triples, and we’re going to be asserting A uncle C as our assertion.
The syntax that I have a rule if you’re calling this as an API, which is what I was anticipating to start with for Naga, that syntax uses where clauses, which look identical to the where clause of a Datomic query, and it uses a keyword of colon dash to separate that from the production in the same way that Prolog does. Then the production is a triple of the same pattern that you might use in the where clause. When I’m doing it in the Prolog like language that I called Pabu, that ends up being structured quite differently, but it ends up being translated into the same thing.
Now internally the API for storage needs to be able to handle a query, which is given these patterns, which I’m going to join, and these filters. Then I want to see a projection of just these variables. That’s a really minor tweak to turn that into a Datomic query. But it also applies to the in-memory version that we needed that I was asked to implement with it.
CRAIG: Okay. If I understand this correctly, the query engine that you wrote is how you implement some aspect of the rules, but the surface area of Naga, the user facing stuff is still the Pabu language, if I have that correct.
PAULA: Yeah. Yeah, so Pabu is basically -- I’m in the process of doing another language because we’ve got more to it than what I’ve described. But Pabu is a re-implementation of a subset of Prolog. I’ve taken Prolog code and I’ve plugged it into Naga, and it just runs it fine. I’ve taken Naga programs, and I’ve put them into Prolog and they pop out just the same way.
The difference is that with Prolog I have to ask, what did you get here? Whereas in Naga it generates it all and ends up in the database. The database doesn’t, at the moment, have and front-facing interface for just executing queries on. You need to use an API. I’m thinking of wrapping a web interface around that just to provide easy access for it.
CRAIG: That’s all right.
PAULA: All of the function will be there.
PAULA: And I mean the rules engine takes what looks like relatively standard queries and their outcomes, and it deconstructs them. It figures out what those patterns were. It figures out what the interdependencies are between the rules. It then figures out that if I’m going to run this rule, which other rules could possibly have been triggered. Then it’ll check to see if the path of the rule that was dependent on the outcome of this one that’s just run, if that, if anything changed in relation to that. If it did, then okay, this rule is now a candidate for being run.
Yeah, and we can move forward with that sort of thing. That actually allows us to have negations in our rules. Not where we’re saying I want to remove these statements, but it allows us to say “I’m looking for these statements, but not where these statements exist.” A standard rule Datalog system won’t let you do that because they’re typically looking for the entire outcome of “once patterns have been joined across your data, did we see a change in the number of elements that we have?”. You can’t look for a change in the elements that you had because that could be something, which has scaled beyond what you can remember. But you can remember how large that result set was.
Typically Datalog has this restriction on it where you can’t say “I’m looking for data without this” because sometimes you’ll assert new data. Then when you assert other data, which applies to the filter, it means we’ve got new things, which have come in. We can actually keep our count not changing or even going down. Then the whole mechanism of trying to spot changes in the database based on how big the resolution of a particular pattern is against an index, that breaks down. But by doing -- sorry. The resolution of the entire query against the database. If we’re basing it on counts there, then that can break down. But if we’re looking for just particular patterns against the database, that’s much more efficient for looking up in an index and for counting, which is part of what I was trying to get at in the first few iterations of what I was doing here.
CRAIG: Does that always? Does that always work? I’m just trying to think like--
PAULA: There are certain cases where it doesn’t, and I try to filter that sort of thing out. The reason why we see restrictions on languages like Datalog, they’ll say, “Oh, we do this and this and this just like Prolog, but you’re not allowed to do that.” Then you look at it, and you said, “Why aren’t I allowed to do this?” Well, it often comes back to reasons like, well, all of the implementations do it by counting this result and comparing this number to that number. Everybody has done it that way, so we have to put this restriction on the language.
There are certain decisions that I’ve made in various parts, which allow for more flexibility. It’d be great to allow for even more, but then I need to start using memory to record certain things, which are happening. And I’m not doing that at this point, so there are restrictions on how far you can go with the language. But certainly in terms of what exists and what doesn’t exist, yeah, it allows mitigation, whereas normal Datalog doesn’t.
CRAIG: Sorry. Can I clarify something? I don’t think everybody is aware of this because I think some of our listeners at least might think Datalog is Datomic’s database language. In fact, it’s a whole family of languages. Like the things that you’re talking about that are properties of Datalog are not necessarily properties of Datomic’s datalog. For example things around negation. I just wanted to clarify that because I know when I first came to it, I was like, oh, Datalog. That’s how Datomic queries. Well, it’s actually been around since the ‘70s.
PAULA: Datalog is essentially a subset of Prolog aimed at databases, hence the name.
CRAIG: Mm-hmm. Cool.
PAULA: Prolog being programmable logic. Well, now this is database logic.
PAULA: It’s based on querying and assertions, which is exactly what I’m doing here. The thing is that back then everything was based on these much more flexible relational database tables and the like. It had performance issues and all sorts of things. There was a lot of research done on it back in the ‘70s. But they kind of moved away from it and much more toward what SQL can do. SQL does have a relationship to Datalog, but SQL doesn’t allow for things like looping and Datalog does.
PAULA: I can -- I’m able to say, like, I can declare something to be transitive. That’s a very simple rule where I say A property B and then B property C. Therefore A property C.
CRAIG: Yeah, like with your inventory system where you were saying, show me all the things that are subcomponents of this component, right?
PAULA: Yeah. Yeah.
PAULA: Transitivity is something, which is a very simple rule, and it’s a really common rule that you see in these sorts of systems.
CRAIG: I want to make sure that we maybe kind of -- we’ve done a great job of covering the past. I think we’ve hit on many of the kind of present features and status of Naga, but maybe we could just briefly make sure we cover future, like what you have in mind. Then we can kind of start to loop back and see if there’s anything else we should hit and bring it down from there.
PAULA: Well, the future for it, one of the things that we have is a set of modules of JSON importing and exporting. The data that we’re processing at Cisco, which is about identifying network events and what’s going on inside machines based on our software, which is moved across the network. We’re looking for threats to the system, so malware. As we accumulate this information, it’s all represented as JSON.
The import mechanism turns all the JSON into a graph. It turns it into a graph using a schema that is relevant to what we’re trying to do with the rules engine. Then once rules run over it, we can then extract everything back out of the graph as JSON again.
Now the Pabu language isn’t really optimized for that sort of thing. It works, but we have these various levels of indirection because we’re not referring necessarily from one entity directly to another entity, and instead we’re referring to an entity by a label that entity may contain. This creates indirection when we’re doing it with Pabu and also some knowledge of how the JSON was encoded into the graph. I’m building a new rule language, which will handle that sort of thing and make it much more intuitive.
I’m trying to expand on the number of databases that I can wrap on. I haven’t finished doing the Datomic one at the moment, for instance, and I want to move straight into SPARQL. Hopefully I’ll have Datomic done before the Conj. It shouldn’t be very much because I’ve got half of it done now.
I definitely want to add more features into the rules engine itself so that I can handle some of these issues around negation or others calling out to external sources. Production rules right now only assert new data into the graph. We also want to send singles out on a messaging API so that you can say, trigger events to occur somewhere else in your system. There’s a lot more work to come out of this, and I’m definitely having fun with it.
CRAIG: Is it ready? If I said I have a production system I’m considering using it, would you advise me to do that or to wait?
PAULA: It will definitely handle certain sorts of things quite well already. So if you have a limited amount of data and you have a set of rules, whether written in Pabu or you want to program directly against the API, it will definitely load that data into memory, run against it, and give you all of the new information out again. If your requirements look like that, then yeah, it’s ready right now.
The sort of requirements that I have are a little beyond that, and I’m not quite there yet. Like I said, I wanted to be able to talk to Datomic or at least I want that persistent scalability in the background. But in terms of doing logic, forward chain logic across your data, then it does that quite well right now.
CRAIG: Okay. Great. It sounds like definitely one to keep an eye on, either as a possible use in a project where those requirements are met or for future use when you expand it out to meet any other requirements. Very, very cool.
Awesome. Well, that is a really excellent overview of Naga. As we mentioned briefly at the beginning of the show, you will be talking about this at the Clojure/conj coming up in, boy, just about a month now. Getting close.
PAULA: Well, I don’t need to now. I’ve told you everything.
CRAIG: Well, I’m sure people will still very much enjoy your talk.
PAULA: Thank you.
CRAIG: As we were discussing kind of offline before the show, you’ll be able to have pictures and wave your hands and all that good stuff, so I’m sure the talk will be well worth watching.
PAULA: I was already waving my hands while I was talking now.
CRAIG: Yeah, so people will be able to see you do that on the video, and I’m sure that will help enormously, although your explanation was quite clear for me, so I appreciate that.
PAULA: There are some extra -- I’m not going to mention now, but there are some extra interactions of ontologies with the rule system, which are quite interesting in terms of complexity that comes out of it.
CRAIG: Cool. Well, we can definitely have you back on as you move the thing forward and maybe check in whenever makes sense and see whether you’ve gotten to. But I do want to leave a little bit of time here as we start to wind down for anything else. We didn’t even talk about Gherkin. I don’t think it’d make sense to talk about that. That’s been awhile, but you know that’s a favorite of mine. People can go look up what that is online.
PAULA: They have to look at Alan Dipert’s talk, and it’s really a shame that that video doesn’t capture the laughter that was going on in the room.
PAULA: Nobody believed it was real to start with.
CRAIG: I know. Just not to tease people. Very, very briefly, this is a Lisp that is implemented in Bash, and Alan presented this a few years ago. People literally thought he was joking. I wasn’t sure myself, but you’ve actually put a bunch of work in on it. Anyway, we’re not going to talk about that today, but if there’s anything else that you wanted to share before we wind down, that’s cool. If not, like I said--
PAULA: Well, there was one comment that I wanted to put out there because I’ve been running into it a little bit lately. I’ve been getting involved with Women Who Code up in D.C., which is an organization, which supports and promotes women working in the technology industry. There’s a lot of people who have come into the industry sort of sideways through, like, boot camps or lots of different backgrounds into coding. Working with some of these people, there are some really, really bright women.
What I’m discovering is that they’re often going out for job interviews and, because of their background through a boot camp or through learning through some other mechanism, they’re not being given the look-in that I think they probably deserve. I’m just sort of -- I’m still figuring this one out, but I’m thinking that the interview process that we have is very much sell yourself to the person who is interviewing you, and yet that has very little to do with the kind of job that people need to do when they finally show up in the workplace and sit down to start writing code or designing or whatever.
There’s this big disconnect between how we’re selecting people to work and the people who would be best working in our company or really have something valuable to offer our company and their ability to slot into that interview process. I know there’s been some research into how women will present and people of different cultures will present in interviews, and how this doesn’t necessarily lead to people who may be able to offer the most necessarily getting the job. But it’s still something, which is unclear to me as to what the best approach is, but I’d like anyone who is looking to hire anybody at the moment to consider what their requirements are and really look at a person not necessarily how they look in the interview, how confident they are, how well they can sell themselves, but to really consider what they may need for their job because we’re seeing a lot of research coming out that shows that if you had greater diversity in your workplace, then you will have greater return on investment for your company. It would also be giving a lot of people who deserve opportunities the chance that they really should be getting.
CRAIG: Yeah. Thanks for that a lot. I mean I’m a white male, so I have a certain perspective. But my daughter, my older daughter says she wants to be a programmer. Who knows if that will survive her teen years, but she is neither male nor arguably white. Her mother is Chinese, so a number of dimensions there in which she is not the stereotypical candidate. I’m really, really interested, at least at a minimum as her father, in the way that our industry that she’s interested in treats people outside kind of the white male. I don’t know, stereotype, if that’s the right word.
I guess my question is, given that, one of the things that we do at Cognitect, our interview process is -- maybe interview is the wrong word, but our selection process includes what I think to be a substantial emphasis on actually working together. We have multiple hours where the candidate and one of our employees sit together and work on real problems.
For instance, one of the things I might do is, if I’m talking to somebody on a Friday, I say, “Well, this is a problem that I’m solving. Let’s spend two hours together actually making progress,” not setting up situations where I’m trying to determine some specific aspect of their knowledge, but really can this person and I sit together and progress this actual technical problem in the exact same way that we would be asking them to do should we decide to work together.
Is that the sort of thing that, in your opinion, is going to be more appropriate than the kind of interview, answer these questions, and convince me in a sales perspective that you should work for us?
PAULA: It definitely is an improvement, but there are also issues, which show up with that. One of the things, which tends to come out there, is that there’s a bias towards enjoying working with people who are like yourself. And if you’re finding that maybe you’re making progress, but you’re being challenged a lot by another person, are you necessarily? Now you may, but is every interviewer necessarily going to appreciate that development process? It may be that it was a better development process even though it wasn’t as comfortable.
A lot of the processes that we have now, even that sort of thing, tends to bias towards hiring people who are already like us because you say, well, I want to work with this sort of person, you know, whoever we employ, and I want to make sure that they enjoy the sorts of things that I enjoy. That’s not necessarily true across the board, but definitely I’m seeing a lot of that sort of bias showing up all the time with people making statements like that.
Is this the sort of person that I would want to share a drink with at the end of work on a Friday? We can end up with unintended biases coming through when we have that sort of approach. Like I said earlier, I don’t necessarily have the solutions here, but I really would like to think that people could be aware that they have this bias for wanting to work with people like themselves and to try to look a little further afield because it’s not comfortable. You know I’m going to want to look for people like myself as well, but some of the most productive interactions I’ve ever had were with people who challenged me to the point of discomfort.
CRAIG: Mm-hmm. That’s a really great perspective. I really appreciate you sharing that. And as so often happens, we have managed to wander into a topic that, by itself, would be excellent advice. Yet here I am left with the question that I ask all of our guests at the end of the show, which is, do you have any other advice--I suppose I should put it--that you would like to share with our audience?
PAULA: I have a motto that I go by, and I wouldn’t necessarily provide it as advice for others, but it certainly works for me. It’s problematic in and of itself because it’s a little judgmental in some ways. But I heard this phrase once: Only the mediocre person is always at their best.
PAULA: The thing is that I think everybody has the capacity for greatness on occasion, and so declaring that we have mediocre people isn’t necessarily a very nice thing, but I do like it in that it both inspires me to have those bursts where I do something really great, but also it allows me to be a little more kind to myself when I feel that I haven’t produced something or I’m not being very productive right now or I’ve just made a really stupid mistake and, you know, what kind of idiot does this make me. Just remembering that phrase reminds me that I am allowed to be not at my best sometimes because when I am at my best, maybe I’ll be really, really good.
I think it also goes the other way around. Every person out there, including myself, especially myself, makes horrible mistakes on occasion. If we can acknowledge that, it can allow us greater compassion for ourselves when we make those mistakes and greater compassion for the people around us when they make those mistakes because everybody does.
CRAIG: Yeah. It’s definitely true for me. I like that a lot. That’s really great. Well, Paula, thank you so much for taking the time to come on the show with us today. I really enjoyed the discussion of Naga and your perspective on the other things we talked about. It’s just good stuff. I mean I think you did a great job of explaining it, and it’s clear we’re going to check back in with you at some point to see where Naga has gotten to, the other interesting things that you’ve done because, as I said, there’s at least -- at least, and I know there’s more -- but there’s at least one other interesting thing that you’ve done that we didn’t touch on at all, so thanks a ton for coming on the show.
PAULA: Thanks for having me.
CRAIG: Yeah, this was great. A great time. Always good to talk to you, but I think we will wrap it up there. Great show. Thanks again, and we will thank our audience. This has been The Cognicast.
[Music: "Thumbs Up (for Rock N' Roll)" by Kill the Noise and Feed Me]
CRAIG: You have been listening to The Cognicast. The Cognicast is a production of Cognitect, Inc. Cognitect are the makers of Datomic, and we provide consulting services around it, Clojure, and a host of other technologies to businesses ranging from the smallest startups to the Fortune 50. You can find us on the web at cognitect.com and on Twitter, @Cognitect. You can subscribe to The Cognicast, listen to past episodes, and view cover art, show notes, and episode transcripts at our home on the web, cognitect.com/podcast. You can contact the show by tweeting @Cognicast or by emailing us at firstname.lastname@example.org.
Our guest today was Paula Gearon, on Twitter @quoll. Episode cover art is by Michael Parenteau, audio production by Russ Olsen and Daemian Mack. The Cognicast is produced by Kim Foster. Our theme music is Thumbs Up (for Rock N' Roll) by Kill the Noise with Feed Me. I'm your host, Craig Andera. Thanks for listening.