BSDTalk Podcast #53: Interview with Matt Dillon

Original audio available at bsdtalk.blogspot.com. This document is available in other formats at http://derek.trideja.com/bsdtalk/.

BSDTalk: Hello and welcome to BSDTalk, number 53. It's Wednesday, July 12th, 2006. I just have an interview for you today, so we'll go right to it.

Today on BSDTalk, we're speaking with Matthew Dillon from the DragonFly BSD project, so I want to welcome you to the podcast.

MD: Thank you very much, Will.

BSDTalk: And the reason I wanted to speak with you was the upcoming 1.6 release. Perhaps you could start by describing the release numbers and how those work.

MD: We're kind of using, I'm not sure if it's exactly equivalent to the original Linux release numbering, but it's pretty close. We're using even numbers for releases and odd numbers for development, so the last release was 1.4, the current development tree is 1.5, and it will branch into 1.6-RELEASE and then the new development tree will become 1.7.

BSDTalk: So, for people who are running, you know, roughly stable systems, the assumption is that they've been sticking to 1.4, and then waiting for the 1.6 release.

MD: Yeah. We do, of course, commit to the 1.4 branch when there are major bug fixes, but usually once we get about halfway through development cycles, the changes get too extensive to really be able to MFC them. But, you know, then they wind up in the next release, and that's how it's worked this time too. There have been a substantial number of bug fixes that we simply couldn't bring into 1.4 due to their complexity, which will be in 1.6.

BSDTalk: So once 1.6 comes out, then there won't be security updates, at least by your team, for 1.4?

MD: Uhm, you know, that's really ... really depends on the developers' time availability. We don't have a whole lot of developers. [laughs] It's not for a lack of wanting to bring them, but yeah, generally speaking, we think people using DragonFly really have to upgrade to whatever the latest release is. And we actually spend a lot of effort trying to insure that the releases are forwards-compatible.

BSDTalk: So for those moving up to 1.6, what can they be looking forward to?

MD: A whole tonne of stuff! First of all, you could consider this mostly a bug fix release. There have just been a tonne of bug fixes, especially to the filesystem code, and in addition to the bug fixes, we've also made significant progress in the infrastructure work. The buffer cache is now split into what's called a buffer and what's called a BIO, which is the actual entity that represents an I/O. And we've done that separation, similar to what FreeBSD did about a year ago, and we've also completely ripped out the block number scheme and the buffer cache, and we now just use straight 64-byte offsets which simplifies a lot of things. And we've pushed the big GIANT lock in quite a ways, so most of the file descriptor routines are MP safe and will run as such.

But mostly it's been bug fixes, new random number generator, major code cleanups, new drivers support, tonnes of fixes to softupdates... I'm going through the list here as I tell you... the list is, like two or three hundred entries long, so I... [laughs] won't repeat them all... malloc() improvements, fixes to NFS, you know, things like that. Fixes to the floating point subsystem, fixes to threading...

BSDTalk: How long ago was the 1.4 release?

MD: Uhm, we usually try and get a release out once every six months, so the 1.4 release was basically January, and 1.6 is, well, July, and 1.8 will probably be at the end of this year or early next year.

BSDTalk: All right, so, it's a terrible thing to do to bring up old wishes, but you did have a document a while back, I think it was December 2005, where you had plans for the 1.5...

MD: Oh yeah!

BSDTalk: Uhm...

MD: Yeah, those, yeah, those really got pushed back.

BSDTalk: Some of those, like the ZFS filesystem from Sun Microsystems, stuff like that, maybe you could talk about some of the things that were pushed back, and maybe why some of that happened.

MD: Well, the three big-ticket items we consider really big ticket items that we want to get into DragonFly: userland VFS, which would be an API to userland to allow a filesystem to run as a user process; ZFS, which kind of needs the userland VFS to be able to do the port. Trying to port ZFS straight clean into the kernel would be fairly difficult. And then, everything related to clustering, which is kind of the big enchilada.

Originally when I started DragonFly, I didn't think it would take this long to be able to get to the point where we could actually start integrating the clustering code in. But it's turned out that there's just been a whole lot of infrastructure work that has had to be done in the DragonFly kernel to even come close to supporting something like the clustering. I'll give you one example to kind of show you [laughs] how difficult this is.

To do the clustering properly, we need to be able to, for example, share a file between two machines, but it's not quite that simple. It's not as simple as NFS. There has to be complete, 100% cache coherency between an open file that's being read and written on one physical machine and an open file that's being read and written on another physical machine, because in a clustered system, a single program that's threaded might have a thread running on one machine, and another thread running on another machine, and we want that to be transparent. Now, in order to do that, we need a cache coherency management system, which basically comes down to dealing with range locks, but range locks in the situation very similar to what you have when you have an SMP system and you have the cache coherency between the CPU and memory caches, that is in hardware. We have to do something very similar in software to really make the type of clustering we want work.

And that creates a whole chain of issues. For example, traditionally, a BSD system will lock a vnode exclusively for a write operation. Well, you can't lock in a vnode exclusively in a clustered system--[it] doesn't work, you have to lock the entire range of the file or it kind of defeats the idea of having, you know, shared caches and parallel operation and all of that. So all of the vnode locks have to be converted from a single global shared or exclusive lock per file to, really, a range lock per file, and it just creates a whole change of infrastructure that are hard. And I've been slogging through it, basically, and that's where we are.

BSDTalk: I guess it's now time to ask you to look in the crystal ball again. [laughs]

MD: Guess? [laughs] Well, I don't think we're going to have either clustering or ZFS up by 1.8, which would be the end of this year. But I think there's a very good chance that we'll have either ZFS or clustering, but probably not both by July next year, one year from now. But it really depends on how well I'm able to make progress on this core infrastructure that's required to support these mechanisms.

BSDTalk: What about amd64 support?

MD: It's not on the table at the moment, or at least not for me. I can only do one thing at a time. If another developer wants to do an amd64 port, they're certainly welcome to and I'll definitely support that.

BSDTalk: Did I see something the make world for amd64 partially succeeding?

MD: That was basically makefile support, not really any infrastructure or any actual code. But it's certainly a prerequisite to doing any amd64 work. We aren't actually very far away from an amd64 port, but you know, it comes down to developer resources, and at the moment, I think there are only three people that are actually doing really heavy kernel work, and none of that right now is amd64. I'm really focused on userland VFS and clustering.

BSDTalk: It looks like there was some noise around removing ipfw, the IP firewall tool...

MD: Yeah, there's been talk about it. I think it's doable. We definitely want to try and reduce the number of firewalls we've got in the system in order to be able to get the big giant lock pushed through, all the way through the networking system. Right now, MP support within the system is kind of spotty. We've got fairly good coverage of the networking subsystem, but there are pieces that aren't MP safe yet, and the firewall is one of them. There are a few things that have to be implemented in Packet Filter before we can remove ipfw, and that's why it's not going to be removed, probably at least for six months. For example, we want dummynet support in Packet Filter, pipes, rate limiting and that sort of thing, before we remove ipfw.

BSDTalk: And when you talk about MP safe, and multiple processors, has there been some recent testing about how well DragonFly scales?

MD: Probably not very well in an MP test. [laughs] Basically what it comes down to is the big giant lock, because if you have one place in the code in the code path, like a read call or a write call that requires a big giant lock, it will create cache contention and MP issues in an SMP system, even if it only applies to a small portion of the code that's actually run in the kernel. That's pretty much where we are. There's still some pieces in almost every system call that still has to be MP safe. But our ability, or when we're able to remove them, that will allow the whole codepath to be MP safe.

BSDTalk: pkgsrc, another topic.

MD: Yes! Actually quite a success so far.

BSDTalk: Yeah, so that's been for working out well for you? The importation of pkgsrc?

MD: It's been working out pretty well, yeah. We're still, I think our coverage is a few thousand packages at the moment that will compile and run on DragonFly. We've got a good chunk of, well, we've got X working, we've got a good chunk of the X base utilities and UIs working. There's still plenty of issues, but the nice thing about pkgsrc is that since it's effectively a multi-vendor project, a multi-operating system project, it's, and we have several developers that are pkgsrc committers as well. Generally when we're able to get problems fixed, it's fixed permanently. You know, it's pretty much the developers that encounter these problems get the bug fixes in and it cycles through and it winds up in a quarterly pkgsrc distribution, and we're getting them fixed.

BSDTalk: As before, is the installer going to be through a live CD?

MD: It's always been through a live CD.

BSDTalk: And any big changes or additional tools that are going to be going into the next live CD?

MD: Uhm, probably, well, we're going to include kernel source. We try and keep the live CD as minimal as possible. We will include kernel source in the next live CD. We probably will not include an X environment, mainly because the X support in pkgsrc is just stabilized for DragonFly, and it's probably, it's still a moving target. But hopefully in the December release, in 1.8, we will get some kind of integrated UI support, not for the installer, but at least on the CD so that people can get X up and running without having to access the Internet to download the packages.

BSDTalk: Great, well, are there any other topics you wanted to talk about today?

MD: You know, the big thing for me, at least for the DragonFly project is clustering support, and it's still my goal, and I still think it's very achievable, even though it's been pushed back a couple of times. We've made huge progress in the kernel, I think the 1.6 release--[the release] we're going to do in a week, is probably the most stable release we've ever had. And at this point, I would consider it far more stable than any of the FreeBSD 4.x releases, which are kind of the benchmark for stability in the past. So I'm really happy about, you know, the way the kernel is progressing, especially on the stability front. That's been a very important feature for me in doing all this MP work and all the clustering work and everything; to be able to keep the kernel stable while we're doing it. And I think we've succeeded.

BSDTalk: I think that a lot of the concepts you guys are attempting really provide for some wonderful proving ground. There's no better way to discuss a theory than to actually try and implement it.

MD: That's, yeah, that's absolutely true. We've certainly proved the viability of many of the networking mechanisms we've implemented. The parallel route table works fine. The threaded networking stacks, they work pretty well. It's not, for performance testing-wise, they probably won't test out very well simply because the MP lock has not been completely removed from them. But we have been able to test certain code paths.

For example, you know, thread switch times are under 1 microsecond. Inter-processor messaging is under a microsecond, and it can be batched, and in batch mode, you'd have a 1 microsecond delay for the first message, and basically just 10, 15 nanoseconds for each successive message that's handled in a batch. And those are very important concepts to be able to prove out because they allow the kernel to be coded in a much more modular fashion, and a much more debuggable and understandable fashion.

BSDTalk: Would you consider yourself a hybrid, micro/monolithic kernel, or how would you describe your kernel describe?

MD: [takes a breath] Well, I'm not, I'm not really in either camp. I support the idea of having a modular kernel, but at the same time, I recognize that certain large infrastructure pieces of the kernel really have to be monolithic in order to get any real performance out of them. You know, you have to have an infrastructure that you can depend on, and that really means monolithic. And in order to have stability, you really need to have something that's integrated.

It doesn't have to be integrated in a monolithic fashion, but it certainly has to be integrated in all the code's in one place, and you're not trying to, you know, take a Windows box for example. You've got drivers provided by 50,000 drivers from difference vendors and it's kind of hit-or-miss whether they'll work or not with any given version of Windows. So, you know, there are different levels of, you know, being monolithic or modular, and I'm kind of in the middle ground.

BSDTalk: Thank you very much for speaking with me today, and I want to wish you luck on the 1.6 release, and I look forward to trying it out.

MD: Sure, thanks a lot, Will.

BSDTalk: And maybe we'll catch up with you again, maybe after ZFS, or single system imaging...

MD: Uh, well, for 1.8 and, I don't know, you know, maybe, maybe we'll have a 1.10 or maybe we'll have a 1.10 or maybe it will become 2.0. It really depends on whether we're able to get userland VFS in or some level of the clustering in. So cross your fingers. [laughs]

BSDTalk: All right! Great. Thank you!

MD: Great!

BSDTalk: ... if you'd like to leave comments on the website, or reach the show archives, you can reach it at bsdtalk.blogspot.com, or if you'd like to send me an e-mail, you can reach me at bitgeist@yahoo.com. This is BSDTalk #53. Thank you for listening.

[transcribed 12-Aug-2006 by Derek Warren - derek@trideja.com - http://derek.trideja.com/]