Recently I was working on a socket pool for a new scheduler for OCaml 5 (multicore baby!) and I ran into a strange issue.
This new socket pool works by spinning up a series of lightweight processes to accept connections. Every one of those will wait for a client to connect, and create a new lightweight process to handle a connection. Eventually, the client will terminate the connection and the relevant processes are terminated.
All good so far.
All of this accepting and connecting is done via file descriptors (a
Unix.file_desc). In some cases they correspond to listening sockets, when used to accept new connections, and when connected to a client they become streaming sockets (so a socket used to send/receive data). But really all you have is an integer that's behind the
Unix.file_desc type: the Unix file descriptor.
Okay, so what went wrong?
In one of my load tests, I consistently could reproduce that the entire application would just exit. No error messages, no prints, no stack traces. It was running and then at some point, it just wasn't.
I can't emphasize enough how much I dug through the entire runtime, adding more logging, and more safety nets, just to see if I was doing something wrong. A good day of work was lost to this.
Then asking around, after exploring all the options I could think of, I asked on the #multicore channel of OCaml Labs, and I got an answer from Stephen Dolan.
Turns out that:
- if you have a streaming socket
- and you write to it
- but the client has closed it
- your program will receive a Unix signal:
- which if you didn't know about, and didn't specifically set to ignore, will TERMINATE YOUR PROGRAM.
No return value, no exception, nothing of the sort. You have this entirely out-of-band input to your program that even in an impure functional language like OCaml feels like a sucker punch.
Why does this happen? Let's see.
The Unix module
The Unix module is the default way to interact with your operating system in OCaml. You've probably used
Lwt_unix before or the
-unix flavor of your favorite lib if you aren't using promises yet. All of those rely on
But really, this module is just super low-level bindings to syscalls.
Syscalls, or "system calls", are little bridges between the boundaries of User space and Kernel space in your operating system:
- Kernel space is where your operating system implements all sorts of things to make your computer run, like how to write to disks, or read from the network. If something is buggy here it will BRICK your computer.
- User space is where you and I write our buggy software. Buggy is cool here. It
keeps us employedwon't brick the computer.
And our Unix module is full of bindings to syscalls like
write(2) that lets User space programs actually write files by asking Kernel space code to do the writing. Neat, right?
The fact that were are using these syscalls isn't obvious, but as you can see here in this snippet for
Unix.write we are making an external call to a function called
(* lowest-level binding, directly calling C code *)
external unsafe_write : file_descr -> bytes -> int -> int -> int
(* slightly-higher level binding, that checks the buffer offset is ok *)
let write fd buf ofs len =
if ofs < 0 || len < 0 || ofs > Bytes.length buf - len
then invalid_arg "Unix.write"
else unsafe_write fd buf ofs len
caml_unix_write will in turn call a C function called
write which comes from
libc on Unix-like operating systems that follow the POSIX standard, and will call
WriteFile from the Windows APIs when compiling on Windows.
write from libc, and
WriteFile from the Windows APIS. Those are the syscalls.
The important thing to know is that when you are using this module, many of the functions you will call there are not OCaml code. They are C code, and they reach into the depths of your operating system to do dangerous, wonderful, weird things.
On Unix systems, one of those is signals.
Unix has a way of interrupting a process with a mechanism called signals. A process in turn can tell Unix how it's going to react to those signals, by setting a signal handler.
It's essentially a configurable, OS-triggered callback.
Some of these signals are very common. Like when you press
Ctr+C to exit a long-running program, you're really sending a
SIGINT signal, also known as an interrupt signal.
You can of course override this, and you see many REPLs do it, so that if you accidentally press
Ctrl+C you get a chance to confirm this and exit or return to the program.
Signals, however, are not a part of the Unix module. If we want to configure them (and their handlers) we need to use the
In particular, we need to use the
Sys.set_signal. This function lets you set the behavior for a particular signal, which can be one of:
- Default – whatever POSIX decides is the default
- Ignore – just do nothing with it
- Set a custom handler – use this handler to do something that fits your program
Fixing The SIGPIPEs
The error we had described before is fixed with a single line of OCaml at the top of our program:
But the knowledge required to put that line there isn't trivial.
You need to know that the Unix module is just a wrapper around OS syscalls. And in here you'll want to know exactly which one, which may involve digging through some of the OCaml C libraries.
You need to know where to find the right doc for that syscall (is it BSD since macOS inherited a lot from it? That doesn't mention anything about SIGPIPEs, maybe the Linux syscall manual is relevant here?
And then you have to learn about Signals, how to catch them, and how to use the
Sys module to do that. Granted this last part is the easiest since it's more actionable, but that second step?! Not as easy a leap to make.
This is most definitely not the kind of surprise you want to find when writing a type-safe, high-level functional programming language like OCaml.
Hell, I think Python does this better by throwing an
IOError instead. That would've saved me hours of self-doubt.
If you really need this level of control, you may find it useful to mentally frame it as writing garbage-collected C, and behave accordingly. And please shield your users from all the gory details.
Otherwise, stay happy and away from the Unix module and look for alternatives. Use Bos for your OS interactions, stick to a higher-level library for sockets, and if it comes to it, isolate that part of your system.
I hope this gotcha won't get you the next time you're writing network code, and if you have any stories like this one, I'd be happy to share them on Practical OCaml too.