Between two Sun workstations: decoding the X11 wire protocol

I have a basement full of Sun workstations, all with keyboards, graphics cards, and displays. But it would be nice to have a good X server running on my Mac. Yes, I can run XQuartz, but I’m never really happy with that solution. I want anti-aliased fonts, Retina awareness, rootless rendering, scalable display for resolution matching. Real Mac windows surrounding Sun pixels.

I write a lot of software, even after I retired, but honestly I’d shelved this one. Building an X server is a real piece of work, and this project has sat on the “someday” list and stayed there.

But Claude Code changed the math. Over the last few months I’ve put it to work on a stack of projects I never would have attempted, and the productivity is amazing.

This morning I thought: could Claude and I actually build an X server for the Mac? I’m not interested in a completely modern one with all the extensions, but rather one that can render X11R5/R6 clients and actually look good and work well, using the Mac’s Core Graphics routines, which are really quite excellent.

The full answer is going to take a while, and the larger arc of this project isn’t ready to tell yet. But the first piece is already running, and that’s what this article is about.

Creating a Man-in-the-middle attack, on purpose

Before you can write an X server, you have to know byte-for-byte what a real X client expects to talk to. The X.org protocol docs are solid, and I’ve read them carefully. It’s all binary frames of data, so not super easy to decode without writing even more software.

But docs only describe what ought to happen on the wire. Real clients and real servers have version quirks and sequencing that only show up when you watch them talk. Validating the docs against actual traffic is what closes the gap. A spec is not a conversation.

So today Claude and I wrote a passive proxy in Swift. Kind of like pcap or tcpdump, but just for the X wire protocol.

Swift proxy capturing X11 protocol traffic between the SPARCstation 2 and the Ultra 5

The two Sun boxes are my SPARCstation 2 running a color-capable xterm (xterm with the ANSI color extensions) and my Ultra 5 running Solaris 2.6 and CDE. The proxy runs on my Mac laptop, in between. The SS2 thinks it’s connecting to an X server at the Mac’s IP. The Mac accepts the connection, opens a second connection to the real X server on the Ultra 5, and moves packets (X protocol frames) in both directions, writing every one to a capture file as it goes, decoding the ones it understands, and passing through the ones it doesn’t.

From the SS2’s side, the Mac is the X server. From the U5’s side, the Mac is the client. Neither end knows the other is there. You know you’re dealing with old software when you can so easily do this.

Network diagram: the SPARCstation 2 (xterm client) connects to a Mac laptop running a Swift proxy, which forwards X11 traffic to the Ultra 5 (Xsun + dtwm server). The Mac writes every byte that passes through to a capture file. — Two TCP connections, one capture file, no idea on either end that the other isn't who it claims to be.

This is, strictly speaking, a man-in-the-middle attack. It is also a perfectly good way to record protocol traffic when both ends belong to you.

The xterm invocation on the SS2 was:

[ss2:[tvernon]:/home2/tvernon] $ xterm \
       -bg black -fg green -fn 8x13 -display 192.168.7.126:0

192.168.7.126:0 is the Mac. The proxy listens on TCP port 6000 (X display :0), forwards to the U5’s X server on its own port 6000, and writes every byte that passes through to disk. A second Swift program decodes that capture file and prints one line per protocol message: timestamp from connection open, direction, message kind, key fields. → is client to server, ← is server to client.

What follows is the first 980 milliseconds of one xterm session, captured exactly that way.

Handshake

  0.000ms → SetupRequest   msbFirst proto=11.0 auth=(none)
  5.118ms ← SetupAccepted  "Sun Microsystems, Inc." rel=3600
                           1280x1024 depth=8

The Sun says hello in big-endian byte order. The X wire protocol can be either big-endian or little-endian; it’s decided by the client. SPARC is natively MSB-first, so xterm announces itself that way and the server accepts. The vendor string in the reply is “Sun Microsystems, Inc.” Release 3600 corresponds to OpenWindows 3.6, the Xsun build that shipped with Solaris 2.6. Five milliseconds to negotiate over the LAN.

The screen on my Ultra 5 is set to 1280×1024 at 8-bit depth, so that’s what the X server presents to the client.

X11 does its own access control at connection setup: the client typically presents a shared-secret cookie that the server has to recognize before any real traffic flows. auth=(none) in the SetupRequest means there was no MIT-MAGIC-COOKIE handshake — that step was skipped on this connection. I had xhost + set on the server for this capture, and the Mac proxy didn’t care about security, so the LAN was the trust boundary.

Cursors and colors

xterm spends the next 200ms doing housekeeping. It reads the user’s resource defaults, allocates foreground and background colors, opens the X “cursor” font, and then carves seven glyph cursors out of it:

  156.894ms → OpenFont          fid=0x4400001 name="cursor"
  156.894ms → CreateGlyphCursor cid=0x4400002 src=0x4400001 ch=152
  156.894ms → CreateGlyphCursor cid=0x4400003 src=0x4400001 ch=116
  156.894ms → CreateGlyphCursor cid=0x4400004 src=0x4400001 ch=108
  156.894ms → CreateGlyphCursor cid=0x4400005 src=0x4400001 ch=114
  156.894ms → CreateGlyphCursor cid=0x4400006 src=0x4400001 ch=106
  156.894ms → CreateGlyphCursor cid=0x4400007 src=0x4400001 ch=110
  156.894ms → CreateGlyphCursor cid=0x4400008 src=0x4400001 ch=112

Char 152 is the I-beam, the cursor xterm shows inside the terminal area. The other six (106, 108, 110, 112, 114, 116) are all scrollbar cursors: up-arrow, down-arrow, left-arrow, right-arrow, and the horizontal and vertical double-arrows. Every cursor xterm could possibly display through its lifetime is allocated up front, before the window is even visible. That’s a performance trick: pay the allocation cost during startup so you never pay it during interactive use.

A window is born

Then xterm builds its top-level window:

  203.247ms → CreateWindow    wid=0x440000D parent=0x28
                              1x1 at (0,0) inputOutput
  203.247ms → ChangeProperty  WM_NAME = "xterm"
  203.247ms → ChangeProperty  WM_ICON_NAME = "xterm"
  203.247ms → ChangeProperty  WM_COMMAND =
              "xterm -bg black -fg green -fn 8x13
               -display 192.168.7.126:0"
  203.247ms → ChangeProperty  WM_CLIENT_MACHINE = "ss2"
  203.247ms → ChangeProperty  WM_NORMAL_HINTS  (72 bytes)
  203.247ms → ChangeProperty  WM_HINTS         (36 bytes)
  203.247ms → ChangeProperty  WM_CLASS = "xterm/XTerm"
  203.247ms → OpenFont        fid=0x440000E name="8x13"
  203.247ms → QueryFont       font=0x440000E
  231.239ms ← QueryFontReply  ascent=11 descent=2
                              chars=256 properties=21

The window is created at 1×1 pixels as a placeholder. xterm tells the window manager its preferred size through WM_NORMAL_HINTS rather than baking it into CreateWindow, and lets the WM finalize geometry from there. Then the seven ICCCM (Inter-Client Communication Conventions Manual) properties go up: title, icon-name, the literal command line, the host ("ss2" is my SPARCstation 2), size hints, input hints, class. This is the convention every X window manager since the late-80s ICCCM drafts has looked at to populate its title bar and decorations.

The 8x13 font is a classic monospace X bitmap font. The QueryFont reply comes back 28ms later: ascent=11, descent=2, chars=256, properties=21. The 21 properties are XLFD metadata: POINT_SIZE=120 (12-point), WEIGHT=10 (medium), RESOLUTION_X/Y=75 75 (75dpi, the standard X resolution of the era). Several of those property names resolve to predefined atoms whose numeric IDs were assigned in X11 itself and have not changed since.

The window actually appears

xterm creates an inner window for the actual text rendering, grabs the three mouse buttons twice each (I read somewhere that this avoids accidental clicks at startup), and asks the server to map everything:

  497.466ms → MapWindow       window=0x440000D
  497.466ms → ImageText8      drawable=0x4400011
                              gc=0x440000F at (2,13) " "
  497.466ms → PolyLine        drawable=0x4400011
                              gc=0x440000F points=5
  509.177ms ← ConfigureNotify window=0x440000D
                              644x316 at (0,0)
  509.177ms ← ReparentNotify  window=0x440000D
                              parent=0x3800106
  509.177ms ← ConfigureNotify [SendEvent]
                              644x316 at (225,225)
  514.764ms ← MapNotify       window=0x440000D
  514.764ms ← Expose          window=0x4400011
                              (0,0) 644x316
  515.728ms ← FocusIn         window=0x440000D
                              detail=nonlinear

Three things happen in that block.

MapWindow makes the top-level visible. The two requests right after it draw a placeholder character and a five-point PolyLine. The PolyLine is xterm’s text cursor outline being stamped into the inner window, so the cursor is on screen the instant the window itself appears.

Then ReparentNotify shows CDE’s dtwm inserting xterm into a frame window. From the window manager’s perspective xterm is no longer the top-level; it is now a child of 0x3800106, which is dtwm’s title-bar frame.

The first ConfigureNotify, at offset (0,0), is the real geometry event the server sent. The second is synthesized by the window manager and re-sent, with absolute screen coordinates (225,225) instead of position-within-parent. That is the ICCCM-mandated synthetic ConfigureNotify, specced in 1989 so X clients can know where they sit on the root window without traversing the parent chain themselves. Every X window manager still does it. I am watching a SPARCstation perform a 1989 handshake, with the bytes decoded in 2026 by Swift code that didn’t exist the night before.

First prompt

  980.222ms → ImageText8 drawable=0x4400011 gc=0x440000F
                         at (2,13)
                         "[ss2:[tvernon]:/home2/tvernon] "

Half a second after the window appeared, the shell finally drew its prompt. ImageText8 sent the literal prompt string into the inner window at pixel (2,13) using graphics context 0x440000F. That was the first thing on screen the user could read.

From the moment the TCP connection opened to the moment [ss2:[tvernon]:/home2/tvernon] was visible on the Sun’s screen: 980ms. 62 requests went out, 7 events came back, and the whole conversation in both directions fit in about 15 KB.

The X protocol is brutally efficient this way. The whole 980ms conversation — handshake, font metadata, every cursor, every property, every event — fit in about 15 KB on the wire. A single 1280×1024×8-bit screenshot of the resulting window would be 1.3 MB on its own, roughly 90× larger than the entire conversation that drew it. X sends drawing commands, not pixels. That’s the design Bob Scheifler and his team at MIT shipped in 1987.

Testing more apps

I went on to capture the transactions of all the standard MIT X11 apps: xeyes, xclock, xcalc, xlogo, xload, and xbiff. Each capture filled in more holes in the protocol coverage.

The final app I tried was one I wrote myself, back in 1994 (I think), on SunOS 4.1.4 running the Motif window manager. I had bought the Xm libraries and mwm from one of the third-party Motif vendors of the era; the name escapes me. Pre-CDE, that was the only way to do Motif programming on a Sun. (Disk images and setup notes for that stack live at SunOS 4.1.4 disk images and Getting SunOS 4.1.4 working, if you want to follow along.)

QuickPlot, a 2D time-history plotting application written in Motif on SunOS 4.1.4 — QuickPlot, written in Motif on SunOS 4.1.4 in 1994. Used at NASA for two decades after I left.

The program is QuickPlot, a 2D time-history plotting application that ended up being used at NASA for at least two decades after I left. NASA still has the source if you ask. It leans heavily on Motif and does a lot of the things X clients do once they get serious: grabbing the pointer for rubber-band zoom selections, juggling keyboard modifier state, manipulating the colormap, and so on.

Adding QuickPlot to the captures pushed the combined coverage to roughly 70% of the X11 wire opcodes — a pretty healthy number for one morning’s work. The rest are the niche corners of the protocol: keyboard- and pointer-remapping (SetModifierMapping, SetPointerMapping), PseudoColor colormap manipulation (AllocColorPlanes, StoreColors), motion-event history queries (GetMotionEvents), and a handful of oddballs like NoOperation, which is literally a padding opcode.

What I’m doing with this

The capture tool that produced this dump is the first piece of a Swift X server I’m building so my Suns can render properly on my Mac. The next piece is the server itself. The framer that decoded these messages goes into that server unchanged. Every byte of every message in this dump becomes a test fixture, alongside the other captures I took this morning: xeyes, xclock, xcalc, and a Motif graphing app I wrote 30+ years ago. The server will eventually have to consume and produce these exact byte sequences in response to these exact clients. Real captures from real Suns are the hardest thing to argue with.

There’s a larger arc to this project that I’ll write up when it’s ready. For now, this is what one morning’s worth of curiosity looks like on the wire.

If you have a SPARCstation in a closet and you want to know what its xterm does in its first 980 milliseconds, the answer is: this.