Play Test Recap: Rolling a One.

Sometimes to take one step forward you first take three steps back.

Every other week or so we get the whole team together to play test. One of us DMs a one-shot style session within Tabyltop. Jonathan was DMing this one, his first time in that seat, and had his materials prepared. Based on our previous play test sessions, we were all looking forward to diving in and playing a few hours of D&D, with periodic breaks to discuss features, record bugs, or brainstorm next steps. Our play test two weeks prior went so smoothly that for a while there we forgot we were working and were just playing D&D.

The software sprint leading up to this session was very active. I was working pretty deep in our videoconferencing code addressing several issues around reconnecting after refresh. I also added a WebRTC data channel alongside the audio and video channels to carry information about features extracted from the video stream. Initially this includes face location and orientation, which will be tied in to a virtual-camera pan and zoom feature in the near future. The point of this is that it was pretty invasive surgery on a very complex and critical portion of our codebase.

Two issues surfaced almost as soon as the five of us were logged in. Video was working well, but some of us couldn’t hear each other. For example, I could hear Ben but he couldn’t hear me, even though Brian could hear both of us. We went through a ritual of hitting refresh in our browsers one at a time until everyone could hear each other. Hoping it was something rare we could workaround, we carried on.

Not a Fluke

Shortly thereafter, Tom’s video started lagging, first by a few hundred milliseconds and then by several seconds. At first i was skeptical this was a real issue, as Tom likes to stress-test our system by using Cold War-era AV equipment and an internet connection he managed to bounce off a Russian weather satellite. He logged out and logged back in and things got better, but then Brian started having the same issue. <sigh>

Ben and I logged into the server to start looking at logs. After trying to copy one of the logs for later review, I got an unexpected No space left on device error. <Sigh>

So here it turned out that we had debug recordings enabled, and over the weekend I accidentally left a session running for something like 36 hours after getting distracted and pulled away from the computer by my kids. Oops. A quick check confirmed the disk on the server was 100% full, and the logs full of a variety of filesystem-related errors. Easy fix, deleted the 12 GB video file from the weekend and we were back in business, or so we thought.

Everyone logged back in and for a brief glorious moment everything was perfect. Then someone’s video flickered. And their sound disappeared. And then Tom logged out and I was seeing double, with Ben appearing in two windows. WTF.

Time to dive into the console logs. I had left all of the WebRTC debugging set to its most verbose level in case something like this came up. There was plenty of detail there, but it was overwhelming as dozens of messages each second spewed into the log. This was going to take some digging.

Time for a New Strategy

At this point, maybe 20 minutes into our scheduled playtest time, we pivoted to a full-fledged troubleshooting session. The group captured descriptions of what they were seeing and the corresponding console logs and sent them to me for review. Ben added some debug code on the fly so that we could observe the media stream assignments and compare those against the server’s expectations. We spent another ten or fifteen minutes collecting this data before I decided to call it, with the most productive path forward being to focus solo on the logs and the code.

The first 30 minutes diving in were not terribly fruitful. WebRTC is a complex beast, and with 3 streams per participant, 5 participants, and almost constant refreshing or media changes, it is very time-consuming to correlate the various logs and piece together what happened. I did discover that most of the clients ended up subscribed to dozens of media streams, several times what was expected. I decided to dig back into the code with this in mind and see what I could find.

This is when I came upon this little gem:


    } else if (msg["streams"]) {
    Janus.warn("Received streams event -- Not currently supported, may cause issues");
    // var streams = msg["streams"];
    // for(var i in streams) {
    // 	var stream = streams[i];
    ...

I checked the console logs, and sure enough, that warning was sprinkled throughout, just as things started to go to hell for each user.

It all makes sense now…

So here’s what happened: A number of weeks ago we did some major refactoring of the front-end client code. We did a ton of testing on it, but primarily with only two clients. In that testing the code above did not appear to be hit. Rather than take the time to refactor the code with the rest, we assumed it was unnecessary in our application and just commented it out for the time being.

Last week when I was making changes, I came across this, and thought it might be an issue down the road, so I added the warning. I had no idea it would come to bite us so soon. Turns out these “Streams” messages from the server become really important when you have more than two clients, especially when they refresh, and it seems to be exacerbated by the additional data streams.

Easy enough, I uncommented the code, performed the required changes for our new client code, ran a quick test locally and then deployed to the server. Jonathan was still up to help test, and everything worked perfectly. Phew.

So, What Have We Learned?

Moral of the story? Well, we did some things well:

Fail Fast – Rather than try to push through with the play test, we acknowledged the issue and pivoted to troubleshooting and then bugfix. We could have attempted to workaround for hours, but in that same amount of time we got a permanent fix.
Enabling debug logging ahead of time – We anticipated that the changes in WebRTC might require troubleshooting, so had logging dialed up to the max. This issue was challenging to replicate in isolation, so having the ability to have everyone send me logs when errors occurred was a huge timesaver

Here’s where we can improve next time:

Don’t Be Lazy – We skipped the offending code twice: once during the initial refactor and again during my bugfixes. We never took the time to thoroughly test or review the importance of it, and this whole issue could have been easily avoided.
Server Monitoring – Even in development, having server alerts configured makes sense. It should be impossible to fill a server’s disk and not know about it.
Test! – Testing multiple connections is challenging without the right setup. Time to create a test environment with multiple users/browsers and monitoring to make sure it’s working correctly and avoid future regressions. Ideally I’ll get this automated in some way so we can run it regularly.

So, not much of a play test, but a bit of a learning experience, and I think we’ll be set up great for our next session. I expect the next recap will be far less developer-focused and will highlight some of the great features we’ve been working on!