The Bug Fix Chronicles: Marge’s Fight
Good Morning CVW-20. If you’ve had the time to read up on some of our recent blog posts on cvw20.net, and also on our discord, you’re well aware that DCS 2.9 has been throwing us through a loop. The good news is that we can now call some of our previous theories into question. While 2.9 has been heavily affecting many other servers, it hasn’t brought ours down yet. However, ours has been down, and unfortunately, we now know the reason for that.
We’re going to break this into 3 sections. We’ll cover what went wrong, how it could have been avoided, and the steps we’re taking to prevent this from happening again.
During our recent troubleshooting session, we came across the following error:
00001.417 STATUS: Can't run C:\Program Files\Eagle Dynamics\DCS World OpenBeta Server\bin-mt/DCS.exe: (2) The system cannot find the file specified.
00008.048 === Log closed.
which was odd to say the least. As you can see it looks as if the DCS server executable is no longer in the files. Well, that was an obvious falsehood. the executable had to be there, it’s just that when we went to force our way directly into the exe, we got the same error. Well maybe this is something on our end. Now a fun fact about Marge. Her storage is broken up into 2 main Raid 1 arrays, the “C” or Boot array, and then the “Storage Drive” Array. Now DCS and ARMA 3, as well as our many media transcoders physically run in the C array, but most of their files like mission files or mods or liveries are kept in the storage drive array. The Storage Drive Array consists of 4 drives while the C Array only contains 2. A mild trigger warning for any IT geeks out there, because this next part is gonna hurt. Through CHKDSK and other utilities, we were quickly able to determine that 2 drives on the Storage Array had failed entirely, and one on the C Array was reporting serious errors. Furthermore, while a RAID 1 array should be able to continue functioning and simply rebuild the data, it wasn’t. While we initially, attempted a drop in replacement for the drives, with the plan of restoring the data via our off site backups, that has now also failed.
I’ll also take a second to mention that we (and when I say “we” I mean “I”) have royally screwed up when it came to getting this information out. I didn’t want anyone in the dark, so I put out the information I had at the time, which was more often than not incorrect. Even now, I’m gonna attach the disclaimer of “this is just what I know right now”. While initial theories included that the drives may have failed due to a physical shock, or an unreasonably high amount of moisture, we’ve now changed our position to suggest that repeated power outages may have been to blame, especially since our bad C array drive looks to be suffering from corruption.
Now some of you who have been around for a while remember that Marge was never my first choice for a server setup. Her design was compromised by financial limitations and the parts we could get our hands on at the time. I was initally in favor of a large multi bay unit that we were referring to as “The Monster” or “MAE” in honor of the server management tool we used with some frequency. Unfortunately, I was in an apartment at the time, and MAE just was not coming together in a way that made sense, so Marge was proposed as a way to use spare parts to get anything up and running while a more permanent solution could be found. Ultimately though, Marge continued to knock everything out of the park, so we didn’t feel a need to upgrade quite yet. Although there were plans on the drawing board. Marge 2 would have featured a number of improvements, including long term battery backups, ground up server software (Not jerry-rigged windows), and the ability to drop in upgrade as our needs changed. Marge 2, absolutely could not be funded though. We’re talking about a 2.5k USD server that I couldn’t afford, our revenue streams at the time couldn’t subsidize.
With this most recent incident though, we’re aware that Marge is ready to take a break. She’s worked her CPU Cooler to the point where the copper heat pipes are working. So the following steps are underway.
I’m heading to Microcenter and replacing our bad drives. Although 1 to 1 replacements aren’t possible, we’re working with what we can.
The slow and painful process of rebuilding our data is going to start over the coming weeks. Thanks to some of our partners, there’s a massive chunk of our data stored in off site backups, that we can use. While we still lost a disasterous amount of data, we’ll be able to mitigate any long term consequences.
Before rebooting Marge, we’re going to upgrade her to Marge 1.5. We’ll move to our dedicated server software, and drop in upgrade system, but ditch some of the non cost effective upgrades like current gen parts. Marge will still need to be upgraded within the near future, but this will keep her running smoothly until that point and should ultimately, reduce our barrier to 2.0 status.
Lastly, I want to take a second to thank you guys. The fact is, without your support this could have been so much worse than it was. If we couldn’t build Marge in the first place and this happened on my personal PC, I could have been unable to work for a while which would have been devastating to me. Thankfully my most recent failure only involved my boot drive (mostly failing) and I could use Marge to recover. Also if you guys hadn’t been supporting us with your CVW-20 merch purchases, then there would have been absolutely no way that I would be able to just run out next day and pick up $250 worth of Hard Drives. Like I’ve always said, that war chest exists BECAUSE of you. Even when word of the Marge’s memory loss first started to spread I got practically smothered with words of support, encouragement, and even a couple of you telling your own stories of times when your own systems suffered similar failures. Special shoutout to those by the way, because we actually used some of them in troubleshooting.
I had planned on writing this whole sappy bit about how far CVW-20 has come from it’s early days because of your support, but to cut to the chase: I know we hit one hell of a speed bump, and a lot of us are moving through a bit of a rough patch right now. But we’re all gonna land on our feet together, and get right back at it again.
so basically, thank you CVW-20
Sincerely
Spaceman