I Cooked a SYSRES!

But before we all start panicking, let me preface this by saying it was on my employers z/D&T (Development and Test) environment – not a client system. No mainframes were hurt (much) in the making of this blog post.

So what happened?

TLDR:

It turns out the SYS1.NUCLEUS dataset can’t have extents allocated, it must be exact in size because it needs to be loaded into contiguous memory at IPL time. This is probably common knowledge for experienced mainframers, but for baby mainframers like me this is completely new information (hence the post about it, I must share this exciting news).

So our system issued the following message upon re-IPL

AWSEMI307I Warning! Disabled Wait CPU 0 = 00020000 00000000 00000000 00000081

And if we go to the IBM z/OS 3.1 Documents and look at the disabled wait codes it tells us exactly that – “Initial program load (IPL) tried to load a module from the SYS1.NUCLEUS data set. The SYS1.NUCLEUS data set or an IEANUC0x or IEAVEDAT member occupies more than one extent.” [1]

I will tell myself that IBM wouldn’t have this documented if people didn’t do it on the regular to make myself feel better.

And now for the long version:

Let’s lay down some facts first.

  • Our test system runs z/OS from the target libraries of the SMP/E zone it is installed in.
  • Is this a bad thing? Yeah, a little bit.
  • Did I make an inactive copy of the target libraries and apply maintenance to those instead? No way, this is a test system and it is prime cowboy hours over here (yeehaw).

And thus, my debacle began.

I applied maintenance with a standard SMP/E apply job. The job came back and reported that SYS1.MIGLIB ran out of space during the apply. No worries. We’ll increase the size.

Usually I do this by making a copy of the dataset into a *.NEW with a larger allocation, maybe I even throw a little secondary space on there so I don’t have to come back later (hint: first mistake). Just because some system datasets can have secondary space, doesn’t meant it’s wise or best practice to do that.

I now have my slightly larger SYS1.MIGLIB.NEW allocated. Hooray for me.

Now comes the second big no-no of the day. I want to swap my SYS1.MIGLIB.NEW with SYS1.MIGLIB. For datasets with no ENQs against them, I can just rename the original SYS1.MIGLIB to SYS1.MIGLIB.OLD and rename SYS1.MIGLIB.NEW to SYS1.MIGLIB. And the system is (hopefully) none the wiser.

Except SYS1.MIGLIB has ENQs on our system – specifically from XCFAS and LLA. That’s the cross-system coupling facility and the library look-aside.

I can stop LLA temporarily with a /P LLA console command and start it with an /S LLA, I’ve done it before and the world hasn’t ended, so that’s out LLA ENQ taken care of.
Next, our LPAR being worked on is a monoplex. So the XCFAS is probably not vital to its existence. But I can’t just stop this one, XCFAS is part of JES and if we stop JES the whole system is kaput. Luckily (or unluckily) the internet gives me unfettered access to articles about system commands that can do very bad things. And after searching for a way to kick XCFAS off my dataset I found an article from IBM. [2]

Specifically, I found this set of commands:

SETPROG LNKLST,UNALLOCATE
SETPROG LNKLST,ALLOCATE

Did I read the rest of the article? Of course not, who has time for that (mistake the third). I took my linklist unallocate command and went on my merry way.

I stop LLA, I unallocate the entire linklist, I swap the dataset names, then I reallocate the linklist and start LLA back up. And everything seems great so far so I rerun my apply job to get the rest of that maintenance on now there’s enough space.

The maintenance fails two more times for out of space in SYS1.LINKLIB and SYS1.NUCLEUS – but that’s ok because I know the workaround I used for SYS1.MIGLIB! Let’s do that two more times and give these datasets loads of secondary space (I’m going to stop counting the mistakes before we get into the double digits, but suffice to say – please do not do this. Ever.).

This was around the time I started getting abends all over the place, certain commands weren’t working, it was chaos! Because LLA couldn’t look up the commands I was running in storage – the secondary storage had moved everything around when the dataset expanded to accommodate the new members installed by the SMP/E apply job.

So according to some very fast panic-googling I did, this problem can be solved with an /F LLA,REFRESH (a modify refresh of LLA to force it to update the locations of all the members), however that turned out to be a lie and jobs and commands were still abending. A second source I found recommended rebuilding the linklist. No problem, I’ve had a little experience doing SETPROG commands before for dynamic changes. Only the SETPROG command isn’t working either now.

It is at this point that I decide to consult one of my more senior co-workers (you know who you are) who very kindly puts up with my 24/7 dumb questions and breakage of things. I would like to know if there is any way out of this besides an IPL? No there is not.

Which circles nicely back to the beginning of this post. Wait State 081.

So how do we fix this?

I have no way of mounting this SYSRES elsewhere to any of the other LPARs, because they’ve all been set up as copies of this first one and all the SYSRES volumes have the exact name VOLSER name.

BUT. That also means I can just pinch one of the other LPARs SYSRES volumes quickly and IPL off that. Huzzah, the LPAR is saved (ish).

And now begins the process of making an inactive target copy of the z/OS SMP/E zone. So that we can apply maintenance to that without cooking a second SYSRES. Once we’re ready to build a new one, we can initialize the volume with the dodgy SYS1.NUCLEUS, redeploy a copy of the new target datasets to that, and hopefully nothing else goes sideways.

To be continued…?

References:

[1] IBM, Wait State Codes: https://www.ibm.com/docs/en/zos/3.1.0?topic=wsc-081
[2] IBM, Using Dynamic Lnklst Facility: https://www.ibm.com/support/pages/using-dynamic-lnklst-facility-safely-and-properly

Leave a comment