I'm not sure why you are seeing this issue. We have 24/7 printing and have for years. There are 400+ users printing everything from 150 page color documents to 42"x6' full color posters. Even during finals when the printers are backed up for hours due to the volume and size of jobs being sent, the server doesn't really break a sweat. What you describe sounds like a system issue, or you're running on 10 year old hardware. We use a 3 year old Dell server with no issue, and all services are on one machine, not split up between servers as you describe.
I had similiar questions when installing Pharos in our environment. I, too, work at an academic instituion. I wanted to know what could be done about back up and redundancy and Pharos' answer was "We've never had a problem." Not quite the answer I was looking for. I put in a feature request for some kind of back up and/or redundancy, but that was 4 years ago.
What I did in my environment was to install 2 principle servers and 2 print servers. Then I put half my queues on one server and half on the second server. I installed "backup" queues on each server also, so in case one server was to go down, users could print to the backup queues. This does not handle already queued jobs but it is better than nothing. The biggest problem with this setup is keeping everything in sync. I have to make sure that all primary queues have backup queues on the other server. I also have to have a way to switch the release stations in an area to point to the backup server in case of a problem. I use Ghost for the latter.
This solution is not ideal, but the closest I could get for some kind of redundancy. I've only had to use it once so far but I was glad it was there.
A monitoring service (something like BMC Patrol, although this should not imply a recommendation) to monitor services and/or TCP/UDP ports is a great
option. If a service exceeds a utilization threshold for some period of time, you can force fail-over to the other node, which will clean up the issue on the current node.
Printing is a somewhat reactive process model in most operating systems; there's very little that is predictive. I hope that this helps.
We had issues like this when we first started but we doubled the number of CPU's available and increased memory in the servers which took care of the problem. We are also running 3 print servers to distribute the load and rarely have an issue. We run all our servers in a VM which makes increasing system resources fairly easy and recovery is easy if anything fails.
We are currently running with 6GB of RAM and 2 processors in a VM environment.
We have 25 physical queues coupled with 19 virtual queues. We hold jobs for three hours before they are deleted.
Given the Pharos deployment guidelines we fall into the "medium sized" deployment range as far as I can tell.
When I bring the clustered print server online, I will have faster CPUs, faster disk and faster RAM, all of which I hope will help.
I think some of the bigged culprits we are seeing for the performance hit are very large documents that the students, who in an effort to reduce page costs, print with 4 pages to the sheet and the spool size for these documents skyrockets.
Thanks for all the helpful suggests on what is working for people.