The obscurity of the error message in this case – in fact Google spat this one out – was the starting point for an usual set of circumstances trying to resolve a crashed Domino server.
What does FATAL (44): unable to open file ‘Files’ even point towards ? So some background first then. An IBM (soon to be HCL) Domino 9 server running on windows – not yet migrated to V10 but running for several years without incident with one large complex main application – unusual but not so much to suggest trying to debug this a different way.
Reports of the server being unavailable and checks of a replication partner showed that the problem had probably occurred since the last replication.
Restarting the domino server application showed no errors – leading to an initial false sense of security – which was cruelly shattered after 15 minutes when the cryptic message “Fatal (44)” was displayed and the server stopped running. Repeated attempts simply showed a consistency scan on the databases open at the time of crash and another serving of crash sauce after another 15 minutes.
At this stage I decided to rename the application nsf files as there were only a couple and restarted the server to find that it remained running. By a process of elimination the database (of course it had to be the largest at around 20GB) was shown to be the source of the error and through observation of the console, when renamed back, it was noted that a number of agents in the application ran after the server started and that after one status message that identified the agent in question that was “last on the scene” before the hang. So the suspect became an agent processing documents.
Starting the server but stopping the agent manager with tell agmr quit before agents ran allowed the database to be inspected while discussing any recent changes with the developer in charge of it. The developer confirmed that no code changes had been made in the month preceding the crash but when questioned deeper about any non programming changes or issues he mentioned that a data quality issue had prevented a agent from completing on the day the problem started.
Digging deeper it transpired that the issue had related to a malformed file name held in a document field – had my spidey senses tingling and so we took at look at the form in question and the views holding the forms to be processed by another agent. The developer had previously simply corrected the field to provide a correct file name for the production of an output file used by another application – basically removed a space to correct the error.
So when we looked through the view of documents awaiting processing – in this case we found a similar problem but where the field the filename in question was to be generated from contained a large amount of text rather than the 8.3 string intended to generate a file.
After reviewing the code which had previously detected an invalid file name and thrown the appropriate warning – it was realised that the larger version of the field was in fact breaking the validation code – but could only come up with Fatal (44) as a crash exception
So the moral is – if you give them enough rope (i.e space in a field length than shouldn’t be needed) – they will probably hang themselves