Memory resource annotation passed incorrectly to FW_submit script
Consider the following test model with resource annotations:
f(x) = x
a1 = f(1) on 2 cores with 2 [GB] for 2.0 [minutes]
a2 = f(1) on 2 cores for 2.0 [minutes]
a3 = f(1) on 2 cores
The model is then executed using CLI-script mode, and evaluated with the seession manager via texts session -a -r -w -u some_sample_UUID
The nodes that contain cores and time resource annotations, but no memory annotation, are evaluated without any problems (queued to the batch system and set to COMPLETED status with correct results).
These correspond to variables a2
and a3
.
Meanwhile, the nodes corresponding to variables including memory resource annotations such as a1
are set to READY state, but are never submitted to the batch system.
The session manager's log output to debug (--enable_logging --logging_level DEBUG
) prints the following message:
2024-03-12 17:40:39,374 INFO submitting queue script
2024-03-12 17:40:39,400 ERROR ----|vvv|----
2024-03-12 17:40:39,400 ERROR Error writing/submitting queue script!
2024-03-12 17:40:39,402 ERROR Traceback (most recent call last):
File "/hkfs/home/project/hk-project-consulting/an9294/python_VirtEnvs/vre-lang/lib/python3.9/site-packages/fireworks/queue/queue_launcher.py", line 150, in launch_rocket_to_queue
raise RuntimeError(
RuntimeError: queue script could not be submitted, check queue script/queue adapter/queue server status!
2024-03-12 17:40:39,402 ERROR ----|^^^|----
2024-03-12 17:40:39,402 INFO Un-reserving FW with fw_id, launch_id: 13044, 11157
Manual submission via sbatch
command does not yield better results and the following error message is printed:
sbatch: error: Invalid --mem-per-cpu specification
The FW_submit.script file of the node frozen in READY contains the keywords listed below, where mem-per-cpu
has been highlighted with an arrow:
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --time=2
#SBATCH --partition=dev_cpuonly
#SBATCH --job-name=10540661e7c444b5b22c19b3099da70d
#SBATCH --output=10540661e7c444b5b22c19b3099da70d-%j.out
#SBATCH --error=10540661e7c444b5b22c19b3099da70d-%j.error
#SBATCH --mem-per-cpu=2000.0 <----
After some testing, it was found that the following variations of mem-per-cpu
do not encounter such problem at (manual) job submission:
#SBATCH --mem-per-cpu=2000
#SBATCH --mem-per-cpu=2000Mb
#SBATCH --mem-per-cpu=2000MB
#SBATCH --mem-per-cpu=2000M
So it seems like the cause of the issue is the format of the amount of memory per cpu, which is rejected by SLURM if it contains a decimal point.
The modules passing this value from the model to the SLURM script should be revised.