Running Harmonie under ecFlow
Introduction
This document describes how to run Harmonie under ecFlow scheduler at ECMWF. ecFlow is the ECMWF workflow manager and it has been written using python to improve maintainability, allow easier modification and introduce object orientated features as compared to the old scheduler SMS. ecFlow can be used in any HARMONIE version in and above harmonie-40h1.1.beta.1.
New users
On the ECMWF Atos machine in Bologna, each user has a virtual machine on which ecFlow is running. If you don't have a VM yet, ask ECMWF to set it up for you. If you are starting ecFlow for the first time at ECMWF, you may have to add your ssh key to the authorized_keys file to allow passwordless access, as ssh is used to communicate between the servers:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Start your experiment supervised by ecFlow
Launch the experiment in the usual manner by giving start time, DTG
, end time, DTGEND
and other optional arguments
~hlam/Harmonie start DTG=YYYYMMDDHH
If successful, ecFlow will identify your experiment name and start building your binaries and run your forecast. If not, you need to examine the ecFlow log file $HM_DATA/ECF.log
. $HM_DATA
is defined in your Env_system
file. At ECMWF $HM_DATA=$SCRATCH/hm_home/$EXP
where $EXP
is your experiment name.
The ecFlow viewer starts automatically. To view any suite for your server or other servers, the server must be added to the ecFlow viewer (via Servers -> Manage servers, Add server) and selected in Servers. See below on how to find the port and server name.
- More than one experiment is not allowed with the same name monitored in the same server so Harmonie will start the server and delete previous non-active suite for you.
- For deleting a suite manually using
ecflow_client --port XXXX --host XXXX --delete force yes /suite
or using the GUI: right-click on the suite, then click "Remove" (if you don't see the Remove option, go to Tools -> Preferences -> Menus, and make yourself Administrator) - If other manual intervention in server or client is needed you can use ecflow commands. See here.
ecFlow control
Finding the port and host of the ecFlow server
The server on which ecFlow is running is defined with variable $ECF_HOST
, the port with ECF_PORT
, set in Env_system
or derived. On the VMs on ECMWF Atos machine in Bologna ECF_HOST=ecflow-gen-${USER}-001
and ECF_PORT=3141
for all users.
NOTE: New naming convention of the ecflow servers has been implemented by ECMWF. The old server name still will be available for some users. If the new naming is used for your user you need to update in Env_system
ECF_HOST=ecfg-${USER}-1
and ECF_PORT=3141
.
Information about server variables can be found by running:
- On ECMWF's Atos:
ssh ecflow-gen-${USER}-001 ecflow_server status
- Or if ecFlow is running on the machine you are logged into:
ecflow_server status
You can also find ECF_PORT
/ECF_HOST
by checking the files under $ECF_HOME
, like:
> ls -rlt ~/ecflow_server
total 12
-rw-r--r-- 1 hlam accord 2529 Jun 15 16:20 ecflow-gen-hlam-001.3141.ecf.check.b
-rw-r--r-- 1 hlam accord 2529 Jun 20 17:36 ecflow-gen-hlam-001.3141.ecf.check
-rw-r--r-- 1 hlam accord 3113 Jun 20 17:38 ecflow-gen-hlam-001.log
Check the status of your server
To check the status of your server you can use
ecflow_client --stats --port ECF_PORT --host ECF_HOST
or
ecflow_client --port ECF_PORT --host ECF_HOST --ping
or go to the "Info" tab in the ecFlow viewer.
Open the viewer of a running ecFlow server
If you know that your ecFlow server is running but you have no viewer attached to it you can restart the viewer:
ecflow_ui &
Stop your ecFlow server
If you are sure you're running the server on the login node of your machine you can simply run
ecflow_stop.sh
A more complete and robust way is
export ECF_PORT=<your port>
export ECF_HOST=<your server name>
ecflow_client --halt=yes
ecflow_client --check_pt
ecflow_client --terminate=yes
Restart your ecFlow server
The ecFlow servers on the virtual machines a ECMWF should be restarted automatically. If it doesn't, you may need to restart it with:
ssh ecflow-gen-${USER}-001 sudo systemctl restart ecflow-server
On other systems, if the server is not running you can start again using the script:
ecflow_start.sh [-d $ECF_HOME]
If ecFlow is running on a different machine you have to login and start it on that machine:
ssh <your server name>
module load ecflow
ecflow_start.sh [-d $ECF_HOME]
As an alternative you can let Harmonie start the server for you when starting your next experiment, or type
~hlam/Harmonie mon
Keep your ecFlow server alive
If not using ecFlow at the ECMWF's VMs, the ecFlow server will eventually die causing an unexpected disruption in you experiments. To prevent this you can add a cron job restarting the server e.g. every fifth minute.
> crontab -l
*/5 * * * * /home/$USER/bin/cronrun.sh ecflow_start.sh -d $ECF_HOME > ~/ecflow_start.out 2>&1
where tthe small script cronrun.sh
makes sure you get the right environment
#!/bin/bash
source ~/.bash_profile
module unload ecflow
module load ecflow/5.7.0
$@
The ecFlow server version may change over time.
Add another user to your ecFlow viewer
Sometimes it's handy to be able to follow, and control, your colleagues experiments. To be able to do this do the following steps:
- Find the port number of your colleague as described above.
- In the ecFlow viewer choose Servers -> Manage servers, click on "Add server" and fill in the appropriate host and port and give it a useful name. Click on OK to save it.
- If you click on Servers in the viewer the name should appear and you can make it visible by clicking on it.
Changing the port
By default, the port is set by
export ECF_PORT=$((1500+usernumber))
in mSMS.job
(40h1.1), Start_ecFlow.sh
(up to #b6d58dd), or Main
(currently).
For the VMs at ECMWF it is set to 3141 in Env_system
. If you want to change this number (for example, if that port is in use already), you will also need to add a -p flag when calling ecflow_start.sh
as follows:
ecflow_start.sh -p $ECF_PORT -d $JOBOUTDIR
Otherwise, ecflow_start.sh
tries to open the default port.
Note: if you already have an ecFlow server running at your new port number before launching an experiment, this won't be an issue.