Running Harmonie under ecFlow

Introduction

This document describes how to run Harmonie under ecFlow scheduler at ECMWF. ecFlow is the ECMWF workflow manager and it has been written using python to improve maintainability, allow easier modification and introduce object orientated features as compared to the old scheduler SMS. ecFlow can be used in any HARMONIE version in and above harmonie-40h1.1.beta.1.

New users

On the ECMWF Atos machine in Bologna, each user has a virtual machine on which ecFlow is running. If you don't have a VM yet, ask ECMWF to set it up for you. If you are starting ecFlow for the first time at ECMWF, you may have to add your ssh key to the authorized_keys file to allow passwordless access, as ssh is used to communicate between the servers:

      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Start your experiment supervised by ecFlow

Launch the experiment in the usual manner by giving start time, DTG, end time, DTGEND and other optional arguments

      ~hlam/Harmonie start DTG=YYYYMMDDHH

If successful, ecFlow will identify your experiment name and start building your binaries and run your forecast. If not, you need to examine the ecFlow log file $HM_DATA/ECF.log. $HM_DATA is defined in your Env_system file. At ECMWF $HM_DATA=$SCRATCH/hm_home/$EXP where $EXP is your experiment name.

The ecFlow viewer starts automatically. To view any suite for your server or other servers, the server must be added to the ecFlow viewer (via Servers -> Manage servers, Add server) and selected in Servers. See below on how to find the port and server name.

  • More than one experiment is not allowed with the same name monitored in the same server so Harmonie will start the server and delete previous non-active suite for you.
  • For deleting a suite manually using ecflow_client --port XXXX --host XXXX --delete force yes /suite or using the GUI: right-click on the suite, then click "Remove" (if you don't see the Remove option, go to Tools -> Preferences -> Menus, and make yourself Administrator)
  • If other manual intervention in server or client is needed you can use ecflow commands. See here.

ecFlow control

Finding the port and host of the ecFlow server

The server on which ecFlow is running is defined with variable $ECF_HOST, the port with ECF_PORT, set in Env_system or derived. On the VMs on ECMWF Atos machine in Bologna ECF_HOST=ecflow-gen-${USER}-001 and ECF_PORT=3141 for all users.

NOTE: New naming convention of the ecflow servers has been implemented by ECMWF. The old server name still will be available for some users. If the new naming is used for your user you need to update in Env_system ECF_HOST=ecfg-${USER}-1 and ECF_PORT=3141.

Information about server variables can be found by running:

  • On ECMWF's Atos:
      ssh ecflow-gen-${USER}-001 ecflow_server status 
  • Or if ecFlow is running on the machine you are logged into:
      ecflow_server status 

You can also find ECF_PORT/ECF_HOST by checking the files under $ECF_HOME, like:

> ls -rlt ~/ecflow_server
total 12
-rw-r--r-- 1 hlam accord 2529 Jun 15 16:20 ecflow-gen-hlam-001.3141.ecf.check.b
-rw-r--r-- 1 hlam accord 2529 Jun 20 17:36 ecflow-gen-hlam-001.3141.ecf.check
-rw-r--r-- 1 hlam accord 3113 Jun 20 17:38 ecflow-gen-hlam-001.log

Check the status of your server

To check the status of your server you can use

ecflow_client --stats  --port ECF_PORT  --host ECF_HOST

or

ecflow_client --port ECF_PORT  --host ECF_HOST  --ping

or go to the "Info" tab in the ecFlow viewer.

Open the viewer of a running ecFlow server

If you know that your ecFlow server is running but you have no viewer attached to it you can restart the viewer:

ecflow_ui &

Stop your ecFlow server

If you are sure you're running the server on the login node of your machine you can simply run

ecflow_stop.sh

A more complete and robust way is

export ECF_PORT=<your port>
export ECF_HOST=<your server name>
ecflow_client  --halt=yes
ecflow_client  --check_pt
ecflow_client  --terminate=yes

Restart your ecFlow server

The ecFlow servers on the virtual machines a ECMWF should be restarted automatically. If it doesn't, you may need to restart it with:

ssh ecflow-gen-${USER}-001 sudo systemctl restart ecflow-server

On other systems, if the server is not running you can start again using the script:

 ecflow_start.sh [-d $ECF_HOME]

If ecFlow is running on a different machine you have to login and start it on that machine:

 ssh <your server name>
 module load ecflow
 ecflow_start.sh [-d $ECF_HOME]

As an alternative you can let Harmonie start the server for you when starting your next experiment, or type

~hlam/Harmonie mon

Keep your ecFlow server alive

If not using ecFlow at the ECMWF's VMs, the ecFlow server will eventually die causing an unexpected disruption in you experiments. To prevent this you can add a cron job restarting the server e.g. every fifth minute.

> crontab -l
*/5 * * * * /home/$USER/bin/cronrun.sh ecflow_start.sh -d $ECF_HOME > ~/ecflow_start.out 2>&1

where tthe small script cronrun.sh makes sure you get the right environment

#!/bin/bash
source ~/.bash_profile
module unload ecflow
module load ecflow/5.7.0
$@

The ecFlow server version may change over time.

Add another user to your ecFlow viewer

Sometimes it's handy to be able to follow, and control, your colleagues experiments. To be able to do this do the following steps:

  • Find the port number of your colleague as described above.
  • In the ecFlow viewer choose Servers -> Manage servers, click on "Add server" and fill in the appropriate host and port and give it a useful name. Click on OK to save it.
  • If you click on Servers in the viewer the name should appear and you can make it visible by clicking on it.

Changing the port

By default, the port is set by

export ECF_PORT=$((1500+usernumber))

in mSMS.job (40h1.1), Start_ecFlow.sh (up to #b6d58dd), or Main (currently).

For the VMs at ECMWF it is set to 3141 in Env_system. If you want to change this number (for example, if that port is in use already), you will also need to add a -p flag when calling ecflow_start.sh as follows:

ecflow_start.sh -p $ECF_PORT -d $JOBOUTDIR

Otherwise, ecflow_start.sh tries to open the default port.

Note: if you already have an ecFlow server running at your new port number before launching an experiment, this won't be an issue.

More info