Skip to content

Tensorboard

FlowML provides a Tensorboard logger that tracks PyTorch/XGBoost model training loss and validation loss by default. These logs are saved with the workflow artifacts directory. If you are running locally, then this would be on the local disk. If you are running on distributed storage, then the logs are saved accordingly.

tensorboard --logdir path/to/artifacts/<study_identifier>/

where the study_identifier is the value passed into the configuration file. This will start a local server, which is typically accessed at 127.0.0.1:6006 in a browser.

Adding custom metrics

FlowML models allow users to use the tensorboard logger by default by simply adding the following lines to their training loop:

# Train loop here:
for epoch in epochs:
    for batch in batches:
        x, y_target = batch
        y_pred = model(x)
        loss = criterion(y_pred, y_target)
        # Backprogation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    self.log("train_loss", train_loss, step=epoch)

This will start logging train_loss in the logger, which will become available in the artifacts directory for post-processing.

If you wish to start a new tensorboard log, you can simply call self.reset_logger() at the end of each training.