Schedule data pipeline

Scheduling a data pipeline allows us to run the pipeline automatically at specific intervals. For instance, we can define a pipeline to run at 2 pm every day or 8 am every Monday.

To schedule a pipeline, go to the Load view and click Schedule pipeline.

Schedu002

The Schedule data pipeline window appears.

Schedu003

Enter the pipeline name. We can also choose the start date of the pipeline. Click the calendar icon Schedu004 to select a start date.

Schedu005

We can choose the frequency of the pipeline runs. It could be hourly, daily, weekly, or monthly. We can also specify the time in which the pipeline will be run.

Schedu006

Note

When scheduling a pipeline from a S3 folder, the schedule frequency of the pipeline should be consistent with the frequency of arrival of new micro-batches to the S3 folder. For instance, if an external extraction process automatically uploads a micro-batch every hour, we can set the pipeline schedule to run hourly. To learn more about ETL pipeline from an S3 folder, see Extract micro-batches from an S3 folder.

If we want the pipeline to be run right after scheduling it, tick  Run pipeline now in addition to the scheduled time.

Schedu007

Loading strategies and eviction periods

The loading strategy is an important parameter when scheduling a pipeline. There are three loading strategies.

  • Generate a new log every time the pipeline is executed creates a new log without overwriting the existing ones. When we choose this option, we will be required to enter the log name for the new files.

Schedu008

  • Always append data to the same log add the new rows to the existing log file. Effectively, when the pipeline triggers, it appends data to the initial log file.

If the schema of the appended log changes at any time, the pipeline run will fail. To prevent this, we can tick Create new log when the schema is altered. Apromore will instead create another log when it notices a change in the log schema.

Schedu009

To prevent the resulting log file from becoming extremely huge, we can optionally specify the data range to retain.

Schedu010

If one month is selected, data older than one month in the previous dataset will be discarded.

Note

If we delete the log file created from previous pipeline runs, a new log file will be created in the following pipeline run.

  • Overwrite the log everytime the pipeline is executed always replaces the existing log file with the output of the pipeline execution. To prevent pipeline run failure due to a change in the schema, we can tick Create new log when the schema is altered.

Schedu011

Lastly, enter the log file’s name created from the pipeline and specify its path.

Schedu012

Click Schedule to schedule the pipeline. The pipeline will be successfully scheduled and can be managed in the Data pipeline management window.