-
Notifications
You must be signed in to change notification settings - Fork 0
Setting up remote Spark job execution
Jan Ehmueller edited this page Jul 27, 2017
·
7 revisions
To enable running Spark jobs from the Curation Interface (e.g. for restoring versions or committing changes), do the following:
- Create an SSH key pair for the user that is running the Curation Interface API server
- Copy the created PUBLIC key (
~/.ssh/id_rsa.pub) into the cluster's~/.ssh/authorized_keysfile (for the user that will run the Spark jobs) - Try logging in via SSH (e.g.
ssh <user>@<cluster host>), see if it works without a password (if it doesn't, configure your SSH server for public key authentication or modify the access rights for the.sshfolder) - Customize the
run_job.shfile in the Curation repository with your user- and hostname - On the cluster node, add a build of the Ingestion pipeline containing the jobs that need to be run from Curation as
~/jars/curation_jobs.jarand add thespark.shscript from the Ingestion repository as~/scripts/spark.sh - Modify the paths in the
run_job.shscript for different paths if necessary (e.g. different user name) - Test the job execution by navigating to
http://<curation host>:3000/api/run/versiondiff/667ccd90-5cc4-11e7-9047-dfcf226f2431,aa8ac8e0-5ca9-11e7-aea9-c37dbfcb3b83