Jenkins Setup

See AWS Setup to set up AWS resources and prepare secure configuration.

There are two ways to set up the Jenkins scheduler. The patched edx-analytics-configuration playbook is being phased out in favor of the edx/configuration/playbook/analytics-jenkins.yml, but many of our existing instances use the old method, so be careful when updating an existing instance.

If you're updating an existing Jenkins instance, check your vars-analytics.yml file. If it contains JENKINS_ANALYTICS_* variables, then use Option 1: ansible-jenkins.yml. Otherwise, use the patched edx-analytics-configuration.

If you're setting up a new Jenkins instance, use Option 1: ansible-jenkins.yml.

Option 1. ansible-jenkins

Use the edx/configuration repository on the director instance. Ensure that it contains the file playbook/analytics-jenkins.yml. If not, consider merging the changes from edx:master, or using the edx-analytics-configuration option below.

Variables and SSH Keys

Update your vars-analytics.yml to use the Jenkins scheduler configuration, and remove the edx-analytics-configuration section. See the Jenkins Analytics README for more information about the variables in that file.

Ensure that the SSH key (e.g. analytics.pem) file can be used to shell into your Jenkins EC2 instance.

Run the playbook, e.g.:

workon edx-configuration
cd configuration/playbooks
ansible-playbook -i '1.2.3.4,' \
                 -e @/home/ubuntu/secure-config/vars-analytics.yml \
                 -u ubuntu \
                 --private-key=/home/ubuntu/secure-config/analytics.pem \
                 --skip-tags="jenkins-seed-job" \
                 analytics-jenkins.yml

Jenkins Seed Job

The steps in the analytics-jenkins.yml playbook tagged with jenkins-seed-job rely on the edx-ops/edx-jenkins-job-dsl repo, which is currently private. The edX Analytics team are working on making this repository public. It contains DSL scripts which create the Jenkins jobs automatically, and configured by files in your secure configuration repo.

However, until that repo is opened up, we need to manually create the Jenkins jobs using the instructions in Setting up Jenkins, and copy the relevant files to /home/jenkins so they can be accessed by the analytics tasks.

To avoid seeing ansible errors about the seed job when running this playbook, we use the --skip-tags="jenkins-seed-job" argument in the ansible command above.

Troubleshooting

  • Jenkins is bound on IPv6: Make sure the vars-analytics.yml file defines jenkins_jvm_args to contain "-Djava.net.preferIPv4Stack=true".

Option 2. edx-analytics-configuration

Checkout open-craft/edx-analytics-configuration/analytics-sandbox on the director instance, install requirements in dedicated virtual environment, and run the jenkins/scheduler.yml playbook, passing vars from the vars.yml file. Make sure to apply analytics-configuration patches (commit range 46ea75c..9d6ea8f) for the playbook to run correctly.

Update your vars-analytics.yml to use the edx-analytics-configuration Jenkins scheduler section, not the edx/configration Jenkins scheduler section.

Commands should look roughly like this:

workon edx-analytics-ansible
cd edx-analytics-configuration
ansible-playbook -i /path/to/analytics_hosts.lst \
                  -e @/path/to/vars.yml \
                  -u ubuntu \
                  --private-key=/path/to/edxapp.pem \
                  jenkins/scheduler.yml

Troubleshooting

  • locale.Error: unsupported locale setting - by default Analytics/Insights instance have broken locale - this AskUbuntu answer was helpful.
  • Jenkins is bound on IPv6: by default Jenkins binds to localhost, which on some systems is translated to ::1 (IPv6 notation). However, nginx proxy_pass might not work with that.
    Make sure this commit is applied/cherry-picked and set jenkins_prefer_ipv4 to true in vars-analytics.yml
  • jenkins_port, jenkins_external_port and INSIGHTS_NGINX_PORT - first controls which port Jenkins app listens to. Second controls which port nginx reverse proxy uses to forward requests to Jenkins app. Third is used by nginx reverse proxy to forward requests to Insights app. The three might conflict, preventing either Jenkins or Insights to be acessible from outside, or even fail to start nginx. Default values, are 8080, 80 and 18110, respectively. In this tutorial, we're confine Jenkins to AWS VPC only, by not exposing it to the world, while also allowind Insights to listen to 80 and 443 ports (default HTTP and HTTPS, in case you don't know :)). In this scenario, values should be 8080, 8081 and 80.

SSH Tunneling to Jenkins

With current setup, Jenkins listens to port 8080, but this port is only accessible from members of director Security Group. In order to access it, we need to set up SSH tunneling:

ssh -i director.pem -L 8080:analytics_ip:8080 [email protected]_ip

After establishing the tunnels you should be able to access Jenkins at localhost:8080

Troubleshooting

  • Can't connect to Jenkins:
  • Check tunneling is enabled
  • SSH to jenkins VM, and try connecting to ::1:8080 (localhost in IPv6 notation). If you get Jenkins response, try connecting to 127.0.0.1:8080 (localhost in IPv4 notation). If no response is available Jenkins have bound on IPv6 only. Finally try connecting to localhost:8080 - if no response is received, than system does not resolve localhost as both (?) IPv4 and IPv6 address. See the Troubleshooting section in your preferred Jenkins setup section for how to fix this issue.

Configuring EMR clusters

Updating from EMR 2.x to EMR 4.x required many changes to the analytics pipeline code, its dependencies, and configuration. See also AWS: Differences introduced in EMR 4.x

Any questions or issues related to the analytics configuration should be posted to the openedx-ops mailing list.

Configuration S3 bucket

EMR 4.7.1

To set up the runtime environment for EMR 4.7.1, download these files from the OpenCraft AWS account, and upload to the client-name-analytics-emr bucket:

  • mysql-connector-java-5.1.35.tar.gz - java library for connecting to mysql. If we ever need an updated version, obtain one from mysql.com, and modify the install-sqoop step in emr-vars.yml to pass --mysql-connector-version=x.y.z as a step argument.
  • edx-analytics-hadoop-util.jar - java library for handling manifest files. Path referenced in analytics-override.cfg setting [manifest] lib_jar, and requires analytics-override.cfg setting [manifest] input_format = org.edx.hadoop.input.ManifestTextInputFormat. Replaces oddjob-1.0.1-standalone.jar.
  • install-sqoop - use a version that supports EMR release 4.x.x

EMR 2.4.11

AWS has dropped support for EMR 2.x.x, so use only when maintaining legacy analytics systems. See resources/emr-2.x for example configuration files appropriate for this version.

Download these files from the OpenCraft AWS account, and upload to the client-name-analytics-emr bucket:

  • mysql-connector-java-5.1.35.tar.gz - java library for connecting to mysql.
  • oddjob-1.0.1-standalone.jar - java library for handling manifest files. Path referenced in analytics-override.cfg setting [manifest] lib_jar, and requires analytics-override.cfg setting [manifest] input_format = oddjob.ManifestTextInputFormat.
  • packages/*.deb - store under a separate packages folder.
  • install-sqoop
  • security.sh - Before uploading to the client's S3, modify this script to fetch its .deb packages from the client's S3 bucket.

Pipeline S3 buckets

Pipeline S3 bucket (named: client-name-edxanalytics) should contain the following files:

  • edxapp_creds - contains credentials to be used to access edxapp DBs (edxapp, ecommerce, etc.). Readonly access is enough and preferred. Example: creds_example
  • edxanalytics_creds - contains credentials to be used to access analytics DBs (analytics-api, reports, etc.). Read-write access is required. Example: creds_example
  • GeoIP.dat - file that maps IP adresses to countries; used by InsertToMysqlCourseEnrollByCountryWorkflow task. A copy is provided here.

File names can be overridden in analytics-override.cfg

Jenkins Env and Configuration Overrides

Ensure these files exist on the analytics instance, and are backed up in the secure config repo:

  • /home/jenkins/jenkins_env: environment variables used when running analytics tasks via Jenkins. See jenkins_env below for details.
  • /home/jenkins/emr-vars.yml: extra variables used to provision the EMR cluster. See emr-vars.yml below for details.
  • /home/jenkins/analytics-override.cfg: configuration for the analytics pipeline. See analytics-override.cfg below for details.

Also, ensure these repositories are cloned and readable by the jenkins user:

  • /home/jenkins/analytics-configuration: clone the client's fork and branch, e.g.:

yaml analytics_configuration_repo: 'https://github.com/xxx/edx-analytics-configuration.git' analytics_configuration_version: 'master'

  • /home/jenkins/analytics-tasks: clone the client's fork and branch, e.g.:

yaml analytics_pipeline_repo: 'https://github.com/xxx/edx-analytics-pipeline.git' analytics_pipeline_version: 'master'

jenkins_env

Running pipeline jobs requires some environment variables to be set - these are listed in jenkins_env. Variables include the S3 buckets created above and others:

  • TRACKING_LOGS_S3_BUCKET="s3://client-name-tracking-logs" - bucket containing edxapp tracking logs
  • HADOOP_S3_BUCKET="s3://client-name-edxanalytics" - bucket for temporary/intermediate storage of hadoop files
  • TASK_CONFIGURATION_S3_BUCKET="s3://client-name-analytics-emr" - bucket containing task configuration files
  • EXTRA_VARS="@/home/jenkins/emr-vars.yml" - ansible configuration for provisioning EMR cluster. See emr-vars.yml below.
  • CLUSTER_NAME="Client Name Analytics Cluster" - default cluster name. See CLUSTER_NAME below.
  • OVERRIDE_CONFIG - provides secure configuration variables to the EMR cluster. See analytics-override.cfg below.

CLUSTER_NAME

The emr-vars.yml file defines a name variable, which is the identifier for the EMR cluster. However, the analytics scripts use CLUSTER_NAME to lookup the cluster, and so these variables must match, otherwise the lookup will fail. Additionally, it's a good idea to use a different CLUSTER_NAME for each analytics task, to allow them to run in parallel on different clusters. To achieve this, we override the default CLUSTER_NAME with a unique name for each analytics task in its Jenkins Job Command.

So to ensure that the emr-vars.yml cluster name matches the CLUSTER_NAME environment variable, use a lookup:

---
name: "{{ lookup('env', 'CLUSTER_NAME') }}"

analytics-override.cfg

The OVERRIDE_CONFIG in jenkins_env points to this file. It is used to provide secure configuration variables to EMR cluster and should look like analytics-overide.cfg. This file contains links to S3 EMR config, EMR log and tracking log S3 buckets - example uses S3 bucket names suggested in this walkthrough.

Note that default edx-analytics-pipeline uses different approach to provide secure config:

  • --secure-config-repo $SECURE_REPO - specifies GIT repo with secure configuration.
  • --secure-config-branch $SECURE_BRANCH - specifies branch in that repo to be used.
  • --secure-config $SECURE_CONFIG - specifies configuration file in that repo to be used.

Make sure to check what approach is used in current setup branches and alter jenkins_env accordingly.

emr-vars.yml

The emr-vars.yml file is passed to the ansible playbook that handles EMR provisioning. See emr-vars.yml for an example. You'll need to update the S3 bucket names as per the S3 buckets you created.

Jenkins analytics jobs

See Jenkins Jobs for how to manually create the jenkins jobs.

See the Jenkins Seed Job section for information on automatic Jenkins job creation.

Troubleshooting

  • EMR provisioning: ansible unable to access new EMR instance. See SSH Access to EMR. Ensure the ElasticMapReduduce-master security group has inbound SSH access from the analytics security group.

Another way to provide Jenkins with access is to to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY variables in jenkins_env, however it's very important to ensure that your Jenkins logs are not publicly visible, because these variables will be echoed to the Console output.

To use AWS keys, create a new analytics IAM user and use the ID and KEY for these variables. Note that this user should have provision_emr_clusters policy attached, otherwise trying to provision the cluster will fail with:

ClientError: An error occurred (AccessDeniedException) when calling the ListClusters operation: User: arn:aws:iam::123456789012:user/analytics_user is not authorized to perform: elasticmapreduce:ListClusters

bash ClientError: An error occurred (AccessDeniedException) when calling the ListClusters operation: User: arn:aws:iam::123456789012:user/analytics_user is not authorized to perform: elasticmapreduce:ListClusters * java.lang.UnsupportedClassVersionError: org/edx/hadoop/input/ManifestTextInputFormat : Unsupported major.minor version 51.0 or 52.0. This error occurs if the edx-analytics-hadoop-util.jar you're using for your manifest.lib_jar was compiled using a different version of java than what's running on the EMR cluster. The easiest way to rebuild the edx-analytics-hadoop-util.jar using the correct java version, and the required hadoop libraries, is to:

1. Launch an EMR cluster using the version of EMR configured for your analytics tasks.

    Alternately, run one of the failing tasks with `export TERMINATE=false`
    in the environment, and this will leave the EMR cluster running after
    the job has failed.

    Note the EMR Cluster ID for the `aws emr` step below.
1. Create a virtualenv, and install awscli:
    ```bash
    pip install awscli
    ```
1. Create an IAM user and attach the `provision_emr_clusters` policy you created above.
1. Using the AWS Access key ID and secret, authenticate your awscli:
    ```bash
    aws configure
    ```
  1. Shell into the EMR cluster using the analytics.pem file: bash aws emr ssh --cluster-id j-xxxxxxxxxxxx --key-pair-file=analytics.pem
  2. Clone the edx-analytics-hadoop-util repo, and build the jar file: bash git clone https://github.com/edx/edx-analytics-hadoop-util cd edx-analytics-hadoop-util javac -cp "/usr/lib/hadoop/client/*" org/edx/hadoop/input/ManifestTextInputFormat.java jar cf edx-analytics-hadoop-util.jar org/edx/hadoop/input/ManifestTextInputFormat.class

  3. EMR provisioning fails on the hive_install step with the following in stderr log: bash Exception in thread "main" com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Moved Permanently (Service: Amazon S3; Status Code: 301; Error Code: 301 Moved Permanently; Request ID: A80C873649993B68), S3 Extended Request ID: z0mA1W5N329bG+Sznq/j7G2g5gsRgKWlzqdoRmYVoCIyELiv0CNk+hmbcm2fkd7G30c7Gzs7xXk= May occur if you're running on a region other than us-east-1. See emr-vars.yml configuration for core-site to set the fs.s3n.endpoint.

  4. EMR provisioning fails during provisioning with: bash The subnet configuration was invalid: No route to any external sources detected in Route Table for Subnet: subnet-xxxxx for VPC: vpc-xxxxx This could mean you have not created an Internet Gateway for your VPC. See VPC DNS Hostname