WMArchive Tools

Here we outline useful tools available in WMArchive. All tools are written in python and associative bash wrapper script is provided for each of them (this is mostly for convenience, e.g. setup proper PYTHONPATH). Majority of the users will only need to use wma_client tool, while advanced users can use either myspark (to place Spark jobs) or mrjob (to place MapReduce jobs) tools.


wma_client.py tool to query data in WMArchive.
Usage: wma_client.py [options]
For more help please visit https://github.com/dmwm/WMArchive/wiki

Options:
  -h, --help         show this help message and exit
  --verbose=VERBOSE  verbosity level
  --spec=SPEC        specify query spec file, defalt query.json
  --host=HOST        host name of WMArchive server, default is
                     https://vocms013.cern.ch
  --key=CKEY         specify private key file name, default $X509_USER_PROXY
  --cert=CERT        specify private certificate file name, default
                     $X509_USER_PROXY

myspark tools can be used to execute python code on Spark platform. End-user is responsible to write his/her own executor (mapper) function.
myspark --help
usage: PROG [-h] [--hdir HDIR] [--schema SCHEMA] [--script SCRIPT]
            [--spec SPEC] [--yarn] [--store STORE] [--wmaid WMAID]
            [--ckey CKEY] [--cert CERT] [--verbose]

optional arguments:
  -h, --help       show this help message and exit
  --hdir HDIR      Input data location on HDFS, e.g. hdfs:///path/data
  --schema SCHEMA  Input schema, e.g. hdfs:///path/fwjr.avsc
  --script SCRIPT  python script with custom mapper/reducer functions
  --spec SPEC      json file with query spec
  --yarn           run job on analytics cluster via yarn resource manager
  --store STORE    store results into WMArchive, provide WMArchvie url
  --wmaid WMAID    provide wmaid for store submission
  --ckey CKEY      specify private key file name, default $X509_USER_PROXY
  --cert CERT      specify private certificate file name, default
                   $X509_USER_PROXY
  --verbose        verbose output

Example: executes Spark job on HDFS with provided script and query JSON file
myspark --hdir=hdfs:///hdfspath/data
        --schema=hdfs:///hdfspath/schema.avsc
        --script=/path/WMArchive/Tools/RecordFinder.py
        --spec=query.json

mrjob generates (and optionally executes) MR bash script for end-user.
mrjob --help
usage: mrjob [-h] [--hdir HDIR] [--odir ODIR] [--schema SCHEMA] [--mrpy MRPY]
             [--pydoop PYDOOP] [--avro AVRO] [--execute] [--verbose]

Tool to generate and/or execute MapReduce (MR) script. The code is generated
from MR skeleton provided by WMArchive and user based MR file. The later must
contain two functions: mapper(ctx) and reducer(ctx) for given context. Their
simplest implementation can be found here WMArchive/MapReduce/mruser.py
Based on this code please create your own mapper/reducer functions and use this
tool to generate final MR script.

optional arguments:
  -h, --help       show this help message and exit
  --hdir HDIR      HDFS input data directory
  --odir ODIR      HDFS output directory for MR jobs
  --schema SCHEMA  Data schema file on HDFS
  --mrpy MRPY      MapReduce python script
  --pydoop PYDOOP  pydoop archive file, e.g. /path/pydoop.tgz
  --avro AVRO      avro archive file, e.g. /path/avro.tgz
  --execute        Execute generate mr job script
  --verbose        Verbose output