WMArchive Tools
Here we outline useful tools available in WMArchive. All tools are written in python
and associative bash wrapper script is provided for each of them (this is
mostly for convenience, e.g. setup proper PYTHONPATH). Majority of the users
will only need to use wma_client tool, while advanced users can use either
myspark (to place Spark jobs) or mrjob (to place MapReduce jobs) tools.
wma_client.py tool to query data in WMArchive.
Usage: wma_client.py [options]
For more help please visit https://github.com/dmwm/WMArchive/wiki
Options:
-h, --help show this help message and exit
--verbose=VERBOSE verbosity level
--spec=SPEC specify query spec file, defalt query.json
--host=HOST host name of WMArchive server, default is
https://vocms013.cern.ch
--key=CKEY specify private key file name, default $X509_USER_PROXY
--cert=CERT specify private certificate file name, default
$X509_USER_PROXY
myspark
tools can be used to execute python code on Spark platform. End-user is responsible to write his/her own executor (mapper) function.
myspark --help
usage: PROG [-h] [--hdir HDIR] [--schema SCHEMA] [--script SCRIPT]
[--spec SPEC] [--yarn] [--store STORE] [--wmaid WMAID]
[--ckey CKEY] [--cert CERT] [--verbose]
optional arguments:
-h, --help show this help message and exit
--hdir HDIR Input data location on HDFS, e.g. hdfs:///path/data
--schema SCHEMA Input schema, e.g. hdfs:///path/fwjr.avsc
--script SCRIPT python script with custom mapper/reducer functions
--spec SPEC json file with query spec
--yarn run job on analytics cluster via yarn resource manager
--store STORE store results into WMArchive, provide WMArchvie url
--wmaid WMAID provide wmaid for store submission
--ckey CKEY specify private key file name, default $X509_USER_PROXY
--cert CERT specify private certificate file name, default
$X509_USER_PROXY
--verbose verbose output
Example: executes Spark job on HDFS with provided script and query JSON file
myspark --hdir=hdfs:///hdfspath/data
--schema=hdfs:///hdfspath/schema.avsc
--script=/path/WMArchive/Tools/RecordFinder.py
--spec=query.json
mrjob
generates (and optionally executes) MR bash script for end-user.
mrjob --help
usage: mrjob [-h] [--hdir HDIR] [--odir ODIR] [--schema SCHEMA] [--mrpy MRPY]
[--pydoop PYDOOP] [--avro AVRO] [--execute] [--verbose]
Tool to generate and/or execute MapReduce (MR) script. The code is generated
from MR skeleton provided by WMArchive and user based MR file. The later must
contain two functions: mapper(ctx) and reducer(ctx) for given context. Their
simplest implementation can be found here WMArchive/MapReduce/mruser.py
Based on this code please create your own mapper/reducer functions and use this
tool to generate final MR script.
optional arguments:
-h, --help show this help message and exit
--hdir HDIR HDFS input data directory
--odir ODIR HDFS output directory for MR jobs
--schema SCHEMA Data schema file on HDFS
--mrpy MRPY MapReduce python script
--pydoop PYDOOP pydoop archive file, e.g. /path/pydoop.tgz
--avro AVRO avro archive file, e.g. /path/avro.tgz
--execute Execute generate mr job script
--verbose Verbose output