WMArchive is a RESTful service, therefore clients can use any tool/language to contact the service and use GET/POST HTTP requests to retrieve or post their data. Below we provide examples for curl and python clients.
POST requests can be used to query documents in WMArchive:
# define some parameters OUT=/dev/stdout HEADERS="Content-type: application/json" URL=http://hostname:port/wmarchive/data # prepare your query.json file {"spec":{"task":"AbcCde09[0-9]", "timerange":[20160101,20160102]}, "fields":["task"]} # single document injection, POST request curl -D $OUT -X POST -H $HEADERS -d query.json $URL # depending on provided timerange you'll either get results # or you'll be given UID of your request which can be used later # to locate your results
GET requests can be used to fetch results from WMArchive placed previously via POST request:
# single document retrieval, GET request for provided UID curl -D $OUT -H $HEADERS $URL/UID
import os, json, httplib # get connection to our service conn = httplib.HTTPConnection(host, port) # get data from WMArchive query = dict(spec={"task":"taskName", "timerange":[20160101,20160102]}, fields=["task"]) conn.request("POST", path, json.dumps(query), headers) response = conn.getresponse() print("STATUS", response.status, "REASON", response.reason) # read our results res = response.read() data = json.loads(res)
Map-reduce jobs are supported by WMArchive service.
Users require to write two functions: mapper and reducer for their task, see example below. To run MR jobs someone will need to get account on analytix.cern.ch node, login over there, setup WMArchive environment and run mrjob script.
mrjob --hdir=hdfs://host:port/path/data --odir=hdfs://host:port/path/out --schema=hdfs://host:port/path/schema.avsc --mrpy=mr.py --pydoop=/path/pydoop.tgz --avro=/path/avro.tgz
def mapper(ctx): "Read given context and yield key (job-id) and values (task)" rec = ctx.value jid = rec["jobid"] if jid is not None: ctx.emit(jid, rec["fwjr"]["task"]) def reducer(ctx): "Emit empty key and some data structure via given context" ctx.emit("", {"jobid": ctx.key, "task": ctx.values})
The queries in WMArchive should be placed via POST request in a form of Query Language (QL). The QL should specify condition keys and document fields to be retrieved. We end-up using JSON format as QL data-format similar to MongoDB QL. For example:
{"spec": {"lfn":"bla*123.root", "task": "bla", "run":[1,2,3]}, "fields":["logs"]}
here we define that our conditions spec is a dictionary where we define actual conditions such as lfn pattern, task name and run list. The output fields would be logs key in FWJR documents stored in WMArchive.
The WMArchive will support three attributes which user may provide:
spec defines condition dictionary each spec must have a timerange key-value pair which points to specific timerange user want to look-up the data, e.g.
"timerange":[20160101, 20160102], the values in timerange list should be in YYYYMMDD format fields defines list of attributes to be retrieved from FWJR and returned back to user aggregate define list of attributes we need to aggregate
The QL should be flexible enough to accommodate user queries, see next section. For that we can use JSON document to define our queries, see example in previous section. Since we're going to use nested structure in our FWJR documents we need to refer them via dot notations, e.g. output.inputDataset, meta_data.agent_ver, etc. Usage of JSON for queries can be easily translated into MongoDB QL. Here is simplest QL rule definitions (mostly aligned with MongoDB QL):
{"$or": [JSON1, JSON2]}structure.
The WMArchive code will translate user query either into Mongo DB one or the one suitable for querying documents on HDFS. Here are few examples of QL syntax
# use multiple keys {"task":"abc", "run":1} # use patterns {"lfn":"/a/v/b*/123.root"} # use or conditions {"$or": [{"task":"abc", "lfn":"/a/v/b*/123.root"}, {"run":[1,2,3]}]} # use array values, i.e. find docs with all given run numbers {"run":[1,2,3]} # usage of $gt, $lt operators {"run":{"$gt":1})
# User query {"spec":{"task": "/AbcCde_Task_Data_test_6735909/RECO"}, "fields":["wmaid"]} # client call and output wma_client.py --spec=query.json {"result": [{"status": "ok", "input": {"fields": ["wmaid"], "spec": {"task": "/AbcCde_Task_Data_test_6735909/RECO"}}, "storage": "mongodb", "results": [{"wmaid": "6b0bace20fc732563316198d1ed2b94e"}]}]}
Here we outline useful tools available in WMArchive. All tools are written in python and associative bash wrapper script is provided for each of them (this is mostly for convenience, e.g. setup proper PYTHONPATH). Majority of the users will only need to use wma_client tool, while advanced users can use either myspark (to place Spark jobs) or mrjob (to place MapReduce jobs) tools.
wma_client.py tool to query data in WMArchive.
Usage: wma_client.py [options] For more help please visit https://github.com/dmwm/WMArchive/wiki Options: -h, --help show this help message and exit --verbose=VERBOSE verbosity level --spec=SPEC specify query spec file, defalt query.json --host=HOST host name of WMArchive server, default is https://vocms013.cern.ch --key=CKEY specify private key file name, default $X509_USER_PROXY --cert=CERT specify private certificate file name, default $X509_USER_PROXY
myspark tools can be used to execute python code on Spark platform. End-user is responsible to write his/her own executor (mapper) function.
myspark --help usage: PROG [-h] [--hdir HDIR] [--schema SCHEMA] [--script SCRIPT] [--spec SPEC] [--yarn] [--store STORE] [--wmaid WMAID] [--ckey CKEY] [--cert CERT] [--verbose] optional arguments: -h, --help show this help message and exit --hdir HDIR Input data location on HDFS, e.g. hdfs:///path/data --schema SCHEMA Input schema, e.g. hdfs:///path/fwjr.avsc --script SCRIPT python script with custom mapper/reducer functions --spec SPEC json file with query spec --yarn run job on analytics cluster via yarn resource manager --store STORE store results into WMArchive, provide WMArchvie url --wmaid WMAID provide wmaid for store submission --ckey CKEY specify private key file name, default $X509_USER_PROXY --cert CERT specify private certificate file name, default $X509_USER_PROXY --verbose verbose output Example: executes Spark job on HDFS with provided script and query JSON file myspark --hdir=hdfs:///hdfspath/data --schema=hdfs:///hdfspath/schema.avsc --script=/path/WMArchive/Tools/RecordFinder.py --spec=query.json
mrjob generates (and optionally executes) MR bash script for end-user.
mrjob --help usage: mrjob [-h] [--hdir HDIR] [--odir ODIR] [--schema SCHEMA] [--mrpy MRPY] [--pydoop PYDOOP] [--avro AVRO] [--execute] [--verbose] Tool to generate and/or execute MapReduce (MR) script. The code is generated from MR skeleton provided by WMArchive and user based MR file. The later must contain two functions: mapper(ctx) and reducer(ctx) for given context. Their simplest implementation can be found here WMArchive/MapReduce/mruser.py Based on this code please create your own mapper/reducer functions and use this tool to generate final MR script. optional arguments: -h, --help show this help message and exit --hdir HDIR HDFS input data directory --odir ODIR HDFS output directory for MR jobs --schema SCHEMA Data schema file on HDFS --mrpy MRPY MapReduce python script --pydoop PYDOOP pydoop archive file, e.g. /path/pydoop.tgz --avro AVRO avro archive file, e.g. /path/avro.tgz --execute Execute generate mr job script --verbose Verbose output