WMArchive APIs

WMArchive is a RESTful service, therefore clients can use any tool/language to contact the service and use GET/POST HTTP requests to retrieve or post their data. Below we provide examples for curl and python clients.


Curl client

POST requests can be used to query documents in WMArchive:

# define some parameters
OUT=/dev/stdout
HEADERS="Content-type: application/json"
URL=http://hostname:port/wmarchive/data

# prepare your query.json file
{"spec":{"task":"AbcCde09[0-9]", "timerange":[20160101,20160102]}, "fields":["task"]}

# single document injection, POST request
curl -D $OUT -X POST -H $HEADERS -d query.json $URL

# depending on provided timerange you'll either get results
# or you'll be given UID of your request which can be used later
# to locate your results

GET requests can be used to fetch results from WMArchive placed previously via POST request:

# single document retrieval, GET request for provided UID
curl -D $OUT -H $HEADERS $URL/UID
Python client
import os, json, httplib

# get connection to our service
conn = httplib.HTTPConnection(host, port)

# get data from WMArchive
query = dict(spec={"task":"taskName", "timerange":[20160101,20160102]}, fields=["task"])
conn.request("POST", path, json.dumps(query), headers)
response = conn.getresponse()
print("STATUS", response.status, "REASON", response.reason)

# read our results
res = response.read()
data = json.loads(res)

Map-Reduce

Map-reduce jobs are supported by WMArchive service.

Please note there is no yet web UI interface for MR jobs

Users require to write two functions: mapper and reducer for their task, see example below. To run MR jobs someone will need to get account on analytix.cern.ch node, login over there, setup WMArchive environment and run mrjob script.

mrjob --hdir=hdfs://host:port/path/data
      --odir=hdfs://host:port/path/out
      --schema=hdfs://host:port/path/schema.avsc
      --mrpy=mr.py --pydoop=/path/pydoop.tgz --avro=/path/avro.tgz
Example of mapper/reducer functions
def mapper(ctx):
    "Read given context and yield key (job-id) and values (task)"
    rec = ctx.value
    jid = rec["jobid"]
    if jid is not None:
        ctx.emit(jid, rec["fwjr"]["task"])
def reducer(ctx):
    "Emit empty key and some data structure via given context"
    ctx.emit("", {"jobid": ctx.key, "task": ctx.values})

Queries

The queries in WMArchive should be placed via POST request in a form of Query Language (QL). The QL should specify condition keys and document fields to be retrieved. We end-up using JSON format as QL data-format similar to MongoDB QL. For example:

{"spec": {"lfn":"bla*123.root", "task": "bla", "run":[1,2,3]}, "fields":["logs"]}

here we define that our conditions spec is a dictionary where we define actual conditions such as lfn pattern, task name and run list. The output fields would be logs key in FWJR documents stored in WMArchive.

The WMArchive will support three attributes which user may provide:

spec defines condition dictionary each spec must have a timerange key-value pair which points to specific timerange user want to look-up the data, e.g.

"timerange":[20160101, 20160102]
, the values in timerange list should be in YYYYMMDD format fields defines list of attributes to be retrieved from FWJR and returned back to user aggregate define list of attributes we need to aggregate

Query Language

The QL should be flexible enough to accommodate user queries, see next section. For that we can use JSON document to define our queries, see example in previous section. Since we're going to use nested structure in our FWJR documents we need to refer them via dot notations, e.g. output.inputDataset, meta_data.agent_ver, etc. Usage of JSON for queries can be easily translated into MongoDB QL. Here is simplest QL rule definitions (mostly aligned with MongoDB QL):

The WMArchive code will translate user query either into Mongo DB one or the one suitable for querying documents on HDFS. Here are few examples of QL syntax

# use multiple keys
{"task":"abc", "run":1}

# use patterns
{"lfn":"/a/v/b*/123.root"}

# use or conditions
{"$or": [{"task":"abc", "lfn":"/a/v/b*/123.root"}, {"run":[1,2,3]}]}

# use array values, i.e. find docs with all given run numbers
{"run":[1,2,3]}

# usage of $gt, $lt operators
{"run":{"$gt":1})

Example of user queries

# User query
{"spec":{"task": "/AbcCde_Task_Data_test_6735909/RECO"}, "fields":["wmaid"]}

# client call and output
wma_client.py --spec=query.json
{"result": [{"status": "ok",
             "input": {"fields": ["wmaid"],
             "spec": {"task": "/AbcCde_Task_Data_test_6735909/RECO"}},
             "storage": "mongodb",
             "results": [{"wmaid": "6b0bace20fc732563316198d1ed2b94e"}]}]}

Tools

Here we outline useful tools available in WMArchive. All tools are written in python and associative bash wrapper script is provided for each of them (this is mostly for convenience, e.g. setup proper PYTHONPATH). Majority of the users will only need to use wma_client tool, while advanced users can use either myspark (to place Spark jobs) or mrjob (to place MapReduce jobs) tools.


wma_client.py tool to query data in WMArchive.

Usage: wma_client.py [options]
For more help please visit https://github.com/dmwm/WMArchive/wiki

Options:
  -h, --help         show this help message and exit
  --verbose=VERBOSE  verbosity level
  --spec=SPEC        specify query spec file, defalt query.json
  --host=HOST        host name of WMArchive server, default is
                     https://vocms013.cern.ch
  --key=CKEY         specify private key file name, default $X509_USER_PROXY
  --cert=CERT        specify private certificate file name, default
                     $X509_USER_PROXY

myspark tools can be used to execute python code on Spark platform. End-user is responsible to write his/her own executor (mapper) function.

myspark --help
usage: PROG [-h] [--hdir HDIR] [--schema SCHEMA] [--script SCRIPT]
            [--spec SPEC] [--yarn] [--store STORE] [--wmaid WMAID]
            [--ckey CKEY] [--cert CERT] [--verbose]

optional arguments:
  -h, --help       show this help message and exit
  --hdir HDIR      Input data location on HDFS, e.g. hdfs:///path/data
  --schema SCHEMA  Input schema, e.g. hdfs:///path/fwjr.avsc
  --script SCRIPT  python script with custom mapper/reducer functions
  --spec SPEC      json file with query spec
  --yarn           run job on analytics cluster via yarn resource manager
  --store STORE    store results into WMArchive, provide WMArchvie url
  --wmaid WMAID    provide wmaid for store submission
  --ckey CKEY      specify private key file name, default $X509_USER_PROXY
  --cert CERT      specify private certificate file name, default
                   $X509_USER_PROXY
  --verbose        verbose output

Example: executes Spark job on HDFS with provided script and query JSON file
myspark --hdir=hdfs:///hdfspath/data
        --schema=hdfs:///hdfspath/schema.avsc
        --script=/path/WMArchive/Tools/RecordFinder.py
        --spec=query.json

mrjob generates (and optionally executes) MR bash script for end-user.

mrjob --help
usage: mrjob [-h] [--hdir HDIR] [--odir ODIR] [--schema SCHEMA] [--mrpy MRPY]
             [--pydoop PYDOOP] [--avro AVRO] [--execute] [--verbose]

Tool to generate and/or execute MapReduce (MR) script. The code is generated
from MR skeleton provided by WMArchive and user based MR file. The later must
contain two functions: mapper(ctx) and reducer(ctx) for given context. Their
simplest implementation can be found here WMArchive/MapReduce/mruser.py
Based on this code please create your own mapper/reducer functions and use this
tool to generate final MR script.

optional arguments:
  -h, --help       show this help message and exit
  --hdir HDIR      HDFS input data directory
  --odir ODIR      HDFS output directory for MR jobs
  --schema SCHEMA  Data schema file on HDFS
  --mrpy MRPY      MapReduce python script
  --pydoop PYDOOP  pydoop archive file, e.g. /path/pydoop.tgz
  --avro AVRO      avro archive file, e.g. /path/avro.tgz
  --execute        Execute generate mr job script
  --verbose        Verbose output