Queries 5 CMS data-services, placed 800K queries, spawn 220K datasets, 900 releases, 500 sites, 5000 user DNs
Anonymized and performed factorization of all data
One year of meta-data translates into 78x600000 dataframe
id,cpu,creator,dataset,dbs,dtype,era,naccess,nblk,nevt,nfiles,nlumis,nrel,nsites,nusers,parent,primds,proc_evts,procds,rel1_0,rel1_1,rel1_2,rel1_3,rel1_4,rel1_5,rel1_6,rel1_7,rel2_0,rel2_1,rel2_10,rel2_11,rel2_2,rel2_3,rel2_4,rel2_5,rel2_6,rel2_7,rel2_8,rel2_9,rel3_0,rel3_1,rel3_10,rel3_11,rel3_12,rel3_13,rel3_14,rel3_15,rel3_16,rel3_17,rel3_18,rel3_19,rel3_2,rel3_20,rel3_21,rel3_22,rel3_23,rel3_3,rel3_4,rel3_5,rel3_6,rel3_7,rel3_8,rel3_9,relt_0,relt_1,relt_2,rnaccess,rnusers,rtotcpu,s_0,s_1,s_2,s_3,s_4,size,tier,totcpu,wct
999669242,207737071.0,2186,20186,3,0,759090,14251.0,6,21675970,2158,72274,1,10,11.0,5862538,335429,30667701,373256,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0.6,0.4,3.9,0,3,8,0,0,8002,5,64280.0,216946588.0
332990665,114683734.0,2186,176521,3,1,759090,21311.0,88,334493030,32621,86197,1,4,8.0,6086362,968016,123342232,1037052,0,0,0,0,1,2,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,1,0,2,0.8,0.3,3.5,0,6,9,0,0,96689,3,58552.0,276683510.0
....
2013/2014 datasets are available at
https://git.cern.ch/web/CMS-DMWM-Analytics-data.git
Data transition through 2014 on weekly basis
Dataset popularity
Left plot shows few random datasets, while right one
summarizes 100 most accessed datasets through 2014.
Dataset access is like stock market, but
N(datasets) >> N(stocks @ NASDAQ)
Different dataset popularity metrics
Left: popular datasets by nusers,
Right: popular datasets by totcpu metric.
From data to prediction
- Generate dataframes
- Transform data into suitable format for ML
- Build ML model
- use classification or regression techniques
- train and validate model, e.g. use 2013 data for training
and 2014 for validation
- Generate new data and transform it similar to step #2.
- Apply best model to new data to make prediction
- Verify prediction with popularity DB once data metrics become available
- Iterate again
Model prediction
|
Run 5 different models:
- Random Forest Classifier
- Linear SVC
- SGDClassifier
- VW online-classifier
- xgboost classifier
Datasets we predict : 605691
Predicted as popular : 66583
Dataset in popdb sample : 23397
We predicted almost 3x times more
datasets than we had. But we may need
to choose different classifiers for
different data-tiers.
TIER TP(%) TN(%) FP(%) FN(%)
--------------------------------------
AOD 49.65 31.58 0.62 18.15
AODSIM 24.35 52.42 0.76 22.47
MINIAOD 0.59 95.31 0.88 3.23
MINIAODSIM 11.99 59.37 0.67 27.98
USER 9.65 72.16 1.25 16.93
......
|
We have a problem to solve
-
Build/tune the model as we go
-
Mine HyperNews/twikis and find physicist's interests
-
Mine HEP events, e.g. CERN indico calendar, and be predictive
If you'are interested take a challenge