notebook to store data in json files (#604)

* notebook to store data in json files * actual notebook
PCMDI · May 30, 2019 · 151a0a9 · 151a0a9
1 parent 22dc990
commit 151a0a9
Show file tree

Hide file tree

Showing 6 changed files with 270 additions and 12 deletions.
diff --git a/doc/jupyter/Jsons/JsonClass.ipynb → doc/jupyter/Jsons/ReadInJsonFiles.ipynb b/doc/jupyter/Jsons/JsonClass.ipynb → doc/jupyter/Jsons/ReadInJsonFiles.ipynb
diff --git a/doc/jupyter/Jsons/WriteToJson.ipynb b/doc/jupyter/Jsons/WriteToJson.ipynb
@@ -0,0 +1,201 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Writing Tables Into Re-Usable Json Files\n",
+    "\n",
+    "This notebook demonstrate how to use PMP's Json class to write easily parsable and reusable json files. See [this notebook](ReadInJsonFiles.ipynb) to see how to take advantage of this json format.\n",
+    "\n",
+    "## Key Concepts\n",
+    "\n",
+    "\n",
+    "### Structure\n",
+    "\n",
+    "This essentialy helps storing possibly complex tables into a json format that can later be easily parsed back into cdms/numpy variables.\n",
+    "\n",
+    "The idea is that the user ran a set of metrics looping over different parameters and wants to store these results\n",
+    "\n",
+    "For example for a given set of ***models***, loop through a given set of ***variables*** and for each variable compute a set of ***statitics***.\n",
+    "\n",
+    "`model`, `variable` and `statistic` would represent what the call the json file's **structure**\n",
+    "\n",
+    "Another example is to loop through model and realizations test against a set of references loop through modes and seasons to produce a statistic\n",
+    "\n",
+    "Here the structure would be:\n",
+    "\n",
+    "`model`, `realization`, `reference`, `mode`, `season`, `statistic`\n",
+    "\n",
+    "A python code to generate this would probably look similar to this:\n",
+    "\n",
+    "```python\n",
+    "for model in [\"A\", \"B\", \"C\"]:\n",
+    "    for realization in [\"a\", \"b\", \"c\", \"d\"]:\n",
+    "        for reference in [\"ref1\", \"ref2\"]:\n",
+    "            for mode in [\"NAM\", \"NAO\", \"NPGO\", \"PDO\", \"PNA\"]:\n",
+    "                for season in [\"DJF\", \"JJA\", \"MAM\"]:\n",
+    "                    for stat in [\"rms\", \"average\"]:\n",
+    "                        value = compute_some_stat(model, realization, reference, mode, season, stat)\n",
+    "```\n",
+    "\n",
+    "### Dictionary\n",
+    "\n",
+    "If stored in an array the final shape would be: `(3,4,2, 5, 3, 2)` which is 720 values\n",
+    "\n",
+    "But in reality maybe for each mode the user runs a different set of statistics these can also depend on the variable. Storing this in an array would end up with a lot of missing values. This is not necessary when using dictionaries.\n",
+    "\n",
+    "(If your data comes as a cdms2 variable, our package comes with a utility function to convert it back to a dictionary)\n",
+    "\n",
+    "\n",
+    "As described above the \"Structure\" defines what each layer of keys represent\n",
+    "\n",
+    "In the example above to access the first value one would do:\n",
+    "\n",
+    "```python\n",
+    "\n",
+    "value = results[\"A\"][\"a\"][\"ref1\"][\"NAM\"][\"DJF\"][\"rms\"]\n",
+    "\n",
+    "```\n",
+    "\n",
+    "Additional the \"results\" are expected to be in a filed named \"RESULTS\"\n",
+    "\n",
+    "## Example\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO::2019-05-23 14:00::pcmdi_metrics:: Results saved to a json file: /1TB/git/pcmdi_metrics/doc/jupyter/Jsons/myfile.json\n"
+     ]
+    }
+   ],
+   "source": [
+    "results = {\"RESULTS\": {\"A\": {\"rms\": .2, \"mean\":.5}, \"B\": {\"mean\":.123, \"rms\": .67}}}\n",
+    "\n",
+    "import pcmdi_metrics\n",
+    "\n",
+    "out = pcmdi_metrics.io.base.Base(\".\", \"myfile.json\")\n",
+    "out.write(results, json_structure=[\"model\", \"Statisitc\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{\"RESULTS\": {\"A\": {\"rms\": 0.2, \"mean\": 0.5}, \"B\": {\"mean\": 0.123, \"rms\": 0.67}},\n",
+      " \"json_version\": 3.0, \"json_structure\": [\"model\", \"Statisitc\"], \"provenance\": {\"\n",
+      "platform\": {\"OS\": \"Linux\", \"Version\": \"4.15.0-50-generic\", \"Name\": \"drdoom\"}, \"u\n",
+      "serId\": \"doutriaux1\", \"osAccess\": false, \"commandLine\": \"/1Tb/miniconda3/envs/ju\n",
+      "pyter-vcdat/lib/python3.6/site-packages/ipykernel_launcher.py -f /run/user/1000/\n",
+      "jupyter/kernel-76cecce7-1761-432d-915f-fc0bfd45647d.json\", \"date\": \"2019-05-23 1\n",
+      "4:00:21\", \"conda\": {}, \"packages\": {}, \"openGL\": {\"GLX\": {\"server\": {}, \"client\"\n",
+      ": {}}}}}\n"
+     ]
+    }
+   ],
+   "source": [
+    "!more myfile.json"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "J = pcmdi_metrics.io.base.JSONs(files=[\"myfile.json\",], oneVariablePerFile=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[   id: model\n",
+       "    Length: 2\n",
+       "    First:  A\n",
+       "    Last:   B\n",
+       "    Python id:  0x7f18d1163a90,    id: Statisitc\n",
+       "    Length: 2\n",
+       "    First:  mean\n",
+       "    Last:   rms\n",
+       "    Python id:  0x7f18d1163160]"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "J.getAxisList()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "variable_5\n",
+       "masked_array(\n",
+       "  data=[[0.5  , 0.2  ],\n",
+       "        [0.123, 0.67 ]],\n",
+       "  mask=False,\n",
+       "  fill_value=1e+20)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "J()"
+   ]
+  }
+ ],
+ "metadata": {
+  "data_variable_file_paths": {},
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  },
+  "selected_variables": [],
+  "variable_source_names": {},
+  "vcdat_file_path": "",
+  "vcdat_loaded_variables": []
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/pcmdi_metrics/io/__init__.py b/pcmdi_metrics/io/__init__.py
@@ -1,2 +1,3 @@
 # init for pcmdi_metrics.io
 from . import base  # noqa
+from .base import MV2Json  # noqa
diff --git a/pcmdi_metrics/io/base.py b/pcmdi_metrics/io/base.py
@@ -8,7 +8,7 @@
 import cdms2
 import hashlib
 import numpy
-import collections
+from collections import OrderedDict, Mapping
 import pcmdi_metrics
 import cdp.cdp_io
 import subprocess
@@ -33,6 +33,23 @@
     basestring = str
 
 
+# Convert cdms MVs to json
+def MV2Json(data, dic={}, struct=None):
+    if struct is None:
+        struct = []
+    if not isinstance(data, cdms2.tvariable.TransientVariable) and dic != {}:
+        raise RuntimeError("MV2Json needs a cdms2 transient variable as input")
+    if not isinstance(data, cdms2.tvariable.TransientVariable):
+        return data, struct  # we reach the end
+    else:
+        axis = data.getAxis(0)
+        if axis.id not in struct:
+            struct.append(axis.id)
+        for i, name in enumerate(axis):
+            dic[name], _ = MV2Json(data[i], {}, struct)
+    return dic, struct
+
+
 # Group merged axes
 def groupAxes(axes, ids=None, separator="_"):
     if ids is None:
@@ -56,7 +73,7 @@ def groupAxes(axes, ids=None, separator="_"):
 # cdutil region object need a serializer
 def update_dict(d, u):
     for k, v in u.items():
-        if isinstance(v, collections.Mapping):
+        if isinstance(v, Mapping):
             r = update_dict(d.get(k, {}), v)
             d[k] = r
         else:
@@ -88,9 +105,9 @@ def populate_prov(prov, cmd, pairs, sep=None, index=1, fill_missing=False):
 
 
 def generateProvenance():
-    prov = collections.OrderedDict()
+    prov = OrderedDict()
     platform = os.uname()
-    platfrm = collections.OrderedDict()
+    platfrm = OrderedDict()
     platfrm["OS"] = platform[0]
     platfrm["Version"] = platform[2]
     platfrm["Name"] = platform[1]
@@ -110,7 +127,7 @@ def generateProvenance():
     prov["osAccess"] = bool(os.access('/', os.W_OK) * os.access('/', os.R_OK))
     prov["commandLine"] = " ".join(sys.argv)
     prov["date"] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
-    prov["conda"] = collections.OrderedDict()
+    prov["conda"] = OrderedDict()
     pairs = {
         'Platform': 'platform ',
         'Version': 'conda version ',
@@ -140,7 +157,7 @@ def generateProvenance():
         'vcs': 'vcs ',
         'vtk': 'vtk-cdat ',
     }
-    prov["packages"] = collections.OrderedDict()
+    prov["packages"] = OrderedDict()
     populate_prov(prov["packages"], "conda list", pairs, fill_missing=None)
     pairs = {
         'vcs': 'vcs-nox ',
@@ -159,11 +176,11 @@ def generateProvenance():
         "version": "OpenGL version string",
         "shading language version": "OpenGL shading language version string",
     }
-    prov["openGL"] = collections.OrderedDict()
+    prov["openGL"] = OrderedDict()
     populate_prov(prov["openGL"], "glxinfo", pairs, sep=":", index=-1)
     prov["openGL"]["GLX"] = {
-        "server": collections.OrderedDict(),
-        "client": collections.OrderedDict()}
+        "server": OrderedDict(),
+        "client": OrderedDict()}
     pairs = {
         "version": "GLX version",
     }
@@ -294,7 +311,6 @@ def write(self, data, type='json', *args, **kwargs):
             data["json_structure"] = json_structure
             f = open(file_name, 'w')
             data["provenance"] = generateProvenance()
-#           data["user_notes"] = "BLAH"
             json.dump(data, f, cls=CDMSDomainsEncoder, *args, **kwargs)
             f.close()
 

diff --git a/pcmdi_metrics/version.py b/pcmdi_metrics/version.py
@@ -1,3 +1,3 @@
 __version__ = 'v1.2'
-__git_tag_describe__ = 'v1.2-45-g6fef135'
-__git_sha1__ = '6fef1358acba0e4c5617143fbf2fe25ad4e0f406'
+__git_tag_describe__ = 'v1.2-50-gef54524'
+__git_sha1__ = 'ef54524c9a3845afadc9f1312393d0f68734a4be'
diff --git a/tests/test_pmp_mv2json.py b/tests/test_pmp_mv2json.py
@@ -0,0 +1,40 @@
+import unittest
+from pcmdi_metrics.io import MV2Json
+import MV2
+import cdms2
+
+
+class TestMV2Json(unittest.TestCase):
+    def test2D(self):
+        a = MV2.array(range(6))
+        a = MV2.resize(a, (2, 3))
+        ax1 = cdms2.createAxis(["A", "B"], id="UPPER")
+        ax2 = cdms2.createAxis(["a", "b", "c"], id="lower")
+        a.setAxis(0, ax1)
+        a.setAxis(1, ax2)
+        jsn, struct = MV2Json(a)
+        self.assertEqual(
+            jsn, {'A': {'a': 0, 'b': 1, 'c': 2}, 'B': {'a': 3, 'b': 4, 'c': 5}})
+        self.assertEqual(struct, ['UPPER', 'lower'])
+
+    def test3D(self):
+        self.maxDiff = None
+        a = MV2.array(range(24))
+        a = MV2.resize(a, (2, 4, 3))
+        ax1 = cdms2.createAxis(["A", "B"], id="UPPER")
+        ax2 = cdms2.createAxis(["1", "2", "3", "4"], id="numbers")
+        ax3 = cdms2.createAxis(["a", "b", "c"], id="lower")
+        a.setAxis(0, ax1)
+        a.setAxis(1, ax2)
+        a.setAxis(2, ax3)
+        jsn, struct = MV2Json(a)
+        self.assertEqual(jsn, {'A': {'1': {'a': 0, 'b': 1, 'c': 2},
+                                      '2': {'a': 3, 'b': 4, 'c': 5},
+                                      '3': {'a': 6, 'b': 7, 'c': 8},
+                                      '4': {'a': 9, 'b': 10, 'c': 11}},
+                                'B': {'1': {'a': 12, 'b': 13, 'c': 14},
+                                      '2': {'a': 15, 'b': 16, 'c': 17},
+                                      '3': {'a': 18, 'b': 19, 'c': 20},
+                                      '4': {'a': 21, 'b': 22, 'c': 23}}})
+
+        self.assertEqual(struct, ['UPPER', 'numbers', 'lower'])