Skip to content

Tefx/PyD2M

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PyD2M

PyD2M is a dependency management tool for pandas DataFrames. In brief, it does two things:

  1. Automatically extract and join data fields from different data files (in different formats: csv, msgpack, etc.)
  2. Generate new data and save them to files according to some pre-defined recipes

Get Started

Here, we use PyD2M to manage a dataset used for the simulation of a transshipment container port. In a container port, a transshipment container will be discharged from the first vessel, stored somewhere in the terminal's storage yard for a few days, and then be loaded to its second vessel.

Loading data

Initially, suppose we have a vessel_info.csv file generated by some other tool, which contains a port's vessel visiting information within one month. There are three columns in this file: the vessel IDs, vessel lengths and their arrival times. Each vessel visits the port in a periodic pattern; thus, a VesselID may appear multiple times in the file. On the other hand, the ArrivalTime is the number of seconds since the beginning of the month.

VesselId,Length,ArrivalTime
7,240,7148
2,195,9907
0,195,32867
6,216,47490
8,319,49142
3,182,59537
...

To use PyD2M, let's create a directory named dataset and put vessel_info.csv under dataset/raw. Also, we create a sub-directory conf and a file d2m.rc.

dataset
+-- conf
|   +-- d2m.rc
+-- raw
|   +-- vessel_info.csv
  • Note conf/d2m.rc is the default location for a PyD2M dataset's configuration, which is in YAML format.

Now, add the following context to d2m.rc:

- DATA:
    raw:
      vessel_info.csv:
        TYPE: csv
        DECLARE_NEW_FIELDS: True
        LOCAL_FIELDS_ONLY: False
        FREE_FIELDS: False
        FIELDS:
          - VesselID: str
          - Length: int
          - ArrivalTime: int

Okay, these data is ready to be managed by PyD2M!

To use PyD2M, first, we import the module and create a DataSource object.

>>> from pyd2m.datasource import DataSource
>>> ds = DataSource("./dataset")

Then, we can use the DataSource.load method to load the file as a pandas DataFrame.

>>> ds.load("raw/vessel_info.csv")
	VesselID 	Length 	ArrivalTime
0 	4 	344 	7257
1 	6 	294 	16757
2 	8 	339 	18753
3 	7 	298 	31737
4 	0 	318 	44082
5 	9 	270 	49759
...

Or, we can directly use the field names to load the data:

>>> ds["VesselID", "Length", "ArrivalTime"]
Base:  raw/vessel_info.csv

	VesselID 	Length 	ArrivalTime
0 	4 	344 	7257
1 	6 	294 	16757
2 	8 	339 	18753
3 	7 	298 	31737
4 	0 	318 	44082
5 	9 	270 	49759
...

Also, we can access only partial of the data and change the order of fields:

>> ds["ArrivalTime", "Length"]
Base:  raw/vessel_info.csv

 	ArrivalTime 	Length
0 	7257 	344
1 	16757 	294
2 	18753 	339
3 	31737 	298
4 	44082 	318
5 	49759 	270
...

However, one may find the original data is not so convenient for data analytics. For example, we would like the ArrivalTime to be in the more human-readable datatime64 format instead of the integers. And also, we would like to generate a unique VesselArrivalID for each visit of the vessels, which is defined as {MMDD}V{VesselID}.

This can be done by add a hook to the file. First, we create a hook file named vessels.hk and put it into the conf sub-directory.

dataset
+-- conf
|   +-- d2m.rc
|   +-- vessels.hk
+-- raw
|   +-- vessel_info.csv
  • Note: The names of the hook file can be arbitrary, as long as their extension names are .hk. There can be more than one .hk files in the conf directory.

Add the following context to the vessels.hk file.

from pyd2m import hooks
import pandas as pd


@hooks.load("raw/vessel_info.csv")
def vel_load_hook(df):
    df.ArrivalTime = pd.to_timedelta(df.ArrivalTime, unit="s") + pd.to_datetime("2019")
    df["VesselArrivalID"] = df.ArrivalTime.dt.strftime("%m%d") + "V" + df.VesselID.astype(str)
    return df

Also, do not forget to change the field ArrivalTime's type and add the field VesselArrivalID in the configuration file.

- DATA:
    raw:
      vessel_info.csv:
        TYPE: csv
        DECLARE_NEW_FIELDS: True
        LOCAL_FIELDS_ONLY: False
        FREE_FIELDS: False
        FIELDS:
          - VesselID: str
          - Length: int
          - ArrivalTime: datetime64[s]
          - VesselArrivalID: str

Let's try loading this file/fields again.

>>> ds = DataSource("./dataset")
>>> ds["VesselArrivalID", "ArrivalTime", "Length"]
Base:  raw/vessel_info.csv

	VesselArrivalID 	ArrivalTime 	Length
0 	0101V7 	2019-01-01 04:13:07 	101
1 	0101V0 	2019-01-01 06:03:26 	335
2 	0101V5 	2019-01-01 06:27:19 	150
3 	0101V3 	2019-01-01 12:34:50 	314
4 	0101V2 	2019-01-01 14:19:19 	396
5 	0101V8 	2019-01-01 14:38:38 	140

You may have noticed that, in the vel_load_hook function, column ArrivalTime of the transformed DataFrame actually has the type of datetime64[ns]. However, since we have declare the ArrivalTime to be datetime64[s] in the configuration, its type has already been auto-converted during th loading!

Now, suppose we have another csv file box_info.csv records the containers' information, including their BoxID, UnloadingVesselArrivalID and LoadingVesselArrivalID. Let's add this file's information in the configuration file as well.

- DATA:
    raw:
      vessel_info.csv:
        ...
        
      box_info.csv:
        TYPE: csv
        DECLARE_NEW_FIELDS: True
        LOCAL_FIELDS_ONLY: False
        FREE_FIELDS: False
        FIELDS:
          - BoxID: str
          - UnloadingVesselArrivalID: str
          - LoadingVesselArrivalID: str
  • Note We can also define a DEFAULT section in the configuration file and move the common attributes of different files into it, so that we don't need to declare them every time.
- DEFAULTS:
    TYPE: csv
    DECLARE_NEW_FIELDS: True
    LOCAL_FIELDS_ONLY: False
    FREE_FIELDS: False

- DATA:
    raw:
      vessel_info.csv:
        FIELDS:
          - VesselID: str
          - Length: int
          - ArrivalTime: datetime64[s]
          - VesselArrivalID: str

      box_info.csv:
        FIELDS:
          - BoxID: str
          - UnloadingVesselArrivalID: str
          - LoadingVesselArrivalID: str

Saving data

When a vessel arrives at the container port, the berth planner will decide when and where the vessel can be berthed. Here, we will not dive into the complicated vessel berthing algorithms. Let's write a simple random function to berth the vessels --- making each vessel wait for random time within 2 hours after its arrival, and then put it on a random position along the linear quay of the port. After being berthed, each vessel will have a random handling time between 4 and 12 hours.

First, let's define some const values in the dataset's configuration file.

- PARAMS:
    QUAY_LENGTH: 3000
    MAX_WAITING_TIME: 7200 

After that, these consts can be accessed via ds.{var_name}.

The simple random strategy is implemented as follows.

>>> df = ds["VesselArrivalID", "Length", "ArrivalTime"]
>>> df["MooringPosition"] = df.apply(lambda v: np.random.randint(0, ds.QUAY_LENGTH - v.Length), axis=1)
>>> df["MooringTime"] = df.ArrivalTime + pd.to_timedelta(np.random.random(size=len(df)) * ds.MAX_WAITING_TIME, unit="s")
>>> df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")
>>> df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")

Now, we can save the berthing plan into another file. Let's declare a file berthing.msg in the configuration file. This time, we want it to be in the msgpack format.

- DATA:
    raw:
      ... 
      
    plan:
      berthing.msg:
        TYPE: msgpack
        FIELDS:
          - VesselArrivalID: str
          - MooringPosition: int
          - MooringTime: datetime64[s]
          - HandlingTime: timedelta64[s]

Once the file is declared, use the DataSource.dump to save the data.

>> ds.dump("plan/berthing.msg", df)
  • Tips: a hook can also be added to the dumping process. Just use the @hooks.dump to decorate a function in any .hk files.

Once the file exists, we can use load to load it again, or access its fields directly.

>>> ds.load("plan/berthing.msg", df)
 	VesselArrivalID 	MooringPosition 	MooringTime 	HandlingTime
0 	0101V8 	816 	2019-01-01 02:25:22 	04:42:51
1 	0101V7 	243 	2019-01-01 05:57:56 	07:32:13
2 	0101V0 	630 	2019-01-01 06:12:13 	07:12:19
3 	0101V9 	140 	2019-01-01 07:51:36 	11:30:00
4 	0101V6 	570 	2019-01-01 09:29:30 	07:43:05
5 	0101V3 	588 	2019-01-01 13:05:37 	08:27:04
...
>>> ds["VesselArrivalID", "MooringTime"]
Base:  plan/berthing.msg

	VesselArrivalID 	MooringTime
0 	0101V8 	2019-01-01 02:25:22
1 	0101V7 	2019-01-01 05:57:56
2 	0101V0 	2019-01-01 06:12:13
3 	0101V9 	2019-01-01 07:51:36
4 	0101V6 	2019-01-01 09:29:30
5 	0101V3 	2019-01-01 13:05:37
...

Auto joining

Now, what if we want to analyse the relationships between vessels' lengths and their handling times? To do so, we need the values of both Length and HandlingTime. However, there are in different files with different format! Do we need to load these two files seperately and join them manually? The answer is of course not. Instead, we can retrieve these fields directly.

>>> ds["VesselArrivalID", "Length", "HandlingTime"]
Base:  raw/vessel_info.csv
Joining: plan/berthing.msg

	VesselArrivalID 	Length 	HandlingTime
0 	0101V8 	345 	09:12:38
1 	0101V7 	217 	09:47:53
2 	0101V0 	293 	08:55:18
3 	0101V9 	177 	05:42:23
4 	0101V6 	115 	10:06:20
5 	0101V3 	375 	10:11:21
...

See? PyD2M has done this joining automatically!

Cookbook

A cookbook contains a series of recipes. A recipe is a function which generates new DataFrames (dishes) from exists DataFrames (ingredients).

Now, let's generate each container's unloading/loading time and position at the quay according to the vessel information. The results will be saved in plan/box_pos_time.msg. First, add the file's information in the configuration file.

- DATA:
    ...
    
    plan:
      ...
      
      box_pos_time.msg:
        TYPE: msgpack
        FIELDS:
          - BoxID: str
          - UnloadingPosition: int
          - UnloadingTime: datetime64[s]
          - LoadingPosition: int
          - LoadingTime: datetime64[s]

Then, create a file plan.cb in the conf directory and add the following code to it. The filename can be arbitrary as long as the extension name is .cb.

@recipe("plan/box_pos_time.msg")
def gen_box_pos_time(cb):
    vel_info = cb.DS["VesselArrivalID", "MooringPosition", "Length", "MooringTime", "HandlingTime"]

    df_u = cb.DS["BoxID", "UnloadingVesselArrivalID"].merge(
        vel_info, left_on="UnloadingVesselArrivalID", right_on="VesselArrivalID")
    df_u["UnloadingPosition"] = df_u.Length * np.random.random(size=len(df_u)) + df_u.MooringPosition
    df_u["UnloadingTime"] = df_u.HandlingTime * np.random.random(size=len(df_u)) + df_u.MooringTime

    df_l = cb.DS["BoxID", "LoadingVesselArrivalID"].merge(
        vel_info, left_on="LoadingVesselArrivalID", right_on="VesselArrivalID")
    df_l["LoadingPosition"] = df_l.Length * np.random.random(size=len(df_l)) + df_l.MooringPosition
    df_l["LoadingTime"] = df_l.HandlingTime * np.random.random(size=len(df_l)) + df_l.MooringTime

    return df_u.merge(df_l, on="BoxID")
  • Note 1: We can also indicate the recipe's ingredients and dishes manually as follows.
@recipe(ingredients=["raw/vessel_info.csv", "plan/berthing.msg"], dishes=["plan/box_pos_time.msg"])
def gen_box_pos_time(cb, vel, bth):
    vel_info = vel.merge(bth, on="VesselArrivalID")
    ...

In this case, the parameters after cb are the DataFrames in the ingredient list, separately. There can also be more than one generated dishes. If there are multiple dishes, the decorated function should return a tuple containing all the dishes in order.

  • Note 2: When dumping data to a file, PyD2M discards the fields not defined in the configuration file. So you do not need to call DataFrame's drop method manually.

Now, let's try fetching boxes' unloading information directly.

>>> ds["BoxID", "UnloadingVesselArrivalID", "UnloadingPosition", "UnloadingTime]
Generating plan/box_pos_time.msg
[] => ['plan/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Base:  plan/berthing.msg
Joining: raw/vessel_info.csv
Base:  raw/box_info.csv
Base:  raw/box_info.csv
Base:  plan/box_pos_time.msg
Joining: raw/box_info.csv

	BoxID 	UnloadingVesselArrivalID 	UnloadingPosition 	UnloadingTime
0 	0 	0101V0 	2510 	2019-01-01 13:20:21
1 	105 	0101V0 	2737 	2019-01-01 13:54:03
2 	142 	0101V0 	2544 	2019-01-01 08:08:59
3 	224 	0101V0 	2388 	2019-01-01 14:47:23
4 	283 	0101V0 	2727 	2019-01-01 11:25:09
5 	324 	0101V0 	2707 	2019-01-01 07:26:47

The box_pos_time.msg has been generated automatically and the fields are extracted/joined correctly! The structure of the dataset directory is now as follows.

dataset
+-- conf
|   +-- d2m.rc
|   +-- vessels.hk
|   +-- plan.cb
+-- raw
|   +-- vessel_info.csv
|   +-- box_info.csv
+-- plan
|   +-- berthing.msg
|   +-- box_pos_time.msg

We can also add the recipe of the file berthing.msg into the cookbook, by adding the following function to plan.cb (or another .cb file, there can be as many .cb files as you want in the conf directory).

@recipe("plan/berthing.msg")
def gen_berthing_plan(cb):
    df = cb.DS["VesselArrivalID", "Length", "ArrivalTime"]
    df["MooringPosition"] = df.apply(lambda v: np.random.randint(0, cb.DS.QUAY_LENGTH - v.Length), axis=1)
    df["MooringTime"] = df.ArrivalTime + pd.to_timedelta(np.random.random(size=len(df)) * cb.DS.MAX_WAITING_TIME, unit="s")
    df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")
    df["HandlingTime"] = pd.to_timedelta(np.random.uniform(4, 12, size=len(df)), unit="h")
    return df

Now, let's delete the whole plan directory and then access boxes' information again.

>>> ds["BoxID", "LoadingVesselArrivalID", "LoadingPosition", "LoadingTime"]
Generating plan/box_pos_time.msg
[] => ['plan/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Generating plan/berthing.msg
['raw/vessel_info.csv'] => ['plan/berthing.msg'] By <CookBook.gen_berthing_plan>
Base:  plan/berthing.msg
Joining: raw/vessel_info.csv
Base:  raw/box_info.csv
Base:  raw/box_info.csv
Base:  plan/box_pos_time.msg
Joining: raw/box_info.csv

	BoxID 	LoadingVesselArrivalID 	LoadingPosition 	LoadingTime
0 	0 	0106V3 	625 	2019-01-06 11:02:05
1 	84 	0128V4 	2648 	2019-01-28 08:12:10
2 	105 	0110V1 	2002 	2019-01-11 00:01:40
3 	129 	0119V1 	1144 	2019-01-19 23:47:53
4 	132 	0116V6 	2626 	2019-01-16 19:53:44
5 	142 	0101V2 	2406 	2019-01-02 07:28:23
...

Whoosh! The data are generated again!

Variables

We can also use variables in configuration entities.

Say, we may want to generate different instances of berthing.msg and box_pos_time.msg for different experiments. What we can do is modifying the configuration file and the recipes correspondingly as follows.

- DATA:
    ... 
    
    'plan_{exp}':
        berthing.msg:
            ...

        box_pos_time.msg:
            ...
@recipe("plan_{exp}/berthing.msg")
def gen_berthing_plan(cb):
    ...

@recipe("plan_{exp}/box_pos_time.msg")
def gen_box_pos_time(cb):
    ...

We can then indicate the {exp} values when create the DataSource object or calling the load/dump method.

>>> ds = DataSource("./dataset", exp="exp_1")
>>> ds["BoxID", "LoadingPosition"]
Generating plan_exp_1/box_pos_time.msg
[] => ['plan_{exp}/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Generating plan_exp_1/berthing.msg
[] => ['plan_{exp}/berthing.msg'] By <CookBook.gen_berthing_plan>
Base:  raw/vessel_info.csv
Base:  plan_{exp}/berthing.msg
Joining: raw/vessel_info.csv
Base:  raw/box_info.csv
Base:  raw/box_info.csv
Base:  plan_{exp}/box_pos_time.msg

	BoxID 	LoadingPosition
0 	0 	1437
1 	84 	2007
2 	105 	733
3 	129 	2060
4 	132 	1476
5 	142 	2145
...

>>> ds = DataSource("./dataset", exp="exp_2")
>>> ds["BoxID", "LoadingPosition"]
Generating plan_exp_2/box_pos_time.msg
[] => ['plan_{exp}/box_pos_time.msg'] By <CookBook.gen_box_pos_time>
Generating plan_exp_2/berthing.msg
[] => ['plan_{exp}/berthing.msg'] By <CookBook.gen_berthing_plan>
Base:  raw/vessel_info.csv
Base:  plan_{exp}/berthing.msg
Joining: raw/vessel_info.csv
Base:  raw/box_info.csv
Base:  raw/box_info.csv
Base:  plan_{exp}/box_pos_time.msg

	BoxID 	LoadingPosition
0 	0 	659
1 	84 	1771
2 	105 	2261
3 	129 	2136
4 	132 	2208
5 	142 	2140
...
 
>>> ds.load("plan_{exp}/box_pos_time.msg", exp="exp_1")
 	BoxID 	UnloadingPosition 	UnloadingTime 	LoadingPosition 	LoadingTime
0 	0 	868 	2019-01-01 22:53:41 	1437 	2019-01-06 13:52:45
1 	84 	909 	2019-01-01 21:50:22 	2007 	2019-01-28 07:29:07
2 	105 	968 	2019-01-01 19:56:14 	733 	2019-01-10 19:35:00
3 	129 	909 	2019-01-01 23:29:08 	2060 	2019-01-19 17:08:25
4 	132 	696 	2019-01-01 23:10:09 	1476 	2019-01-16 14:07:04
5 	142 	702 	2019-01-01 19:33:32 	2145 	2019-01-02 01:58:43
...

After executing these codes, the structure of the dataset directory becomes

dataset
+-- conf
|   +-- d2m.rc
|   +-- vessels.hk
|   +-- plan.cb
+-- raw
|   +-- vessel_info.csv
|   +-- box_info.csv
+-- plan_exp_1
|   +-- berthing.msg
|   +-- box_pos_time.msg
+-- plan_exp_2
|   +-- berthing.msg
|   +-- box_pos_time.msg

Auto-recipe, Steps and Quick-recipes

Sometimes, PyD2M can automatically generate recipes for use. For example, if we want to save the boxes' unloading information separately in a file named box_unloading_info.csv, which is defined as follows.

- DATA:
    ...
    
    tmp:
        "{exp}":
            box_unloading_info:
                FIELDS:
                    - BoxID: str
                    - UnloadingVesselID: str
                    - UnloadingPosition: int
                    - UnloadingTime: datetime64[s]

Instead of writing a function like

@recipe("tmp/{exp}/box_unloading_info.csv")
def gen_box_unloading_info(cb):
    return cb.DS["BoxID", "UnloadingVesselID", "UnloadingPosition", "UnloadingTime"]

in a .cb file, we can do

@recipe("tmp/{exp}/box_unloading_info.csv")
def gen_box_unloading_info(cb):
    return cb.DS.autogen('tmp/{exp}/box_unloading_info.csv')

or simply

from pyd2m.cookbook import auto_recipe

auto_recipe("tmp/{exp}/box_unloading_info.csv")

On the other hand, Steps are some pre-defined operation that can simplify the recipes. Say, if we want to save different vessel's berthing info in separate files

- DATA:
    ...
    
    'plan_{exp}':
        '{VesselID}':
            berthing.csv:
                FIELDS:
                    - ArrivalTime: datetime64[s]
                    - MooringPosition: int
                    - MooringTime: datetime64[s]

, we can write a recipe like

from pyd2m.cookbook import recipe
from pyd2m.cookbook.steps import groupby
import pandas as pd

@recipe("plan_{exp}/{VesselID}/berthing.csv")
def gen_vessel_berthing_info(cb):
    vels = ds["VesselID", "ArrivalTime", "VesselArrivalID", "MooringPosition", "MooringTime"]
    return groupby(vels, "VesselID", drop_index=True)

At last, the quick recipes can also generate recipes automatically, by following some pre-defined strategies. Say, if we want to concat the contents in all plan_{exp}/{VesselID}/berthing.csv into another file plan_{exp}/yet_another_berthing_info_file.csv which has the same fields as in these berthing.csv files, the recipe can be automatically generated by

from pyd2m.cookbook import quick_recipe

quick_recipe("concat", 
    ingredients=[plan_{exp}/{VesselID}/berthing.csv"], 
    dishes=[plan_{exp}/yet_another_berthing_info_file.csv"], 
    axis=0)

About

Dataframe dependency manager

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages