Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor/RenamingClasses #47

Merged
merged 15 commits into from
Mar 18, 2021
Merged
Prev Previous commit
Next Next commit
Wiki + small comment changes
  • Loading branch information
JavierMonton committed Mar 17, 2021
commit 5e381bbb74a44694493577e4befb71040e7e28a3
3 changes: 2 additions & 1 deletion .github/workflows/bigquery-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,5 @@ jobs:
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
verbose: true # optional (default = false)
verbose: false # optional (default = false)
flags: bq-integration
3 changes: 2 additions & 1 deletion .github/workflows/ci-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,5 @@ jobs:
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
verbose: true # optional (default = false)
verbose: false # optional (default = false)
flags: unittests
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ object SqlInstanceToBigQuery {
}

/** Simplify syntax for SqlType, allows:
* BigQueryTypesInstance[SqlType].bigQueryFields(mySqlTypeInstance)
* SqlInstanceToBigQuery[SqlType].bigQueryFields(mySqlTypeInstance)
*/
implicit def fieldsFromSqlType(implicit f: Formats): SqlInstanceToBigQuery[SqlType] =
new SqlInstanceToBigQuery[SqlType] {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ object SqlTypeToSpark {
getSchemaWithName(f.transformKeys(name), sqlType) :: getSchema(SqlStruct(records, mode))
}

/** Basic SqlTypes conversions to BigQuery Fields
/** Basic SqlTypes conversions to Spark Types
*/
private def getSchemaWithName(name: String, sqlType: SqlType)(implicit f: Formats): StructField = sqlType match {
case SqlInt(mode) =>
Expand Down
291 changes: 282 additions & 9 deletions wiki.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,33 @@ An important note of nomenclature:
- We call `Conversion` to anything that converts `into the geneic SqlType`
- We call `Reverse Conversion` to anything that converts `from generic SqlType into specific`

### SqlType ADT
[SqlType](https://github.com/data-tools/big-data-types/blob/main/core/src/main/scala/org/datatools/bigdatatypes/types/basic/SqlType.scala)
is an Algebraic Data Type that aims to generalize all the cases that we can have from different types.

It is used internally for all the conversions.

It consists in a few case class with a `Mode` (Nullable, Required or Repeated) and one SqlStruct that contains a
map of `String` and `SqlType`, being the String the name of the "field"


### Conversion / Reverse Conversion
There are 2 type of conversions that we should cover

- `Conversion` = Generic SqlType -> Our Type
- `Reverse Conversion` = Our type -> Generic SqlType

#### Conversion
`SqlType -> our type` = `All types -> our type`
For the _normal_ _Conversion_ we have to create two new _Type Classes_
The **Core** of the library already have a _Type Class_ that converts Case Classes into `SqlTypes` so our new _Type Classes_ only need to derive from there.

- `SqlTypeToOurType` for only types (converting a Case Class into our new type) -> e.g: `Conversion[A].myNewType`
- `SqlInstanceToOurType` for convert instances into our new type -> e.g: `Conversion[A](myInstance: A).myNewType`
(`A` could be an instance of any implemented type in the library)

One will be based on the other one, so don't worry, one will be really short.

#### Reverse Conversion
`our type -> SqlType` = `our type -> all types`

Expand All @@ -33,20 +53,273 @@ an existing _Type Class_ called `SqlTypeConversion`
By doing this, we will get automatically conversion to the rest of the types of the library


#### Conversion
`SqlType -> our type` = `All types -> our type`
For the _normal_ _Conversion_ we have to create two new _Type Classes_
## How to do it

- `SqlTypeToOurType` for only types (converting a Case Class into our new type) -> e.g: `Conversion[A].myNewType`
- `SqlInstanceToOurType` for convert instances into our new type -> e.g: `Conversion[A](myInstance: A).myNewType`
(`A` could be an instance of any implemented type in the library)
As covered in [Conversion](#conversion), we have to implement 2 types classes, one for types, another for instances.
Both will derive `SqlTypeConversion` type class into our specific type and by doing so, we will get automatically all conversions into our new type

One will be based on the other one, so don't worry, one will be really short.
### Create a new subproject in SBT
For a new type, we need a new submodule in SBT, by doing so, we don't force anyone to import all implemented types in the library, so anyone can pick only the types that is interested in

An example from Spark. This defines a new module `big-data-types-spark`.
Options can be changed as desired:
- publishSettings seems to be obvious as we want to publish the new type
- crossScalaVersions specify the different Scala versions that the artifact will be created. The more the better
- crossVersionSharedSources is a method to specify different code for Scala 2.13- or 2.13+
- Dependencies, defined in a variable, the ones that we need
- dependsOn for depend on Core for compile and for tests. (There are a few useful classes in Core/Tests that we can use to test our new type)

```scala
lazy val spark = (project in file("spark"))
.settings(
name := projectName + "-spark",
publishSettings,
crossScalaVersions := List(scala212),
crossVersionSharedSources,
libraryDependencies ++= sparkDependencies
)
.dependsOn(core % "test->test;compile->compile")
```

Add the dependency to the root project, inside `aggregate`:

```scala
//Project settings
lazy val root = (project in file("."))
.configs(IntegrationTest)
.settings(noPublishSettings)
.aggregate(
core,
bigquery,
spark,
examples
)
```

Now, you can create a new root folder with your type name with the typical structure (src/main/scala_ ...)

### Conversion: Type Class - SqlType -> New Type

#### Defining the syntax
- First, create a new package called something like `org.datatools.bigdatatypes.{mynewtype}`
- Add the Type Class (trait) with a method `() => A` being `A` our new type/schema
For example:
```scala
trait SqlTypeToSpark[A] {

/** @return a list of [[StructField]]s that represents [[A]]
*/
def sparkFields: List[StructField]
}
```
In this case, `sparkFields` will be the name of the method that we will use to obtain our new type.
Pick a representative name but don't worry too much, at the end we can create a wrapper class that we'll use our _Type Class_.

#### Implementing the Type Class
We can start by creating a companion object with `apply` and a factory constructor that will make easier construction of instances
````scala
object SqlTypeToSpark {

/** Summoner method */
def apply[A](implicit instance: SqlTypeToSpark[A]): SqlTypeToSpark[A] = instance

/** Factory constructor - allows easier construction of instances */
def instance[A](fs: List[StructField]): SqlTypeToSpark[A] =
new SqlTypeToSpark[A] {
def sparkFields: List[StructField] = fs
}
}
````
Then, the code will depend a lot on the type that we are constructing, but we have to think that we are converting `SqlType`s into our type so,
we should do a method with like `SqlType => ourType`

As the types usually can be recursive (nested objects) we can start defining a method for the recursion that:
- Will take an SqlType
- Will use implicit `Formats` as an optional key transformation
- Will return our desired type
```scala
/** Creates the schema (list of fields)
* Applies an implicit [[Formats.transformKeys]] in the process
* @param sqlType [[SqlType]]
* @param f [[Formats]] to apply while constructing the schema
* @return List of [[StructField]] representing the schema of the given type
*/
private def getSchema(sqlType: SqlType)(implicit f: Formats): List[StructField] = sqlType match {
case SqlStruct(Nil, _) => Nil
case SqlStruct((name, sqlType) :: records, mode) =>
getSchemaWithName(f.transformKeys(name), sqlType) :: getSchema(SqlStruct(records, mode))
}
```
And another method (`getSchemaWithName` in this example) to specify the specific types:
In this case, we are showing an example from BigQuery as it seems simpler to understand:
```scala
/** Basic SqlTypes conversions to BigQuery Fields
*/
private def getSchemaWithName(name: String, sqlType: SqlType)(implicit f: Formats): Field = sqlType match {
case SqlInt(mode) =>
Field.newBuilder(name, StandardSQLTypeName.INT64).setMode(sqlModeToBigQueryMode(mode)).build()
case SqlLong(mode) =>
Field.newBuilder(name, StandardSQLTypeName.INT64).setMode(sqlModeToBigQueryMode(mode)).build()
case SqlFloat(mode) =>
Field.newBuilder(name, StandardSQLTypeName.FLOAT64).setMode(sqlModeToBigQueryMode(mode)).build()
...
...
}
```
Same example from Spark:
```scala
/** Basic SqlTypes conversions to Spark Types
*/
private def getSchemaWithName(name: String, sqlType: SqlType)(implicit f: Formats): StructField = sqlType match {
case SqlInt(mode) =>
StructField(name, sparkType(mode, IntegerType), isNullable(mode))
case SqlLong(mode) =>
StructField(name, sparkType(mode, LongType), isNullable(mode))
case SqlFloat(mode) =>
StructField(name, sparkType(mode, FloatType), isNullable(mode))
...
...
}
```

We have to understand `Mode` at this point.

##### Mode inside Types
There are different ways to handle Arrays or repeated values in a structure and two are the most common.
- Define a Mode that can be `Nullable`, `Required` or `Repeated` (this how `SqlType` works).
-- An Array will look like a normal type with `Repeated` mode -> e.g: `SqlInt(mode = Repeated)`
- Define an `Array` structure that has another type inside
-- An Array will look like `Array(Int)` or similar -> e.g: Spark -> `ArrayType(IntegerType)`

So, in BigQuery example, we use `sqlModeToBigQueryMode` method as following:
```scala
private def sqlModeToBigQueryMode(sqlTypeMode: SqlTypeMode): Mode = sqlTypeMode match {
case Nullable => Mode.NULLABLE
case Repeated => Mode.REPEATED
case Required => Mode.REQUIRED
}
```
Pretty straight forward.

In case of structs with Arrays like Spark, we use something like:
```scala
private def sparkType(mode: SqlTypeMode, sparkType: DataType): DataType = mode match {
case Repeated => ArrayType(sparkType, containsNull = isNullable(mode))
case _ => sparkType
}
```
Where if the mode is not repeated, we return the value, if it's repeated, we create an array with the value.

###### Everything together
Finally, we create a method that derives the instance from `SqlType` into our type, using our new methods. As simple as:
```scala
/** Instance derivation via SqlTypeConversion.
*/
implicit def fieldsFromSqlTypeConversion[A: SqlTypeConversion](implicit f: Formats): SqlTypeToSpark[A] =
instance(getSchema(SqlTypeConversion[A].getType))
```
In order to understand it,
- `SqlTypeConversion[A].getType` will return the `SqlType` for any type that implements `SqlTypeConversion`,
so we don't need to care about other types.
- `getSchema` is our new method that generates the complete type
- `instance` is our factory constructor that makes easier to construct the type class implementation

And, that's it! Now **we can convert any Case Class into our new type**!

To do it, e.g:
```scala
SqlTypeToOurType[MyCaseClass].getMyNewType
```
Again, if you wonder how this happens, it's because [SqlTypeConversion](https://github.com/data-tools/big-data-types/blob/main/core/src/main/scala/org/datatools/bigdatatypes/conversions/SqlTypeConversion.scala)
Type Class derives any Case Class into `SqlType` using [Shapeless](https://github.com/milessabin/shapeless)


**But**, one piece is still missing. Converting types is really cool, and it happens on compiler time,
but we also want to convert other types that live only in the running phase, like when we have an instance.
e.g: An Spark Schema is not just a type, it's an instance of StructType, so we need to pass an instance to our new converter


### Conversion: Type Class - SqlInstance -> New Type

This will be quick as we already have methods that convert an `SqlType` into our new type, so we only need to extend them to accept an instance as argument

We should create a new Type Class called `SqlInstanceToOurType`

As before, we create the syntax (for making it easier, we should have the same method name as before)
```scala
/** Type class to convert generic SqlTypes received as instance into BigQuery specific fields
* This uses [[SqlTypeToBigQuery]] to create BigQuery Fields
*
* @tparam A the type we want to obtain an schema from
*/
trait SqlInstanceToBigQuery[A] {

/** @param value an instance of [[A]]
* @return a list of [[Field]]s that represents [[A]]
*/
def bigQueryFields(value: A): List[Field]
}
```
In this case, `bigQueryFields` expects an input parameter A

Next, we create the companion object with a few methods, starting by the summoner (apply)
```scala
object SqlInstanceToBigQuery {

/** Summoner method
*/
def apply[A](implicit a: SqlInstanceToBigQuery[A]): SqlInstanceToBigQuery[A] = a
```

then, we add a method that will derive an instance of `SqlInstance` into our new type
```scala
/** Instance derivation via SqlTypeConversion. It uses `getSchema` from BigQueryTypes Type Class
*/
implicit def fieldsFromSqlInstanceConversion[A: SqlInstanceConversion](implicit f: Formats): SqlInstanceToBigQuery[A] =
new SqlInstanceToBigQuery[A] {

override def bigQueryFields(value: A): List[Field] =
SqlTypeToBigQuery.getSchema(SqlInstanceConversion[A].getType(value))
}
```
**Note!**: This is the full code of the implementation but in Scala we can reduce a lot the syntax. As there is only one method inside our Type Class,
Scala can understand that and there is no need to write all the code, so it can be written as follows:
```scala
implicit def fieldsFromSqlInstanceConversion[A: SqlInstanceConversion](implicit f: Formats): SqlInstanceToBigQuery[A] =
(value: A) => SqlTypeToBigQuery.getSchema(SqlInstanceConversion[A].getType(value))
```
both codes are equivalent!

Another (optional) method more, to be able to use instances of `SqlType` directly:
```scala
implicit def fieldsFromSqlType(implicit f: Formats): SqlInstanceToBigQuery[SqlType] =
(value: SqlType) => SqlTypeToBigQuery.getSchema(value)
```

And that's it! We can convert an instance of any type in the library into our new type, e.g: An Spark schema into our new type by doing:
```scala
SqlInstanceToOurType[StructType](mySparkSchema).getMyNewType
```

Finally, an extension method could help to improve the syntax. (Note: Scala 3 is improving the syntax for [extension methods](https://dotty.epfl.ch/docs/reference/contextual/extension-methods.html))

Inside the same object, we create an implicit class with our new syntax:
```scala
/** Allows the syntax myInstance.bigQueryFields for any instance of type A: SqlInstanceConversion
*/
implicit class InstanceSyntax[A: SqlInstanceToBigQuery](value: A) {
def bigQueryFields: List[Field] = SqlInstanceToBigQuery[A].bigQueryFields(value)
}
```
And, we will win a new syntax like:
```scala
anyInstance.myNewType
mySparkSchema.myNewType
```

## Implement Conversion

As covered in [Conversion](#conversion), we have to implement 2 types clases, one for types, another for instances
### Reverse conversion, Type Class implementation
Implement `SqlTypeConversion` type class to have conversion from the new type to `SqlType`

### Type Conversion

Expand Down