Skip to content

Reporting Example

Annie B. Burgess edited this page May 21, 2019 · 1 revision

Introduction

This page is the location for all project reports related to the 2018 GSoC project titled Recurrent Neural Networks applied in the time-series classification over a high resolution data.

Student: Evandro Carrijo Mentor: Lewis John McGibbney ESIP PoC: Annie Bryant Burgess

Community Bonding Period

Week 1 (23th April – 30th April)

Previous Action Items : None

Weekly Activity:

  1. Took a first talk to Lewis to establish a communication protocol;
  2. Walked around the source code as much as the mailing list timeline to get familiar with the coding standards and workflow used within COAL project;
  3. Put the project's testing toolkit (nosetests and Travis) into my study list;
  4. I've also seen this discussion about feature extraction over very high-dimensionality data and pointed out a potential algorithm to tackle the issue;
  5. Probed the threads related to Neural Networks and found those ones which I can use as resource in the project's scope:
  1. Also have read this interview to better understand the nontechnical aspects of the project, a good resource to become less layperson;
  2. Probing the Issues page I found that the Classifier Callback would be the natural issue to solve, as the user will use this feature to choose the desired classification algorithm, mine included. I think this is also a good opportunity to get used to the source code and learn to perform the sanity tests workflow.

Next Week’s agenda :

  1. Run the examples of the examples module;
  2. Get used with ENVI datasets;
  3. Discuss how orbital imagery can be used within project scope;
  4. Analyze the Classifier Callback issue and figure out how to colaborate.

Blockers/Problem Issues:

None yet.

Mentor’s comments : Things are going well... I am looking forward to Evandro taking on Classifier Callback next week.

Weeks 2, 3 (1st May – 13th May)

Previous Action Items :

  1. Run the examples of the examples module;
  2. Get used with ENVI datasets;
  3. Discuss how orbital imagery can be used within project scope;
  4. Analyze the Classifier Callback issue and figure out how to colaborate.

Weekly Activity:

  1. Tried to run the code of Usage section of the Documentation Page in order to reproduce the same outcomes; got many issues. Also have ran the example_mineral.py code, successfully, but it took more time than expected and right now execution time is about 87h, unfinished;
  2. After reading some docs (e.g. ENVI Image Files , ENVI Header Files) and fetching many datasets at the AVIRIS Data Portal, got the feeling of the main features of the ENVI format such as the layout of the files' structure, interpolation methods and (no) compression;
  3. It was agreed that we are going to start working with some of the AVIRIS and AVIRIS-NG flightlines that have some temporal and spatial coverage overlap, but Lewis will also look for a suitable orbital ENVI imagery;
  4. Due to the complications while running through datasets, there was few time to perform the necessary tests to evaluate an eventual change of the mineral API. I'll talk to Lewis to grab some smaller dataset such that I can perform the sanity tests of coming changes.

Next Week’s agenda :

  1. Grab a smaller dataset to be able to run and evaluate all available classification algorithms after the ISSUE-122 changes;
  2. Talk to Lewis about a good region of study with a good spatiotemporal overlap and fetch it;
  3. Perform activities of the 1st coding week of the project proposal, namely:
  • Get remote sensed imagery of the same source acquired at different dates, which have all at least one common region;
  • Use a shapefile to precisely specify a region to be clipped in all images (optional if all provided images have the exact same boundaries);
  • Create a "daily basis" mosaic for those (possibly complementary) images that have more than one record within a day and overlapping regions.

Blockers/Problem Issues:

  1. The time spent trying to make things work was expensive but, being optimistic, it was good for understanding the usage of the library as much as the computer resources constraints;
  2. It'll be very important to work, at first, with a smaller scenario, under the risk of being unable to perform the sanity tests in a reasonable time.

Mentor’s comments :

Coding Period

Week 1 (14th May – 20th May)

Previous Action Items :

  1. Grab a smaller dataset to be able to run and evaluate all available classification algorithms after the ISSUE-122 changes;
  2. Talk to Lewis about a good region of study with a good spatiotemporal overlap and fetch it;
  3. Perform activities of the 1st coding week of the project proposal, namely:
  • Get remote sensed imagery of the same source acquired at different dates, which have all at least one common region;
  • Use a shapefile to precisely specify a region to be clipped in all images (optional if all provided images have the exact same boundaries);
  • Create a "daily basis" mosaic for those (possibly complementary) images that have more than one record within a day and overlapping regions.

Weekly Activity:

  1. Downloaded ang20150914t171542 flight line and was able to run example_mineral.py on it with in_memory flag set to True;
  2. Researching some possible regions with spatial overlapping and different dates, there's no much choices to work with. Following, some considerations about what I found:
  • Firstly, as the available flight scenes with some space overlapping and different dates were not conceived to cover the exact same area, the effective overlapping areas we could use to extract pixels' time-series are very reduced.
  • Work with time-series mean to classify targets that have a minimal temporal signature, like vegetation (e.g. croplands, pasturelands) or the so called "change-detection" (e.g. deforestation, increase of methane concentration, land use change). If we are going to reduce the scope of analysis to these types of targets, the number of available areas with a reasonable temporal consistency are yet more reduced. It would be easier if we aim vegetation targets in that the seasonal aspects of them could be discriminated, assuming the use/cover won't change through time and we could use a static reference map to label the training pixels' time-series. In other hand, working with change-detection is more complicated in the sense of labeling the ground-truth samples, because we have to analize a span of time to label the training samples, not to mention the unbalancing between changing and non changing pixels. If I'm not wrong, I think this kind of analysis were not performed by the JPL folk yet just because analysis in time were not at the project's scope.
  • If we consider to choose a small available region of study (as a proof of concept) with a reasonable time resolution and generate time-series from it's pixels, our RNN classifier will be constrained to work only within that region, which is the only area that have the same distribution of pixel's timestamps for all the time-series inside of it.
  • I think that working with AVIRIS data through time would only make sense if there's predictable repeat period of observations and big overlapping areas. That would be feasible by properly scheduling AVIRIS flight lines to meet those restrictions, which is not the case of the available flights. Also, I don't know if the aforementioned targets would fit the COAL's project goals.
  1. After some discussion with my mentor, it was decided that we'll dismiss temporal dimension while still taking LSTM into account. Instead of using it through time we could use it with the bands through spectral dimension. This approach can work as well as if it were applied to temporal dimension because what really matter to RNNs is the order relationship between observations and, in case of spectra, it matters. That suits very well to AVIRIS dataset which has a very good spectral consistency but poor temporal one. Beside, working only with spectra allows us to compare this approach against SAM "shoulder-to-shoulder".

Next Week’s agenda :

Considering the proposal-shift, next week's agenda has changed. Instead of transforming mosaics into time-cubes and normalizing them, I'm going to grab ground truth pixels and get some statistical parameters of them in order to create a healthy (balanced) training dataset. A data augmentation method can be performed in order to increase number of less representative subsamples. Also, a normalization strategy is going to be studied.

Blockers/Problem Issues:

Beside the issues aforementioned, I had to stay away for some days due to an unpredictable travel. This forced me to push the ISSUE-122 to the next week's agenda.

Mentor’s comments :

Week 2 (21th May – 27th May)

Previous Action Items :

  1. ISSUE-122;
  2. Grab ground truth pixels and get some statistical parameters of them in order to create a healthy (balanced) training dataset;
  3. A data augmentation method can be performed in order to increase number of less representative subsamples;
  4. Study a normalization strategy.

Weekly Activity:

  1. Pull request to address ISSUE-122;
  2. I've planned a way to pick some reliable labeled pixels in case of lack of ground truth examples. The idea is to classify many regions with SAM and select only the pixels' classifications that has a high degree of confidence and start training over them.

Next Week’s agenda :

  1. Ensure the availability of training points and normalize them;
  2. If there is time, perform some statistical analyses to get the status of the training dataset;
  3. A data augmentation method can be performed in order to increase number of less representative subsamples.

Blockers/Problem Issues:

As ground truth samples are not available yet, it was no possible to perform the items #2 and #3 of Previous Action Items, but, there's still enough time to finish the first midterm objective due to the absence of some activities of the original agenda.

Mentor’s comments :

Week 3 (28th May – 3rd June)

Previous Action Items :

  1. Ensure the availability of training points and normalize them;
  2. If there is time, perform some statistical analyses to get the status of the training dataset;
  3. A data augmentation method can be performed in order to increase number of less representative subsamples.

Weekly Activity:

  1. I've hacked Pycoal's mineral API in order to create a 'score' image, that is, to store the SAM's confidence value of each classified pixel. After that we can use the pixels' scores as a reference to prune unreliable pixels and pick only those with values above a confidence threshold to be the training pixels. This way, by managing threshold, we could customize the number of training samples of each considered class and therefore be able to balance the training dataset;
  2. The above approach was issued with ang20150914t171542 for sake of tests. Some statistical calculations were performed and I found that there's a huge imbalance among classes of USGS Spectral library v6 within that scene;
  3. Update on Pull request to address ISSUE-122.

Next Week’s agenda :

  1. As we are waiting for some feedback from the JPL scientific team regarding the availability of ground-truth labels, it will be needed to find out how we could use those potential datasets;
  2. Concurrently run the approach developed this week to the wider ang20150420t182050 scene and use it as a baseline to our classification benchmarks.

Blockers/Problem Issues:

Mentor’s comments :

Week 4 (4th June – 10th June)

Previous Action Items :

  1. As we are waiting for some feedback from the JPL scientific team regarding the availability of ground-truth labels, it will be needed to find out how we could use those potential datasets;
  2. Concurrently run the approach developed this week to the wider ang20150420t182050 scene and use it as a baseline to our classification benchmarks.

Weekly Activity:

  1. We've receive the JPL scientific team feedback. Among other options, they recommended us to take a look at some older datasets (e.g. Indian Pines, Cuprite, DC Mall) but, as stated, they are old and available ground truths were gathered and compiled using other libraries than ENVI USGS Spectral Library v6 or v7. After a wide survey, I discovered that there's no AVIRIS ground truth datasets which use the same standard Pycoal's spectral libraries. Considering that adapting other non-standard spectral libraries isn't the focus of this work, it was decided that the former training samples selection approach will be used instead;
  2. After more than 60 hours of execution, the scored, classified and rgb images of the ang20150420t182050 scene were successfully processed;
  3. A wide analysis were performed onto created images and it was found that some classes are extremely abundant while other have very few samples. Also, there's some classes which inherently have less SAM's confidence values than others within the same scene and that some sites show much confusion;
  4. As a proof of concept, it was decided, for now, that the classifier will restrict the spectral space for only one class, that is, a binary classifier specialized in identifying some important classes such as Schwertmannite or Goethite. This due to the lack of a sizable number of training sample candidates for most of the classes considered by the library;
  5. A prune has been performed to cut of all Schwertmannite points with scores below 0.85, what yielded 118,181 training samples for that class. Other 118,181 non Schwertmannite training points were draw from the remaining pixels (without prune).

Next Week’s agenda :

  1. Prepare chosen data points in order to start modeling, training and performing sensitivity analysis.

Blockers/Problem Issues:

Mentor’s comments :

Week 5 (11th June – 17th June)

Previous Action Items :

  1. Prepare chosen data points in order to start modeling, training and performing sensitivity analysis.

Weekly Activity:

  1. First trial to fetch selected pixels' spectra and use them for training a RNN model in Keras. Got many problems;
  2. Fixed issues regarding the affine getransformation of some selected points;
  3. There was selected 45,390 out of the 118,181 Schwertmannite points filtered last week; same to non-Schwertmannite points.

Next Week’s agenda :

  1. Sensitivity Analysis

Blockers/Problem Issues:

I've tried to use QGIS 2.18 to perform drawing of the 236,362 points but the operation was taking so long and I gave up. Then, I asked a colleague to perform the operation with ArcGIS and it was accomplished in a reasonable time, but, by unknown reasons, many of the layer's features of the ERSI shapefile have lost their geometry field. Because of that (and for sake of performance), I had to filter many of the original points, resulting in a total of 90,780 of Schwertmannite and non-Schwertmannite points.

Mentor’s comments :

Week 6 (18th June – 24th June)

Previous Action Items :

  1. Sensitivity Analysis.

Weekly Activity:

  1. Created this fork in order to generalize the code and documentation to work not only with time-series, but with any kind of features throughout the rasters' layers, treating them as sequences regardless their nature. The repository contains the data points selected in past weeks as much as the instructions necessary to reproduce the same procedures I used to evaluate the first version of the model.
  2. Got 96.44% accuracy after training 29 epochs.
  3. Currently running the predict phase to generate the classified map of the example scene.

Next Week’s agenda :

  1. Continue Sensitivity Analysis

Blockers/Problem Issues:

Mentor’s comments :

Week 7 (25th June – 1st July)

Previous Action Items :

  1. Continue Sensitivity Analysis

Weekly Activity:

  1. After conclusion of whole scene classification, taking a first look, we can see a lot of commission error, comparing with the map produced by SAM for the same area. This is probably due to the lack of a proper balancing of the non-Schwertmannite subsamples.
  2. It looks like the training dataset selection is the most critical part of the sensitivity analysis. There's need to perform some data balancing in order to inspect the impact it has on predict.
  3. A binary raster was produced for each class present in the scene using the following command: while read -r line; do gdal_calc.py --NoDataValue=-50 --type=Float32 -A ang20150420t182050_corr_v1e_img_class.img --overwrite --outfile="masks/$line.tif" --calc="A==$line"; done < classes (where 'classes' is a file containing the list of classes' ids)
  4. Call gdal_polygonize for each file created in last step aiming to delimitate regions where to draw training points of each class: while read -r line; do gdal_polygonize.py -f "ESRI Shapefile" masks/$line.tif -mask masks/$line.tif masks/$line.shp; done < classes

Next Week’s agenda:

  1. Continue Sensitivity Analysis

Blockers/Problem Issues:

Ran into performance issues during gdal_calc and gdal_polygonize phases, which took much more time than expected.

Mentor’s comments :

Week 8 (2nd July – 8th July)

Previous Action Items :

  1. Continue Sensitivity Analysis

Weekly Activity:

  1. With all geometries in hand, I've used QGIS algorithms in order to try to draw 100 random points inside the polygons created last week, one class at a time. Got a number of problems and this strategy proved to be unfeasible. Firstly, it was hard to ensure a one-to-one relation between drew points and raster's pixels. Secondly, the operation consumes a huge amount of time when there are many polygons within the same class and QGIS stopped to respond.
  2. The failure of the last strategy to efficiently draw the same number of points for all non-Schwertmannite classes lead me to come up with another approach. First, it was decided to clip the example scene into a smaller site, for sake of performance but without loss of generality. Second, I used the "Regular Points" QGIS's algorithm to pin one point at the center of every pixel of the whole scene in order to ensure a one-to-one relation between drawn points and the raster's pixels. This process created a Point shapefile called grid.shp, which is being used in conjunction to the classes' Polygon shapefiles to filter the points belonging each class. The following command is being casted to perform this operation: while read -r class; do ogr2ogr -clipsrc masks/$class.shp pts_$class.shp grid.shp; done < classes

Next Week’s agenda:

  1. Continue Sensitivity Analysis

Blockers/Problem Issues:

As pointed in #1, the early approach revealed itself very inefficient. A lot of QGIS "Random Points" algorithm parameterization was performed to yield a good distribution of drawn points, unsuccessfully. The whole trial has consumed a big amount of time. I'm running into another performance issue to make the intersection between the points of the grid and classes's polygons. To address this, I'll manually run many threads in parallel (one for each class) and create a spatial index to evaluate the performance impact.

Mentor’s comments :

Weeks 9, 10 (9th July – 22th July)

Previous Action Items :

  1. Continue Sensitivity Analysis

Weekly Activity:

  1. The strategies to automate the stratified sampling of the training samples didn't work well and I had to make a great research to find an alternative strategy.
  2. After a big while, found an efficient way to achieve my goals. The procedures steps are as follows:
  • Create a Point shapefile with a grid of equally spaced points, each placed right in the center of each image's pixel. This could be accomplished using the "Regular Points" algorithm of QGIS.
  • Now, we have to extract the values from the pixels of the rasters of the classified image and the score image, into the points created in previous step. To do that, I used a QGIS Plugin called Point sampling tool.
  • Create a script that find out all the USGS spectral library (v6) classes spotted in the classified raster and print them into a file.
  • Sort the points by confidence value in descending order and grab the top 100 of each class, through the following command: !while read -r line; do ogr2ogr -progress -overwrite \ -sql "SELECT SAMscore,geometry FROM grid_classified WHERE clsUSGSv6=$line ORDER BY SAMscore DESC LIMIT 100" \ -dialect "sqlite3" data/selected_points/$line.shp grid_classified.shp; done < classes.txt (where calsses.txt is the list created in previous step and grid_classified.shp is the points whose fields values where extracted from the rasters at the penultimate step).
  • Merge all the points of non-Schwertmannite into a single file; fetch the same amount of points to represent the Schwertmannite class, also getting the most significant scores.

Next Week’s agenda :

  1. Find some AVIRIS scenes that contains, besides Schwertmannite, Reynolds_Tnl_Sludge, Jarosite and Goethite.
  2. Perform the aforementioned steps to make models able to classify these scenes to these classes.

Blockers/Problem Issues:

Mentor’s comments :

Weeks 11, 12 (23th July – 5th August)

Previous Action Items :

  1. Find some AVIRIS scenes that contains, besides Schwertmannite, Reynolds_Tnl_Sludge, Jarosite and Goethite.
  2. Perform the aforementioned steps to make models able to classify these scenes to these classes.

Weekly Activity:

  1. Performed the following PR: Issue 157-Create a score map. This aids for the stratified sampling to build the models.
  2. I've made a great survey trying to discover some sites I could use to increase available number of training samples for Renyolds_Tnl_Sludge and the other classes of interest, through the the guidance of following resources:

I got to try to perform the classification of some AVIRIS and AVIRIS-NG scenes with SAM, but, besides the overhead of clipping the scenes and the performance issues, I couldn't get a number of the desired classes, as expected. There's almost impossible to get the exact sites where have at the same time the desired classes and some overlapping AVIRIS flight lines in a reasonable amount of time.The paucity of some classes make it hard to properly build a model based on supervised learning, regardless if the rare class is the target classe or not.

  1. Nevertheless, the proof-of-concept has worked very well. Not only I was able to build a RNN model capable of accurately distinguish between Schwertmannite and non-Schwertmannite pixels, but also I have a complete workflow used to gather training points and train the model upon them. Even though I've used the Schwertmannite as the canonical example, the developed workflow can be used arbitrarily; that is, one can train a specific model (using the same architecture used with the Schwertmannite example) by inputting a number of points representing the target class and another ones representing its complementary "classes". At the end, the whole procedure will yield a trained model capable of generalize, given the right training points were selected. I'm gonna put all the workflow (selecting, training and predicting) into a Jupyter Notebook as part of the Pycoal documentation.

Next Week’s agenda :

  1. Build final version of the Schwertmannite binary classifier.
  2. Make it available from within Pycoal.
  3. Build final documentation.

Blockers/Problem Issues:

Mentor’s comments :

Week 13 (6th August – 14th August)

Previous Action Items :

  1. Build final version of the Schwertmannite binary classifier.
  2. Make it available from within Pycoal.
  3. Build final documentation.

Weekly Activity:

  1. Got 99% accuracy by predicting the test dataset for the Schwertmannite classifier and points within the example scene (ang20150420t182050);
  2. Integrated the model to the Pycoal and developed all necessary changes to allow users to make their own models and include them into Pycoal. One can see performed changes in this PR: Add a new LSTM based classification algorithm #159
  3. Created a tutorial with Jupyter Notebook, to instruct how to perform all the steps from the training samples gathering to the inclusion of the model into Pycoal tree.

Blockers/Problem Issues:

Currently the Jupyter Notebook is offline, but it is going to be available within the Pycoal's repository shortly.

Mentor’s comments :