Skip to content

AmeyaWagh/Lyft_Perception_challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lyft Perception Challenge

The lyft Perception challenge in association with Udacity had an image segmentation task where the candidates had to submit their algorithm which could segment road and cars pixels precisely in real time. The challenge started on May 1st,2018 and went through June 3rd, 2018.

Approach

Although it was a segmentation problem and did not require instance segmentation, I went ahead with MASK-RCNN as it was the state of the art algorithm in image segmentation and I was always intrigued to learn about it. Also I started on 28th, just after finishing my first term, so transfer learning was my only shot. 😓

Click to Watch on youtube

Mask-RCNN (A brief overview)

Mask-RCNN, also known as Detectron is a research platform for object detection developed by facebookresearch. It is mainly a modification of Faster RCNN with a segmentation branch parallel to class predictor and bounding box regressor. The vanilla ResNet is used in an FPN setting as a backbone to Faster RCNN so that features can be extracted at multiple levels of the feature pyramid The network heads consists of the Mask branch which predicts the mask and a classification with bounding box regression branch. The architecture with FPN was used for the purpose of this competition

Backbone Heads
FPN FPN
Feature Pyramid network with Resnet different head architecture with and without FPN

The loss function consists of 3 losses L = Lclass + Lbox + Lmask where

  • Lclass uses log loss for true classes
  • Lbox uses smoothL1 loss defined in [Fast RCNN]
  • Lmask uses average binary cross entropy loss

The masks are predicted by a Fully Connected Network for each RoI. This maintains the mxm dimension for each mask and thus for each instance of the object we get distinct masks.

The model description after compiliation can be found at model

Training

The trained weights can be downloaded from weights The model was trained on Alienware 13R3 with Nvidia-GTX-1060

Pre-processing Data

The CARLA datset provided contains train and test images in png format where the masks are in the RED channel of the ground truth mask.

def process_labels(self,labels):
        
        # label 6 - lane lines pixels
        # label 7 - lane pixels
        # label 10 - car

        labels_new = np.zeros(labels.shape)
        labels_new_car = np.zeros(labels.shape)
        
        lane_line_idx = (labels == 6).nonzero()
        lane_idx = (labels == 7).nonzero()
        car_pixels = (labels == 10).nonzero()

        # remove car hood pixels
        car_hood_idx = (car_pixels[0] >= 495).nonzero()[0]
        car_hood_pixels = (car_pixels[0][car_hood_idx], \
                       car_pixels[1][car_hood_idx])

        labels_new[lane_line_idx] = 1
        labels_new[lane_idx] = 1

        labels_new_car[car_pixels] = 1
        labels_new_car[car_hood_pixels] = 0

        
        return np.dstack([labels_new,labels_new_car])

MaskRCNN Configuration

For this application Resnet-50 was used by setting BACKBONE = "resnet50" in config. NUM_CLASSES = 1+2 for 2 classes (car,road) and IMAGE_MAX_DIM = 1024 as image dimensions are 800x600.

class LyftChallengeConfig(Config):
    """Configuration for training on the toy shapes dataset.
    Derives from the base Config class and overrides values specific
    to the toy shapes dataset.
    """
    # Give the configuration a recognizable name
    NAME = "road_car_segmenter"

    # Backbone network architecture
    # Supported values are: resnet50, resnet101
    # BACKBONE = "resnet101"
    BACKBONE = "resnet50"

    # Train on 1 GPU and 8 images per GPU. We can put multiple images on each
    # GPU because the images are small. Batch size is 8 (GPUs * images/GPU).
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1

    # Number of classes (including background)
    NUM_CLASSES = 1 + 2  # background + 2 shapes

    # Use small images for faster training. Set the limits of the small side
    # the large side, and that determines the image shape.
    IMAGE_MIN_DIM = 128
    IMAGE_MAX_DIM = 1024

    # Use smaller anchors because our image and objects are small
    RPN_ANCHOR_SCALES = (8, 16, 32, 64, 128)  # anchor side in pixels

    # Reduce training ROIs per image because the images are small and have
    # few objects. Aim to allow ROI sampling to pick 33% positive ROIs.
    TRAIN_ROIS_PER_IMAGE = 32

    # Use a small epoch since the data is simple
    STEPS_PER_EPOCH = 100

    # use small validation steps since the epoch is small
    VALIDATION_STEPS = 5

Data Augmentation

As the samples provided were very less (1K), data augmentation was necessary to avoid overfitting. imgaug is a python module which came handy in adding augmentation to the dataset. The train function of the MaskRCNN takes augmentation object and augments the images in its generator according to the rules defined in the augmentation object

augmentation = iaa.SomeOf((0, None), [
        iaa.Fliplr(0.5),
        iaa.Flipud(0.5),
        iaa.OneOf([iaa.Affine(rotate=45),
                   iaa.Affine(rotate=90),
                   iaa.Affine(rotate=135)]),
        iaa.Multiply((0.8, 1.5)),
        iaa.GaussianBlur(sigma=(0.0, 5.0)),
        iaa.Affine(scale=(0.5, 1.5)),
        iaa.Affine(scale={"x": (0.5, 1.5), "y": (0.5, 1.5)}),
    ])

Examples of augmentation are given in https://github.com/matterport/Mask_RCNN

Training Loss

The model was trained using pretrained CoCo weights. Instead of a single training loop, it was trained multiple times in smaller epochs to observe change in the loss with changes in parameters and to avoid overfitting. As the data was less, the network used to saturate quickly and required more augmentations to proceed. Also i did not wanted to go overboard on the augmentation so was observing which one works best. Below are the logs of final training setting with the above given augmentation.

heads Epoch all Epoch loss val_loss
10 40 loss val_loss
40 100 loss val_loss
10 40 loss val_loss
20 60 loss val_loss

For the purpose of this competition, after 2 training sets I froze the bounding box and class label layer and only trained the masks head by adding 2 new regex (just_mrcnn_mask,heads_mask) to the layers in train method of Model found in Mask_RCNN/mrcnn/model.py also some more augmentation was added.

 layer_regex = {
            # all layers but the backbone
            "heads": r"(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
            
            # just to train the branch dealing with masks
            "just_mrcnn_mask": r"(mrcnn\_mask.*)",
            "heads_mask": r"(mrcnn\_mask.*)|(rpn\_.*)|(fpn\_.*)",
            
            # From a specific Resnet stage and up
            "3+": r"(res3.*)|(bn3.*)|(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
            "4+": r"(res4.*)|(bn4.*)|(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
            "5+": r"(res5.*)|(bn5.*)|(mrcnn\_.*)|(rpn\_.*)|(fpn\_.*)",
            
            # All layers
            "all": ".*",
        }

Following is the sample snippet of training function

model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE,
            epochs=2,
            augmentation=augmentation, 
            layers="just_mrcnn_mask")

model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE/10.0,
            epochs=10,
            augmentation=augmentation, 
            layers="heads_mask")

Inference and Submission

Inference

The inference configuration uses a batch size of only one image. To improve the FPS, a batch of image was taken from the video and was passed to the pipeline but this did not improve the FPS drastically and rather increased the complexity as the test video had odd number of frames.

class InferenceConfig(LyftChallengeConfig):
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1

Submission

Submission requires files to be encoded in a json. test_inference.py contains the inference and submission code. In attempt to increase the FPS, The encode function was replaced with the follows which was shared on the forum

def encode(array):
    retval, buffer = cv2.imencode('.png', array)
    return base64.b64encode(buffer).decode("utf-8")

Results

Your program runs at 1.703 FPS

Car F score: 0.519 | Car Precision: 0.509 | Car Recall: 0.521 | Road F score: 0.961 | Road Precision: 0.970 | Road Recall: 0.926 | Averaged F score: 0.740

The above score was just obtained by the dataset provided.I observed that the class frequency was imbalanced and there were fewer number of samples with cars than roads. This could have been solved by weighing the loss function according to classes but I faced dimension issues while implementing it. The other approach would be to generate more data using Carla simulator or using popular datasets such as Cityscape.

The loss function for masks is given as

def mrcnn_mask_loss_graph(target_masks, target_class_ids, pred_masks):
    """Mask binary cross-entropy loss for the masks head.

    target_masks: [batch, num_rois, height, width].
        A float32 tensor of values 0 or 1. Uses zero padding to fill array.
    target_class_ids: [batch, num_rois]. Integer class IDs. Zero padded.
    pred_masks: [batch, proposals, height, width, num_classes] float32 tensor
                with values from 0 to 1.
    """
    # Reshape for simplicity. Merge first two dimensions into one.
    target_class_ids = K.reshape(target_class_ids, (-1,))
    mask_shape = tf.shape(target_masks)
    target_masks = K.reshape(target_masks, (-1, mask_shape[2], mask_shape[3]))
    pred_shape = tf.shape(pred_masks)
    pred_masks = K.reshape(pred_masks,
                           (-1, pred_shape[2], pred_shape[3], pred_shape[4]))
    # Permute predicted masks to [N, num_classes, height, width]
    pred_masks = tf.transpose(pred_masks, [0, 3, 1, 2])

    # Only positive ROIs contribute to the loss. And only
    # the class specific mask of each ROI.
    positive_ix = tf.where(target_class_ids > 0)[:, 0]
    positive_class_ids = tf.cast(
        tf.gather(target_class_ids, positive_ix), tf.int64)
    indices = tf.stack([positive_ix, positive_class_ids], axis=1)

    # Gather the masks (predicted and true) that contribute to loss
    y_true = tf.gather(target_masks, positive_ix)
    y_pred = tf.gather_nd(pred_masks, indices)

    # Compute binary cross entropy. If no positive ROIs, then return 0.
    # shape: [batch, roi, num_classes]
    loss = K.switch(tf.size(y_true) > 0,
                    K.binary_crossentropy(target=y_true, output=y_pred),
                    tf.constant(0.0))
    loss = K.mean(loss)
    return loss

Code Execution

  • training on CARLA dataset
python train_mrcnn.py
  • test inference on test_video.mp4
python inference.py
  • submission script test_inference.py
grader 'python Lyft_challenge/test_inference.py'

Requirements

Python 3.4, TensorFlow 1.3, Keras 2.0.8, pycocotools and other dependencies required for https://github.com/matterport/Mask_RCNN

Acknowledgement

This code heavily uses https://github.com/matterport/Mask_RCNN by waleedka. Thanks for your contribution.

Reference

Author

Ameya Wagh [email protected]