AI Competition

The target of this competition is to improve and enhance the performance of TensorFlow and/or Caffe2 distributed training implementation over RDMA and GPUDirect technologies.

TensorFlow is an open source software library for numerical computation using data flow graphs. TensorFlow 1.5 supports RDMA and GPUDirect to accelerate distributed training. The code can be found at: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/verbs. Reference deployment guide for distributed TensorFlow over RDMA can be found at: https://community.mellanox.com/docs/DOC-3067

Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by Berkeley AI Research (BAIR) and by community contributors. Caffe2 adds support for RDMA. The code can be found at: https://github.com/caffe2/caffe2/tree/master/caffe2/contrib/gloo.

To test the distributed training performance we will use Imagnet data set and the following networks:

  • Inception v3
  • Resnet152
  • VGG16

TensorFlow RDMA user guide can be found at: https://community.mellanox.com/docs/DOC-2852