Deep Learning Frameworks

Category: Machine Intelligence

Read the original document

Major Frameworks

Tensorflow
Torch / PyTorch
Caffe / Caffe2
MXNet
CNTK
DL4J
Theano
Chainer

Minor Frameworks

CoreML
Nirvana Neon
DyNet
Dsstne
PaddlePaddle
SystemML
BigDL

Higher Level APIs / Wrappers

Keras [Theano, Tensorflow, MXNet, CNTK]
Sonnet [Tensorflow]
TFLearn [Tensorflow]
TFSlim [Tensorflow]
Gluon [MXNet]
Lasagne [Theano]
ScalNet [DL4J]

Technical Differentiating Factors

Static vs. Dynamic Graph Declaration
Autograd (Reverse Mode)
Model Parallelism
Compiler
Language API
RNN Support
Productionizability / Model Serving
Long Tail Support
1. Localization
2. Segmentation
3. Regression
4. Generation
5. ...
Custom Extensions
Speed
Chainer Benchmarks
1. https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html
Convnet-benchmarks (Soumith)
1. https://github.com/soumith/convnet-benchmarks
DeepBench (Baidu)
1. https://github.com/baidu-research/DeepBench
Benchmarking DL Software Tools (Hong Kong Baptist University Paper)
1. https://arxiv.org/pdf/1608.07249.pdf
Hardware
Model / Data Parallelism
Adoption

Other Overviews & Comparisons

Chainer Comparisons
1. https://docs.chainer.org/en/latest/comparison.html
DL4J [2016]
1. https://deeplearning4j.org/compare-dl4j-torch7-pylearn
DeepFrameworks [2016]
1. https://github.com/zer0n/deepframeworks

Tensorflow

CPU, GPU, Distributed CPU, Distributed GPU support.
TPU support - important downside, code optimized for hardware unavailable without google cloud.
RNN support, though static graph makes variable structural inputs difficult. Hard-coded solutions for variable input lengths in RNNs. Substantially slower LSTM / GRU / RNNs than Torch / CNTK / others.
CNN support, full. Early to have new architectures implemented as custom kernels.
Slower than other frameworks in part due to lack of inplace matrix operations, forced to copy a matrix to operate on it
Python API. Too low level for most data scientists, leading to the popularity of many wrappers (TFlearn, Keras, Sonnet). Other APIs available - ex. C++ for production.
1. Somewhat bloated API.
Strongly supported by the Google Brain team.
Compiler for linear algebra operations - XLA.
Computational graph tooling still closed-source.
Excellent visualization w/ Tensorboard.
Autograd for reverse mode automatic differentiation
Static computational graph. Upsides for production, downsides for flexible inputs - see bottom comparison for more information.
Model and data parallelism.
Extremely robust documentation
Broad and robust model serving options, but through Tensorflow Serving. Mobile serving for Android and IOS.
Extremely strong adoption in research and industry.

Torch / PyTorch

CPU, GPU, Distributed CPU, Distributed GPU support.
Strong, fast RNN / LSTM / GRU support.
Full CNN support.
API for Torch is in LuaJIT, API for PyTorch is in Python. Differing back-ends as well, PyTorch being more modern.
Supported by Facebook, Twitter, and various academic labs. Used to be supported by Deepmind.
No Linear Algebra Compiler, and unlikely due to dynamic computational graph.
Autograd for reverse mode automatic differentiation through Twitter Cortex implementation for Torch, and built into PyTorch core.
Dynamic computation graph. (See comparison below)
Model and Data parallelism.
Unsuited to production (need to run Python in production, rather than faster C / C++ / Java)
Plan to build a converter to Caffe 2
Integrates cleanly with Numpy / Cython / Python Data Ecosystem
Somewhat spotty documentation
Strong adoption in research communities, weak adoption from industry.

Caffe / Caffe 2

CPU, GPU, Distributed CPU, Distributed GPU support.
Extremely weak RNN / LSTM / GRU support.
Extremely strong CNN support.
Python / C++ API.
1. API is inelegant for very deep networks (kept in config file, with details for each layer)
Supported by Berkeley, Caffe 2 created by Facebook.
No Autograd / Reverse Mode Automatic Differentiation.
Static computation.
Model and Data Parallelism.
Caffe 2 optimized for production, including on mobile.
Strong adoption for image processing in industry and research.

MXNet

CPU, GPU, Distributed CPU, Distributed GPU support.
RNN / LSTM / GRU Support.
CNN Support.
Python API / others.
Supported by Amazon, created by researchers at University of Washington
Autograd through NNVM - ‘Neural Network Visual Model’ is a C++ based computational graph for networks. It’s abstracted out so that DL libraries can be customized and be made more modular.
Mixed Static / Dynamic computational graph, taking advantage of both paradigms.
Model and Data Parallelism.
Merged with production (rather than having a different library like Pytorch / Caffe 2), looking to provide a solution for both industry and research.
Limited adoption in industry.

CNTK

CPU, GPU, Distributed CPU, Distributed GPU support.
1. 1bit-SGD (Distributed comput algorithm) is licensed, can still use standard Asynch-SGD
Strong RNN / LSTM / GRU Support.
Strong CNN Support.
Python / C++ API
Backed by Microsoft.
1. Works well on Windows
No compiler.
Implements Autograd with reverse mode automatic differentiation.
Static Computational Graph.
Model and Data Parallelism.
Production through Azure, as well as Windows production options.
Limited adoption in industry.

DL4J

CPU, GPU, Distributed CPU, Distributed GPU support.
Weak RNN / LSTM / GRU Support.
Strong CNN Support.
Inelegant API.
Java / Scala / Python APIs, focused on Hadoop / Spark ecosystem.
Supported by Skymind. ~15 person startup, headed by Adam Gibson and Chris Nicholson.
No linear algebra compiler.
Backed by ND4J, linear algebra tensor library connected to BLAS and abstracting over hardware.
No autograd.
Static computation
Data Parallelism, no model parallelism.
Production through Skymind Intelligence Layer, which is not open source
Some adoption in Japan / China.

Theano

Old and original deep learning library, created by MILA
CPU and GPU support. No distributed CPU, no distributed GPU support.
Support for RNNs and LSTMs.
Python API. Very low level. Wrapped by Keras and Lasagne
Supported by MILA, but relatively inactive due to new frameworks.
Does support autograd.
No linear algebra compiler.
Static computation.
Poor for production, need to run python - you can pickle compiled theano function and load into a server.
Moderate adoption in research.

Chainer

CPU, GPU, Distributed CPU, Distributed GPU support.
Known as an innovator with dynamic computational graph and great speed.
Strong RNN / LSTM / GRU support.
CNN support.
Python API
Supported by Intel, in partnership with Preferred Networks (which created Chainer) in Japan.
No linear algebra compiler, and unlikely to get one due to dynamic computational graph.
Model and data parallelism.
Few to no examples of Chainer’s use in production.
Weak adoption.

Minor Frameworks

CoreML is Apple’s IOS ML library, optimized for mobile in production
Nirvana Neon was acquired by Intel, works on abstracting over hardware and has great speed.
DyNet is an NLP focused new library out of a number of universities but primarily CMU, applying a dynamic computation graph.
Dsstne is an Amazon library focused on deep recommender systems in production, that has fallen out of favor relative to MXNet.
PaddlePaddle is Baidu’s ML library, with strong machine translation (Chinese <-> English) applications.
BigDL is Intel’s Spark focused distributed CPU library, working on Intel’s chips through MKL.

DyNet Fact List

Focused on NLP
Dynamic computational graph
1. ‘Dy’ in the name stands for Dynamic
Developed by Carnegie Mellon, many others
Applications
1. Syntactic Parsing [https://github.com/clab/lstm-parser]
2. Machine Translation [https://github.com/neubig/lamtram]
3. Morphological Inflection [https://github.com/mfaruqui/morph-trans]
Python bindings over C++

Positive Feedback

There’s a loop where being popular means that new research and custom kernels are implemented first in your framework. And as you accumulate more base code, more people what to use your framework. This leads to more new research and custom code being written for your framework, in a positive feedback loop. The fast / faster RCNN example is a clean one. Tensorflow is benefiting the most from this.

Static vs. Dynamic Graph Declaration

Static Declaration

Definition
1. A computational architecture is defined (typically in a computational graph) which can be differentiated using autodiff and optimized. In a separate step, the graph is executed.
Advantages
1. Computational graph can be optimized for performance
2. Ease of scheduling computation across many workers

Dynamic Declaration

Definition
1. No separate steps for graph definition and execution. The necessary computation graph is created on the fly, with a new graph for each training instance.
Advantages
1. Variable sized inputs
  1. Naturally allows for different sizes to the input, say images of differing size or sequences of different lengths
  2. TF and Theano have special dynamic_rnn and scan operations for variable length sequences in RNNs
2. Variably structured inputs
  1. Say, tree or graph inputs
3. Nontrivial inference Algorithms
  1. Say, Bayes risk or marginal likelihood approximation
4. Variably structured outputs
  1. The structure of the output can change as a function of computations earlier in the graph.
5. Easy to debug
6. Ability to inspect, visualize and generally interact with components of the graph
7. Very difficult to implement Attentional Models without it, which are crucial for Seq2seq models in NLP. Difficult to do SOTA neural machine translation, question answering, dialogue.

Source: Original Google Doc