Deep Learning Frameworks
Category: Machine Intelligence
<!-- gdoc-inlined -->
Major Frameworks
- Tensorflow
- Torch / PyTorch
- Caffe / Caffe2
- MXNet
- CNTK
- DL4J
- Theano
- Chainer
Minor Frameworks
- CoreML
- Nirvana Neon
- DyNet
- Dsstne
- PaddlePaddle
- SystemML
- BigDL
Higher Level APIs / Wrappers
- Keras [Theano, Tensorflow, MXNet, CNTK]
- Sonnet [Tensorflow]
- TFLearn [Tensorflow]
- TFSlim [Tensorflow]
- Gluon [MXNet]
- Lasagne [Theano]
- ScalNet [DL4J]
Technical Differentiating Factors
- Static vs. Dynamic Graph Declaration
- Autograd (Reverse Mode)
- Model Parallelism
- Compiler
- Language API
- RNN Support
- Productionizability / Model Serving
- Long Tail Support
- Localization
- Segmentation
- Regression
- Generation
- ...
- Custom Extensions
- Speed
- Chainer Benchmarks
- https://chainer.org/general/2017/02/08/Performance-of-Distributed-Deep-Learning-Using-ChainerMN.html
- Convnet-benchmarks (Soumith)
- https://github.com/soumith/convnet-benchmarks
- DeepBench (Baidu)
- https://github.com/baidu-research/DeepBench
- Benchmarking DL Software Tools (Hong Kong Baptist University Paper)
- https://arxiv.org/pdf/1608.07249.pdf
- Hardware
- Model / Data Parallelism
- Adoption
Other Overviews & Comparisons
- Chainer Comparisons
- https://docs.chainer.org/en/latest/comparison.html
- DL4J [2016]
- https://deeplearning4j.org/compare-dl4j-torch7-pylearn
- DeepFrameworks [2016]
- https://github.com/zer0n/deepframeworks
Tensorflow
- CPU, GPU, Distributed CPU, Distributed GPU support.
- TPU support - important downside, code optimized for hardware unavailable without google cloud.
- RNN support, though static graph makes variable structural inputs difficult. Hard-coded solutions for variable input lengths in RNNs. Substantially slower LSTM / GRU / RNNs than Torch / CNTK / others.
- CNN support, full. Early to have new architectures implemented as custom kernels.
- Slower than other frameworks in part due to lack of inplace matrix operations, forced to copy a matrix to operate on it
- Python API. Too low level for most data scientists, leading to the popularity of many wrappers (TFlearn, Keras, Sonnet). Other APIs available - ex. C++ for production.
- Somewhat bloated API.
- Strongly supported by the Google Brain team.
- Compiler for linear algebra operations - XLA.
- Computational graph tooling still closed-source.
- Excellent visualization w/ Tensorboard.
- Autograd for reverse mode automatic differentiation
- Static computational graph. Upsides for production, downsides for flexible inputs - see bottom comparison for more information.
- Model and data parallelism.
- Extremely robust documentation
- Broad and robust model serving options, but through Tensorflow Serving. Mobile serving for Android and IOS.
- Extremely strong adoption in research and industry.
Torch / PyTorch
- CPU, GPU, Distributed CPU, Distributed GPU support.
- Strong, fast RNN / LSTM / GRU support.
- Full CNN support.
- API for Torch is in LuaJIT, API for PyTorch is in Python. Differing back-ends as well, PyTorch being more modern.
- Supported by Facebook, Twitter, and various academic labs. Used to be supported by Deepmind.
- No Linear Algebra Compiler, and unlikely due to dynamic computational graph.
- Autograd for reverse mode automatic differentiation through Twitter Cortex implementation for Torch, and built into PyTorch core.
- Dynamic computation graph. (See comparison below)
- Model and Data parallelism.
- Unsuited to production (need to run Python in production, rather than faster C / C++ / Java)
- Plan to build a converter to Caffe 2
- Integrates cleanly with Numpy / Cython / Python Data Ecosystem
- Somewhat spotty documentation
- Strong adoption in research communities, weak adoption from industry.
Caffe / Caffe 2
- CPU, GPU, Distributed CPU, Distributed GPU support.
- Extremely weak RNN / LSTM / GRU support.
- Extremely strong CNN support.
- Python / C++ API.
- API is inelegant for very deep networks (kept in config file, with details for each layer)
- Supported by Berkeley, Caffe 2 created by Facebook.
- No Autograd / Reverse Mode Automatic Differentiation.
- Static computation.
- Model and Data Parallelism.
- Caffe 2 optimized for production, including on mobile.
- Strong adoption for image processing in industry and research.
MXNet
- CPU, GPU, Distributed CPU, Distributed GPU support.
- RNN / LSTM / GRU Support.
- CNN Support.
- Python API / others.
- Supported by Amazon, created by researchers at University of Washington
- Autograd through NNVM - ‘Neural Network Visual Model’ is a C++ based computational graph for networks. It’s abstracted out so that DL libraries can be customized and be made more modular.
- Mixed Static / Dynamic computational graph, taking advantage of both paradigms.
- Model and Data Parallelism.
- Merged with production (rather than having a different library like Pytorch / Caffe 2), looking to provide a solution for both industry and research.
- Limited adoption in industry.
CNTK
- CPU, GPU, Distributed CPU, Distributed GPU support.
- 1bit-SGD (Distributed comput algorithm) is licensed, can still use standard Asynch-SGD
- Strong RNN / LSTM / GRU Support.
- Strong CNN Support.
- Python / C++ API
- Backed by Microsoft.
- Works well on Windows
- No compiler.
- Implements Autograd with reverse mode automatic differentiation.
- Static Computational Graph.
- Model and Data Parallelism.
- Production through Azure, as well as Windows production options.
- Limited adoption in industry.
DL4J
- CPU, GPU, Distributed CPU, Distributed GPU support.
- Weak RNN / LSTM / GRU Support.
- Strong CNN Support.
- Inelegant API.
- Java / Scala / Python APIs, focused on Hadoop / Spark ecosystem.
- Supported by Skymind. ~15 person startup, headed by Adam Gibson and Chris Nicholson.
- No linear algebra compiler.
- Backed by ND4J, linear algebra tensor library connected to BLAS and abstracting over hardware.
- No autograd.
- Static computation
- Data Parallelism, no model parallelism.
- Production through Skymind Intelligence Layer, which is not open source
- Some adoption in Japan / China.
Theano
- Old and original deep learning library, created by MILA
- CPU and GPU support. No distributed CPU, no distributed GPU support.
- Support for RNNs and LSTMs.
- Python API. Very low level. Wrapped by Keras and Lasagne
- Supported by MILA, but relatively inactive due to new frameworks.
- Does support autograd.
- No linear algebra compiler.
- Static computation.
- Poor for production, need to run python - you can pickle compiled theano function and load into a server.
- Moderate adoption in research.
Chainer
- CPU, GPU, Distributed CPU, Distributed GPU support.
- Known as an innovator with dynamic computational graph and great speed.
- Strong RNN / LSTM / GRU support.
- CNN support.
- Python API
- Supported by Intel, in partnership with Preferred Networks (which created Chainer) in Japan.
- No linear algebra compiler, and unlikely to get one due to dynamic computational graph.
- Model and data parallelism.
- Few to no examples of Chainer’s use in production.
- Weak adoption.
Minor Frameworks
- CoreML is Apple’s IOS ML library, optimized for mobile in production
- Nirvana Neon was acquired by Intel, works on abstracting over hardware and has great speed.
- DyNet is an NLP focused new library out of a number of universities but primarily CMU, applying a dynamic computation graph.
- Dsstne is an Amazon library focused on deep recommender systems in production, that has fallen out of favor relative to MXNet.
- PaddlePaddle is Baidu’s ML library, with strong machine translation (Chinese <-> English) applications.
- BigDL is Intel’s Spark focused distributed CPU library, working on Intel’s chips through MKL.
DyNet Fact List
- Focused on NLP
- Dynamic computational graph
- ‘Dy’ in the name stands for Dynamic
- Developed by Carnegie Mellon, many others
- Applications
- Syntactic Parsing [https://github.com/clab/lstm-parser]
- Machine Translation [https://github.com/neubig/lamtram]
- Morphological Inflection [https://github.com/mfaruqui/morph-trans]
- Python bindings over C++
Positive Feedback
There’s a loop where being popular means that new research and custom kernels are implemented first in your framework. And as you accumulate more base code, more people what to use your framework. This leads to more new research and custom code being written for your framework, in a positive feedback loop. The fast / faster RCNN example is a clean one. Tensorflow is benefiting the most from this.
Static vs. Dynamic Graph Declaration
Static Declaration
- Definition
- A computational architecture is defined (typically in a computational graph) which can be differentiated using autodiff and optimized. In a separate step, the graph is executed.
- Advantages
- Computational graph can be optimized for performance
- Ease of scheduling computation across many workers
Dynamic Declaration
- Definition
- No separate steps for graph definition and execution. The necessary computation graph is created on the fly, with a new graph for each training instance.
- Advantages
- Variable sized inputs
- Naturally allows for different sizes to the input, say images of differing size or sequences of different lengths
- TF and Theano have special dynamic_rnn and scan operations for variable length sequences in RNNs
- Variably structured inputs
- Say, tree or graph inputs
- Nontrivial inference Algorithms
- Say, Bayes risk or marginal likelihood approximation
- Variably structured outputs
- The structure of the output can change as a function of computations earlier in the graph.
- Easy to debug
- Ability to inspect, visualize and generally interact with components of the graph
- Very difficult to implement Attentional Models without it, which are crucial for Seq2seq models in NLP. Difficult to do SOTA neural machine translation, question answering, dialogue.
- Variable sized inputs
Source: Original Google Doc