
.. _crei2013_theano:

******
Theano
******

Pointers
--------

* http://deeplearning.net/software/theano/
* Announcements mailing list: http://groups.google.com/group/theano-announce
* User mailing list: http://groups.google.com/group/theano-users
* Deep Learning Tutorials: http://www.deeplearning.net/tutorial/
* Installation: https://deeplearning.net/software/theano/install.html


Description
-----------

* Mathematical symbolic expression compiler
* Dynamic C/CUDA code generation
* Efficient symbolic differentiation

  * Theano computes derivatives of functions with one or many inputs.

* Speed and stability optimizations

  * Gives the right answer for ``log(1+x)`` even if x is really tiny.

* Works on Linux, Mac and Windows
* Transparent use of a GPU

  * float32 only for now (working on other data types)
  * Still in experimental state on Windows
  * On GPU data-intensive calculations are typically between 6.5x and 44x faster. We've seen speedups up to 140x

* Extensive unit-testing and self-verification

  * Detects and diagnoses many types of errors

* On CPU, common machine learning algorithms are 1.6x to 7.5x faster than competitive alternatives

  * including specialized implementations in C/C++, NumPy, SciPy, and Matlab

* Expressions mimic NumPy's syntax & semantics
* Statically typed and purely functional
* Some sparse operations (CPU only)

Simple example
--------------

>>> import theano
>>> a = theano.tensor.vector("a")      # declare symbolic variable
>>> b = a + a ** 10                    # build symbolic expression
>>> f = theano.function([a], b)        # compile function
>>> f([0, 1, 2])
array([    0.,     2.,  1026.])

======================================================  =====================================================
        Unoptimized graph                                    Optimized graph
======================================================  =====================================================
.. image:: ../hpcs2011_tutorial/pics/f_unoptimized.png   .. image:: ../hpcs2011_tutorial/pics/f_optimized.png
======================================================  =====================================================

Symbolic programming = *Paradigm shift*: people need to use it to understand it.

Exercise 1
-----------

.. code-block:: python

  import theano
  a = theano.tensor.vector()      # declare variable
  out = a + a ** 10               # build symbolic expression
  f = theano.function([a], out)   # compile function
  print f([0, 1, 2])
  # prints `array([0, 2, 1026])`

  theano.printing.pydotprint_variables(b, outfile="f_unoptimized.png", var_with_name_simple=True)
  theano.printing.pydotprint(f, outfile="f_optimized.png", var_with_name_simple=True)

Modify and execute the example to do this expression: ``a ** 2 + b ** 2 + 2 * a * b``

Real example
------------

**Logistic Regression**

* GPU-ready
* Symbolic differentiation
* Speed optimizations
* Stability optimizations

.. literalinclude:: logreg.py


**Optimizations:**

Where are those optimization applied?

* ``log(1+exp(x))``

* ``1 / (1 + tt.exp(var))`` (sigmoid)

* ``log(1-sigmoid(var))`` (softplus, stabilisation)

* GEMV (matrix-vector multiply from BLAS)

* Loop fusion


.. code-block:: python

  p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))
  # 1 / (1 + tt.exp(var)) -> sigmoid(var)
  xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)
  # Log(1-sigmoid(var)) -> -sigmoid(var)
  prediction = p_1 > 0.5
  cost = xent.mean() + 0.01 * (w ** 2).sum()
  gw, gb = tt.grad(cost, [w, b])

  train = theano.function(
            inputs=[x, y],
            outputs=[prediction, xent],
            # w - 0.1 * gw: GEMV with the dot in the grad
            updates=[(w, w - 0.1 * gw),
                     (b, b - 0.1 * gb)])


Theano flags
------------

Theano can be configured with flags. They can be defined in two ways

* With an environment variable: ``THEANO_FLAGS="profile=True,profile_memory=True"``

* With a configuration file that defaults to ``~/.theanorc``


Exercise 2
-----------

.. code-block:: python

    import numpy
    import theano
    import theano.tensor as tt
    rng = numpy.random

    N = 400
    feats = 784
    D = (rng.randn(N, feats).astype(theano.config.floatX),
    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
    training_steps = 10000

    # Declare Theano symbolic variables
    x = tt.matrix("x")
    y = tt.vector("y")
    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
    x.tag.test_value = D[0]
    y.tag.test_value = D[1]
    #print "Initial model:"
    #print w.get_value(), b.get_value()


    # Construct Theano expression graph
    p_1 = 1 / (1 + tt.exp(-tt.dot(x, w) - b))  # Probability of having a one
    prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
    xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
    cost = xent.mean() + 0.01 * (w**2).sum()  # The cost to optimize
    gw,gb = tt.grad(cost, [w, b])

    # Compile expressions to functions
    train = theano.function(
                inputs=[x, y],
                outputs=[prediction, xent],
                updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
                name="train")
    predict = theano.function(inputs=[x], outputs=prediction,
                              name="predict")

    if any([x.op.__class__.__name__=='Gemv' for x in
            train.maker.fgraph.toposort()]):
        print 'Used the cpu'
    elif any([x.op.__class__.__name__=='GpuGemm' for x in
              train.maker.fgraph.toposort()]):
        print 'Used the gpu'
    else:
        print 'ERROR, not able to tell if theano used the cpu or the gpu'
        print train.maker.fgraph.toposort()



    for i in range(training_steps):
        pred, err = train(D[0], D[1])
    #print "Final model:"
    #print w.get_value(), b.get_value()

    print "target values for D"
    print D[1]

    print "prediction on D"
    print predict(D[0])

    # Print the graph used in the slides
    theano.printing.pydotprint(predict,
                               outfile="pics/logreg_pydotprint_predic.png",
                               var_with_name_simple=True)
    theano.printing.pydotprint_variables(prediction,
                               outfile="pics/logreg_pydotprint_prediction.png",
                               var_with_name_simple=True)
    theano.printing.pydotprint(train,
                               outfile="pics/logreg_pydotprint_train.png",
                               var_with_name_simple=True)

Modify and execute the example to run on CPU with floatX=float32

* You will need to use: ``theano.config.floatX`` and ``ndarray.astype("str")``

GPU
---

* Only 32 bit floats are supported (being worked on)
* Only 1 GPU per process. Wiki page on using multiple process for multiple GPU
* Use the Theano flag ``device=gpu`` to tell to use the GPU device

 * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one
 * Shared variables with float32 dtype are by default moved to the GPU memory space

* Use the Theano flag ``floatX=float32``

 * Be sure to use ``floatX`` (``theano.config.floatX``) in your code
 * Cast inputs before putting them into a shared variable
 * Cast "problem": int32 with float32 to float64

  * Insert manual cast in your code or use [u]int{8,16}
  * The mean operator is worked on to make the output stay in float32.

* Use the Theano flag ``force_device=True``, to exit if Theano isn't able to use a GPU.

  * Theano 0.6rc4 will have the combination of ``force_device=True``
    and ``device=cpu`` disable the GPU.



Exercise 3
-----------

* Modify and execute the example of `Exercise 2`_ to run with floatX=float32 on GPU

* Time with: ``time python file.py``

Symbolic variables
------------------

* # Dimensions

 * tt.scalar, tt.vector, tt.matrix, tt.tensor3, tt.tensor4

* Dtype

 * tt.[fdczbwil]vector (float32, float64, complex64, complex128, int8, int16, int32, int64)

 * tt.vector to floatX dtype

 * floatX: configurable dtype that can be float32 or float64.

* Custom variable

 * All are shortcuts to: ``tt.tensor(dtype, broadcastable=[False]*nd)``

 * Other dtype: uint[8,16,32,64], floatX

Creating symbolic variables: Broadcastability

* Remember what I said about broadcasting?

* How to add a row to all rows of a matrix?

* How to add a column to all columns of a matrix?


Details regarding symbolic broadcasting...

* Broadcastability must be specified when creating the variable

* The only shorcut with broadcastable dimensions are: **tt.row** and **tt.col**

* For all others: ``tt.tensor(dtype, broadcastable=([False or True])*nd)``


Differentiation details
-----------------------

>>> gw, gb = tt.grad(cost, [w,b])  # doctest: +SKIP

* tt.grad works symbolically: takes and returns a Theano variable

* tt.grad can be compared to a macro: it can be applied multiple times

* tt.grad takes scalar costs only

* Simple recipe allows to compute efficiently vector x Jacobian and vector x Hessian

* We are working on the missing optimizations to be able to compute efficently the full Jacobian and Hessian and Jacobian x vector


Old Benchmarks
--------------

:ref:`Example: <cifar2011_benchmark>`

* Multi-layer perceptron
* Convolutional Neural Networks
* Misc Elemwise operations

Competitors: NumPy + SciPy, MATLAB, EBLearn, Torch5, numexpr

* EBLearn, Torch5: specialized libraries written by practitioners specifically for these tasks
* numexpr: similar to Theano, 'virtual machine' for elemwise expressions

New Benchmarks
--------------

`Example <http://arxiv.org/pdf/1211.5590v1.pdf>`_ (Page 7 and 9):

* Logistic regression, MLP with 1 and 3 layers
* Recurrent neural networks

Competitors: Torch7, RNNLM

* Torch7, RNNLM: specialized libraries written by practitioners specifically for these tasks
