Lecture 3

  • FLOPS
  • Profiling
  • Python classes
  • How many floating point operations are required?

FLOPS = Floating point operations per second

Classes = Object oriented programming

Example estimates for the number of floating point operations: multiplying two matrices, N-body simulations, hydro simulations

In [2]:
# using python profiler

import cProfile
import numpy as np

# nit is numbers of iteractions 
def foo(nit,x):
    ssum = 1.0
    for i in range(nit):
        ssum += x*np.sin(x)   # this is an arbitrary operation
        
    return ssum

how many floating point operations are required in this routine?

Answer: at least 3 per loop One multiply, one function call which might be fast, and one add.

In [3]:
x = 1/np.sqrt(2.0)
cProfile.run('foo(int(1e7),x)')  # you insert the entire command within the quotes
         4 function calls in 12.739 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   12.739   12.739   12.739   12.739 <ipython-input-2-192acdfe8e74>:7(foo)
        1    0.000    0.000   12.739   12.739 <string>:1(<module>)
        1    0.000    0.000   12.739   12.739 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


In [5]:
#How fast was this?
print('{:.1e} FLOPS'.format( 3e7/12.7)) 
2.4e+06 FLOPS

foo_file.py contains the following

import numpy as np

def foo_file(nit,x):
    ssum = 1.0
    for i in range(nit):
        ssum += x*np.sin(x)

    return ssum

If we run an external file can we speed up the same calculation?

In [6]:
from foo_file import *
cProfile.run('foo_file(int(1e7),x)')
         4 function calls in 12.742 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.742   12.742 <string>:1(<module>)
        1   12.742   12.742   12.742   12.742 foo_file.py:4(foo_file)
        1    0.000    0.000   12.742   12.742 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


My laptop

2.6 GHz 6-Core Intel Core i7

Suppose a single operation occurs per clock cycle. Each clock cycle is $dt = \frac{1}{2.6\times 10^9} $ s long.

If I do $3 \times 10^7$ operations that each take 1 clock cycle the total length of time would be 0.01 s or so.

In [4]:
3e7/2.6e9
Out[4]:
0.011538461538461539

we should be able to do our function about 1000 times faster

How may Floating point operations and FLOPS are the Illustris cosmological simulations?

https://www.illustris-project.org/about/#astronomers-three

Apparently: "Evolving the main simulation to z=0 used 8,192 compute cores, a peak memory of 25 TB, and 19 million CPU hours". Tera is $10^{12}$.

In [41]:
#What was the total number of floating point operations (approximately)?
#I think CPU hours refers to single cores.
hour = 60*60 # hour in seconds
time = 19e6*hour  # total time in seconds adding all CPUs together 
# suppose that each CPU does FP operations at 100 Ghz, with each one taking 10^-11 s each
FPO = time/1e-11
print(FPO)
# this is probably an overestimate as some of the simulation time
# involves data transfer
6.84e+21

What are the FLOPS for Ilustris?

If each core does $10^{11}$ FLOPS and there are 8,192 cores (which is about $10^4$ cores) the cluster does $10^{15}$ FLOPS = 1000 T FLOPS (Tera floating point ops per second).

Can we seed up our foo routine using compiled code?

The same foo routine in C in a file called foo.c

#include<stdio.h>
#include<math.h>
#include<time.h>

// for seeing about how fast it takes routines to do things
// comparing python vs c

double foo();

int main()
{
   clock_t t0 =  clock();
   double ssum = foo((int)(1e7), sqrt(2.));
   clock_t t1 =  clock();
   double time_diff = (double)(t1-t0)/(double)CLOCKS_PER_SEC;
   printf("clocks t0=%lu t1=%lu (t1-t0)=%lu\n",t0,t1,(t1-t0));
   // printf("%u \n",CLOCKS_PER_SEC);

   printf("%.12f seconds\n",time_diff);
}

double foo(int nit,double x){
   double ssum = 1.0;
   for(int i=0;i < nit;i++){
      ssum+= x*sin(x);
   }
   return ssum;
}

compiled with

gcc foo.c -lm -o foo.out

and run on the command line with

./foo.out

python to C comparison

I run the same routine in C and get

0.096453000000 seconds

How fast in FLOPS was this?

In [38]:
#How fast was the c routine in FLOPS ?
print('{:.1e} FLOPS'.format( 3e7/0.1)) 
3.0e+08 FLOPS

Note: if the compiler was smart it could have bypassed the loop.

Code speed

sin, +,* operations on my laptop probably take more than than 1 clock cycle.

There is overhead. There is more overhead in python.

You can estimate how long code should take by counting operations.

Using a profiler or with analysis you can identify which part of your code is most computationally intensive and attempt to optmize that part of the code.

There are trade-offs between code simplicity, accuracy, speed, flexibility and development time.

Some practical ways to speed up python code

  • vector arithmetic
  • vector indexing instead of looping and conditional statements
  • load information into vectors before plotting as the plotting routines involve overhead
  • off-load computationally intensive operations into compiled subroutine and libraries

Is foo computed with vectors faster?

In [12]:
# the same set of operations computed with vectors
def foo_vec(nit,x):
    xs = np.zeros(nit)
    x_vec = np.zeros(nit) + x  # is a vector
    sin_xvec = np.zeros(nit) + np.sin(x)  # is a vector
    xs += x_vec*sin_xvec
    ssum = np.sum(xs)
    return ssum
In [13]:
x = 1/np.sqrt(2.0)
cProfile.run('foo_vec(int(1e7),x)')
         16 function calls in 0.092 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.005    0.005 <__array_function__ internals>:2(sum)
        1    0.048    0.048    0.077    0.077 <ipython-input-12-c405b0065846>:2(foo_vec)
        1    0.015    0.015    0.092    0.092 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2087(_sum_dispatcher)
        1    0.000    0.000    0.005    0.005 fromnumeric.py:2092(sum)
        1    0.000    0.000    0.005    0.005 fromnumeric.py:73(_wrapreduction)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:74(<dictcomp>)
        1    0.000    0.000    0.092    0.092 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.005    0.005 {built-in method numpy.core._multiarray_umath.implement_array_function}
        3    0.025    0.008    0.025    0.008 {built-in method numpy.zeros}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        1    0.005    0.005    0.005    0.005 {method 'reduce' of 'numpy.ufunc' objects}


In [14]:
print(12.7/0.1)
126.99999999999999

Speed up

We sped up the routine by a factor of 13/0.1 $\sim$ 100 using vectors (and not using a loop)

Many numpy functions will work on arrays.

Many math functions will not.

Conditional statements on arrays

indexing with arrays of booleans

Question: Is it possible to do conditional statements on arrays without doing a loop?

In [3]:
xvec = np.array([0.1,0.3,0.7,0.3,0.9,0.1])
ii = (xvec < 0.5)  # a conditional statement
## ii is now an array of booleans
print(ii)
[ True  True False  True False  True]
In [4]:
#if (x < 0.5) then multiply x by -1
xvec[ii] *= -1
print(xvec)
[-0.1 -0.3  0.7 -0.3  0.9 -0.1]

This procedure is very useful when working with data tables.

Python classes

  • Object Oriented Programming
  • C++ like
  • A means of bundling data and functionality together

For a tutorial see https://docs.python.org/3/tutorial/classes.html

Class definition

class Point():
    ...
    ...

Defining an object to be of that class

pt1 = Point()
In [45]:
class Point():
    def __init__(self, x, y):   # constructor
        self.x = x
        self.y = y   # internal data (x,y)
        
    # rotate x,y by angle (in radians)
    def rotate_by(self, angle):   # function operating on object
        cos_a = np.cos(angle)
        sin_a = np.sin(angle)
        newx = cos_a * self.x - sin_a * self.y 
        newy = sin_a * self.x + cos_a * self.y 
        self.x = newx
        self.y = newy
      
    # return atan2(y,x)
    def theta(self):   # a function operating on the object that returns something
        return np.arctan2(self.y,self.x)
    
    # return atan2(y,x) in degrees
    def theta_deg(self):   # using above function previously defined
        return self.theta()*180.0/np.pi
    
    # return sqrt(x^2+y^2)
    def length(self):
        return np.sqrt(self.x*self.x + self.y*self.y)
    
    def __str__(self):  # so can print()
        return "({},{})".format(self.x, self.y)

    def __repr__(self):  # if just entered on command line, what is printed out
        return "Point({},{})".format(self.x, self.y)  # should return a string
    
    # return a dot product with another point object
    def dot_prot_internal(self,pt):
        return self.x*pt.x + self.y*pt.y
    
In [46]:
pt1 = Point(1,0)  # using the constructor
# pt1 is now a point with x=1,y=0
In [47]:
print(pt1)   # __str__()
(1,0)
In [48]:
pt1   # __repr__()
Out[48]:
Point(1,0)
In [49]:
pt1.rotate_by(np.pi/2)
print(pt1)
(6.123233995736766e-17,1.0)
In [50]:
pt1.length()
Out[50]:
1.0
In [51]:
pt1.rotate_by(np.pi/4)
pt1.theta_deg()
Out[51]:
135.0
In [52]:
# Function defined outside the class
# take Point (x,y) and make it become (rx,ry)
def scaleby(self, r):
    self.x *= r
    self.y *= r

print(pt1)
scaleby(pt1,2)
print(pt1)
(-0.7071067811865475,0.7071067811865476)
(-1.414213562373095,1.4142135623730951)
In [53]:
# an external function that operates on two points and returns something
def dot_product(self1,self2):
    return self1.x*self2.x + self1.y*self2.y
In [54]:
pt1 = Point(1,0)
pt2 = Point(3,4)
dot_product(pt1,pt2)
Out[54]:
3
In [55]:
pt1.dot_prot_internal(pt2)  # the internal function for a dot product
Out[55]:
3
In [56]:
type(pt1)
Out[56]:
__main__.Point
In [57]:
# how do you check that you have 
# the expected class?
if (type(pt1) == Point):
    print(True)
True

Question:

Can you define a class without the __init__ constructor?

Answer: yes.

How many floating point operations are required?

Some examples

Multiplying two NxN matrices

$${\bf C} = {\bf A} {\bf B} $$

$N$ gives the size of the square matrices.

For example $$ \begin{pmatrix} C_{00} & C_{01} & ... \\ C_{10} & C_{11} & ... \\ .. & .. & .. \end{pmatrix} = \begin{pmatrix} A_{00} & A_{01} & ... \\ A_{10} & A_{11} & ... \\ .. & .. & .. \end{pmatrix} \begin{pmatrix} B_{00} & B_{01} & ... \\ B_{10} & B_{11} & ... \\ .. & .. & .. \end{pmatrix} $$

$$C_{ij}= \sum_{k=0}^{N-1} A_{ik} B_{kj} $$

This must be done $N^2$ time, one for each of all posible values of $i,j$.

The sum itself has $N$ terms. Each term in the sum requires 1 multiplication giving $N$ total multiplications. The sum itself requires $N-1$ additions.
So each sum requires $2N$ floating point operations.

The total number of floating point operations is $N^2 \times 2N \approx 2 N^3$.

N-body simulations (all pairs)

Each particle feels the gravitational force from every other particle. $${\bf F}_{ij} ={G m_i m_j} \frac{({\rm x}_j - {\rm x}_i) } {|{\bf x}_i - {\bf x}_j|^2}$$

$N$ is the number of particles.

Computing a single term in the Force vector. 3 differences (x,y,z), a sum (3 terms), a power, 4 multiplications. About 10 operations.

The force is equal and opposite so only need be computed for each pair of particles or $N(N-1)$ times.

To do an integration step, particle positions and velocities must be updated. That's at least 6 floating point operations.

Of order $10 N^2$ per timestep.

Hydrodynamics on a grid

$$ \begin{pmatrix}\rho \\ {\bf v} \end{pmatrix}_t = {\bf F}( \rho, {\bf v}) $$

Dimension $n$ and grid size $N$. Number of grid points $N^n$.

The function $\bf F$ depends on the gradient of the velocity field $\nabla {\bf v}$. Nearest neighbor interactions.

Computing the gradient: about 3 floating point operations for a low order approximation per grid point and for a single dimension. Total: $3 n N^n$.

Evaluate the function $F$: Total a few times $n N^n$.

Update the arrays: Total a few times $n N^n$.

Total: About $10 n N^n$

In [ ]: