Lecture 3¶

FLOPS
Profiling
Python classes
How many floating point operations are required?

FLOPS = Floating point operations per second

Classes = Object oriented programming

Example estimates for the number of floating point operations: multiplying two matrices, N-body simulations, hydro simulations

# using python profiler

import cProfile
import numpy as np

# nit is numbers of iteractions 
def foo(nit,x):
    ssum = 1.0
    for i in range(nit):
        ssum += x*np.sin(x)   # this is an arbitrary operation
        
    return ssum

how many floating point operations are required in this routine?¶

Answer: at least 3 per loop One multiply, one function call which might be fast, and one add.

x = 1/np.sqrt(2.0)
cProfile.run('foo(int(1e7),x)')  # you insert the entire command within the quotes

         4 function calls in 12.739 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   12.739   12.739   12.739   12.739 <ipython-input-2-192acdfe8e74>:7(foo)
        1    0.000    0.000   12.739   12.739 <string>:1(<module>)
        1    0.000    0.000   12.739   12.739 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

#How fast was this?
print('{:.1e} FLOPS'.format( 3e7/12.7))

2.4e+06 FLOPS

foo_file.py contains the following¶

import numpy as np

def foo_file(nit,x):
    ssum = 1.0
    for i in range(nit):
        ssum += x*np.sin(x)

    return ssum

If we run an external file can we speed up the same calculation?

from foo_file import *
cProfile.run('foo_file(int(1e7),x)')

         4 function calls in 12.742 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   12.742   12.742 <string>:1(<module>)
        1   12.742   12.742   12.742   12.742 foo_file.py:4(foo_file)
        1    0.000    0.000   12.742   12.742 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

My laptop¶

2.6 GHz 6-Core Intel Core i7

Suppose a single operation occurs per clock cycle. Each clock cycle is $dt = \frac{1}{2.6\times 10^9} $ s long.

If I do $3 \times 10^7$ operations that each take 1 clock cycle the total length of time would be 0.01 s or so.

3e7/2.6e9

0.011538461538461539

we should be able to do our function about 1000 times faster¶

How may Floating point operations and FLOPS are the Illustris cosmological simulations?¶

https://www.illustris-project.org/about/#astronomers-three

Apparently: "Evolving the main simulation to z=0 used 8,192 compute cores, a peak memory of 25 TB, and 19 million CPU hours". Tera is $10^{12}$.

#What was the total number of floating point operations (approximately)?
#I think CPU hours refers to single cores.
hour = 60*60 # hour in seconds
time = 19e6*hour  # total time in seconds adding all CPUs together 
# suppose that each CPU does FP operations at 100 Ghz, with each one taking 10^-11 s each
FPO = time/1e-11
print(FPO)
# this is probably an overestimate as some of the simulation time
# involves data transfer

6.84e+21

What are the FLOPS for Ilustris?¶

If each core does $10^{11}$ FLOPS and there are 8,192 cores (which is about $10^4$ cores) the cluster does $10^{15}$ FLOPS = 1000 T FLOPS (Tera floating point ops per second).

Can we seed up our foo routine using compiled code?¶

The same foo routine in C in a file called foo.c¶

#include<stdio.h>
#include<math.h>
#include<time.h>

// for seeing about how fast it takes routines to do things
// comparing python vs c

double foo();

int main()
{
   clock_t t0 =  clock();
   double ssum = foo((int)(1e7), sqrt(2.));
   clock_t t1 =  clock();
   double time_diff = (double)(t1-t0)/(double)CLOCKS_PER_SEC;
   printf("clocks t0=%lu t1=%lu (t1-t0)=%lu\n",t0,t1,(t1-t0));
   // printf("%u \n",CLOCKS_PER_SEC);

   printf("%.12f seconds\n",time_diff);
}

double foo(int nit,double x){
   double ssum = 1.0;
   for(int i=0;i < nit;i++){
      ssum+= x*sin(x);
   }
   return ssum;
}

compiled with

gcc foo.c -lm -o foo.out

and run on the command line with

./foo.out

python to C comparison¶

I run the same routine in C and get

0.096453000000 seconds

How fast in FLOPS was this?

#How fast was the c routine in FLOPS ?
print('{:.1e} FLOPS'.format( 3e7/0.1))

3.0e+08 FLOPS

Note: if the compiler was smart it could have bypassed the loop.

Code speed¶

sin, +,* operations on my laptop probably take more than than 1 clock cycle.

There is overhead. There is more overhead in python.

You can estimate how long code should take by counting operations.

Using a profiler or with analysis you can identify which part of your code is most computationally intensive and attempt to optmize that part of the code.

There are trade-offs between code simplicity, accuracy, speed, flexibility and development time.

Some practical ways to speed up python code¶

vector arithmetic
vector indexing instead of looping and conditional statements
load information into vectors before plotting as the plotting routines involve overhead
off-load computationally intensive operations into compiled subroutine and libraries

Is foo computed with vectors faster?¶

# the same set of operations computed with vectors
def foo_vec(nit,x):
    xs = np.zeros(nit)
    x_vec = np.zeros(nit) + x  # is a vector
    sin_xvec = np.zeros(nit) + np.sin(x)  # is a vector
    xs += x_vec*sin_xvec
    ssum = np.sum(xs)
    return ssum

x = 1/np.sqrt(2.0)
cProfile.run('foo_vec(int(1e7),x)')

         16 function calls in 0.092 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.005    0.005 <__array_function__ internals>:2(sum)
        1    0.048    0.048    0.077    0.077 <ipython-input-12-c405b0065846>:2(foo_vec)
        1    0.015    0.015    0.092    0.092 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2087(_sum_dispatcher)
        1    0.000    0.000    0.005    0.005 fromnumeric.py:2092(sum)
        1    0.000    0.000    0.005    0.005 fromnumeric.py:73(_wrapreduction)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:74(<dictcomp>)
        1    0.000    0.000    0.092    0.092 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.005    0.005 {built-in method numpy.core._multiarray_umath.implement_array_function}
        3    0.025    0.008    0.025    0.008 {built-in method numpy.zeros}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        1    0.005    0.005    0.005    0.005 {method 'reduce' of 'numpy.ufunc' objects}

print(12.7/0.1)

126.99999999999999

Speed up¶

We sped up the routine by a factor of 13/0.1 $\sim$ 100 using vectors (and not using a loop)

Many numpy functions will work on arrays.

Many math functions will not.

Conditional statements on arrays¶

indexing with arrays of booleans

Question: Is it possible to do conditional statements on arrays without doing a loop?

xvec = np.array([0.1,0.3,0.7,0.3,0.9,0.1])
ii = (xvec < 0.5)  # a conditional statement
## ii is now an array of booleans
print(ii)

[ True  True False  True False  True]

#if (x < 0.5) then multiply x by -1
xvec[ii] *= -1
print(xvec)

[-0.1 -0.3  0.7 -0.3  0.9 -0.1]

This procedure is very useful when working with data tables.

Python classes¶

Object Oriented Programming
C++ like
A means of bundling data and functionality together

For a tutorial see https://docs.python.org/3/tutorial/classes.html

Class definition¶

class Point():
    ...
    ...

Defining an object to be of that class

pt1 = Point()

class Point():
    def __init__(self, x, y):   # constructor
        self.x = x
        self.y = y   # internal data (x,y)
        
    # rotate x,y by angle (in radians)
    def rotate_by(self, angle):   # function operating on object
        cos_a = np.cos(angle)
        sin_a = np.sin(angle)
        newx = cos_a * self.x - sin_a * self.y 
        newy = sin_a * self.x + cos_a * self.y 
        self.x = newx
        self.y = newy
      
    # return atan2(y,x)
    def theta(self):   # a function operating on the object that returns something
        return np.arctan2(self.y,self.x)
    
    # return atan2(y,x) in degrees
    def theta_deg(self):   # using above function previously defined
        return self.theta()*180.0/np.pi
    
    # return sqrt(x^2+y^2)
    def length(self):
        return np.sqrt(self.x*self.x + self.y*self.y)
    
    def __str__(self):  # so can print()
        return "({},{})".format(self.x, self.y)

    def __repr__(self):  # if just entered on command line, what is printed out
        return "Point({},{})".format(self.x, self.y)  # should return a string
    
    # return a dot product with another point object
    def dot_prot_internal(self,pt):
        return self.x*pt.x + self.y*pt.y

pt1 = Point(1,0)  # using the constructor
# pt1 is now a point with x=1,y=0

print(pt1)   # __str__()

(1,0)

pt1   # __repr__()

Point(1,0)

pt1.rotate_by(np.pi/2)
print(pt1)

(6.123233995736766e-17,1.0)

pt1.length()

1.0

pt1.rotate_by(np.pi/4)
pt1.theta_deg()

135.0

# Function defined outside the class
# take Point (x,y) and make it become (rx,ry)
def scaleby(self, r):
    self.x *= r
    self.y *= r

print(pt1)
scaleby(pt1,2)
print(pt1)

(-0.7071067811865475,0.7071067811865476)
(-1.414213562373095,1.4142135623730951)

# an external function that operates on two points and returns something
def dot_product(self1,self2):
    return self1.x*self2.x + self1.y*self2.y

pt1 = Point(1,0)
pt2 = Point(3,4)
dot_product(pt1,pt2)

3

pt1.dot_prot_internal(pt2)  # the internal function for a dot product

3

type(pt1)

__main__.Point

# how do you check that you have 
# the expected class?
if (type(pt1) == Point):
    print(True)

True

Question:¶

Can you define a class without the __init__ constructor?

Answer: yes.

How many floating point operations are required?¶

Some examples

Multiplying two NxN matrices¶

$${\bf C} = {\bf A} {\bf B} $$

$N$ gives the size of the square matrices.

For example $$ \begin{pmatrix} C_{00} & C_{01} & ... \\ C_{10} & C_{11} & ... \\ .. & .. & .. \end{pmatrix} = \begin{pmatrix} A_{00} & A_{01} & ... \\ A_{10} & A_{11} & ... \\ .. & .. & .. \end{pmatrix} \begin{pmatrix} B_{00} & B_{01} & ... \\ B_{10} & B_{11} & ... \\ .. & .. & .. \end{pmatrix} $$

$$C_{ij}= \sum_{k=0}^{N-1} A_{ik} B_{kj} $$

This must be done $N^2$ time, one for each of all posible values of $i,j$.

The sum itself has $N$ terms. Each term in the sum requires 1 multiplication giving $N$ total multiplications. The sum itself requires $N-1$ additions.
So each sum requires $2N$ floating point operations.

The total number of floating point operations is $N^2 \times 2N \approx 2 N^3$.

N-body simulations (all pairs)¶

Each particle feels the gravitational force from every other particle. $${\bf F}_{ij} ={G m_i m_j} \frac{({\rm x}_j - {\rm x}_i) } {|{\bf x}_i - {\bf x}_j|^2}$$

$N$ is the number of particles.

Computing a single term in the Force vector. 3 differences (x,y,z), a sum (3 terms), a power, 4 multiplications. About 10 operations.

The force is equal and opposite so only need be computed for each pair of particles or $N(N-1)$ times.

To do an integration step, particle positions and velocities must be updated. That's at least 6 floating point operations.

Of order $10 N^2$ per timestep.

Hydrodynamics on a grid¶

$$ \begin{pmatrix}\rho \\ {\bf v} \end{pmatrix}_t = {\bf F}( \rho, {\bf v}) $$

Dimension $n$ and grid size $N$. Number of grid points $N^n$.

The function $\bf F$ depends on the gradient of the velocity field $\nabla {\bf v}$. Nearest neighbor interactions.

Computing the gradient: about 3 floating point operations for a low order approximation per grid point and for a single dimension. Total: $3 n N^n$.

Evaluate the function $F$: Total a few times $n N^n$.

Update the arrays: Total a few times $n N^n$.

Total: About $10 n N^n$