FLOPS = Floating point operations per second
Classes = Object oriented programming
Example estimates for the number of floating point operations: multiplying two matrices, N-body simulations, hydro simulations
# using python profiler
import cProfile
import numpy as np
# nit is numbers of iteractions
def foo(nit,x):
ssum = 1.0
for i in range(nit):
ssum += x*np.sin(x) # this is an arbitrary operation
return ssum
Answer: at least 3 per loop One multiply, one function call which might be fast, and one add.
x = 1/np.sqrt(2.0)
cProfile.run('foo(int(1e7),x)') # you insert the entire command within the quotes
#How fast was this?
print('{:.1e} FLOPS'.format( 3e7/12.7))
import numpy as np
def foo_file(nit,x):
ssum = 1.0
for i in range(nit):
ssum += x*np.sin(x)
return ssum
If we run an external file can we speed up the same calculation?
from foo_file import *
cProfile.run('foo_file(int(1e7),x)')
2.6 GHz 6-Core Intel Core i7
Suppose a single operation occurs per clock cycle. Each clock cycle is $dt = \frac{1}{2.6\times 10^9} $ s long.
If I do $3 \times 10^7$ operations that each take 1 clock cycle the total length of time would be 0.01 s or so.
3e7/2.6e9
https://www.illustris-project.org/about/#astronomers-three
Apparently: "Evolving the main simulation to z=0 used 8,192 compute cores, a peak memory of 25 TB, and 19 million CPU hours". Tera is $10^{12}$.
#What was the total number of floating point operations (approximately)?
#I think CPU hours refers to single cores.
hour = 60*60 # hour in seconds
time = 19e6*hour # total time in seconds adding all CPUs together
# suppose that each CPU does FP operations at 100 Ghz, with each one taking 10^-11 s each
FPO = time/1e-11
print(FPO)
# this is probably an overestimate as some of the simulation time
# involves data transfer
If each core does $10^{11}$ FLOPS and there are 8,192 cores (which is about $10^4$ cores) the cluster does $10^{15}$ FLOPS = 1000 T FLOPS (Tera floating point ops per second).
#include<stdio.h>
#include<math.h>
#include<time.h>
// for seeing about how fast it takes routines to do things
// comparing python vs c
double foo();
int main()
{
clock_t t0 = clock();
double ssum = foo((int)(1e7), sqrt(2.));
clock_t t1 = clock();
double time_diff = (double)(t1-t0)/(double)CLOCKS_PER_SEC;
printf("clocks t0=%lu t1=%lu (t1-t0)=%lu\n",t0,t1,(t1-t0));
// printf("%u \n",CLOCKS_PER_SEC);
printf("%.12f seconds\n",time_diff);
}
double foo(int nit,double x){
double ssum = 1.0;
for(int i=0;i < nit;i++){
ssum+= x*sin(x);
}
return ssum;
}
compiled with
gcc foo.c -lm -o foo.out
and run on the command line with
./foo.out
I run the same routine in C and get
0.096453000000 seconds
How fast in FLOPS was this?
#How fast was the c routine in FLOPS ?
print('{:.1e} FLOPS'.format( 3e7/0.1))
Note: if the compiler was smart it could have bypassed the loop.
sin, +,* operations on my laptop probably take more than than 1 clock cycle.
There is overhead. There is more overhead in python.
You can estimate how long code should take by counting operations.
Using a profiler or with analysis you can identify which part of your code is most computationally intensive and attempt to optmize that part of the code.
There are trade-offs between code simplicity, accuracy, speed, flexibility and development time.
# the same set of operations computed with vectors
def foo_vec(nit,x):
xs = np.zeros(nit)
x_vec = np.zeros(nit) + x # is a vector
sin_xvec = np.zeros(nit) + np.sin(x) # is a vector
xs += x_vec*sin_xvec
ssum = np.sum(xs)
return ssum
x = 1/np.sqrt(2.0)
cProfile.run('foo_vec(int(1e7),x)')
print(12.7/0.1)
We sped up the routine by a factor of 13/0.1 $\sim$ 100 using vectors (and not using a loop)
Many numpy functions will work on arrays.
Many math functions will not.
indexing with arrays of booleans
Question: Is it possible to do conditional statements on arrays without doing a loop?
xvec = np.array([0.1,0.3,0.7,0.3,0.9,0.1])
ii = (xvec < 0.5) # a conditional statement
## ii is now an array of booleans
print(ii)
#if (x < 0.5) then multiply x by -1
xvec[ii] *= -1
print(xvec)
This procedure is very useful when working with data tables.
For a tutorial see https://docs.python.org/3/tutorial/classes.html
class Point():
def __init__(self, x, y): # constructor
self.x = x
self.y = y # internal data (x,y)
# rotate x,y by angle (in radians)
def rotate_by(self, angle): # function operating on object
cos_a = np.cos(angle)
sin_a = np.sin(angle)
newx = cos_a * self.x - sin_a * self.y
newy = sin_a * self.x + cos_a * self.y
self.x = newx
self.y = newy
# return atan2(y,x)
def theta(self): # a function operating on the object that returns something
return np.arctan2(self.y,self.x)
# return atan2(y,x) in degrees
def theta_deg(self): # using above function previously defined
return self.theta()*180.0/np.pi
# return sqrt(x^2+y^2)
def length(self):
return np.sqrt(self.x*self.x + self.y*self.y)
def __str__(self): # so can print()
return "({},{})".format(self.x, self.y)
def __repr__(self): # if just entered on command line, what is printed out
return "Point({},{})".format(self.x, self.y) # should return a string
# return a dot product with another point object
def dot_prot_internal(self,pt):
return self.x*pt.x + self.y*pt.y
pt1 = Point(1,0) # using the constructor
# pt1 is now a point with x=1,y=0
print(pt1) # __str__()
pt1 # __repr__()
pt1.rotate_by(np.pi/2)
print(pt1)
pt1.length()
pt1.rotate_by(np.pi/4)
pt1.theta_deg()
# Function defined outside the class
# take Point (x,y) and make it become (rx,ry)
def scaleby(self, r):
self.x *= r
self.y *= r
print(pt1)
scaleby(pt1,2)
print(pt1)
# an external function that operates on two points and returns something
def dot_product(self1,self2):
return self1.x*self2.x + self1.y*self2.y
pt1 = Point(1,0)
pt2 = Point(3,4)
dot_product(pt1,pt2)
pt1.dot_prot_internal(pt2) # the internal function for a dot product
type(pt1)
# how do you check that you have
# the expected class?
if (type(pt1) == Point):
print(True)
Some examples
$N$ gives the size of the square matrices.
For example $$ \begin{pmatrix} C_{00} & C_{01} & ... \\ C_{10} & C_{11} & ... \\ .. & .. & .. \end{pmatrix} = \begin{pmatrix} A_{00} & A_{01} & ... \\ A_{10} & A_{11} & ... \\ .. & .. & .. \end{pmatrix} \begin{pmatrix} B_{00} & B_{01} & ... \\ B_{10} & B_{11} & ... \\ .. & .. & .. \end{pmatrix} $$
$$C_{ij}= \sum_{k=0}^{N-1} A_{ik} B_{kj} $$This must be done $N^2$ time, one for each of all posible values of $i,j$.
The sum itself has $N$ terms. Each term in the sum requires 1 multiplication giving $N$ total multiplications.
The sum itself requires $N-1$ additions.
So each sum requires $2N$ floating point operations.
The total number of floating point operations is $N^2 \times 2N \approx 2 N^3$.
Each particle feels the gravitational force from every other particle. $${\bf F}_{ij} ={G m_i m_j} \frac{({\rm x}_j - {\rm x}_i) } {|{\bf x}_i - {\bf x}_j|^2}$$
$N$ is the number of particles.
Computing a single term in the Force vector. 3 differences (x,y,z), a sum (3 terms), a power, 4 multiplications. About 10 operations.
The force is equal and opposite so only need be computed for each pair of particles or $N(N-1)$ times.
To do an integration step, particle positions and velocities must be updated. That's at least 6 floating point operations.
Of order $10 N^2$ per timestep.
Dimension $n$ and grid size $N$. Number of grid points $N^n$.
The function $\bf F$ depends on the gradient of the velocity field $\nabla {\bf v}$. Nearest neighbor interactions.
Computing the gradient: about 3 floating point operations for a low order approximation per grid point and for a single dimension. Total: $3 n N^n$.
Evaluate the function $F$: Total a few times $n N^n$.
Update the arrays: Total a few times $n N^n$.
Total: About $10 n N^n$