Otherwise, it seems like you have four to five floating point instructions per iteration (including int to float conversion, once or twice, depending on the optimizer), so $4\cdot10^9$ or $5\cdot10^9$ in total. Divide this by 0.431635 to get your FLOPS, which will be around $10^{10}$, or 10 GFLOPS. A little bit late but maybe it helps some visitors in future. For your example I successfully tested the following snippet: g = tf.Graph() run_meta = tf.RunMetadata() with g.as_default(): A = tf.Variable(tf.random_normal( [25,16] )) B = tf.Variable(tf.random_normal( [16,9] )) C = tf.matmul(A,B) # shape=[25,9] opts = tf.profiler.ProfileOptionBuilder.float_operation() flops = tf.profiler.profile ... QDP:FlopCount:invcg2 Total performance: 7554.93149027877 Mflops = 7.55493149027877 Gflops = 0.00755493149027877 Tflops CG_SOLVER: 37 iterations. Rsd = 2.06699506385389e-09 Relative Rsd = 5.77136063168358e-13 CG_SOLVER_TIME: 2.72206 sec But given a higher core count, that actually adds up to less cache per core than Intel's previous designs. A 1 GFlop machine will do a billion operations in a second. Take i7-3632QM as an example: 16 (SP FLOPS/cycle) X 4 (Quad-core) X 2. 8 GFlops and 3870 is 496 GFlops. Flops calculator Flops calculator

