[Math] Given sample data with three data points, how to predict or guess the 3rd data point when only given two data points

data analysis

I'm tracking data on a backup job that runs nightly on our server and using the historical data to predict data and job time growth. I have the following three data points for most of the records: Data backed up (in Bytes), Total Job Time (hh:mm:ss), and Transfer Speed (in Bytes/Minute).

Total Job Time does not equal (Data Backed Up)/(Transfer Speed) because there is necessary overhead for the job starting, transitioning, and completing. I have created a fourth data point, Work Time, recording the time spent actually transfering data created using the above formula, but this does not appear to relate directly or consistently with the Total Job Time. Server use, network latency, and other resource-bound factors all affect the relationship.

For some of the older records I am missing the Transfer Speed data and I would like to know what formula I need to apply to the other two data points (Data and Job Time) to make a reasonable guess as to what the Transfer Speed might have been.

Below is a representative sample of the data, I've converted all the to bytes and minutes for ease of calculation:

Data(bytes)   Time(min) Speed(bytes/min)
383542111073    381.22  1273000000
383676323632    382.72  1267000000
383875888842    378.55  1283000000
384088122257    382.15  1268000000
384247013724    378.40  1282000000
384457413287    378.68  1285000000
384652849842    381.42  1272000000
384973213219    380.15  1278000000
385188544442    380.13  1280000000
385504302010    377.80  1291000000
385628091021    377.97  1289000000
386061561686    384.77  1264000000
386853481337    383.98  1270000000
387117610212    381.90  1278000000
387679368117    385.80  1262000000
388015187994    386.50  1261000000
388240874769    385.20  1265000000
391312996783    383.15  1282000000
392497055973    384.73  1280000000
392877252269    387.13  1269000000
392988498970    386.52  1274000000
393236837467    385.33  1279000000
392386489223    366.32  1363000000
392626640464    370.68  1341000000
392772670262    366.68  1363000000
391049505322    366.60  1360000000
391308127859    365.62  1362000000
391683916463    365.53  1367000000
391868818660    367.87  1355000000
392029291293    366.82  1356000000
392028073259    370.40  1341000000
392143518314    366.07  1365000000

For any given combination of Data and Time, I'd like to be able to guess Speed.

UPDATE for comment regarding graphing:

I have graphed the data, but probably because I forgot most of my math in order to focus on technology as a career, I'm not exactly sure how the graph type will point me towards a particular function. The graph of the entire data set is below. Note how the MB/Min and Work Time data is missing from mid-July and before.

Part of the problem with my (meager, unpracticed) thoughts on what formula is best is that a month into the data collection I changed the time at which the backup occured which, by moving it to a time period when fewer things were running on the server lowered the resulting time by what appears to be 12 minutes. You can see that in the data set above where the last 7 time values are clustered around the high 360's and the points above are closer to 380.

The entire data set

Best Answer

Having copied data from your post, 3D plot suggest that data-points are on a hyper-plane, which suggests a fitting model. I am using Mathematica:

enter image description here

Added: Per OP's additional question, the model does change a little with 7-th record removed:

In[37]:= x3 == Exp[c] x1^a x2^b /. 
 FindFit[Log[data], a logx1 + b logx2 + c, {a, b, c}, {logx1, logx2}]

Out[37]= x3 == (1444.8 x1^0.800888)/x2^1.29128

In[38]:= x3 == Exp[c] x1^a x2^b /. 
 FindFit[Delete[Log[data], 7], 
  a logx1 + b logx2 + c, {a, b, c}, {logx1, logx2}]

Out[38]= x3 == (1581.69 x1^0.797488)/x2^1.29123
Related Question