PDA

View Full Version : reliability (no poker content)


tylerdurden
08-02-2005, 09:37 PM
Assume you are in charge of management for 2000 computers in a datacenter. Over one year, you expect 400 hardware failures. Each failure will take an average of 3 hours to resolve.

Per year, how many failure events would you expect to overlap?

spaminator101
08-02-2005, 09:59 PM
What might i ask, is the point of this post.

uuDevil
08-02-2005, 10:53 PM
[ QUOTE ]
What might i ask, is the point of this post.

[/ QUOTE ]

Considering your screenname and post history, there is no small irony here.

uuDevil
08-03-2005, 02:21 AM
I'm unreliable, but my answer is 25.

Method:

The expected #failures in a 3-hr period is 400/(24*365/3)=.137

The probability of 2 or more failures in a 3-hr period is

1-P(X=0)-P(X=1)= 1-exp(-.137)*(.137)^0/0!-exp(-.137)*(.137)^1/1!= .0857

The expected number of times 2 or more failures will occur in the same 3-hr period over a year is

.0857*(24*365/3)=25.0

emp1346
08-04-2005, 05:00 AM
i think uuDevil is on the right track... I simply used the Poisson formula, and got a bit different number, with the probability being ~.0081, resulting in about 23.9, so 24... basically the same though...

and as for you spaminator, i simply agree with uuDevil...

tylerdurden
08-04-2005, 10:32 PM
Thanks guys. I have heard of the Poisson Distribution, but wasn't sure how to apply it here. I did a little reading and I think I have the hang of it now.

irchans
08-05-2005, 07:28 AM
uuDevil,

I think your method underestimates the number of overlaps because it implies that failures start exactly on a three hour boundary.

Here is a second method for estimating the number of overlaps. Suppose there are exactly 400 failures. We will say that the ith failure and the jth failure overlap if their start times differ by 6 hours or less. The probability that the ith failure overlaps the jth failure is approximately

6/(24*356) = 0.000684932.

There are 400*399/2 = 79800 possible pairs of i's and j's, so the expected number of overlaps (with "overlap" defined as above) is

6/(24*356) * 400*399/2 = 54.6575.

tylerdurden
08-05-2005, 03:03 PM
[ QUOTE ]
I think your method underestimates the number of overlaps because it implies that failures start exactly on a three hour boundary.

Here is a second method for estimating the number of overlaps. Suppose there are exactly 400 failures. We will say that the ith failure and the jth failure overlap if their start times differ by 6 hours or less.

[/ QUOTE ]

I think you're onto something (I see the point about starting exactly on a three hour boundary), but I think your remedy is off-base. If the start times are more than three hours apart they don't overlap. For our purposes we can assume the variance on the repair length is zero, and that all failures always take exactly three hours to fix.

irchans
08-05-2005, 04:36 PM
pvn,
You are correct! There was a typo in my previous post. Below is the corrected version substituting 3 hours for 6. The expected number of overlaps did not change when I made the correction.

We really should do a simulation.

---- corrected post -----

uuDevil,

I think your method underestimates the number of overlaps because it implies that failures start exactly on a three hour boundary.

Here is a second method for estimating the number of overlaps. Suppose there are exactly 400 failures. We will say that the ith failure and the jth failure overlap if their start times differ by 3 hours or less. The probability that the ith failure overlaps the jth failure is approximately

6/(24*356) = 0.000684932.

There are 400*399/2 = 79800 possible pairs of i's and j's, so the expected number of overlaps (with "overlap" defined as above) is

6/(24*356) * 400*399/2 = 54.6575.

tylerdurden
08-05-2005, 07:51 PM
[ QUOTE ]
uuDevil,

I think your method underestimates the number of overlaps because it implies that failures start exactly on a three hour boundary.

Here is a second method for estimating the number of overlaps. Suppose there are exactly 400 failures. We will say that the ith failure and the jth failure overlap if their start times differ by 3 hours or less. The probability that the ith failure overlaps the jth failure is approximately

6/(24*356) = 0.000684932.

There are 400*399/2 = 79800 possible pairs of i's and j's, so the expected number of overlaps (with "overlap" defined as above) is

6/(24*356) * 400*399/2 = 54.6575.

[/ QUOTE ]


I came up with a similar number in a different manner.

As uuDevil pointed out, The expected number of failures in a three hour period is 400/(24*365/3)=.137

We expect to have 400 failure events (averaging three hours each) in a year. During each one of those, the probability that another machine will fail is 0.137.

Now if we take 400*0.137 = 54.8. However, that means we'd actually have 454.8 failures, not 400 (we're counting duplicates twice.

We just need to solve this for x: (x*0.137)+x=400

That gives us x=351.8. 351.8 single failure events.

351*0.137=48.2

48.2 overlapping events.

351.8+48.2=400.

irchans
08-05-2005, 09:22 PM
pvn,

I wrote a quick simulation for an additional comparison. I don't think my Mathematica code is really readable, but it beats writing a longer c++ program. Here is the code and the results:

<font class="small">Code:</font><hr /><pre>
(* deln computes the differences between elements of the
vector v whose indices differ by n *)
deln[v_List, n_Integer] := Drop[v, n] - Drop[v, -n];

vOverlaps = Table[
(* generate 400 start times in seconds past Jan 1 *)
vStartTimes = Table[ Random[Integer, {0, 365*24*60*60}], {400}] // Sort;

(* count the overlaps where the start times differ
by less than 3600*3 seconds *)
iOverlaps = Count[ Join @@ Table[
deln[vStartTimes, n], {n, 5}],
x_ /; Abs[x] &lt; 3600*3],
{10000}];

(* print the average number of overlaps *)
Print[ Plus @@ vOverlaps /10000. ];
</pre><hr />

I ran the code above twice and got the following results:

first run - ave 54.6250 overlaps/year over 10000 years.
second run - ave 54.6962 overlaps/year over 10000 years.

I think I can use the reasoning in your previous post to generate a number near 54.

As uuDevil pointed out, the expected number of failures in a three hour period is 400/(24*365/3)=0.136986.

We expect to have 400 failure events (averaging three hours each) in a year. During each one of those, the expected number of other machines that will start fail is 399* 3/365/24 = 0.136644. (Except after 9 pm December 31st.)

Now if we multiply 400*0.136644 = 54.6575 we get the expected number of failures. We have not counted any overlaps twice because for each failure start time we only counted overlaps with failures that began after the original start time.

tylerdurden
08-05-2005, 09:52 PM
[ QUOTE ]
Now if we multiply 400*0.136644 = 54.6575 we get the expected number of failures. We have not counted any overlaps twice because for each failure start time we only counted overlaps with failures that began after the original start time.

[/ QUOTE ]

Very nice. I agree with your line.

Now, if I want to expand this to see about 3 simultaneous failures, it should be pretty simple, right?

When two machines fail simultaneously, the average length of overlap should be 1.5 hours.

So we have 54.6575 events where two failures overlap. The expected number of failures during that overlap period should be 398 * 1.5/365/24 = 0.06815

54.6575*0.06815= 3.725 triple failures. This seems about right.

Extending from here should be obvious.

tylerdurden
08-05-2005, 09:58 PM
[ QUOTE ]
(Except after 9 pm December 31st.)

[/ QUOTE ]

BTW, I don't think this matters.

irchans
08-05-2005, 10:48 PM
[ QUOTE ]



Now, if I want to expand this to see about 3 simultaneous failures, it should be pretty simple, right?

When two machines fail simultaneously, the average length of overlap should be 1.5 hours.

So we have 54.6575 events where two failures overlap. The expected number of failures during that overlap period should be 398 * 1.5/365/24 = 0.06815

54.6575*0.06815= 3.725 triple failures. This seems about right.
Extending from here should be obvious.

[/ QUOTE ]

Your reasoning looks good to me. I ran some more simulations and got an average near 3.55 triple failures per year, but I am not confident that the code was perfectly correct.

emp1346
08-06-2005, 08:13 PM
[ QUOTE ]
i think uuDevil is on the right track... I simply used the Poisson formula, and got a bit different number, with the probability being ~.0081, resulting in about 23.9, so 24... basically the same though...


[/ QUOTE ]

look, will one of you two tell me why you're disregarding the poisson? it was designed for situations of this sort...

and btw, for a triple occurence, only approximately 1 will occur in a year...

tylerdurden
08-06-2005, 08:39 PM
[ QUOTE ]
look, will one of you two tell me why you're disregarding the poisson? it was designed for situations of this sort...

and btw, for a triple occurence, only approximately 1 will occur in a year...

[/ QUOTE ]

The poisson distribution assumes discrete intervals, not continuous ones (sorry if my terminology is awkward) - i.e. poisson will give you the expected number of failures in three hour intervals such as 00:00 - 03:00, 03:00 - 06:00 etc. A failure at 2:30 and another at 3:30 would overlap, but in the poisson they would be in different three-hour periods, and wouldn't get counted.

I think a better way to phrase it might be that the poisson is concerned with events in a given period, whereas I'm concerned with the proximity of events.