Consider this interview question reportedly given to data scientist candidates:
‘There is a building with 100 floors. You are given two identical eggs. How do you use two eggs to find the threshold floor, where the egg will definitely break from any floor above floor n, including floor n itself?’
I would imagine that this does not feature in real interviews, since there are more direct ways of determining a candidate’s skills and overall suitability for a job. It also tempts ‘trick’ answers like ‘Most eggs will break from 1-2m high’ ensuring a complete waste of time for both the interviewer and the candidate.
The key thing here is to not focus on semantics. You need to assume that all eggs behave the same and that if you drop and egg and it survives, then you can drop it again without it getting damaged. In the real world, you might be searching a database for specific data from billions of other values, and searching them all one by one is just not feasible. This will become clear later!
Here’s how a data scientist might approach the problem:
1. Linear search
Drop the first egg from the 1st floor. If it breaks, then you’ve found n! If it doesn’t, then drop the egg from the second floor and repeat until you find the floor at which the egg breaks. This will find n with just one egg.
But what if n = 100? You would have to drop the egg 100 times before you find the correct answer. If a database worked like this with billions of rows of data then it will take a massive amount of time. You should find something more efficient.
We could try dropping the first egg from the 10th floor instead. If it breaks, then you know that n<10 and you can begin searching every floor below the 10th floor, starting at the first and working your way up.
Similarly, if it survives the fall from 10 floors, move up to floor 20 and repeat the process. For example, if the egg breaks at 60 after going up in intervals of 10, then I know that n is greater than 50 but less than 60. This reduces the maximum number of egg drops required to find out n, which makes our algorithm (a formula for analysing data) a lot faster.
The maximum number of drops required for this method is 19, because the worst case scenario is that n = 99, and you would have to try all intervals of 10 and then all cases between 91 and 99 before you could find out the answer.
So what’s the optimum number of drops?
2. Binary search
We can find out the minimum number of eggs that we would need to break by splitting the floors in half. Start at floor 50: if it breaks, move down to floor 25; if it survives, move up to floor 75. You are dividing the number of possible floors by half with each trial, so you should get to the right answer in the fewest number of drops possible.
This also applies for larger buildings. For example, if we divided a building of 512 floors this way, we would get to the right answer in a maximum of 9 drops.
But this isn’t a good solution for our two egg problem yet. If n = 1, then we would have to break 7 eggs before we found our optimal solution. Let’s adjust this slightly to find an optimal solution for two eggs.
The formula to work this out relies on a quadratic equation. The explanation is quite long for this post, but if you would like to read the full explanation then you can read it here.
The short answer is to start at floor 14. If it breaks, use the second egg to search the lower 13 floors for a maximum number of 14 drops. If it survives, instead of moving up in regular intervals to 28, move up to floor 27. Because this is our second drop, with each interval we move higher up the building we want to try and reduce the number of floors to search. If the egg breaks at floor 27, it means we only have an extra 12 floors to search, giving us a maximum number of 14drops required to find n.
3. A more imaginative approach
Although the interviewer is testing knowledge of algorithms, it might be fun to look at a more practical approach to the problem. After all, determining the strength of eggs is a real-world problem!
- Data mining: discover the relative strength of eggs by combine data on egg size, egg shape and the species it came from. Mine the data for relationships between these factors and others (in fact, this data is already available and well documented – fun link). Cluster eggs on various traits.
- Modelling: Calculate the momentum for each egg with x weight and y shape at each floor, so that we can determine a specific height for n rather than the number of floors. Start to create an model incorporating all of these variables.
- Predictive analytics: combine poultry factory data, maybe even data from school science projects (surely there are plenty of egg dropping experiments to be found?) and predict the likelihood of an egg breaking from n based on it’s characteristics.
- Machine learning: teach a machine to identify egg cluster types from images, and compute n based on its other traits.
There is certainly a free range of other technologies and statistical methods you can use to build from this and predict n with greater certainty, which is why data science and analytics is so exciting right now. You are unlikely to see a challenge like this on Kaggle anytime soon (in fact, all of these active competitions absolutely trump the egg problem and are worth reading about) but it is just an interesting glimpse into the analytical mind of a data scientist.