My Experience with Problem 2 of the 2021 Bioinformatics Contest
I really enjoyed working on the second problem of the qualification round of the 2021 Bioinformatics Contest hosted on the Stepik platform. Not only did it help me learn about a real-life application of bioinformatics, but it also reinforced the importance of efficient code. Even though there was no official time limit on the amount of time for my solution to run, I ended up optimizing my code anyway because the data input was so large in some of the test cases that had I not done so, my code would have taken days if not weeks to run. I experienced first-hand the importance of efficiency in a real-world application with real-world time constraints rather than the seemingly arbitrary time constraints of most competitive programming contests.
I initially wrote a brute force solution for this problem, which consisted of three nested for loops. That was good enough for the first two test cases: for the first one each loop made no more than 10 iterations for a max of 1000 operations that ran in less than a second, and for the second each loop made no more than 500 iterations for a max of 5003 or 125 million operations that took my Python code about two minutes to run. However, the third test case was another story. For that one the outer loop made 100,000 iterations and each of the other two loops made 1000 iterations. The total number of operations had I allowed the program to finish running would have been 105 × 103 × 103 for a total of 1011 operations in the worst case. If 108 operations took two minutes, 1011 operations would take 2000 minutes, which is about 33 hours. On top of that there were still two other test cases, each of which I had calculated would do 1012 operations with my brute force solution. Those would each need about two weeks to run, and there were only six more days left in the seven-day contest by the time I started.
After waiting five minutes and seeing that the program had not finished running, I decided to do some research to see if I could discover a more efficient solution. In the problem I was working on, we are given a list of measurements of signals consisting of positive floating-point numbers. We are also given a list of masses of metabolites (biochemical compounds) and a list of charged fragments (adducts) that metabolites can gain or lose in order to become ionized. The list of masses of metabolites consists of positive floating-point (decimal) numbers, while the list of adducts are floating-point numbers that can be positive, negative, or zero. The task was to find for each of the signals the pair of metabolite and adduct whose sum is closest to the measurement of that signal. In a similar problem I found while researching, we’re given a sorted array and a target number with the goal of finding two numbers in the array with a sum closest to the target number. The efficient solution involved using two pointers, one at the start of the array and the other at the end, and adjusting the positions of one pointer or the other based on whether the resulting sum of the two numbers pointed at was too big or too small.
I realized I could adapt this solution to my original problem by sorting the lists used in the inner loops of my brute force solution, then using a pointer at the beginning of one list and at the end of the other list and adjusting the pointers in the same way as in the solution I had found. That way, instead of using three nested loops, I could use just two nested loops, one to iterate through each signal, and the other to find the combination of metabolite and adduct whose sum comes closest to the signal’s measurement. That would bring the number of operations involved for the third test case down to 105 × 2 × 103, which is 2 × 108 or 200 million operations in the worst case.
By that point, half an hour had passed, and my brute force code was still running. I stopped it and coded up the more efficient solution. Now it took my code less than four minutes to run the third test case. That’s a huge improvement over the 33 hours that would have been needed for the brute force method! I calculated that test case #5 would be the faster of the two remaining test cases to run, so I ran that one next. For that one there were up to 104 signals, 104 metabolites, and 104 adducts, so the number of operations needed with the updated code was 104 × 2 × 104 or 200 million in the worst case, the same amount as test case #3. Indeed, that one also ran in less than four minutes. For test case #4, there were 1000 signals, 1 million metabolites, and 1000 adducts; the updated code made 103 × 106 or one billion operations in the worst case. This one took about twenty minutes to run. It was definitely worth the time I took to search for a more efficient solution as there is no way my program would have finished running for those last two test cases within the duration of the contest otherwise.
I am really glad I discovered this contest. I learned about a whole new field that I am now interested in, and I have further developed my knowledge of the ways in which I can make my code more efficient.