In this post I will go back to an article that I read around 30 years ago. At the time I was working for a Fortune 500 company. After lunch some of our team members would stop by the technical library in our building. The library had a nice collection of books and magazines.
I believe that the article of interest appeared in the IEEE Spectrum magazine. It dealt with a technique to troubleshoot computer cards. I believe the article was about the Xerox company.
I spent some time looking for different types of automobiles. Today there is a big push for EVs, but if you look back, we have had automobiles powered by steam, petrol, hydrogen, and electricity at different times. The changes from one technology to the other have occurred based on what was available at the time and politics. I like to concentrate on science because it is the source of truth. That said, in my opinion, in the foreseeable future (50 years or so) the answer might be in hydrogen. Of course as usual there are always politics and money involved.
The problem was how to diagnose if a computer card had a problem. Not only that, but which component was at fault so the issue could be corrected.
Today we develop and maintain massive software products. Believe it or not but very often things may go wrong. When that happens we would like to determine the cause so it can be addressed in a short period of time.
Could we make use of a similar approach as it was done with computer cards 30 years ago? Things have changed in 30 years and software differs from hardware, but ideas that worked to address similar problems could work when properly modified.
As far as I recall, the idea was to take a computer card and inject a specific signal with a signal generator at a specified point. You would then check a specified set of points with an oscilloscope and look for specific wave forms. The procedure described the labeled posts on the card to inject a specified signal and the posts to connect the scope to check the wave. The technician debugging the card would have a diagram and the sequence of steps (using a decision tree) to connect the scope and check the wave form.
So how could we translate the approach to a complex piece of code with dozens of millions of lines of code (LOC)?
Today most large systems emit values at some points and with some frequency information about what they are seeing. For example, in a previous life, I developed a storage server. When all components were included in production, we had about 6M LOC. The approach was to have some methods in some classes periodically write to log files some metrics. For example, the dispatcher could write how many requests were pending in the queue every few seconds. The processor for new objects could write how many new objects were generated in the same amount of time. Other classes could write the amount of memory free and used in the same amount of time. I guess you get the idea.
When all is well, we can collect the data and determine which are good averages, and which are not so good. So when something happens, we can look at the metrics and determine which class or module caused the issue which could then be passed to an engineer on-call to resolve or pass it to the responsible team.
30 years ago the process had to rely on humans to diagnose the issue. Today, perhaps with the use of AI and ML we could get to the offending class or module in a very short period of time.
If we think a little more, the AI could predict that a problem is starting to brew on a specific part of the software so it could let specific engineers on-call that something is not right. Appropriate actions could be taken and the health of the software could be restored quickly without customers being affected. Something to think about!
As usual, keep on reading and experimenting, you would be surprised when something that you learned and is part of your toolkit will become useful in the future.
I would like to hear your thoughts. Please leave me a note in the comment section.
John