On [date], we noticed the air temperature levels in both the X°C and Y°C chillers began to rise. Some air temperature variation is normal, due to many factors including scheduled defrost cycles, where the chillers shut down with the goal of eliminating ice buildup on the unit. Over the next several days, we realized that the chiller in the y chamber would not be able to maintain optimal temperature levels for storing fresh meat and seafood. On [date], we determined that some products had been stored outside the recommended ranges from [date] through [date], and initiated a full product recall of the [product] and [product] products sold and delivered to customers between those dates.
Upon initiating the recall, we immediately took the following actions:
It took x days to get the y chamber back in a normal operating mode. On [date], we started stocking up [products] so we could resume selling [product] products. The y chiller failed again during this first stock-up attempt. We had to waste additional product, though none of that product made it to customers so a second recall was not necessary. Our chilled rooms are now operating normally again. We have put in short-term fixes and are working on longer-term solutions to ensure this problem does not happen again.
Customer and Financial Impact
The recall directly affected x customers. The direct cost of the recall was approximately ~$x ($x in refunds, $x of goodwill credits towards future purchases ($x credits to x customers), and $x in inventory that was disposed of. Additionally, we were unable to sell much of our chilled range from [date] until the full range was back in place at the end of [date]. We lost potential sales, but more importantly lost customer trust during that timeframe.
Background
For background, the fresh and frozen operation on level was started on [date] operating across 3 temperature regimes x, y and z. The cold rooms were designed and built by [mfr] based on assumptions given to [company] on [date] to calculate the heat load. The chambers are designed to work at x-y, z-q and r. The insulated envelope was constructed by a contractor using standard sandwich panels and an insulated floor.
The operation and maintenance of the system is the responsibility of [company]. The temperature controlled operation is under the umbrella of [organization] license for the building. The chilled chambers are cooled by the [mfr] central ammonia system with a single direct expansion blower in each chamber. Whilst there is a standby compressor there is no backup for the blowers. The central system is operating at capacity and the freezer is supplied from a standalone Freon system. Overall there have been no major issues with the temperature control until [date].
The food quality team has manually recorded the temperature for each of the 3 chambers on an hourly basis since the operation started.
Triggering event
The triggering event that led to the problem was a buildup of ice and dust on the blower that reduced the output of the chiller in the y chamber to the point where it could no longer maintain the desired temperature given the heat load in the room. There were several root causes, most of which stem from an insufficient understanding of, and operational control over the effective chilling capacity, our heat load, and how those two factors interacted with each other. The chiller’s effective output gradually declined due to ice and dust build up. Concurrently, we also had been gradually introducing a higher heat load into the y chamber because we needed more material, people and activity to handle increasing order volumes. Once the heat load passed the effective capacity, we were unable to recover without impacting normal operations.
Root Causes
Why did the chilling system fail?
1. The chilling system was not designed for our type of operation
When our vendor originally designed the chiller system, they did not anticipate the heat load our operations would place on these rooms. During the outage, the vendor was surprised to see the amount of activity we had to perform to receive, putaway, pick, pack, and ship chilled product. They also were not aware of the fact that we would have blast freezers to chill our eutectic plates. As we worked with the vendor to understand their assumptions and chilling output specifications, we realized the y chamber is not sufficient for the growth of our business. So even if there were no reduction in effective chilling output due to dust and ice, we eventually would have hit the nominal capacity.
2. We did not have effective temperature monitoring systems in place
The temperature in the chilled rooms was monitored, just not in a way that was as effective as we needed it to be. Security recorded the temperature regularly but there was no process to escalate changes. We also had a QA process that measured the temperature of outbound totes. However, we did not have the appropriate feedback loops in place to make sure this data made it back to the appropriate people. Moreover, during the chiller event, the QA person involved in testing and monitoring moved to another role. The role was not immediately filled. Also, since food temperature is such a key element in product safety, it should be the responsibility of the operations managers, not the security staff.
3. We did not have clear guidelines and escalation procedures on how to handle temperature variance
With proper escalations procedures, we should have recognized and reacted to the issue sooner. There were no clear guidelines on what to do when a temperature variance occurs. As mentioned above, some air temperature variance is normal. A short term air temperature variance could have little or no effect on the temperature of the product. However, a sustained air temperature variance may require actions. There was no way for the person recording the temperature to know if the variance required follow-up action. Examples of follow-up actions range from measuring specific food items, escalating to the appropriate internal people or groups, or requesting our vendor to immediately check and/or service the chilling unit.
4. We relied on a vendor to provide and maintain a mission critical piece of our infrastructure and did not have sufficient understanding, expertise, equipment, and controls to deal with the range of problems that could occur
During our discovery process we learned that our vendor did not have a regular maintenance schedule for the chilling units. They monitored and recorded temperatures, but did not inform us when the temperature was outside of the design limit 2 - 5 as it is our internal operation. We also have learned with the increased heat load, the chiller starts to ice up every two weeks. We also did not have a scissors lift to inspect the chilling unit ourselves.
5. We had a single point of failure and could have taken more risk mitigation steps
The cooling system has a number of single points of failure. For instance, there is only one blower in each room. So when it goes down for planned maintenance, such as a ~1 hour defrost cycle or an unplanned outage, there is no other chilling source for that room.
We have made or are in the process of making the following changes to address the nominal and effective capacity of our cold rooms:
We took the following actions to reduce the heat load:
We have made or are in the process of making the following changes to address the monitoring and escalation of temperature in our chiller rooms:
Most of the lessons below can be applied to many areas in [company] outside of the chiller and operations area.
Here are some things we did well during the event.