El Software Que (A Veces) Te Mataba Con Radiación
The Therak 25 Incident: A Software Failure in Radiation Therapy
Overview of the Therak 25 Case
- The Therak 25, a radiation machine used for cancer treatment, became infamous due to software errors and poor practices that led to six patients receiving lethal doses of radiation.
- On June 3, 1985, a 61-year-old woman was treated with the Therak 25 for her twelfth session of radiotherapy, unaware of the programming issues affecting the machine.
Initial Incident and Patient Experience
- During treatment, an error in data entry caused the machine to deliver an overdose of radiation. The patient experienced severe pain and reported it to the operator.
- Despite her complaints and visible burns two weeks later, ASL (the manufacturer) denied any possibility of overdose from their machine.
Consequences of Overdoses
- The patient received between 15,000 and 20,000 rads; normal treatments are around 200 rads. This resulted in significant health issues including loss of mobility and breast removal.
- Another incident occurred shortly after when operators ignored persistent error messages on the control computer due to their frequency.
Repeated Errors and Lack of Understanding
- An operator mistakenly believed no treatment had been administered due to a software error message. They repeatedly pressed 'P' to proceed without realizing high doses were being delivered.
- The second patient died months later; her death was attributed to cancer despite evidence suggesting fatal radiation exposure.
Investigation Challenges
- ASL attempted to replicate the errors but failed. They identified a faulty micro switch as a potential cause but could not confirm its role in overdoses.
- After adjustments were made claiming increased safety by five orders of magnitude (100,000 times), they still did not understand the root causes behind previous incidents.
Continued Failures Despite Corrections
- Even with supposed improvements, another overdose occurred at a different hospital. This raised questions about how such machines remained operational after multiple incidents.
- At that time, software was perceived as infallible compared to hardware which required maintenance; this misconception contributed significantly to ongoing risks.
Communication Failures Among Hospitals
- ASL's inability to identify exact causes led them not to inform other hospitals about prior incidents effectively.
- When asked about safety measures post-incidents, ASL maintained that their device was not responsible for any harm caused by operator error or equipment failure.
Understanding Radiation Therapy and the Therac-25 Incident
Overview of Radiation Types Used in Treatment
- A medical physicist's research led to understanding and correcting issues with the linear accelerator, Therac-25, which provided two types of radiation: accelerated electrons for superficial tumors and X-ray photons for deeper tumors.
- Both types of radiation are ionizing, meaning they have enough energy to remove electrons from atoms, destroy DNA, and kill cells. This property is crucial for targeting cancer cells.
Mechanism of Action Against Cancer Cells
- Cancer cells divide more frequently than healthy cells, making them more susceptible to radiation during their division phase (M phase). Treatments aim at maximizing damage to cancerous cells while allowing healthy ones time to recover.
Design Features of Therac-25
- The Therac-25 had a rotating plate with three positions: one for accelerated electrons, another for X-rays, and a third with a mirror for regular light positioning patients correctly.
- Unlike its predecessor (Therac-20), which had hardware locks preventing overdoses by burning fuses, Therac-25 relied solely on software controls after removing redundant hardware safety features.
Software Development Concerns
- The software was based on the code from Therac-20 but was entirely controlled by a single programmer without any internal or external review processes. This lack of oversight raised significant safety concerns.
Incident Analysis and Consequences
- During an incident where an operator mistakenly entered 'X' instead of 'E', the machine displayed error 54—an undocumented error indicating potential overdose. The operator proceeded without realizing the patient received excessive radiation.
- Following this incident, despite initial investigations finding no faults with the machine itself, it became evident that there were serious underlying issues when similar errors occurred again shortly after.
Repeated Errors Leading to Fatalities
- After another overdose incident caused by the same input error ('X' instead of 'E'), resulting in severe patient harm and eventual death months later. Investigations continued but failed to identify systemic problems within the software or operational protocols.
Attempts to Replicate Issues
Understanding the Track 25 Incident
Overview of the Problem
- The operator was required to input data quickly, changing from 'x' to 'e' in under 8 seconds. This speed became problematic as the operator's quickness led to an error that was difficult for others to replicate.
- Error 54 was confirmed by SL, indicating that following specific steps could result in a dangerous overdose of radiation (25,000 rads). The software operated multiple parallel processes, complicating execution.
Software Execution and Errors
- The software had several processes: one for treatment stages (data entry, progress), another for keyboard control, and a third for controlling the rotating plate based on screen information.
- A race condition occurred when processes executed out of order. For instance, if data entry finished too quickly, it could mislead other processes about the machine's state.
Sequence of Events Leading to Overdose
- When the operator switched modes too quickly (under 8 seconds), it caused a mismatch between the energy settings and machine positioning—leading to potential overdoses.
- X-rays require more energy than electron radiation; however, due to improper configuration during mode switching, high energy levels were applied without necessary attenuation.
Response and Consequences
- A letter was sent to clients advising against using the "up" key for data editing without explaining why. This lack of transparency raised concerns among users.
- At an annual conference, hospitals learned about previous incidents involving overdoses and demanded changes from SL. Hardware locks were also discussed as safety measures.
Further Incidents and Regulatory Changes
- In January 1987, another incident occurred where a patient received a 10,000 rad overdose despite not using the "up" key as instructed.
- The software continuously checked if components were correctly positioned using a counter stored in memory. An arithmetic overflow reset this counter unexpectedly.
Final Outcomes and Lessons Learned
- During treatment preparation, an overflow led to incorrect assumptions about readiness; thus maximum energy was applied directly without proper safeguards.
- Following these incidents, media scrutiny increased alongside legal actions against hospitals and SL. Eventually mandated corrections included software updates and hardware locks.
Understanding Software Security Failures
Importance of Comprehensive Reviews
- The significance of addressing programming errors seriously from the outset is emphasized, as blaming them is an easy but inadequate response. Common programming errors can be mitigated through extensive internal or external reviews, including security audits.
Lessons Learned from Security Analysis
- Nancy Levesson, a software and systems security specialist, highlights critical lessons learned: over-reliance on hardware testing while neglecting software assessments led to vulnerabilities. It’s crucial not to assume that software remains reliable without thorough scrutiny.
Misconceptions about Reliability and Safety
- Track 25 had numerous successful treatments before its first incident, leading to a false sense of security among operators who were trained to ignore error messages due to perceived robust software safeguards. This reflects a lack of defensive design in the software.
Inadequate Incident Response Practices
- There was insufficient follow-up on incidents; despite notifications after the first occurrence, delays in informing hospitals and doctors were noted. Poor engineering practices included inadequate documentation and uninformative error messages.
Issues with Software Integration and Reuse
- The integration of hardware and software was poorly executed; they were only tested together post-installation at hospitals. Additionally, reused code from Track 20 was assumed safe without proper validation, which later revealed similar defects that could lead to failures.
Embracing Technology Despite Challenges