A.J.S. Rayl • Jan 26, 2004
Spirit is "In Recovery"
The first Mars Exploration Rover Spirit has gone from "critical" to "serious" and now is "in recovery," MER Project Manager Pete Theisinger has announced, and the robot's Anomaly Team is pretty confident now that they know what went wrong last Wednesday.
Spirit stopped transmitting good data last Wednesday, but never -- contrary to some reports -- went silent. It continued communicating with the MER team -- although some of that communication was the return of some bad data, and is communicating with the team now and returning good information with which the team is now working.
"Our truths are often temporary," Theisinger said, quoting EDL Manager Rob Manning who experienced a couple of "temporary truths" during Opportunity's landing Saturday night.
Given that caveat, Theisinger said at the daily news briefing yesterday that the leading hypothesis for Spirit's woes is that the robot field geologist's flash memory, a 256-megabyte, short-term memory system similar to the memory sticks used in digital camera, became corrupted, specifically in the file management software module. That theory still leads the list of possible glitches. "The software had gotten into a condition that it could not cope with, "Theisinger said. "It was not robust enough for the operations we were engaged in when we had the flaw on Wednesday." As a result of that "flaw,"
Spirit rebooted itself more than 125 times, was unable to shut down and go to sleep, and wound up going to low power.
Once the Anomaly Team (A-Team) members had regained control of the errant robot Saturday, however, they began requesting -- and getting -- the needed engineering data to figure out possible scenarios and plans with how to deal with the problem. In all likelihood, if their hypothesis is correct, the view is that they will be able to correct the problem by relaying new software. To solve the problem, the MER team is considering deleting hundreds of files, including much of the data files collected during Spirit's seven-month cruise to Mars.
Two other theories, though, "cannot be discounted," said Theisinger, and are still being investigated by the A-Team. "One is that some kind of error or hardware issue occurred on the motor control board, the circuit board with the electronics that controls the motors," he expounded. The other is a solar event or flare that blasted the Martian surface with heavy ions and neutrons. This event -- which was detected by instruments on board Mars Odyssey -- was not detected by any instruments on Earth. "Flash memories are sensitive to high energy ions and neutrons when they're being read from and written to and we were certainly engaged in a lot of that activity that day," he noted.
In any event, as of yesterday, Spirit's prognosis was vastly improved, and the outlook looks the same if not better today. "The patient is well on the way to recovery and there is a very good chance that we will have a very good rover when we get it back up," Theisinger said.
Earlier today, Mission Manager Jennifer Trosper, who is leading the A-Team recounted the events and analyses of Spirit's troubles of the last few days, but cautioned that the problem is associated with the rover's ability to collect and maintain recorded data, and the flash memory where the data that will tell them what happened is stored, is actually part of the problem.
The team has secured and is now further analyzing a few bits of data, including the last information collected by the rover on Wednesday, Sol 18, a kind of status report of the information on the rover as of yesterday, and the changes in that status seen through today. "We're making a lot of progress now that we've gotten telemetry from the vehicle," she said. "As we get more data some of these things will change."
"On Sol 18 [Wednesday], we had weather problems at the station," she began. The station is the Canberra, Australia dish, which is part of the Deep Space Network, used for communications relays from space to Earth. During the morning antenna pass, they lost the signal 10 minutes early. "It wasn't clear whether that was the result of a spacecraft problem or a station problem. It's entirely possible that was a spacecraft problem at that time. We believe that was possibly a reset [rebooting of the rover's computer] on the spacecraft -- that would have caused our signal to be lost."
As a result, the morning activities did not complete. "If you recall, we were using the Instrument Deployment Device (IDD) [the robot arm], and we were getting ready to RAT" [use the rock abrasion device on Adirondack before which Spirit now still stands]. "The IDD, arm position is actually in same place it was on Sol 18 before we attempted to do that move.
"Sometime in the late morning, early afternoon of Sol 18 we encountered the problem, which initially was most likely a reset," Trosper said. Although they still do not understand exactly what caused that rebooting problem which become decidedly exacerbated as time passed, it did steer them to the belief that the flash system was corrupted in a way that threw Spirit into continuous reset loop.
That afternoon, the mission team sent a command sequence to Spirit, "with a little 'bleep' in it to tell us if the sequence got there - and we received the 'bleep' with no problem," she recounted. "Twenty minutes after that, we expected to see a session from the vehicle on the high gain antenna (HGA) communicating with us as it had been since Sol 2. We didn't see that communication. That, in addition to the 10-minute drop-out early in the morning, was one of the early indications there was something was wrong. Then, if you recall, from the afternoon Odyssey pass, we did not see any data from the vehicle."
From the early morning Mars Global Surveyor (MGS) pass on Sol 19 [Thursday], the team managed to receive about 2 minutes of data from the rover. "But it wasn't really data -- it was the-UHF-radio-is-on-and-nobody-was-home kind of data," Trosper explained. The next morning pass by Odyssey returned nothing.
"Then on Sol 20 [Friday], we attempted to command the rover at the nominal uplink rate where it should be if everything is fine and we received no data. We had pre-loaded communications windows where the rover should attempt to communicate with us [if something went wrong]. And those windows did not execute on the morning of Sol 20. We commanded at our nominal uplink rate and did not receive any data Sol 20," she recalled.
One of the things the rovers' computer systems will do if they encounter a system level fault is change the rate at which they send commands. "That's for the vehicle's protection, as well as our knowledge," Trosper said. "So in the afternoon, we sent a command at a different rate to tell the vehicle to send us a 'beep,' and we actually got that 'beep' back. The rate we sent it at was a rate that the software would have autonomously put us in if it had some sort of system level fault."
At that point, they determined four scenarios could have initiated that particular rate of data flow. "And we started to go down the path of those four scenarios," she explained. The overnight UHF passes that night returned nothing.
"On Sol 21, [Saturday], we were actually trying to establish the commandabilty we had the previous day -- we now knew there was a system level fault - we didn't know if it was power issue or a thermal issue or an X-band communications issue and so we sent essentially the same command to get a 'beep' on Sol 21 and we didn't get the 'beep.' Then, as we were getting ready to send the next 'beep' command, the vehicle decided to communicate with us in one of its nominal communications windows, at which point we got a little bit of data that had very little information. We started to decode it and it was from the year 2053. We said, 'This is not good,'" she remembered with a chuckle. They quickly found that the data was corrupted. "We were all cheering at that point, because there weren't a lot of scenarios that would put us in 2053 on Mars."
That signal -- which was returning information at the excruciatingly slow rate of 10 bits per second, then dropped out after 9 minutes, so they received very little data and what data they did receive was corrupted.
"We then immediately sent another command to the spacecraft to give us a 30-miunte communication session at 120 bits per second," she explained. "That command was received and we got the signal on the ground and we got one frame of data which told us it was sending us data and then it stopped." The team modified some of the parameters in the command to try and get a different set of data and sent the new revised command. "And it actually gave us a very limited [report] of the current state of the vehicle . . . we had some channelized telemetry that told us how many flight software resets had happened over the course of those two nights and that's where the big numbers [of resets or reboots] came from." That was when they realized they had a reset problem that was keeping them from being able to conduct tests and communications.
"As a result of that knowledge, we also realized the vehicle may not have shut down, because the reset could be associated with the shutdown of the vehicle," Trosper continued. "So we attempted to shut the vehicle down and then we sent a 'beep' after we shut it down to make sure it shut down and we get the 'beep' - and we shut it down again and we sent a 'beep' and got the 'beep.' The vehicle was clearly not able to shut itself down and the reset was causing a problem with the shutdown."
By then, they knew too that the power system was struggling and the battery wasn't charged as much as was desired. So they decided to delete all the UHF overnight passes. "But in the same way the reset cycle had caused [previous] commands not to get in," it prevented this command. Much to their chagrin, the team received the first Odyssey UHF pass despite not wanting to hear from Spirit. "We wanted her to be asleep and recharging the batteries."
Following that , the team requested that Odyssey and MGS turn off their radios, so they could insure Spirit would not use any more energy that night to try and transmit. We were getting close to entering our low power mode . . . a mode that will 'safe' the vehicle and take the batteries off-line and let it simply bask in the Sun until the voltage gets high enough for the rover to be able to 'wake up.'
So Spirit woke that morning via her solar arrays. "We saw then that we had indeed entered low power mode and the fault system worked exactly as designed," Trosper pointed out. " We don't get our morning communications until about 11 a.m., and in that we realized that we did have this reset problem. Based on the hunch of our lead software architect, [who] believed that the problem was probably associated with the mounting of flash and the initialization," the team decided to send a command to the hardware to bypass the software, to not allow the use of flash on initialization.
"The next day we sent the command to do that and the software initialized normally and was behaving like the software we had always known," Trosper said. "It was a fantastic moment."
Once they managed to get into that mode where they could command Spirit to go into a software state that they understood, they were able to start collecting data. "That's the path we're on right now."
The issue has been further narrowed down, Trosper added. "The amount of space required in RAM to manage all of the files we have in flash is apparently more than we anticipated," she offered. Spirit's computer has 128 megabytes of RAM, which is used to temporarily store data and administer computer operations. It is erased, as previously reported, when the computer shuts down. "We have been collecting data and collecting data and had lots and lots of files on the spacecraft. We intended for that that and this was a new problem that we encountered based on having so many files."
Although they had conducted numerous operational readiness tests, the longest operational was 10 days -- "and we were on Sol 18,when the problem occurred," she noted.
As a result of those findings, the A-Team is moving in a more specific "de-bugging activity," Trosper said. "Today, we started to dump out some of flash and we're actually loading a script that will get some of the past traits on the software and identify exactly where the problem was in the code so that we can make sure our hunch was correct. Tomorrow we might try to access flash and do a little bit of a health check on it and then next day we might try to delete some files to see if our hunch is correct. But it is really due to the number of files we're trying to manage on the flash file system. That was our read on Spirit."
Meanwhile, the A-Team has begun trying to recreate the problem in JPL's Mars yard rover testbed. If possible, the team would like to keep the files stored in flash. But time -- and more work in figuring out a workaround or solution -- will tell if they can keep, or must delete, some of or all of the flash files.
"The folks working on the details are the best of the best in the world that we have and everyday I come into work their innovation and persistence and talent and hard work almost overwhelms me and certainly humbles me. But that's what got us to where we are today and that's what's going to get us to having a healthy Spirit rover on the surface shortly," Trosper added.
"We don't know yet whether Spirit will be perfect again. Our current theory is that software could fix the problem, but still need to work with HGA to make sure our theory works out."
If their theory does hold, it will still be two to three weeks, Trosper estimates, before Spirit will be collecting science or out roving again, because it will take some time to completely characterize the problem and check out the functionality on the rover. "Adding any variable to system [right now] might cloud the answers we're getting," she elaborated. "Most likely we'll have to have the engineering problem fully understood before we go after the science."
Provided Spirit can be brought back online and restored to normal functioning there will be little, if any, disruption to the planned mission objectives. "That 90-day figure [cited as mission duration for both Spirit and Opportunity] is when the warranty expires," lead scientist Steve Squyres pointed out. "These vehicles are performing magnificently. Let's say this stand-down last 30 Sols. Before that, we were 17 for 17. We put margin on top of margin specifically to allow for things to go wrong."
"I think we have a very good chance now that we'll have a very good rover when we get done and get this thing back up," offered Theisinger.
Although Spirit has returned no spectacular images since last week, Trosper has found that her favorite image to date actually emerged from these troubled times. "My favorite image from Mars so far has been the one that [we got] after about a day in a half of not hearing from the rover -- the image on the afternoon of Sol 20, where the signal went from a flatline to a 'beep.'"