Making the right decisions during system design to account for the unknown
One afternoon, one of our OnTrack™ Managed Services customers called in a panic and said “I sure wish we would have implemented Flashback on the prod databases”. I knew right then that something was very wrong. As it turns out, someone logged-in to their production database and truncated two of their key tables. Just like that… *poof* millions of rows of data were gone.
So we started discussing options.
yes, they were in place and successful the night before. We had archivelogs available to roll the Database forward, so there was a way to get the data back. It involved taking the entire database back to the pre-truncate state but, by now, the database had processed thousands of new transactions that would be lost.
Nope. This database was too big for regular exports containing data to be useful.
yes, we have several. The customer checked this first and reported that the Standby lag was < 1 minute. Therefore, the changes had already been shipped. It was then that the customer realized I was directing his attention to the Active Standby Database, not the first local standby.
You see, about a year ago, we began a Greenfield project and we designed as much usable redundancy as was financially tolerable in the environment. Sure we discussed Real Application Clusters (RAC), but knowing the customer can tolerate 30 minutes of unplanned outage, and several hours of a planned outage, we came to the conclusion that clustering the database with RAC wasn’t cost effective. Instead, we made a more calculated decision for which we built out:
- a local primary Database
- a local Standby Database using DataGuard
- a remote standby Database for disaster recovery using DataGuard
- a local active standby (ASB) database using DataGuard to offload reporting and ETL
- maintained a 2-hour apply lag on the ASB for added protection.
Forced apply lag determines the amount of time we wait after the transactions commit on the primary database to apply those changes on the standby database. The logs are shipped to the standby, which means the logs for the truncate were present on the ASB server, but had not yet been applied.
I clearly remember, during the design phase, recommending the customer keep this database in a perpetual lag state, but they pushed back. After some discussion about User Error, they agreed that since it would not affect the reporting they wanted to offload, we could maintain a 120 minute lag behind production.
With this knowledge in hand, we set out to find the fastest way to recover the missing data. It turns out that the two truncated tables are only used when certain inbound transactions are received, and only a few of those transactions were received since the error occurred. This was the best case scenario for such an ‘oops’ moment.
- We rolled the Active Standby (ASB) database forward to a time just before the truncate was performed and verified the data.
- Then, we copied the few new transactions off to another table and inserted the data that we extracted from the ASB.
- Once that was finished, we copied the few new transactions back into the production tables.
This entire episode took place within a 2 hour window, and they did not experience any downtime or lose a single row of data related to this user error.
What I learned that day as a DBA is that you can never have too many ways to get your data back. As a consultant, I learned that our customers pay us for our knowledge and expertise, and sometimes they need some extra guidance in order to see what I see from our years of experience in the industry (like having a standby that lags behind production). And as a businessman, I learned that putting flashback in place is a key element to maintain minimal downtime. I believe if we had flashback we could have put the data back in less than an hour.
About the author
Dan Elliott has been with Eagle since its inception in 2003. In Eagle’s early days, Dan was “the tech guy,” responsible for everything tech-related, while Chuck Egerter handled everything admin. Dan is now our leading Database Systems Architect and Senior DBA responsible for managing the Technical Delivery Team. Dan is a results-driven technical leader with experience in implementing a vast variety of IT solutions. He oversees the team that manages and monitors hundreds of databases world wide on various versions of Oracle providing installations, maintenance, test environments, and primary DBA services.