Recently I had a meeting at a customer with a very nice question. It was a kick-off meeting, but the question was … is it possible to have a system that gives us 99,999 or 99,9999 uptime for our Oracle database. My first reaction was, “But can your stack on top handle these requirements as well?” Apparently they already have this, so now it was time to put this on db-level too.
I have to say that the MAA-team does a wonderful job, and lots of info can be found here: https://www.oracle.com/database/technologies/high-availability/maa.html I highlighted the full link, so can you see Oracle does everything to make it easy for you. You can drill down to the component you want, and in my case I’ll need “Oracle Database” section, but it gives answers to MAA questions for Exadata and cloud as well.
With the preparation for, or move to the cloud, design comes up more and more. This is the time to think about the environment again.
My personal opinion is “Everything starts with a service catalog”. To create one, it is best to define your requirements. And yes people, this means talking to the end-users to define their needs.
Two important terms come into play: RTO (Recovery Time Objective) & RPO (Recovery Point objective).
RTO means (freely translated) “How long may it take to get the service back online”.
RPO means (freely translated) “How much dataloss are you maximum willing to take”.
It is clear that, the higher these requirements are, the more complex and more expensive the environment will be. So it is basically a careful balance / trade-off to make.
A common habit is to define the service catalog in 4 parts:
Depending on your/the customer needs, the database will fall in one of these categories. One of the advantages using these classification is that you gently push the enduser to standardisation. With that comes automatisation and that also leads in the end to a more stable environment.
For this example, I’ll assume the Oracle Enterprise edition and correctly installed according good/best practices and the database being configured correctly (multiplex / mirror redo, make sure to have copies of controlfiles, … you got the picture )
Oracle EE has already a lot of features on itself, which makes the database pretty available. The following is based on personal favourites so it might slightly be different than the Oracle vision, but hey … we can discuss about that.
Oracle restart used to be deprecated but … it is back! And yes I am a fan. Basically it is grid infrastructure, but just on one node. If for some reason (one or another) an oracle process just fails, the Grid infrastructure restarts the failed process. This leads to lower outage time and thus higher availability. You can read more about Oracle restart in the Oracle Documentation.
Automatic storage management (ASM)
This goes hand in hand with Grid infrastructure. ASM is a kind of volume manager. One of the benefits is that ASM can provide mirroring and striping, which means that in case of a diskfailure, the data will be rebalanced across the remaining disks and you got more time to replace the disk. This also leads to higher availability.
Fast recovery area / Flashback
Unfortunately I still see databases with flashback_on being not turned on. I agree, you need to do some homework to size the Fast recovery area (FRA) properly, but once you did that, you can recover from a lot of errors. From the oracle website a small list:
- Flashback database
- Flashback table
- Flashback drop
- Flashback transaction
- Flashback Transaction Query
- Flashback Query
- Flashback versions query
- Total Recall
I like to compare this with, going back in time and undo the past.
Sounds simple, but RMAN has a very efficient way to backup the database.
I take only 3 small events to compare the catalog parts, but you get the idea.
This is also an example, but it can be applied easily to the Oracle cloud. Which makes it easy to move towards it.
The basic tier. This is a very simple instance, based on the features above. This way there is “kind off” high availability but at the cheapest way possible. This implies da simple single instance database and a backup taken locally and replicated to a failover data center or the cloud.
|Site outage||hours to days||Since last backup|
|DB upgrades||Minutes to hours||zero|
Includes all of the Bronze features and adds more features.
This is also service level where 2 options are possible.
- Oracle RAC, with local backup and replicated backup to a failover data center.
- Oracle Data guard protected DB in the local data center with a local backup and the replicated backup to the failover data center.
The choice between RAC or Data guard depends on the brownout requirements of the application. How much brownout time can we afford?
When you choose for RAC and an instance faces an outage, the server failover is immediate. This brings the brownout to a very minimum before the service is resumed. Regarding patching the system, this option has the advantage that you can do this in a rolling fashion which reduces downtime again.
The Data guard option does not have the same transparency to hardware outages, however it copes with protecting the database against other failures. Failures like data corruptions, human error (dataguard can lag behind intentionally). Another advantage is that Dataguard physical standby is included in the EE license.
|Site outage||hours to days||hours to days||Since last backup||since last backup|
|DB upgrades||Minutes to hours||seconds||zero||zero|
Includes all of the Silver features and adds more features.
When choosing for gold, the primary database is a clustered database. Oracle’s reference architecture is a Clustered database on the primary site with a local backup. That primary database has a standby database on the failover site, with its own backup.
When doing such a setup, you can opt for a far sync standby and/or active dataguard to minimise downtime or risks further.
Two arguments for using the active dataguard instead of the normal physical standby can be that you could offload read only queries to the standby database, but also that you minimise the risk on block corruptions. When you would hit a block corruption on the primary database, the active data guard would fetch the block from the standby and tries to repair the primary automatically.
|Site outage||Seconds to minutes||zero to seconds|
|DB upgrades||Seconds to minutes||zero|
Includes all of the Gold features and adds more features.
I slightly disagree with the Oracle reference architecture on this one. But the idea remains the same. Oracle proposes to put a second clustered database with active dataguard in the same datacenter and make sure that the environment is replicated using active dataguard towards the failover data center.
IMHO, and for most of our customers, the requirements are “low” enough to have one rac cluster on the primary site and another rac cluster on the failover site. Ofcourse protected by (active) data guard.
Other features which are worth to investigate on this level of availability are: application continuity, EBR (Edition based redefinition), Goldengate,…
This way, of careful combining technology and taking care of the design you see that a lot of Availability can be reached.
As always, questions, remarks? find me on twitter @vanpupi