560 likes | 815 Views
Overview of high availability in Microsoft SQL Server. Szymon Wójcik. Agenda. Introduction What is availability? What is and why to choose high availability? Key factors to consider for high availability scenario High availability techniques in Microsoft SQL Server Replication
E N D
Overview of high availability in Microsoft SQL Server Szymon Wójcik
Agenda • Introduction • What is availability? • What is and why to choose high availability? • Key factors to consider for high availability scenario • High availability techniques in Microsoft SQL Server • Replication • Log shipping • Mirroring • Failover clustering • Discussion
Agenda • Introduction • What is availability? • What is and why to choose high availability? • Key factors to consider for high availability scenario • High availability techniques in Microsoft SQL Server • Replication • Log shipping • Mirroring • Failover clustering • Discussion
Introduction • Szymon Wójcik • Experience with MS SQL Server since 2000 (dev/admin) • MCITP: DBA SQL Server 2005 • Interests: • Performance tuning • High availability • Blog – sqlphobosq.wordpress.com • Twitter - @phobosq
Agenda • Introduction • What is availability? • What is and why to choose high availability? • Key factors to consider for high availability scenario • High availability techniques in Microsoft SQL Server • Replication • Log shipping • Mirroring • Failover clustering • Discussion
Availability [1/5] • One of the concepts defined within ITIL • Ability to perform its agreed function when required • Determined by: • Reliability – how long (MTBF) • Maintainability – how quickly restored (MTRS) • Serviceability – contract conditions • Performance • Security • Confidentiality • Integrity • Availability
Availability [2/5] • Best practice – measure in %: • Agreed Service Time – defined in SLA (Service Level Agreement) • Downtime – duration of service unavailability during Agreed Service Time • Important when planning/deploying a service to understand availability concept
Availability [3/5] – figures for one week Allowed downtime duration per week [hh:mm:ss format]
Availability [4/5] – figures for one year Allowed downtime duration per year [DD.hh:mm:ss format]
Availability [5/5] – important notes • Availability != Uptime (service may be up but unavailable) • Scheduled downtime does not have to cause unavailability (up to definition in SLA)
Agenda • Introduction • What is availability? • What is and why to choose high availability? • Key factors to consider for high availability scenario • High availability techniques in Microsoft SQL Server • Replication • Log shipping • Mirroring • Failover clustering • Discussion
High availability - definition • System design approach and service implementation that ensures a certain level of operational performance (Wikipedia) • Masks the effects of hardware or software failure • Maintains availability of applications so that perceived downtime is minimized (Microsoft)
High availability != disaster recovery • High availability is used for ensuring for meeting Service Level Target for availability • Disaster recovery is ensuring operational continuity • They can be used complementary – HA can minimize the need of invoking DR, but never replace it
Why to choose high availability • For users: • Minimizes downtime probability • Allows to sustain a failure if properly designed • For administrators: • Simplifies migration effort • Minimizes risk of continuity
Agenda • Introduction • What is availability? • What is and why to choose high availability? • Key factors to consider for high availability scenario • High availability techniques in Microsoft SQL Server • Replication • Log shipping • Mirroring • Failover clustering • Discussion
Single point of failure • A whole system is as strong as the weakest link
Hardware redundancy • Introduce additional hardware to minimize risk of failure
Hardware redundancy • Not only whole machines may be multiplicated to become fault tolerant • Also components: • Power supplies • CPUs • Hard disks • Network interface cards • Storage controllers
Standby node • A standby node is a machine in a HA system that takes over in case of primary server failure • Three types: • Cold standby – Unplugged, needs to be prepared before use • Warm standby – Ready to use, but requires manual switch • Hot standby – Ready to use, takes over automatically • Fail over = switching from primary to standby • Fail back = return to primary • There may be more than one standby in HA scenario!
Load balancing vs failover • Load balancing – distributing of workload between several peer servers • If one goes down, others take over • Workload distributed by load balancer • Failover – automatic switch to standby • Standby is not active • Switch initiated upon loss of heartbeat
Other points • High availability requires additional costs – multiple components must be present according to design in order to meet requirements • It may become complex to maintain – additional CIs present in environment that need to be kept up-to-date • Hardware design must be followed by software to fully benefit from HA • KISS – Keep It Simple and Stupid
Agenda • Introduction • What is availability? • What is and why to choose high availability? • Key factors to consider for high availability scenario • High availability techniques in Microsoft SQL Server • Replication • Log shipping • Mirroring • Failover clustering • Discussion
High availability in Microsoft SQL Server • SQL Server, as a RDBMS, provides means for failover scenario • Load balancing is difficult and must be properly designed in order to work • High availability in SQL Server does not prevent logical data corruption – periodic DBCC checks are advised
Replication • Three server roles in replication: • Publisher • Distributor • Subscriber • Three types: • Snapshot • Transactional • Merge • Two subscription methods: • Push – Distributor pushes articles to Subscribers • Pull – Subscribers downloads from Distributor
Possible application of replication • Create a second copy of data to be used in case of emergency (DR) • Create a copy of data to offload the server (load balancing) • Allow offline users to work with data and upload their changes later (high availability)
Replication agents • External programs which are used to implement replication: • Snapshot Agent: • creates snapshots • Log Reader Agent: • Reads transaction log • Marks transactions for replication • Distribution Agent: • Dispatches transactions to Subscriber • Merge Agent: • Downloads remote and uploads local changes • Resolves conflicts in merge replication
Snapshot replication • Publisher makes a copy of a database which is applied at Subscriber • Good for small, static data: • Whole snapshot is applied every time – the changes which appear after snapshot will be applied with next snapshot • Requires sufficient bandwidth
Transactional replication • Starts with a snapshot • Transactions are recorded at Publisher and replayed at Subscriber • May allow for updatable subcriptions • If Subscriber is offline, records are stored at the Distributor
Merge replication • Starts with a snapshot • Merges changes between Publisher and Subscribers • Allows synchronization via HTTPS (since SQL Server 2008) • Allows the most autonomous design – e.g. mobile users, multiple branch offices working on the same data
Replication how-to • Configure Distributor • Configure Publisher: • Select replication type • Select articles to be published • [Optional] Set up article filtering • Set up security • Configure Subscribers: • Connect to Distributor • Select subscription method • Apply snapshot • [Transactional/merge] Synchronize changes
Failover in replication • Stop subscription • Direct all traffic from Publisher to Subscriber: • Change application connection strings • Change DNS aliases, if required, or • Change IP addresses
Failback in replication • After restoring Publisher, restore a copy of database from Subscriber • Direct all traffic from Subscriber to Publisher • Reestablish the replication
Log shipping • Keeps a standby by automating backup, copy and restore process • Three server roles in log shipping: • Primary • Secondary • Monitor
How it works? [1/2] • Restore a full backup from Primary to Secondary and then: • A job runs on Primary which backs up transaction log • Second job copies the log backup to Secondary • Third job on Secondary restores the log after it’s copied • [Optional] Monitor server tracks performance and incidents
Failover in log shipping • Copy transaction log backups from primary to secondary • Backup tail of the log on primary • Restore all backups except tail-log with NORECOVERY • Restore tail-log with RECOVERY • Disable log shipping jobs • Redirect client traffic to secondary
Drawbacks of log shipping • You can’t miss a transaction log backup • Network traffic generated has to be considered • You are always behind on Secondary • Secondary is read-only
Database mirroring • Allows to keep your standby up-to-date • Allows automatic failover • Cost-effective alternative to clustering • Available in Standard Edition (2005 – 2008 R2) • Does not require cluster capable hardware • Might be in implemented when Windows Authentication mode is not possible (using certificates)
Database mirroring modes • High availability (with witness) • Automatic failover • Synchronous transaction commit (principal commits after mirror confirms it’s commit) • High protection (without witness) • Manual failover • Synchronous transaction commit • High performance (without witness) • Manual failover • Asynchronous transaction commit
Manual failover in database mirroring • Can be done with one mouse click in SSMS • Requires client traffic redirection: • Possible within connection string using Failover Partner command
Automatic failover in database mirroring • Initiated automatically by witness if there is no quorum: • If principal is unavailable, fails over to mirror • Does nothing if mirror becomes unavailable • Fails over also if principal is up but unreachable from network! • Requires client traffic redirection: • Possible within connection string using Failover Partner command
Failover clustering • Provides protection on a server level: • Automatic failover in case of server failure • Fails over logins, endpoints and jobs • Combines multiple machines (nodes) in a single virtual server • Requires cluster-capable hardware: • Shared or common storage • Certified server hardware