Monday 5 October 2015

SQL Server Checkpoint Mystery

SQL Server Checkpoint Mystery
Checkpoint is the hidden Mystery in SQL Server,Thanks to Paul Randal for very good explanation.

Whenever a checkpoint occurs either auto/manual/indirect/direct following set of operations are occurred for the database:

1. All dirty data file pages for the database are written to disk (all pages that have changed in memory since they were read from disk or since the last checkpoint), regardless of the state of the transaction that made the change.
2. Before a page is written to disk, all log records up to and including the most recent log record describing a change to that page are written to disk (yes, log records can be cached in memory too). This guarantees recovery can work and is called write-ahead logging. Log records are written to the log sequentially, and log records from multiple transactions will be interspersed in the log. The log cannot be selectively written to disk, so writing a dirty page to disk that only has a single log record affecting it may mean writing many more previous log records to disk as well.
3. Log records describing the checkpoint are generated.
4. The LSN of the checkpoint is recorded in the database boot page in the dbi_checkptLSN field.

If in the SIMPLE recovery model, the VLFs in the log are checked to see whether they can be marked inactive (called clearing or truncating the log – both of which are terrible misnomers, as nothing is either physically cleared or truncated).

Checkpoints are not really tracked in the transaction log – it just serves as a useful repository for information about which transactions are active at the time of the checkpoint. The LSN of the last checkpoint is recorded in the database boot page. This is where recovery starts, and if this page is inaccessible, the database cannot be attached, opened, or processed in any way – partly because it’s the boot page that knows whether the database was cleanly shut down or not, and partly because it’s the only place that records the LSN of the last checkpoint record. You may say, well it’s recorded in the transaction log too, but what if the log is corrupt in some way?
One area of confusion I’ve seen is that the checkpoint log records are overwritten by subsequent checkpoints. Absolutely not – once written, a log record is NEVER updated or overwritten – it will only be overwritten when the log wraps and the VLFs are re-used.. This has led to further confusion about when checkpoint information is retrievable from the log, using commands such as fn_dblog.

Now What happens if a checkpoint starts but doesn’t finish before a crash occurs? Will that checkpoint be used for crash recovery?
The answer is no, it won’t. The purpose of a checkpoint is to bring the pages in the data files up-to-date with what’s in the transaction log. When a checkpoint ends, there’s a guarantee that as of the LSN of the LOP_BEGIN_CKPT log record, all changes from log records before that point are persisted in the data files on disk. There’s no guarantee about logged changes after that point, only before it. In other words, all the log records before the LSN of the LOP_BEGIN_CKPT log record are no longer required for crash recovery, unless there’s a long running transaction that started before that LSN.

When the checkpoint ends, the boot page of the database (page 9 in file 1) is updated with the beginning LSN of the checkpoint (and then if in the SIMPLE recovery mode, any log clearing/truncating can occur).

So if a checkpoint started but didn’t end before a crash, it’s LSN would not be in the boot page and so crash recovery would start from the previous checkpoint. This is good, because an incomplete checkpoint means there’s no guarantee about which logged changes are persisted in the data files, and so crash recovery wouldn’t be able to work correctly from only starting at the beginning of the incomplete checkpoint.

How does SQL Server guarantee that there’s always one complete checkpoint in the active portion of the log, in case a crash occurs?
The answer is that log clearing/truncation of a VLF containing an LOP_BEGIN_CKPT log record cannot happen until another complete checkpoint has occurred. In other words, a complete checkpoint has to occur since the last log clearing/truncation before the next one can happen. If a checkpoint hasn’t occurred, the log_reuse_wait_desc for the database in sys.databases will return CHECKPOINT. It’s not common to see this occur, but you might see it if there’s a very long running checkpoint (e.g. a very large update on a system with a slow I/O subsystem so the flushing of data file pages takes a long time) and very frequent log backups, so two log backups occur over the time taken by a single checkpoint operation. It could also happen if you’ve messed with the sp_configure recovery interval and set it higher than the default.

How Checkpoint works for tempDB database?
A checkpoint is only done for tempdb when the tempdb log file reaches 70% full – this is to prevent the tempdb log from growing if at all possible (note that a long-running transaction can still essentially hold the log hostage and prevent it from clearing, just like in a user database).
You know very well, when a checkpoint occurs for a user database, all dirty pages for that database are flushed to disk (as well as other operations). This does not happen for tempdb. Tempdb is not recovered in the event of a crash, and so there is no need to force dirty tempdb pages to disk, except in the case where the lazywriter process (part of the buffer pool) has to make space for pages from other databases. Of course, when you issue a *manual* CHECKPOINT, all the dirty pages are flushed, but for automatic checkpoints they’re not. You can easily prove this to yourself with a large transaction inserting into a user table on tempdb, and watch the Databases/Log File(s) Used Size (KB) and Databases/Log Truncations for tempdb, when you see them go up, check to see if the number of dirty pages in tempdb has reduced – it won’t have.


1 comment: