Skip to content
This repository has been archived by the owner on Mar 9, 2019. It is now read-only.

ext3/ext4 is not fully POSIX, to be safe there, need to fsync after file size changes #284

Closed
tv42 opened this issue Jan 16, 2015 · 5 comments · Fixed by #286
Closed

ext3/ext4 is not fully POSIX, to be safe there, need to fsync after file size changes #284

tv42 opened this issue Jan 16, 2015 · 5 comments · Fixed by #286

Comments

@tv42
Copy link
Contributor

tv42 commented Jan 16, 2015

@cespare pointed out this conversation to me: https://www.openldap.org/lists/openldap-devel/201411/msg00000.html (I've seen the talk before, but didn't pay enough attention).

Reading https://linux.die.net/man/2/fdatasync says (my emphasis) "fdatasync() [...] does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. [...] On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush."

Quoting the paper (my emphasis): "On ext3 with the default “ordered” journaling mode, the file data is forced directly out to the main file system prior to its metadata being committed to the journal. This is why we observe the journaling of the length update (op#399, 400, and 402) after the file data updates(op#342–398)." "[...] means fdatasync on ext3 does not wait for the completion of journaling (similar behavior has been observed on ext4)."

Currently, bolt seems to rely on individual page writes increasing the st_size of the file; there's no file size change where it actually notices it needs more space:

bolt/db.go

Line 554 in 15a58b0

// Resize mmap() if we're at the end.

So, my takeaway from this is, ext3, and likely ext4, have a bug / "design tradeoff". If bolt wants to be pragmatic it should probably accommodate for that bug.

I think it would be enough to do something like this:

    if minsz >= db.datasz {
        if err := db.file.Truncate(minsz); err != nil {
            return nil, fmt.Errorf("file resize error: %s", err)
        }
        if err := db.file.Sync(); err != nil {
            return nil, fmt.Errorf("file sync error: %s", err)
        }
        if err := db.mmap(minsz); err != nil {
            return nil, fmt.Errorf("mmap allocate error: %s", err)
        }
    }

Doing the resize once, and not page-by-page when writing out dirty pages, might even be more efficient.

@benbjohnson
Copy link
Member

Awesome, thanks for the fix! I moved the code over to bolt_unix.go#mmap(). That way it occurs after the max mmap size check and, since it's in the Unix-specific file, I don't have to re-test on Windows. :)

#286

@benbjohnson
Copy link
Member

@tv42 Is the Truncate() necessary in this change? Shouldn't the fsync() be enough to flush the metadata? I'm seeing really long wait times on Truncate() sometimes on XFS.

@tv42
Copy link
Contributor Author

tv42 commented May 1, 2015

@benbjohnson The truncate is the part that resizes the file; if you skip that, the file remains small until you write to it, and since data is only synced with fdatasync, that doesn't need to write the new size -> after a crash, the part that was after the old size of the file may be lost. The delay you're seeing is probably xfs actually allocating space for the file, so it can later guarantee writes to it will be "easy".

If you remove either the truncate or the sync, it's as if this fix never went in.

@benbjohnson
Copy link
Member

@tv42 Thanks, my question was dumb now that I read it again. :)

I'm going to add a DB.NoTruncate flag to skip truncation. It'll be false by default but turning it on will avoid the truncation step. I'm still not able to figure out why xfs is hanging for me. It's taking minutes with no events out of strace or any io stats tools. The flag will also allow bolt to not allocate space up front when it doesn't need to.

@benbjohnson
Copy link
Member

/cc @mkobetic @bouk

benbjohnson added a commit to benbjohnson/bolt that referenced this issue May 4, 2015
This commit adds the DB.NoTruncate flag to optionally revert mmap()
calls to how they were implemented before the ext3/ext4 fix. When
NoTruncate is true, remapping the data file will not force the file
system to resize it immediately. This works for non-ext3/4 file
systems.

The default value of NoTruncate is false so it is still safe for
ext3/ext4 file systems by default.

See also: boltdb#284
benbjohnson added a commit to benbjohnson/bolt that referenced this issue May 5, 2015
This commit adds the DB.NoTruncate flag to optionally revert mmap()
calls to how they were implemented before the ext3/ext4 fix. When
NoTruncate is true, remapping the data file will not force the file
system to resize it immediately. This works for non-ext3/4 file
systems.

The default value of NoTruncate is false so it is still safe for
ext3/ext4 file systems by default.

See also: boltdb#284
benbjohnson added a commit to benbjohnson/bolt that referenced this issue May 5, 2015
This commit adds the DB.NoTruncate flag to optionally revert mmap()
calls to how they were implemented before the ext3/ext4 fix. When
NoTruncate is true, remapping the data file will not force the file
system to resize it immediately. This works for non-ext3/4 file
systems.

The default value of NoTruncate is false so it is still safe for
ext3/ext4 file systems by default.

See also: boltdb#284
benbjohnson added a commit to benbjohnson/bolt that referenced this issue May 5, 2015
This commit adds the DB.NoTruncate flag to optionally revert mmap()
calls to how they were implemented before the ext3/ext4 fix. When
NoTruncate is true, remapping the data file will not force the file
system to resize it immediately. This works for non-ext3/4 file
systems.

The default value of NoTruncate is false so it is still safe for
ext3/ext4 file systems by default.

See also: boltdb#284
benbjohnson added a commit to benbjohnson/bolt that referenced this issue May 5, 2015
This commit adds the DB.NoTruncate flag to optionally revert mmap()
calls to how they were implemented before the ext3/ext4 fix. When
NoTruncate is true, remapping the data file will not force the file
system to resize it immediately. This works for non-ext3/4 file
systems.

The default value of NoTruncate is false so it is still safe for
ext3/ext4 file systems by default.

See also: boltdb#284
benbjohnson added a commit to benbjohnson/bolt that referenced this issue May 6, 2015
This commit adds the DB.NoGrowSync flag to optionally revert mmap()
calls to how they were implemented before the ext3/ext4 fix. When
NoGrowSync is true, remapping the data file will not force the file
system to resize it immediately. This works for non-ext3/4 file
systems.

The default value of NoGrowSync is false so it is still safe for
ext3/ext4 file systems by default.

See also: boltdb#284
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants