Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NC_SHARE: call sync after fill operations #107

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

adammoody
Copy link
Contributor

@adammoody adammoody commented Jul 10, 2023

While running PnetCDF tests on a file system that requires stricter adherence to the MPI-I/O consistency semantics (UnifyFS), I found that several test cases involving fill operations were not calling MPI_File_sync after the writes associated with the fill calls. This led to data inconsistency when other ranks later wrote different data to the same region of the file. This was true even after modifying the test to create the file with NC_SHARE set.

One example shows up when running test/testcases/flexible.c with 4 ranks on 1 node due to the fill calls here:

/* fill 2 records with default fill values */
err = ncmpi_fill_var_rec(ncid, varid1, 0); CHECK_ERR
err = ncmpi_fill_var_rec(ncid, varid1, 1); CHECK_ERR
err = ncmpi_fill_var_rec(ncid, varid2, 0); CHECK_ERR
err = ncmpi_fill_var_rec(ncid, varid2, 1); CHECK_ERR
err = ncmpi_fill_var_rec(ncid, varid3, 0); CHECK_ERR
err = ncmpi_fill_var_rec(ncid, varid3, 1); CHECK_ERR

Adding NC_SHARE to the ncmpi_create call does not help:

err = ncmpi_create(MPI_COMM_WORLD, filename, NC_CLOBBER, MPI_INFO_NULL,

To get this test to pass, I had to add a call to ncmpi_sync after the fill calls.

The intent with this PR is to call MPI_File_sync/MPI_Barrier after writing data during a fill call if the NC_SHARE mode bit was set when creating the file.

I wanted to open the PR for discussion.

Are you open to a change like this?

If so, are there other locations where MPI_File_sync should be called during fills?

@adammoody
Copy link
Contributor Author

adammoody commented Jul 10, 2023

I can confirm that this PR allows the test/testcases/flexible.c test to pass if one adds NC_SHARE to the create call.

Without that, the test produces errors like the following:

+ srun --overlap -n 4 -N 1 ./flexible /unifyfs/testfile.nc
Error at flexible.c line 201: expect buf[0][0]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][1]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][2]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][3]=11 but got -2147483647
Error at flexible.c line 201: expect buf[0][4]=11 but got -2147483647
Error at flexible.c line 201: expect buf[1][0]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][1]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][2]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][3]=10 but got -2147483647
Error at flexible.c line 201: expect buf[1][4]=10 but got -2147483647
Error at line 209 in flexible.c: (NC_ERANGE)
Error at flexible.c line 217: expect buf[0][0]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][1]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][2]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][3]=11 but got -2147483647
Error at flexible.c line 217: expect buf[0][4]=11 but got -2147483647
Error at flexible.c line 217: expect buf[1][0]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][1]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][2]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][3]=10 but got -2147483647
Error at flexible.c line 217: expect buf[1][4]=10 but got -2147483647
Error at line 246 in flexible.c: (NC_ERANGE)
Error at line 275 in flexible.c: (NC_ERANGE)
Error at line 276 in flexible.c: (NC_ERANGE)
*** TESTING C   flexible for flexible put and get                  ------ fail with 72 mismatches

@wkliao
Copy link
Member

wkliao commented Jul 11, 2023

To get this test to pass, I had to add a call to ncmpi_sync after the fill calls.

I assume adding one ncmpi_sync call after all ncmpi_fill_var_rec calls will pass
flexible.c on UnifyFS, instead of one for each ncmpi_fill_var_rec.
Is this the case?

@adammoody
Copy link
Contributor Author

adammoody commented Jul 11, 2023

To get this test to pass, I had to add a call to ncmpi_sync after the fill calls.

I assume adding one ncmpi_sync call after all ncmpi_fill_var_rec calls will pass flexible.c on UnifyFS, instead of one for each ncmpi_fill_var_rec. Is this the case?

Thanks @wkliao . Yes, that's right. A single ncmpi_sync call after all fill calls in this particular test case is sufficient.

There are a number of test cases that I've come across so far, e.g., e47759a

@wkliao
Copy link
Member

wkliao commented Jul 11, 2023

Thanks. That is a very useful information.

PnetCDF is built on top of MPI-IO, so follows the MPI-IO consistency semantics.
When using a non-POSIX or non-UNIX compliant file system, users are required
to call MPI_File_sync, MPI_Barrier, and MPI_File_sync (in that order) after each
I/O call. Thus for PnetCDF users, I suggest to add ncmpi_file_sync, MPI_Barrier,
and ncmpi_file_sync after the last call to ncmpi_fill_var_rec.

Your patch will affect all applications that are using POSIX compliant file
systems and ncmpi_file_sync is very expensive call. I wonder if there is a
alternative solution or one from UnifyFS side.

@roblatham00
Copy link
Contributor

How common is NC_SHARE? it could be a nice shorthand for "try really hard to keep this in sync across processes" ?

another option would be to consult the romio_visibility_immediate hint. see pmodels/mpich#6148 and pmodels/mpich#5902 for some recent Unify-inspired enhancements to ROMIO

ROMIO is not the only MPI-IO implementation, so it does feel wrong to have PnetCDF rely on a non-standard hint like this.

@adammoody where did we leave https://github.com/roblatham00/mpich/tree/dev-unify ? To wei-keng's point, trying to use a POSIX-assuming ROMIO driver on a non-posix storage system is going to get you into trouble, but the goal with the Unify ROMIO driver was Unify can (cheaply) sync and flush if needed, when needed.

@adammoody
Copy link
Contributor Author

adammoody commented Jul 11, 2023

I figured the change in this PR aligns well with the PnetCDF data consistency doc:

https://github.com/Parallel-NetCDF/PnetCDF/blob/master/doc/README.consistency.md#note-on-parallel-io-data-consistency

If users would like PnetCDF to enforce a stronger consistency, they should add NC_SHARE flag when open/create the file. By doing so, PnetCDF adds MPI_File_sync() after each MPI I/O calls.

  • For PnetCDF collective APIs, an MPI_Barrier() will also be called right after MPI_File_sync().
  • For independent APIs, there is no need for calling MPI_Barrier(). Users are warned that the I/O performance when using NC_SHARE flag could become significantly slower than not using it.

It seems like most calls that lead to an MPI_File_write are followed by MPI_File_sync when NC_SHARE is set, so I wondered whether the fill calls could be included in that set.

@adammoody
Copy link
Contributor Author

adammoody commented Jul 11, 2023

By the way, I think it would help to add an empty line on that page to fix the github markdown display:

diff --git a/doc/README.consistency.md b/doc/README.consistency.md
index cdbf42f8..56a44587 100644
--- a/doc/README.consistency.md
+++ b/doc/README.consistency.md
@@ -15,6 +15,7 @@ MPI_File_sync() after each MPI I/O calls.
   * For PnetCDF collective APIs, an MPI_Barrier() will also be called right
     after MPI_File_sync().
   * For independent APIs, there is no need for calling MPI_Barrier().
+
 Users are warned that the I/O performance when using NC_SHARE flag could become
 significantly slower than not using it.
 

As it is, the "Users are warned" statement gets attached to the second bullet rather than standing on its own.

@adammoody
Copy link
Contributor Author

@adammoody where did we leave https://github.com/roblatham00/mpich/tree/dev-unify ? To wei-keng's point, trying to use a POSIX-assuming ROMIO driver on a non-posix storage system is going to get you into trouble, but the goal with the Unify ROMIO driver was Unify can (cheaply) sync and flush if needed, when needed.

Thanks @roblatham00 . Right, the MPICH dev-unify branch would also resolve this, though I haven't tried it yet. I have been testing against our default system MPI libraries to get a baseline. For that, I'm taking extra steps like disabling ROMIO data sieving and other aspects that are known to be problematic for UnifyFS.

@wkliao
Copy link
Member

wkliao commented Jul 11, 2023

NC_SHARE is a flag inherited from NetCDF. NetCDF before version 4 supported
only serial I/O. This flag was designed for the situation when there are two or
more (serial) applications accessing to the same file. When PnetCDF added
the parallel I/O feature we would like to follow the MPI-IO consistency semantics,
which are only defined for a single MPI application. Therefore, NC_SHARE is
no longer valid in MPI-IO realm and keeping it can be misleading.

It is in my plan to retire this flag in the future and leave the users to enforce
the desired consistency, i.e. by calling sync-barrier-sync or setting an MPI-IO
hint to turn on/off a feature underneath PnetCDF.

Requiring PnetCDF and UnifyFS users to always set NC_SHARE (and hence
calling sync and barrier in every write operation) is not an ideal solution,
especially when the application does not overwrite each other's data
after fill. A better solution would be for application program itself to just add
a sync-barrier-sync after all the fill calls, so the remaining writes can still
enjoy a high performance.

@wkliao
Copy link
Member

wkliao commented Jul 13, 2023

Hi, @adammoody
Can you please provide a complete list of test programs that failed?
I plan to add a sync-barrier-sync after the fill calls.

@adammoody
Copy link
Contributor Author

Hi, @adammoody
Can you please provide a complete list of test programs that failed?
I plan to add a sync-barrier-sync after the fill calls.

Sure. I'm still working through the different test cases. I'll let you know which tests I had to modify when I'm done.

@adammoody
Copy link
Contributor Author

Here is the latest list of changes to the test cases:

master...adammoody:PnetCDF:unifyfs

In some tests, I may have added more ncmpi_sync() calls than strictly necessary. Also, my branch mixes in some unrelated changes that I was working on, so just consider the changes involving ncmpi_sync().

@wangvsa, is using Recorder and VerifyIO to develop a more comprehensive list of test cases.

@wkliao
Copy link
Member

wkliao commented Feb 11, 2024

PR #120 has been created to add the following code fragment into the test programs
reported failed on UnifyFS.

    ncmpi_sync(ncid);
    MPI_Barrier(MPI_COMM_WORLD);
    ncmpi_sync(ncid);

Please give it a try and let me know if you found more test programs need the similar fix.

FYI. The git command to fetch PR #120 is:

git fetch origin pull/120/head:unifyfs
git checkout unifyfs

@wkliao wkliao force-pushed the master branch 2 times, most recently from 051cdc1 to f1db1d6 Compare May 23, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants