-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IP Clearance #128
Comments
@justinmclean Can you help me understand our requirements here a little bit more with a couple examples:
General Questions:
I know we have other files with other license or cases to go through, but this should cover the vast majority and can get us moving in the right direction. |
@justinmclean any thoughts on these examples. I'm trying to be 100% sure I understand what we need to do here to move this forward in a meaningful way. |
|
Note that with a WIP disclaimer none of this actually blocks a release. |
License clearing wiki page (with draft process and tools): https://cwiki.apache.org/confluence/display/NUTTX/License+Clearing This was used in release 9.0.0 and 9.1.0. |
@adamfeuer do you have enough free time to collect the statistics inforamtion? My team leader reserve a dedicated resource help you to improve the tools and generate the report. @PeterBee97. |
Thanks @xiaoxiang781216 – I should have enough time to do a high-level analysis this week or next, and I could definitely use the help! @PeterBee97 are you able to help me do this? If so, reply here or send me an email (it's on my profile), and we'll work out what to do. 🙂 |
@adamfeuer Hi Adam, sure I'm here to help. BTW I spent some time yesterday on a script that doesn't modify anything yet but only tries to extract information. Hope this helps :) |
@PeterBee97 Great work with the script and database! I'll update my tools branch and post it here– would you be willing to do a PR to that, so we can have a single branch that we're working on? I'm hoping we can merge these tools to master so that others can help us or continue our work. Here's a few questions:
|
@PeterBee97 I updated my license-clearing tools branch to upstream/master, here's where I've put my tools: https://github.com/starcat-io/incubator-nuttx/tree/feature/license-clearing-tools/tools/license-clearing |
@PeterBee97 Let's try running the process that we did on the You can see what we did on |
By typing sched/ in the DB Browser filter I can see that these files either have apache license already or only owe copyrights to Greg or Xiaomi & Pinecone, which should have already approved the license change. The csv files are uploaded && PR created. https://github.com/PeterBee97/incubator-nuttx/tree/feature/license-clearing-tools/tools/license-clearing |
@PeterBee97 Cool, thanks– I didn't realize the script already used git to find the authors, sorry for missing that. We will need all the authors, not just the top 3. I'll take a closer look tomorrow. Re: Xiaomi and Pinecone already approving the license change, do you know if they have filed an Apache Software Grant Agreement (SGA)? Would you be willing to run your tool on fs and mm directories, and see if you can extract a report of the authors for each section and file? That way we can see if we're dealing with 10 authors, 100 authors, etc. I think another next step is to get you an account on the NuttX Fossology instance. At some point we'll need to get the data into there. I'll email Brennan and you on the list. Thanks again for being willing to help with this! |
Top 3 was my idea, given that some 1 commit contributors can be ignored(can't they?). For license issue I don't know exactly the details, @xiaoxiang781216 knows better. I ran the tool on the whole proj already so those two directories can just be filtered. I'll try to get a report for particular files. |
@PeterBee97 <https://github.com/PeterBee97> Cool, thanks??? I didn't
realize the script already used git to find the authors, sorry for
missing that. We will need all the authors, not just the top 3. I'll
take a closer look tomorrow.
I mentioned this before, but it bears repeating. The NuttX project was 13 years old in February of 2010. For the first 6 to 6 and a half years, the project used CVS and SVN. You will find no authorship or contact information for the first half of the project's life in the current GIT authors. The log will show me as the sole author for during that time.
I did by far most the changes in those days, but not all. Prior to GIT, contributors were noted only in commit comments. It should be possible to get the names, or in most cases just user handles, from the comments but with no contact information.
Github apparently does not even know how to parse that early activity. If you look at
https://github.com/apache/incubator-nuttx/graphs/contributors you would conclude that the project has only existed since sometime in 2013. The project was actually created in February of 2007. This is clearer in the Bitbucket statistics[1]: https://bitbucket.org/nuttx/nuttx/addon/bitbucket-graphs/graphs-repo-page#!graph=contributors&uuid=4430abf9-a782-49ff-bd16-bc1df696048e&type=c&group=weeks which goes all the way back to the day the project was created.
I think that is because prior to GIT, authors were NOT referenced by email address, but rather with some UUID.
[1]Note you have to be logged into Bitbucket to see the statistics there.
|
@patacongo Are the original CVS and SVN archives saved anywhere? |
No |
@patacongo Ok. I'll see if I can look through the commit message to see if I can see what's going on there. I'm logged in to Bitbucket, but for some reason I can't view the graph link you posted. Maybe it's a permissions issue or I don't have access to the graphs addon? |
@PeterBee97 can we add a column in the database to indicate the source code exist before git is used? @patacongo, we need gather the statistics information first and convert the unambiguous code base automatically(of course we need review the PR carefully) and then work on the rest case by case, otherwise NuttX can never become the TOP LEVEL PROJECT. |
@xiaoxiang781216 @patacongo @PeterBee97 I cloned the Bitbucket repo last night (https://bitbucket.org/nuttx/nuttx/src/master/), looked through the commit logs, and I can see what @patacongo is talking about. I didn't compare to the github log, but we should probably also do that. Then we can see if we can do anything with the information there. It seems like we should be able to come up with a strategy for dealing with this:
Let me know if you have other thoughts about this. @PeterBee97 Will you clone the Bitbucket repo and look at the logs to see if you have some insight about it? |
This is also informative: git log | grep author The will produce over 30 thousand lines but you clearly see that the last several thousand commits have author: patacongo patacongo@42af7a65-404d-4744-a932-0658087f49c3 That, I think is a bogus email that was created when the SVN repository was converted to GIT. Then there are several thousand with author: Gregory Nutt [email protected] That is GIT, but when I was still using GIT as though it were SVN with no authors. The first author that is not me appears at:
So it appears that there is authorship information for the first 8 years. Only for the last 5 years. |
@patacongo @PeterBee97 If do
There are others. They seem to indicate patches or other code from contributors, committed by Greg. |
@patacongo Thanks for pointing this out again, I am sorry I didn't remember this. |
David Hewson I know. We are connected on LinkedIn. He just started working for HPE. He did a some of the LPC31 port in the 2010 timeframe but has not been involved significantly since. |
"by" or "from" would both be good search keys. I also recorded the authors in the old ChangeLog files that were recently removed from the repositories because they are not used in the current workflow. That should be a complete list of authors except for a few trivial things like typo fixes that weren't normally included in the ChangeLog. |
I cloned the bitbucket repo today but the git log seems to be the same with that on GitHub... So I found the latest ChangeLog from NuttX 9.0.0 RC0 and tried to filter out the names with keywords from|by and the help of some NLP library and put the results in names-changelog.txt. Also processed the git log in the same way and the result is names-gitlog.txt. Still the commit messages of earlier SVN commits are incomplete and many commits are authorless. This may help cover some corner cases. Maybe we can open an issue and mention these users? But before that let's filter out the "safe" files first as @xiaoxiang781216 suggests. |
Any updates here? I think this is only blocker issue to prevert us
graduate, let's try to make progress.
It seems to me that there are people who have interest and good ideas
but there is not significant progress being made. The job is really two
large for a couple of people to accomplish working now and then.
|
Could we start with the easy cases? I feel that reducing the size of the problem also makes it less intimidating to approach. What confuses me though is that we're worrying about git authors whereas I believe that if someone contributes a file without listing themselves as the authors in the header (for the BSD case), didn't the author concede rights over the code by doing so? At least that was my understanding at the time when I submitted patches to existing files and I did not include an extra line to add me as author to every affected file. In case this is not the correct assumption, I agree that a "best effort" approach (by comparing git author to authors on header) is the only remaining possibility. |
Hi
What confuses me though is that we're worrying about git authors whereas I believe that if someone contributes a file without listing themselves as the authors in the header (for the BSD case), didn't the author concede rights over the code by doing so?
Without an ICLA (or an equivalent) this is not the case. Copyright automatically applies. They may not even own rights to the code they commit if their employment contract says otherwise.
Thanks,
Justin
|
Hi,
BTW Apache doesn’t use author tags in any new code, doing so implies ownership by a person rather than the whole project.
Thanks,
Justin
|
So @justinmclean is it safe we do the batch conversion if the source code meet all following critieria? |
Hi,
So @justinmclean <https://github.com/justinmclean> is it safe we do the batch conversion if the source code meet all following critieria?
1.The source code isn't converted from SVN or CVS
I’m not sure what you mean by that.
2.All commiters(or his company) in git log sign ICLA or SGA
Small contributions don’t have to have a CLA, but the person who committed that contribution takes responsibility for ensuring teh code’s IP. If possible it's best to have one.
3.The copyright holder in the source code sign ICLA or SGA
Take care with this. The copyright holder in source may or may not be the correct one.
Thanks,
Justin
|
Similarly, the author in GIT may not be the author of the file. Often the copyright holder in the source file header is the correct one, even though that person many not appear in GIT history. Many people copy files that wrote into different locations (very often for new architectures and for new boards which are very similar to older architectures and boards). Very often, I am the author of the file in these cases. Bottom line: There is no magic, automated way to correct determine the author. It requires collecting data and then also applying human insight. @justinmclean https://github.com/justinmclean For many cases there are multiple contributors of changes to a file. There is an original author, the original committer (who might be a different person) and people who have made trivial changes (as trivial as a spelling fix) or who have made substantial enhancements or re-designs. The former would not be treated as authors or copyright holders, but the latter may be. Is there any rule of thumb for what constitutes a significant change warranting rights to the file? Or does this also require human insight. There are thousands of files involved here. This is potentially multiple man years of effort. I don't see how we can ever accomplish this. |
We can only operate on the information we have. If authorship information was lost from CVS and SVN era (git author is Greg) and the header does not list anyone else than Greg, we can either "play safe" and leave the BSD header (we would respecting original authors license even if we don't know who it really was) or assume that without further information the original author cannot prove authorship either then we are safe to change to Apache. For these "unknown" cases, I don't see any other way. We just need to decide and then act. For other cases where there is indeed information I think we can script a header change based on various scenarios of git author/header author/author aliases where all have ICLAs. This change can be made to create one commit per file change and add the reason for the safety of the change to the commit message for traceability. Then, we can review each commit in a PR and decide if manual intervention is needed (throwing out unsafe changes, for example). |
In the SVN/CVS days, I did always give credit to the contributor in comments. However, the task of reading all comments in those 15 thousand or so commits is a very onerous task. The information is there, just not easily accessible. AFAIK there are no un-credited changes in the repositories. |
We can try to see what wording you used in general and use some regular expression to try to match the attribution. What I'm thinking is that in any case we will always need to analyze a file by looking at its complete git history to extract git author + header author + commit msg attribution right? The "easy" cases would then be files only touched by current commiters. |
Hi,
@justinmclean <https://github.com/justinmclean> https://github.com/justinmclean <https://github.com/justinmclean> For many cases there are multiple contributors of changes to a file. There is an original author, the original committer (who might be a different person) and people who have made trivial changes (as trivial as a spelling fix) or who have made substantial enhancements or re-designs.
Ideally we wold have CLAs for those who have made significant changes or who owned the IP on the original contribution, whose owner may or may not be the author.
There are thousands of files involved here. This is potentially multiple man years of effort. I don't see how we can ever accomplish this.
I would try solving for the low hanging fruit e.g files you know that only people who currently have CLA have contributed to and work from there and change the licenses to ALv2. I think this has already been suggested. Other code is under a compatible license so that’s the fallback position.
Thanks,
Justin
|
Let's clear the license for the files we own first. I think it is OK to have some files under compatibile licenses for a ASF project. You just need to mention them in the NOTICE file. And there is another possible solution is to rewrite these files so we can change the license. Anyway, this depends on the number of files we can not change license. Thanks. |
Let's clear the license for the files we own first. I think it is OK
to have some files under compatibile licenses for a ASF project. You
just need to mention them in the NOTICE file. And there is another
possible solution is to rewrite these files so we can change the
license. Anyway, this depends on the number of files we can not change
license.
I don't think anyone has committed to do that work. Adam and Peter
have, I guess, but they don't apparently have the bandwidth required to
do that effectively.
I think that even the first baby steps would require a substantial,
committed, full time effort.
|
I think @xiaoxiang781216 has already found someone wish to help here? But anyway, we need at least a committer to review the work... |
I've been writing some scripts which convert the output of git log (over a given file) into JSON format, to obtain metadata for each revision of the file. The final JSON contains (among other information): commit author, commit message and blob hash for the file. I will work a bit more on this and open a draft PR (to add the script inside tools/). |
People have been using Fossology to get historical information: https://www.fossology.org/ |
Yeah, life intervened and I haven't been able to get back to this. I have less time for it than I thought. @PeterBee97 made some progress in parsing out the list of contributors from the Git log messages. I will see if I can take his list and see if I can get a list of files and also number of lines of code for each contribution... anyway that seems to be the next steps:
There are several other approaches. This is just the one that seems most straightforward to me. If anyone wants to help, we could use help with:
|
Please see #1834 I know @PeterBee97 started some of this work but to be honest it was quite difficult for me to take advantage of those, considering it was based on sqlite databases. I chose JSON format since it is quite easy to read and parse with different programming languages. |
I have to be in favor of anything that makes forward progress. |
@patacongo Re: anything that makes forward progress, me too. @v01d yes, text-based json or csv/tsv formats would be great. The scripts in #1834 look cool. Maybe we combine them into one python script with the sh module. I'll try them out. |
There's quite a bit of escaping going on in the bash script, so embedding it inside python would probably require some work. Not sure if it is worth it, but we can think about it. |
Comment moved to #1834 |
1 similar comment
Comment moved to #1834 |
Oops, thought I was on the PR, I'll move the comments there |
Hi guys, we made some progress and post it here. Basically, we collected the author/company list which have not signed the agreement. So the next step is to contact them via email and get them sign the agreement. My questions are the following:
|
ICLAs are emailed to [email protected] see https://www.apache.org/licenses/contributor-agreements.html |
@justinmclean Thanks!One more question, how would you normally contact companies to get their SGA signed? Do you contact people you know from the company to get introduced? What department is normally responsible for this? For other authors, shall we just auto send email to contact them? |
@justinmclean One more question, shall we ask authors to send ICLA directly to [email protected]? Will someone from Apache Secretary process the mails and update the list and sync with us on the author list? |
I think this issue can be closed:
If there is something I am missing please just re-open. |
All files developed at the ASF need to have an ASF header [1], 3rd party headers for the most part need to be retained [2]
The text was updated successfully, but these errors were encountered: