Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tables.jl integration #25

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Tables.jl integration #25

wants to merge 1 commit into from

Conversation

tbeason
Copy link
Contributor

@tbeason tbeason commented Jul 27, 2020

Something I stumbled across while working on the previous PR... free Tables.jl integration! Because the existing (+CIF) nonparametric estimators return vectors of equal length, why not go one extra step and make it table friendly?

Before this PR:

julia> ev = EventTime.(rand(10),rand(Bool,10))
10-element Array{EventTime{Float64},1}:
 0.35179991680800793
 0.40759565633515504
 0.8775546088667416
 0.45056768643591116+
 0.7839147900638677
 0.5359791316287386
 0.6307186426517366
 0.7518185123916523
 0.2001353457567876
 0.0664477717088301+

julia> km = fit(KaplanMeier,ev)
KaplanMeier{Float64}([0.0664477717088301, 0.2001353457567876, 0.35179991680800793, 0.40759565633515504, 0.45056768643591116, 0.5359791316287386, 0.6307186426517366, 0.7518185123916523, 0.7839147900638677, 0.8775546088667416], [0, 1, 1, 1, 0, 1, 1, 1, 1, 1], [1, 0, 0, 0, 1, 0, 0, 0, 0, 0], [10, 9, 8, 7, 6, 5, 4, 3, 2, 1], [1.0, 0.8888888888888888, 0.7777777777777777, 0.6666666666666666, 0.6666666666666666, 0.5333333333333333, 0.4, 0.2666666666666667, 0.13333333333333336, 0.13333333333333336], [0.0, 0.11785113019775792, 0.1781741612749496, 0.23570226039551584, 0.23570226039551584, 0.32489314482696546, 0.4346134936801766, 0.5962847939999438, 0.9249624617007738, 0.9249624617007738])

Now with it...

julia> using DataFrames

julia> km |> DataFrame
10×6 DataFrame
│ Row │ times     │ nevents │ ncensor │ natrisk │ survival │ stderr   │
│     │ Float64   │ Int64   │ Int64   │ Int64   │ Float64  │ Float64  │
├─────┼───────────┼─────────┼─────────┼─────────┼──────────┼──────────┤
│ 10.066447801101.00.0      │
│ 20.2001351090.8888890.117851 │
│ 30.35181080.7777780.178174 │
│ 40.4075961070.6666670.235702 │
│ 50.4505680160.6666670.235702 │
│ 60.5359791050.5333330.324893 │
│ 70.6307191040.40.434613 │
│ 80.7518191030.2666670.596285 │
│ 90.7839151020.1333330.924962 │
│ 100.8775551010.1333330.924962

Needs tests and I guess a mention somewhere that this is possible.

@nignatiadis
Copy link

That's super useful! Makes it also very easy to use Efron-style [1] survival analysis based on logistic regression: just pass the Table to GLM.jl+StatsModels.jl.

[1] Efron, Bradley. "Logistic regression, survival analysis, and the Kaplan-Meier curve." Journal of the American statistical Association 83.402 (1988): 414-425.

@AntoineHus
Copy link

AntoineHus commented May 2, 2022

Hello,

Which tests are you looking for before doing the merge ? I think this is a nice feature

Thanks

@tbeason
Copy link
Contributor Author

tbeason commented May 3, 2022

I think it is unlikely that I come back to finish this PR any time soon. Feel free to take it over and add whatever needs to be done to consider it ready to merge!

@ararslan
Copy link
Member

Apologies for letting this languish for so long!

I gave this a fair bit of thought and while I definitely understand the appeal, I think I'm overall not in favor, at least with the current implementation. It effectively forces us to make guarantees about the particular fields of the types, namely that they all need to be vectors of equal length and have some kind of externally meaningful names and values. If we were to add another field to store some kind of intermediate computation or other kind of value, that would either end up getting exposed to the user in a rather surprising way or break this altogether. Now, it's possible to control this stuff with some getproperty/propertynames tomfoolery but I'd much prefer we not go down that route.

Something we could do is define things explicitly for each NonparametricEstimator subtype rather than generically for all of them to expose only particular things in the resulting table schema, which could be fields of the type or could be values derived from fields.

Note that in the meantime, you can do e.g. DataFrame(; (f => getfield(estimator, f) for f in fieldnames(typeof(estimator)))...), or omit DataFrame and construct a NamedTuple that way, which is a valid column table.

ararslan added a commit that referenced this pull request Jul 25, 2022
Both Kaplan-Meier and Nelson-Aalen compute the same set of basic
counts at each unique time prior to computing their respective
quantities of interest. The counts have a notably table-like format,
so much so that they can implement the Tables.jl interface with minimal
effort.

Credit for the idea of integration with Tables.jl goes entirely to Tyler
Beacon, author of PR #25, who has been added as a co-author of this
commit.

Co-Authored-By: Tyler Beason <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants