Prior Knowledge (https://www.priorknowledge.com) has a very similar API but a more interesting underlying model. They model the full joint distribution of the data, so any variables can be missing not just the outcome. They also are able to return the joint probability distribution over unknowns, which is extremely useful in terms of quantifying uncertainty.
I'm very curious to see (a) what sort of generative model they're using under the hood, and (b) how they do inference efficiently enough to not dedicate a cluster to each customer.
This looks fun and certainly simple - but I would guess that for many, the actual training of the model is not the show-stopped before "automated, data-driven decisions and data-driven applications are going to change the world."
If you already have clean data in tabular form, with a single target class to predict, and enough training data, the last step was always sort of easy. Much harder is the fact that people expect Big Data and ML to be fairy dust, just give it my DB password and MAGIC comes out. And instead of a clean two-class classification problem you have some ill-defined a bit of clustering, a bit of visulisation and a bit pure guessing -problem.
<quote>This looks fun and certainly simple - but I would guess that for many, the actual training of the model is not the show-stopped before "automated, data-driven decisions and data-driven applications are going to change the world."</quote>
totally agree, indeed whenever I train a machine learning model (for a ranker or a classifier) I spend most of the time building the workflow to generate the datasets and extract and compute the features. I actually haven't found yet a good open source product that cares about that, last time I had to work on a ML related stuff I relied on Makefiles and a few Python scripts to distribute the computation in a small cluster. I needed a more powerful tool for doing that so during my spare time I've tried to build something similar to what I've in my mind. I came out with a prototype here: https://bitbucket.org/duilio/streamr . The code was mostly written the first day then I did a few commits to try how it could work in a distribute environment. It is in a very early stage and need a massive refactoring, it is just a proof of concept. I'd like my workflows to look like https://bitbucket.org/duilio/streamr/src/26937b99e083/tests/... . The tool should take care of distribute the workflow nodes and cache the results, so that you can slightly change your script and avoid to recompute all the data. I hadn't used celery before, maybe many of the stuff I've done for this prototype could have been avoided (i.e. the storage system could have been implemented as celery cache)
Cool project on your side of things. I'm spending a bit of time myself trying to put together a compositional tool chain for machine learning tasks myself. Are there any major design choices you've thought out for your own project youd care to expound upon?
What type of algorithm (or at least what general class of algorithm) is generating the model? I'm just curious because I'm wondering if the input data has to be linearly separable or if it has a limited number of classes(I think that the iris dataset is only linearly separable in one case)?
As skystorm notes, these are decision trees. We have a limit on the number of classes (it's in the hundreds). We'll increase this as we improve the algorithms.
Luckily, our universe has exploitable structure. The No Free lunch theorem is as much a headache to Searching as the Halting problem is to programming - not really one.
Seems a little too simple -- how can you generate predictions without even specifying which is the class variable and which are the predictor variables?
In the absence of any input or objective field arguments, the API assumes that the final column in a flat file is the objective field, while the rest are input fields. It will also try to determine the appropriate types for all fields.
sincere question - Is your only data ingest mechanism by uploading gzipped csvs, or other files? it seems that if people really have big data, then by definition that approach won't work
Most of my data is too big for CSVs but too small to justify distributed storage. I use HDF5 with chunking and column compression. I think many other people in the sciences and finance also do this (along with using NetCDF).
How does one get control over the predictive model? What classifier gets used, for instance. Maybe there is something in the API, but I didn't see it in the article.
We provide basic classification and regression trees for now, and we can decide which one is appropriate from the objective field type. Once we start adding in other types of models we will add a model type parameter for the relevant API method.
Do you plan on exposing parameters that control the fitting process? E.g. loss function / tree depth / min samples per leaf? Or will the fitting process always be a black-box automagic call with no user-controllable knobs?
Is there any plan to provide some assessment of model accuracy via the API - e.g. K-fold cross validation with respect to some specified loss function?
We do a little automagic currently, but we'll expose some of the knobs soon, probably first via the API. Expressing model confidence and handling loss functions are being worked on right now.
Did I mention we're hiring? Someone with the right combination of big data and machine learning skills can make a big impact.
https://bigml.com/team
The model itself appears very flexible:
https://blog.priorknowledge.com/blog/beyond-correlation/
I'm not affiliated with these guys but they are clearly doing the most interesting work in this area.