Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding caveats for DataFrame.iloc under Pandas Dataframes #598

Open
candemircan opened this issue Mar 31, 2022 · 4 comments
Open

Adding caveats for DataFrame.iloc under Pandas Dataframes #598

candemircan opened this issue Mar 31, 2022 · 4 comments
Labels
good first issue Good issue for first-time contributors type:enhancement Propose enhancement to the lesson

Comments

@candemircan
Copy link

candemircan commented Mar 31, 2022

Hi!

For the Pandas Dataframes episode, under the DataFrame.iloc[..., ...] section, it might be worth mentioning the caveats of this method, i.e. if you add new columns to your data later, an index based selection (as opposed to using column names) can lead to problems. If this is worth adding, I would be happy to make the edit and make a pull request.

Thanks,

Can

@alee alee added type:enhancement Propose enhancement to the lesson good first issue Good issue for first-time contributors labels Apr 20, 2022
@richiehodel
Copy link

I think a 'caveats' section would be a great addition, @candemircan. The lesson workflow does a good job of introducing learners to using different approaches to slice data frames. However, I wonder if learners may run into trouble when applying some of the knowledge from this lesson (e.g., adding columns to data later and running into index-based selection problems, as you mention). Perhaps immediately after the section "Result of slicing can be used in further operations", a section could be added to demonstrate the caveats and how learners might run into trouble. After completing the lesson, if learners start adding on further operations to .iloc, they might run into a "SettingWithCopyWarning", and be unsure why it is happening. Maybe addressing this specific warning is beyond the scope of the lesson, but including a brief section demonstrating the caveats would be valuable.

@chillenzer
Copy link

Hi there,

I agree and I would be happy to write something up in that direction. However, I would discuss this further before starting an attempt:
Before getting to the point, I want to suggest going even further and teaching the .loc method before .iloc because it is more pandasothic (in the sense of pythonic): If you reach for pandas instead of numpy that should be because there is a special meaning to rows and columns of your data (and not their indices) and, if so, you should assign meaningful labels and use those.

Anyways, both methods have their caveats. A list from the top of my head would be (inlcuding yours):

  1. Choose .loc over .iloc if possible because operations might change the index in a non-obvious way (and above).
  2. Refrain from combining more than one .loc and/or .iloc due to the mentioned SettingWithCopyWarning. I think that should be best practice even if not setting anything because you might later copy that expression for setting the same elements. Also, try to never return .(i)loc[...] from a function because there is no guarantee what will happen outside the function (another .(i)loc perhaps?).
  3. pandas' slicing is inclusive for .loc but exclusive for .iloc. That is particularly mean for trivial integer indices because df.loc[0:1] != df.iloc[0:1] even in situations when both expressions are valid and 0 and 1 refer to the same rows.
  4. Indexing with lists and tuples is not semantically equivalent (although I can't come up with an example where both are valid but yield different results on the spot).

There are probably even more subtle things to be aware of. My questions:

  • Anything missing? Anything irrelevant or too advanced?
  • As my points go a bit beyond the original scope, it might be better to put them at the end of the "Use/Select ..." sections (i.e. before "GroupBy...")?
  • How verbose should that be? More explicit information and reasoning or rather "do this because see this link"? (pandas has very informative docs on those things.) Include a MWE exhibiting a confusing scenario?
  • Should we have dedicated exercises about these pitfalls?

Best,
Julian

@KristinaGagalova
Copy link

Hi,
I cannot access the link posted before, can you point to the correct page in case anything has changed?
Thanks

@alee
Copy link
Member

alee commented Jul 2, 2024

Hi Kristina, I've updated the original issue's link that should be pointing here: https://swcarpentry.github.io/python-novice-gapminder/08-data-frames.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good issue for first-time contributors type:enhancement Propose enhancement to the lesson
Projects
None yet
Development

No branches or pull requests

5 participants