At this point you know how to load CSV data in Python. In this lesson you will learn how to access rows, columns, cells and subsets of rows and columns from a pandas dataframe. Let’s open the CSV file again, but this time we will work smarter. We will not download the CSV from the web manually. We will let Python directly access the CSV download URL. Here is the code:
import pandas as pd df1=pd.read_csv("http://pythonhow.com/wp-content/uploads/2016/01/Income_data.csv")
The dataframe will be identical to the dataframe we used in the previous lesson. Again, once you have the dataframe loaded in your iPython session, you can perform operations to your dataframe. Just for reference, here is how the complete dataframe looks like:
And before extracting data from the dataframe, it would be a good practice to assign a column with unique values as the index of the dataframe. The State column would be a good choice:
As you see you needed to store the result in a new dataframe because this is not an in-place operation. The df2 dataframe would look like this now:
Now, let’s extract a subset of the dataframe. Here is the general syntax rule to subset portions of a dataframe:
If you can’t wrap your mind around that, here is a nice example that extract the values for the rows from Alaska through Arkansas for years 2005 to 2007:
And here is how to slice a column:
df2.loc[: , "2005"]
Note that when you extract a single row or column, you get a one-dimensional object as output. That is called a pandas Series. The values on the left are just labels taken from the dataframe index. On the other hand, when we extracted portions of a pandas dataframe like we did earlier, we got a two-dimensional DataFrame type of object. Just something to keep in mind for later.
So, the formula to extract a column is still the same, but this time we didn’t pass any index name before and after the first colon. That tells Python to include all the rows. And we passed only one column name.
To extract only a row you would do the inverse:
df2.loc["California", : ]
And for a single cell:
You could apply methods to the subsets:
That for example would return the mean income value for year 2005 for all states of the dataframe.
Position based indexing
Now, sometimes, you don’t have row or column labels. In such case you will have to rely on position based indexing which is implemented with iloc instead of loc:
Note that when we used label based indexing both the start and the end labels were included in the subset. With position based slicing, only the start index is included. So, in this case Alabama had an index of 0, Alaska 1, and Arizona 2. Same goes for the columns.
And one more thing you should now about indexing is that when you have labels for either the rows or the columns, and you want to slice a portion of the dataframe, you wouldn’t know whether to use loc or iloc. In this case, you would want to use ix:
It’s recommended though that you use loc and iloc whenever you can.
Great! Now you know how to access data from your dataframe. You can then use those extracts to perform analysis and visualizations.