Panda's query method
Pandas is great when we need to select or filter our data according to some criteria. Generally, no loops are needed. A clear statement of what we want is just enough.
While still experimental, pandas query
method offers a simple way for making selections. The main advantage of this method, is that it allows writing cleaner and more readable code for getting the exact pieces of data you want. It also makes easier the task of index based selection.
In what follows, I give a brief overview of this method based on its documentation. If you want a thorough overview, read the docs.
Index based selection
Dealing with indices, is not an easy task. Indices are the main responsible for most of the speed and consistency that pandas offers (e.g. it makes sure that operations are for same observation). In the beginning, however, it might take some time to adjust to their logic.
Having used other statistical software, one of the things I could not get my head around was when I needed to use the index to filter the data according to some attribute. In the end, I would just convert the index into a column (or reset index) and go about with selections as I would regularly do with columns.
The good thing about the query
method is that it allows users to make selections directly with indices. Say that you want to select those indices bigger than a column. Well, no need to reset your index and refer to the column. We can simply use query
:
Example Data
Using indices as conditions
The query method makes index selections very easy. You just need to refer to the index for comparisons. When your index has a name, you just refer to that name in your query statement.
The query method can also be applied to multi index data frames.
In this case we have included another index level, with indices ‘bar’ and ‘foo’. Say that you want to select only those observations corresponding to ‘bar’ and from those takes the ones whose second index is greater than the values in column B.
Note that in case we have named our indices, we would replace “ilevel” for the corresponding index name. Also, take into account that in case the index has the same name as the column, the column will be given precedence. In other words, comparisons will be made with columns.
Succinct syntax
Another advantage of the query
method is that it makes our conditions more succinct and clear. For example, suppose we are checking the grades of a group of students and we are interested in finding out which students have shown increasing improvement in their grades. We will compare the regular and query
way of performing this selection.
Example Data
Checking for continuous progress:
Unsurprisingly, all lines of code lead to the same result. Most important to notice, however, is how easy was to get this result with the query
method.
In terms of comparison, a first thing to notice is that we do not use the parenthesis. The comparison operators in query
have precedence over and
and or
. Furthermore, note that in the last example, we do not even need to use and
. Personally, I find this makes the code very readable. As a result, the selection criteria is more clear, not only to the one performing the analysis, but also to people that will collaborate/read it.
There are many other things that are easier with query
. For example, imagine that you want to filter your data according to if a value is present in a column. As before, I present the “regular” way to perform this and the one with query
. I will use the grades data set presented before.
Again, the query
syntax is more straightforward. Note that for selections based on particular values, it is possible to use in/not in
or ==/!=
. Both render the same results.
There are many other things one can do using query
. I hope this has whet your appetite. As usual, you can check the docs to discover all its uses.
To recap, the query
method:
- Makes data selections succinct and easier to read.
- Selections based on indices are straightforward and do not need further operations (e.g. converting index to columns).