Always happy to chat about bioinformatics/general data analysis.
Data manipulation is not new but when you need to constantly work across two different lauguages such as R and Python like me, a side-by-side comparison of the most frequently used commands can be really handy.
This piece is for me, the future me, and YOU.
When you get a new dataset, the first thing you'd want to do is to look at it. You will need to load the library/package you want to use, read the data, and look at the basic properties of the data. I listed the commands in both R and Python side-by-side, so you can easily spot the resemblance and differences.
library(data.table)
import pandas as pd
heart <- fread('PathToYourFile/heart.csv')
heart = pd.read_csv('PathToYourFile/heart.csv')
I prefer 'fread' to 'read_csv' in R because it's much faster for reading large datasets.
class(heart)
type(heart)
dim(heart)
heart.shape
Coming from R, you might find the syntax of the Python code a bit unusual. As I learn more about Python coding, I get to know that the Pandas dataframes have attributes (properties) and methods (behaviours). Shape is one of the attributes, thus no brackets.
head(heart)
heart.head()
The head functions are very similar in both languages. Pay attention to the Python command, it is a method, thus with brackets.
glimpse(heart)
heart.info()
Apart from providing data type for each column, the Python command also gives the number of missing values, while the R command does not.
summary(heart)
heart.describe()
The summary statistics generated by the two commands are similar but the Python command also provides count and standard deviation.
colnames(heart)
heart.columns.values
The Python command is longer. The outputs of the two are of different data types. The R command generates a list, while the Python one gives a numpy array. Lists can contain items of different data type while arrays only contain elements of the same data type.
rownames(heart)
heart.index.values
Instead of using 'row', Pandas uses 'index', a little bit counter-intuitive and taking some time to get used to.
Once you are familiar with your data, you can start manipulate it, adding some new features, replacing some values, slicing/subsetting the dataframe. Let's see how to do this in both languages.
heart$new <- "hello"
heart['new'] = 'hello'
heart <- heart[, !"new"]
heart.pop('new')
Pay attention to the Python command here. You don't need to re-assign it as we normally do in R. The original dataframe is already changed, popped in this case. Further, 'pop' only works for a single column.
heart <- heart[, !c("thal", "target")]
heart = heart.drop(['thal', 'target'], axis = 'columns')
As mentioned above, for dropping multiple columns in Python, it's better to use 'drop'. For this task, the R code is neater.
heart$sex[heart$sex == 1] <- "male"
heart$sex[heart$sex == 0] <- "female"
heart = heart.replace({'sex': {1: 'male', 0: 'female'}})
The Python command is more flexible when replacing multiple values, benefiting from the use of dictionaries.
As you can see, for simple data wrangling tasks, there are some similarities between the R and Python commands, yet each language has its own advantages. Hope the examples above can help ease the difficulties when you just start working with both languages.