Analyze Bike-sharing Data in JavaScript/TypeScript environment

Use Danfo.js with Node.js Notebooks in Visual Studio Code

8 min readSep 23, 2021

--

Danfo.js is an open-source, JavaScript library providing high-performance, intuitive, and easy-to-use data structures for manipulating and processing structured data.

Danfo.js is heavily inspired by the Pandas library and provides a similar interface and API. - Danfo.js Docs

A similar Tool Setup like Pandas & Jupyter Notebook in Python

In this tutorial, we will set up an environment to make data analysis with JavaScript/TypeScript and write our code in an interactive environment like Jupyter Notebook.

I wrote a similar article here, what I didn’t like about using danfo.js in tslab was the formatting, also autocompletion is missing there. When I had tables with some more columns the output was not very readable.

Here is a comparison df.describe().print() output in both environments:

tslab
Node.js Notebooks

The installation of Node.js Notebooks is very easy. If you already installed VS Code on your machine you just need to open up the Extension Site here and click on the “Install” button. Maybe you need to reload VS Code to use the new Extension.

First, we will look at an example we got by installing the Extension, run the following shortcut ⌘ + Alt + P you can run the command in the new window>Open a sample node.js notebook and you will get the bellow list. Select the marked item.

This simple example shows how danfo.js can be used. You see how a CSV file can be loaded and how the Date column of the DataFrame is used as an index and two other Columns >["AAPL.Open", "AAPL.High"] are used to plot the lines. You can see the result below.

Install Danfo.js in your current folder VS Code working directory.

npm i danfojs-node

With the file extension *.nnb you can add your own files and VS Code will automatically show the controls for the notebook.

We will be using the Bike Sharing Dataset

This dataset contains the hourly and daily count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Please download the Dataset here and save them beside the sample.nnb file, we will be adding next.

Add a new file in VS Code sample.nnb, we will run some basic data analysis tasks.

In the first cell import Danfo.js

import * as dfd from "danfojs-node"

In the next cell, we load the CSV file into a DataFrame — it’s like a table that gives us some out-of-the-box methods for data analysis.

const df = await dfd.read_csv(“./Bike-Sharing-Dataset/hour.csv”, {csvConfigs:{delimiter: ‘,’}})

Let’s take a look at the first 10 rows to get a feeling of what’s inside of our DataFrame.

df.head(10).print()

Print the type of each column, in this case, we have the types string, int32, and float32.

df.ctypes.print()

Print the shape of our DataFrame, [rows, columns]

console.log('Shape of data:', df.shape)
// output: Shape of data: [ 17379, 17 ]

Print count for each column.

Should be the same as above 17379, if not there are missing values. For example, if one column would have the count of 17376 this column would have 3 missing values (NaN or undefined).

df.isna().count().print()

Print some statistical values like mean, standard deviation (std), median, variance, min, max.

df.describe().T.print()

We will make some data more human-readable.

First, we will copy our DataFrame into a new DataFrame to keep the original DataFrame.

const hourlyDataFrame = df.copy()
hourlyDataFrame.head().print()

We rename the columns to something more readable.

hourlyDataFrame.rename({
mapper: {
instant: "record_id",
dteday: "datetime",
holiday: "is_holiday",
workingday: "is_workingday",
weathersit: "weather_condition",
hum: "humidity",
mnth: "month",
cnt: "total_count",
hr: "hour",
yr: "year",
temp: "temperature",
atemp: "feeling_temperature",
},
inplace: true,
});

Map the numeric values 1, 2, 3, 4 of the season column to winter, spring, summer, fall. The last line prints out the unique values of our mapped data.

const seasonNames = {
1: 'winter',
2: 'spring',
3: 'summer',
4: 'fall'
}
const mapSeasonNumberToSeasonName = (seasonNumber) => {
return seasonNames[seasonNumber]
}
const preprocessedSeasonSeries = hourlyData['season'].apply(mapSeasonNumberToSeasonName)hourlyData['season'] = preprocessedSeasonSeriespreprocessedSeasonSeries.unique().print()

Map the index 0 and 1 to the years 2011 and 2012.

const yearMapping = {
0: 2011,
1: 2012
}
const mapYearIndexToYear = (yearIndex) => {
return yearMapping[yearIndex]
}
let preprocessedYearSeries = hourlyDataFrame['year'].apply(mapYearIndexToYear)
hourlyDataFrame['year'] = preprocessedYearSeries
preprocessedYearSeries.unique().print()

Map the numbers 0 to 6 to readable week names.

const weekNames = {
0: 'Sunday',
1: 'Monday',
2: 'Tuesday',
3: 'Wednesday',
4: 'Thursday',
5: 'Friday',
6: 'Saturday'
}
const mapWeekNumberToWeekName = (weekNumber) => {
return weekNames[weekNumber]
}
let preprocessedWeekSeries = hourlyDataFrame['weekday'].apply(mapWeekNumberToWeekName)
hourlyDataFrame['weekday'] = preprocessedWeekSeries
preprocessedWeekSeries.unique().print()

Map weather condition numbers 1 to 4 to clear, cloudy, light_rain_snow, and heavy_rain_snow.

const weatherConditions = {
1: 'clear',
2: 'cloudy',
3: 'light_rain_snow',
4: 'heavy_rain_snow',
}
const mapWeatherConditionNumberToConditionName = conditionNumber => {
return weatherConditions[conditionNumber]
}
let preprocessedWeatherSeries = hourlyDataFrame['weather_condition'].apply(mapWeatherConditionNumberToConditionName)
hourlyDataFrame['weather_condition'] = preprocessedWeatherSeries
preprocessedWeatherSeries.unique().print()

Scale humidity numbers between 0 and 1 to 0 and 100.

onst preprocessedHumiditySeries = hourlyDataFrame['humidity']
.apply(h => h * 100)
hourlyDataFrame['humidity'] = preprocessedHumiditySeries

Scale wind speed numbers between 0 and 0.8507 to 0 and ~56.

const preprocessedWindspeedSeries = hourlyDataFrame['windspeed'].apply(h => h * 67)
hourlyDataFrame['windspeed'] = preprocessedWindspeedSeries

Print out randomly 10 rows of our DataFrame to check our transformations.

let randomSample = await hourlyDataFrame.sample(10);
randomSample.print();

The column registered holds users that registered before using the bike and casual holds users that do not. The count column holds the sum of casual and registered.

Let’s prove if the numbers are correct.

// get values of the columns
const casual = hourlyDataFrame['casual'].values
const registered = hourlyDataFrame['registered'].values
const totalCount = hourlyDataFrame['total_count'].values
// create new data frames from values
const casualDataFrame = new dfd.DataFrame(casual)
const registeredDataFrame = new dfd.DataFrame(registered)
const countDataFrame = new dfd.DataFrame(totalCount)
// concatenate the casual and registered data
const concatedDataFrame = dfd.concat({ df_list: [casualDataFrame, registeredDataFrame], axis: 1 })
// sum up casual and registerd
const summedCasualAndRegistered = concatedDataFrame.sum({axis: 0})
// calculate the difference between (casual + registered) - totalCountconst diffBetweenSummedAndCount = summedCasualAndRegistered.sub(countDataFrame)// this should print 0, thats the prove that the sums match
diffBetweenSummedAndCount.unique().print()

Visualize information

Distribution of registered vs casual users per hour. We see on 17 o’clock (5 pm) the most usages and on 4 o’clock (4 am) the most fewer usages.

const { Plotly } = require('node-kernel');
const data = [
{
y: hourlyDataFrame["casual"].values,
histfunc: "sum",
x: hourlyDataFrame["hour"].values,
type: "histogram",
name: "Casual",
},
{
y: hourlyDataFrame["registered"].values,
histfunc: "sum",
x: hourlyDataFrame["hour"].values,
type: "histogram",
name: "Registered",
},
];
const layout = {
title: "Registered vs Casual rides per hour of the day",
xaxis: {
title: "Hour of the day",
},
yaxis: {
title: "Bike rides count",
},
};
Plotly.newPlot('myDiv', data, layout);

Total bike rides per season. Winter has the lowest rides and summer most rides.

const { Plotly } = require("node-kernel");
const data = [
{
y: hourlyDataFrame["total_count"].values,
x: hourlyDataFrame["season"].values,
type: "bar",
},
];
const layout = {
title: "Total bike rides per season",
xaxis: {
title: "Season",
},
yaxis: {
title: "Total bike rides",
},
};
Plotly.newPlot("myDiv", data, layout);

Total bike rides per month in 2011 and 2012. August is the month with most bike rides.

const { Plotly } = require("node-kernel");
const data = [
{
y: hourlyDataFrame["total_count"].values,
x: hourlyDataFrame["month"].values,
type: "bar",
},
];
const layout = {
title: "Total bike rides per month",
xaxis: {
title: "Month",
},
yaxis: {
title: "Total bike rides",
},
};
Plotly.newPlot("myDiv", data, layout);

Total bike rides per month 2011 vs 2012.

In 2011 we see the best and very similar performing months are May, June, July, and August. In 2012 the performance is very different, the best performing months are September and October. The less performing months are in 2011 and 2012, January and February.

const { Plotly } = require("node-kernel");
const groupedByYear = hourlyDataFrame.groupby(["year"]);
const { Plotly } = require("node-kernel");
const data = [
{
y: groupedByYear.get_groups(["2011"])["total_count"].values,
x: hourlyDataFrame["month"].values,
type: "bar",
name: "2011",
},
{
y: groupedByYear.get_groups(["2012"])["total_count"].values,
x: hourlyDataFrame["month"].values,
type: "bar",
name: "2012",
},
];
const layout = {
title: "Total bike rides per month 2011 vs 2012",
xaxis: {
title: "Month",
},
yaxis: {
title: "Total bike rides",
},
};
Plotly.newPlot("myDiv", data, layout);

Do users rent bikes more on working or nonworking days?

const { Plotly } = require("node-kernel");
const isWorkingDayDataFrame = hourlyDataFrame.query({
column: "is_workingday",
is: "==",
to: 1,
});
const isNotWorkingDayDataFrame = hourlyDataFrame.query({
column: "is_workingday",
is: "!=",
to: 1,
});
const data = [
{
values: [
isWorkingDayDataFrame["total_count"].sum(),
isNotWorkingDayDataFrame["total_count"].sum(),
],
labels: ["WorkingDay", "Non-WorkingDay"],
type: "pie",
},
];
Plotly.newPlot("myDiv", data, layout);

Do users on working days use the rental service more before 9 am or after 4 pm?

const { Plotly } = require("node-kernel");
const morningDataFrame = hourlyDataFrame
.query({
column: "hour",
is: "<",
to: "9",
})
.query({
column: "is_workingday",
is: "==",
to: 1,
});
const eveningDataFrame = hourlyDataFrame
.query({
column: "hour",
is: ">",
to: "16",
})
.query({
column: "is_workingday",
is: "==",
to: 1,
});
const data = [
{
values: [
morningDataFrame["total_count"].sum(),
eveningDataFrame["total_count"].sum(),
],
labels: ["Before 9am", "After 4pm"],
type: "pie",
},
];
Plotly.newPlot("myDiv", data, layout);

Distribution of counts per year (2011 and 2012) via violin plot.

const { Plotly } = require("node-kernel");
const data = [
{
type: "violin",
x: hourlyDataFrame["year"].values,
y: hourlyDataFrame["total_count"].values,
box: {
visible: true,
},
meanline: {
visible: true,
},
},
];
const layout = {
title: "Distribution of counts per year",
};

Plotly.newPlot("myDiv", data, layout);

I hope you enjoyed reading this article as much as I writing it. Your questions & feedback are very welcome!

--

--

Loving web development and learning something new. Always curious about new tools and ideas.