How to Join Two Data Frames with Date Ranges

Suppose you have Github issues data frame with resolved dates (i.e. closed_at column). You have another data frame with sprints that have a sprint number and begin/end dates in one month duration. Now, You want to add the sprint number to the Github issues data frame based on which sprint the resolved date (i.e. closed_at) falls in.

For example, Github issue #1002 is resolved on Oct 17, 2021 so it should fall in Sprint 5 which started on Oct 1 and ended on Oct 30.)

  • Issues Data Frame

So the Issue Data Frame contains closed_at column which contains the date the the issue is closed.

  • Sprints Data Frame

Sprints Data Frame contains begin and end Dates columns like below.

So you want to join GitHub issue data frame with Sprints Data Frame with the closed_at is between begin and end condition.

Prerequisite - fuzzyjoin Package

So to address the issue, first you need to install fuzzyjoin Package. Click three bar icon and select Manage R Packages.

Click CRAN tab and type in fuzzyjoin and click Install.

fuzzy_left_join

And now we are ready to try the fuzzy_left_join to make the date range join condition.

Click the Plus (+) button next to Steps then select Custom R Command.

Enter an R Script like this.

fuzzy_left_join(., Sprints,
by=c("closed_at"="begin", "closed_at"="end"),
match_fun=list(`>=`, `<=`))

it means it left joins Github issues with Sprints data frame by using closed_at column and below conditions:

  1. closed_at >= begin
  2. closed_at <= end

With these conditions, hat it can detect which sprint is associated with each Github issue.

And after you run the command, yon can see each Github issue has a sprint number.