Function Reference

Logical

! Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'.
all Return TRUE if the condition is satisfied for all the rows or a group of rows.
any Return TRUE if the condition is satisfied for any of the rows or a group of rows.
duplicated Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows.
ifelse Returns different values based on the conditional result.
is.character Returns true if a given object is character type.
is.double Returns true if a given object is double type.
is.factor Returns true if a given object is factor type.
is.list Returns true if a given object is list type.
is.logical Returns true if a given object is logical type.
is.matrix Returns true if a given object is matrix type.
is.na Returns TRUE when it is NA.
is.null Returns TRUE when it is null.
is.numeric Returns true if a given object is numeric type.
isTRUE Returns true if and only if a given value is a length-one logical vector whose only element is TRUE.
which Returns the TRUE indices of a logical object or based on the condition.
xor Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE.
is.Date Returns true if a given object is Date type.
is.difftime Returns true if a given object is difftime type.
is.duration Returns true if a given object is duration type.
is.POSIXct Returns true if a given object is POSIXct type.
str_detect Return TRUE or FALSE based on whether Text data contains a given text or not
if_else Returns values based on whether a given condition is satisfied or not.
case_when Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below.
coalesce Replace the missing values with a given value or the values that are at the same position in other columns.
na_if Replaces the matching values to NA.
near Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.
recode Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
recode_factor Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
parse_logical Convert data to Logical (Boolean) data type.
is.hms Returns true if a given object is hms type.

Text

abbreviate Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped.
c This is a generic function which combines its arguments.
iconv Convert a given text from one encoding to another.
nchar Returns # of letters or size of a given value.
relevel Sets the base level of a given factor column by making the level specified by ```ref``` argument becomes the first and the others are moved down.
str_c Concatenates multiple text from multiple columns.
str_conv Convert text to a specified encoding. e.g. UTF-8
str_count Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text.
str_detect Return TRUE or FALSE based on whether Text data contains a given text or not
str_dup Repeat text values.
str_extract Extract only letters that match with a given letters or patterns
str_extract_all Extract all characters that match with a given letters or patterns
str_length Return number of letters in text data.
str_pad Add white spaces to text values to make the total length to be exact a specified number.
str_replace Replace letters that matches with a given letters or expressions
str_replace_all Replace letters that matches with a given letters or expressions
str_replace_na Replace NA values to a specific Text
str_split Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function.
str_sub Extract letters from Text values based on the position
str_to_lower Convert text to lowercase
str_to_title Convert text to Title-case
str_to_upper Convert text to UPPERCASE
str_trim Trim (or remove) white spaces in Text values
word Extract First / Last Nth Word in Text data
extract_numeric Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.).
url_parameters Returns decoded query parameters from url.
na_if Replaces the matching values to NA.
recode Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.
parse_number Parse characters, extract numeric values, and convert to number data type including
anonymize Anonymize values by hashing algorithms.
str_clean Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces.
str_count_all Count patterns from texts
get_stopwords Returns stop words like "a", "the", "and", etc.
word_to_sentiment Returns sentiment types of positive or negative based on a given word(s).
get_sentiment Returns the sentiment score for a given text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.'
stem_word Stem word so that almost the same words which have a little different spells can be recognized as the same words.
is_stopword Returns TRUE if a given word is included in a list of the stop words defined by one of the dictionary.
is_empty Returns TRUE if a given text is empty string or NA.
is_alphabet Returns TRUE if a given text contains only alphabets.
ip_to_country Returns country names from IP addresses.
url_domain Returns domain (ex. "exploratory.io") from url.
url_fragment Returns fragment from url.
url_path Returns path from url.
url_port Returns port from url.
url_scheme Returns scheme (ex. "http", "https") from url.
url_suffix Returns suffix (ex. "com", "org") from url.
url_subdomain Returns subdomain (ex. "www", "blog") from url.
url_tld Returns top-level domain (ex. "com", "co") from url.
url_param Returns decoded query parameter from url.
countrycode Map country names or codes to other codes or names (country names, continent names, etc.).
statecode This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination.
countycode Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names.

Number

abs Returns the absolute value.
acos Returns the inverse cosine of a value, in radians
acosh Returns the inverse hyperbolic cosine of a number.
asin Returns the inverse sine of a value, in radians.
asinh Returns the inverse hyperbolic sine of a number.
atan Returns the inverse tangent of a value, in radians.
atan2 Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (`x`,`y`), in radians.
atanh Returns the inverse hyperbolic tangent of a number.
ceiling Rounds a number up to the nearest integer. ceiling(3.475) is 4
cos Returns the cosine of an angle provided in radians.
cosh Returns the hyperbolic cosine of any real number.
cospi Returns the cos(pi * ).
cut Divides the range of values into intervals and return the values in x according to which interval they fall.
exp Returns Euler's number, e (~2.718) raised to a power.
floor Rounds a number down to the nearest integer. ceiling(3.475) is 3.
log Returns logarithms, by default natural logarithms.
log10 Computes common (i.e., base 10) logarithms
log1p Computes log(1+x) accurately also for |x| << 1.
log2 computes binary (i.e., base 2) logarithms
pmax Returns the parallel maxima of the input values.
pmin Returns the parallel minima of the input values.
round Rounds a number to a certain number of decimal places.
sign Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively.
signif Rounds the values in its first argument to the specified number of significant digits.
sin Returns the sine of an angle provided in radians.
sinh Returns the hyperbolic sine of any real number.
sinpi Returns the sin(pi * ).
sqrt Returns the square root of the values.
tan Returns the tangent of an angle provided in radians.
tanh Returns the hyperbolic tangent of any real number.
tanpi Returns the tan(pi * ).
trunc Truncates the numbers by omitting digits.
near Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.
normalize Centers and/or scales the numeric values of a given column.

Aggregate

all Return TRUE if the condition is satisfied for all the rows or a group of rows.
any Return TRUE if the condition is satisfied for any of the rows or a group of rows.
length Returns the length of the values.
max Returns the maximum value in a numeric column.
mean Returns the numerical average (mean) value.
min Returns the minimum value in a numeric column.
prod Returns the product value by multiplying a series of numbers in a given data together.
sum Returns the sum of all the values.
unique Returns number of unique values.
n Returns the count of the rows for each group.
n_distinct Returns the count of unique values.
nth returns the nth value of all the values.
IQR computes interquartile range of the x values
mad returns the median absolute deviation of the values.
median Returns the numerical median value.
quantile sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.
sd returns the standard deviation of the values.
var returns the variance of the values.

Type Conversion

as.character Convert a given data to character data type.
as.Date Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'.
as.double Convert data to double precision number data type. It is identical to numeric.
as.factor Encode a column as a factor.
as.integer Convert data to integer number data type.
as.logical Convert data to Logical (Boolean) data type.
as.numeric Convert data to
as.POSIXct Convert data to Date/Time data type.
factor Encode a column as a factor.
parse_date_time Convert data to Date/Time data type.
parse_number Parse characters, extract numeric values, and convert to number data type including
parse_double Convert data to double precision number data type.
parse_euro_double Convert data to double precision number data type.
parse_integer Convert data to integer number data type.
parse_time Convert data to Time data type.
parse_character Convert a given data to character data type.
parse_factor Parse data and convert to factor data tyep.
parse_logical Convert data to Logical (Boolean) data type.
as_datetime Convert a given data to date.
as.hms Convert a given data to hms data type. It's difftime dealt as seconds.
excel_numeric_to_date Convert number of date from excel to Date.
unixtime_to_datetime Convert unix time numeric values to POSIXct.

Others

chartr Translates each character in x that is specified in old to the corresponding character specified in new.
list Construct a list
rep Repeats given numbers.
rev Reverses the entries in a given data.
desc Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z.

Window Calculation

cummax Returns the cumulative maxima.
cummin Returns the cumulative minima.
cumprod Returns the cumulative products.
cumsum Returns the cumulative sums.
cumall Returns TRUE if all the values up to the current position are TRUE.
cumany Returns TRUE if any of the values up to the current posiiton is TRUE.
cume_dist Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is.
cummean Returns the cumulative mean (average).
dense_rank Ranks with no gaps. Missing values are left as is.
first returns the first value of all the values.
lag Provides access to a row at a given physical offset prior to that position.
last returns the last value of all the values.
lead Provides access to a row at a given physical offset that follows the current row.
min_rank Ranks. Ties get min rank. Missing values are left as is.
ntile Breaks the column values into n buckets. Missing values are left as is.
percent_rank Ranks rescaled to [0, 1]. Missing values are left as is.
row_number Returns row numbers. Equivalent to Rank.
roll_max Returns the rolling (or moving) max value.
roll_mean Returns the rolling (or moving) mean value.
roll_median Returns the rolling (or moving) median value.
roll_min Returns the rolling (or moving) min value.
roll_prod Returns the rolling (or moving) prod value.
roll_sd Returns the rolling (or moving) standard deviation value.
roll_sum Returns the rolling (or moving) sum value.
roll_var Returns the rolling (or moving) variance value.

Date

difftime Calculates the difference between two given dates/times.
months Convert a number to month periods so you can use it for further calculations.
ceiling_date Rounds date/time up to the nearest integer value of the specified time unit.
date_decimal Converts numeric data calculated as a fraction of the year to date data type.
day Extract Day
days Convert a number to day periods so you can use it for further calculations.
decimal_date Converts date data to numeric data type by calculating the date to a fraction of the year.
dmy Convert Character or Number to Date / Period when data contains only Date, but not Time.
dmy_h Convert Character or Number to Date / Period when data contains Date and Time.
dmy_hm Convert Character or Number to Date / Period when data contains Date and Time.
dmy_hms Convert Character or Number to Date / Period when data contains Date and Time.
duration Create a duration
dym Convert Character or Number to Date / Period when data contains only Date, but not Time.
floor_date Rounds date/time down to the nearest integer value of the specified time unit.
force_tz Re-registers the data with a given timezone.
here Get current time in your local timezone
hm Convert text to Period data type if a given column holds values that look like Hours and Minutes
hms Create hms data. It's difftime dealt as seconds.
hour Extract hour
hours Convert a number to hour periods so you can use it for further calculations.
interval Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()".
mday Extract Day of Month
mdy Convert Character or Number to Date / Period when data contains only Date, but not Time.
mdy_h Convert Character or Number to Date / Period when data contains Date and Time.
mdy_hm Convert Character or Number to Date / Period when data contains Date and Time.
mdy_hms Convert Character or Number to Date / Period when data contains Date and Time.
milliseconds Convert a number to milliseconds periods so you can use it for further calculations.
minute Extract minute
minutes Convert a number to minute periods so you can use it for further calculations.
month Extract Month
ms Convert text to Period data type if a given column holds values that look like Minutes and Seconds
myd Convert Character or Number to Date / Period when data contains only Date, but not Time.
now Returns current date and time. A similar function 'today()' returns only date.
parse_date_time Convert data to Date/Time data type.
qday Extract Day of Quarter
quarter Extract Quarter
rollback Calculate the last day of the previous month or to the first day of the month based on a given date.
round_date Rounds date/time to the nearest integer value of the specified time unit.
second Extract second
seconds Convert a number to second periods so you can use it for further calculations.
time_length Calculate the exact time length between two dates.
today Returns current date without time. A similar function 'now()' returns date including time.
wday Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun
week Extract Week
weeks Convert a number to week periods so you can use it for further calculations.
with_tz Returns Date / Time in a given time zone
yday Extract Day of Year
ydm Convert Character or Number to Date / Period when data contains only Date, but not Time.
ydm_h Convert Character or Number to Date / Period when data contains Date and Time.
ydm_hm Convert Character or Number to Date / Period when data contains Date and Time.
ydm_hms Convert Character or Number to Date / Period when data contains Date and Time.
year Extract Year
years Convert a number to year periods so you can use it for further calculations.
ymd Convert Character or Number to Date / Period when data contains only Date, but not Time.
ymd_h Convert Character or Number to Date / Period when data contains Date and Time.
ymd_hm Convert Character or Number to Date / Period when data contains Date and Time.
ymd_hms Convert Character or Number to Date / Period when data contains Date and Time.
parse_time Convert data to Time data type.
yq Convert text to Date data type if a given column holds values that look like Year and Quoter.
as.hms Convert a given data to hms data type. It's difftime dealt as seconds.

Data Frame

intersect Keep the rows that appear in all of the given data frames.
setdiff Keep the rows that appear in the first data frame but not the other data frames.
union Keep the rows that appear in one of the given data frames.
anti_join Return all rows from the current data frame where there are not matching values in the target, keeping just columns from the current.
arrange Sort rows by given column(s).
bind_cols Bind multiple data frames by column.
bind_rows Bind multiple data frames by row.
distinct Select distinct/unique rows. Only the first row will be preserved. Set .keep_all argument to TRUE to keep all the original columns.
filter Select rows with conditions.
full_join Return all rows and all columns from both the current data frame and the target data frame.
group_by Converts the data frame into a grouped data frame where the following operations will be performed based on the groups. Grouping should be done before you want to aggregate values.
inner_join Return all rows from the current data frame where there are matching values in the current, and all columns from the current and the target.
left_join Return all rows from the current data frame, and all columns from the current and the target. Rows in the current with no match in the target will have NA values in the new columns. If there are multiple matches between the current and the target, all combinations of the matches are returned.
mutate Mutate creates new column(s) with given expressions and preserves existing . You can use transmute to create new columns while removing the original ones.
rename Rename existing column names.
right_join Return all rows from the target data frame, and all columns from the current and the target
sample_frac Sample n fraction of rows from the data frame.
sample_n Sample n rows from the data frame.
select Select column(s) by column names.
semi_join Return all rows from the current data frame where there are matching values in the target data frame, keeping just columns from the current.
slice Select rows by positions. This is not supported for relational databases, in which case you should use filter() with row_number() function.
summarize Summarize a column values to a single value by using aggregate functions. Make sure you have grouped the data frame already using group_by() already.
top_n Select the top n entries based on a given measure in each group.
transmute Transmute adds new columns and drops existing columns.
ungroup Ungroup existing grouping
predict Returns data augmented with fitted values and residuals.
complete Complete a data frame by adding missing combinations of data.
expand Expand a data frame by adding all the combination of given columns.
fill Fills missing values in using the previous entry. This is convenient especially when values are not presented in some cells to avoid duplicated entries.
gather Takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.
nest Nest a set of columns together as a list column.
separate Separates a column with delimited values into multiple columns.
spread Spread a key-value pair across multiple columns. Missing values will be replaced with NA.
unite Unite multiple columns together into one column with given uniting characters. It will concatenate values by “\_" and remove the original column by default. Multiple columns specified in the list can be combined together at once.
unnest Unnest a list column or a list of data frames by making each element of the list to be presented in its own row.
augment Returns the original data augmented with fitted values and residuals.
glance Returns a single row summary of a given model.
tidy Returns a summary information of a given model in a tidy (normalized data frame) format.
select_if Select column(s) with a predicate (conditional) function.
mutate_all Apply functions to all the columns.
mutate_at Apply functions to specified columns.
mutate_if Apply functions to only the columns that match with a given condition.
summarize_all Apply functions to all the columns.
summarize_at Apply functions to specified columns.
summarize_if Apply functions to only the columns that match with a given condition.
separate_rows Separates a column with delimited values into multiple rows.
drop_na Drop rows that have NA value.
type_convert Heuristically guess the data type for each column by reading the first 1000 rows, parse the data, and set appropriate data types for all the columns of the data frame.
remove_empty_rows Remove rows whose column values are all NAs.
remove_empty_cols Remove columns whose values are all NAs.
clean_names Make column names clean by using only _ character, lowercase letters, and numbers.
get_dupes Get rows that are duplicated.
convert_to_NA Convert specific values to NA.
tabyl Create frequency table.
do_svd.kv Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).
build_lm Builds a linear regression model (lm) and store it in a data frame.
build_lr Builds logistic regression model and store it in a data frame.
build_glm Builds generalized linear models (glm) and store it in a data frame.
build_multinom Builds multinomial logistic regression model and store it in a data frame.
prediction Returns a data frame with regression information about a given model.
prediction_binary Returns a data frame with binary classification information about a given model.
prediction_coxph Returns a data frame with predicted values of Cox Proportional Hazard Model.
model_coef Returns a data frame with 'Parameter Estimates (Coefficients)' information about a given model including the below.* term - Term in lm that is estimated.* estimate - Estimated coefficients. The larger the number, the more positive effect the variable has.* std_error - Standard Error. How much prediction error there is in the variable.* t_ratio - estimate / std_error. This value in t distribution is p_value.* p_value - The probability that the variable is **not** effective.* conf_low - The lower bound of confidence interval.* conf_high - The upper bound of confidence interval.
model_stats Returns a data frame with 'Summary of Fit' information about a given model including the below.* r_square - R Square.* r_square_adj - Adjusted R Square.* root_mean_square_error - Root mean square error.* f_ratio - F ratio.* p_value - P value.* df - Degree of freedom.* null_deviance - Null deviance.* df_for_null_model - Degree of freedom for null model.* log_likelihood - Log likelifood.* deviance - Deviance.* AIC - Akaike's information criterion.* BIC - Bayesian information criterion.* deviance - Deviance.* residual_df - Residual degree of freedom.
model_anova Returns a data frame with anova test information about a given model including the below.* df - Degree of freedom.* sum_of_squares* mean_square* f_ratio* p_value* deviance* residual_df* residual_deviance
evaluate_regression Returns a data frame with evaluation score of regression including the below.* r_squared* explained_variance* mean_square_error* misclassification_error* root_mean_square_error* mean_absolute_error* mean_absolute_percentage_error
evaluate_binary Returns a data frame with evaluation score of binary classification including the below.* AUC* f_score* accuracy* misclassification_rate* precision* recall* specificity* true_positive - Number of positive predictions that actually are positive.* false_positive - Number of positive predictions that actually are negative.* true_negative - Number of negative predictions that actually are negative.* false_negative - Number of negative predictions that actually are positive.* test_size - The number of tested data.* threshold - threshold value for prediction.
evaluate_multi Returns a data frame with evaluation score of multi classification including the below.* micro_f_score* macro_f_score* accuracy* misclassification_rate
do_roc Returns coordinates of roc curve.
build_kmeans.cols Builds a clustering model (k-means) from variable columns and returns the summary of the model or the augmented data depending on the parameter value.
build_kmeans.kv Builds a clustering model (k-means) from key-value columns and store the model into a generated data frame or augment the original data with the clustered ID.
build_kmeans Builds a clustering model (k-means). Stores the model into a generated data frame or the augmented data depending on the parameter value.
do_t.test Execute t-test, which checks differences of means of variables.
do_var.test Execute F-test, which checks the differences of variances between groups.
do_chisq.test Execute chi-squared contingency table tests and goodness-of-fit tests.
do_apriori Find rules of what tend to occur at the same time from transaction data.
do_anomaly_detection Detect anomaly in time series data frame.
row_as_header Use a row as column names.
pivot Pivot columns into rows and columns. Values are count of pairs of rows and columns or aggregation of another column.
do_cor.cols Calculates correlations for all the pairs of the variables (columns).
do_cor.kv Calculates correlations for all the pairs of subject columns.
do_cor Calculates correlations for all the pairs of the variables or subjects.
do_dist.kv Calculate the distances between each of the pairs.
do_dist.cols Calculate distances of each of the pairs.
do_dist Calculate distances of each of the pairs of the variables or subjects.
do_cmdscale Execute multidimensional scaling (MDS). Calculate approximated coordinations from distances of entity pairs.

Factor

levels Returns the levels of a given factor.
as_factor Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place).
fct_anon Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved.
fct_expand Add additional levels to a factor
fct_drop Drop unused levels.
fct_explicit_na This gives missing value an explicit factor level, ensuring that they appear in the charts.
fct_lump Makes least/most common factor levels into "other".
fct_inorder Reorder factors levels by first appearance.
fct_infreq Reorder factors levels by frequency (the most frequently appears to the least.)
fct_relevel Sets the level of a given factor column by moving any number of levels to any location.
fct_reorder Reorder factor levels by sorting based on another variable
fct_rev Reverse order of factor levels.

Range

range Returns a min and max values of a given data.
between Return TRUE or FALSE based on a given value is within a given range (between).

Statistics

cor Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values.
cov Computes the covariance of two column values.

List

str_detect Return TRUE or FALSE based on whether Text data contains a given text or not
list_extract Extract a value of a given list data type column based on a position or name when the data inside the list is data frame.
list_to_text Concatenates texts from all the elements of a given list data type column.
list_concat Concatenates values from multiple columns into a list.
list_n Returns number of elements inside a given list data type column for each row.

tidyr sub-function

full_seq Generate a sequence of numbers based on the values in a given column.
nesting Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation.

URL

param_remove Removes parameters from a given URL.
url_decode Decodes a given encoded URL.
url_encode Encodes a given URL.
url_parameters Returns decoded query parameters from url.
url_domain Returns domain (ex. "exploratory.io") from url.
url_fragment Returns fragment from url.
url_path Returns path from url.
url_port Returns port from url.
url_scheme Returns scheme (ex. "http", "https") from url.
url_suffix Returns suffix (ex. "com", "org") from url.
url_subdomain Returns subdomain (ex. "www", "blog") from url.
url_tld Returns top-level domain (ex. "com", "co") from url.
url_param Returns decoded query parameter from url.

Column

- Removes column(s). It can be used along with any column(s) selection operation like select(), gather(), etc.

Column Selection

starts_with Returns the column names that starts with a given text.
ends_with Returns the column names that ends with a given text.
contains Returns the column names that contain a given text.
matches Returns the column names that matches with a given text.
num_range Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc.
one_of Returns the column names that are one of the given names.
everything Returns all the column names. It's useful when you want to have particular column(s) first before everything else.

dplyr-subfunction

vars Returns a list of user selected columns.

Model Function

xgboost_reg Create extreme gradient boosting model for regression.
xgboost_binary Create extreme gradient boosting model for binary classification.
xgboost_multi Create extreme gradient boosting model for binary classification.
randomForestReg Create random forest model for regression.
randomForestBinary Create random forest model for binary classification.
randomForestMulti Create random forest model for multi class classification.

Survival Analysis

do_survfit Calculates Survival Curve from survival time and survival status.
prediction_survfit Simulates Survival Curve for specified cohort based on a survival model.
build_coxph Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame.
prediction_coxph Returns a data frame with predicted values of Cox Proportional Hazard Model.

Geo

ip_to_country Returns country names from IP addresses.

Text Analysis

do_tokenize Returns one token (e.g. word) per row after tokenizing a given text.
pair_count Count pairs of words (tokens) that cooccur within a group
do_tfidf Calculates TF-IDF for each term against a given group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency.
do_ngram Create columns of n-grams connected in sentences.
do_cosine_sim.kv Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space.

!

Summary
Reverses the meaning of the following condition. It is a logical negation and has a similar meaning as 'Not'.

Syntax
!

Arguments

Return Value

Example
filter(!is.na(Population))
Keep the row if Population value is NOT NA
filter(!str_detect(City, "Francisco"))
Keep the row if City value does NOT include text 'Francisco'.
filter(!Sales > 500)
Keep the row if Sales value is NOT greater than 500.

abbreviate

Summary
Abbreviate strings to at least minlength characters, such that they remain unique (if they were). First all spaces at the beginning of the string are stripped. Then (if necessary) any other spaces are stripped. Next, lower case vowels are removed (starting at the right) followed by lower case consonants. Finally if the abbreviation is still longer than minlength upper case letters are stripped.

Syntax
abbreviate(<column>, minlength = <number>, dot = <logical>)

Arguments

Return Value
Character

Example
abbreviate("Exploratory")
Returns "Expl"
abbreviate("Exploratory", 3)
Returns "Exp"

abs

Summary
Returns the absolute value.

Syntax
abs(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(abs = abs(ARR_DELAY))

acos

Summary
Returns the inverse cosine of a value, in radians

Syntax
acos(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = acos(X1))
Create a new column for calculating the acos.

acosh

Summary
Returns the inverse hyperbolic cosine of a number.

Syntax
acosh(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = acosh(X1))
Create a new column for calculating the acosh.

all

Summary
Return TRUE if the condition is satisfied for all the rows or a group of rows.

Syntax
all(<condition>, na.rm = <logical>)

Arguments

Return Value
Logical

Example
summarize(is_matched_all = all(userid == username))
Return TRUE if all the userid values and username values are same, otherwise FALSE. summarize(is_matched_all = all(revenue > 500))
Return TRUE if all the revenue values are greater than 500, otherwise FALSE.

any

Summary
Return TRUE if the condition is satisfied for any of the rows or a group of rows.

Syntax
any(<condition>, na.rm = <logical>)

Arguments

Return Value
Logical

Example
summarize(is_matched_all = any(userid == username))
Return TRUE if any of the userid values and username values are same, otherwise FALSE. summarize(is_matched_all = all(revenue > 500))
Return TRUE if any of the revenue values are greater than 500, otherwise FALSE.

as.character

Summary
Convert a given data to character data type.

Syntax
as.character(<column>)

Arguments

Return Value
Character

Example
as.character(123.11)
Returns "123.11" as character.

as.Date

Summary
Convert a given data to Date data type. Date data type doesn't include Time. The default formats follow the rules of the ISO 8601 international standard which expresses a day as "2001-02-03". If the given data is number as 'internal system number (# of days since some origin)' then you want to provide which date to be used as 'origin'.

Syntax
as.Date(<column>, format = <date_format>, tz = <timezone>, origin = <text>)

Arguments

Return Value
Date

Example
as.Date("2015-10-10")
Returns "2015-10-10"
as.Date("2015-10-10", tz = "America/Los_Angeles")
Returns "2015-10-10"
as.Date("10-10-10", format = "%d-%m-%y")
Returns "2010-10-10"
as.Date(35981, origin = "1899-12-30") # This is recommended for Windows Excel data.
Returns "1998-07-05"
as.Date(34519, origin = "1904-01-01") # This is recommended for Mac Excel data.
Returns "1998-07-05"

as.double

Summary
Convert data to double precision number data type. It is identical to numeric.

Syntax
as.double(<column>)

Arguments

Return Value
Double

Example
as.double("12345")
Returns 12345
as.double("12345.10")
Returns 12345.1

as.factor

Summary
Encode a column as a factor.

Syntax
factor(<column>, levels = <text>, labels = <text>, exclude = <text>, ordered = <logical>, nmax = <number>)

Arguments

Return Value
Factor

Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
as.factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)

as.integer

Summary
Convert data to integer number data type.

Syntax
as.integer(<column>)

Arguments

Return Value
Integer

Example
as.integer("12345")
Returns 12345
as.integer("12345.10")
Returns 12345

as.logical

Summary
Convert data to Logical (Boolean) data type.

Syntax
as.logical(<column>)

Arguments

Return Value
Logical

Example
as.logical("TRUE")
Returns TRUE
as.logical("true")
Returns TRUE
as.logical("T")
Returns TRUE
as.logical("True")
Returns TRUE
as.logical(1)
Returns TRUE

as.numeric

Summary
Convert data to

Syntax
as.numeric(<column>, units = <difftime_units>)

Arguments

Return Value
Numeric

Example
as.numeric("12345")
Returns 12345
as.numeric("12345.10")
Returns 12345.1
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "days")
Returns 15
as.numeric(as.Date("2015-01-30") - as.Date("2015-01-15"), units = "weeks")
Returns 2.142857

as.POSIXct

Summary
Convert data to Date/Time data type.

Syntax
as.POSIXct(<column>, format = <date_time_format>, tz = <timezone>, origin = <text>)

Arguments

Return Value
POSIXct

Example
as.POSIXct("2015-10-10 13:10:05") Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "America/Los_Angeles") Returns "2015-10-10 13:10:05 PDT"
as.POSIXct("2015-10-10 13:10:05", tz = "Asia/Tokyo") Returns "2015-10-10 13:10:05 JST"
as.POSIXct("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M") Returns "2015-10-05 13:10:00 PDT"

asin

Summary
Returns the inverse sine of a value, in radians.

Syntax
asin(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = asin(X1))
Create a new column for calculating the asin.

asinh

Summary
Returns the inverse hyperbolic sine of a number.

Syntax
asinh(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = asinh(X1))
Create a new column for calculating the asinh.

atan

Summary
Returns the inverse tangent of a value, in radians.

Syntax
atan(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = atan(X1))
Create a new column for calculating the atan.

atan2

Summary
Returns the angle between the x-axis and a line segment from the origin (0,0) to specified coordinate pair (x,y), in radians.

Syntax
atan2(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = atan2(X1, Y1))
Create a new column for calculating the atan2.

atanh

Summary
Returns the inverse hyperbolic tangent of a number.

Syntax
atanh(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = atanh(X1))
Create a new column for calculating the atanh.

c

Summary
This is a generic function which combines its arguments.

Syntax
c(<value1>, <value2>, ... , recursive = <logical>)

Arguments

Example
c(1:10)
Returns 1 2 3 4 5 6 7 8 9 10
c(1:10, 1:3)
Returns 1 2 3 4 5 6 7 8 9 10 1 2 3
c("a", "b", "c")
Returns "a" "b" "c"

ceiling

Summary
Rounds a number up to the nearest integer. ceiling(3.475) is 4

Syntax
ceiling(<column_num>)

Arguments

Return Value
Numeric

Example
ceiling(3.475)
Returns 4

chartr

Summary
Translates each character in x that is specified in old to the corresponding character specified in new.

Syntax
chartr(<old_text>, <new_text>, <column_text>)

Arguments

Return Value
Character

Example
x = "abcdef 123456" chartr("abc", "xyz", x)
Returns "xyzdef 123456"
chartr("a-f", "u-z", x)
Returns "uvwxyz 123456"
chartr("a-f", "xxxxxx", x)
Returns "xxxxxx 123456"
chartr("123", "000", x)
Returns "abcdef 000456"

cos

Summary
Returns the cosine of an angle provided in radians.

Syntax
cos(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = cos(X1))
Create a new column for calculating the cos.

cosh

Summary
Returns the hyperbolic cosine of any real number.

Syntax
cosh(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = tanh(X1))
Create a new column for calculating the cosh.

cospi

Summary
Returns the cos(pi * ).

Syntax
cospi(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = cospi(X1))
Create a new column for calculating the cospi.

cummax

Summary
Returns the cumulative maxima.

Syntax
cummax(<column_num>)

Arguments

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cummax(X)
Returns 1, 2, 3, 3, 3

cummin

Summary
Returns the cumulative minima.

Syntax
cummin(<column_num>)

Arguments

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cummin(X)
Returns 1, 1, 1, 1, 1

cumprod

Summary
Returns the cumulative products.

Syntax
cumprod(<column_num>)

Arguments

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cumprod(X)
Returns 1, 2, 6, 12, 12

cumsum

Summary
Returns the cumulative sums.

Syntax
cumsum(<column_num>)

Arguments

Return Value
Numeric

Example
// X = c(1, 2, 3, 2, 1) cumsum(X)
Returns 1, 3, 6, 8, 9

cut

Summary
Divides the range of values into intervals and return the values in x according to which interval they fall.

Syntax
cut(<column_num>, breaks = <number>, labels = <text>, include.lowest = <logical>, right = <logical>, dig.lab = <number>), ordered_result = <logical>)

Arguments

Example
//x = c(1,2,3,4,5,6,7,8,9,10)
cut(x, breaks = 5)
Returns (0.991,2.8] (0.991,2.8] (2.8,4.6] (2.8,4.6] (4.6,6.4] (4.6,6.4] (6.4,8.2] (6.4,8.2] (8.2,10] (8.2,10]
cut(x, breaks = 2)
Returns (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (0.991,5.5] (5.5,10] (5.5,10] (5.5,10] (5.5,10] (5.5,10]
cut(x, breaks = 2, labels = c(1,2))
Returns 1 1 1 1 1 2 2 2 2 2
cut(x, breaks = 2, labels = c("A", "B"))
Returns A A A A A B B B B B
mutate(a = cut(FL_DATE, breaks = 5, start.on.monday = FALSE))
Returns one of the five buckets based on the date value

difftime

Summary
Calculates the difference between two given dates/times.

Syntax
difftime(<column_date>, <column_date>, tz = <timezone>, units = <difftime_units>)

Arguments

Return Value
Difftime

Example
difftime("2015-12-10", "2015-12-30"))
Returns Time difference of -20 days.
difftime("2015-12-10", "2015-12-30", unit = "hours")
Returns Time difference of -480 hours.
as.numeric(difftime("2015-12-10", "2015-12-30"))
Returns -20.
as.numeric(difftime("2015-12-10", "2015-12-30"), units = "hours")
Returns -480.

duplicated

Summary
Returns a logical value (TRUE or FALSE) indicating if it is a duplicated value or rows.

Syntax
duplicated(<column(s)>, incomparables = <logical>)

Arguments

Return Value
Logical

Example
// x <- c("a", "b", "c", "a", "b")
duplicated(x)
Returns FALSE FALSE FALSE TRUE TRUE duplicated(x, incomparables = "a")
Returns FALSE FALSE FALSE FALSE TRUE
mutate(a = duplicated(ARR_TIME)) Returns TRUE or FALSE for each row
filter(duplicated(ARR_TIME, ARR_DELAY))
Keep only the duplicated rows
summarise(a = sum(!duplicated(ARR_TIME)))
Returns # of NOT duplicated values for each group. This is same as the below.
summarise(a = sum(length(unique(ARR_TIME))))

exp

Summary
Returns Euler's number, e (~2.718) raised to a power.

Syntax
exp(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(revenue_exp = exp(revenue))
Returns the exponential value of revenue

factor

Summary
Encode a column as a factor.

Syntax
factor(<column>, levels = <text>, labels = <text>, exclude = <text>, ordered = <logical>, nmax = <number>)

Arguments

Return Value
Factor

Example

mutate(color = factor(color, levels = c("Gold","Silver","Bronze"), ordered=TRUE))

Before:

color counts
Bronze 8
Silver 12
Gold 15

After:

color counts
Gold 15
Silver 12
Bronze 8

floor

Summary
Rounds a number down to the nearest integer. ceiling(3.475) is 3.

Syntax
ceiling(<column_num>)

Arguments

Return Value
Numeric

Example
ceiling(3.475)
Returns 3

iconv

Summary
Convert a given text from one encoding to another.

Syntax
iconv(<column>, from = <encoding>, to = <encoding>, sub = <text>, mark = <logical>, toRaw = <logical>)

Arguments

Return Value
Character

Example
iconv(x, "ISO_8859-1", "UTF-8")
Converts a given text from "ISO8859-1" to "UTF-8"._

ifelse

Summary
Returns different values based on the conditional result.

Syntax
ifelse(<condition>, <return_value_when_TRUE>, <return_value_when_FALSE>)

Arguments

Return Value

Example
//x <- 1:5
ifelse(x > 3, "Bigger", "Smaller")
Returns "Smaller" "Smaller" "Smaller" "Bigger" "Bigger"

intersect

Summary
Keep the rows that appear in all of the given data frames.

Syntax
Intersect(<data_set(s)>, ...)

Example
Intersect(DATA_2015)
Keep the rows that appear in the original data and DATA_2015.

is.character

Summary
Returns true if a given object is character type.

Syntax
is.character

Arguments

Return Value
Logical

Example

mutate_if(is.character, str_to_lower)
Convert all the letters to lowercase for all the character type columns.

is.double

Summary
Returns true if a given object is double type.

Syntax
is.double

Arguments

Return Value
Logical

Example

mutate_if(is.double, round)
Round all the double type columns.
mutate_if(is.double, funs(. * 0.2))
Multiply all the double columns by 0.2 overriding existing columns.
mutate_if(is.double, funs(calc = . * 0.2))
Multiply all the double columns by 0.2 as new columns.
summarize_if(is.double, mean) Calculate the mean (average) for all the double columns.

is.factor

Summary
Returns true if a given object is factor type.

Syntax
is.factor

Arguments

Return Value
Logical

Example

mutate_if(is.factor, as.character)
Convert all the factor type columns to character type.

is.list

Summary
Returns true if a given object is list type.

Syntax
is.list

Arguments

Return Value
Logical

Example

select_if(is.list)
Select only list type columns.

is.logical

Summary
Returns true if a given object is logical type.

Syntax
is.logical

Arguments

Return Value
Logical

Example

mutate_if(is.logical, as.numeric)
Convert the logical values (TRUE, FALSE) to numeric values (1,0).
summarize_if(is.logical, sum) Count the number of TRUE values for all the logical type columns.

is.matrix

Summary
Returns true if a given object is matrix type.

Syntax
is.matrix

Arguments

Return Value
Logical

Example

mutate_if(is.matrix, scale)
Scale and center all the columns of matrices in all the matrix type columns.

is.na

Summary
Returns TRUE when it is NA.

Syntax
is.na(<column>)

Arguments

Return Value
Logical

Example
filter(is.na(ARR_DELAY))
Keep only rows where ARR_DELAY is NA
filter(!is.na(ARR_DELAY))
Keep only rows where ARRDELAY is NOT NA_

is.null

Summary
Returns TRUE when it is null.

Syntax
is.null(<column>)

Arguments

Return Value
Logical

Example
filter(is.null(ARR_DELAY))
Keep only rows where ARR_DELAY is null
filter(!is.null(ARR_DELAY))
Keep only rows where ARRDELAY is NOT null_

is.numeric

Summary
Returns true if a given object is numeric type.

Syntax
is.numeric

Arguments

Return Value
Logical

Example

mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.
summarize_if(is.numeric, mean) Calculate the mean (average) for all the numeric columns.

isTRUE

Summary
Returns true if and only if a given value is a length-one logical vector whose only element is TRUE.

Syntax
isTRUE()

Arguments

Return Value
Logical

Example
isTRUE(TRUE)
Returns TRUE
isTRUE(FALSE)
Returns FALSE
summarise(a = isTRUE(mean(ARR_DELAY) > 120))
Returns TRUE or FALSE based on the average of ARRDELAY for each group._

length

Summary
Returns the length of the values.

Syntax
length(<column>)

Arguments

Return Value
Numeric

Example
summarize(text_length = length(TAIL_NUM))
Returns # of values for each group
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAIL_NUM' for each group

levels

Summary
Returns the levels of a given factor.

Syntax
levels(<column>)

Arguments

Example
// x <- as.factor(c("a", "b", "c", "d", "e", "a", "b", "a"))
levels(x)
Returns "a" "b" "c" "d" "e"
summarise(n = length(levels(as.factor(flight$CARRIER))))
Returns 16

list

Summary
Construct a list

Syntax
list(<column>)

Arguments

Return Value
List

Example
list(sales = 0, profit = 10)
Create a named list of [sales = 0, profit = 10]
complete(year, product, fill = list(sales = 0))
Set 0 for 'sales' column for missing values.

log

Summary
Returns logarithms, by default natural logarithms.

Syntax
log(<column_num>, base = <number>)

Arguments

Return Value
Numeric

Example
mutate(Sales_log = log(Sales))

log10

Summary
Computes common (i.e., base 10) logarithms

Syntax
log10(<column_num>, base = <number>)

Arguments

Return Value
Numeric

Example
mutate(Sales_log = log10(Sales))

log1p

Summary
Computes log(1+x) accurately also for |x| << 1.

Syntax
log1p(<column_num>, base = <number>)

Arguments

Return Value
Numeric

Example
mutate(Sales_log = log1p(Sales))

log2

Summary
computes binary (i.e., base 2) logarithms

Syntax
log2(<column_num>, base = <number>)

Arguments

Return Value
Numeric

Example
mutate(Sales_log = log2(Sales))

max

Summary
Returns the maximum value in a numeric column.

Syntax
max(<column>, na.rm = <logical>)

Arguments

Example
summarize(revenue_max = max(revenue))
Create a new column for calculating the max value of of revenue for each group.

mean

Summary
Returns the numerical average (mean) value.

Syntax
mean(<column_num_logic_date>, na.rm = <logical>, trim = )

Arguments

Return Value
Numeric

Example
summarize(revenue_mean = mean(revenue))
Create a new column for calculating the average value of revenue for each group.

min

Summary
Returns the minimum value in a numeric column.

Syntax
min(<column>, na.rm = <logical>)

Arguments

Example
summarize(revenue_min = min(revenue))
Create a new column for calculating the minimum of revenue for each group.

months

Summary
Convert a number to month periods so you can use it for further calculations.

Syntax
months(<column_num>)

Return Value
Period

Example
months(1) returns "1m 0d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + months(1)
returns "2015-11-01 06:15:30 UTC"

nchar

Summary
Returns # of letters or size of a given value.

Syntax
nchar(<column>, type = "chars"|"bytes"|"width", allowNA = <logical>)

Arguments

Return Value
Numeric

Example
nchar("Exploratory")
Returns 11
nchar("西田")
Returns 2
nchar("西田", type = "bytes")
Returns 6

pmax

Summary
Returns the parallel maxima of the input values.

Syntax
pmax(<column_num_logic_date>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
x <- c(1, 2, 3, 4, 5) y <- c(3, 3, 3, 3, 3) pmax(x, y)
Returns 3 3 3 4 5
mutate(a = pmax(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a bigger value between ARR_DELAY and DEP_DELAY values.

pmin

Summary
Returns the parallel minima of the input values.

Syntax
pmin(<column_num_logic_date>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
x <- c(1, 2, 3, 4, 5)
y <- c(3, 3, 3, 3, 3)
pmin(x, y)
Returns 1 2 3 3 3
mutate(a = pmin(ARR_DELAY, DEP_DELAY, na.rm = TRUE))
Returns a smaller value between ARR_DELAY and DEP_DELAY values.

prod

Summary
Returns the product value by multiplying a series of numbers in a given data together.

Syntax
prod(<<column_num_logic>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
// x <- 1:5
prod(x)
Returns 120

range

Summary
Returns a min and max values of a given data.

Syntax
range(<column>, na.rm = <logical>)

Arguments

Return Value

Example
// x <- c(1, 2, 3, 4, 5)
Returns 1 5
filter(ARR_DELAY %in% range(DEP_DELAY, na.rm = TRUE))
Keep only rows with ARR_DELAY is in a range of DEP_DELAY. This is same as the below. filter(between(ARR_DELAY, range(DEP_DELAY, na.rm = TRUE)[1], range(DEP_DELAY, na.rm = TRUE)[2]))

rep

Summary
Repeats given numbers.

Syntax
rep(<column>, times = <number>, length_out = <number>, each = <number>)

Arguments

Return Value
Numeric

Example
// x <- c(1, 2, 3, 4, 5)
rep(x, 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, times = 2)
Returns 1 2 3 4 5 1 2 3 4 5
rep(x, length.out = 8)
Returns 1 2 3 4 5 1 2 3
rep(x, times = 2, each = 2)
Returns 1 1 2 2 3 3 4 4 5 5 1 1 2 2 3 3 4 4 5 5
rep(x, length.out = 8, each = 2)
Returns 1 1 2 2 3 3 4 4

rev

Summary
Reverses the entries in a given data.

Syntax
rev(<column>)

Arguments

Return Value

Example
// x <- c(1, 2, 3, 4, 5)
rev(x)
Returns 5 4 3 2 1

round

Summary
Rounds a number to a certain number of decimal places.

Syntax
round(<column_num>, digits = <number>)

Arguments

Return Value
Numeric

Example
round(3.475, digits=2)
Returns 3.48

seq

Summary
Returns a sequence of numbers.

Syntax
seq(from = <number>, to = <number>, by = <number>, length.out = <number>, along.with = <number>)

Arguments

Return Value
Numeric

Example
seq(1, 5)
Returns 1 2 3 4 5 seq(1, 5, by = 2)
Returns 1 3 5
seq(1:5)
Returns 1 2 3 4 5
seq(0, 1, length.out = 11)
Returns 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

setdiff

Summary
Keep the rows that appear in the first data frame but not the other data frames.

Syntax
setdiff(<data_set(s)>, ...)

Example
setdiff(DATA_2015)
Keep the rows that appear in the original data frame but not in DATA_2015

sign

Summary
Returns either 1, 0, or -1 depending on whether the input value is positive, zero, or negative, respectively.

Syntax
sign(<column_num>)

Arguments

Return Value
Numeric

Example
// x <- c(1, 2, 10, 0, -1, -2, -10)
sign(x)
Returns 1 1 1 0 -1 -1 -1

signif

Summary
Rounds the values in its first argument to the specified number of significant digits.

Syntax
signif(<column_num>, digits = <number>)

Arguments

Return Value
Numeric

Example
signif(3.475, digits=2)
Returns 3.5

sin

Summary
Returns the sine of an angle provided in radians.

Syntax
sin(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.

sinh

Summary
Returns the hyperbolic sine of any real number.

Syntax
sinh(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.

sinpi

Summary
Returns the sin(pi * ).

Syntax
sinpi(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = sinpi(X1))
Create a new column for calculating the sinpi.

sqrt

Summary
Returns the square root of the values.

Syntax
sqrt(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(revenue_sq = sqrt(revenue))
Returns the square root of revenue values

sum

Summary
Returns the sum of all the values.

Syntax
sum(<column_num_logic>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
summarize(total = sum(revenue))
Create a new column for calculating the sum of revenue for each group.

tan

Summary
Returns the tangent of an angle provided in radians.

Syntax
tan(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = sinh(X1))
Create a new column for calculating the sinh.

tanh

Summary
Returns the hyperbolic tangent of any real number.

Syntax
tanh(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = tanh(X1))
Create a new column for calculating the tanh.

tanpi

Summary
Returns the tan(pi * ).

Syntax
tanpi(<column_num>)

Arguments

Return Value
Numeric

Example
mutate(a = tanpi(X1))
Create a new column for calculating the tanpi.

trunc

Summary
Truncates the numbers by omitting digits.

Syntax
trunc(<column_num>)

Arguments

Return Value
Numeric

Example
trunc(3.475)
Returns 3

union

Summary
Keep the rows that appear in one of the given data frames.

Syntax
union(<data_set(s)>, ...)

Example
union(DATA_2015)
Keep the rows that appear in either or both the original data frame and DATA_2015.

unique

Summary
Returns number of unique values.

Syntax
unique(<column>, incomparables = <logical>, fromLast = <logical>, nmax = <number>)

Arguments

Return Value
Numeric

Example
summarize(text_length = length(unique(TAIL_NUM)))
Returns # of unique values of 'TAILNUM' for each group_

which

Summary
Returns the TRUE indices of a logical object or based on the condition.

Syntax
which(<condition>, arr.ind = <logical>, useNames = <logical>)

Arguments

Return Value
Numeric

Example
x <- c("a", "b", "c", "d", "e")
which(x == "c")
Returns 3
select(which(sapply(., is.numeric)))
Returns only numeric columns.

xor

Summary
Performs an exclusive-or check across two values and return TRUE when only one of them is TRUE.

Syntax
xor(<value1>, <value2>)

Arguments

Return Value
Logical

Example
xor(TRUE, TRUE)
Returns FALSE
xor(FALSE, TRUE)
Returns TRUE
xor(FALSE, FALSE)
Returns FALSE

anti_join

Summary
Return all rows from the current data frame where there are not matching values in the target, keeping just columns from the current.

Syntax
anti_join(<data_set>, by = "<column>")
anti_join(<data_set>, by = c("<source_column>" = "<target_column>"))

Arguments

Example
anti_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
anti_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
anti_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.

arrange

Summary
Sort rows by given column(s).

Syntax
arrange(<column(s)>)

Used Together
desc

Example
arrange(POPULATION)
Sort by POPULATION values in a ascending order.
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(POPULATION, PRODUCTION)
Sort by POPULATION, PRODUCTION columns in a ascending order.
arrange(-POPULATION, -PRODUCTION)
Sort by POPULATION, PRODUCTION columns in both descending order. This is an equivalent to the following
arrange(desc(POPULATION), desc(PRODUCTION))

between

Summary
Return TRUE or FALSE based on a given value is within a given range (between).

Syntax
between(<column_num_date>, <left>, <right>)

Arguments

Return Value
Logical

Example
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.

bind_cols

Summary
Bind multiple data frames by column.

Syntax
bind_cols(<data_set(s)>)

Example
bind_cols(ECONOMY)
Add columns from a data frame 'ECONOMY' to the existing data frame.

bind_rows

Summary
Bind multiple data frames by row.

Syntax
bind_rows(<data_set(s)>, .id = <text>)

Arguments

Example
bind_rows(DATA2015)
Add a data frame 'DATA2015' to the existing data.
bind_rows(DATA2014, DATA2015)
Add rows from a data frame 'DATA2014' and 'DATA2015' to the existing data.
bind_rows(DATA2014, DATA2015, .id = "id_column")
It will create a column called 'id_column' to link each row to its original data frame.
bind_rows("d2014" = DATA2014, "d2015" = DATA2015, .id = "id_column")
It will create a column called 'id_column' and use "d2014" and "d2015" to mark each row from two data frames.

count

Summary
Summarize the data by either calling n() or sum() for each group.

Syntax
count(<column(s)>, wt = <column_num_or_base_aggregate_expression>, sort = <logical>)

Arguments

Example
count(CARRIER)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER'.
count(CARRIER, sort = TRUE)
Groups by 'CARRIER' and returns the number of rows for each 'CARRIER' from the highest number to the lowest.
count(CARRIER, wt = ARR_DELAY)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY'.
count(CARRIER, TAIL_NUM, wt = ARR_DELAY)
Groups by 'CARRIER' and 'TAIL_NUM', and returns the sum total of 'ARR_DELAY'.
count(CARRIER, wt = ARR_DELAY, sort = TRUE)
Groups by 'CARRIER' and returns the sum total of 'ARR_DELAY' and sorted by the total number.
count(CARRIER, wt = mean(ARR_DELAY, na.rm = TRUE))
Groups by 'CARRIER' and returns the average of 'ARRDELAY' after removing NA values._

cumall

Summary
Returns TRUE if all the values up to the current position are TRUE.

Syntax
cumall(<condition>)

Arguments

Return Value
Logical

Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumall(X)
Returns TRUE, TRUE, FALSE, FALSE, FALSE

cumany

Summary
Returns TRUE if any of the values up to the current posiiton is TRUE.

Syntax
cumany(<condition>)

Arguments

Return Value
Logical

Example
// X = TRUE, TRUE, FALSE, TRUE, FALSE
cumany(X)
Returns TRUE, TRUE, TRUE, TRUE, TRUE

cume_dist

Summary
Cumulative distribution. Proportion of all values less than or equal to the current rank. Missing values are left as is.

Syntax
cume_dist(<column_num>)

Return Value
Numeric

Example
mutate( = cume_dist())
Create a new column to have the cumulative distribution number.
filter(cume_dist() < .25)
Select rows whose cumulative distribution number are less .25.

cummean

Summary
Returns the cumulative mean (average).

Syntax
cummean(<column_num>)

Arguments

Return Value
Numeric

Example
// X = 1, 2, 3, 2, 1
cummean(X)
Returns 1, 3, 6, 8, 9

dense_rank

Summary
Ranks with no gaps. Missing values are left as is.

Syntax
dense_rank(<column_num>)

Return Value
Numeric

Example
mutate(population_rank = dense_rank(population))
Create a new column to have the rank number based on 'population' values.
filter(dense_rank(population) < 10)
Select rows whose rank numbers are less 10.

desc

Summary
Change the sorting order to a descending order. e.g. 9 to 1 instead of 1 to 9, z to a instead of a to z.

Syntax
desc(<column>)

Example
arrange(desc(POPULATION))
Sort by POPULATION values in a descending order.
arrange(desc(POPULATION), desc(PRODUCTION))
Sort by POPULATION, PRODUCTION columns in both descending order.

distinct

Summary
Select distinct/unique rows. Only the first row will be preserved. Set .keep_all argument to TRUE to keep all the original columns.

Syntax
distinct(<column(s)>, .keep_all = <logical>)

Arguments

Example
distinct()
Keep only unique/distinct rows based on all the columns.
distinct(CARRIER, TAIL_NUM, FL_NUM)
Keep only unique/distinct rows based on CARRIER, TAIL_NUM, FL_NUM columns.
distinct(CARRIER, .keep_all = TRUE)
Keep only unique/distinct rows based on CARRIER and return all the columns.

filter

Summary
Select rows with conditions.

Syntax
filter(<condition>)

Used Together
is.na, between,

Example
filter(revenue < 6000)
Keep data whose revenue is less than 6000.
filter(revenue < 6000 & region == “West")
Keep data whose revenue is less than 6000 AND region is "West".
filter(revenue < 6000 | region == “West")
Keep data whose revenue is less than 6000 OR region is "West".
filter(between(revenue, 1000, 6000))
Keep data whose revenue is between 1000 and 6000.
filter(is.na(revenue))
Keep data whose revenue is NA.
filter(!is.na(revenue))
Keep data whose revenue is NOT NA.

first

Summary
returns the first value of all the values.

Syntax
first(<column>, order_by = <column(s)>, default = <text>)

Arguments

Example
summarize(total = first(name))
Create a new column for returning the first value for each group.

full_join

Summary
Return all rows and all columns from both the current data frame and the target data frame.

Syntax
full_join(<data_set>, by = "<column>", suffix = c("<source_suffix>", "<target_suffix>"))
full_join(<data_set>, by = c("<source_column>" = "<target_column>"), suffix = c("<source_suffix>", "<target_suffix>"))

Arguments

Example
full_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
full_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
full_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.

funs

Summary
funs provides a flexible way to generate a named list of functions for input to other functions like summarise_each.

Syntax
funs(<function_name(s)>)

Arguments

A list of functions specified by: * Their name, "mean" * The function itself, mean * A call to the function with . as a dummy parameter, mean(., na.rm = TRUE)

Example
funs(mean, "mean", mean(., na.rm = TRUE))
# Overide default names
funs(m1 = mean, m2 = "mean", m3 = mean(., na.rm = TRUE))

group_by

Summary
Converts the data frame into a grouped data frame where the following operations will be performed based on the groups. Grouping should be done before you want to aggregate values.

Syntax
group_by(<column(s)>)

Example
group_by(COUNTRY)
Group the data by COUNTRY.
group_by(REGION, COUNTRY)
Group the data by REGION and COUNTRY.
group_by(FIRSTNAME + LASTNAME)
Group by a dynamically generated value of 'FIRSTNAME + LASTNAME'

inner_join

Summary
Return all rows from the current data frame where there are matching values in the current, and all columns from the current and the target.

Syntax
inner_join(<data_set>, by = "<column>", suffix = c("<source_suffix>", "<target_suffix>"))
inner_join(<data_set>, by = c("<source_column>" = "<target_column>"), suffix = c("<source_suffix>", "<target_suffix>"))

Arguments

Example
inner_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
inner_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
inner_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.

lag

Summary
Provides access to a row at a given physical offset prior to that position.

Syntax
lag(<column>, n = <number>, default = <text>, order_by = <column(s)>)

Arguments

Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lag(x)
Returns NA 1 2 3 4 5 6 7 8 9
lag(x, 2)
Returns NA NA 1 2 3 4 5 6 7 8

last

Summary
returns the last value of all the values.

Syntax
last(<column>, order_by = <column(s)>, default = <text>)

Arguments

Example
summarize(total = last(name))
Create a new column for returning the last value for each group.

lead

Summary
Provides access to a row at a given physical offset that follows the current row.

Syntax
lead(<column>, n = <number>, default = <text>, order_by = <column(s)>)

Arguments

Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
lead(x)
Returns 2 3 4 5 6 7 8 9 10 NA
lead(x, 2)
Returns 3 4 5 6 7 8 9 10 NA NA

left_join

Summary
Return all rows from the current data frame, and all columns from the current and the target. Rows in the current with no match in the target will have NA values in the new columns. If there are multiple matches between the current and the target, all combinations of the matches are returned.

Syntax
left_join(<data_set>, by = "<column>", suffix = c("<source_suffix>", "<target_suffix>"))
left_join(<data_set>, by = c("<source_column>" = "<target_column>"), suffix = c("<source_suffix>", "<target_suffix>"))

Arguments

Example
left_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
left_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
left_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.
left_join(AIRPORT, by = c("DEST" = "CODE", "MONTH" = "MON")
Join with AIRPORT data frame using DEST and MONTH columns from the original data frame and CODE and MON from the target respectively.

min_rank

Summary
Ranks. Ties get min rank. Missing values are left as is.

Syntax
min_rank(<column_num>)

Return Value
Numeric

Example
mutate(population_rank = min_rank(population))
Create a new column to have the rank number based on 'population' values. The smaller the numbers are the higher the ranks are.
mutate(population_rank = min_rank(desc(population)))
Create a new column to have the rank number based on 'population' values. The bigger the numbers are the higher the ranks are.
filter(min_rank(population) < 10)
Select rows whose rank numbers are less 10.

mutate

Summary
Mutate creates new column(s) with given expressions and preserves existing . You can use transmute to create new columns while removing the original ones.

Syntax
mutate(<new_column> = <expression>, ...)

Example
mutate(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense'.

n

Summary
Returns the count of the rows for each group.

Syntax
n()

Arguments

Return Value
Numeric

Example
summarize(total = n())
Create a new column for calculating the number of entries for each group.

n_distinct

Summary
Returns the count of unique values.

Syntax
n_distinct(<column(s)>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
summarize(total = n_distinct(TAIL_NUM))
Create a new column for calculating the count of unique values of TAILNUM for each group._

summarize(total = n_distinct(STATE, CITY))
Create a new column for calculating the count of unique combinations of STATE and CITY for each group.

nth

Summary
returns the nth value of all the values.

Syntax
nth(<column>, n = <number>, order_by = <column(s)>, default = <text>)

Arguments

Example
summarize(total = nth(name, 5))
Returns the fifth value for each group.

summarize(total = nth(name, -5))
Returns the fifth value from the end for each group.

ntile

Summary
Breaks the column values into n buckets. Missing values are left as is.

Syntax
ntile(<column>, <number>)

Arguments

Return Value
Numeric

Example
mutate(quartile = ntile(revenue, 4))
Breaks 'revenue' column values into 4 buckets to generate quartile.

percent_rank

Summary
Ranks rescaled to [0, 1]. Missing values are left as is.

Syntax
percent_rank(<column_num>)

Return Value
Numeric

Example
mutate(population_rank = percent_rank(population))
Create a new column to have the percent rank number based on 'population' values.
filter(percent_rank(population) < .25)
Select rows whose percent rank numbers are less .25.

rename

Summary
Rename existing column names.

Syntax
rename(<new_column_name> = <column>, ...)

Example
rename(state = X1)
Rename X1 with 'state'
rename(state = X1, country = X2)
Rename X1 with 'state' and X2 with 'country'

right_join

Summary
Return all rows from the target data frame, and all columns from the current and the target

Syntax
right_join(<data_set>, by = "<column>", suffix = c("<source_suffix>", "<target_suffix>"))
right_join(<data_set>, by = c("<source_column>" = "<target_column>"), suffix = c("<source_suffix>", "<target_suffix>"))

Arguments

Example
right_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
right_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
right_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.

row_number

Summary
Returns row numbers. Equivalent to Rank.

Syntax
row_number()

Arguments

Return Value
Numeric

Example
mutate(rowid = row_number())
Returns an unique row number for each row.

sample_frac

Summary
Sample n fraction of rows from the data frame.

Syntax
sample_frac(<number>, weight=<column_num>, replace=<logical>)

Arguments

Example
sample_frac(.2)
Select randomly selected 20% of of data.
sample_frac(.2, weight = ARR_DELAY)
Select randomly selected 20% of of data with a weight on ARRDELAY column._

sample_n

Summary
Sample n rows from the data frame.

Syntax
sample_n(<number>, weight=<column_num>, replace=<logical>)

Arguments

Example
sample_n(100)
Select 100 randomly selected rows.
sample_n(100, weight = ARR_DELAY)
Select 100 randomly selected rows from the data with a weight on ARRDELAY column._

select

Summary
Select column(s) by column names.

Syntax
select(<column(s)>,...)
select(-<column(s)>,...)
select(starts_with(<text>, ignore.case = <logical>))
select(ends_with(<text>, ignore.case = <logical>))
select(contains(<text>, ignore.case = <logical>))
select(matches(<text>, ignore.case = <logical>))
select(num_range(<text>, <start_num>:<end_num>))
select(one_of(<text1>, <text2>, ...))
select(<column>, everything())

Arguments

Used Together
starts_with, ends_with, contains, matches, num_range, one_of, everything

Example
select(X, Y)
Select column X and Y.
select(-X,- Y)
Remove column X and Y.
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression) select(num_range("X", 1:6))
Selects all columns (numerically) from X1 to X5. 
select(one_of("x", "y", "z”))
Selects columns provided inside the brackets.
select(X, everything())
Reorder columns: keep the column “X" in the front
select_if(is.numeric)
Select only numeric columns.

semi_join

Summary
Return all rows from the current data frame where there are matching values in the target data frame, keeping just columns from the current.

Syntax
semi_join(<data_set>, by = "<column>")
semi_join(<data_set>, by = c("<source_column>" = "<target_column>"))

Arguments

Example
semi_join(AIRPORT)
Join with AIRPORT data frame using all variables with common names.
semi_join(AIRPORT, by = "CODE")
Join with AIRPORT data frame using CODE columns.
semi_join(AIRPORT, by = c("DEST" = "CODE")
Join with AIRPORT data frame using DEST column from the original data frame and CODE from the target.

slice

Summary
Select rows by positions. This is not supported for relational databases, in which case you should use filter() with row_number() function.

Syntax
slice(<numeric>)
slice(<numeric>:<numeric>)

Arguments

Example
slice(1) : equivalent to -> filter(row_number() == 1)
Select only the first row.
slice(5:n()) : equivalent to -> filter(between(row_number(), 5, n()))
Select only rows starting from the 5th to the last.
slice(n()) : equivalent to -> filter(row_number() == n())
Select only the last row
slice(n()-10:n())
Select only the last 10 rows

summarize

Summary
Summarize a column values to a single value by using aggregate functions. Make sure you have grouped the data frame already using group_by() already.

Syntax
summarize(<new_column> = <aggregate_expression>, ...)

Example
summarise(avg = mean(revenue))
Create a new column 'avg' to store average values of 'revenue' for each group.
summarize(avg = mean(revenue), total = sum(revenue))
Create new columns 'avg' for 'average of revenue' and 'total' for 'sum of revenue' for each group.

tally

Summary
Summarize the data by either calling n() or sum() on top of the grouped data. group_by() operation should have been done before.

Syntax
tally(wt = <column>, sort = <logical>)

Arguments

Example
tally()
Returns the number of rows for each group.
tally(wt = ARR_DELAY)
Returns the sum total of 'ARR_DELAY' for each group.
tally(wt = ARR_DELAY, sort = TRUE)
Returns the sum total of 'ARR_DELAY' and sorted by the total number.

top_n

Summary
Select the top n entries based on a given measure in each group.

Syntax
top_n(<number>, <column_num>)
top_n(<number>, desc(<column_num>))

Arguments

Used Together
desc

Example
top_n(10, POPULATION)
Select the top 10 countries based on POPULATION.
top_n(-10, POPULATION)
Select the bottom 10 countries based on POPULATION. This is equivalent to the next command.
top_n(10, desc(POPULATION))

transmute

Summary
Transmute adds new columns and drops existing columns.

Syntax
transmute(<new_column> = <expression>, ...)

Example
transmute(profit = revenue - expense)
Create 'profit' column based on the calculation result of 'revenue - expense', and drop 'revenue' and 'expense' columns.

ungroup

Summary
Ungroup existing grouping

Syntax
ungroup()

Example
ungroup()
Ungroup the existing groups.

ceiling_date

Summary
Rounds date/time up to the nearest integer value of the specified time unit.

Syntax
ceiling_date(<column_date>, unit = <unit>)

Arguments

Return Value
POSIXct

Example
ceiling_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
ceiling_date(A, "minute")
returns "2015-10-01 06:11:00 UTC"
ceiling_date(A, "hour")
returns "2015-10-01 07:00:00 UTC"
ceiling_date(A, "day")
returns "2015-10-02 UTC"
ceiling_date(A, "week")
returns "2015-10-04 UTC"
ceiling_date(A, "month")
returns "2015-11-01 UTC"
ceiling_date(A, "quarter")
returns "2016-01-01 UTC"
ceiling_date(A, "year")
returns "2016-01-01 UTC"

date_decimal

Summary
Converts numeric data calculated as a fraction of the year to date data type.

Syntax
date_decimal(<column_num>, tz = <timezone>)

Arguments

Return Value
POSIXct

Example
date_decimal(2015.953)
Returns "2015-12-14 20:16:47 UTC"

day

Summary
Extract Day

Syntax
day(<column_date>)

Return Value
Numeric

Example
day("2015-10-01 06:15:30")
returns 1

days

Summary
Convert a number to day periods so you can use it for further calculations.

Syntax
days(<column_num>)

Return Value
Period

Example
days(1) returns "1d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + days(1)
returns "2015-10-02 06:15:30 UTC"

decimal_date

Summary
Converts date data to numeric data type by calculating the date to a fraction of the year.

Syntax
decimal_date(<column_date>)

Arguments

Return Value
Numeric

Example
decimal_date(ymd("2015-12-15"))
Returns 2015.953

dmy

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
dmy(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
Date

Example
dmy("10-01-2015")
returns "2015-10-01 UTC"
dmy("10/01/2015")
returns "2015-10-01 UTC"
dmy("Created on 10 1 2015")
returns "2015-10-01 UTC"

dmy_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
dmy_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
dmy_h("10-01-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_h("Created on 10 1 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_h("10-01-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"

dmy_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
dmy_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
dmy_hm("01-10-2015 06:10")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("Created on 1 10 2015, at 6:10AM")
returns "2015-10-01 06:10:00 UTC"
dmy_hm("01-10-2015 06", truncated = 1)
returns "2015-10-01 06:00:00 UTC"

dmy_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
dmy_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
dmy_hms("01-10-2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01/10/2015 06:10:15")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("Created on 1 10 2015, at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
dmy_hms("01-10-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
dmy_hms("01-10-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected

duration

Summary
Create a duration

Syntax
duration(<column_num>, units = <units>)

Arguments

Return Value
Duration

Example
duration(1.5, "minutes") returns 90s
duration(mins = 1.5) returns 90s

dym

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
dym(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
Date

Example
dym("01-2015-10")
returns "2015-10-01 UTC"
dym("01/2015/10")
returns "2015-10-01 UTC"
dym("Created on 1 2015 10")
returns "2015-10-01 UTC"

floor_date

Summary
Rounds date/time down to the nearest integer value of the specified time unit.

Syntax
floor_date(<column_date>, unit = <unit>)

Arguments

Return Value
POSIXct

Example
floor_date(A, "second")
returns "2015-10-01 06:10:15 UTC"
floor_date(A, "minute")
returns "2015-10-01 06:10:00 UTC"
floor_date(A, "hour")
returns "2015-10-01 06:00:00 UTC"
floor_date(A, "day")
returns "2015-10-01 UTC"
floor_date(A, "week")
returns "2015-09-27 UTC"
floor_date(A, "month")
returns "2015-10-01 UTC"
floor_date(A, "quarter")
returns "2015-10-01 UTC"
floor_date(A, "year")
returns "2015-01-01 UTC"

force_tz

Summary
Re-registers the data with a given timezone.

Syntax
force_tz(<column_date>, tzone = <timezone>)

Arguments

Return Value
POSIXct

Example

as.POSIXct("2014-01-01 00:15:00", tz = "UTC")
Returns "2014-01-01 00:15:00 UTC". force_tz(as.POSIXct("2014-01-01 00:15:00", tz = "America/Los_Angeles"))
Returns "2014-01-01 00:15:00 PST". Notice that the timezone is now "PST" instead of the original "UTC"

here

Summary
Get current time in your local timezone

Syntax
here()

Return Value
POSIXct

Example
here()
returns "2015-12-31 00:04:59 PST"

hm

Summary
Convert text to Period data type if a given column holds values that look like Hours and Minutes

Syntax
hm(<column>, tz = <timezone>)

Arguments

Return Value
Period

Example
hm("06:10")
returns "6H 10M 0S"
hm("06:10", "UTC")
returns "6H 10M 0S"

hms

Summary
Create hms data. It's difftime dealt as seconds.

Syntax
hms(seconds = <column_num>, minutes = <column_num>, hours = <column_num>, days = <column_num>)

Arguments

Return Value
hms

Example
hms(seconds = 20, minutes = 1, hours = 5)
returns 20 + 1 * 60 + 5 * 60 * 60 = 18080 seconds difftime

hour

Summary
Extract hour

Syntax
hour(<column_date>)

Return Value
Integer

Example
hour("2015-10-01 06:15:30")
returns 6

hours

Summary
Convert a number to hour periods so you can use it for further calculations.

Syntax
hours(<column_num>)

Return Value
Period

Example
hours(1) returns "1H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + hours(1) returns "2015-10-01 07:15:30 UTC"

interval

Summary
Return intervals between two dates. You can use this result to get the duration by either dividing by a given units such as days, weeks, etc, OR converting it to numeric data type with "as.numeric()".

Syntax
interval(<column_date>, <column_date>)

Return Value
Interval

Example
interval("2015-10-01", "2015-10-18")
returns 2015-10-01 UTC--2015-10-18 UTC
interval("2015-10-01", "2015-10-18") %/% days(1)
returns 17
as.numeric(interval("2015-10-01", "2015-10-18"), units = "days")
returns 17
interval("2015-10-01", "2015-10-18") %/% weeks(1)
returns 2

is.Date

Summary
Returns true if a given object is Date type.

Syntax
is.Date

Arguments

Return Value
Logical

Example

mutate_if(is.Date, wday)
Extract day of the week information from Date columns.
mutate_if(is.Date, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.Date, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.

is.difftime

Summary
Returns true if a given object is difftime type.

Syntax
is.difftime

Arguments

Return Value
Logical

Example

mutate_if(is.difftime, as.numeric)
Convert difftime type columns to numeric type.

is.duration

Summary
Returns true if a given object is duration type.

Syntax
is.duration

Arguments

Return Value
Logical

Example

mutate_if(is.duration, as.numeric)
Convert duration type columns to numeric type.

is.POSIXct

Summary
Returns true if a given object is POSIXct type.

Syntax
is.POSIXct

Arguments

Return Value
Logical

Example

mutate_if(is.POSIXct, wday)
Extract day of the week information from Date columns.
mutate_if(is.POSIXct, funs(week_day = wday))
Extract day of the week information from Date columns and creating new columns.
mutate_if(is.POSIXct, wday, label = TRUE)
Extract day of the week information using wday function with 'label = TRUE' argument from Date columns.

mday

Summary
Extract Day of Month

Syntax
mday(<column_date>)

Return Value
Numeric

Example
mday("2015-10-01 06:15:30")
returns 1

mdy

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
mdy(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
Date

Example
mdy("01-10-2015")
returns "2015-10-01 UTC"
mdy("01/10/2015")
returns "2015-10-01 UTC"
mdy("Created on 1 10 2015")
returns "2015-10-01 UTC"

mdy_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
mdy_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
mdy_h("10-01-2015 06")
returns "2015-10-01 06:00:00 UTC"
mdy_h("Created on 10 1 2015 6 AM")
returns "2015-10-01 06:00:00 UTC"

mdy_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
mdy_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
mdy_hm("10-01-2015T06:10")
returns "2015-10-01 06:10:00 UTC"
mdy_hm("Created on 10 1 2015, at 6:10 AM")
returns "2015-10-01 06:10:00 UTC"

mdy_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
mdy_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
mdy_hms("10-01-2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10/01/2015T06:10:15")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("Created on 10 1 2015")
returns "2015-10-01 06:10:15 UTC"
mdy_hms("10-01-2015 06")
returns NA, because there is not enough data for Date and Time parsing.
mdy_hms("10-01-2015 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected

milliseconds

Summary
Convert a number to milliseconds periods so you can use it for further calculations.

Syntax
milliseconds(<column_num>)

Return Value
Period

Example
milliseconds(1) returns "1S"
ymd_hms("2015-10-01 06:15:30") + milliseconds(1)
returns "2015-10-01 06:15:31 UTC"

minute

Summary
Extract minute

Syntax
minute(<column_date>)

Return Value
Integer

Example
minute("2015-10-01 06:15:30")
returns 15

minutes

Summary
Convert a number to minute periods so you can use it for further calculations.

Syntax
minutes(<column_num>)

Return Value
Period

Example
minutes(1)_
returns "1M 0S"
ymd_hms("2015-10-01 06:15:30") + minutes(1)
returns "2015-10-01 06:16:30 UTC"

month

Summary
Extract Month

Syntax
month(<column_date>, label = <logical>, abbr = <logical>)

Arguments

Return Value
Numeric or Text

Example
month("2015-10-01 06:15:30")
returns 10 month("2015-10-01 06:15:30", label = TRUE)
returns October month("2015-10-01 06:15:30", label = TRUE, abbr = TRUE)
returns Oct

ms

Summary
Convert text to Period data type if a given column holds values that look like Minutes and Seconds

Syntax
ms(<column>, tz = <timezone>)

Arguments

Return Value
Period

Example
ms("10:15")
returns "10M 15S"
ms("10:15", "UTC")
returns "10M 15S"

myd

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
myd(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
Date

Example
myd("01-2015-10")
returns "2015-01-10 UTC"
myd("01/2015/10")
returns "2015-01-10 UTC"
myd("Created on 1 2015 10")
returns "2015-01-10 UTC"

now

Summary
Returns current date and time. A similar function 'today()' returns only date.

Syntax
now(tzone = <timezone>)

Arguments

Return Value
POSIXct

Example
now()
returns "2015-12-31 08:04:59 PST"
now("GMT")
returns "2015-12-31 08:04:59 PST"
now("America/Los_Angeles")
returns "2015-12-31 00:04:59 PST"

parse_date_time

Summary
Convert data to Date/Time data type.

Syntax
parse_date_time(<column>, orders = <date_time_order>, locale = <locale>, tz = <timezone>, truncated = <integer>, quiet = <logical>, exact = <logical>, lt = <logical>)

Arguments

Return Value
POSIXct

Example
parse_date_time("03/15/05 1:30 PM", "mydHMp")
Returns "2015-03-05 13:30:00 PDT"
lubridate::parse_date_time("1:30 PM", "HMp")
Returns "0000-01-01 13:30:00 PDT. The default to each parameters are 0 and it always returns POSIXct"

qday

Summary
Extract Day of Quarter

Syntax
qday(<column_date>)

Return Value
Numeric

Example
qday("2015-10-01 06:15:30")
returns 7

quarter

Summary
Extract Quarter

Syntax
quarter(<column_date>)

Return Value
Numeric

Example
quarter("2015-08-01 06:15:30") returns 3
quarter("2015-06-16 07:42:29") returns 2

rollback

Summary
Calculate the last day of the previous month or to the first day of the month based on a given date.

Syntax
rollback(<column_date>, roll_to_first = <logical>, preserve_hms = <logical>)

Arguments

Return Value
POSIXct

Example
rollback("2015-03-03 06:15:30")
returns "2010-02-28 06:15:30 UTC"

rollback("2015-03-03 06:15:30", roll_to_first = TRUE)
returns "2010-03-01 06:15:30 UTC"

rollback("2015-03-03 06:15:30", preserve_hms = FALSE)
returns "2010-02-28 UTC"

round_date

Summary
Rounds date/time to the nearest integer value of the specified time unit.

Syntax
round_date(<column_date>, unit = <unit>)

Arguments

Return Value
POSIXct

Example
round_date("2015-10-01 06:10:15 UTC", "second")
returns "2015-10-01 06:10:15 UTC"
round_date("2015-10-01 06:10:15 UTC", "minute")
returns "2015-10-01 06:10:00 UTC" round_date("2015-10-01 06:10:15 UTC", "hour")
returns "2015-10-01 06:00:00 UTC"
round_date("2015-10-01 06:10:15 UTC", "day")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "week")
returns "2015-10-04 UTC"
round_date("2015-10-01 06:10:15 UTC", "month")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "quarter")
returns "2015-10-01 UTC"
round_date("2015-10-01 06:10:15 UTC", "year")
returns "2015-01-01 UTC"

second

Summary
Extract second

Syntax
second(<column_date>)

Return Value
Integer

Example
second("2015-10-01 06:15:30")
returns 30

seconds

Summary
Convert a number to second periods so you can use it for further calculations.

Syntax
seconds(<column_num>)

Return Value
Period

Example
seconds(1) returns "1S"
ymd_hms("2015-10-01 06:15:30") + seconds(1)
returns "2015-10-01 06:15:31 UTC"

time_length

Summary
Calculate the exact time length between two dates.

Syntax
time_length(<interval_or_duration>, unit = <unit>)

Arguments

Return Value
Numeric

Example
time_length(interval("2015-10-01", "2015-10-18"), "day")
returns 17
time_length(interval("2015-10-01", "2015-10-18"), "month")
returns 0.5483871
trunc(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 0
round(time_length(interval("2015-10-01", "2015-10-18"), "month"))
returns 1

today

Summary
Returns current date without time. A similar function 'now()' returns date including time.

Syntax
today(tzone = <timezone>)

Arguments

Return Value
Date

Example
today("GMT")
returns "2015-12-31"

today("America/Los_Angeles")
returns "2015-12-31"

wday

Summary
Extract day of the week as a decimal number (01-07, Sunday is 1) or as Text (e.g. Sunday). The Text can be a full text or an abbreviated text. Sunday vs. Sun

Syntax
wday(<column_date>, label = <logical>, abbr = <logical>)

Arguments

Return Value
Numeric or Text

Example
wday("2015-10-01 06:15:30")
returns 5
wday(A, label = TRUE)
returns Thurs
wday(A, label = TRUE, abbr=FALSE)
returns Thursday

week

Summary
Extract Week

Syntax
week(<column_date>)

Return Value
Numeric

Example
week("2015-10-01 06:15:30")
returns 40

weeks

Summary
Convert a number to week periods so you can use it for further calculations.

Syntax
weeks(<column_num>)

Return Value
Period

Example
weeks(1) returns "7d 0H 0M 0S"
ymd_hms("2015-10-01 06:15:30") + weeks(1)
returns "2015-10-08 06:15:30 UTC"

with_tz

Summary
Returns Date / Time in a given time zone

Syntax
with_tz(<column_date>, tzone = <timezone>)

Arguments

Return Value
POSIXct

Example

with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"), tzone = "America/Los_Angeles")
Returns "2013-12-31 16:15:00 PST". The original data is registered in UTC, and with_tz() function is returning it in Pacific Timezone. with_tz(as.POSIXct("2014-01-01 00:15:00", tz = "UTC"))
Returns "2013-12-31 16:15:00 PST". with_tz() function returns the value in the default timezone (e.g. "America/LosAngeles")._

yday

Summary
Extract Day of Year

Syntax
yday(<column_date>)

Return Value
Numeric

Example
yday("2015-10-01 06:15:30")
returns 274

ydm

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
ydm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
Date

Example
ydm("2015-10-01")
returns ""2015-01-10 UTC""
ydm("2015/10/01")
returns ""2015-01-10 UTC""
ydm("Created on 2015 10 1")
returns ""2015-01-10 UTC""

ydm_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ydm_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
ydm_h("2015-01-10T06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("2015/01/10 06")
returns "2015-10-01 06:00:00 UTC"
ydm_h("Created on 15-01-10 at 6 AM")
returns "2015-10-01 06:00:00 UTC"

ydm_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ydm_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
ydm_hm("2015-01-10 06:10")
returns "2015-10-01 06:10:00 UTC"
ydm_hm("2015/01/10 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hm("Created on 15-01-10 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"

ydm_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ydm_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
ydm_hms("2015-01-10T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015/01/10 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("Created on 15-01-10 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ydm_hms("2015-01-10 06")
returns NA, because there is not enough data for Date and Time parsing.
ydm_hms("2015-01-10 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected

year

Summary
Extract Year

Syntax
year(<column_date>)

Return Value
Numeric

Example
year("2015-10-01 06:15:30")
returns 2015

years

Summary
Convert a number to year periods so you can use it for further calculations.

Syntax
years(<column_num>)

Return Value
Period

Example
ymd_hms("2015-10-01 06:15:30") + years(1)
returns "2016-10-01 06:15:30 UTC"

ymd

Summary
Convert Character or Number to Date / Period when data contains only Date, but not Time.

Syntax
ymd(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
Date

Example
ymd("2015-10-01")
returns "2015-10-01 UTC"
ymd("2015/10/01")
returns "2015-10-01 UTC"
ymd("Created on 2015 10 1")
returns "2015-10-01 UTC"

ymd_h

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ymd_h(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
ymd_h("2015-10-01T06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("2015/10/01 06")
returns "2015-10-01 06:00:00 UTC"
ymd_h("Created on 15-10-01 at 6 AM")
returns "2015-10-01 06:00:00 UTC"

ymd_hm

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ymd_hm(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
ymd_hm("2015-10-01 06:10")
returns "2015-10-01 06:10:00 UTC"
ymd_hm("2015/10/01 06:00 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hm("Created on 15-10-01 at 06:10 AM")
returns "2015-10-01 06:10:00 UTC"

ymd_hms

Summary
Convert Character or Number to Date / Period when data contains Date and Time.

Syntax
ymd_hms(<column>, tz = <timezone>, locale = <locale>, truncated = <number>)

Arguments

Return Value
POSIXct

Example
ymd_hms("2015-10-01T06:10:15")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015/10/01 06:10:15 UTC")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("Created on 15-10-01 at 06:10:15 AM")
returns "2015-10-01 06:10:15 UTC"
ymd_hms("2015-10-01 06")
returns NA, because there is not enough data for Date and Time parsing.
ymd_hms("2015-10-01 06", truncated = 2)
returns "2015-10-01 06:00:00 UTC" as expected

cor

Summary
Computes the correlation of two column values. The covariance divided by the product of the standard deviations of the two column values.

Syntax
cor(<column_num>, <column_num>, use = <cor_na_operation>, method = <cor_method>)

Arguments

Example
summarize(correlation = cor(ARR_DELAY, DEP_DELAY))
Return the correlation value for ARRDELAY and DEP_DELAY columns._

cov

Summary
Computes the covariance of two column values.

Syntax
cov(<column_num>, <column_num>, use = <cor_na_operation>, method = <cor_method>)

Arguments

Example
summarize(covariance = cov(ARR_DELAY, DEP_DELAY))
Return the covariance value for ARRDELAY and DEP_DELAY columns._

IQR

Summary
computes interquartile range of the x values

Syntax
IQR(<column_num_logic_date>, na.rm = <logical>, type = 1|2|3|4|5|6|7|8|9)

Arguments

Return Value
Numeric

Example
// x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
IQR(x)
Returns 4.5

mad

Summary
returns the median absolute deviation of the values.

Syntax
mad(<column_num_logic_date>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
summarize(mad = mad(revenue))
Create a new column for calculating the median absolute deviation of revenue for each group.

median

Summary
Returns the numerical median value.

Syntax
median(<column_num_logic_date>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
summarize(revenue_median = median(revenue))
Create a new column for calculating the median value of revenue for each group.

predict

Summary

Returns data augmented with fitted values and residuals.

Syntax

predict(<column_model>, data = <column_data_set>, newdata = <data_set>, type.predict = "link"|"response")

Arguments

Example

predict(model) - assume that 'model' column contains a fitted model of 'lm', 'glm' or 'kmeans'. If it's lm, the output is like below.

.rownames ARR_DELAY DEP_DELAY .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 -15 -9 -13.934 0.0247 -1.066 0.0000035503 13.0886 7.8503e-9 -0.0814
2 -28 -19 -22.4116 0.035 -5.5884 0.0000071333 13.0886 4.3348e-7 -0.427

quantile

Summary
sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1.

Syntax
quantile(<column_num_logic_date>, probs = <percent_number>, na.rm = <logical>, [type = 1|2|3|4|5|6|7|8|9])

Arguments

Return Value
Numeric

Example
//x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
quantile(x, 0.25)
Returns 3.25
quantile(x, .5)
Returns 5.5

relevel

Summary
Sets the base level of a given factor column by making the level specified by ref argument becomes the first and the others are moved down.

Syntax
relevel(<column_factor>, ref = <value>)

Arguments

Return Value
Factor

Example
x <- factor(level = c("France", "Japan", "Australia", "United States"))
This would create a factor with the following order as, France Japan Australia United States.

relevel(x, ref = "United States")
This would change the order to be United States France Japan Australia.

sd

Summary
returns the standard deviation of the values.

Syntax
sd(<column_num_logic_date>, na.rm = <logical>))

Arguments

Return Value
Numeric

Example
summarize(sd = sd(revenue))
Create a new column for calculating the standard deviation of revenue for each group.

var

Summary
returns the variance of the values.

Syntax
var(<column_num_logic_date>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
summarize(total = sum(revenue))
Create a new column for calculating the variance of revenue for each group.

str_c

Summary
Concatenates multiple text from multiple columns.

Syntax
str_c(<column(s)>, sep = <text>, collapse = <text>)

Arguments

Return Value
Character

Example
str_c(CARRIER, TAIL_NUM)
Concatenate CARRIER and TAIL_NUM values.
str_c(CARRIER, TAIL_NUM, sep = "-")
Use dash as a separator to concatenate CARRIER and TAIL_NUM values.
summarize(combined_text = str_c(CARRIER, collapse = ",")
Use comma ',' as a separator to combine all the CARRIER names into one text for each group.

str_conv

Summary
Convert text to a specified encoding. e.g. UTF-8

Syntax
str_conv(<column>, <encoding>)

Arguments

Return Value
Character

Example
str_conv("Exploratory", "UTF-8")

str_count

Summary
Return number of letters or words in text data. You can also set a pattern to count only the matched pattern or text.

Syntax
str_count(<column>, pattern = [<regular_expression>])
str_count(<column>, pattern = [boundary("word")])

Arguments

Return Value
Numeric

Example
str_count("I'm traveling to New York on 1/1.")
returns 33 including punctuations and space
str_count("I'm traveling to New York on 1/1.", "New")
returns 1 by matching text 'New'
str_count("I'm traveling to New York on 1/1.", "[:alpha:]")
returns 22 counting only alphabet letters
str_count("I'm traveling to New York on 1/1.", "[:digit:]")
returns 2 counting only digit letters
str_count("I'm traveling to New York.", boundary("word"))
returns 5 counting as words
str_count("I'm traveling to New York on 1/1.", ".")
returns 33 because dot matches anything
str_count("I'm traveling to New York on 1/1.", fixed("."))
returns 1 because dot is searching dot as letter with 'fixed()'

str_detect

Summary
Return TRUE or FALSE based on whether Text data contains a given text or not

Syntax
str_detect(<column>, <pattern>)

Arguments

Return Value
Logical

Example
str_detect("New York", "New")
returns TRUE because there is 'New' in the data.
str_detect("New York on 1/1.", "[:alpha:]")
returns TRUE because there are alphabet letters in the data.
str_detect("New York on 1/1.", "[:digit:]")
returns TRUE because there are numeric (digit) letters in the data.
str_detect("New York on 1/1.", "^New")
returns TRUE because there is "New" text at the beginning of the data.
str_detect("New York on 1/1.", "New$")
returns TRUE because there is no "New" text at the end of the data.
str_detect("New York on 1/1.", "N.w")
returns TRUE because 'New' starts with 'N' and ends with 'w'
str_detect("New York on 1/1.", fixed("N.w"))
returns FALSE because 'N.w' exact matching letters don't exist
str_detect("New York on 1/1.", c("New", "Old"))
returns TRUE because a given text matches with one of matching texts

str_dup

Summary
Repeat text values.

Syntax
str_dup(<column_text>, times = <numeber>)

Arguments

Return Value
Character

Example
str_dup("wow", 2)
returns "wowwow".

str_extract

Summary
Extract only letters that match with a given letters or patterns

Syntax
str_extract(<column_text>, <pattern>)

Arguments

Return Value
Character

Example
str_extract("ABCabc123", "[A-Za-z]+")
returns "ABCabc" extracting only alphabet letters.
str_extract("ABCabc123", "[:alpha:]+")
returns "ABCabc" extracting only alphabet letters.
str_extract("ABCabc123", "[0-9]+")
returns "123" extracting only numbers
str_extract("ABCabc123", "[:digit:]+")
returns "123" extracting only numbers

str_extract_all

Summary
Extract all characters that match with a given letters or patterns

Syntax
str_extract_all(<column_text>, <pattern>)

Arguments

Return Value
List

Example

str_extract_all("Hello world", "[:alpha:]+")
Extracts all matching texts and returns c("Hello", "world").

str_extract("Hello world", "[:alpha:]+")
Extracts the first matching text and returns "Hello".

str_length

Summary
Return number of letters in text data.

Syntax
str_length(<column>)

Return Value
Numeric

Example
str_length("I'm traveling to New York.")
Returns 26 including punctuations and space

str_pad

Summary
Add white spaces to text values to make the total length to be exact a specified number.

Syntax
str_pad(<column_text>, width = <number>, side = "left"|"right"|"both", pad = <text>)

Arguments

Return Value
Character

Example
str_pad("Exploratory", 3)
returns "Exploratory" because 3 is less than number of the original given text.
str_pad("Exploratory", 30, "left")
returns " Exploratory" to make the length of the text to be 30 by adding white spaces.

str_replace

Summary
Replace letters that matches with a given letters or expressions

Syntax
str_replace(<column>, <pattern>, <replacement>)

Arguments

Return Value
Character

Example
str_replace("I am traveling to New York!!", "York", "Jersey")
returns "I am traveling to New Jersey"
str_replace("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York!" replacing only the first matching punctuation. You want to use str_replace_all() to replace them all.
str_replace("New York New", "^New", "Old")
returns "Old York New" by finding "New" at the beginning and replace it with "Old"
str_replace("New York New.", "New$", "Old")
returns "New York Old" by finding "New" at the end and replace it with "Old"
str_replace("New York", str_sub("New York", 4, 8), "")
returns "New".

str_replace_all

Summary
Replace letters that matches with a given letters or expressions

Syntax
str_replace_all(<column>, <pattern>, <replacement>)

Arguments

Return Value
Character

Example
str_replace_all("I am traveling to New York!!", "[:punct:]", "")
returns "I am traveling to New York"

str_replace_na

Summary
Replace NA values to a specific Text

Syntax
str_replace_na(<column>, <replacement>)

Arguments

Return Value
Character

Example
str_replace_na(ARR_DELAY, "Not Available")
Returns "Not Available" for NA in ARRDELAY column._

str_split

Summary
Split a given text into multiple text by a given separator. It will return a list so you want to unnest the newly created column with unnest() function.

Syntax
str_split(<column_text>, pattern = <pattern>, n = <numeric>)

Arguments

Return Value
List

Example
str_split("a, b, c", ",")
returns "a" " b" " c" in a list format.
str_split("a, b, c", ",", 2)
returns "a" and " b, c" in a list format.
mutate(y = str_split(text_a, ",")) %>% unnest()
First split a given text column, this will create a list column. Then, unnest() function will break the list by given each element of the list its own row. unnest(y = str_split(text_a, ","))
This is equivalent of the above command.

str_sub

Summary
Extract letters from Text values based on the position

Syntax
str_sub(<column_text>, <start>, <end>)

Arguments

Return Value
Character

Example
str_sub("Kevin Spacey", start = 7, end = 12)
returns "Spacey". Starts from 7 to 12.
str_sub("Kevin Spacey", start = 7)
returns "Spacey". Starts from 7 to the end.
str_sub(Text, start = -6)
returns "Spacey". Last 6 letters.

str_to_lower

Summary
Convert text to lowercase

Syntax
str_to_lower(<column>, locale = <locale>)

Arguments

Return Value
Character

Example
str_to_upper("CALIFORNIA")
Returns "california"

str_to_title

Summary
Convert text to Title-case

Syntax
str_to_title(<column>, locale = <locale>)

Arguments

Return Value
Character

Example
str_to_title("CALIFORNIA")
Returns "California"

str_to_upper

Summary
Convert text to UPPERCASE

Syntax
str_to_upper(<column>, locale = <locale>)

Arguments

Return Value
Character

Example
str_to_upper("california")
Returns "CALIFORNIA"
str_to_upper("i", "en")
Seting English locale would return 'i'
str_to_upper("i", "tr")
Setting Turkish locale would return dotted 'I' in capital

str_trim

Summary
Trim (or remove) white spaces in Text values

Syntax
str_trim(<column_text>, side = "both"|"left"|"right")

Arguments

Return Value
Character

Example
str_trim(" String with trailing and leading white space\t")
returns "String with trailing and leading white space".
str_trim("\n\nString with trailing and leading white space\n\n")
returns "String with trailing and leading white space".

word

Summary
Extract First / Last Nth Word in Text data

Syntax
word(<column_text>, <start>, <end>, sep = <text>)

Arguments

Return Value
Character

Example
word("I am traveling to New York", 1)
returns "I".
word("I am traveling to New York", 1, 3)
returns "I am traveling"
word("I am traveling to New York", 2)
returns "am".
word("I am traveling to New York", -1)
returns "York"
word("I am traveling to New York", 2, -1)
returns "am traveling to New York"
word("123-456-789", 1, sep = "-")
returns "123". word("123.456.789", 1, sep = ".")
returns null. You want to use fixed() to make the dot literally mean dot.
word("123.4568.789", 1, sep = fixed("."))
returns "123".

complete

Summary
Complete a data frame by adding missing combinations of data.

Syntax
complete(<column(s)>, fill = <list>)

Arguments

Used Together
full_seq, nesting

Example

Sample data:

year product type sales
2013 iPhone 5 100
2013 iPad mini 50
2015 iPhone 6 200

complete(year, product)
Returns all the combinations of year and product values.

year product type sales
2013 iPad mini 50
2013 iPhone 5 100
2015 iPad NA NA
2015 iPhone 6 200

complete(year, product, fill = list(sales = 0))
Returns all the combinations of year and product values and fill with 0 for NA for sales column.

year product type sales
2013 iPad mini 50
2013 iPhone 5 100
2015 iPad NA 0
2015 iPhone 6 200

complete(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).

year product type sales
2013 iPad mini 50
2013 iPhone 5 100
2014 iPad NA NA
2014 iPhone NA NA
2015 iPad NA NA
2015 iPhone 6 200

complete(year, product, type)
Returns all the combinations of year, product, and type values.

year product type sales
2013 iPad 5 NA
2013 iPad 6 NA
2013 iPad mini 50
2013 iPhone 5 100
2013 iPhone 6 NA
2013 iPhone mini NA
2015 iPad 5 NA
2015 iPad 6 NA
2015 iPad mini NA
2015 iPhone 5 NA
2015 iPhone 6 200
2015 iPhone mini NA

complete(year, nesting(product, type))
By using nesting() function you can set the combination of values to be the ones that really exist in data for some columns.

year product type sales
2013 iPad mini 50
2013 iPhone 5 100
2013 iPhone 6 NA
2015 iPad mini NA
2015 iPhone 5 NA
2015 iPhone 6 200

expand

Summary
Expand a data frame by adding all the combination of given columns.

Syntax
expand(<column(s)>)

Arguments

Used Together
full_seq, nesting

Example

Sample data:

year product type sales
2013 iPhone 5 100
2013 iPad mini 50
2015 iPhone 6 200

expand(year, product)
Returns all the combinations of year and product values.

year product
2013 iPad
2013 iPhone
2015 iPad
2015 iPhone

expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).

year product
2013 iPad
2013 iPhone
2014 iPad
2014 iPhone
2015 iPad
2015 iPhone

expand(year, product, type)
Returns all the combinations of year, product, and type values.

year product type
2013 iPad 5
2013 iPad 6
2013 iPad mini
2013 iPhone 5
2013 iPhone 6
2013 iPhone mini
2015 iPad 5
2015 iPad 6
2015 iPad mini
2015 iPhone 5
2015 iPhone 6
2015 iPhone mini

expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.

year product type
2013 iPad mini
2013 iPhone 5
2013 iPhone 6
2015 iPad mini
2015 iPhone 5
2015 iPhone 6

extract_numeric

Summary
Extract only the number from a given text. This is useful for strings that are supposed to be numbers with extra formatting (e.g. $1,200.34, -12%, X1, etc.).

Syntax
extract_numeric(<column>)

Return Value
Numeric

Example
extract_numeric("$56,500.00")
Returns 56500
extract_numeric("-45%")
Returns -45
mutate( = extract_numeric()
Crate a new column to have only the extracted numbers from the original column.

fill

Summary
Fills missing values in using the previous entry. This is convenient especially when values are not presented in some cells to avoid duplicated entries.

Syntax
fill(<column(s)>, .direction = <fill_direction>)

Arguments

Example
fill(POPULATION)
Fills using the previous appearing value for POPULATION column.
fill(POPULATION, .direction = "up")
Fills using the next appearing value for POPULATION column.
fill(POPULATION, PRODUCTION)
Fills using the previous appearing value for POPULATION and PRODUCTION columns.
fill(-UNEMPLOYMENT)
Fills using the previous appearing value for all the columns other than UNEMPLOYMENT column.

full_seq

Summary
Generate a sequence of numbers based on the values in a given column.

Syntax
full_seq(<column_num>, period = <number>)

Arguments

Used Together
complete, expand,

Example
Sample data:

year product type sales
2013 iPhone 5 100
2013 iPad mini 50
2015 iPhone 6 200

expand(year = full_seq(year, period = 1), product)
Returns all the combinations of year and product values plus fill missing years (2014).

year product
2013 iPad
2013 iPhone
2014 iPad
2014 iPhone
2015 iPad
2015 iPhone

Without full_seq() function.

expand(year, product)
Returns all the combinations of year and product values.

year product
2013 iPad
2013 iPhone
2015 iPad
2015 iPhone

gather

Summary
Takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed.

Syntax
gather(<new_key_column>, <new_value_column>, <column(s)>, na.rm = <logical>, convert = <logical>)

Arguments

Example
gather(COUNTRY, POPULATION, 2:50, na.rm = FALSE)
Takes the 2nd column to the 50th column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, -YEAR)
Takes all the columns other than YEAR column into two new columns, COUNTRY and POPULATION.
gather(COUNTRY, POPULATION, starts_with("country"))
Takes the columns whose names start with "country" into two new columns, COUNTRY and POPULATION.

nest

Summary
Nest a set of columns together as a list column.

Syntax
nest(<column(s)> , .key = <text>)

Arguments

Example
Sample Data

year product type sales
2013 iPhone 5 100
2013 iPad mini 50
2015 iPhone 6 200

nest(product, type, sales, .key = a)
This will create a new column that is a list of data frames and each row of the data frame holds 'product', 'type', 'sales' values.

year a
2013 list()
2015 list()

nesting

Summary
Fix multiple columns so that only the combination of actually existing values in those columns will be used for complete() or expand() operation.

Syntax
nesting(<column(s)>)

Arguments

Used Together
complete, expand,

Example
Sample data:

year product type sales
2013 iPhone 5 100
2013 iPad mini 50
2015 iPhone 6 200

expand(year, nesting(product, type))
By using nesting() function you can set a certain combination of column values to be fixed and have them nested inside other columns.

year product type
2013 iPad mini
2013 iPhone 5
2013 iPhone 6
2015 iPad mini
2015 iPhone 5
2015 iPhone 6

Without nesting():

expand(year, product, type)
Returns all the combinations of year, product, and type values.

year product type
2013 iPad 5
2013 iPad 6
2013 iPad mini
2013 iPhone 5
2013 iPhone 6
2013 iPhone mini
2015 iPad 5
2015 iPad 6
2015 iPad mini
2015 iPhone 5
2015 iPhone 6
2015 iPhone mini

separate

Summary
Separates a column with delimited values into multiple columns.

Syntax
separate(<column_text_date>, into = c("<new_column_name(s)>"), sep = <text>, remove=<logical>, convert=<logical>, extra = "warn"|"drop"|"merge", fill = "warn"|"right"|"left")

Arguments

Example
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="drop")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will drop the rest if any.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), remove = FALSE)
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will keep NAME column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), extra="merge")
Separates NAME column into FIRSTNAME and LASTNAME columns by non-alphanumeric values. It will merge the rest into the last column.
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = "-")
Separates NAME column into FIRSTNAME and LASTNAME columns by dash "-".
separate(NAME, into = c("FIRSTNAME", "LASTNAME"), sep = 2)
Separates NAME column into FIRSTNAME and LASTNAME by a position of 2.
separate(DATE, into = c("Y", "M", "D"), convert = TRUE)
Separate DATE into Year, Month, and Day columns and set the data type for each column based on the heuristics.

spread

Summary
Spread a key-value pair across multiple columns. Missing values will be replaced with NA.

Syntax
spread(<key_column>, <value_column>, fill = <value>, convert = <logical>, drop = <logical>, sep=<text>)

Arguments

Example
spread(COUNTRY, POPULATION)
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY.
spread(COUNTRY, POPULATION, fill = “NA")
Spread a COUNTRY-POPULATION pair across multiple columns of COUNTRY, and fill with "NA" text for missing values.

unite

Summary
Unite multiple columns together into one column with given uniting characters. It will concatenate values by “_" and remove the original column by default. Multiple columns specified in the list can be combined together at once.

Syntax
unite(<new_column>, <column(s)>, sep = <text>, remove = <logical>)

Arguments

Example
unite(NAME, FIRSTNAME, LASTNAME)
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.
unite(NAME, FIRSTNAME, LASTNAME, sep = "-")
Unite FIRSTNAME and LASTNAME into a new column called NAME using _ as a uniting character.

unnest

Summary
Unnest a list column or a list of data frames by making each element of the list to be presented in its own row.

Syntax
unnest(<column_list> , .drop = <logical>, .id = <text>, .sep = <text>)

Arguments

Example
Sample Data with a list column

issue_id assignee
720 list(Kan, Kei)
721 list(Kan, Marie, Simon)

unnest() or unnest(assignee)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.

issue_id assignee
720 Kan
720 Kei
721 Kan
721 Marie
721 Simon

Sample Data with a list of Data Frames:

issue_id assignee
720 list(name = Kan, id = 105, country = US )
721 list(name = Kei, id = 106, country = Japan )

unnest()
Unnest 'assignee' column which is a list of data frames. This will create columns for each column of the nested data frame.

issue_id name id country
720 Kan 105 US
721 Kei 106 Japan

unnest(.sep="-")
Unnest With .sep argument, the result columns are with original list column name connected by .sep argument.

issue_id assignee-name assignee-id assignee-country
720 Kan 105 US
721 Kei 106 Japan

Sample Data with two list columns

issue_id assignee category
720 list(Kan, Kei) list(Critical, Security)
721 list(Kan, Marie, Simon) list(Critical, Minor Issue)

unnest(assignee, .drop = FALSE)
Unnest 'assignee' column which is a list. This will give each value of the list its own row.

issue_id assignee category
720 Kan list(Critical, Security)
720 Kei list(Critical, Security)
721 Kan list(Critical, Security)
721 Marie list(Critical, Minor Issue)
721 Simon list(Critical, Minor Issue)

param_remove

Summary
Removes parameters from a given URL.

Syntax
param_remove(<column_text>, keys = <text>)

Arguments

Return Value
Character

Example
param_remove(urls = "https://github.com/hideaki/tam/issues?q=is%3Aopen&ticket=open", keys = c("q","ticket"))
Returns "https://github.com/hideaki/tam/issues"

url_decode

Summary
Decodes a given encoded URL.

Syntax
url_decode(<column_text>)

Arguments

Return Value
Character

Example
url_decode("http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29")
Returns "http://exploratory.io?name=flight data analysis (V1)"

url_encode

Summary
Encodes a given URL.

Syntax
url_encode(<column_text>)

Arguments

Return Value
Character

Example
url_encode("http://exploratory.io?name=flight data analysis (V1)")
Returns "http://exploratory.io%3fname%3dflight%20data%20analysis%20%28V1%29"

url_parameters

Summary
Returns decoded query parameters from url.

Syntax
url_parameters(<column_text>)

Arguments

Return Value
Character

Example
url_parameters("https://exploratory.io/?debug=%3Ctrue%3E#test")
returns "debug=<true>".

augment

Summary

Returns the original data augmented with fitted values and residuals.

Syntax

augment(<column_model>, data = <column_data_set>, newdata = <data_set>)

Arguments

Example

augment(model) - assume that 'model' column contains a fitted model of 'lm' by do_lm() command.

.rownames ARR_DELAY DEP_DELAY .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
1 -15 -9 -13.934 0.0247 -1.066 0.0000035503 13.0886 7.8503e-9 -0.0814
2 -28 -19 -22.4116 0.035 -5.5884 0.0000071333 13.0886 4.3348e-7 -0.427

glance

Summary
Returns a single row summary of a given model.

Syntax
glance(<column_model>)

Arguments

Returned Values

Example

glance(model) - assume that 'model' column contains a fitted model of 'lm' by do_lm() command.

r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.8879 0.8879 13.0886 1716441.58 0 3 -1729152.62 3458313.24 3458357.15 74228528.48 433295

tidy

Summary

Returns a summary information of a given model in a tidy (normalized data frame) format.

Syntax

tidy(<column_model>, ...)

Arguments

Returned Values

Example

tidy(model) - assume that 'model' column contains a fitted model of 'lm' by do_lm() command.

term estimate std.error statistic p.value
(Intercept) -3.0966 0.0341 -90.749 0
2 DEP_DELAY 1.003 0.0005 1851.9463 0
3 DISTANCE -0.0037 0 -114.3364 0

roll_max

Summary
Returns the rolling (or moving) max value.

Syntax
roll_max(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_max = roll_max(revenue, n = 3, align="center", fill = 0)
Returns a max value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_max
2012 100 0
2013 200 300
2014 300 400
2015 400 500
2016 500 0

rolling_max = roll_max(revenue, n = 3, align="left", fill = 0)
Returns a max value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_max
2012 100 300
2013 200 400
2014 300 500
2015 400 0
2016 500 0

roll_mean

Summary
Returns the rolling (or moving) mean value.

Syntax
roll_mean(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_mean = roll_mean(revenue, n = 3, align="center", fill = 0)
Returns a mean value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_mean
2012 100 0
2013 200 200
2014 300 300
2015 400 400
2016 500 0

rolling_mean = roll_mean(revenue, n = 3, align="left", fill = 0)
Returns a mean value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_mean
2012 100 200
2013 200 300
2014 300 400
2015 400 0
2016 500 0

roll_median

Summary
Returns the rolling (or moving) median value.

Syntax
roll_median(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_median = roll_median(revenue, n = 3, align="center", fill = 0)
Returns a median value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_median
2012 100 0
2013 200 200
2014 300 300
2015 400 400
2016 500 0

rolling_median = roll_median(revenue, n = 3, align="left", fill = 0)
Returns a median value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_median
2012 100 200
2013 200 300
2014 300 400
2015 400 0
2016 500 0

roll_min

Summary
Returns the rolling (or moving) min value.

Syntax
roll_min(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_max = roll_min(revenue, n = 3, align="center", fill = 0)
Returns a min value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_min
2012 100 0
2013 200 100
2014 300 200
2015 400 300
2016 500 0

rolling_max = roll_min(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_min
2012 100 100
2013 200 200
2014 300 300
2015 400 0
2016 500 0

roll_prod

Summary
Returns the rolling (or moving) prod value.

Syntax
roll_prod(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_prod = roll_prod(revenue, n = 3, align="center", fill = 0)
Returns a prod value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_prod
2012 1 0
2013 2 6
2014 3 24
2015 4 60
2016 5 0

rolling_prod = roll_prod(revenue, n = 3, align="left", fill = 0)
Returns a prod value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_prod
2012 1 6
2013 2 24
2014 3 60
2015 4 0
2016 5 0

roll_sd

Summary
Returns the rolling (or moving) standard deviation value.

Syntax
roll_sd(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_sd = roll_sd(revenue, n = 3, align="center", fill = 0)
Returns a standard deviation value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_sd
2012 100 0
2013 200 100
2014 300 100
2015 400 100
2016 500 0

rolling_sd = roll_sd(revenue, n = 3, align="left", fill = 0)
Returns a min value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_sd
2012 100 100
2013 200 100
2014 300 100
2015 400 0
2016 500 0

roll_sum

Summary
Returns the rolling (or moving) sum value.

Syntax
roll_sum(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_sum = roll_sum(revenue, n = 3, align="center", fill = 0)
Returns a sum value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_sum
2012 100 0
2013 200 600
2014 300 900
2015 400 1200
2016 500 0

rolling_sum = roll_sum(revenue, n = 3, align="left", fill = 0)
Returns a sum value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_sum
2012 100 600
2013 200 900
2014 300 1200
2015 400 0
2016 500 0

roll_var

Summary
Returns the rolling (or moving) variance value.

Syntax
roll_var(<column_num>, n = <numeric>, weights = NULL, by = <numeric>, fill = <numeric>, align = <alignment>, normalize = <logical>, na.rm = <logical>)

Arguments

Return Value
Numeric

Example
rolling_var = roll_var(revenue, n = 3, align="center", fill = 0)
Returns a variance value among three values starting from one previous and one next value, setting NA to 0.

year revenue rolling_var
2012 100 0
2013 200 10000
2014 300 10000
2015 400 10000
2016 500 0

rolling_var = roll_var(revenue, n = 3, align="left", fill = 0)
Returns a variance value among three values starting from the current value and the next two values, setting NA to 0.

year revenue rolling_var
2012 100 10000
2013 200 10000
2014 300 10000
2015 400 0
2016 500 0

-

Summary
Removes column(s). It can be used along with any column(s) selection operation like select(), gather(), etc.

Syntax
-

Arguments

Return Value

Example
select(-Country)
Remove Country column.
select(-starts_with("user"))
Remove all the columns whose names start with "user".
gather(Year, Population, -Country)
Gather all the columns except Country column into Year and Population columns.

select_if

Summary
Select column(s) with a predicate (conditional) function.

Syntax
select_if(<predicate_condition>,...)

Arguments

Example
select_if(is.numeric)
Select only numeric data type columns.
select_if(is.character)
Select only character data type columns.
select_if(function(x) is.numeric(x) && mean(x) > 5)
Select only the columns that are numeric data type and their mean values are greater than 5.

starts_with

Summary
Returns the column names that starts with a given text.

Syntax
starts_with(<text>, ignore.case = <logical>)

Arguments

Example
select(starts_with("X", ignore.case = TRUE))
Select only columns whose name starts with "X"

ends_with

Summary
Returns the column names that ends with a given text.

Syntax
ends_with(<text>, ignore.case = <logical>)

Arguments

Example
select(ends_with("ABR", ignore.case = TRUE))
Select only columns whose name ends with "ABR"

contains

Summary
Returns the column names that contain a given text.

Syntax
contains(<text>, ignore.case = <logical>)

Arguments

Example
select(contains("ABR", ignore.case = TRUE))
Select only columns whose name contains with "ABR"

matches

Summary
Returns the column names that matches with a given text.

Syntax
matches(<text>, ignore.case = <logical>)

Arguments

Example
select(matches("[:digit:]+", ignore.case = TRUE))
Select only columns whose names contains digits (regular expression)

num_range

Summary
Returns the column names that starts with a given text with numbers that are within a given range. It's useful especially when you have column names like X1, X2, X3, X4, etc.

Syntax
num_range(<text>, <start_num>:<end_num>)

Arguments

Example
select(num_range("X", 1:6))
Selects all columns (numerically) from X1 to X5.

one_of

Summary
Returns the column names that are one of the given names.

Syntax
one_of(<text1>, <text2>, ...)

Arguments

Example
select(one_of("x", "y", "z”))
Selects columns whose names are "x", "y", and "z".

everything

Summary
Returns all the column names. It's useful when you want to have particular column(s) first before everything else.

Syntax
everything()

Arguments

Example
select(X, everything())
Reorder columns: keep the column “X" before all the other columns.

mutate_all

Summary
Apply functions to all the columns.

Syntax
mutate_all(funs(<function(s)>), ...)

Arguments

Example
mutate_all(funs(. * 0.5))
Multiply 0.5 to all the column values and override the existing columns.
mutate_all(funs(half = . * 0.5))
Multiply 0.5 to all the column values and create new columns whose names will end with "_half".
mutate_all(funs(min_rank, dense_rank)) Apply min_rank and dense_rank functions to all the column values and create new columns whose names will end with the function name (minrank, dense_rank)._

mutate_at

Summary
Apply functions to specified columns.

Syntax
mutate_at(vars(<column(s)>), funs(<function(s)>), ...)

Arguments

Example
mutate_at(vars(ends_with("DELAY")), min_rank)
Apply 'min_rank' function to the columns whose name ending with "DELAY".
mutate_at(vars(ends_with("DELAY")), funs(. * .5))
Multiply 0.5 to the columns whose name ending with "DELAY".
mutate_at(c("ARR_DELAY", "DEP_DELAY"), min_rank)
Apply 'min_rank' function to "ARR_DELAY" and "DEP_DELAY" columns.
mutate_at(c(1,3), min_rank)
Apply 'min_rank' function to 1st and 3rd columns.

mutate_if

Summary
Apply functions to only the columns that match with a given condition.

Syntax
mutate_if(<predicate_condition>, funs(<function(s)>), ...)

Arguments

Example
mutate_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type.
mutate_if(is.numeric, funs(. * 0.2))
Multiply all the numeric columns by 0.2 overriding existing columns.
mutate_if(is.numeric, funs(calc = . * 0.2))
Multiply all the numeric columns by 0.2 as new columns.

summarize_all

Summary
Apply functions to all the columns.

Syntax
summarize_all(funs(<function(s)>), ...)

Arguments

Example
summarize_all(mean)
Apply mean function to all the columns values per group and override the existing columns.
summarize_all(mean, na.rm = TRUE)
Apply mean function with argument "na.rm = TRUE" to all the columns values per group.
summarize_all(funs(min, max))
Apply min and max functions to all the columns values per group and create new columns.
summarize_all(funs(average = mean), na.rm = TRUE)
Apply mean function to all the columns values per group and create new columns whose names ending with 'average'.

summarize_at

Summary
Apply functions to specified columns.

Syntax
summarize_at(vars(<column(s)>), funs(<function(s)>), ...)

Arguments

Example
summarize_at(vars(ends_with("DELAY")), mean)
Apply 'mean' function to the columns whose name ending with "DELAY".
summarize_at(vars(ends_with("DELAY")), mean, na.rm = TRUE)
Apply 'mean' function with "na.rm = TRUE" argument to the columns whose name ending with "DELAY".
summarize_at(c("ARR_DELAY", "DEP_DELAY"), mean)
Apply 'mean' function to "ARR_DELAY" and "DEP_DELAY" columns.
summarize_at(c(1,3), mean)
Apply 'mean' function to 1st and 3rd columns.

summarize_if

Summary
Apply functions to only the columns that match with a given condition.

Syntax
summarize_if(<predicate_condition>, )

Arguments

Example
summarize_if(is_numeric, mean)
Apply 'mean' function to all the columns that are numeric data type for each group.

vars

Summary
Returns a list of user selected columns.

Syntax
vars(<column(s)>)

Arguments

Example
vars(starts_with("ARR"))
Select columns whose name starting with "ARR".
vars(ARR_DELAY, ARR_TIME)
Select "ARR_DELAY" and "ARRTIME" columns._

if_else

Summary Returns values based on whether a given condition is satisfied or not.

Syntax

if_else(<condition>, <return_value_when_TRUE>, <return_value_when_FALSE>, missing = NULL)

Arguments

Return Value
Vector

Example

mutate(category = ifelse(price >= 500, "Expensive", "Cheap"))

price category
1000 Expensive
500 Expensive
100 Cheap
50 Cheap

case_when

Summary Returns values based on multiple conditions. This is similar to CASE WHEN conditions in SQL. You can use two sided formula to define the condition and the mapping values, and use TRUE as ELSE condition. See examples below.

Syntax
case_when(...)

Arguments

Return Value
Vector

Example

mutate(category = case_when(.$price > 900 ~ "Super Expensive", .$price >= 500 ~ "Expensive", .$price >= 100 ~ "Mild")))

price category
1000 Super Expensive
500 Expensive
100 Mild
50 NA

mutate(category = case_when(.$price > 900 ~ "Super Expensive", .$price >= 500 ~ "Expensive", .$price >= 100 ~ "Mild", TRUE ~ "Cheap")))

price category
1000 Super Expensive
500 Expensive
100 Mild
50 Cheap

coalesce

Summary
Replace the missing values with a given value or the values that are at the same position in other columns.

Syntax
coalesce(<column>, <column(s)>)

Arguments

Return Value
Vector

Example

Original data:

price price_old
1500 1000
NA 500
NA 100
50 10

mutate(price = coalesce(price, 0))

price price_old
1500 1000
0 500
0 100
50 10

mutate(price = coalesce(price, price_old))

price price_old
1500 1000
500 500
100 100
50 10

na_if

Summary
Replaces the matching values to NA.

Syntax

na_if(<column>, <column_value>)

Arguments

Return Value
Vector

Example

mutate(after = na_if(before, 999))

before after
A A
G G
999 NA

near

Summary
Compares two numeric vectors (columns). This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance.

Syntax
near(<column_num>, <column_num>, tol = <numeric>)

Arguments

Return Value
Logical

Example

sqrt(2) ^ 2 == 2 Returns FALSE.

near(sqrt(2) ^ 2, 2) Returns TRUE.

recode

Summary

Recodes (or replaces) the original values with given values. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.

Syntax

recode(<column>, ..., .default = <text>, .missing = <text>)

Arguments

Return Value
Vector

Example

Original data is numeric:

ID
4716
4755
2432

mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten"))

ID name
4716 Oracle
4755 Rakuten
2432 NA

Since the original value 2432 doesn't have a replacement value yet its data type (numeric) is different from the other replacement values (character) its replacement value becomes NA. If the data type is same then it will use the original value.

mutate(name = recode(ID, `4716` = "Oracle", `4755` = "Rakuten", .default = "Not Known"))

You can use '.default' argument to set the default value.

ID name
4716 Oracle
4755 Rakuten
2432 Not known

Original data is character:

ID
A
G
M

mutate(name = recode(ID, "A" = "Apple", "G" = "Google"))

ID name
A Apple
G Google
M M

Since the original value 'M' doesn't have a replacement value yet its data type (character) is same as the other replacement values (character) it will use the original value as 'M'.

mutate(name = recode(ID, A = "Apple", G = "Google", .default = "Not known"))

You can use '.default' argument to set the default value.

ID name
A Apple
G Google
M Not Known

recode_factor

Summary

Recodes (or replaces) the original values with given values and create a factor column. You can replace numeric values either by the name with backticks or based on their position, and character values by their name.

Syntax

recode_factor(<column>, ..., .default = <text>, .missing = <text>, .ordered = <logical>)

Arguments

Return Value
Factor

Example

Original Data:

ID
M
T
S

mutate(name = recode_factor(ID, S = "Sunday", M = "Mondy", T = "Tuesday", .ordered = TRUE))

ID name
M Monday
T Tuesday
S Sunday

separate_rows

Summary
Separates a column with delimited values into multiple rows.

Syntax
separate_rows(<column_text_date>, sep = <text>, convert=<logical>)

Arguments

Example

Original data:

company investors
Uber Lowercase Capital, Benchmark Capital, Google Ventures
Xiaomi Digital Sky Technologies, QiMing Venture Partners, Qualcomm Ventures

seperate_rows(investors, sep=", ")

company investors
Uber Lowercase Capital
Uber Benchmark Capital
Uber Google Ventures
Xiaomi Digital Sky Technologies
Xiaomi QiMing Venture Partners
Xiaomi Qualcomm Ventures

drop_na

Summary
Drop rows that have NA value.

Syntax
drop_na(<column(s)>,...)
drop_na(-<column(s)>,...)
drop_na(starts_with(<text>, ignore.case = <logical>))
drop_na(ends_with(<text>, ignore.case = <logical>))
drop_na(contains(<text>, ignore.case = <logical>))
drop_na(matches(<text>, ignore.case = <logical>))
drop_na(num_range(<text>, <start_num>:<end_num>))
drop_na(one_of(<text1>, <text2>, ...))
drop_na(<column>, everything())

Arguments

Example

Sample data:

Original data:

location year plant_date harvest_date
Als 1900 NA 35.02899
Als 1901 10.00000 24.02899
Als 1902 12.58863 45.02899
Bea 1900 5.00000 18.00000
Bea 1901 3.31119 10.00000
Bea 1902 29.12891 NA
Bor 1900 12.52136 18.14896
Bor 1901 NA 10.64896
Bor 1902 10.23056 20.64896

drop_na(ends_with("date"))
Returns a data frame without NA values in plant_date and harvest_date columns.

location year plant_date harvest_date
Als 1901 10.00000 24.02899
Als 1902 12.58863 45.02899
Bea 1900 5.00000 18.00000
Bea 1901 3.31119 10.00000
Bor 1900 12.52136 18.14896
Bor 1902 10.23056 20.64896

add_row

Summary
Add rows with given column values.

Syntax
add_row()

Arguments

Return Value
Data Frame

Example

Original Data:

ID name
4716 Oracle
4755 Rakuten
2432 DeNA

add_row(ID = 4689)

ID name
4716 Oracle
4755 Rakuten
2432 DeNA
4689 NA

add_row(ID = 4689, name = "Yahoo")

ID name
4716 Oracle
4755 Rakuten
2432 DeNA
4689 Yahoo

parse_number

Summary
Parse characters, extract numeric values, and convert to number data type including

Syntax
parse_number(<column>, na = c("", "NA"), locale = <locale>)

Arguments

Return Value
Numeric

Example
parse_number("12345")
Returns 12345
parse_number("12345.10")
Returns 12345.1
parse_number("$12,345.10")
Returns 12345.1
parse_number("$12.20M")
Returns 12.2
parse_number(as.Date("2015-01-30") - as.Date("2015-01-15"))
Returns 15

parse_double

Summary
Convert data to double precision number data type.

Syntax
parse_double(<column>, na = c("", "NA"), locale = <locale>)

Arguments

Return Value
Double

Example
parse_double("12345")
Returns 12345
parse_double("12345.10")
Returns 12345.1

parse_euro_double

Summary
Convert data to double precision number data type.

Syntax
parse_euro_double(<column>, na = c("", "NA"))

Arguments

Return Value
Double

Example
parse_euro_double("12345")
Returns 12345
parse_euro_double("12345.10")
Returns 12345.1

parse_integer

Summary
Convert data to integer number data type.

Syntax
parse_integer(<column>, na = c("", "NA"), locale = <locale>)

Arguments

Return Value
Integer

Example
parse_integer("12345")
Returns 12345
parse_integer("12345.10")
Returns 12345

parse_time

Summary
Convert data to Time data type.

Syntax
parse_time(<column>, format = <time_format>, locale = <locale>)

Arguments

Return Value
time

Example
parse_time("13:10:05")
Returns "13:10:05"
parse_time("13:10:05", locale = locale(tz = "America/Los_Angeles"))
Returns "13:10:05"
parse_time("10:20:15 PM") Returns "22:20:15"
parse_time("10:20:15 午後", local = locale("ja"))
Returns "22:20:15"

parse_character

Summary
Convert a given data to character data type.

Syntax
parse_character(<column>, na = c("", "NA"), locale = <locale>)

Arguments

Return Value
Character

Example
parse_character(123.11)
Returns "123.11" as character.

parse_factor

Summary
Parse data and convert to factor data tyep.

Syntax
parse_factor(x, levels, ordered = <logical>, na = c("", "NA"), locale = <locale>)

Arguments

Return Value
Factor

Example
columnA - "iMac", "iPod", "iPhone", "iPod", "iPhone"
parse_factor(columnA)
Returns iMac, iPod, iPhone, iPod, iPhone (It has Levels information of : iMac iPhone iPod)

parse_logical

Summary
Convert data to Logical (Boolean) data type.

Syntax
parse_logical(<column>, na = c("", "NA"), locale = <locale>)

Arguments

Return Value
Logical

Example
parse_logical("TRUE")
Returns TRUE
parse_logical("T")
Returns TRUE

type_convert

Summary
Heuristically guess the data type for each column by reading the first 1000 rows, parse the data, and set appropriate data types for all the columns of the data frame.

Syntax
type_convert()

Arguments

Return Value
Data Frame

Example
type_convert()

yq

Summary
Convert text to Date data type if a given column holds values that look like Year and Quoter.

Syntax
yq(<column>, tz = <timezone>, locale = <locale>)

Arguments

Return Value
Date

Example
yq("2016.2")
returns "2016-04-01"
yq("2016-03") returns "2016-07-01"

as_datetime

Summary
Convert a given data to date.

Syntax
as_datetime(<column_date>, format = <date_time_format>, tz = <timezone>, origin = <text>)

Arguments

Return Value
POSIXct

Example

as_datetime("2015-10-10 13:10:05") Returns "2015-10-10 13:10:05 PDT"
as_datetime("2015-10-10 13:10:05", tz = "America/Los_Angeles") Returns "2015-10-10 13:10:05 PDT"
as.as_datetime("2015-10-10 13:10:05", tz = "Asia/Tokyo") Returns "2015-10-10 13:10:05 JST"
as_datetime("05-10-15T13:10:05", format = "%d-%m-%yT%H:%M") Returns "2015-10-05 13:10:00 PDT"

as.hms

Summary
Convert a given data to hms data type. It's difftime dealt as seconds.

Syntax
as.hms(<column>)

Arguments

Return Value
hms

Example
as.hms(20)
returns 20 seconds difftime

is.hms

Summary
Returns true if a given object is hms type.

Syntax
is.hms(<column>)

Arguments

Return Value
Logical

Example
mutate_if(is.hms, as.character)
Convert hms columns into character columns. An example of the value is "05:01:20".

remove_empty_rows

Summary
Remove rows whose column values are all NAs.

Syntax
remove_empty_rows()

Example

Original data:

item_name transaction_id
banana 1
apple 1
NA NA
carrot 2
apple 2
NA NA
orange 3
NA 3

remove_empty_rows()
Returns the data without all NA rows.

item_name transaction_id
banana 1
apple 1
carrot 2
apple 2
orange 3
NA 3

remove_empty_cols

Summary
Remove columns whose values are all NAs.

Syntax
remove_empty_cols()

Example

Original data:

item_name transaction_id tag
banana 1 NA
apple 1 NA
carrot 2 NA
apple 2 NA
orange 3 NA
NA 3 NA

remove_empty_cols()
Returns the data without all NA columns.

item_name transaction_id
banana 1
apple 1
carrot 2
apple 2
orange 3
NA 3

clean_names

Summary
Make column names clean by using only _ character, lowercase letters, and numbers.

Syntax
clean_names()

Example

Original data:

Item Name .transaction
banana 1
apple 1
carrot 2
apple 2
orange 3
NA 3

clean_names()
Returns data with clean column names.

item_name _transaction
banana 1
apple 1
carrot 2
apple 2
orange 3
NA 3

get_dupes

Summary
Get rows that are duplicated.

Syntax
get_dupes(<column(s)>)

Example

Original data:

name n
banana 1
apple 1
apple 1
apple 2

get_dupes(name)
Returns rows that are duplicated in name column.

name n
apple 1
apple 1
apple 2

get_dupes()
Returns rows that are duplicated in all columns.

name n
apple 1
apple 1

excel_numeric_to_date

Summary
Convert number of date from excel to Date.

Syntax
excel_numeric_to_date(<column_num>, date_system = "modern"|"mac pre-2011")

Arguments

Example

Original data:

user last_login
Lisa 42370
Emily 42318
John 42489

mutate(last_login = excel_numeric_to_date(last_login)) Returns data frame with lastlogin column as Date._

user last_login
Lisa 2016-01-01
Emily 2015-11-10
John 2016-04-29

Return Value
Date

convert_to_NA

Summary
Convert specific values to NA.

Syntax
convert_to_NA(strings = <character_vector>)

Arguments

Example

Original data:

item_name transaction_id
banana 1
apple 1
-99 -99
carrot 2
apple 2
-99 -99
orange 3
-99 -99

convert_to_NA("-99")
Convert -99 to NA even if it's numeric.

item_name transaction_id
banana 1
apple 1
NA NA
carrot 2
apple 2
NA NA
orange 3
NA NA

tabyl

Summary
Create frequency table.

Syntax
tabyl(, sort = FALSE, show_na = TRUE)

Arguments

Example

Original data:

item_name transaction_id
banana 1
apple 1
NA NA
carrot 2
apple 2
NA NA
orange 3
NA 3

tabyl(item_name)
Show ratio of item_name with NA and without it.

item_name n percent valid_percent
apple 1 0.125 0.2
apple 1 0.125 0.2
banana 1 0.125 0.2
carrot 1 0.125 0.2
orange 1 0.125 0.2
NA 3 0.375 NA

as_factor

Summary
Convert a given column to a factor. Compared to base R's as.factor, this function creates levels in the order in which they appear, which will be the same on every platform (base R sorts in the current locale which can vary from place to place).

Syntax
as_factor(<column_char_num_date>)

Arguments

Example

x <- c("FR", "US", "JP")
as_factor(x)
Returns a factor with FR, US, JP with levels of FR, US, JP.

fct_anon

Summary
Anonymize factor levels. Replaces factor levels with arbitrary numeric identifiers. Neither the values nor the order of the levels are preserved.

Syntax
fct_anon(<column_char_num_factor>, prefix = <text>)

Arguments

Example

x <- factor(c("FR", "US", "JP"))
fct_anon(x)
Returns a factor with 1, 2, 3.

fct_expand

Summary
Add additional levels to a factor

Syntax
fct_expand(<column_char_num_factor>, <text>)

Arguments

Example

x <- factor(c("FR", "US"))
fct_expand(x, "JP", "CA")
Returns a factor with levels, FR US JP CA.

fct_drop

Summary
Drop unused levels.

Syntax
fct_drop(<column_char_num_factor>, only = <text>)

Arguments

Example

x <- factor(c("FR", "US"), levels = c("FR", "US", "JP", "CA"))
fct_drop(x)
Returns a factor with levels of FR, US.

fct_drop(x, only = "CA")
Returns a factor with levels of FR, US, JP.

fct_explicit_na

Summary
This gives missing value an explicit factor level, ensuring that they appear in the charts.

Syntax
fct_explicit_na(<column_char_num_factor>, na_level = <text>)

Arguments

Example

x <- factor(c("FR", NA, "US", NA, "JP"))
fct_explicit_na(x, "No value")
Returns a factor with FR, No value, US, No value, JP and with levels of FR JP US No value

fct_lump

Summary
Makes least/most common factor levels into "other".

Syntax
fct_lump(<column_char_num_factor>, n = <number>, prop = <number>, other_level = <text>, ties.method = <ties_method_type>)

Arguments

If both n and prop are missing, fct_lump lumps together the least frequent levels into "other", while ensuring that "other" is still the smallest level.

Example

mutate(STATE = fct_lump(STATE))

This will make STATE column to keep most of the values in order to make the 'Other' level will have the smallest counts compared to all the other levels.

mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, n = 10))

This will make STATE column to keep only 10 States from the most frequent ones and add 'Other' level for putting all the others together.

mutate(ORIGIN_STATE_ABR = fct_lump(ORIGIN_STATE_ABR, prop = .1))

This will make STATE column to keep only States that have more than 10% proportion of all the entries and add 'Other' level for putting all the others together.

fct_inorder

Summary
Reorder factors levels by first appearance.

Syntax
fct_inorder(<column_char_num_factor>, ordered = <logical>)

Arguments

Example

x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.

x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column. fct_inorder(x)
Returns a factor with levels in the following order, FR US JP.

fct_infreq

Summary
Reorder factors levels by frequency (the most frequently appears to the least.)

Syntax fct_infreq(<column_char_num_factor>, ordered = <logical>)

Arguments

Example

x <- factor(c("FR", "US", "US", "JP", "US", "JP"))
fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.

x <- c("FR", "US", "US", "JP", "US", "JP") -- it can be used for Character data type column. fct_infreq(x)
Returns a factor with levels in the following order, US JP FR.

fct_relevel

Summary
Sets the level of a given factor column by moving any number of levels to any location.

Syntax
fct_relevel(<column_char_num_factor>, ...)

Arguments

Example
x <- c("France", "Japan", "Australia", "United States") -- creating a character column. fct_relevel(x, "Japan")
Return as Japan, Australia, France, United States. The character column is converted to factor with the default level setting, which is an alphabetical order.

x <- factor(level = c("France", "Japan", "Australia", "United States")) -- creating a factor column with levels. fct_relevel(x, "Japan")
Return as Japan, France, Australia, United States

fct_relevel(x, "Japan", "United States")
Return as Japan, United States, France, Australia

fct_reorder

Summary
Reorder factor levels by sorting based on another variable

Syntax

fct_reorder(<column_char_num_factor>, <column_num>, fun = <aggregate_function>, ..., .desc = <logical>)

Arguments

Example

mutate(CARRIER = fct_reorder(CARRIER, ARR_DELAY))

This will make CARRIER column to be factor with the following level order. Levels: VX 9E DL FL UA AS AA B6 MQ US WN YV EV OO HA F9 (VX has the smallest median ARRDELAY value._

mutate(CARRIER = fct_reorder(CARRIER, ARR_DELAY, .desc = TRUE))

This will make CARRIER column to be factor with the following level order. Levels: F9 HA EV OO B6 MQ US WN YV AA AS UA 9E DL FL VX (F9 has the largest median ARRDELAY value._

mutate(CARRIER = fct_reorder(CARRIER, DISTANCE))

This will make CARRIER column to be factor with the following level order. Levels: HA YV OO 9E EV MQ WN US FL DL F9 AS AA B6 UA VX (HA has the smallest median distance.

mutate(CARRIER = fct_reorder(CARRIER, ARR_DELAY, fun = sum, na.rm= TRUE))

This will make CARRIER column to be factor with the following level order. Levels: DL FL VX AS UA HA US WN YV F9 9E B6 AA MQ OO EV (DL has the smallest total ARRDELAY value._

fct_rev

Summary
Reverse order of factor levels.

Syntax

fct_rev(<column_factor>)

Arguments

Example
x <- factor(c("FR", "US", "JP"))
This creates a factor column with levels in an order of, FR, JP, US. (alphabetical order as default.)

fct_rev(x)
Returns a factor with levels in an order of US, JP, FR.

xgboost_reg

Summary
Create extreme gradient boosting model for regression.

Syntax
xgboost_reg(output_type = <xgb_reg_output_type>, nrounds = <integer>, booster = <xgboost_booster_type>, eval_metric = <xgb_reg_evaluation_type>, weight = <column_num>, early_stopping_rounds = <integer>)

Arguments

Return Value
Model

Example

Original data:

CARRIER ARR_DELAY DEP_DELAY DISTANCE
UA 10 7 300
UA 20 30 1000
UA -5 0 200
AA 20 20 500
AA -5 3 700
AA 10 8 1000

%>% group_by(CARRIER)

%>% build_model(model_func = xgboost_reg, formula = DEP_DELAY ~ ARR_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a xgb.Booster model based on a formula defined as 'Predict DEP_DELAY based on ARR_DELAY and DISTANCE.' It also returns a column of original data.

CARRIER source_data model .test_index
UA source dataframe xgb.Booster model c(1)
AA source dataframe xgb.Booster model c(2)

You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.

xgboost_binary

Summary
Create extreme gradient boosting model for binary classification.

Syntax
xgboost_binary(output_type = <xgb_binary_output_type>, nrounds = <integer>, booster = <xgboost_booster_type>, eval_metric = <xgb_binary_evaluation_type>, weight = <column_num>, early_stopping_rounds = <integer>)

Arguments

Return Value
Model

xgboost_multi

Summary
Create extreme gradient boosting model for binary classification.

Syntax
xgboost_multi(output_type = <xgb_multi_output_type>, nrounds = <integer>, booster = <xgboost_booster_type>, eval_metric = <xgb_multi_evaluation_type>, weight = <column_num>, early_stopping_rounds = <integer>)

Arguments

Return Value
Model

Example

randomForestReg

Summary
Create random forest model for regression.

Syntax
randomForestReg(subset = <column_logical>, na.action = <'na_action_type'>, ntree = <integer>, mtry = <numeric>, replace = <logical>, strata = <column>, sampsize = <integer>, nodesize = <integer>, maxnodes = <integer>, importance = <logical>, localImp = <logical>, nPerm = <integer>, proximity = <logical>, oob.prox = <logical>, keep.forest = <logical>, corr.bias = <logical>, keep.inbag = <logical>)

Arguments

Return Value
Model

Example

randomForestBinary

Summary
Create random forest model for binary classification.

Syntax
randomForestReg(subset = <column_logical>, na.action = <'na_action_type'>, ntree = <integer>, mtry = <numeric>, replace = <logical>, classwt = <numeric>, cutoff = <numeric>, strata = <column>, sampsize = <integer>, nodesize = <integer>, maxnodes = <integer>, importance = <logical>, localImp = <logical>, nPerm = <integer>, proximity = <logical>, oob.prox = <logical>, norm.votes = <logical>, keep.forest = <logical>, corr.bias = <logical>, keep.inbag = <logical>)

Arguments

Return Value
Model

Example

randomForestMulti

Summary
Create random forest model for multi class classification.

Syntax
randomForestReg(subset = <column_logical>, na.action = <'na_action_type'>, ntree = <integer>, mtry = <numeric>, replace = <logical>, classwt = <numeric>, cutoff = <numeric>, strata = <column>, sampsize = <integer>, nodesize = <integer>, maxnodes = <integer>, importance = <logical>, localImp = <logical>, nPerm = <integer>, proximity = <logical>, oob.prox = <logical>, norm.votes = <logical>, keep.forest = <logical>, corr.bias = <logical>, keep.inbag = <logical>)

Arguments

Return Value
Model

Example

list_extract

Summary
Extract a value of a given list data type column based on a position or name when the data inside the list is data frame.

Syntax
list_extract(<column_list>, position = <numeric_text>, rownum = <numeric>)

Arguments

Return Value
Character

Example

1) Column is List of Character

mutate(text = list_extract(categories, 1))
Returns the following.

category text
c("Doctors", "Health & Medical") Doctors
c("Bars", "American (New)", "Nightlife") Bars

mutate(text = list_extract(categories, -1))
Returns the value of the 1st element counted from the last.

category text
c("Doctors", "Health & Medical") Health & Medical
c("Bars", "American (New)", "Nightlife") Nightlife

2) Column is List of Data Frame

mutate(text = list_extract(labels, 1))
Returns the value of the 1st column and 1st row. The default is 1st row when not specified.

labels text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) Regression

mutate(text = list_extract(labels, -1))
Returns the value of the last column and 1st row..

labels text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) 5319e7

mutate(text = list_extract(labels, 1, 2))
Returns the value of the 1st column and 2nd row.

labels text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) UI

mutate(text = list_extract(labels, "name"))
Returns the value of a column, 'name', and the 1st row. The default is 1st row when not specified.

labels text
list(name = c("Regression", "UI"), color = c("5319e7", "207de5")) Regression

list_to_text

Summary
Concatenates texts from all the elements of a given list data type column.

Syntax
list_to_text(<column_list> , sep = <text>)

Arguments

Return Value
character

Example
mutate(category_text = list_to_text(category))
Returns the following.

category category_text
c("Doctors", "Health & Medical") Doctors, Health & Medical
Nightlife Nightlife
c("Active Life", "Mini Golf", "Golf") Active Life, Mini Golf, Golf
c("Bars", "American (New)", "Nightlife") Bars, American (New), Nightlife
c("Bars", "American (Traditional)", "Nightlife") Bars, American (Traditional), Nightlife

list_concat

Summary
Concatenates values from multiple columns into a list.

Syntax
list_concat(<column(s)>, collapse = <logical>)

Arguments

Return Value
list

Example

mutate(items = list_concat(item1, item2))
Returns the following.

Before:

name item1 item2
Cathy c("Banana", "Apple") "Apple"
Mary "Lemon" c("Lemon", "Banana")
Lisa c("Lemon", "Banana") c("Banana", "Apple")

After:

name item1 item2 items
Cathy c("Banana", "Apple") "Apple" c("Banana", "Apple", "Apple")
Mary "Lemon" c("Lemon", "Banana") c("Lemon", "Lemon", "Banana")
Lisa c("Lemon", "Banana") c("Banana", "Apple") c("Lemon", "Banana", "Banana", "Apple")

summarize(categories = list_concat(category, collapse = TRUE))
Returns the following.

Before:

grouped by "region"

region category
East c("Doctors", "Health & Medical")
East Nightlife
East c("Active Life", "Mini Golf", "Golf")
West c("Bars", "American (New)", "Nightlife")
West c("Bars", "American (Traditional)", "Nightlife")

After:

region categories
East c("Doctors", "Health & Medical", "Nightlife", "Active Life", "Mini Golf", "Golf")
West c("Bars", "American (New)", "Nightlife", "Bars", "American (Traditional)", "Nightlife")

list_n

Summary
Returns number of elements inside a given list data type column for each row.

Syntax
list_n(<column_list>)

Arguments

Return Value
Numeric

Example
mutate(count = list_n(categories)) Returns the following.

category count
c("Doctors", "Health & Medical") 2
Nightlife 1
c("Active Life", "Mini Golf", "Golf") 3
c("Bars", "American (New)", "Nightlife", "Lounges", "Restaurants") 5
c("Bars", "American (Traditional)", "Nightlife", "Restaurants") 4

normalize

Summary
Centers and/or scales the numeric values of a given column.

Syntax
normalize(<column_num>, center = <logical>, scale = <logical>)

Arguments

Return Value
Numeric

Example
// x <- c(1, 2, 3, 4, 5, 100)
normalize(x, center = 0) Returns -0.4584610 -0.4332246 -0.4079882 -0.3827518 -0.3575154 2.0399410. These are same as the below.
(x - mean(x)) / sd(x)

normalize(x, center = FALSE) Returns 0.02229944 0.04459888 0.06689832 0.08919776 0.11149720 2.22994404. This is same as the below.
x / sqrt(sum(x^2)/(length(x)-1))

normalize(x, center = 0, scale = max(x)))
Returns 0.01 0.02 0.03 0.04 0.05 1.00

anonymize

Summary
Anonymize values by hashing algorithms.

Syntax
anonymize(<column>, algo = <hash_algorithms>, seed = <integer>, chars = <character>, n_chars = <integer>)

Arguments

Example

Original data:

user login
Lisa 1482148710
Lisa 1484148710
Emily 1479913692
John 1481128318

mutate(user = anonymize(user, algo = "murmur32") Returns anonymized values .

user login
6fe21ea2 1482148710
6fe21ea2 1484148710
2f968d4a 1479913692
9b5f08f7 1481128318

Return Value
Character

unixtime_to_datetime

Summary
Convert unix time numeric values to POSIXct.

Syntax
unixtime_to_datetime(<column_num>)

Example

Original data:

user last_login
Lisa 1482148710
Emily 1479913692
John 1481128318

mutate(last_login = unixtime_to_datetime(last_login)) Returns data frame with lastlogin column as Date._

user last_login
Lisa 2016-12-19 11:58:30 GMT
Emily 2016-11-23 15:08:12 GMT
John 2016-12-07 16:31:58 GMT

Return Value
Date, POSIXct

do_svd.kv

Summary
Calculates coordinations by reducing dimensionality using SVD (Singular Value Decomposition).

Syntax
do_svd.kv(<subject_column>, <key_column>, <value_column>, type = <svd_type>, fill = <numeric>, fun.aggregate = <aggregate_function>, n_component = <integer>, centering = <logical>, output = <svd_output>)

Arguments

Return Value
Data frame

Example

Original data:

author_name keywords tfidf_value
Cathy party 3.2
Cathy exciting 1.3
Cathy friends 0.8
Mary study 2.4
Mary exam 2.1
Mary hard 1.5
Lisa sports 1.9
Lisa exciting 1.8
Lisa hard 1.6

do_svd.kv(document_name, keywords, tfidf_value, n_component=2)
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'long' data form.

author_name new.dimension value
Cathy 1 -0.7131255
Cathy 2 0.3976414
Lisa 1 0.0121952
Lisa 2 -0.8164055
Mary 1 0.7009303
Mary 2 0.4187641

do_svd.kv(name, key, value, n_component=2, output="wide")
Reducing the number of the dimensions (keywords) to 2 and returning the result in a 'wide' data form.

author_name axis1 axis2
Cathy -0.7131255 0.3976414
Lisa 0.0121952 -0.8164055
Mary 0.7009303 0.4187641

do_svd.kv(name, key, value, n_component=2, type="dimension")
Reducing the number of the dimensions (keywords) to 2 and returning the result by 'dimensions' instead of by 'group'.

top_key_words new.dimension value
exam 1 0.4163953
exam 2 0.3259924
exciting 1 -0.2560436
exciting 2 -0.3531243
friends 1 -0.1613866
friends 2 0.1179235
hard 1 0.3029449
hard 2 -0.2513704
party 1 -0.6455466
party 2 0.4716940
sports 1 0.0065547
sports 2 -0.5750138
study 1 0.4758803
study 2 0.3725628

do_svd.kv(name, key, value, n_component=2, type="variance")
Reducing the number of the dimensions (keywords) to 2 and returning the 'variance' values for each dimension

new.dimension value
1 3.534991
2 2.697623

build_lm

Summary
Builds a linear regression model (lm) and store it in a data frame.

Syntax
build_lm(<formula>, test_rate = <numeric>, subset = <numeric_vector>, weights = <numeric_vector>, na.action = <na_action_type>, method = <lm_method_type>, model = <logical>, x = <logical>, y = <logical>, qr = <logical>, singular.ok = <logical>, contrasts = <list>, offset = <numeric_vector>)

Arguments

Return Value
Data frame

Example

Original data:

CARRIER ARR_DELAY DEP_DELAY DISTANCE
UA 10 7 300
UA 20 30 1000
UA -5 0 200
AA 20 20 500
AA -5 3 700
AA 10 8 1000

%>% group_by(CARRIER)

%>% build_lm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a lm (Linear Regression) model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.' It also returns a column of original data.

CARRIER source_data model .test_index
UA source dataframe lm model c(1)
AA source dataframe lm model c(2)

You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.

do_survfit

Summary
Calculates Survival Curve from survival time and survival status.

Syntax
do_survfit(<column_num>, <column_num_logic>, weights = <numeric_vector>, subset = <numeric_vector>, na.action = <na_action_type>, id = <id_column>, istate = <column>, type = <survival_curve_type>, error = <survfit_error_type>, conf.type = <survfit_conf_int_type>, conf.lower = <survfit_lower_conf_int_type>, conf.int = <numeric>, se.fit = <logical>, influence = <logical>)

Arguments

Return Value
Data frame

Example

Original data:_

status time
1 40
1 30
0 35
0 8

do_survfit(time, status) Returns a data frame as follows.

time n_risk n_event n_censor estimate std_error conf_high conf_low
8 4 0 1 1.0000000 0.0000000 1 1.0000000
30 3 1 0 0.6666667 0.4082483 1 0.2995071
35 2 0 1 0.6666667 0.4082483 1 0.2995071
40 1 1 0 0.0000000 Inf NA NA

prediction_survfit

Summary
Simulates Survival Curve for specified cohort based on a survival model.

Syntax
prediction_survfit(newdata = <cohort_data_set>, na.action = <na_action_type>, type = <coxph_survival_curve_type>, conf.type = <survfit_conf_int_type>, conf.int = <numeric>, se.fit = <logical>, censor = <logical>, start.time = <numeric>)

Arguments

Return Value
Data frame

Example

Original data: model data frame.

prediction_survfit(newdata = data.frame(age = c(50, 60), sex = (0, 1))) Returns a data frame as follows.

time n_risk n_event n_censor age_sex estimate std_error conf_high conf_low
8 4 0 1 50_0: 1.0000000 0.0000000 1 1.0000000
30 3 1 0 50_0: 0.6666667 0.4082483 1 0.2995071
35 2 0 1 50_0: 0.6666667 0.4082483 1 0.2995071
40 1 1 0 50_0: 0.0000000 Inf NA NA

build_lr

Summary
Builds logistic regression model and store it in a data frame.

Syntax
build_lr(<formula>, test_rate = <numeric>, weights = <numeric_vector>, subset = <numeric_vector>, na.action = <na_action_type>, start = <numeric_vector>, etastart = <numeric_vector>, mustart = <numeric_vector>, offset = <numeric_vector>, epsilon = <numeric>, maxit = <integer>, trace = <logical>, model = <logical>, method = <glm_method_type>, contrasts = <list>, x = <logical>, y = <logical>)

Arguments

Example
Original data:

CARRIER IS_DELAYED DEP_DELAY DISTANCE
UA TRUE 7 300
UA TRUE 30 1000
UA FALSE 0 200
AA TRUE 20 500
AA FALSE 3 700
AA TRUE 8 1000

%>% group_by(CARRIER)

%>% build_lr(IS_DELAYED ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'

CARRIER source_data model .test_index
UA source dataframe glm model c(1)
AA source dataframe glm model c(2)

You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.

build_glm

Summary
Builds generalized linear models (glm) and store it in a data frame.

Syntax
build_glm(<formula>, test_rate = <numeric>, family = <glm_family_type>, weights = <numeric_vector>, subset = <numeric_vector>, na.action = <na_action_type>, start = <numeric_vector>, etastart = <numeric_vector>, mustart = <numeric_vector>, offset = <numeric_vector>, epsilon = <numeric>, maxit = <integer>, trace = <logical>, model = <logical>, method = <glm_method_type>, contrasts = <list>, x = <logical>, y = <logical>)

Arguments

Example

Original data:

CARRIER ARR_DELAY DEP_DELAY DISTANCE
UA 10 7 300
UA 20 30 1000
UA -5 0 200
AA 20 20 500
AA -5 3 700
AA 10 8 1000

%>% group_by(CARRIER)

%>% build_glm(ARR_DELAY ~ DEP_DELAY + DISTANCE, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'

CARRIER source_data model .test_index
UA source dataframe lm model c(1)
AA source dataframe lm model c(2)

You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.

build_multinom

Summary
Builds multinomial logistic regression model and store it in a data frame.

Syntax
build_multinom(<formula>, test_rate = <numeric>, subset = <numeric_vector>, weights = <numeric_vector>, na.action = <na_action_type>, Hess = <logical>, summ = <multinom_summ_type>, censored = <logical>, model = <logical>, contrasts = <list>)

Arguments

Example

Original data:

CARRIER ARR_DELAY DEP_DELAY DISTANCE
UA 10 7 300
UA 20 30 1000
DL -5 0 200
AA 20 20 500
AA -5 3 700
AA 10 8 1000

%>% build_multinom(CARRIER ~ ARR_DELAY + DEP_DELAY, test_rate = 0.1)
Returns a data frame that stores a GLM model based on a formula defined as 'Predict ARR_DELAY based on DEP_DELAY and DISTANCE.'

source_data model .test_index
source dataframe multinom model c(1)

You can use model_coef or model_stats function to get the summary information about the models. Also, you can use prediction function to predict with the data that is stored in the same data frame.

build_coxph

Summary
Builds Cox Proportional Hazard Model for survival analysis and store it in a data frame.

Syntax
build_coxph(<formula>, test_rate = <numeric>, subset = <numeric_vector>, weights = <numeric_vector>, na.action = <na_action_type>, init = <numeric>, ties = <coxph_tie_type>, singular.ok = <logical>, model = <logical>, x = <logical>, y = <logical>, tt = <function>)

Arguments

Example

Original data:

time status age sex
40 2 55 female
30 2 68 male
35 1 59 male
8 1 80 female

build_coxph(survival::Surv(time, status) ~ age + sex) Builds a Cox Proportional Hazard Model that estimates survival of the subjects based on age and sex. The resulting model is stored in returning data frame. You can use model_coef or **modelstats** function to get the summary information about the models. Also, you can use prediction_coxph function to predict with the data that is stored in the same data frame._

prediction

Summary
Returns a data frame with regression information about a given model.

Syntax
prediction(data = "training"|"test")

Arguments * data (Optional) - The default is "training". Type of data to use in the prediction.

Return Value
Data Frame

Example

Original data: model data frame.

prediction(data = "test")
Returns a data frame with predicted values.

prediction_binary

Summary
Returns a data frame with binary classification information about a given model.

Syntax
prediction_binary(<predicted_probability_column>, <actual_value_column>, data = "training"|"test", threshold = <binary_metrics>)

Arguments

Return Value
Data Frame

Example

Original data: binary classification model data frame.

prediction_binary(data = "test", threshold = 0.2)
Returns a data frame with predicted values of binary classification.

prediction_coxph

Summary
Returns a data frame with predicted values of Cox Proportional Hazard Model.

Syntax
prediction_coxph(data = "training"|"test", type.predict = <coxph_predict_type>, type.residuals = <coxph_residuals_type>)

Arguments * data (Optional) - The default is "training". Type of data to use in the prediction. * type.predict (Optional) - The default is "lp". The type of prediction. This can be * "lp" * "risk" * "expected" * type.residuals (Optional) - The default is "martingale". The type of residuals. This can be * "martingale" * "deviance" * "score" * "schoenfeld" * "scaledsch" * "dfbeta" * "dfbetas"

Return Value
Data Frame

Example

Original data: coxph model data frame.

prediction_coxph(data = "test")
Returns a data frame with predicted values.

model_coef

Summary
Returns a data frame with 'Parameter Estimates (Coefficients)' information about a given model including the below.

Syntax
model_coef(conf_int = <conf_int_type>, conf.level = <numeric>)

Arguments

Return Value
Data Frame

Example

Original data: lm model or glm model

model_coef(conf_int = "default", conf.level = 0.90)
Returns a data frame with 'Parameter Estimates' information.

model_stats

Summary
Returns a data frame with 'Summary of Fit' information about a given model including the below.

Syntax
model_stats()

Example

Original data: lm model or glm model

model_stats()
Returns a data frame with 'Summary of Fit' information.

Return Value
Data Frame

model_anova

Summary
Returns a data frame with anova test information about a given model including the below.

Syntax
model_anova()

Return Value
Data Frame

Example

Original data: lm model or glm model

model_anova()
Returns a data frame with anova test information.

evaluate_regression

Summary
Returns a data frame with evaluation score of regression including the below.

Syntax
evaluate_regression(<predicted_value_column>, <actual_value_column>)

Arguments

Return Value
Data Frame

Example

Original data:

ARR_DELAY predicted_value
10 7
20 30
-5 0
20 20
-5 3
10 8

evaluate_regression(predicted_value, ARR_DELAY)
Returns a data frame with evaluation scores of regression.

evaluate_binary

Summary
Returns a data frame with evaluation score of binary classification including the below.

Syntax
evaluate_binary(<predicted_probability_column>, <actual_value_column>, threshold = <binary_metrics>)

Arguments

Return Value
Data Frame

Example

Original data:

IS_DELAYED predicted_probability
TRUE 0.82
TRUE 0.67
FALSE 0.34
TRUE 0.98
FALSE 0.19
TRUE 0.26

evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of binary classification.

evaluate_multi

Summary
Returns a data frame with evaluation score of multi classification including the below.

Syntax
evaluate_multi(<predicted_label_column>, <actual_value_column>)

Arguments

Return Value
Data Frame

Example

Original data:

CARRIER predicted_label
AA AA
9E 9E
AS AA
9E 9E
AA AA
AS FL

evaluate_binary(predicted_value, ARR_DELAY, threshold = "accuracy")
Returns a data frame with evaluation scores of multi classification.

micro_f_score macro_f_score accuracy misclassification_rate
0.6666667 0.45 0.6666667 0.3333333

do_roc

Summary
Returns coordinates of roc curve.

Syntax
do_roc(<predicted_probability_column>, <actual_value_column>)

Arguments

Return Value
Data Frame

Example

Original data:

IS_DELAYED predicted_probability
TRUE 0.82
TRUE 0.67
FALSE 0.34
TRUE 0.98
FALSE 0.19
TRUE 0.26

do_roc(predicted_value, ARR_DELAY)
Returns a data frame with true_positive_rate and false_positive_rate.

true_positive_rate false_positive_rate
0.00 0.0
0.25 0.0
0.50 0.0
0.75 0.0
0.75 0.5
1.00 0.5
1.00 1.0

With this data, you can assign "false_positive_rate" to X axis and "truepositive_rate" to Y axis to draw ROC curve._

build_kmeans.cols

Summary
Builds a clustering model (k-means) from variable columns and returns the summary of the model or the augmented data depending on the parameter value.

Syntax
build_kmeans.cols(<column(s)>, centers = <integer>, fill = <numeric>, fun.aggregate = <aggregate_function>, seed = <integer>, iter.max = <integer>, nstart = <integer>, algorithm = <kmeans_algorithms>, trace = <logical>, augment = <logical>, keep.source = <logical>)

Arguments

Return Value
Data frame

Example

Original data:

location 1900 1901 1902
Als 35.02899 24.02899 45.02899
Bea 18.00000 10.00000 27.00000
Bor 18.14896 10.64896 20.64896

build_kmeans.cols(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.

location X1900 X1901 X1902 .cluster
Als 35.02899 24.02899 45.02899 2
Bea 18.00000 10.00000 27.00000 1
Bor 18.14896 10.64896 20.64896 1

build_kmeans.kv

Summary
Builds a clustering model (k-means) from key-value columns and store the model into a generated data frame or augment the original data with the clustered ID.

Syntax
build_kmeans.kv(<subject_column>, <key_column>, <value_column>, centers = <integer>, seed = <integer>, iter.max = <integer>, nstart = <integer>, algorithm = <kmeans_algorithms>, trace = <logical>, augment = <logical>, keep.source = <logical>)

Arguments

Return Value
Data frame

Example

Original data:

location year harvest_date
Als 1900 35.02899
Als 1901 24.02899
Als 1902 45.02899
Bea 1900 18.00000
Bea 1901 10.00000
Bea 1902 27.00000
Bor 1900 18.14896
Bor 1901 10.64896
Bor 1902 20.64896

build_kmeans.kv(location, year, harvest_date, centers=2, augment=TRUE)

Builds a K-means clustering model selecting 'location' as the subject to cluster, 'year' as the dimension, and 'harvest_date' as the value, and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.

location year harvest_date .cluster
Als 1900 35.02899 2
Als 1901 24.02899 2
Als 1902 45.02899 2
Bea 1900 18.00000 1
Bea 1901 10.00000 1
Bea 1902 27.00000 1
Bor 1900 18.14896 1
Bor 1901 10.64896 1
Bor 1902 20.64896 1

build_kmeans

Summary
Builds a clustering model (k-means). Stores the model into a generated data frame or the augmented data depending on the parameter value.

Syntax
build_kmeans(<column(s)>, skv = <character_vector>, fill = <numeric>, fun.aggregate = <aggregate_function>, centers = <integer>, iter.max = <integer>, nstart = <integer>, algorithm = <kmeans_algorithms>, trace = <logical>, seed = <integer>, augment = <logical>, keep.source = <logical>)

Arguments

Return Value
Data frame

Example

Original data:

location 1900 1901 1902
Als 35.02899 24.02899 45.02899
Bea 18.00000 10.00000 27.00000
Bor 18.14896 10.64896 20.64896

build_kmeans(-location, centers=2, augment=TRUE)
Builds a K-means clustering model selecting all the columns other than 'location' and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.

location X1900 X1901 X1902 .cluster
Als 35.02899 24.02899 45.02899 2
Bea 18.00000 10.00000 27.00000 1
Bor 18.14896 10.64896 20.64896 1

Original data:

location year harvest_date
Als 1900 35.02899
Als 1901 24.02899
Als 1902 45.02899
Bea 1900 18.00000
Bea 1901 10.00000
Bea 1902 27.00000
Bor 1900 18.14896
Bor 1901 10.64896
Bor 1902 20.64896

build_kmeans(skv = c("location", "year", "harvest_date"), centers = 2, augment = TRUE)

Builds a K-means clustering model selecting 'location' as the subject to cluster, 'year' as the dimension, and 'harvest_date' as the value, and setting 2 as the number of the clusters, and scoring the original data right away. If you want to evaluate the model set 'augment' to FALSE.

location year harvest_date .cluster
Als 1900 35.02899 2
Als 1901 24.02899 2
Als 1902 45.02899 2
Bea 1900 18.00000 1
Bea 1901 10.00000 1
Bea 1902 27.00000 1
Bor 1900 18.14896 1
Bor 1901 10.64896 1
Bor 1902 20.64896 1

do_t.test

Summary
Execute t-test, which checks differences of means of variables.

Syntax
do_t.test(<value_column>, <group_column>, alternative = <alternative_type>, mu = <integer>, paired = <logical>, var.equal = <logical>, conf.level = <numeric>, subset = <column_logical>, na.action = <na_action_type>)

Arguments

Return Value
Data frame

Example

Original data:

location year harvest_date
Als 1900 35.02899
Als 1901 24.02899
Als 1902 45.02899
Bor 1900 18.14896
Bor 1901 10.64896
Bor 1902 20.64896

do_t.test(harvest_date, location)
Returns a data frame with one row that has t-test result.

do_var.test

Summary
Execute F-test, which checks the differences of variances between groups.

Syntax
do_var.test(<value_column>, <group_column>, ratio = <numeric>, alternative = <alternative_type>, conf.level = <numeric>, na.action = <na_action_type>)

Arguments

Return Value
Data frame

Example

Original data:

location year harvest_date
Als 1900 35.02899
Als 1901 24.02899
Als 1902 45.02899
Bor 1900 18.14896
Bor 1901 10.64896
Bor 1902 20.64896

do_var.test(harvest_date, location)
Returns a data frame with one row that has t-test result.

do_chisq.test

Summary
Execute chi-squared contingency table tests and goodness-of-fit tests.

Syntax
do_chisq.test(<column(s)>, correct = <logical>, p = <column_num>, rescale.p = <logical>, simulate.p.value = <logical>, B = <numeric>)

Arguments

Return Value
Data frame

Example

Original data:

clarity GIA HRD IGI
IF 6 4 34
VS1 61 13 7
VS2 36 15 2
VVS1 15 23 14
VVS2 33 24 21

do_chisq.test(-clarity)
Returns a data frame with one row that has chi-squared test result.

statistic p.value parameter method
112.7472 0 8 Pearson's Chi-squared test

do_apriori

Summary
Find rules of what tend to occur at the same time from transaction data.

Syntax
do_apriori(<subject_column>, <key_column>, minlen = <integer>, maxlen = <integer>, min_support = <numeric>, max_support = <numeric>, min_confidence = <numeric>, lhs = <character_vector>, rhs = <character_vector>)

Arguments

Return Value
Data frame

Example

Original data:

item_name transaction_id
banana 1
apple 1
carrot 2
apple 2
orange 3
carrot 3
apple 4
carrot 4

do_apriori(item_name, transaction_id)
Returns association rules about item names.

lhs rhs support confidence lift
apple 0.75 0.7500000 1.0000000
carrot 0.75 0.7500000 1.0000000
banana apple 0.25 1.0000000 1.3333333
orange carrot 0.25 1.0000000 1.3333333
apple carrot 0.50 0.6666667 0.8888889
carrot apple 0.50 0.6666667 0.8888889

do_apriori(item_name, transaction_id, minlen=2)
Returns association rules where the total number of item names in lhs and rhs are more than 2.

lhs rhs support confidence lift
banana apple 0.25 1.0000000 1.3333333
orange carrot 0.25 1.0000000 1.3333333
apple carrot 0.50 0.6666667 0.8888889
carrot apple 0.50 0.6666667 0.8888889

do_apriori(item_name, transaction_id, lhs=c( "apple", "orange"))
Returns association rules where apple or orange come to lhs.

lhs rhs support confidence lift
orange carrot 0.25 1.0000000 1.3333333
apple carrot 0.50 0.6666667 0.8888889

do_anomaly_detection

Summary
Detect anomaly in time series data frame.

Syntax
do_anomaly_detection(<column_date>, <column_num>, direction = <anomaly_direction_type>, e_value = <logical>, max_anoms = <numeric>, alpha = <numeric>, only_last = <anomaly_last_type>, threshold = <anomaly_threshold_type>, longterm = <logical>, piecewise_median_period_weeks = <integer>)

Arguments

Return Value
Data frame

Example

Original data:

dateHour newUsers
... ...
2017-01-20 09:00:00 4
2017-01-20 10:00:00 2
2017-01-20 11:00:00 5
2017-01-20 12:00:00 31
2017-01-20 13:00:00 3
2017-01-20 14:00:00 3
... ...

do_anomaly_detection(dateHour, newUsers, direction = "both", e_value = TRUE)
Returns data frame with columns about anomaly data.

dateHour newUsers pos_anomaly pos_value neg_anomaly neg_value expected_value
... ... ... ... ... ... ...
2017-01-20 09:00:00 4 FALSE NA FALSE NA 4
2017-01-20 10:00:00 2 FALSE NA FALSE NA 2
2017-01-20 11:00:00 5 FALSE NA FALSE NA 5
2017-01-20 12:00:00 31 TRUE 31 FALSE NA 5
2017-01-20 13:00:00 3 FALSE NA FALSE NA 3
2017-01-20 14:00:00 3 FALSE NA FALSE NA 3
... ... ... ... ... ... ...

row_as_header

Summary
Use a row as column names.

Syntax
row_as_header(row_index = <integer>, prefix = <character>, clean_names = <logical>)

Arguments

Return Value
Data frame

Example

Original data:

V1 V2
NA NA
item.name transaction.id
banana 1
apple 1
carrot 2
apple 2
orange 3
carrot 3
apple 4
carrot 4

row_as_header(row_index = 2, prefix = "c_", clean_names = TRUE)
Returns data frame using the 2nd row as column names with prefix and dot is replaced by underscore.

c_item_name c_transaction_id
NA NA
banana 1
apple 1
carrot 2
apple 2
orange 3
carrot 3
apple 4
carrot 4

pivot

Summary
Pivot columns into rows and columns. Values are count of pairs of rows and columns or aggregation of another column.

Syntax
pivot(<formula>, value = <column>, fill = <numeric>, fun.aggregate = <character>, na.rm = <logical>)

Arguments

Return Value
Data frame

Example

Original data:

clarity certification colour price
VVS2 HRD F 10796
VS1 GIA F 1551
VS1 GIA F 4291
VVS2 GIA F 5881
VS2 GIA H 4585
VVS2 GIA H 5193

pivot(clarity+colour~certification, value = price, fill=0, fun.aggregate = sum)
Returns data frame with rows of combinations of clarity and colour and with columns of certification. Values are sum of price and missing values are filled by 0.

clarity_colour GIA HRD
VS1_F 5842 0
VS2_H 4585 0
VVS2_F 5881 10796
VVS2_H 5193 0

do_cor.cols

Summary
Calculates correlations for all the pairs of the variables (columns).

Syntax
do_cor.cols(<column(s)>, use = <cor_na_operation>, method = <cor_method>, distinct = <logical>, diag = <logical>)

Arguments

Return Value
Data frame

Example

Original data:

year Als Bea Bor
1900 35.02899 18 18.14896
1901 24.02899 10 10.64896
1902 45.02899 27 20.64896

do_cor.cols(-year)

pair.name.1 pair.name.2 value
Als Bea 0.9981135
Als Bor 0.9680283
Bea Als 0.9981135
Bea Bor 0.9508014
Bor Als 0.9680283
Bor Bea 0.9508014

do_cor.kv

Summary
Calculates correlations for all the pairs of subject columns.

Syntax
do_cor.kv(<subject_column>, <key_column>, <value_column>, use = <cor_na_operation>, method = <cor_method>, distinct = <logical>, diag = <logical>, fill = <numeric>, fun.aggregate = <aggregate_function>)

Arguments

Return Value
Data frame

Example

Original data:

location year harvest_date
Als 1900 35.02899
Als 1901 24.02899
Als 1902 45.02899
Bea 1900 18.00000
Bea 1901 10.00000
Bea 1902 27.00000
Bor 1900 18.14896
Bor 1901 10.64896
Bor 1902 20.64896

do_cor.kv(location, year, harvest_date)

location.x location.y value
Als Bea 0.9981135
Als Bor 0.9680283
Bea Als 0.9981135
Bea Bor 0.9508014
Bor Als 0.9680283
Bor Bea 0.9508014

do_cor

Summary
Calculates correlations for all the pairs of the variables or subjects.

Syntax
do_cor(<column(s)>, skv = <character_vector>, use = <cor_na_operation>, method = <cor_method>, distinct = <logical>, diag = <logical>, fill = <numeric>, fun.aggregate = <aggregate_function>)

Arguments

Return Value
Data frame

Example

Original data:

year Als Bea Bor
1900 35.02899 18 18.14896
1901 24.02899 10 10.64896
1902 45.02899 27 20.64896

do_cor(-year)

pair.name.1 pair.name.2 value
Als Bea 0.9981135
Als Bor 0.9680283
Bea Als 0.9981135
Bea Bor 0.9508014
Bor Als 0.9680283
Bor Bea 0.9508014

Original data:

location year harvest_date
Als 1900 35.02899
Als 1901 24.02899
Als 1902 45.02899
Bea 1900 18.00000
Bea 1901 10.00000
Bea 1902 27.00000
Bor 1900 18.14896
Bor 1901 10.64896
Bor 1902 20.64896

do_cor( skv = c("location", "year", "harvest_date"))

location.x location.y value
Als Bea 0.9981135
Als Bor 0.9680283
Bea Als 0.9981135
Bea Bor 0.9508014
Bor Als 0.9680283
Bor Bea 0.9508014

do_dist.kv

Summary
Calculate the distances between each of the pairs.

Syntax
do_dist.kv(<subject_column>, <key_column>, <value_column>, distinct = <logical>, diag = <logical>, method = <dist_method>, fill = <numeric>, fun.aggregate = <aggregate_function>, p = <numeric> )

Arguments

Return Value
Data frame

Example

Original data:

date name product num
2015-06-12 Lisa orange 5
2015-06-12 Emily apple 3
2015-06-12 John carrot 3
2015-06-13 Emily apple 4
2015-06-13 John apple 6
2015-06-16 Lisa orange 2
2015-06-16 John carrot 4
2015-06-16 Emily orange 7

do_dist.kv(name, product, num)

This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.

Emily John Lisa
apple 3.5 6.0 0.0
carrot 0.0 3.5 0.0
orange 7.0 0.0 3.5

Then calculate euclidean distances between pairs of each column.

name.x name.y value
Emily John 8.215838
Emily Lisa 4.949747
John Emily 8.215838
John Lisa 7.778175
Lisa Emily 4.949747
Lisa John 7.778175

do_dist.kv(name, product, num, fun.aggregate=sum, method="manhattan")

This considers matrix like this. Duplicated entries is aggregated to sum.

Emily John Lisa
apple 7 6 0
carrot 0 7 0
orange 7 0 7

Then calculate manhattan distances between pairs of each column.

name.x name.y value
Emily John 15
Emily Lisa 7
John Emily 15
John Lisa 20
Lisa Emily 7
Lisa John 20

do_dist.cols

Summary
Calculate distances of each of the pairs.

Syntax
do_dist.cols(<column(s)>, distinct = <logical>, diag = <logical>, method = <dist_method>, p = <numeric> )

Arguments

Return Value
Data frame

Example

Original data:

year Als Bea Bor
1900 35.02899 18 18.14896
1901 24.02899 10 10.64896
1902 45.02899 27 20.64896

do_dist.cols(-year)

pair.name.1 pair.name.2 value
Als Bea 28.492868
Als Bor 32.532238
Bea Als 28.492868
Bea Bor 6.385847
Bor Als 32.532238
Bor Bea 6.385847

do_dist

Summary
Calculate distances of each of the pairs of the variables or subjects.

Syntax
do_dist(<column(s)>, skv = <character_vector>, distinct = <logical>, diag = <logical>, method = <dist_method>, p = <numeric>, fill = <numeric>, fun.aggregate = <aggregate_function>)

Arguments

Return Value
Data frame

Example

Original data:

year Als Bea Bor
1900 35.02899 18 18.14896
1901 24.02899 10 10.64896
1902 45.02899 27 20.64896

do_dist(-year)

pair.name.1 pair.name.2 value
Als Bea 28.492868
Als Bor 32.532238
Bea Als 28.492868
Bea Bor 6.385847
Bor Als 32.532238
Bor Bea 6.385847

Original data:

date name product num
2015-06-12 Lisa orange 5
2015-06-12 Emily apple 3
2015-06-12 John carrot 3
2015-06-13 Emily apple 4
2015-06-13 John apple 6
2015-06-16 Lisa orange 2
2015-06-16 John carrot 4
2015-06-16 Emily orange 7

do_dist(skv = c("name", "product", "num"))

This considers matrix like this. Duplicated entries is aggregated to mean as default and you can change it by fun.aggregate argument.

Emily John Lisa
apple 3.5 6.0 0.0
carrot 0.0 3.5 0.0
orange 7.0 0.0 3.5

Then calculate euclidean distances between pairs of each column.

name.x name.y value
Emily John 8.215838
Emily Lisa 4.949747
John Emily 8.215838
John Lisa 7.778175
Lisa Emily 4.949747
Lisa John 7.778175

do_dist(skv = c("location", "year", "harvest_date"), fun.aggregate=sum, method="manhattan")

This considers matrix like this. Duplicated entries is aggregated to sum.

Emily John Lisa
apple 7 6 0
carrot 0 7 0
orange 7 0 7

Then calculate manhattan distances between pairs of each column.

name.x name.y value
Emily John 15
Emily Lisa 7
John Emily 15
John Lisa 20
Lisa Emily 7
Lisa John 20

str_clean

Summary
Cleans up text by removing escape characters (e.g. \n, \t), extra white spaces, extra period, and leading and trailing spaces.

Syntax
str_clean(<column_text>)

Arguments

Return Value
Character

Example
str_clean(" Exploratory ..io ")
returns "Exploratory.io".

str_count_all

Summary
Count patterns from texts

Syntax
str_count_all(<column_text>, patterns = <character_vector>, remove.zero = <logical>)

Arguments

Return Value
List of data frame

Example
str_count_all("I ate banana, apple, and peach yesterday, and banana, peach today.", patterns=c("apple", "banana"), remove.zero=TRUE)
Returns a list column of data frames with 'apple' and 'banana' columns.

get_stopwords

Summary
Returns stop words like "a", "the", "and", etc.

Syntax
get_stopwords(lang = <language>, include = <character_vector>, exclude = <character_vector>)

Arguments

Return Value
Character

Example
get_stopwords()
Return a list of the stop words like "a", "the", "and".

word_to_sentiment

Summary
Returns sentiment types of positive or negative based on a given word(s).

Syntax
word_to_sentiment(<column_text>, lexicon = <sentiment_lexicon>)

Arguments

Return Value
If lexicon argument is "bing" (default), Character. If lexicon argument is "AFINN", Numeric. If lexicon argument is "nrc", List.

Example
word_to_sentiment("good")
Return "positive".

word_to_sentiment("bad", lexicon="AFINN")
Return -3.

word_to_sentiment("cry", lexicon="nrc")
Return c("negative" "sadness").

get_sentiment

Summary
Returns the sentiment score for a given text sentence. The positive score indicates the positive sentence and the negative score indicates the opposite. 0 means 'neutral.'

Syntax
get_sentiment(<column_text>)

Arguments

Return Value
Numeric

Example
get_sentiment("I'm happy")
Return 0.7071068.

get_sentiment("I'm sick.")
Return -0.7071068.

get_sentiment("I'm not so good.")
Return -0.5000000.

stem_word

Summary
Stem word so that almost the same words which have a little different spells can be recognized as the same words.

Syntax
stem_word(<column_text>, language = <stem_language>)

Arguments

Return Value
Character

Example
stem_word(c("stand","stands", "cheerful", "cheering"))
Return c("stand","stand","cheer","cheer").

is_stopword

Summary
Returns TRUE if a given word is included in a list of the stop words defined by one of the dictionary.

Syntax
is_stopword(<column_text>, lang = <language>, include = <character_vector>, exclude = <character_vector>)

Arguments

Return Value
Logical

Example
is_stopword(c("a", "and", "stopword", "the"))
Return c(TRUE, TRUE, FALSE, TRUE).

is_empty

Summary
Returns TRUE if a given text is empty string or NA.

Syntax
is_empty(<column_text>)

Arguments

Return Value
Logical

Example
is_empty(c("", " ", NA, "The", "\n"))
Return c(TRUE, TRUE, TRUE, FALSE, TRUE).

is_alphabet

Summary
Returns TRUE if a given text contains only alphabets.

Syntax
is_alphabet(<column_text>)

Arguments

Return Value
Data frame

Example
is_alphabet(c("1", "132", "32MB", "hello"))
Return c(FALSE, FALSE, FALSE, TRUE).

ip_to_country

Summary
Returns country names from IP addresses.

Syntax
ip_to_country(<column_text>)

Arguments

Return Value
Character

Example
ip_to_country("133.43.96.45")
returns "Japan".

url_domain

Summary
Returns domain (ex. "exploratory.io") from url.

Syntax
url_domain(<column_text>)

Arguments

Return Value
Character

Example
url_domain("https://exploratory.io/login/")
returns "exploratory.io".

url_fragment

Summary
Returns fragment from url.

Syntax
url_fragment(<column_text>)

Arguments

Return Value
Character

Example
url_fragment("https://exploratory.io/?debug=true#test")
returns "test".

url_path

Summary
Returns path from url.

Syntax
url_path(<column_text>)

Arguments

Return Value
Character

Example
url_path("https://exploratory.io/reference/")
returns "reference/".

url_port

Summary
Returns port from url.

Syntax
url_port(<column_text>)

Arguments

Return Value
Character

Example
url_port("https://exploratory.io:443")
returns "443".

url_scheme

Summary
Returns scheme (ex. "http", "https") from url.

Syntax
url_scheme(<column_text>)

Arguments

Return Value
Character

Example
url_scheme("https://exploratory.io")
returns "https".

url_suffix

Summary
Returns suffix (ex. "com", "org") from url.

Syntax
url_suffix(<column_text>)

Arguments

Return Value
Character

Example
url_suffix("https://exploratory.io")
returns "io".

url_suffix("http://sample.edu.co")
returns "edu.co".

url_subdomain

Summary
Returns subdomain (ex. "www", "blog") from url.

Syntax
url_subdomain(<column_text>)

Arguments

Return Value
Character

Example
url_subdomain("https://blog.exploratory.io")
returns "blog".

url_tld

Summary
Returns top-level domain (ex. "com", "co") from url.

Syntax
url_tld(<column_text>)

Arguments

Return Value
Character

Example
url_tld("http://sample.edu.co")
returns "co".

url_param

Summary
Returns decoded query parameter from url.

Syntax
url_param(<column_text>, parameter_name = <character>)

Arguments
* parameter_name - Name of query. If there's "?user=ferret" in the end of the url, the name is "user".

Return Value
Character

Example
url_param("http://example.com/?user=ferret", "user")
returns "ferret".

countrycode

Summary
Map country names or codes to other codes or names (country names, continent names, etc.).

Syntax
countrycode(<column>, origin = <origin_countrycode_type>, destination = <destination_countrycode_type>)

Arguments

Values for origin and destination arguments:

Return Value
Country name, id, etc.

Example
countrycode("U.S.", "country.name", "iso2c")
Return US.

countrycode("United States", "country.name", "iso2c")
Return US.

countrycode("US", "iso2c", "country.name")
Return USA.

countrycode("USA", "cowc", "country.name")
Return United States.

countrycode(2, "cown", "country.name")
Return United States.

statecode

Summary
This function takes a column that has US State information and returns either US State names, abbreviations, numeric codes, division, or region, based on 'output_type' argument setting. The US State information can be either US State names, US State abbreviations (FIPS / ANSI codes), or US State numeric code (FIPS) in any combination.

Syntax
statecode(<state_column>, output_type = <statecode_type>)

Arguments

Output Types:

Return Value
State name, State code, etc.

Example
statecode("CA", "name")
Return California.

statecode("CA", "num_code")
Return 06.

statecode("California", "num_code")
Return 06.

statecode("CA", "region")
Return West.

countycode

Summary
Generate US county code (FIPS - Federal Information Processing Standard) based on US State and County names.

Syntax
countycode(state = <state_column>, county = <county_column>)

Arguments

Return Value
County code

Example
countycode("California", "San Mateo")
Return 06081.

countycode("CA", "San Mateo")
Return 06081.

countycode("CA", "San Mateo County")
Return 06081.

do_tokenize

Summary
Returns one token (e.g. word) per row after tokenizing a given text.

Syntax
do_tokenize(<column_text>, token = <token_type>, keep_cols = <logical>, to_lower = <logical>, drop = <logical>, output = <new_column_name>, with_id = <logical>, pattern = <pattern>)

Arguments

Return Value
Data frame

Example

Original data:

index text
First It was a great party.
Second She has just left. She will be off tomorrow.

do_tokenize(text)

index document_id sentence_id token
First 1 1 it
First 1 1 was
First 1 1 a
First 1 1 great
First 1 1 party
Second 2 1 she
Second 2 1 has
Second 2 1 just
Second 2 1 left
Second 2 2 she
Second 2 2 will
Second 2 2 be
Second 2 2 off
Second 2 2 tomorrow

do_tokenize(text, token="sentences")

index token
First it was a great party.
Second she has just left.
Second she will be off tomorrow.

pair_count

Summary
Count pairs of words (tokens) that cooccur within a group

Syntax
pair_count(group = <column>, value = <column>, distinct = <logical>, diag = <logical>, sort = <logical>)

Arguments

Return Value
Data frame

Example
pair_count(group=title, value=word, distinct=FALSE, diag = FALSE, sort=TRUE)
Return a data frame with word.x and word.y which is from "word" column and value that has count.

do_tfidf

Summary
Calculates TF-IDF for each term against a given group. TF-IDF is a weighting mechanism that calculates the importance of each word to each document by increasing the importance based on the term frequency while decreasing the importance based on the document frequency.

Syntax
do_tfidf(<document_column>, <token_column>, tf_weight = <tf_weight_type>, idf_log_scale = <function>, norm = "l2"|"l1"|FALSE)

Arguments

Return Value
Data frame

Example

Original data:

document_id token
1 this
1 is
1 what
1 it
1 is
2 which
2 is
2 better

do_tfidf(document_id, token)

document_id token count_per_doc count_of_docs tfidf
1 is 2 2 0.0000000
1 it 1 1 0.5773503
1 this 1 1 0.5773503
1 what 1 1 0.5773503
2 better 1 1 0.7071068
2 is 1 2 0.0000000
2 which 1 1 0.7071068

do_ngram

Summary
Create columns of n-grams connected in sentences.

Syntax
do_ngram(<token_column>, <document_column>, <sentence_column>, maxn = <numeric>, sep = <text>)

Arguments

Return Value
Data frame

Example

Original data:

document_id sentence_id token
1 1 it
1 1 is
1 1 good
2 1 she
2 1 left
2 2 she
2 2 will
2 2 come
2 2 tomorrow

do_ngram(token, document_id, sentence_id, maxn=3)

document_id sentence_id gram token
1 1 1 it
1 1 1 is
1 1 1 good
1 1 2 it_is
1 1 2 is_good
1 1 3 it_is_good
2 1 1 she
2 1 1 left
2 1 2 she_left
2 2 1 she
2 2 1 will
2 2 1 come
2 2 1 tomorrow
2 2 2 she_will
2 2 2 will_come
2 2 2 come_tomorrow
2 2 3 she_will_come
2 2 3 will_come_tomorrow

do_cosine_sim.kv

Summary
Calculates the similarity between each pair of the documents using the cosine similarity algorithm. Cosine similarity measures the cosine of the angle between two vectors in the multi-dimensional space.

Syntax
do_cosine_sim.kv(<subject_column>, <key_column>, <value_column>, distinct = <logical>, diag = <logical>, fun.aggregate = <aggregate_function>)

Arguments

Return Value
Data frame

Example

Original data:

document_id token value
Lisa it 0.8966972
Lisa was 0.2655087
Lisa good 0.3721239
Emily she 0.5728534
Emily is 0.9082078
Emily nice 0.2016819
John she 0.8983897
John is 0.9446753
John good 0.6607978

do_cosine_sim.kv(document_id, token, value)

document_id.x document_id.y value
Lisa Emily 0.0000000
Lisa John 0.1671574
Emily Lisa 0.0000000
Emily John 0.8595770
John Lisa 0.1671574
John Emily 0.8595770

do_cmdscale

Summary
Execute multidimensional scaling (MDS). Calculate approximated coordinations from distances of entity pairs.

Syntax
do_cmdscale(<name1_column>, <name2_column>, <value_column>, k = <integer>)

Arguments

Return Value
Data frame

Example

Original data:

name.x name.y value
cow horse 36
cow turtle 82
horse turtle 48

do_cmdscale(pair.name.1, pair.name.2, value)

name V1 V2
cow -38.915551 -3.435417
horse -4.163866 5.962475
turtle 43.079417 -2.527058