R grep


grep() function searchs for matches of a string or string vector. It returns a vector of the matched elements or their indices.


grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)


• pattern: string to be matched, supports regular expression
• x: string or string vector
• ignore.case: if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching
• perl: logical. Should perl-compatible regexps be used? Has priority over extended
• fixed: logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments
• useBytes: logical. If TRUE the matching is done byte-by-byte rather than character-by-character
• invert: logical. If TRUE return indices or values for elements that do not match


grep(value = FALSE) returns an integer vector of the indices of the elements of x that yielded a match (or not, for invert = TRUE).

> grep("rect", "draw a rectangle")
[1] 1
> str <- c("Regular", "expression", "examples of R language")
> x <- grep("ex",str,value=F)
> x

[1] 2 3

> x <- "line 4322: He is now 25 years old, and weights 130lbs";
> x <- grep("\\d","",x)
> x

[1] 1


• grep(value = TRUE) returns a character vector containing the selected elements of x (after coercion, preserving names but no other attributes).

> grep("rect", "draw a rectangle", value=T)
[1] "draw a rectangle"
> x <- grep("ex",str,value=T)
> x

[1] "expression" "examples of R language"

• grepl returns a logical vector (match or not for each element of x).

> x <- grepl("ex",str)
> x
[1] FALSE TRUE TRUE


R has various functions for regular expression based match and replaces. The grep, grepl, regexpr and gregexpr functions are used for searching for matches, while sub and gsub for performing replacement.

• sub and gsub return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding). If useBytes = FALSE a non-ASCII substituted result will often be in UTF-8 with a marked encoding (e.g. if there is a UTF-8 input, and in a multibyte locale unless fixed = TRUE).

> str <- c("Regular", "expression", "examples of R language")
> x <- sub("x.ress","",str)
> x

[1] "Regular" "eion" "examples of R language"

> x <- sub("x.+e","",str)
> x

[1] "Regular" "ession" "e"

> x <- "line 4322: He is now 25 years old, and weights 130lbs";
> x <- gsub("[[:digit:]]","",x)
> x

[1] "line : He is now years old, and weights lbs"


> x <- "line 4322: He is now 25 years old, and weights 130lbs";
> x <- gsub("\\d+","",x)
> x

[1] "line : He is now years old, and weights lbs"


• regexpr returns an integer vector of the same length as text giving the starting position of the first match or -1 if there is none, with attribute "match.length", an integer vector giving the length of the matched text (or -1 for no match). The match positions and lengths are in characters unless useBytes = TRUE is used, when they are in bytes.

> str <- c("Regular", "expression", "examples of R language")
> x <- regexpr("x*ress",str)
> x

[1] -1 4 -1

• gregexpr returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.

> str <- c("Regular", "expression", "examples of R language")
> x <- gregexpr("x*ress",str)
> x

[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 4
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE
[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE



Regular Expression Syntax:

Syntax
Description
\\d
Digit, 0,1,2 ... 9
\\D
Not Digit
\\s
Space
\\S
Not Space
\\w
Word
\\W
Not Word
\\t
Tab
\\n
New line
^
Beginning of the string
$
End of the string
\
Escape special characters, e.g. \\ is "\", \+ is "+"
|
Alternation match. e.g. /(e|d)n/ matches "en" and "dn"
Any character, except \n or line terminator
[ab]
a or b
[^ab]
Any character except a and b
[0-9]
All Digit
[A-Z]
All uppercase A to Z letters
[a-z]
All lowercase a to z letters
[A-z]
All Uppercase and lowercase a to z letters
i+
i at least one time
i*
i zero or more times
i?
i zero or 1 time
i{n}
i occurs n times in sequence
i{n1,n2}
i occurs n1 - n2 times in sequence
i{n1,n2}?
non greedy match, see above example
i{n,}
i occures >= n times
[:alnum:]
Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:]
Alphabetic characters: [:lower:] and [:upper:]
[:blank:]
Blank characters: e.g. space, tab
[:cntrl:]
Control characters
[:digit:]
Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:]
Graphical characters: [:alnum:] and [:punct:]
[:lower:]
Lower-case letters in the current locale
[:print:]
Printable characters: [:alnum:], [:punct:] and space
[:punct:]
Punctuation character: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
[:space:]
Space characters: tab, newline, vertical tab, form feed, carriage return, space
[:upper:]
Upper-case letters in the current locale
[:xdigit:]
Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f