Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Patterns tutorial

Jmv38Jmv38 Mod
edited March 2015 in Examples Posts: 3,295

For codea showcase, i have to decode a little bit of html. I've decided to switch to patterns. Here are my first patterns explained. Since i am a noob in patterns, it may help some of you to jump into them more quickly.


function setup() str = "12 add FGv" for s in charSplit(str," ") do print(s) end str = "12|add|FGv" for s in charSplit(str,"|") do print(s) end somePatterns() end function charSplit(str,c) str = str .. c return string.gmatch(str,"(.-)"..c.."+") end function somePatterns() -- some patterns explained: local pattern = '(%b<>)' -- finds all <html tags> -- pattern explained step by step: -- '%b<>' a substring delimited by '<' and '>' -- '(%b<>)' capture a substring delimited by '<' and '>', and include delimiters local data = "<xxx> yyy </xxx>" print("pattern: '"..pattern .. "'" ) print("input: '" .. data .. "'") print("result:") for tag in string.gmatch(data, pattern) do print("'" .. tag .. "'" ) end local pattern = '<xxx>(.-)</xxx>' -- finds content of one html tag xxx -- pattern explained step by step: -- '<xxx>.</xxx>' a substring delimited by '<xxx>' and '</xxx>' -- '<xxx>(.)</xxx>' capture a substring delimited by '<xxx>' and '</xxx>', and exclude delimiters -- '<xxx>(.-)</xxx>' capture a substring delimited by '<xxx>' and '</xxx>', the smallest possible, and exclude delimiters local data = "<xxx> yyy </xxx>" print("pattern: '"..pattern .. "'" ) print("input: '" .. data .. "'") print("result:") local tag = string.match(data, pattern) print("'" .. tag .. "'" ) local pattern = '(<.->)([^<]*)' -- returns all pairs: <html tags>, text between html tags -- pattern explained step by step: -- '<.>' a substring delimited by '<' and '>' -- '<.->' a substring delimited by '<' and '>', the smallest possible, -- '(<.->)' capture a substring delimited by '<' and '>', the smallest possible, and include delimiters -- '[^<]' a substring of chars that are not '<', -- '[^<]*' a substring of chars that are not '<', the biggest possible, minimum 0 char. -- '([^<]*)' capture a substring of chars that are not '<', the biggest possible, minimum 0 char. -- lets call these 2 patterns [A] and [B]. The whole pattern is: -- '(<.->)([^<]*)' capture a substring according to [A], then, starting after the last char defined in [A], capture a substring according to [B] local data = "<xxx> yyy </xxx>" print("pattern: '"..pattern .. "'" ) print("input: '" .. data .. "'") print("result:") for tag, inner in string.gmatch(data,pattern) do print("'" .. tag .. "', '" .. inner .. "'" ) end end --[[ -- i found this nice summary on the web: http://www.gammon.com.au/scripts/doc.php?lua=string.find Patterns The standard patterns (character classes) you can search for are: . --- (a dot) represents all characters. %a --- all letters. %c --- all control characters. %d --- all digits. %l --- all lowercase letters. %p --- all punctuation characters. %s --- all space characters. %u --- all uppercase letters. %w --- all alphanumeric characters. %x --- all hexadecimal digits. %z --- the character with hex representation 0x00 (null). %% --- a single '%' character. %1 --- captured pattern 1. %2 --- captured pattern 2 (and so on). %f[s] transition from not in set 's' to in set 's'. %b() balanced pair ( ... ) Important! - the uppercase versions of the above represent the complement of the class. eg. %U represents everything except uppercase letters, %D represents everything except digits. Also important! If you are using string.find (or string.match etc.) in MUSHclient, and inside "send to script" in a trigger or alias, then the % sign has special meaning there (it is used to identify wildcards, such as %1 is wildcard 1). Thus the % signs in string.find need to be doubled or they won't work properly (so use %%d instead of %d in "send to script"). There are some "magic characters" (such as %) that have special meanings. These are: ^ $ ( ) % . [ ] * + - ? If you want to use those in a pattern (as themselves) you must precede them by a % symbol. eg. %% would match a single % You can build your own pattern classes (sets) by using square brackets, eg. [abc] ---> matches a, b or c [a-z] ---> matches lowercase letters (same as %l) [^abc] ---> matches anything except a, b or c [%a%d] ---> matches all letters and digits [%a%d_] ---> matches all letters, digits and underscore [%[%]] ---> matches square brackets (had to escape them with %) --[[ You can use pattern classes in the form %x in the set. If you use other characters (like periods and brackets, etc.) they are simply themselves. You can specify a range of character inside a set by using simple characters (not pattern classes like %a) separated by a hyphen. For example, [A-Z] or [0-9]. These can be combined with other things. For example [A-Z0-9] or [A-Z,.]. The end-points of a range must be given in ascending order. That is, [A-Z] would match upper-case letters, but [Z-A] would not match anything. You can negate a set by starting it with a "^" symbol, thus [^0-9] is everything except the digits 0 to 9. The negation applies to the whole set, so [^%a%d] would match anything except letters or digits. In anywhere except the first position of a set, the "^" symbol is simply itself. Inside a set (that is a sequence delimited by square brackets) the only "magic" characters are: ] ---> to end the set, unless preceded by % % ---> to introduce a character class (like %a), or magic character (like "]") ^ ---> in the first position only, to negate the set (eg. [^A-Z) - ---> between two characters, to specify a range (eg. [A-F]) Thus, inside a set, characters like "." and "?" are just themselves. The repetition characters, which can follow a character, class or set, are: + ---> 1 or more repetitions (greedy) * ---> 0 or more repetitions (greedy) - ---> 0 or more repetitions (non greedy) ? ---> 0 or 1 repetition only A "greedy" match will match on as many characters as possible, a non-greedy one will match on as few as possible. The standard "anchor" characters apply: ^ ---> anchor to start of subject string (must be the very first character) $ ---> anchor to end of subject string You can also use round brackets to specify "captures": You see (.*) here Here, whatever matches (.*) becomes the first pattern. You can also refer to matched substrings (captures) later on in an expression: print (string.find ("You see dogs and dogs", "You see (.*) and %1")) --> 1 21 dogs print (string.find ("You see dogs and cats", "You see (.*) and %1")) --> nil This example shows how you can look for a repetition of a word matched earlier, whatever that word was ("dogs" in this case). As a special case, an empty capture string returns as the captured pattern, the position of itself in the string. eg. print (string.find ("You see dogs and cats", "You .* ()dogs .*")) --> 1 21 9 What this is saying is that the word "dogs" starts at column 9. Finally you can look for nested "balanced" things (such as parentheses) by using %b, like this: print (string.find ("I see a (big fish (swimming) in the pond) here", "%b()")) --> 9 41 After %b you put 2 characters, which indicate the start and end of the balanced pair. If it finds a nested version it keeps processing until we are back at the top level. In this case the matching string was "(big fish (swimming) in the pond)". Examples of string.find: print (string.find ("the quick brown fox", "quick")) --> 5 9 print (string.find ("the quick brown fox", "(%a+)")) --> 1 3 the print (string.find ("the quick brown fox", "(%a+)", 10)) --> 11 15 brown print (string.find ("the quick brown fox", "fruit")) --> nil See Also ... Lua functions string.byte - Converts a character into its ASCII (decimal) equivalent string.char - Converts ASCII codes into their equivalent characters string.dump - Converts a function into binary string.format - Formats a string string.gfind - Iterate over a string (obsolete in Lua 5.1) string.gmatch - Iterate over a string string.gsub - Substitute strings inside another string string.len - Return the length of a string string.lower - Converts a string to lower-case string.match - Searches a string for a pattern string.rep - Returns repeated copies of a string string.reverse - Reverses the order of characters in a string string.sub - Returns a substring of a string string.upper - Converts a string to upper-case --]]

Comments

Sign In or Register to comment.