These are the strings that I should not allow in my address:
"PO BOX","P0 DRAWER","POSTOFFICE", " PO ", " BOX ", "C/O","C.O."," ICO "," C/O "," C\0 ","C/0","P O BOX", "P 0 BOX","P 0 B0X","P0 B0X","P0 BOX","P0BOX","P0B0X", "POBX","P0BX","POBOX","P.0.","P.O","P O "," P 0 ", "P.O.BOX","P.O.B","POB ","P0B","P 0 B","P O B", " CARE ","IN CARE"," APO "," CPO "," UPO ", "GENDEL", "GEN DEL", "GENDELIVERY","GEN DELIVERY","GENERALDEL", "GENERAL DEL","GENERALDELIVERY","GENERAL DELIVERY"I created regular expression: This expression validates only POBOx part – please correct to not allow all the above strings in my address field
"([\\w\\s*\\W]*((P(O|OST)?.?\\s*((O(FF(ICE)?)?)?.?\\s*(B(IN|OX|.?))|B(IN|OX))+))[\\w\\s*\\W]*)+
|([\\w\\s*\\W]* (IN \s*(CARE)?\\s*)|\s*[\\w\\s*\\W]*((.?(APO)?|.?(cPO)?|.?(uPO))?.?\s*) [\\w\\s*\\W]*|([\\w\\s*\\W]*(GEN(ERAL)?)?.?\s*(DEL(IVERY)?)?.?\s* [\\w\\s*\\W]*))"; 12 3 Answers
I'm guessing you're trying to see if an address string contains any restricted phrases.
Please do not do this in one single regex.
Doing one single massive regex matching query means it's hard to understand what you did to create the regex, hard to extend if more restrictions pop up, and generally not good code practice.
Here's a (hopefully) more sane approach:
public static final String RESTRICTIONS[] = { " P[0O] ", " B[0O]X ", "etc, etc" };
public static boolean containsRestrictions(String testString) { for (String expression : RESTRICTIONS) { Matcher restriction = Pattern.compile(expression).matcher(testString); if (restriction.find()) return true; } return false;
}You're still doing regex matching, so you can put your fancy schmancy regex into your restrictions list, but it works on just plain old strings too. Now you only need to verify that each of the individual regexes work instead of verifying a giant regex against all possible cases. If you wanna add a new restriction, just add it to the list. If you're real fancy you can load the restrictions from a configuration file or inject it using spring so your pesky product people can add address restrictions without touching une ligne de code.
Edit: To make this even easier to read, and to do what you really want (restricting strings separated from other strings using whitespace), you can remove regexes altogether from the restrictions and do some basic matching work in your method.
// No regexes here, just words you wanna restrict
public static final String RESTRICTIONS[] = { "PO", "PO BOX", "etc, etc" };
public static boolean containsRestrictions(String testString) { for (String word : RESTRICTIONS) { String expression = "(^|\\s)" + word + "(\\s|$)"; Matcher restriction = Pattern.compile(expression).matcher(testString); if (restriction.find()) return true; } return false;
} 5 So, you want to search substrings like a pro? I'd suggest using the Aho Corasick algorithm which solves the kind of problems you have.
Selling point:
It is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the "dictionary") within an input text. It matches all patterns simultaneously.
Luckily, a Java implementation exists. You can get it here.
Here's how to use it:
// this is the part you have to do only once
AhoCorasick tree = new AhoCorasick();
String[] terms = {"PO BOX","P0 DRAWER",...};
for (int i = 0; i < terms.length; i++) { tree.add(terms[i].getBytes(), terms[i]);
}
tree.prepare();
// here comes the part you use for every address you want to check
String text = "The ga3 mutant of Arabidopsis is a gibberellin-responsive. In UPO, that is...";
boolean restrictedWordFound = false;
@SuppressWarnings("unchecked")
Iterator<SearchResult> search = (Iterator<SearchResult>)tree.search(text.getBytes());
if(search.hasNext()) { restrictedWordFound = true;
}If a match has been found, restrictedWordFound will be true.
Note: this search is case sensitive. Since your strings are all in upper case, I'd suggest you first convert address in a temporary upper case variant and use matching on it. That way, you will cover all possible combinations.
From my tests, Aho Corasick is faster than regex based search and in most cases faster than naive string searching using contains and other String based methods. You can add even more filter words; Aho Corasick is the way to go.
Instead of using such complicated regular expressions, you can state: the regex:
"PO BOX|P0 DRAWER|POSTOFFICE| PO | BOX |C/O|C.O.| ICO | C/O | C\0 |C/0|P O BOX|P 0 BOX|P 0 B0X|P0 B0X|P0 BOX|P0BOX|P0B0X|POBX|P0BX|POBOX|P.0.|P.O|P O | P 0 |P.O.BOX|P.O.B|POB |P0B|P 0 B|P O B| CARE |IN CARE| APO | CPO | UPO |GENDEL|GEN DEL|GENDELIVERY|GEN DELIVERY|GENERALDEL|GENERAL DEL|GENERALDELIVERY|GENERAL DELIVERY"And negate the answer.
When you compile the regex (in Java) the resulting mechanism will become more efficiënt. (Java uses DFA minimalisation).