J'utilise RegexBuddy tout en travaillant avec des expressions régulières. De sa bibliothèque, j'ai copié l'expression régulière pour correspondre aux URL. J'ai testé avec succès dans RegexBuddy. Cependant, lorsque je l'ai copié en tant que String
saveur Java et que je l'ai collé dans du code Java, cela ne fonctionne pas. La classe suivante imprime false
:
public class RegexFoo {
public static void main(String[] args) {
String regex = "\\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]";
String text = "http://google.com";
System.out.println(IsMatch(text,regex));
}
private static boolean IsMatch(String s, String pattern) {
try {
Pattern patt = Pattern.compile(pattern);
Matcher matcher = patt.matcher(s);
return matcher.matches();
} catch (RuntimeException e) {
return false;
}
}
}
Quelqu'un sait-il ce que je fais mal?
java
regex
regexbuddy
Sergio del Amo
la source
la source
Réponses:
Essayez plutôt la chaîne de regex suivante. Votre test a probablement été effectué en respectant la casse. J'ai ajouté les alphas minuscules ainsi qu'un espace réservé de début de chaîne approprié.
String regex = "^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Cela fonctionne aussi:
String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Remarque:
String regex = "<\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // matches <http://google.com> String regex = "<^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // does not match <http://google.com>
la source
La meilleure façon de le faire maintenant est:
EDIT: Code de
Patterns
partir https://github.com/android/platform_frameworks_base/blob/master/core/java/android/util/Patterns.java :/* * Copyright (C) 2007 The Android Open Source Project * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package android.util; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * Commonly used regular expression patterns. */ public class Patterns { /** * Regular expression to match all IANA top-level domains. * List accurate as of 2011/07/18. List taken from: * http://data.iana.org/TLD/tlds-alpha-by-domain.txt * This pattern is auto-generated by frameworks/ex/common/tools/make-iana-tld-pattern.py * * @deprecated Due to the recent profileration of gTLDs, this API is * expected to become out-of-date very quickly. Therefore it is now * deprecated. */ @Deprecated public static final String TOP_LEVEL_DOMAIN_STR = "((aero|arpa|asia|a[cdefgilmnoqrstuwxz])" + "|(biz|b[abdefghijmnorstvwyz])" + "|(cat|com|coop|c[acdfghiklmnoruvxyz])" + "|d[ejkmoz]" + "|(edu|e[cegrstu])" + "|f[ijkmor]" + "|(gov|g[abdefghilmnpqrstuwy])" + "|h[kmnrtu]" + "|(info|int|i[delmnoqrst])" + "|(jobs|j[emop])" + "|k[eghimnprwyz]" + "|l[abcikrstuvy]" + "|(mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])" + "|(name|net|n[acefgilopruz])" + "|(org|om)" + "|(pro|p[aefghklmnrstwy])" + "|qa" + "|r[eosuw]" + "|s[abcdeghijklmnortuvyz]" + "|(tel|travel|t[cdfghjklmnoprtvwz])" + "|u[agksyz]" + "|v[aceginu]" + "|w[fs]" + "|(\u03b4\u03bf\u03ba\u03b9\u03bc\u03ae|\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435|\u0440\u0444|\u0441\u0440\u0431|\u05d8\u05e2\u05e1\u05d8|\u0622\u0632\u0645\u0627\u06cc\u0634\u06cc|\u0625\u062e\u062a\u0628\u0627\u0631|\u0627\u0644\u0627\u0631\u062f\u0646|\u0627\u0644\u062c\u0632\u0627\u0626\u0631|\u0627\u0644\u0633\u0639\u0648\u062f\u064a\u0629|\u0627\u0644\u0645\u063a\u0631\u0628|\u0627\u0645\u0627\u0631\u0627\u062a|\u0628\u06be\u0627\u0631\u062a|\u062a\u0648\u0646\u0633|\u0633\u0648\u0631\u064a\u0629|\u0641\u0644\u0633\u0637\u064a\u0646|\u0642\u0637\u0631|\u0645\u0635\u0631|\u092a\u0930\u0940\u0915\u094d\u0937\u093e|\u092d\u093e\u0930\u0924|\u09ad\u09be\u09b0\u09a4|\u0a2d\u0a3e\u0a30\u0a24|\u0aad\u0abe\u0ab0\u0aa4|\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe|\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8|\u0b9a\u0bbf\u0b99\u0bcd\u0b95\u0baa\u0bcd\u0baa\u0bc2\u0bb0\u0bcd|\u0baa\u0bb0\u0bbf\u0b9f\u0bcd\u0b9a\u0bc8|\u0c2d\u0c3e\u0c30\u0c24\u0c4d|\u0dbd\u0d82\u0d9a\u0dcf|\u0e44\u0e17\u0e22|\u30c6\u30b9\u30c8|\u4e2d\u56fd|\u4e2d\u570b|\u53f0\u6e7e|\u53f0\u7063|\u65b0\u52a0\u5761|\u6d4b\u8bd5|\u6e2c\u8a66|\u9999\u6e2f|\ud14c\uc2a4\ud2b8|\ud55c\uad6d|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)" + "|y[et]" + "|z[amw])"; /** * Regular expression pattern to match all IANA top-level domains. * @deprecated This API is deprecated. See {@link #TOP_LEVEL_DOMAIN_STR}. */ @Deprecated public static final Pattern TOP_LEVEL_DOMAIN = Pattern.compile(TOP_LEVEL_DOMAIN_STR); /** * Regular expression to match all IANA top-level domains for WEB_URL. * List accurate as of 2011/07/18. List taken from: * http://data.iana.org/TLD/tlds-alpha-by-domain.txt * This pattern is auto-generated by frameworks/ex/common/tools/make-iana-tld-pattern.py * * @deprecated This API is deprecated. See {@link #TOP_LEVEL_DOMAIN_STR}. */ @Deprecated public static final String TOP_LEVEL_DOMAIN_STR_FOR_WEB_URL = "(?:" + "(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])" + "|(?:biz|b[abdefghijmnorstvwyz])" + "|(?:cat|com|coop|c[acdfghiklmnoruvxyz])" + "|d[ejkmoz]" + "|(?:edu|e[cegrstu])" + "|f[ijkmor]" + "|(?:gov|g[abdefghilmnpqrstuwy])" + "|h[kmnrtu]" + "|(?:info|int|i[delmnoqrst])" + "|(?:jobs|j[emop])" + "|k[eghimnprwyz]" + "|l[abcikrstuvy]" + "|(?:mil|mobi|museum|m[acdeghklmnopqrstuvwxyz])" + "|(?:name|net|n[acefgilopruz])" + "|(?:org|om)" + "|(?:pro|p[aefghklmnrstwy])" + "|qa" + "|r[eosuw]" + "|s[abcdeghijklmnortuvyz]" + "|(?:tel|travel|t[cdfghjklmnoprtvwz])" + "|u[agksyz]" + "|v[aceginu]" + "|w[fs]" + "|(?:\u03b4\u03bf\u03ba\u03b9\u03bc\u03ae|\u0438\u0441\u043f\u044b\u0442\u0430\u043d\u0438\u0435|\u0440\u0444|\u0441\u0440\u0431|\u05d8\u05e2\u05e1\u05d8|\u0622\u0632\u0645\u0627\u06cc\u0634\u06cc|\u0625\u062e\u062a\u0628\u0627\u0631|\u0627\u0644\u0627\u0631\u062f\u0646|\u0627\u0644\u062c\u0632\u0627\u0626\u0631|\u0627\u0644\u0633\u0639\u0648\u062f\u064a\u0629|\u0627\u0644\u0645\u063a\u0631\u0628|\u0627\u0645\u0627\u0631\u0627\u062a|\u0628\u06be\u0627\u0631\u062a|\u062a\u0648\u0646\u0633|\u0633\u0648\u0631\u064a\u0629|\u0641\u0644\u0633\u0637\u064a\u0646|\u0642\u0637\u0631|\u0645\u0635\u0631|\u092a\u0930\u0940\u0915\u094d\u0937\u093e|\u092d\u093e\u0930\u0924|\u09ad\u09be\u09b0\u09a4|\u0a2d\u0a3e\u0a30\u0a24|\u0aad\u0abe\u0ab0\u0aa4|\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe|\u0b87\u0bb2\u0b99\u0bcd\u0b95\u0bc8|\u0b9a\u0bbf\u0b99\u0bcd\u0b95\u0baa\u0bcd\u0baa\u0bc2\u0bb0\u0bcd|\u0baa\u0bb0\u0bbf\u0b9f\u0bcd\u0b9a\u0bc8|\u0c2d\u0c3e\u0c30\u0c24\u0c4d|\u0dbd\u0d82\u0d9a\u0dcf|\u0e44\u0e17\u0e22|\u30c6\u30b9\u30c8|\u4e2d\u56fd|\u4e2d\u570b|\u53f0\u6e7e|\u53f0\u7063|\u65b0\u52a0\u5761|\u6d4b\u8bd5|\u6e2c\u8a66|\u9999\u6e2f|\ud14c\uc2a4\ud2b8|\ud55c\uad6d|xn\\-\\-0zwm56d|xn\\-\\-11b5bs3a9aj6g|xn\\-\\-3e0b707e|xn\\-\\-45brj9c|xn\\-\\-80akhbyknj4f|xn\\-\\-90a3ac|xn\\-\\-9t4b11yi5a|xn\\-\\-clchc0ea0b2g2a9gcd|xn\\-\\-deba0ad|xn\\-\\-fiqs8s|xn\\-\\-fiqz9s|xn\\-\\-fpcrj9c3d|xn\\-\\-fzc2c9e2c|xn\\-\\-g6w251d|xn\\-\\-gecrj9c|xn\\-\\-h2brj9c|xn\\-\\-hgbk6aj7f53bba|xn\\-\\-hlcj6aya9esc7a|xn\\-\\-j6w193g|xn\\-\\-jxalpdlp|xn\\-\\-kgbechtv|xn\\-\\-kprw13d|xn\\-\\-kpry57d|xn\\-\\-lgbbat1ad8j|xn\\-\\-mgbaam7a8h|xn\\-\\-mgbayh7gpa|xn\\-\\-mgbbh1a71e|xn\\-\\-mgbc0a9azcg|xn\\-\\-mgberp4a5d4ar|xn\\-\\-o3cw4h|xn\\-\\-ogbpf8fl|xn\\-\\-p1ai|xn\\-\\-pgbs0dh|xn\\-\\-s9brj9c|xn\\-\\-wgbh1c|xn\\-\\-wgbl6a|xn\\-\\-xkc2al3hye2a|xn\\-\\-xkc2dl3a5ee0h|xn\\-\\-yfro4i67o|xn\\-\\-ygbi2ammx|xn\\-\\-zckzah|xxx)" + "|y[et]" + "|z[amw]))"; /** * Good characters for Internationalized Resource Identifiers (IRI). * This comprises most common used Unicode characters allowed in IRI * as detailed in RFC 3987. * Specifically, those two byte Unicode characters are not included. */ public static final String GOOD_IRI_CHAR = "a-zA-Z0-9\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF"; public static final Pattern IP_ADDRESS = Pattern.compile( "((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(25[0-5]|2[0-4]" + "[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]" + "[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}" + "|[1-9][0-9]|[0-9]))"); /** * RFC 1035 Section 2.3.4 limits the labels to a maximum 63 octets. */ private static final String IRI = "[" + GOOD_IRI_CHAR + "]([" + GOOD_IRI_CHAR + "\\-]{0,61}[" + GOOD_IRI_CHAR + "]){0,1}"; private static final String GOOD_GTLD_CHAR = "a-zA-Z\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF"; private static final String GTLD = "[" + GOOD_GTLD_CHAR + "]{2,63}"; private static final String HOST_NAME = "(" + IRI + "\\.)+" + GTLD; public static final Pattern DOMAIN_NAME = Pattern.compile("(" + HOST_NAME + "|" + IP_ADDRESS + ")"); /** * Regular expression pattern to match most part of RFC 3987 * Internationalized URLs, aka IRIs. Commonly used Unicode characters are * added. */ public static final Pattern WEB_URL = Pattern.compile( "((?:(http|https|Http|Https|rtsp|Rtsp):\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)" + "\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_" + "\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?" + "(?:" + DOMAIN_NAME + ")" + "(?:\\:\\d{1,5})?)" // plus option port number + "(\\/(?:(?:[" + GOOD_IRI_CHAR + "\\;\\/\\?\\:\\@\\&\\=\\#\\~" // plus option query params + "\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?" + "(?:\\b|$)"); // and finally, a word boundary or end of // input. This is to stop foo.sure from // matching as foo.su public static final Pattern EMAIL_ADDRESS = Pattern.compile( "[a-zA-Z0-9\\+\\.\\_\\%\\-\\+]{1,256}" + "\\@" + "[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}" + "(" + "\\." + "[a-zA-Z0-9][a-zA-Z0-9\\-]{0,25}" + ")+" ); /** * This pattern is intended for searching for things that look like they * might be phone numbers in arbitrary text, not for validating whether * something is in fact a phone number. It will miss many things that * are legitimate phone numbers. * * <p> The pattern matches the following: * <ul> * <li>Optionally, a + sign followed immediately by one or more digits. Spaces, dots, or dashes * may follow. * <li>Optionally, sets of digits in parentheses, separated by spaces, dots, or dashes. * <li>A string starting and ending with a digit, containing digits, spaces, dots, and/or dashes. * </ul> */ public static final Pattern PHONE = Pattern.compile( // sdd = space, dot, or dash "(\\+[0-9]+[\\- \\.]*)?" // +<digits><sdd>* + "(\\([0-9]+\\)[\\- \\.]*)?" // (<digits>)<sdd>* + "([0-9][0-9\\- \\.]+[0-9])"); // <digit><digit|sdd>+<digit> /** * Convenience method to take all of the non-null matching groups in a * regex Matcher and return them as a concatenated string. * * @param matcher The Matcher object from which grouped text will * be extracted * * @return A String comprising all of the non-null matched * groups concatenated together */ public static final String concatGroups(Matcher matcher) { StringBuilder b = new StringBuilder(); final int numGroups = matcher.groupCount(); for (int i = 1; i <= numGroups; i++) { String s = matcher.group(i); if (s != null) { b.append(s); } } return b.toString(); } /** * Convenience method to return only the digits and plus signs * in the matching string. * * @param matcher The Matcher object from which digits and plus will * be extracted * * @return A String comprising all of the digits and plus in * the match */ public static final String digitsAndPlusOnly(Matcher matcher) { StringBuilder buffer = new StringBuilder(); String matchingRegion = matcher.group(); for (int i = 0, size = matchingRegion.length(); i < size; i++) { char character = matchingRegion.charAt(i); if (character == '+' || Character.isDigit(character)) { buffer.append(character); } } return buffer.toString(); } /** * Do not create this static utility class. */ private Patterns() {} }
la source
java
solution, pasandroid
Je vais essayer une norme "Pourquoi faites-vous cela de cette façon?" réponse ... Tu le sais
java.net.URL
?URL url = new URL(stringURL);
Ce qui précède lancera un
MalformedURLException
s'il ne peut pas analyser l'URL.la source
Le problème avec toutes les approches suggérées: tout RegEx est en cours de validation
Tout le code basé sur RegEx est sur-conçu: il ne trouvera que des URL valides! À titre d'exemple, il ignorera tout ce qui commence par "http: //" et contenant des caractères non ASCII.
Encore plus: j'ai rencontré des temps de traitement de 1 à 2 secondes (monothread, dédié) avec le package Java RegEx (filtrage des adresses e-mail à partir du texte) pour des phrases très petites et simples, rien de spécifique; éventuellement bug dans Java 6 RegEx ...
La solution la plus simple / la plus rapide serait d'utiliser StringTokenizer pour diviser le texte en jetons, pour supprimer les jetons commençant par "http: //" etc., et pour concaténer à nouveau les jetons en texte.
Si vous voulez filtrer les e-mails à partir du texte (parce que plus tard vous ferez du personnel PNL, etc.) - supprimez simplement tous les jetons contenant "@" à l'intérieur.
Il s'agit d'un texte simple où RegEx de Java 6 échoue. Essayez-le dans différentes variantes de Java. Cela prend environ 1000 millisecondes par appel RegEx, dans une application de test à thread unique de longue durée:
pattern = Pattern.compile("[A-Za-z0-9](([_\\.\\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\\.\\-]?[a-zA-Z0-9]+)*)\\.([A-Za-z]{2,})", Pattern.CASE_INSENSITIVE); "Avalanna is such a sweet little girl! It would b heartbreaking if cancer won. She's so precious! #BeliebersPrayForAvalanna"); "@AndySamuels31 Hahahahahahahahahhaha lol, you don't look like a girl hahahahhaahaha, you are... sexy.";
Ne vous fiez pas aux expressions régulières si vous avez seulement besoin de filtrer les mots avec "@", "http: //", "ftp: //", "mailto:"; il s'agit d'une surcharge d'ingénierie énorme.
Si vous voulez vraiment utiliser RegEx avec Java, essayez Automaton
la source
it will find only valid URLs!
- c'est le but de la question d'OP. Est-ce que je manque quelque chose?Conformément à la réponse de billjamesdev, voici une autre approche pour valider une URL sans utiliser de RegEx:
À partir de la bibliothèque Apache Commons Validator , regardez la classe UrlValidator . Quelques exemples de code:
Construisez un UrlValidator avec des schémas valides de "http" et "https".
String[] schemes = {"http","https"}. UrlValidator urlValidator = new UrlValidator(schemes); if (urlValidator.isValid("ftp://foo.bar.com/")) { System.out.println("url is valid"); } else { System.out.println("url is invalid"); } prints "url is invalid"
Si à la place le constructeur par défaut est utilisé.
UrlValidator urlValidator = new UrlValidator(); if (urlValidator.isValid("ftp://foo.bar.com/")) { System.out.println("url is valid"); } else { System.out.println("url is invalid"); }
affiche "l'URL est valide"
la source
Cela fonctionne aussi:
String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Remarque:
String regex = "<\\b(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // matches <http://google.com> String regex = "<^(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]>"; // does not match <http://google.com>
Donc probablement le premier est plus utile pour un usage général.
la source
((http?|https|ftp|file)://)?((W|w){3}.)?[a-zA-Z0-9]+\.[a-zA-Z]+
vérifiez ici: - https://www.freeformatter.com/java-regex-tester.html#ad-output
Il trie correctement ces entrées
la source
Lorsque vous utilisez des expressions régulières de la bibliothèque de RegexBuddy, assurez-vous d'utiliser les mêmes modes de correspondance dans votre propre code que l'expression régulière de la bibliothèque. Si vous générez un extrait de code source dans l'onglet Utiliser, RegexBuddy définira automatiquement les options de correspondance correctes dans l'extrait de code source. Si vous copiez / collez l'expression régulière, vous devez le faire vous-même.
Dans ce cas, comme d'autres l'ont souligné, vous avez manqué l'option d'insensibilité à la casse.
la source